PurePerformance - Chaos Engineering: The art of breaking things purposefully with Adrian Hornsby

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson. Bet you didn't know that. And as always, my guest is, can you guess, Andy Grabner. Hey Andy, how are you doing today? Good, good.

Starting point is 00:00:37 Hi Brian. Thanks for, well, great to be back on actually doing some recordings. I know the audience wouldn't really know. They have no idea they have no idea yeah they're clueless they think we do it always on like on the two-week schedule uh but it's been you've been traveling right yes i went back to new jersey visited some friends and family and got to go to the beach or as we call it the shore in new jersey sure to please also found out here's a stupid little tidbit um so after i don't know if you recall hurricane sandy probably a lot of our listeners overseas never heard of it but it was this huge hurricane that really did a lot of damage to the whole eastern coast of of the united states but after that you know new jersey was trying to

Starting point is 00:01:20 rebuild along the coast and they came up with this saying jersey strong like you know there's a lot of guidos in new jersey like yeah jersey strong we're gonna come back strong and all right and uh turns out somebody opened a gym like a workout gym called jersey strong as well so now it's confusing because if someone has a jersey strong bumper sticker you don't know if they're taking pride in rebuilding new jersey or if they're like yeah i go to this gym and lift weights bro you know so that our – that's my tidbit. That's what I learned on my trip to New Jersey. And you've been running around a lot too, haven't you, Andy?

Starting point is 00:01:52 I believe so, yeah. Well, not running, flying Europe. I did a tour through Poland, a couple of meetups, a couple of conferences, also visited our lab up in Gdansk, which was the hottest weekend of the year with 36 degrees on the Baltic Sea, which is kind of warm, I would say. But fortunately, the water was at least, you know, refreshing still. And yeah, but we're back now. And we have... Sounds like there's been a lot of chaos.

Starting point is 00:02:24 Sounds like there's been a lot of chaos. Sounds like there's been a lot of chaos, exactly. And actually, I think distance-wise, Gdansk in Poland, in the north of Poland, is probably not that far away from the hometown of our guest today. And with this, I actually want to toss it over to Adrian, and he should tell us how the summer is up there in Helsinki. Welcome back, Adrian. Hi, guys. Hi, Andy. Hi, Brian. How are you? Good, how are you?

Starting point is 00:02:50 Yeah, thanks a lot for having me back. Well, the weather in Finland, since that was the first chaotic question, is indeed chaotic. Last year we had the hottest summer in maybe 50 years and now we had one of the coldest summer in 50 years. It's pretty weird. Today is nice actually. So you did not get the extreme heat wave all the way up there? That's interesting. No, heat waves here are 25 degrees. degrees okay yeah we got uh i think we got 40 here in austria and uh 35 up there in the north of poland crazy crazy stuff hey adrian i'm gonna try a quick conversion here hold on

Starting point is 00:03:32 man when are you finally let's see so it's about 30 it's going up to about 35 all week here. Yeah, wow. It's hot. It's hot, yeah. Yeah, when are we going to finally get on to the, yeah, we're never going to do that socialism. Yeah, I know. So, Adrian, coming to chaos, we talk about chaos. Well, Brian mentioned it, I mentioned it. Today's topic, after having you on the first call, it was really cool to hear about best practices of building modern architectures. I really love this stuff.

Starting point is 00:04:14 And I used it again, actually, on my trip to Poland. I did a presentation in Warsaw, and I gave you credit on the slides that I repurposed for retries, timeouts, off uh jitter this was really well received awesome it's great to hear yeah thanks again for that it makes me it's great when you get when you get on stage and people people think you're smart because you told them something new but it's also great to admit that all the smartness comes from other people that are all sharing the same spirit of sharing and sharing is caring so thanks for that thanks for allowing me to uh to share your content All the smartness comes from other people that are all sharing the same spirit of sharing. Sharing is caring. So thanks for that. Thanks for allowing me to share your content.

Starting point is 00:04:49 Yeah, I don't want to take the credit on that either because I learned it from someone else, you know, or books or articles. At the end of the day, it's all about sharing, as you said, and teach what we had a problem with six months ago. Yeah. Hey, so today's topic is chaos engineering. And I will definitely post a link to your Medium blog post. I think part one is out and part two is coming. But it's a really great introduction into chaos engineering. And to be quite honest with you,

Starting point is 00:05:20 I haven't used the word chaos engineering until recently i always just said chaos monkeys or chaos testing and um and and i think because i was just so influenced by the first time i learned about introducing chaos which was i believe netflix at least when i read about it netflix was the ones that came up with the Chaos Monkeys. Is this correct? Yes, correct. Yeah. You know, it's all part of the resiliency engineering, in a way, kind of field, right? So I think the term Chaos Monkeys was really made popular by Netflix. And they're awesome technology blog that talked the story about how they develop those monkeys.

Starting point is 00:06:08 Hey, and so, I mean, again, everybody should read the blog post. It's amazing the background you give. But maybe in your words, when you're introducing somebody into chaos engineering, when you are teaching at conferences or just visiting customers and clients and enterprises. How do you explain chaos monkey or chaos engineering? Let's call it chaos engineering. How do you explain the benefit of chaos engineering and what it actually is and why people should do it? That's a very good question.

Starting point is 00:06:41 And I'll try to make it short because I think we could spend hours trying to explain just the discipline of chaos engineering. But in short, what I say, chaos engineering is a sort of scientific experiment where we try to identify or especially identify failures that we didn't think of, especially figuring out how to mitigate them. It's very good at taking the unknown unknown out of the equation. I think traditionally we've used tests to try to test and make sure our code is resilient to known condition, chaos engineering, it's kind of a different approach. It's a discipline that really tries to take things that we might not have thought about out of the darkness and bringing it out so that we can learn how to, to mitigate those failures before they happen in production. And examples, I think in your blog post, you also had like this, I have a graphic in there where you're talking about different,

Starting point is 00:07:57 different levels or different layers of the architecture where you can basically introduce chaos, right? So it's, infrastructure is obviously clear. What else? Yeah, I mean, it's very similar to resiliency, right? I think chaos can be introduced at many different layers from the obvious infrastructure level, like removing, for example, an instance or a server and figuring out if the overall system is able to recover from it, from the network and data layer.

Starting point is 00:08:32 So removing, let's say, a network connection or degrading the network connection, adding some latency on the network, dropping packets or making a black hole for a particular protocol. That's on the network and data level. On the application level, you can just simply add the exception, make the application throw some exceptions randomly inside the code

Starting point is 00:09:05 and see how the outside world or the rest of the system is behaving around it. But you can also apply it at the people level. And I mean, I talk a little bit about it in the blog. The first experiment I always do is trying to identify the technical sound people in teams or the kind of semi-gurus, as I like to call them, and take them out of the equation and send them home without the laptops and see how the company is behaving. Because very often you'll realize that information is not spread equally around team members and uh well i call that the bus factor uh if if that very technical technical sound person is not at work or gets hit by a bus well you know how do we uh recover from

Starting point is 00:09:54 from failures i like that idea i mean and brian i know we always try to joke a little bit but i i think if adrian would ever come to us, I think he would definitely not send the two of us home because there's more technical sound people in our organization. It's not necessarily the technical sound people. It's very often the connector. It's the person that knows everyone inside the organization or in the company that can act very fast,

Starting point is 00:10:24 that has basically a lot of history inside the company and that kind of recalls or is able to recall every problem that that happened and how they recovered from it it's kind of a walking encyclopedia you know yeah but andy i think you can attest for the fact that i was just on vacation for about a week and a half and the whole company fell apart. Right. So I am really, really important to the organization. the idea of chaos testing or chaos engineering is you can't just start with chaos engineering. You have to have prerequisites in your environment. You talked about, you know, all your prerequisites to chaos, building a resilient system, resilient software, resilient network. So you really have to, and correct me if I'm wrong, but what I was reading it, before you start the chaos engineering part, you have to build in as much resiliency into all layers of your system

Starting point is 00:11:27 as you can possibly think of first. Then you start seeing what did we miss, right? You don't just say, hey, we just stood up our application and our infrastructure. Now we're going to throw something at it because, of course, it's going to fail and it's going to be catastrophic. So you really want to start from a good place, right? Exactly.

Starting point is 00:11:44 And that's the whole point of chaos engineering. It's really, I would say, an experiment, scientific experiment, where we test an hypothesis. For example, we believe we've built a resilient system. We spend a lot of time. We say, okay, our system is resilient to database failures. So let's make an hypothesis and say, you know, what happens if my database goes down? And then you take the entire organization,

Starting point is 00:12:12 the entire team from the product owner to the designers to the software developers, and you ask them actually what they think is going to happen. Very often I ask them not to talk about it, but to write it on the paper, just to avoid the kind of mutual agreements, you know, like everyone comes with a consensus by talking about it. So if you write it on the paper in private, you realize everyone in the team has often very different ideas of what should happen if a database goes down.

Starting point is 00:12:46 And a good thing to do at that moment is to stop and ask, how is that possible that we have different understanding of our specifications? Then you go back into pointing the specification. Very often people fail to read it carefully or improving it. But once you have everyone in the team having a real good understanding of what should happen when the data goes down, then you make an experiment and you actually want to verify that through that experiment and with all the measure you're doing around it

Starting point is 00:13:21 from making sure that the system goes back to the initial state. We call that the steady state of the application. And also that it does that in the appropriate time, right? In the time you thought about. And you have to verify all this. Yeah, and I think there's a lot to unravel there. But even before that, right? Even before you run that first experiment,

Starting point is 00:13:42 are there, and Andy, I don't know if you've maybe seen any or adrian are there some best practices like i know there's the the concept of doing multi zones let's say if we're talking about aws you can set up your application in multiple zones so this way if one zone has an issue you still have your two like you know the rule of threes um is there you know way back if we go way back in time steve souders wrote the you know best practices for uh web web design you know web performance um is there and i know you've written some articles and all but is there a good checklist for someone starting out or an organization you know before they start these experiments to say here are some of the starting points that you

Starting point is 00:14:21 should set up before you even entertain your next set of hypotheses, right? Because obviously if you don't do this basic set, you're going to have catastrophic failures. Is there something written out yet or is it mostly just things we've been talking about and word of mouth and people's blogs? Well, it's a very good question, actually. I'll take the note and make sure I can build a list. Actually, that's a very good idea. But I would say I think the 12-factor app kind of patterns is a very good place to start with.

Starting point is 00:14:53 I did put some lists around, and I talk about resilient architectures in different blogs. I think it's really software engineering basics, and it's timeout retries, exception handling, these kind of things. blogs I think you know the is really software engineering basics and you know it's a timeout retries exception handling these kind of things make sure that you have redundant redundancy built in that your system is self healing because you know at the end of the day you don't want humans to to recover you want actually the system to recover automatically. So that's something that is usually not well thought

Starting point is 00:15:28 or not at every level. I always say, so if you don't do infrastructure as code and automatic deployment, maybe don't do chaos or actually don't do chaos because you're going to have big problems. You know, your system should be automatically kind of deployed and managed and self-healed in a way. And then, of course, on the operation level, you should have the full observability and complete monitoring of your system.

Starting point is 00:15:57 If you have no visibility on what's happening in your application, there's no way you can conduct a sound experiment or even verify that that experiment first has been successful or has negatively impacted the running system. So measure, measure, measure as much as you can. And of course, you need to have an incident response theme in a way or a practice so that you know what you should do as soon

Starting point is 00:16:26 as an alert comes in and treat the tickets and being able to have this full response kind of incident response. Yep. That'll do, right? Mm-hmm. Yep. And I know Andy's going to want to dive into all the phases, so let's go on. You mentioned kind of an overview of the experiment, the observations, the hypothesis and all that.

Starting point is 00:16:49 So Andy, I know you probably have 20,000 questions and things to talk about. So let's dive into that now. I mean, again, before I actually dive into, because you just mentioned in the end, monitoring incident response. I think in your blog post, you also mentioned that when you are in one of your cases, you inflicted chaos and you actually saw that the incident response or the alerting was also impacted by the chaos. For instance, not able to, let's say, send select messages or send emails or stuff like that.

Starting point is 00:17:21 I think there's such a huge variety of chaos that like your hypothesis is that we need to test against yeah it's very common to to build system that kind of host or power our own response system and you you see actually i mean there's been really a recent outage uh and withflare, for example, and they were not able to use their internal tool easily simply because their tools were too secure. The people didn't use their maintenance tools long enough or recently enough, so the system had just removed the credentials.

Starting point is 00:18:07 And when you're in a panic situation, having systems like this that actually are impacted by your own behavior is hard. But DNS is another one. Very often, if DNS goes down, your own tools are not accessible because they use or might use DNS. So all these kind of things like this are usually very important

Starting point is 00:18:31 to test as well while doing the experiment. That's why you need to start from or simulate the real outage situation. So now I got it. Before going into the phases, because I know you have in the blog post, you have like five phases that you talk about, but one question that came up while you were earlier talking about application exceptions, because, you know, Brian and I, we've been kind of live and breathe applications because we've been, you know, with Dynatrace, especially we monitor applications and services, but obviously the full stack,

Starting point is 00:19:07 but we are very, I think, knowledgeable on applications. And when you said earlier chaos in applications, like you can, you know, just throw exceptions that you would normally not throw and see how the application behaves. Do you then imply there are tools that would, for instance, use dynamic code injection to force exceptions? Is that the way it works? Or do you just, I don't know,

Starting point is 00:19:34 change configuration parameters of the application that you know it will result in exceptions? There are several types of tools. There are libraries that you can directly use to inject failures. Either in Python, very often it's using a decorator function that you wrap around the function and that catches the function and then throws some exception.

Starting point is 00:20:00 Actually, I'm building my chaos library toolkit around that, using that concept in Python to inject failures in Lambda functions in Python. It's being used as well for JavaScript, similar technique. There are also techniques to have a proxy as well. So you have proxies between two different systems and then the proxy kind of hijack the connection, kind of man in the middle in a way, and kind of alter the connection, can add latency, can throw some exception, can inject, you know, like packet drops or things like this. So there's a lot of different techniques.

Starting point is 00:20:48 There's also one very common technique that GreenLin is using is using an agent base. So they have an agent running on an instance or a Docker container that can inject failure locally, or these kind of things. So they can intercept process level type of things and just throw exceptions or make it burst or things like this. Cool. So moving on to the phases, if you kind of look at your blog posts, we already talked about the steady state, right?

Starting point is 00:21:22 Steady state, basically, I think for me, what it means, first of all, you have to have a system that is kind of stable and steady. Because if you have a system that in itself is not predictable, I guess, enforcing chaos on it is probably then not making it easier to figure out, is this a situation now that is normal or caused by chaos so i guess steady state really means a system that is stable and you know what that state is right yeah and it's predictable as you said i think a big mistake what people do is usually they use uh system attribute metrics type of cpu memory or things like this, and look at this as a way to measure the health

Starting point is 00:22:09 and the sanity of an application. Actually, a steady state should have nothing to do with this, or at least not entirely, and should be a mix of operational metrics and customer experience. I write in the blog about the way Amazon does that, the number of orders. And you can easily imagine that the CPU on an instance has actually no impact on the number of orders at big scale. And similarly for Netflix, they use the number of times people press on the play button globally to define their steady state. They call that the pulse of Netflix, which I think is beautiful because it's a relation between the customer and the system, right? If you press several times, then of course it means the system is not responding the way it

Starting point is 00:22:55 should, right? So it's experiencing some issues. And similarly, if you can't place an order on Amazon retail page, it means the system is not working as it's designed. So it's a very good kind of a steady state. But it's important to work on that. It's not easy actually to define it. And I see a lot of customers having a big trouble first defining the steady states or their steady states. You can have several of them.

Starting point is 00:23:21 But for me, I mean, the way you explain it to me, it's kind of like a business metric, right? Like orders, the pulse. I don't know if I would apply it to what we do at Dynatrace, the number of trial signups or the number of agents that send data to Dynatrace. I have a hard time believing, but I'm sure you're right, but I have a hard time believing

Starting point is 00:23:43 that companies have a hard time believing, but I'm sure you're right, but I have a hard time believing that companies have a hard time defining that steady state metric, because if they don't know what their steady state metric is, they don't even know whether their business is doing all right or is currently impacted. Is that really the case? I guess you'd be surprised. It's not that they don't know what the business is doing.

Starting point is 00:24:03 That's something the business might be doing several things. And I think usually the business metrics or things like this are used, let's say, by a higher level monitoring that might go back to the managers or the C-levels versus what we want is actually that kind of metric to go down to the engineering team in a very distilled way and very easy, accessible easily. And okay, now that makes a lot of sense. And I have another question though.

Starting point is 00:24:43 So you said CPU is typically not a metric because it doesn't matter how many, what CPU usage you have as long as the system gets back to steady state. But do you still include these metrics? The reason why I'm asking is what if you are having a, let's say, a constant, like if you look at the orders, right? You have, let's say, 10,000 orders per second is coming in and you know you're using X of CPUs in order, in a state-of-state environment,

Starting point is 00:25:12 then you're bringing chaos into the system. The system recovers, goes back to the 10,000 orders, but all of a sudden you have, I don't know, twice as many CPUs. Isn't that, shouldn't you still include at least some of the metrics across the stack to validate not only that you are that the system itself is back in a steady state but the supporting infrastructure

Starting point is 00:25:34 is at least kind of going back to normal or is this something you would just not do at all because it is not the focus point no you're totally right the steady state is kind of the one metric that is important but you do need all because it is not the focus point. No, you're totally right. The steady state is kind of the one metric that is important, but you do need all the supporting

Starting point is 00:25:49 metrics as well. It doesn't mean, when I say the steady state is the most important one, it doesn't mean you should not include the rest. On the contrary, it's actually you should have as much metrics as you can. It doesn't mean you should try to overdo the things, right?

Starting point is 00:26:08 But I think the most essential one, especially for the cloud, if you say you have 10,000 orders and your infrastructure to support 10,000 orders is the one that you are currently using, you make a chaos experiment, and then that infrastructure or the need to support that 10,000 orders is the one that you are currently using, you make a chaos experiment, and then that infrastructure or the need to support that 10,000 orders is kind of double, then of course you should raise a big alarm, right?

Starting point is 00:26:34 And make sure that this is looked at, of course. Cool. All right, so steady state, first state or first phase of chaos engineering. What comes next? So after the steady states, we make the hypothesis. Once we understand the system, then we need to make the hypothesis. And this is the what, what if, or what when, because failure is going to happen. So what happens when the recommendation engine stops, for example, or what happens when the database fails

Starting point is 00:27:07 or things like this? So if you're first-timer in chaos engineering, yeah, definitely start with a small hypothesis. You know, don't tackle the big problems right away. Build your confidence, build your skills, and then grow from there. But yeah, this is endless possibilities, right? So what I love to do is usually look at an architecture and kind of look at the critical

Starting point is 00:27:34 systems first. So you look at your kind of APIs and what are the critical system for each of the APIs. And then you tackle first the critical systems, right? And see if these are really as resilient as you expected, or if you can uncover some kind of failure mode that you didn't think of, and things like this. And then you can go into less critical dependencies but usually most important is to make sure that the critical components are are tackled first and this is also the case with the hypothesis if i just recount or repeat what you said earlier

Starting point is 00:28:19 this is also was it you explained you did you do the exercise. You go into a room and you bring everybody in the room and then you discuss the hypothesis and let everyone kind of write down what they believe is going to happen, right? Exactly. And this is for me, it's one of the most important part of that exercise because you want to uncover different understandings

Starting point is 00:28:43 and make sure that why do people have different understanding from an hypothesis. And usually this will uncover some problems already, like a lack of specifications or a lack of communication or simply people have forgotten what it's supposed to be. Yeah. And also if you think about this, let's take the database as an example. If you say, what happens if the database is gone?

Starting point is 00:29:12 And maybe one team says, well, my system that is relying on the database is just retrying it later. And then the other team that is responsible for the database maybe says, well, but that's not the intended way. We thought you could just go back to a, like, I don't know, a static version of the data, blah, blah, blah.

Starting point is 00:29:30 I think that's, as you say, uncovering a lot of problems that actually have nothing to do with technology, but actually have to do with lack of communication or lack of understanding of the system. Yeah. And very often what I've noticed, the big difference is usually between the design team or UI teams and then the backend teams, right? Many times the UI team will have not thought about this,

Starting point is 00:29:57 like what happens if your database goes down? Well, for the technical team, it might be obvious to move into a read-only mode and say, okay, we move a request to the read replicas of the database. And eventually, after maybe a minute, we fail over the new master. So for one minute, your application will be in a state of, hey, there's no database. So what do you do? Very often, people in the UI world won't think about it

Starting point is 00:30:26 because it's not the case that they were asked. Read-only mode is quite weird, actually, to think of. Hey, Andy, I was also thinking, when reading through these steps and also sitting on the hypothesis and the steady state piece, how this sounds like this would be a wonderful place for performance testers, engineers, whatever to transition into. But specifically with this hypothesis phase,

Starting point is 00:30:51 I love the idea of making this not just for chaos, but for performance as well. Let's say there's a new release being built, right? Gather everybody together and say, all right, we're going to run a certain load against this. What do you all think is going to happen in a way to make them think, well, wait, what did I just code and what might happen? Again, like classic database issue where suddenly we added four more database queries to a statement.

Starting point is 00:31:20 Well, hey, are you thinking about that? But just even if it's not for performance, this idea of getting everybody together to say, what do you think is going to happen if X, whether it be a chaos experiment or performance, some sort of a load test or something else, to say, what do you think is going to happen? I think that sort of communication with people

Starting point is 00:31:37 opens their mind to actually thinking about what they're doing and what impacts they might have. I think regardless if it's for chaos, I think it's just a great practice for organizations to put together. Yeah, exactly. And if you think of it, I mean, executing load or putting a load on the system is a form of chaos. Anyway, I love this.

Starting point is 00:32:02 Yeah. It's great. Yeah. So hypothesis is done. Then I guess you run the experiment, I love this. Yeah. It's great. It's great. Yeah. So hypothesis is done. Then I guess you run the experiment, right? Correct. And this is first designing and running the experiment.

Starting point is 00:32:18 And I think here the most important thing is blast radius, right? So you have to keep the customers in mind, right? And of course, initially, if it's your first time, you might not want to do that in production and really do this in test environment and make sure that you are able to control the blast radius during your experiments. And this is very important to think about. It's like, you know, how many customers might be affected,

Starting point is 00:32:45 what functionality is impaired, or if you are a multi-site, which location are going to be impacted by this experiment. So, you know, this is very, very important to think about. And, you know, then it's about running it, right, and identifying also the metrics that you need to measure for that experiment. So as you mentioned earlier, you know, you might want to make sure that you control the number of instances that you need for a particular steady state

Starting point is 00:33:19 and make sure that you return to that same number after the experiment. And this is definitely part of the overall metrics that you need to check for your application. From a running perspective, I think you mentioned earlier, you have your Python-based library that can run some experiments. Did Netflix also release their Chaos Monkey libraries to the world? Yeah, I mean, the Chaos Monkey was one of the first tools ever released to do chaos engineering.

Starting point is 00:33:58 That's now part of a continuous integration tool called Spinnaker, together with the rest of what they call the Siemen Army. And that's, you know, there's a bunch of monkeys. There's the Chaos Monkey, which is the original one, which kills randomly instances. There's Chaos Gorilla, which impairs an entire availability zone in AWS. And then there's the big gorilla, the big Kong,

Starting point is 00:34:27 some gorilla is Chaos Kong, which shut down an entire AWS region. And they practice that in production, you know, maybe once a month or something like this while people are watching Netflix. Just to make sure that their initial design is still valid, right? Yeah. I wonder what would happen if AWS shut down an entire region.

Starting point is 00:34:55 Let's not find out. You'd be surprised, actually. I mean, we do chaos engineering as well. I mean, we started very early on, but we don't do chaos engineering as well uh i mean we started very early on but uh you know we don't do chaos engineering on pay on paying customers but we do a different uh level of uh of chaos engineering right and and prime video does that in production actually and i'm writing a blog post about that so it should uh should be out maybe in a month or two, something like this.

Starting point is 00:35:26 Cool. Hey, and then I think you mentioned Gremlin earlier, right? One of the agent-based companies. So that's also cool. Do you have any insights into what they are particularly doing? I mean, as I said, Gremlin is part of the agent-based kind of chaos tool. So they do a lot of cool things on the, on this,

Starting point is 00:35:51 on peer instance level. So they can, you know, corrupt an instance by making a CPU run wild. So 100% CPU utilization and see how your application reacts. They can take the memory out. They can add a big file on the drive to make it run out of this space.

Starting point is 00:36:15 And there's a lot of different things, state attacks as well, like make it terminate or restart or reboot, pause, all these kind of things. And the UI is beautiful to use. That's what is cool. There's another very good framework that I really like is Chaos Toolkit. And that's more on the API level. So it's kind of a framework, an open framework

Starting point is 00:36:43 that you can build extension with. So there's kind of a framework, an open framework that you can build extension with. So there's an AWS extension that kind of wraps the AWS CLI. And then you can also do API queries to get the steady state, do some actions, probe the system and things like this. And the whole template for the experiment is written in JSON. And then you can integrate that in your CI-CD pipeline as well. So, I mean, and actually the chaos toolkit does integrate with learning as well. So, I mean, all the tools are really working together in a way. And I think helping people

Starting point is 00:37:25 to make better chaos experiments. Perfect. Cool. I think it's always interesting to... The reason I was asking, people that want to go into chaos engineering, they probably also want to figure out how do I inflict chaos? Are there tools out there?

Starting point is 00:37:41 Are there frameworks out there? And that's why I wanted to ask the question. Proxy level is a very beautiful tool called Toxiproxy that has been released by Shopify. And that's like a proxy-based kind of chaos tool, which you can put that proxy between your application and, for example, database or Redis or main cache, and these are inject some what they call toxics,

Starting point is 00:38:10 and inject some latency or do some error, dropping packets, let's say, 40% of the time and things like this. It's very interesting. And then, of course course you have the old school Linux tools, right? WRT or of course the corrupting the IP tables as well and things like this. Or the very old fashioned pull the plug.

Starting point is 00:38:37 Yeah. Yeah. That's actually how, let's say Jesse Robbins famously did that at Amazon on retail in the early 2000s. He would walk around data center and pull plugs from servers and even pull plug of entire data centers. One question about these tools then, because the next part of this is all verifying, right? And you mentioned some metrics

Starting point is 00:39:07 and these aren't host or machine type of metrics these are human metrics things like time to detect time for notification um do any of these tools that you mentioned have either the ability to either pick up on like maybe time to notification or allow for entry to, you know, human entry to say, okay, we detected it at X time. First, basically track these reaction metrics that you talk about. Do you know if any of those tools have anything built in or is that something you'd be running on, you know, keeping track of on your own? And then let's talk about what those are as well.

Starting point is 00:39:41 Right. So if you look at the chaos toolkit, since it's API based, you can query kind of all your system if it supports that to kind of add in the chaos engineering experiment report. So every time you run an experiment, it will print a report that you can then analyze. As for the Gremlin, of course, it's more agent-based. So there's no kind of complete report like this. But I mean, neither Chaos Toolkit or Gremlin type of things have a full reporting that would satisfy let's say

Starting point is 00:40:27 the most let's say careful team. If you want to write a COE, you're going to have to do a lot of human work in figuring out all these different things like time to detect, notification, escalation, public notification,

Starting point is 00:40:45 self-healing, recovery until until all clear and stable, right? So there's nothing yet that really covers everything. So there's definitely space for a competitor if you want to build your own company, right? I have an idea. I actually just thought of, obviously on the Dynatrace side, we do the automated problem detection,

Starting point is 00:41:07 and we have APIs where external tools can feed events to Dynatrace. So, for instance, if you set up the hypothesis, you could tell Dynatrace about the hypothesis, meaning you can set thresholds on your business metrics, like order rate or, I don't know, conversion rates. And you can then also tell Dynatrace, I don't know, Friday, six o'clock, we're starting the chaos run.

Starting point is 00:41:38 And in case Dynatrace detects a problem based on your metric going out of the norm, it would immediately then open up a problem and would collect all the evidence, including the event that you sent. So I think in a way, we could actually measure a lot of these things. And by integrating it with these chaos tools,

Starting point is 00:41:59 we would even allow you to automatically set up your hypotheses and tell Dynatrace about when you started the test run, when this test run was ended, and then Dynatrace can tell you when was the problem detected, when were the notifications sent out, and when was the problem gone,

Starting point is 00:42:15 when did the problem go away. So I think we need to... It's a great idea. Actually, the KS2Kit supports extension currently for open tracing. So if it's something that is going to be used at Dynatrace, it's definitely something you should look at. It supports Prometheus probes and Monarch, CLI, InstaNOW.

Starting point is 00:42:38 There's a bunch of extensions. So I think you guys should definitely write an extension for KS Toolkit to actually send them the, what it's called the chaos toolkit report, um, to Dynatrace to, to add visibility and, and to, to the whole thing. Definitely. It's a very good idea. Yeah. Cool. So learn and verify, um, what else is there to know? Learn, verify, and then I guess also optimize and learn and fix things, right? Well, you know, I think the big part of verifying is also the postmortem, right?

Starting point is 00:43:12 I think you should always go through the postmortem part. And in my opinion, it's one of the interesting things because you're going to deep dive on on on on the reasons of the failures right if if they are failures so if your chaos experiment is successful then good for you right you should also write about this but if it's unsuccessful and if you've created or resurfaced failures that you didn't think about as that's uh the post-mortem and then you have to deep dive really, really well on the topic and figuring out what happened, the timeline of it, what was the impact, and why did the failure happen. This is the moment where you kind of want to go to the root cause.

Starting point is 00:44:00 I know root cause is something that is very, very difficult to get because in a failure, it's never one reason. It's always a collection of small reasons that getting together to create this kind of big failure, but trying to find as many reasons as possible at different layers, different levels. And then, of course, what we learned, right? And how do you prevent that from happening in the future?

Starting point is 00:44:30 So how are you going to fix it? And those are really hard to answer. They sound easy when I talk about it, but it's very, very difficult to answer carefully all that stuff. In your blog post, you talk a lot about, I think you start the blog post actually with comparing chaos engineering with firefighters because I think Jesse Robbins, he was actually a firefighter, if I kind of remember this correctly,

Starting point is 00:45:00 and he brought Game Day to Amazon, right? Yes. And looking at this so firefighters i think you said something like 600 hours of training and in general 80 of that time is always training training training uh is there and this might be a strange thought now but is there a training facility for chaos to learn actually chaos engineering is there a demo environment where you have a let's say a reference architecture of a let's say web shop running and then you can play around with chaos engineering and yet i think that's called having kids but you're totally right i mean one of very, very good point of chaos engineering, and even when not done in production or in production,

Starting point is 00:45:48 I mean, in any case, is the fact that you can actually practice recovering from failures, right? So you inject failure in your system, and then you let the team handle it the same way as it would be an outage, right? So you will, you know, practice and practice and develop what the firefighters want to develop, like the intuition for understanding errors and behaviors and failures in general. Like, you know, if you see, for example, an NGINX kind of CPU consumption curve or a concurrent connection and already all of a sudden it becomes flat,

Starting point is 00:46:25 you have to be blocked the intuition for what it might be. You know, definitely the first thing to look is the Linux, the security configuration on your instance. You know, max connection is probably low, all these kind of things. And you can't build that intuition if you've never kind of debugged the system or trying to recover from an outage because those are extreme conditions. And it's exactly that, right?

Starting point is 00:46:55 It's practice, build intuition, and then make those failures come out. Very cool. Is there anything we missed in the phases? I think we are fixing it. Wow. Who wants to do that?

Starting point is 00:47:16 You've done all the fun right now, so you have to fix it. And this is, in my opinion, I'll say something very important here. Unfortunately, I see a lot of companies doing chaos engineering and very brilliant COEs or corrections of error post-mortems, but then the management don't give them the time to actually fix the problem. So I actually was with one of these companies a few months ago, and two weeks after the

Starting point is 00:47:43 chaos experiment, which surfaced some big issues in the infrastructure they had this real outage in production and we're down 16 hours right so they could have fixed it before but they didn't they didn't stop the features or they didn't prioritize that and eventually then they pay a bigger price right so it's very important to just not do those case experiments, but actually to get the management buying and make sure that when you have something serious, stop everything else and just fix it.

Starting point is 00:48:13 And that's super important. But let me ask you, why fix it if you can just reboot the machine and make it work again? Exactly. You've been using the JVM for some time, right? Yeah. It's like reboot Fridays.

Starting point is 00:48:29 Is that what you have as well? Yeah. I imagine. So besides obviously fixing it, as Andy was going on, was there anything, obviously we want to fix it. Beyond that, again, I can't recommend enough to read the blog by everybody, but is there anything that we didn't cover

Starting point is 00:48:47 that you want to make sure? Yeah, just the side effects. I think, you know, chaos engineering is great at uncovering failures, but actually I think the side effects on the companies

Starting point is 00:48:58 are even more interesting and they are mostly cultural. Like the fact that companies that start to do chaos engineering eventually move when successful i've seen non-successful but most of them are successful move to what what i like to call this kind of non-blaming culture and and and move away from pointing fingers to you know how do we fix that? And yeah, I think that's a beautiful place to be, right? For developers, for owners as well.

Starting point is 00:49:30 Because it's a culture that accepts failure, embrace failures, and kind of want to fix things instead of blaming people. And that's also how we work at AWS and Amazon. And I really like that. You know, our COEs and post-mortems are kind of blame-free, and that's one very, very important part of writing that postmortem.

Starting point is 00:49:53 And, you know, I think it's great because if you point someone that is making a mistake, eventually it will come back to you, right? You will suffer a mistake and be blamed. And that's never a good place to be. Yeah. I think another good side effect too is going back to the hypothesis phase where people will start thinking probably about what they're going to do in terms of the hypothesis in mind.

Starting point is 00:50:21 So before they actually implement something, they'll probably start thinking more about what its effect might be. Yeah, exactly. Yeah. You think more about the overall system versus just the part that you build, right? I think that's an important thing. And of course we didn't mention, but a very, very good side effect is sleep, right? You get more sleep because you fix outages before they happen in production.

Starting point is 00:50:51 So you get a lot more sleep. Awesome. Hey, Andy, would you like to summon the Summonerator? I would love if you summon the Summonerator. Do it now. All right. You've rehearsed that one, right? No, so, well, Adrian, thanks again for yet another great educational session on a topic.

Starting point is 00:51:15 Thank you. It's, you know, it's a topic, Chaos Engineering, that I think it's still kind of in its infancy stage when it comes to broader adoption and kind of everybody really understands what it is. I really like a quote that I think you took from Adrian Cockcroft, who said, Chaos engineering is an experiment to ensure that the impact of failures is mitigated, which is also a great way of explaining it. I really encourage everyone out there, read Adrian's blog. The five phases for me are, I mean, I think the first two for me are amazing because steady state means you first of all need to work on an architecture that is ready for chaos and you need to know what the steady state actually is and have a system that is steady but then i really like your your kind of i call it experiments now to experiment with the people

Starting point is 00:52:19 to figure out what they think should happen when a certain condition comes in, like working on your hypothesis, because you can fix and find a lot of problems already before you really run your chaos engineering experiments. So I thanks, thanks for that insight. And I really hope that you will come back on another show because I'm sure there's a lot of more things you can tell us.

Starting point is 00:52:48 I'd love to. Thanks a lot, Andy and Brian, for having me once more on the show. I really enjoyed it. It was an absolute pleasure. I would just like to add one thing as well. For anybody who's in the listening, who might be in the performance or load side of things,

Starting point is 00:53:02 Andy and I have talked many times, especially early on in the podcast, about the idea of leveling up. I'm listening to this bit about chaos engineering and all I keep thinking of, wow, what a great place to level up to. Or not even level up to. Let's say you're like,

Starting point is 00:53:16 ah, I'm kind of done what I feel I can do in the load environment or that whole field. This is kind of, I feel, a continuation of it. It really boils down to hypothesis, experiment, analysis. And obviously you have the fixing at the end, which is very much the same as, you know, doing your performance and load testing. So, and it's a brand new, you know, chaos engineering is not very widespread yet. Obviously the big players are doing it. So there's a lot of opportunity out there to get involved with that.

Starting point is 00:53:48 So definitely if I were still on the other side of the fence, not being a pre-sales engineer, I would probably start looking into this a bit more. Yeah, it's a lovely place. I mean, people are amazing. They love sharing. So I highly recommend everyone to get involved in the space.

Starting point is 00:54:08 Yeah. Sounds like it's just a sidestep over to a new world of experimentation. Anyway, Adrian, thanks again, as Andy said. Thank you. Welcome back anytime. Anytime you got something new, please come on. It's great stuff you have here. We will put up your information. We're going to put a link to this article. So everyone go check out Spreaker slash Pure Performance or Dynatrace.com slash Pure Performance. You can get the links to the article. If anybody has any questions, comments, you can tweet them at Pure underscore

Starting point is 00:54:41 DT, or you can send an old fashioned email to pure performance at dynatrace.com. Uh, thank you all for listening. Adrian, thank you once again for being on Andy. Thank you so much guys. Ciao. Ciao. Andy. Bye. Bye. Thanks. Bye.

PurePerformance - Chaos Engineering: The art of breaking things purposefully with Adrian Hornsby

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.