PurePerformance - Encore - How to build distributed resilient systems with Adrian Hornsby

Starting point is 00:00:00 Hello everybody out there in Pure Performance land. Due to summer vacations and many other factors, Andy and I do not have a new show for you this week. Never fear though, because we've already scheduled some new recordings. We'll have them out to you as soon as possible. In the meantime, we hope you enjoy this encore presentation of How to Build Distributed Resilient Systems with Adrian Hornsby from August 19th, 2019. Andy sounds really funny in this one, so enjoy. It's time for Pure Performance. Get your stopwatches ready.

Starting point is 00:00:40 It's time for Pure Performance with Andy Grabner and Brian Wilson. Hello, everybody, and welcome to another episode of Pure Performance. My name is Brian Wilson. Bet you didn't know that. And as always, my guest is, can you guess, Andy Grabner. Hey, Andy, how are you doing today? Good, good. Hi, Brian. Thanks for, well, great to be back on actually doing some recordings. I know the audience wouldn't really know. They have no idea. Yeah, they have no idea. Yeah. They're clueless. They think we do it always on like on the two week schedule. But it's been, you've been traveling, right? Yes, I went back to New Jersey,

Starting point is 00:01:22 visited some friends and family and got to go to the beach or as we call it, the shore in New Jersey, shore to please. Also found out, here's a stupid little tidbit. So after, I don't know if you recall Hurricane Sandy, probably a lot of our listeners overseas never heard of it, but it was this huge hurricane that really did a lot of damage to the whole eastern coast of the United States. But after that, you know, New Jersey was trying to rebuild along the coast and they came up with this saying jersey strong like you know there's a lot of guidos in new jersey like yeah jersey strong we're gonna come back strong and all right and uh turns out somebody opened a gym like a workout gym called jersey strong as well so now it's confusing because if someone has a jersey strong bumper sticker you don't know if

Starting point is 00:02:02 they're taking pride in rebuilding new jersey or if they're like, yeah, I go to this gym and lift weights, bro. So that was our – that's my tidbit. That's what I learned on my trip to New Jersey. And you've been running around a lot too, haven't you, Andy? I believe so, yeah. Well, not running, flying Europe. I did a tour through Poland, a couple of meetups, a couple of conferences, also visited

Starting point is 00:02:28 our lab up in Gdansk, which was the hottest weekend of the year with 36 degrees on the Baltic Sea, which is kind of warm, I would say, but fortunately the water was at least refreshing still.

Starting point is 00:02:44 And yeah, but we're back now. And we have... Sounds like there's been a lot of chaos. Sounds like there's been a lot of chaos. Exactly. And actually, I think distance-wise, Gdansk in Poland, in the north of Poland, is probably not that far away from the hometown of our guest today.

Starting point is 00:03:02 And with this, I actually want to toss it over to Adrian and he should tell us how the summer is up there in Helsinki. Hi guys, I'm Andy, I'm Brian, how are you? Good, how are you? Thanks a lot for having me back. Well the weather in Finland since that was the first chaotic question is indeed chaotic last year we had the hottest summer in maybe 50 years and now we had one of the coldest

Starting point is 00:03:32 summer in 50 years it's pretty weird today is nice actually so you did not get the extreme heat wave all the way up there huh that's interesting no heat waves here uh 25 degrees okay well yeah we got uh i think we got 40 here in austria and uh 35 up there in the

Starting point is 00:03:52 north of poland crazy crazy stuff hey adrian i'm gonna try a quick conversion here hold on 95 f2 cells man when are you finally let's see so it's about 30 it's going up to about 35 like all week here yeah wow it's hot it's hot yeah yeah we're gonna finally get on to the yeah we're never gonna do that that's socialism yeah i know adrian coming to chaos we talk about chaos uh where we brian mentioned it i mentioned it so today topic, after having you on the first call, it was really cool to hear about best practices of building modern architectures. I really love this stuff. And I used it again, actually, on my trip to Poland.

Starting point is 00:04:39 I did a presentation in Warsaw. And I gave you credit on the slides that I repurposed for retries, timeouts, back off, jitter. This was really well received. Awesome. It's great to hear. Yeah, thanks again for that. It makes me, it's great when you get on stage and people think you're smart because you told them something new, but it's also great to admit that all the smartness comes from other people that are all sharing the same spirit of sharing. Sharing is caring. So thanks for that.

Starting point is 00:05:12 Thanks for allowing me to share your content. Yeah, I don't want to take the credit on that either because I learned it from someone else, you know, or books or articles. At the end of the day, it's all about sharing, as you said, and teach what we had problem with six months ago right yeah hey so today's topic is chaos engineering and um i i

Starting point is 00:05:33 will definitely post a link to your medium blog post i think part one is out and part two is coming uh but it's it's a really great introduction into chaos engineering. And to be quite honest with you, I haven't used the word chaos engineering until recently. I always just said chaos monkeys or chaos testing. And I think because I was just so influenced by the first time I learned about introducing chaos, which was, I believe, Netflixflix at least when i read about it netflix was the ones that came up with the chaos monkeys is this correct yes correct yeah yeah you know it's all part of the resiliency engineering in a way kind of field right so i think the the the term chaos chaos monkeys was really made popular by Netflix, and their awesome technology blog that talked the story

Starting point is 00:06:27 about how they developed those monkeys. Hey, and so, I mean, again, everybody should read the blog post. It's amazing the background you give. But maybe in your words, when you're introducing somebody into chaos engineering, when you're introducing somebody into chaos engineering, when you are teaching at conferences or just visiting customers and clients and enterprises, how do you explain chaos monkey or chaos engineering? Let's call it chaos engineering. How do you explain the benefit of chaos engineering and what it actually is and why people should do it?

Starting point is 00:07:04 That's a very good question and I'll try to make it short because I think we could spend hours trying to explain just the discipline of chaos engineering but in short what I say chaos engineering is is a sort of scientific experiment where we try to identify or especially identify failures that we didn't think of, especially figuring out how to mitigate them. It's very good at taking the unknown unknown out of the equation. I think traditionally we've used tests to try to test and make sure our code is resilient to known conditions. Chaos engineering is kind of a different approach. It's a discipline that really tries to take things that we might not have thought about out of the darkness

Starting point is 00:08:04 and bringing it out so that we can learn how to mitigate those failures before they happen in production. And examples, I think in your blog post, you also had like this, I have a graphic in there where you're talking about different levels or different layers of the architecture where you can basically introduce chaos, right? So infrastructure is obviously clear. What else? Yeah, I mean, you know, it's very similar to resiliency, right? I think chaos can be introduced at many different layers

Starting point is 00:08:41 from the obvious infrastructure level, like removing, for example, an instance or a server and figuring out if the overall system is able to recover from it, from the network and data layer. So removing, let's say, a network connection or degrading the network connection, adding some latency on the network, dropping packets or making a black hole for a particular protocol.

Starting point is 00:09:13 That's on the network and data level. On the application level, you can just simply add the exception, make the application throw some exceptions randomly inside the code and see how the outside world or the rest of the system is behaving around it. But you can also

Starting point is 00:09:36 apply it at the people level. And I mean, I talk a little bit about it in the blog. The first experiment I always do is trying to identify the technical sound people in teams or the kind of semi-gurus, as I like to call them, and take them out of the equation and send them home without the laptops and see how the company is behaving.

Starting point is 00:09:59 Because very often you'll realize that information is not spread equally around team members. And, well, I call that the bus factor. If that very technical sound person is not at work or gets hit by a bus, well, how do we recover from failures? I like that idea. I mean, and Brian, I know we always try to joke a little bit but i i think if adrian would ever come to us i think he would definitely not send the two of us home

Starting point is 00:10:32 because there's more technical sound people in our organization it's not necessarily the technically sound people you know it's the the uh it's very often the connector it's the person that knows everyone inside the organization or in the company that can act very fast, that has basically a lot of history inside the company and kind of recalls

Starting point is 00:10:55 or is able to recall every problem that happened and how they recovered from it. It's kind of a walking encyclopedia. But Andy, I think you can attest for the fact that I was just on vacation for about a how they recovered from it. It's kind of a walking encyclopedia. Yeah. But Andy, I think you can attest for the fact that I was just on vacation for about a week and a half

Starting point is 00:11:09 and the whole company fell apart, right? So I am really, really important to the organization. Before we dive into a lot of this though, one thing I saw on your blog that I think is really, really, really important for people to understand before even entertaining the idea of chaos testing or chaos engineering is you can't just start with chaos engineering you have to have uh prerequisites

Starting point is 00:11:34 in your environment you talked about um you know all your prerequisites to chaos building a resilient system resilient software resilient network so you really to, and correct me if I'm wrong, but what I was reading it, before you start the chaos engineering part, you have to build in as much resiliency into all layers of your system as you can possibly think of first. Then you start seeing what did we miss, right? You don't just say, hey, we just stood up our application in our infrastructure. Now we're going to throw something at it because of course it's going to fail and it's going to be catastrophic. So you really want to start from a good place, right? Exactly. And that's the whole point of chaos engineering, right? It's really,

Starting point is 00:12:13 I would say, an experiment, scientific experiment where we, you know, we test an hypothesis, right? For example, we believe we've built a resilient system and we spend a lot of time we say okay our system is resilient to database failures so let's make an hypothesis and say you know what happened if my database goes down and then you take the entire organization the entire team from the product owner to the designers to the software developers and you ask them actually what they think is going to happen. Very often I ask them not to talk about it, but to write it on the paper,

Starting point is 00:12:50 just to avoid the kind of mutual agreements, you know, like everyone comes with a consensus by talking about it. So if you write it on the paper in private, you realize everyone in the team has often very different ideas of what should happen if a database goes down. And a good thing to do at that moment is to stop and ask, how is that possible that we have different understanding

Starting point is 00:13:18 of our specifications? So then you go back into pointing the specification. Very often people fail to read it carefully or improving it. But once you have everyone in the team having a real good understanding of what should happen when the data is go down, then you make an experiment and you actually want to verify that through that experiment and with all the measure you're doing around it from, you know, making sure that the system goes back to the initial state, we call that the steady state of the application.

Starting point is 00:13:52 And also that it does that in the appropriate time, right? In the time you thought about. And you have to verify all this. Yeah, and I think there's a lot to unravel there. But even before that, right? Even before you run that first experiment, um, are there, and Andy, I don't know if you've maybe seen any or Adrian, are there some best practices? Like I know there's the concept of doing multi zones. Let's say if we're talking about AWS, you can set up your application in multiple zones.

Starting point is 00:14:20 So this way, if one zone has an issue, you still have your two, like the rule of threes. Is there, way back, if we go way back in time, Steve Souders wrote the best practices for web design, web performance. Is there, and I know you've written some articles and all, but is there a good checklist for someone starting out or an organization before they start these experiments to say, here are some of the starting points that you should set up before you even entertain your next set of hypotheses, right? Because obviously if you don't do this basic set, you're going to have catastrophic failures. Is there something written out yet? Or is it mostly just things we've been talking about and word of mouth and people's blogs?

Starting point is 00:15:02 Well, it's a very good, good question. Actually, I'll take the note and make sure I can build a list. Well, it's a very good question, actually. I'll take the note and make sure I can build a list. Actually, that's a very good idea. But I would say I think the 12 factor app kind of patterns are a very good place to start with.

Starting point is 00:15:18 I did put some lists around and I talk about resilient architectures in different blogs. I think it's really software engineering basics and it's a timeout, retries, exception handling, these kind of things. Make sure that you have redundancy built in, that your system is self-healing because at the end of the day, you don't want humans to recover.

Starting point is 00:15:44 You want actually the system to recover automatically. So that's something that is usually not well thought or not at every level. I always say, so if you don't do infrastructure as code and automatic deployment, maybe don't do chaos or actually don't do chaos because you're going to have big problems. Your system should be automatically deployed and managed and self-healed

Starting point is 00:16:14 in a way. Then, of course, on the operation level, you should have the full observability and complete monitoring of your system. If you have no visibility on what's happening in your application, there's no way you can conduct a sound experiment or even verify that that experiment first has been successful

Starting point is 00:16:34 or has negatively impacted the running system, right? So measure, measure, measure as much as you can. And of course, you need to have an incident response theme in a way or practice so that, you know, you know what you should do as soon as an alert comes in and, you know, and treat the tickets and being able to have this full response kind of incident response. Yep. Right. So, and I know Andy is going to want to dive into all this, the phases. So let's go on. You mentioned kind of an overview of the experiment, the observations,

Starting point is 00:17:12 the hypothesis and all that. So Andy, I know you probably have 20,000 questions and things to talk about. So let's dive into that now. I mean, again, before I actually dive into, because you just mentioned in the end, monitoring incident response. I think in your blog post, you also mentioned that when you are in one of your cases, you inflicted chaos. And you actually saw that the incident response or the alerting was also impacted by the chaos. For instance, not able to, let's say, send select messages or send emails or stuff like that

Starting point is 00:17:45 i think this is there's such a huge variety of chaos that like your hypothesis is that we need to test against yeah it's very common to to build system that kind of host or power our own response system and you you see actually, I mean, there's been really a recent outage and with Cloudflare, for example, and they were not able to use the internal tool easily simply because the tools were too secure. Kind of the people didn't use their maintenance tools

Starting point is 00:18:24 and long enough or or recently enough. So the system had just removed the credentials. And when you're in a panic situation, having systems like this that actually are impacted by your own behavior is hard. DNS is another one. Very often, if DNS goes down, your own tools are not accessible because they use or might use DNS.

Starting point is 00:18:49 So all these kind of things like this are usually very important to test as well while doing the experiment. That's why you need to start from or simulate the real outage situation. So now I got it. Before going into the phases, because I know you have in the blog post, you have like five phases that you talk about.

Starting point is 00:19:15 But one question that came up while you were earlier talking about application exceptions, because Brian and I, we've been kind of live and breathe applications because we've been, you know, with Dynatrace, especially we monitor applications and services, but obviously the full stack, but we are very, I think, knowledgeable on applications. And when you said earlier chaos in applications, like you can, you know, just throw exceptions that you would normally not throw and see how the application behaves. Do you then imply there are tools that would, for instance, use dynamic code injection to force exceptions?

Starting point is 00:19:55 Is that the way it works? Or do you just, I don't know, change configuration parameters of the application that you know it will result in exceptions? There are several several type of tools um there are libraries that you can directly use to inject uh inject kind of uh failures um either in python very often it's using a decorator function that you wrap around the function and that catches the function and then throws some exception. Actually, I'm building my chaos library toolkit

Starting point is 00:20:29 around using that concept in Python to inject failures in Lambda functions in Python. It's being used as well for JavaScript, similar technique. There are also techniques to have a proxy as well. So you have proxies between two different systems, and then the proxy kind of hijack the connection, kind of man in the middle in a way, and kind of alter the connection, can add latency,

Starting point is 00:21:07 can throw some exception, can inject, you know, like a packet drops or things like this. So there's a lot of different techniques. There's also one very common technique that GreenLin is using is using an agent base. So they have an agent running on an instance or a Docker container that can inject failure locally, all these kind of things. So they can intercept process level type of things and just throw exceptions or make it burst or things like this. Cool. So moving on to the phases, if you kind of look at your blog posts, we already talked about the steady state.

Starting point is 00:22:06 Steady state, basically, I think for me, what it means, first of all figure out is this a situation now that is normal or caused by chaos? I guess steady state really means a system that is stable and you know what that state is, right? Yeah, and it's predictable, as you said. I think a big mistake what people do is usually they use system attribute metrics, a type of CPU memory or things like this, and look at this as a way to measure the health and the sanity of an application.

Starting point is 00:22:36 Actually, a steady state should have nothing to do with this, or at least not entirely, and should be a mix of operational metrics and customer experience. I write in the blog about the way Amazon does that is the number of orders. And you can easily imagine that CPU on an instance has actually no impact on the number of orders at big scale. And similarly for Netflix, they use the number of times people press on the play button globally to define their steady state. They call that the pulse of Netflix, which I think is beautiful because it's a relation between the customer and the system. If you press several

Starting point is 00:23:16 times, then of course it means the system is not responding the way it should. So it's experiencing some issues. And similarly, if you can't place an order on Amazon retail page, it means the system is not working as it's designed. So it's a very good kind of a steady state. But it's important to work on that. It's not easy actually to define it. And I see a lot of customers having a big trouble first defining the steady states

Starting point is 00:23:42 or their steady states. You can have several of them. But for me, I mean, the way big trouble first defining the steady states or their steady states. You can have several of them. But for me, I mean, the way you explain it to me, it's kind of like a business metric, right? Like orders, the pulse. I don't know if I would apply it to what we do at Dynatrace, the number of trial signups or the number of agents that send data to Dynatrace.

Starting point is 00:24:04 I have a hard time believing, but I'm sure you're right, but I have a hard time believing that companies have a hard time defining that steady state metric, because if they don't know what their steady state metric is, they don't even know whether their business is doing all right or is currently impacted. Is that really the case? I guess you'd be surprised.

Starting point is 00:24:26 It's not that they don't know what the business is doing that's something the business might be doing several things and and uh you know it's uh i think i think usually the business metrics or things like this are are used let's say by a higher level higher level monitoring that might go back to the managers or the C-levels versus what we want is actually that kind of metric to go down to the engineering team in a very distilled way and very easy, accessible easily. And okay, now that makes a lot of sense.

Starting point is 00:25:06 And I have another question, though. So you said CPU is typically not a metric because it doesn't matter how many, what CPU usage you have as long as the system gets back to steady state. But do you still include these metrics? The reason why I'm asking is what if you are, what if you are having a, let's say, a constant, like if you look at the orders, right?

Starting point is 00:25:28 If, let's say, 10,000 orders per second is coming in, and you know you're using X of CPUs in a steady-state environment, then you're bringing chaos into the system. The system recovers, goes back to the 10,000 orders, but all of a sudden you have, I don't know, twice as many CPUs. Isn't that, shouldn't you still include at least some of the metrics across the stack to validate not only that you are, that the system itself is back in a steady state,

Starting point is 00:25:57 but the supporting infrastructure is at least kind of going back to normal? Or is this something you would just not do at all because it is not the focus point no you're totally right you know the steady state is kind of the one metric that is important but you do need all the supportive supporting metrics as well right it's it doesn't mean when i say the state state is the most important one it doesn't mean you should not include the rest on the contrary right it's actually, you should have as much metrics as you can. It doesn't mean you should try to overdo the things, right? But I think the most essential one, especially, I think, especially for the cloud is, you know, if you say you have 10,000 orders and your infrastructure

Starting point is 00:26:44 to support 10,000 orders is the one that you are currently using, you make a chaos experiment and then that infrastructure or the need to support that 10,000 orders is kind of a double, then of course you should raise a big alarm and make sure that this is looked at, of course. Cool. All right, so steady state, first state or first phase of chaos engineering. What comes next? So after the steady states, we make the hypothesis. Once we understand the system, then we need to make the hypothesis. And this is the what if or what when, because failure going to happen, right? So what happens when the recommendation engine stops, for example, or what happened when the database fails or things like

Starting point is 00:27:32 this. So if you're first timer in chaos engineering, yeah, definitely start with a small hypothesis. Don't tackle the big problems right away. Build your confidence, build your skills, and then grow from there. But yeah, this is endless possibilities, right? So what I love to do is usually look at an architecture and kind of look at the critical systems first. So you, you know, you look at your kind of APIs and what are the critical systems for each of the APIs, and then you tackle first the critical systems and see if these are really as resilient as you expected, or if you can uncover some kind of failure mode that you didn't think of and things like this and

Starting point is 00:28:27 then you can go into less critical dependencies but usually most important is to make sure that the critical competence are tackled first. This is also the case with the hypothesis if I just recount or repeat what you said earlier. This is also, was it, you explained, you do the exercise, you go into a room and you bring everybody in the room and then you discuss the hypothesis and let everyone kind of write down what they believe is going to happen, right? Exactly.

Starting point is 00:29:01 And this is for me is one of the most important part of that exercise because you want to uncover uh different understandings and and making sure that why do people have different understanding uh from an hypothesis and usually this will uncover some problems already like a lack of specifications or a lack of communication or simply people have forgotten what's supposed to be. Yeah. And also, if you think about this, you know, let's take the database as an example. If you say, what happens if the database is gone? And maybe one team says, well, my system that is relying on the database is just retrying

Starting point is 00:29:42 it later. And then the other team that is responsible for the database maybe said, well, but that's not the intended way. We thought you could just go back to a, like, I don't know, a static version of the data, blah, blah, blah. I think that's, as you say, uncovering a lot of problems that actually have nothing to do with technology, but actually have to do with lack of communication

Starting point is 00:30:03 or lack of understanding of the system. Yeah, and very often what I've noticed, the big difference is usually between the design team or UI teams and then the backend teams, right? Many times the UI team will have not thought about this, like what happens if your database goes down? Well, for the technical team, it might be obvious to move into a

Starting point is 00:30:27 read-only mode and say, okay, we move a request to the read replicas of the database, and eventually, after maybe a minute, we fail over the new master. So, for one minute, your application will be in a state of, hey, there's no database.

Starting point is 00:30:44 So, what do you do? Very often people in the UI world won't think about it because it's not the case that they were asked. Read-only mode is quite weird, actually, to think of. Hey, Andy, I was also thinking, when reading through these steps and also sitting on the hypothesis and the steady state piece, how this sounds like this would be a wonderful place for performance testers, engineers, whatever, to transition into.

Starting point is 00:31:13 But specifically with this hypothesis phase, I love the idea of making this not just for chaos, but for performance as well. Let's say there's a new release being built, right? Gather everybody together and say, all right, we're going to run a certain load against this. What do you all think is going to happen? In a way to make them think, well, wait, what did I just code and what might happen? Again, like classic database issue

Starting point is 00:31:41 where suddenly we added four more database queries to a statement. Well, hey, are you thinking about that? But just even if it's not for performance, this idea of getting everybody together to say, what do you think is going to happen if X, whether it be a chaos experiment or performance, some sort of a load test or something else to say, what do you think is going to happen? I think that sort of communication with people opens their mind to actually thinking about what they're doing and what impacts they might have. I think regardless if it's for chaos,

Starting point is 00:32:09 I think it's just a great practice for organizations to put together. Yeah, exactly. And if you think of it, I mean, executing load or putting load on the system is a form of chaos. Yes. Anyway, I love this hypothesis. It's great.

Starting point is 00:32:29 So hypothesis is done, then I guess you run the experiment, right? Correct. This is first designing and running the experiment. And I think here the most important thing is blast radius, right? So you have to

Starting point is 00:32:44 keep the customers in mind, right? And of course, initially, if it's your first time, you might not want to do that in production and really do this in test environment and make sure that you are able to control the blast radius during your experiments. And this is very important to think about. It's like, you know, how many customers might be affected, what functionality is impaired, or if you are a multi-site, which location are going to be impacted by this experiment. So, you know, this is very, very important to think about. And then it's about running it and identifying also the metrics that you need to measure for that experiment.

Starting point is 00:33:35 So as you mentioned earlier, you might want to make sure that you control the number of instances that you need for a particular steady state and make sure that you return to that same number after the experiment. And this is definitely part of the overall metrics that we need to check for your application. From a running perspective, I think you mentioned earlier, you have your Python-based library that can run some experiments. Did Netflix also release their Chaos Monkey libraries to the world? Yeah, I mean, the Chaos Monkey was one of the first tools ever released

Starting point is 00:34:18 to do chaos engineering. That's now part of a continuous integration tool called Spinnaker, together with the rest of what they call the Siemen Army. And that's, you know, there's a bunch of monkeys. There's the chaos monkey, which is the original one, which kills randomly instances. There's the chaos gorilla, which impair an entire availability zone on in AWS. And then there's the big gorilla, the big Kong, some gorilla is chaos Kong, which shut down an entire AWS region. And they practice that in production. You know, if you've maybe once a once a month or something like this

Starting point is 00:35:05 while people are watching Netflix. Just to make sure that their initial design is still valid, right? Yeah. I wonder what would happen if AWS shut down an entire region. Let's not find out. You'd be surprised, actually. We do, I mean, We do chaos engineering as well. I mean, we started very early on,

Starting point is 00:35:28 but we don't do chaos engineering on paying customers, but we do a different level of chaos engineering. And Prime Video does that in production, actually. I'm writing a blog post about that so it should should be out maybe in a month or two something like this cool hey and then i think you mentioned gremlin earlier right one of the agent-based companies um so that's also cool do you have any insights into what they are particularly doing? As I said, Gremlin is part of the agent-based chaos tool. So they do a lot of cool things on the peer instance level.

Starting point is 00:36:17 So they can corrupt an instance by making a CPU run wild. So 100% CPU utilization and see how your application reacts. They can take the memory out. They can add a big file on the drive to make it run out of disk space. And there's a lot of different things, state attacks as well,

Starting point is 00:36:44 like make it terminate or restart or reboot, pause, all these kind of things. And the UI is beautiful to use. That's what is cool. There's another very good framework that I really like is Chaos Toolkit. And that's more on the API level. So it's kind of a framework, an open framework that you can build extension with. So there's an AWS extension

Starting point is 00:37:14 and that kind of wraps the AWS CLI. And then you can also like do API queries to get the steady state, do some actions, probe the system and things like this. And the whole template for the experiment is written in JSON. And then you can integrate that in your CI CD pipeline as well. So, I mean, and actually the Kiosk toolkit does integrate with Grunin as well. So, I mean, all the tools are really working together in a way.

Starting point is 00:37:47 I think helping people to make better chaos experiments. Perfect. Cool. Yeah, I think it's always interesting to, the reason I was asking, you know, people that want to go into chaos engineering, they probably also want to figure out, so, how do I inflict chaos?

Starting point is 00:38:05 Are there tools out there, are there frameworks out there? And that's why I wanted to ask the question. The proxy level is a very beautiful tool called Toxiproxy, right? That has been released by Shopify. And that's like a proxy-based kind of chaos tool, which you can put that proxy between your application and, for example, database or Redis or main cache, and this term injects some what they call toxics

Starting point is 00:38:34 and injects some latency or do some error, like dropping packets, let's say, 40% of the time and things like this. So it's very interesting. And then, of course, you have the old-school Linux tools, right? WRT or, of course, corrupting the IP tables as well and things like this. Or the very old-fashioned pull-the-plug.

Starting point is 00:39:02 Yeah, that's actually how, let's say, Jesse Robbins famously did that at Amazon on retail in the early 2000s. He would walk around data center and pull plugs from servers and even pull plug of entire data centers. One question about these tools then, because the next part of this is all verifying right and you mentioned some metrics and these aren't host or machine type of metrics

Starting point is 00:39:34 these are human metrics things like time to detect time for notification do any of these tools that you mentioned have either the ability to either pick up on like maybe time to notification or allow for entry to you know human entry to say okay we detected it at x time first basically track these reaction metrics uh that you talk about do you know if any of those tools have anything built in or is that something you'd be running on you know keeping track of on your own and then let's talk about what those are as well. Right. So if you look at the Chaos toolkit, since it's API-based, you can query all your system

Starting point is 00:40:14 if it supports that to add in the Chaos Engineering experiment report. So every time you run an experiment, it will print a report that you can then analyze. As for the Gremlin, of course, it's more agent-based. So there's no kind of a complete report like this. But I mean, neither Chaos Toolkit or Grimlin type of things have a full reporting that would satisfy, let's say, the most, let's say, careful team. Like if you want to write a COE,

Starting point is 00:41:00 you're going to have to do a lot of human work in figuring out all these different things like time to detect, notification, escalation, public notification, self-healing, recovery, until all clear and stable. There's nothing yet that really covers everything.

Starting point is 00:41:17 There's definitely space for a competitor if you want to build your own company. I think there's an idea well i actually just thought of you know obviously on the dynatrace side we do the automated problem detection and we have apis where external tools can feed events to dynatrace so for instance if you set up the hypothesis you could tell dynatrace about the hypothesis, meaning you can set thresholds on your business metrics, like order rate or conversion rates. And you can then also tell Dynatrace,

Starting point is 00:41:56 I don't know, Friday, 6 o'clock, we're starting the chaos run. In case Dynatrace detects a problem based on your metric going out of the norm, it would immediately then open up a problem and would collect all the evidence, including the event that you sent. So I think in a way we could actually measure a lot of these things. And by integrating it with these chaos chaos tools we would even allow you to automatically set up your hypotheses and tell dynatrace about when you started the test run when this test run was ended and then dynatrace can tell you uh when was the problem detected

Starting point is 00:42:37 when were the notifications sent out and when was the problem gone when did the problem go away so i think we need to... It's a great idea. Actually, the KS Toolkit supports extension currently for open tracing. So if it's something that is going to be used at Dynatrace, it's definitely something you should look at. It supports Prometheus

Starting point is 00:42:58 probes and CLI, Instana. There's a bunch of extensions. So I think you guys should definitely write an extension for Chaos Toolkit to actually send what is called the Chaos Toolkit report to Dynatrace to add visibility to the whole thing. Definitely. It's a very good idea.

Starting point is 00:43:21 Yeah. Cool. So learn and verify. What else is there to know? Learn, verify, and then I guess also optimize and learn and fix things, right? Well, you know, I think the big part of verifying is also the postmortem, right? I think you should always go through the postmortem part. And in my opinion, it's one of the interesting things because you're going to deep dive on the reasons of the failures, right?

Starting point is 00:43:51 If there are failures. So if your chaos experiment is successful, then good for you, right? You should also write about this. If it's unsuccessful and if you've created or resurfaced failures that you didn't think about, that's the post-mortem. And then you have to deep dive really, really well on the topic and figuring out what happened, the timeline of it, what was the impact, and why did the failure happen.

Starting point is 00:44:20 This is the moment where you kind of want to go to the root cause. I know root cause is something that where you kind of want to go to the root cause. I know root cause is something that is very, very difficult to get because in a failure is never one reason. It's always a collection of small reason that getting together to create this kind of big failure, but trying to find as many reasons as possible at different layers, different levels. And then, of course, what we learned, right? And how do you prevent that from happening in the future? So how are you going to fix it?

Starting point is 00:44:56 Those are really hard to answer. They sound easy when I talk about it, but it's very, very difficult to answer carefully all that stuff. In your blog post, you talk a lot about, I think you start the blog post actually with comparing chaos engineering with firefighters because I think Jesse Robbins, he was actually a firefighter, if I kind of remember this correctly,

Starting point is 00:45:24 and he brought Game Day to Amazon, right? Yes. And looking at this, so firefighters, I think you said something like 600 hours of training, and in general, 80% of their time is always training, training, training. Is there, and this might be a strange thought, but is there a training facility for chaos to learn actually chaos engineering? Is there a and this might be a strange thought, but is there a training facility for chaos to learn actually chaos engineering?

Starting point is 00:45:47 Is there a demo environment where you have, let's say, a reference architecture of a, let's say, web shop running and then you can play around with chaos engineering? Andy, I think that's called having kids. You're totally right. I mean, one of the very, very good point of chaos engineering, and even when not done in production or in production,

Starting point is 00:46:12 I mean, in any case, is the fact that you can actually practice recovering from failures. So you inject failure in your system, and then you let the team handle it the same way as it would be an outage, right? So you will, you know, practice and practice and develop what the firefighters want to develop, like the intuition for understanding errors and behaviors and failures in general. Like, you know, if you see, for example, an NGINX kind of CPU consumption curve or a concurrent connection,

Starting point is 00:46:47 and already all of a sudden it becomes flat, you have to be blocked the intuition for what it might be. You know, definitely the first thing to look is the Linux security configuration on your instance. You know, max connection is probably low, all these kind of things. And you can't build that intuition if you've never kind of debugged the system

Starting point is 00:47:14 or trying to recover from an outage because those are extreme conditions. And it's exactly that, right? It's practice, build intuition, and then make those failures come out. Very cool. Is there anything we missed in the phases? I think we are...

Starting point is 00:47:35 Fixing it. Wow. Who wants to do that? Yeah, it's like, you know, you've done all the fun right now, so you have to fix it. And this is, in my opinion, I'll say something very important here is,

Starting point is 00:47:47 unfortunately, I see a lot of companies doing chaos engineering and very brilliant COEs or corrections of error post-mortems, but then the management don't give them the time to actually fix the problem. So I actually was with one of these companies a few months ago, and two weeks after the chaos experiment, which surfaced some big issues in the infrastructure, they

Starting point is 00:48:12 had this real outage in production and we're down 16 hours, right? So they could have fixed it before, but they didn't stop the features or they didn't prioritize that and eventually then they pay a bigger price right so it's very important to just not do those case experiments but actually to get the management buying and make sure that when you have something serious stop everything else and just fix it and that's uh super important but but let me ask you why fix it if you can just reboot the machine and make it work again? Exactly. You've been using the GVM for some time, right? Yeah.

Starting point is 00:48:53 It's like reboot Fridays. Is that what you have as well, like dialing trace? Yeah. Imagine. So besides obviously fixing it, as Andy was going on, was there anything, you know, obviously we want to fix it uh beyond that uh we kind of can't recommend enough to read the blog by everybody but is there anything that we didn't cover uh that you want to make sure yeah there's the side effects i think you know chaos enduring is great at uncovering failures but actually i think the side effects uh on the companies are even more interesting.

Starting point is 00:49:25 And they are mostly cultural. Like the fact that companies that start to do chaos engineering eventually move when successful. I've seen non-successful, but most of them are successful. Move to what I like to call this non-blaming culture and move away from pointing fingers to, you know, how do we fix that? Yeah, I think that's a beautiful place to be, right? For developers, for owners as well. Because it's a culture that accepts failure, embraces failures, and kind of want to fix things instead of blaming people.

Starting point is 00:50:02 And that's also how we work at AWS and Amazon. And I really like that, you know, our COEs and postmortems are kind of blame-free. And that's one very, very important part of writing that postmortem. And, you know, I think it's great, because if you point someone that is making a mistake, eventually it will come back to you, right? You will suffer a mistake and be blamed. And that's never a good place to be. Yeah. I think another good side effect too is going back to the hypothesis phase

Starting point is 00:50:38 where people will start thinking probably about what they're going to do in terms of the hypothesis in mind. So before they actually implement something, they'll probably start thinking more about what its effect might be. Yeah, exactly. You think more about the overall system versus just the part that you build, right? I think that's an important thing. And of course, we didn't mention, but very, very good side effect is sleep, right?

Starting point is 00:51:09 You get more sleep because you fix outages before they happen in production. So you get a lot more sleep. Awesome. Hey, Andy, would you like to summon the Summarytor? I would love if you summon the Summarytor.

Starting point is 00:51:28 Do it now. All right. You've rehearsed that one, right? No, so, well, Adrian, thanks again for yet another great educational session's a topic, chaos engineering, that I think it's still kind of in its infancy stage when it comes to broader adoption and kind of everybody really understands what it is. I really like a quote that I think you took from Adrian Cockcroft, who said, Chaos engineering is an experiment to ensure that the impact of failures is mitigated, which is also a great way of explaining it. I really encourage everyone out there, read Adrian's blog.

Starting point is 00:52:17 The five phases for me are, I mean, I think the first two for me are amazing because steady state means you, first of all, need to work on an architecture that is ready for chaos. And you need to know what the steady state actually is and have a system that is steady. But then I really like your kind of, I call it experiments now too, experiment with the people to figure out what they think should happen when a certain condition comes in. Like working on your hypotheses because you can fix and find a lot of problems already before

Starting point is 00:52:56 you really run your chaos engineering experiments. Thanks for that insight and I really hope that you will come back on another show because I'm sure there's a lot of more things you can tell us. I'd love to.

Starting point is 00:53:13 Thanks a lot, Andy and Brian, for having me once more on the show. I really enjoyed it. It was an absolute pleasure. I would just like to add one thing as well. For anybody who's in the listening, who might be in the performance or load side of things. Andy and I have talked many times, especially early on the podcast about the idea of leveling up. I'm listening to this bit about chaos engineering and all.

Starting point is 00:53:35 I keep thinking of, wow, that's what a great place to level up to or not even level up to. Let's say you're like, I'm kind of done what i feel i can do in the load environment or that whole field uh this is kind of i feel in the in the continuation of it it really boils down to hypothesis experiment analysis and obviously you have the fixing at the end which is very much the same as you know doing your performance and load testing so and it's a brand new you know chaos uh engineering is not very widespread yet. Obviously, the big players are doing it. So there's a lot of opportunity out there

Starting point is 00:54:11 to get involved with that. So definitely, if I were still on the other side of the fence, not being a pre-sales engineer, I would probably start looking into this a bit more. Yeah, and it's a lovely place. I mean, people are amazing. They love sharing. So I highly recommend everyone to get involved in the space.

Starting point is 00:54:33 Sounds like it's just a sidestep over to a new world of experimentation. Anyway, Adrian, thanks again, as Andy said. Thank you. Welcome back anytime. Anytime you get something new, please come on. It's great stuff you have here. We will put up your information. We're going to put a link to this article,

Starting point is 00:54:52 so everyone go check out Spreaker slash Pure Performance or Dynatrace.com slash Pure Performance. You can get the links to the article. If anybody has any questions, comments, you can tweet them at Pure underscore DT, or you can send an old-fashioned email to pureperformance at dynatrace.com. Thank you all for listening. Adrian, thank you once again for being on.

Starting point is 00:55:15 Andy. Thank you so much, guys. Ciao, ciao, Andy. Thank you. Bye-bye. Thanks. Bye.

Your Ad Here

PurePerformance - Encore - How to build distributed resilient systems with Adrian Hornsby

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.