PurePerformance - Chaos Engineering: The art of breaking things purposefully with Adrian Hornsby
Episode Date: September 2, 2019In 2018 Adrian Cockcroft was quoted with: “Chaos Engineering is an experiment to ensure that the impact of failures is mitigated”! In 2019 we sit down with one of his colleagues, Adrian Hornsby (@...adhorn), who has been working in the field of building resilient systems over the past years and who is now helping companies to embed chaos engineering into their development culture. Make sure to read Adrian’s chaos engineering blog and then listen in and learn about the 5 phases of chaos engineering: Steady State, Hypothesis, Run Experiment, Verify, Improve. Also learn why chaos engineering is not limited to infrastructure or software but can also be applied to humans.Adrian on Twitter:https://twitter.com/adhornAdrian's Blog:https://medium.com/@adhorn/chaos-engineering-ab0cc9fbd12a
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson.
Bet you didn't know that.
And as always, my guest is, can you guess, Andy Grabner.
Hey Andy, how are you doing today?
Good, good.
Hi Brian.
Thanks for, well, great to be back on actually doing some recordings.
I know the audience wouldn't really know. They have no idea they have no idea yeah they're clueless they think we do it always on like on the two-week schedule uh but it's been you've been traveling right yes i went back to
new jersey visited some friends and family and got to go to the beach or as we call it the shore
in new jersey sure to please also found out here's a stupid
little tidbit um so after i don't know if you recall hurricane sandy probably a lot of our
listeners overseas never heard of it but it was this huge hurricane that really did a lot of damage
to the whole eastern coast of of the united states but after that you know new jersey was trying to
rebuild along the coast and they came up with this saying jersey strong like you know there's a lot
of guidos in new jersey like yeah jersey strong we're gonna come back strong
and all right and uh turns out somebody opened a gym like a workout gym called jersey strong as
well so now it's confusing because if someone has a jersey strong bumper sticker you don't know if
they're taking pride in rebuilding new jersey or if they're like yeah i go to this gym and
lift weights bro you know so that our – that's my tidbit.
That's what I learned on my trip to New Jersey.
And you've been running around a lot too, haven't you, Andy?
I believe so, yeah.
Well, not running, flying Europe.
I did a tour through Poland, a couple of meetups,
a couple of conferences, also visited our lab up in Gdansk, which was the hottest weekend of the year with 36 degrees on the Baltic Sea, which is kind of warm, I would say.
But fortunately, the water was at least, you know, refreshing still.
And yeah, but we're back now.
And we have...
Sounds like there's been a lot of chaos.
Sounds like there's been a lot of chaos. Sounds like there's been a lot of chaos, exactly.
And actually, I think distance-wise, Gdansk in Poland, in the north of Poland,
is probably not that far away from the hometown of our guest today.
And with this, I actually want to toss it over to Adrian,
and he should tell us how the summer is up there in Helsinki.
Welcome back, Adrian.
Hi, guys. Hi, Andy. Hi, Brian. How are you?
Good, how are you?
Yeah, thanks a lot for having me back.
Well, the weather in Finland, since that was the first chaotic question,
is indeed chaotic.
Last year we had the hottest summer in maybe 50 years and now we had one of the coldest summer in 50 years.
It's pretty weird. Today is nice actually.
So you did not get the extreme heat wave all the way up there? That's interesting.
No, heat waves here are 25 degrees. degrees okay yeah we got uh i think we got 40 here in austria and uh 35 up there in the
north of poland crazy crazy stuff hey adrian i'm gonna try a quick conversion here hold on
man when are you finally let's see so it's about 30 it's going up to about 35 all week here. Yeah, wow.
It's hot.
It's hot, yeah.
Yeah, when are we going to finally get on to the, yeah, we're never going to do that socialism.
Yeah, I know.
So, Adrian, coming to chaos, we talk about chaos.
Well, Brian mentioned it, I mentioned it. Today's topic, after having you on the first call, it was really cool to hear about best practices of building modern architectures.
I really love this stuff.
And I used it again, actually, on my trip to Poland. I did a presentation in Warsaw, and I gave you credit on the slides that I repurposed for retries, timeouts, off uh jitter this was really well received awesome
it's great to hear yeah thanks again for that it makes me it's great when you get when you get on
stage and people people think you're smart because you told them something new but it's also great to
admit that all the smartness comes from other people that are all sharing the same spirit of
sharing and sharing is caring so thanks for that thanks for allowing me to uh to share your content All the smartness comes from other people that are all sharing the same spirit of sharing.
Sharing is caring.
So thanks for that.
Thanks for allowing me to share your content.
Yeah, I don't want to take the credit on that either because I learned it from someone else, you know, or books or articles.
At the end of the day, it's all about sharing, as you said, and teach what we had a problem with six months ago.
Yeah.
Hey, so today's topic is chaos engineering.
And I will definitely post a link to your Medium blog post.
I think part one is out and part two is coming.
But it's a really great introduction into chaos engineering.
And to be quite honest with you,
I haven't used the word chaos engineering until recently i
always just said chaos monkeys or chaos testing and um and and i think because i was just so
influenced by the first time i learned about introducing chaos which was i believe netflix
at least when i read about it netflix was the ones that came up with the Chaos Monkeys. Is this correct?
Yes, correct.
Yeah.
You know, it's all part of the resiliency engineering, in a way, kind of field, right? So I think the term Chaos Monkeys was really made popular by Netflix.
And they're awesome technology blog that talked the story about how they develop those monkeys.
Hey, and so, I mean, again, everybody should read the blog post.
It's amazing the background you give.
But maybe in your words, when you're introducing somebody into chaos engineering,
when you are teaching at conferences or just visiting customers and clients and enterprises.
How do you explain chaos monkey or chaos engineering?
Let's call it chaos engineering.
How do you explain the benefit of chaos engineering and what it actually is and why people should do it?
That's a very good question.
And I'll try to make it short because I think we could spend hours trying to explain just the discipline of chaos engineering. But in short, what I say, chaos engineering is a sort of scientific experiment where we try to identify or especially identify failures that we didn't think of, especially figuring out how to mitigate them.
It's very good at taking the unknown unknown out of the equation.
I think traditionally we've used tests to try to test and make sure our code is resilient
to known condition, chaos engineering, it's kind of
a different approach. It's a discipline that really tries to take things that we might not
have thought about out of the darkness and bringing it out so that we can learn how to, to mitigate those failures before they happen in production.
And examples, I think in your blog post, you also had like this,
I have a graphic in there where you're talking about different,
different levels or different layers of the architecture where you can
basically introduce chaos, right? So it's,
infrastructure is obviously clear.
What else?
Yeah, I mean, it's very similar to resiliency, right?
I think chaos can be introduced at many different layers from the obvious infrastructure level,
like removing, for example, an instance or a server and figuring out if
the overall system is able to recover from it, from the network and data layer.
So removing, let's say, a network connection or degrading the network connection, adding
some latency on the network, dropping packets or making a black hole
for a particular protocol.
That's on the network and data level.
On the application level,
you can just simply add the exception,
make the application throw some exceptions
randomly inside the code
and see how the outside world or the rest of the system is behaving around it.
But you can also apply it at the people level.
And I mean, I talk a little bit about it in the blog.
The first experiment I always do is trying to identify the technical sound people in teams or the kind of semi-gurus, as I like
to call them, and take them out of the equation and send them home without the laptops and
see how the company is behaving.
Because very often you'll realize that information is not spread equally around team members and uh well i call that the bus factor uh if if that very technical
technical sound person is not at work or gets hit by a bus well you know how do we uh recover from
from failures i like that idea i mean and brian i know we always try to joke a little bit but i
i think if adrian would ever come to us,
I think he would definitely not send the two of us home
because there's more technical sound people in our organization.
It's not necessarily the technical sound people.
It's very often the connector.
It's the person that knows everyone inside the organization
or in the company that can act very fast,
that has basically a lot of history
inside the company and that kind of recalls or is able to recall every problem that that happened
and how they recovered from it it's kind of a walking encyclopedia you know yeah but andy i
think you can attest for the fact that i was just on vacation for about a week and a half and the whole company fell apart. Right. So I am really, really important to the organization. the idea of chaos testing or chaos engineering is you can't just start with chaos engineering.
You have to have prerequisites in your environment. You talked about, you know, all your prerequisites
to chaos, building a resilient system, resilient software, resilient network. So you really have to,
and correct me if I'm wrong, but what I was reading it, before you start the chaos engineering
part, you have to build in as much resiliency into all layers of your system
as you can possibly think of first.
Then you start seeing what did we miss, right?
You don't just say, hey, we just stood up our application
and our infrastructure.
Now we're going to throw something at it because, of course,
it's going to fail and it's going to be catastrophic.
So you really want to start from a good place, right?
Exactly.
And that's the whole point of chaos engineering.
It's really, I would say, an experiment, scientific experiment, where we test an hypothesis.
For example, we believe we've built a resilient system.
We spend a lot of time.
We say, okay, our system is resilient to database failures.
So let's make an hypothesis and say, you know,
what happens if my database goes down?
And then you take the entire organization,
the entire team from the product owner to the designers
to the software developers,
and you ask them actually what they think is going to happen.
Very often I ask them not to talk about it,
but to write it on the paper,
just to avoid the kind of mutual agreements, you know, like everyone comes with a consensus
by talking about it. So if you write it on the paper in private, you realize everyone in the
team has often very different ideas of what should happen if a database goes down.
And a good thing to do at that moment is to stop and ask,
how is that possible that we have different understanding of our specifications?
Then you go back into pointing the specification.
Very often people fail to read it carefully or improving it.
But once you have everyone in the team having a real good understanding
of what should happen when the data goes down,
then you make an experiment and you actually want to verify that
through that experiment and with all the measure you're doing around it
from making sure that the system goes back to the initial state.
We call that the steady state of the application.
And also that it does that in the appropriate time, right?
In the time you thought about.
And you have to verify all this.
Yeah, and I think there's a lot to unravel there.
But even before that, right?
Even before you run that first experiment,
are there, and Andy, I don't know if you've maybe
seen any or adrian are there some best practices like i know there's the the concept of doing multi
zones let's say if we're talking about aws you can set up your application in multiple zones
so this way if one zone has an issue you still have your two like you know the rule of threes um
is there you know way back if we go way back in time steve souders wrote the you know best
practices for uh web web design you know web performance um is there and i know you've written
some articles and all but is there a good checklist for someone starting out or an organization
you know before they start these experiments to say here are some of the starting points that you
should set up before you even entertain your next set of hypotheses, right?
Because obviously if you don't do this basic set, you're going to have catastrophic failures.
Is there something written out yet or is it mostly just things we've been talking about and word of mouth and people's blogs?
Well, it's a very good question, actually.
I'll take the note and make sure I can build a list.
Actually, that's a very good idea.
But I would say I think the 12-factor app kind of patterns
is a very good place to start with.
I did put some lists around,
and I talk about resilient architectures in different blogs.
I think it's really software engineering basics,
and it's timeout retries, exception handling, these kind of things. blogs I think you know the is really software engineering basics and you know
it's a timeout retries exception handling these kind of things make sure
that you have redundant redundancy built in that your system is self healing
because you know at the end of the day you don't want humans to to recover you
want actually the system to recover automatically. So that's something that is usually not well thought
or not at every level.
I always say, so if you don't do infrastructure as code
and automatic deployment, maybe don't do chaos
or actually don't do chaos
because you're going to have big problems.
You know, your system should be automatically kind of deployed and managed and self-healed in a way.
And then, of course, on the operation level, you should have the full observability and
complete monitoring of your system.
If you have no visibility on what's happening in your application, there's no way you can conduct a sound experiment
or even verify that that experiment
first has been successful
or has negatively impacted the running system.
So measure, measure, measure as much as you can.
And of course, you need to have an incident response theme
in a way or a practice
so that you know what you should do as soon
as an alert comes in and treat the tickets and being able to have this full response
kind of incident response.
Yep.
That'll do, right?
Mm-hmm.
Yep.
And I know Andy's going to want to dive into all the phases, so let's go on.
You mentioned kind of an overview of the experiment, the observations, the hypothesis and all that.
So Andy, I know you probably have 20,000 questions and things to talk about.
So let's dive into that now.
I mean, again, before I actually dive into, because you just mentioned in the end, monitoring incident response. I think in your blog post, you also mentioned that when you are in one of your cases,
you inflicted chaos and you actually saw
that the incident response or the alerting
was also impacted by the chaos.
For instance, not able to, let's say,
send select messages or send emails or stuff like that.
I think there's such a huge variety of chaos that like your hypothesis
is that we need to test against yeah it's very common to to build system that kind of host or
power our own response system and you you see actually i mean there's been really a recent
outage uh and withflare, for example,
and they were not able to use their internal tool easily
simply because their tools were too secure.
The people didn't use their maintenance tools long enough
or recently enough, so the system had just removed the credentials.
And when you're in a panic situation,
having systems like this that actually are impacted
by your own behavior is hard.
But DNS is another one.
Very often, if DNS goes down,
your own tools are not accessible
because they use or might use DNS.
So all these kind of things like this are usually very important
to test as well while doing the experiment.
That's why you need to start from or simulate the real outage situation.
So now I got it.
Before going into the phases, because I know you have in the
blog post, you have like five phases that you talk about, but one question that came
up while you were earlier talking about application exceptions, because, you know, Brian and I,
we've been kind of live and breathe applications because we've been, you know, with Dynatrace,
especially we monitor applications and services, but obviously the full stack,
but we are very, I think, knowledgeable on applications.
And when you said earlier chaos in applications, like you can, you know,
just throw exceptions that you would normally not throw and see how the
application behaves.
Do you then imply there are tools that would, for instance,
use dynamic code injection to force exceptions?
Is that the way it works?
Or do you just, I don't know,
change configuration parameters of the application
that you know it will result in exceptions?
There are several types of tools.
There are libraries that you can directly use
to inject failures.
Either in Python, very often it's using a decorator function
that you wrap around the function
and that catches the function and then throws some exception.
Actually, I'm building my chaos library toolkit around that, using that concept
in Python to inject failures in Lambda functions in Python. It's being used as well for JavaScript,
similar technique. There are also techniques to have a proxy as well. So you have proxies between two different systems
and then the proxy kind of hijack the connection,
kind of man in the middle in a way,
and kind of alter the connection,
can add latency, can throw some exception,
can inject, you know, like packet drops or things like this. So there's a lot of different techniques.
There's also one very common technique that GreenLin is using is using an agent base. So they
have an agent running on an instance or a Docker container that can inject failure locally, or
these kind of things. So they can intercept process level type of things
and just throw exceptions or make it burst or things like this.
Cool.
So moving on to the phases,
if you kind of look at your blog posts,
we already talked about the steady state, right?
Steady state, basically, I think for me,
what it means, first of all, you have to have a system that is kind of stable and steady.
Because if you have a system that in itself is not predictable, I guess, enforcing chaos on it is probably then not making it easier to figure out,
is this a situation now that is normal or caused by chaos
so i guess steady state really means a system that is stable and you know what that state is
right yeah and it's predictable as you said i think a big mistake what people do is usually
they use uh system attribute metrics type of cpu memory or things like this,
and look at this as a way to measure the health
and the sanity of an application.
Actually, a steady state should have nothing to do with this,
or at least not entirely, and should be a mix
of operational metrics and customer experience.
I write in the blog about the way Amazon does that, the number of orders.
And you can easily imagine that the CPU on an instance has actually no impact on the number of orders at big scale.
And similarly for Netflix, they use the number of times people press on the play button globally to define their steady state. They call that the pulse of Netflix, which I think is beautiful because it's a relation between the customer and the system,
right? If you press several times, then of course it means the system is not responding the way it
should, right? So it's experiencing some issues. And similarly, if you can't place an order on
Amazon retail page, it means the system is not working as it's designed. So it's a very good kind of a steady state.
But it's important to work on that.
It's not easy actually to define it.
And I see a lot of customers having a big trouble
first defining the steady states
or their steady states.
You can have several of them.
But for me, I mean, the way you explain it to me,
it's kind of like a business metric, right?
Like orders, the pulse.
I don't know if I would apply it to what we do at Dynatrace,
the number of trial signups
or the number of agents that send data to Dynatrace.
I have a hard time believing, but I'm sure you're right,
but I have a hard time believing
that companies have a hard time believing, but I'm sure you're right, but I have a hard time believing that companies have a hard time
defining that steady state metric,
because if they don't know what their steady state metric is,
they don't even know whether their business is doing all right
or is currently impacted.
Is that really the case?
I guess you'd be surprised.
It's not that they don't know what the business is doing.
That's something the business might be doing several things.
And I think usually the business metrics or things like this are used, let's say, by a
higher level monitoring that might go back to the managers or the C-levels
versus what we want is actually that kind of metric
to go down to the engineering team in a very distilled way
and very easy, accessible easily.
And okay, now that makes a lot of sense.
And I have another question though.
So you said CPU is typically not a metric because it doesn't matter how many,
what CPU usage you have as long as the system gets back to steady state.
But do you still include these metrics?
The reason why I'm asking is what if you are having a, let's say,
a constant, like if you look at the orders, right?
You have, let's say, 10,000 orders per second is coming in
and you know you're using X of CPUs in order,
in a state-of-state environment,
then you're bringing chaos into the system.
The system recovers, goes back to the 10,000 orders,
but all of a sudden you have, I don't know, twice as many CPUs.
Isn't that, shouldn't you still include at least some of the
metrics across the stack
to validate not only that you are
that the system itself is back
in a steady state but the supporting infrastructure
is at least
kind of going back to normal
or is this something you would just not do
at all because it is not the focus point
no you're totally right
the steady state is kind of the one metric that is important but you do need all because it is not the focus point. No, you're totally right. The steady state is kind of the
one metric that is important,
but you do need all the supporting
metrics as well.
It doesn't mean, when I say the steady
state is the most important one, it doesn't
mean you should not include the rest.
On the contrary, it's actually
you should have as
much metrics as you can.
It doesn't mean you should try to overdo the things, right?
But I think the most essential one,
especially for the cloud,
if you say you have 10,000 orders
and your infrastructure to support 10,000 orders
is the one that you are currently using,
you make a chaos experiment, and then that infrastructure or the need to support that 10,000 orders is the one that you are currently using, you make a chaos experiment, and then that infrastructure
or the need to support that 10,000 orders is kind of double,
then of course you should raise a big alarm, right?
And make sure that this is looked at, of course.
Cool.
All right, so steady state, first state or first phase of chaos engineering.
What comes next?
So after the steady states, we make the hypothesis.
Once we understand the system, then we need to make the hypothesis.
And this is the what, what if, or what when, because failure is going to happen.
So what happens when the recommendation engine stops, for example, or what happens when the database fails
or things like this?
So if you're first-timer in chaos engineering,
yeah, definitely start with a small hypothesis.
You know, don't tackle the big problems right away.
Build your confidence, build your skills,
and then grow from there.
But yeah, this is endless possibilities, right?
So what I love to do is usually look at an architecture and kind of look at the critical
systems first.
So you look at your kind of APIs and what are the critical system for each of the APIs. And then you tackle first the critical systems, right?
And see if these are really as resilient as you expected,
or if you can uncover some kind of failure mode
that you didn't think of, and things like this.
And then you can go into less critical dependencies but
usually most important is to make sure that the critical components are are tackled first
and this is also the case with the hypothesis if i just recount or repeat what you said earlier
this is also was it you explained you did you do the exercise. You go into a room and you bring everybody in the room
and then you discuss the hypothesis
and let everyone kind of write down
what they believe is going to happen, right?
Exactly.
And this is for me,
it's one of the most important part of that exercise
because you want to uncover different understandings
and make sure that why do people have different understanding
from an hypothesis.
And usually this will uncover some problems already,
like a lack of specifications or a lack of communication
or simply people have forgotten what it's supposed to be.
Yeah. And also if you think about this,
let's take the database as an example.
If you say, what happens if the database is gone?
And maybe one team says, well,
my system that is relying on the database
is just retrying it later.
And then the other team that is responsible
for the database maybe says, well,
but that's not the intended way.
We thought you could just go back to a, like, I don't know, a static version of the data,
blah, blah, blah.
I think that's, as you say, uncovering a lot of problems that actually have nothing to
do with technology, but actually have to do with lack of communication or lack of understanding
of the system.
Yeah.
And very often what I've noticed,
the big difference is usually between the design team
or UI teams and then the backend teams, right?
Many times the UI team will have not thought about this,
like what happens if your database goes down?
Well, for the technical team, it might be obvious
to move into a read-only mode and say,
okay, we move a request to the read replicas of the database.
And eventually, after maybe a minute, we fail over the new master.
So for one minute, your application will be in a state of, hey, there's no database.
So what do you do?
Very often, people in the UI world won't think about it
because it's not the case that they were asked.
Read-only mode is quite weird, actually, to think of.
Hey, Andy, I was also thinking,
when reading through these steps
and also sitting on the hypothesis
and the steady state piece,
how this sounds like this would be a wonderful place for performance
testers, engineers, whatever to transition into. But specifically with this hypothesis phase,
I love the idea of making this not just for chaos, but for performance as well. Let's say there's a
new release being built, right? Gather everybody together and say, all right, we're going to run a certain load against this.
What do you all think is going to happen
in a way to make them think,
well, wait, what did I just code and what might happen?
Again, like classic database issue
where suddenly we added four more database queries
to a statement.
Well, hey, are you thinking about that?
But just even if it's not for performance,
this idea of getting everybody together to say,
what do you think is going to happen if X,
whether it be a chaos experiment or performance,
some sort of a load test or something else,
to say, what do you think is going to happen?
I think that sort of communication with people
opens their mind to actually thinking about what they're doing
and what impacts they might have.
I think regardless if it's for chaos,
I think it's just a great practice for organizations to put together.
Yeah, exactly.
And if you think of it, I mean,
executing load or putting a load on the system is a form of chaos.
Anyway, I love this.
Yeah.
It's great.
Yeah.
So hypothesis is done. Then I guess you run the experiment, I love this. Yeah. It's great. It's great. Yeah.
So hypothesis is done.
Then I guess you run the experiment, right?
Correct.
And this is first designing and running the experiment.
And I think here the most important thing is blast radius, right? So you have to keep the customers in mind, right?
And of course, initially, if it's your first time,
you might not want to do that in production
and really do this in test environment
and make sure that you are able to control the blast radius
during your experiments.
And this is very important to think about.
It's like, you know, how many customers might be affected,
what functionality is impaired, or if you are a multi-site,
which location are going to be impacted by this experiment.
So, you know, this is very, very important to think about.
And, you know, then it's about running it, right,
and identifying also the metrics that you need to measure for that experiment.
So as you mentioned earlier, you know,
you might want to make sure that you control the number of instances
that you need for a particular steady state
and make sure that you return to that same number after the experiment.
And this is definitely part of the overall metrics
that you need to check for your application.
From a running perspective, I think you mentioned earlier,
you have your Python-based library that can run some experiments.
Did Netflix also release their Chaos Monkey libraries to the world?
Yeah, I mean, the Chaos Monkey was one of the first tools ever released
to do chaos engineering.
That's now part of a continuous integration tool called Spinnaker,
together with the rest of what they call the Siemen Army.
And that's, you know, there's a bunch of monkeys.
There's the Chaos Monkey, which is the original one,
which kills randomly instances.
There's Chaos Gorilla,
which impairs an entire availability zone in AWS.
And then there's the big gorilla, the big Kong,
some gorilla is Chaos Kong,
which shut down an entire AWS region.
And they practice that in production,
you know, maybe once a month or something like this
while people are watching Netflix.
Just to make sure that their initial design is still valid, right?
Yeah.
I wonder what would happen if AWS shut down an entire region.
Let's not find out.
You'd be surprised, actually.
I mean, we do chaos engineering as well.
I mean, we started very early on,
but we don't do chaos engineering as well uh i mean we started very early on but uh you know we don't do chaos
engineering on pay on paying customers but we do a different uh level of uh of chaos engineering
right and and prime video does that in production actually and i'm writing a blog post about that
so it should uh should be out maybe in a month or two, something like this.
Cool.
Hey, and then I think you mentioned Gremlin earlier, right?
One of the agent-based companies.
So that's also cool.
Do you have any insights into what they are particularly doing?
I mean, as I said, Gremlin is part of the agent-based kind of chaos tool.
So they do a lot of cool things
on the, on this,
on peer instance level.
So they can, you know,
corrupt an instance
by making a CPU run wild.
So 100% CPU utilization
and see how your application reacts.
They can take the memory out.
They can add a big file on the drive to make it run out of this space.
And there's a lot of different things, state attacks as well,
like make it terminate or restart or reboot, pause, all these kind of things.
And the UI is beautiful to use.
That's what is cool.
There's another very good framework that I really like
is Chaos Toolkit.
And that's more on the API level.
So it's kind of a framework, an open framework
that you can build extension with. So there's kind of a framework, an open framework that you can build extension with.
So there's an AWS extension that kind of wraps the AWS CLI.
And then you can also do API queries to get the steady state, do some actions, probe the system and things like this.
And the whole template for the experiment is written in JSON.
And then you can integrate that in your CI-CD pipeline as well.
So, I mean, and actually the chaos toolkit does integrate with learning as well.
So, I mean, all the tools are really working together in a way.
And I think helping people
to make better chaos experiments.
Perfect. Cool.
I think it's always interesting to...
The reason I was asking,
people that want to go into chaos engineering,
they probably also want to figure out
how do I inflict chaos?
Are there tools out there?
Are there frameworks out there?
And that's why I wanted to ask the question.
Proxy level is a very beautiful tool called Toxiproxy
that has been released by Shopify.
And that's like a proxy-based kind of chaos tool,
which you can put that proxy between your application
and, for example, database or Redis or main cache, and these
are inject some what they call toxics,
and inject some latency or do some error,
dropping packets, let's say, 40% of the time and things
like this.
It's very interesting.
And then, of course course you have the old
school Linux tools, right? WRT or of course the corrupting the IP tables as well and things
like this.
Or the very old fashioned pull the plug.
Yeah. Yeah. That's actually how, let's say Jesse Robbins famously did that
at Amazon on retail in the early 2000s.
He would walk around data center
and pull plugs from servers
and even pull plug of entire data centers.
One question about these tools then,
because the next part of this is all verifying, right?
And you mentioned some metrics
and these aren't host or machine type of metrics these are human metrics things like time to detect
time for notification um do any of these tools that you mentioned have either the ability to
either pick up on like maybe time to notification or allow for entry to, you know, human entry to say,
okay, we detected it at X time.
First, basically track these reaction metrics that you talk about.
Do you know if any of those tools have anything built in or is that something you'd be running
on, you know, keeping track of on your own?
And then let's talk about what those are as well.
Right.
So if you look at the chaos toolkit, since it's API based,
you can query kind of all your system if it supports that to kind of add in the chaos
engineering experiment report. So every time you run an experiment, it will print a report that you can then analyze. As for the Gremlin, of course, it's more agent-based.
So there's no kind of complete report like this.
But I mean, neither Chaos Toolkit or Gremlin type of things
have a full reporting that would satisfy
let's say
the most
let's say
careful team.
If you want to write a COE, you're going to have
to do a lot of human work
in figuring out all these
different things like time to detect,
notification, escalation, public notification,
self-healing, recovery until until all clear and stable, right?
So there's nothing yet that really covers everything.
So there's definitely space for a competitor
if you want to build your own company, right?
I have an idea.
I actually just thought of,
obviously on the Dynatrace side,
we do the automated problem detection,
and we have APIs where external tools can feed events to Dynatrace.
So, for instance, if you set up the hypothesis,
you could tell Dynatrace about the hypothesis,
meaning you can set thresholds on your business metrics,
like order rate or, I don't know, conversion rates.
And you can then also tell Dynatrace,
I don't know, Friday, six o'clock,
we're starting the chaos run.
And in case Dynatrace detects a problem
based on your metric going out of the norm,
it would immediately then open up a problem
and would collect all the evidence,
including the event that you sent.
So I think in a way,
we could actually measure a lot of these things.
And by integrating it with these chaos tools,
we would even allow you to automatically
set up your hypotheses
and tell Dynatrace about when you started the test run,
when this test run was ended,
and then Dynatrace can tell you
when was the problem detected,
when were the notifications sent out,
and when was the problem gone,
when did the problem go away.
So I think we need to...
It's a great idea.
Actually, the KS2Kit supports extension currently
for open tracing.
So if it's something that is going to be used at Dynatrace,
it's definitely something you should look at.
It supports Prometheus probes and Monarch, CLI, InstaNOW.
There's a bunch of extensions.
So I think you guys should definitely write an extension
for KS Toolkit to actually send them the, what it's called the chaos toolkit report, um,
to Dynatrace to, to add visibility and, and to,
to the whole thing. Definitely. It's a very good idea.
Yeah. Cool. So learn and verify, um,
what else is there to know? Learn, verify, and then I guess also optimize and learn and fix things, right?
Well, you know, I think the big part of verifying is also the postmortem, right?
I think you should always go through the postmortem part.
And in my opinion, it's one of the interesting things because you're going to deep dive on on on on the reasons of the failures right
if if they are failures so if your chaos experiment is successful then good for you right you should
also write about this but if it's unsuccessful and if you've created or resurfaced failures that
you didn't think about as that's uh the post-mortem and then you have to deep dive really, really well on the topic
and figuring out what happened, the timeline of it,
what was the impact, and why did the failure happen.
This is the moment where you kind of want to go to the root cause.
I know root cause is something that is very, very difficult to get
because in a failure,
it's never one reason.
It's always a collection of small reasons that getting together to create this kind
of big failure, but trying to find as many reasons as possible at different layers, different
levels.
And then, of course, what we learned, right?
And how do you prevent that from happening in the future?
So how are you going to fix it?
And those are really hard to answer.
They sound easy when I talk about it,
but it's very, very difficult to answer carefully all that stuff.
In your blog post, you talk a lot about,
I think you start the blog post actually with comparing chaos engineering with firefighters
because I think Jesse Robbins, he was actually a firefighter,
if I kind of remember this correctly,
and he brought Game Day to Amazon, right?
Yes.
And looking at this so firefighters i think you said something like 600 hours of training and in general 80 of that time is always
training training training uh is there and this might be a strange thought now but is there a
training facility for chaos to learn actually chaos engineering is there a demo environment where
you have a let's say a reference architecture of a let's say web shop running and then you can
play around with chaos engineering and yet i think that's called having kids
but you're totally right i mean one of very, very good point of chaos engineering, and even when not done in production or in production,
I mean, in any case,
is the fact that you can actually practice recovering from failures, right?
So you inject failure in your system,
and then you let the team handle it the same way as it would be an outage, right?
So you will, you know, practice and practice and develop what the firefighters want to develop,
like the intuition for understanding errors and behaviors and failures in general.
Like, you know, if you see, for example, an NGINX kind of CPU consumption curve
or a concurrent connection and already all of a sudden it becomes flat,
you have to be blocked the intuition for what it might be.
You know, definitely the first thing to look is the Linux, the security configuration on
your instance.
You know, max connection is probably low, all these kind of things.
And you can't build that intuition if you've never kind of debugged the system
or trying to recover from an outage
because those are extreme conditions.
And it's exactly that, right?
It's practice, build intuition,
and then make those failures come out.
Very cool.
Is there anything
we missed in the phases? I think we
are fixing it.
Wow.
Who wants to do that?
You've done all the fun right now,
so you have to fix it. And this is, in my opinion,
I'll say something very important here.
Unfortunately, I see a lot of
companies doing
chaos engineering and very brilliant COEs or corrections of error post-mortems, but
then the management don't give them the time to actually fix the problem.
So I actually was with one of these companies a few months ago, and two weeks after the
chaos experiment, which surfaced some
big issues in the infrastructure they had this real outage in production and we're down 16 hours
right so they could have fixed it before but they didn't they didn't stop the features or they didn't
prioritize that and eventually then they pay a bigger price right so it's very important to
just not do those case experiments,
but actually to get the management buying
and make sure that when you have something serious,
stop everything else and just fix it.
And that's super important.
But let me ask you,
why fix it if you can just reboot the machine
and make it work again?
Exactly.
You've been using the JVM for some time, right?
Yeah.
It's like reboot Fridays.
Is that what you have as well?
Yeah.
I imagine.
So besides obviously fixing it, as Andy was going on, was there anything, obviously we
want to fix it.
Beyond that, again, I can't recommend enough to read the blog by everybody,
but is there anything
that we didn't cover
that you want to make sure?
Yeah, just the side effects.
I think, you know,
chaos engineering is great
at uncovering failures,
but actually I think
the side effects
on the companies
are even more interesting
and they are mostly cultural.
Like the fact that companies
that start to do chaos
engineering eventually move when successful i've seen non-successful but most of them are successful
move to what what i like to call this kind of non-blaming culture and and and move away from
pointing fingers to you know how do we fix that? And yeah, I think that's a beautiful place to be, right?
For developers, for owners as well.
Because it's a culture that accepts failure,
embrace failures,
and kind of want to fix things
instead of blaming people.
And that's also how we work at AWS and Amazon.
And I really like that.
You know, our COEs and post-mortems are kind of blame-free,
and that's one very, very important part of writing that postmortem.
And, you know, I think it's great because if you point someone
that is making a mistake, eventually it will come back to you, right?
You will suffer a mistake and be blamed.
And that's never a good place to be.
Yeah.
I think another good side effect too is going back to the hypothesis phase
where people will start thinking probably about what they're going to do
in terms of the hypothesis in mind.
So before they actually implement something,
they'll probably start thinking
more about what its effect might be.
Yeah, exactly. Yeah. You think more about the overall system versus just the part
that you build, right? I think that's an important thing. And of course we didn't mention, but
a very, very good side effect is sleep, right?
You get more sleep because you fix outages
before they happen in production.
So you get a lot more sleep.
Awesome.
Hey, Andy, would you like to summon the Summonerator?
I would love if you summon the Summonerator.
Do it now.
All right.
You've rehearsed that one, right?
No, so, well, Adrian, thanks again for yet another great educational session on a topic.
Thank you.
It's, you know, it's a topic, Chaos Engineering, that I think it's still kind of in its infancy stage when it comes to broader adoption and kind of everybody really understands what it is.
I really like a quote that I think you took from Adrian Cockcroft, who said,
Chaos engineering is an experiment to ensure that the impact of failures is mitigated, which is also a great way of explaining it.
I really encourage everyone out there, read Adrian's blog. The five phases for me are,
I mean, I think the first two for me are amazing because steady state means you first of all need to work on an architecture that is ready for chaos
and you need to know what the steady state actually is and have a system that is steady
but then i really like your your kind of i call it experiments now to experiment with the people
to figure out what they think should happen when a certain condition comes in,
like working on your hypothesis,
because you can fix and find a lot of problems already before you really run
your chaos engineering experiments. So I thanks,
thanks for that insight.
And I really hope that you will come back on another show because I'm sure
there's a lot of more things
you can tell us.
I'd love to.
Thanks a lot, Andy and Brian,
for having me once more on the show.
I really enjoyed it.
It was an absolute pleasure.
I would just like to add one thing as well.
For anybody who's in the listening,
who might be in the performance or load side of things,
Andy and I have talked many times,
especially early on in the podcast,
about the idea of leveling up.
I'm listening to this bit about chaos engineering
and all I keep thinking of,
wow, what a great place to level up to.
Or not even level up to.
Let's say you're like,
ah, I'm kind of done what I feel I can do
in the load environment or that whole field.
This is kind of, I feel, a continuation of it. It really boils down
to hypothesis, experiment, analysis. And obviously you have the fixing at the end, which is very much
the same as, you know, doing your performance and load testing. So, and it's a brand new, you know,
chaos engineering is not very widespread yet. Obviously the big players are doing it.
So there's a lot of opportunity out there
to get involved with that.
So definitely if I were still on the other side of the fence,
not being a pre-sales engineer,
I would probably start looking into this a bit more.
Yeah, it's a lovely place.
I mean, people are amazing.
They love sharing.
So I highly recommend everyone
to get involved in the space.
Yeah. Sounds like it's just a sidestep over to a new world of experimentation.
Anyway, Adrian, thanks again, as Andy said.
Thank you.
Welcome back anytime. Anytime you got something new, please come on. It's great stuff you have
here. We will put up your
information. We're going to put a link to this article. So everyone go check out Spreaker
slash Pure Performance or Dynatrace.com slash Pure Performance. You can get the links
to the article. If anybody has any questions, comments, you can tweet them at Pure underscore
DT, or you can send an old fashioned email to pure performance at dynatrace.com. Uh, thank you all for listening. Adrian, thank you once again for being on Andy.
Thank you so much guys. Ciao. Ciao. Andy. Bye. Bye. Thanks. Bye.