PurePerformance - Encore - How to build distributed resilient systems with Adrian Hornsby
Episode Date: August 3, 2020Adrian Hornsby (@adhorn) has dedicated his last years helping enterprises around the world to build resilient systems. He wrote a great blog series titled “Patterns for Resilient Architectures” an...d has given numerous talks about this such as Resiliency and Availability Design Patterns for the Cloud at DevOne in Linz earlier this year.Listen in and learn more about why resiliency starts with humans, why we need to version everything we do, why default timeouts have to be flagged, how to deal with retries and backoffs and why every distributed architect has to start designing systems that provide different service levels depending on the overall system health state.Links:Adrian on Twitter: https://twitter.com/adhornMedium Blog Post: https://medium.com/@adhorn/patterns-for-resilient-architecture-part-1-d3b60cd8d2b6Adrian's DevOne talk: https://www.youtube.com/watch?v=mLg13UmEXlwDevOne Intro video: https://www.youtube.com/watch?v=MXXTyTc3SPU
Transcript
Discussion (0)
Hello everybody out there in Pure Performance land. Due to summer vacations and many other
factors, Andy and I do not have a new show for you this week. Never fear though, because we've
already scheduled some new recordings. We'll have them out to you as soon as possible.
In the meantime, we hope you enjoy this encore presentation of
How to Build Distributed Resilient Systems with Adrian Hornsby from August 19th, 2019.
Andy sounds really funny in this one, so enjoy.
It's time for Pure Performance.
Get your stopwatches ready.
It's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello, everybody, and welcome to another episode of Pure Performance.
My name is Brian Wilson. Bet you didn't know that.
And as always, my guest is, can you guess, Andy Grabner.
Hey, Andy, how are you doing today?
Good, good. Hi, Brian. Thanks for, well, great to be back on actually doing some recordings. I know the audience wouldn't really know. They have no idea.
Yeah, they have no idea. Yeah. They're clueless. They think we do it always on like on the two
week schedule. But it's been, you've been traveling, right? Yes, I went back to New Jersey,
visited some friends and family and got to go to the beach or as we call it, the shore in New Jersey, shore to please. Also found out, here's
a stupid little tidbit. So after, I don't know if you recall Hurricane Sandy, probably a lot of our
listeners overseas never heard of it, but it was this huge hurricane that really did a lot of damage
to the whole eastern coast of the United States. But after that, you know, New Jersey was trying
to rebuild along the coast and they came up with this saying jersey strong like you
know there's a lot of guidos in new jersey like yeah jersey strong we're gonna come back strong
and all right and uh turns out somebody opened a gym like a workout gym called jersey strong as
well so now it's confusing because if someone has a jersey strong bumper sticker you don't know if
they're taking pride in rebuilding new jersey or if they're like, yeah, I go to this gym and lift weights, bro.
So that was our – that's my tidbit.
That's what I learned on my trip to New Jersey.
And you've been running around a lot too, haven't you, Andy?
I believe so, yeah.
Well, not running, flying Europe.
I did a tour through Poland, a couple of meetups, a couple of
conferences, also visited
our lab up in Gdansk,
which was the hottest weekend
of the year with 36 degrees
on the Baltic Sea,
which is kind of warm,
I would say, but fortunately
the water was at least
refreshing still.
And yeah, but we're back now.
And we have...
Sounds like there's been a lot of chaos.
Sounds like there's been a lot of chaos.
Exactly.
And actually, I think distance-wise,
Gdansk in Poland, in the north of Poland,
is probably not that far away from the hometown of our guest today.
And with this, I actually want to toss it over to Adrian
and he should tell us how the summer is up there in Helsinki.
Hi guys, I'm Andy, I'm Brian, how are you?
Good, how are you?
Thanks a lot for having me back. Well the weather in Finland since that was the first chaotic question is indeed chaotic last year
we had the hottest summer
in maybe 50
years and now we had one of the coldest
summer in 50 years
it's pretty weird
today is nice actually
so you did not get the extreme
heat wave all the way up there huh
that's interesting
no heat waves here uh 25 degrees
okay well yeah we got uh i think we got 40 here in austria and uh 35 up there in the
north of poland crazy crazy stuff hey adrian i'm gonna try a quick conversion here hold on
95 f2 cells man when are you finally let's see so it's about 30 it's going up to about 35 like all week
here yeah wow it's hot it's hot yeah yeah we're gonna finally get on to the yeah we're never
gonna do that that's socialism yeah i know adrian coming to chaos we talk about chaos uh where we
brian mentioned it i mentioned it so today topic, after having you on the first call,
it was really cool to hear about best practices of building modern architectures.
I really love this stuff.
And I used it again, actually, on my trip to Poland.
I did a presentation in Warsaw.
And I gave you credit on the slides that I repurposed for retries, timeouts,
back off, jitter. This was really well received. Awesome. It's great to hear. Yeah, thanks again
for that. It makes me, it's great when you get on stage and people think you're smart
because you told them something new, but it's also great to admit that all the smartness comes from other people
that are all sharing the same spirit of sharing.
Sharing is caring.
So thanks for that.
Thanks for allowing me to share your content.
Yeah, I don't want to take the credit on that either
because I learned it from someone else,
you know, or books or articles.
At the end of the day,
it's all about sharing, as you said,
and teach what we had
problem with six months ago right yeah hey so today's topic is chaos engineering and um i i
will definitely post a link to your medium blog post i think part one is out and part two is coming
uh but it's it's a really great introduction into chaos engineering. And to be quite honest with you, I haven't used the word chaos engineering until recently.
I always just said chaos monkeys or chaos testing.
And I think because I was just so influenced by the first time I learned about introducing chaos,
which was, I believe, Netflixflix at least when i read about it
netflix was the ones that came up with the chaos monkeys is this correct yes correct yeah yeah you
know it's all part of the resiliency engineering in a way kind of field right so i think the the
the term chaos chaos monkeys was really made popular by Netflix, and their awesome technology blog that talked the story
about how they developed those monkeys.
Hey, and so, I mean, again, everybody should read the blog post.
It's amazing the background you give.
But maybe in your words, when you're introducing somebody
into chaos engineering, when you're introducing somebody into chaos engineering,
when you are teaching at conferences or just visiting customers and clients and enterprises,
how do you explain chaos monkey or chaos engineering?
Let's call it chaos engineering. How do you explain the benefit of chaos engineering and what it actually is and why people should do it?
That's a very good question and I'll try to make it short because I think we could
spend hours trying to explain just the discipline of chaos engineering but
in short what I say chaos engineering is is a sort of scientific experiment where we try to identify or especially identify failures that we didn't think of,
especially figuring out how to mitigate them.
It's very good at taking the unknown unknown out of the equation.
I think traditionally we've used tests to try to test and make sure our code is resilient to known conditions.
Chaos engineering is kind of a different approach.
It's a discipline that really tries to take things that we might not have thought about out of the darkness
and bringing it out so that we can learn how to
mitigate those failures before they happen in production.
And examples, I think in your blog post, you also had like this, I have a graphic in there
where you're talking about different levels or different layers of the architecture where you can basically introduce chaos, right?
So infrastructure is obviously clear.
What else?
Yeah, I mean, you know, it's very similar to resiliency, right?
I think chaos can be introduced at many different layers
from the obvious infrastructure level,
like removing, for example, an instance or a server
and figuring out if the overall system is able to recover from it,
from the network and data layer.
So removing, let's say, a network connection
or degrading the network connection,
adding some latency on the network,
dropping packets or making a black hole for a particular protocol.
That's on the network and data level.
On the application level, you can just simply add the exception,
make the application
throw some exceptions
randomly inside the code and see
how the outside world
or the rest of the system is behaving
around it. But you can also
apply it at the people level.
And I mean, I talk a little
bit about it in the blog.
The first experiment I always do is
trying to identify
the technical sound people in teams or the kind of semi-gurus,
as I like to call them, and take them out of the equation
and send them home without the laptops and see how the company is behaving.
Because very often you'll realize that information is not spread equally
around team members.
And, well, I call that the bus factor.
If that very technical sound person is not at work or gets hit by a bus,
well, how do we recover from failures?
I like that idea.
I mean, and Brian, I know we always try to joke a little bit but i
i think if adrian would ever come to us i think he would definitely not send the two of us home
because there's more technical sound people in our organization it's not necessarily the
technically sound people you know it's the the uh it's very often the connector it's the person
that knows everyone inside the organization
or in the company that
can act very fast, that
has basically a lot
of history inside the company and
kind of recalls
or is able to recall every
problem that happened
and how they recovered from
it. It's kind of a
walking encyclopedia.
But Andy, I think you can attest for the fact that I was just on vacation for about a how they recovered from it. It's kind of a walking encyclopedia. Yeah.
But Andy, I think you can attest for the fact
that I was just on vacation for about a week and a half
and the whole company fell apart, right?
So I am really, really important to the organization.
Before we dive into a lot of this though,
one thing I saw on your blog
that I think is really, really, really important
for people to understand
before even entertaining the idea of chaos testing or
chaos engineering is you can't just start with chaos engineering you have to have uh prerequisites
in your environment you talked about um you know all your prerequisites to chaos building a
resilient system resilient software resilient network so you really to, and correct me if I'm wrong,
but what I was reading it, before you start the chaos engineering part, you have to build in as
much resiliency into all layers of your system as you can possibly think of first. Then you start
seeing what did we miss, right? You don't just say, hey, we just stood up our application in
our infrastructure. Now we're going to throw something at it because of course it's going
to fail and it's going to be catastrophic. So you really want to start from a good place,
right? Exactly. And that's the whole point of chaos engineering, right? It's really,
I would say, an experiment, scientific experiment where we, you know, we test an hypothesis,
right? For example, we believe we've built a resilient system and we spend a lot of time we
say okay our system is resilient to database failures so let's make an hypothesis and say
you know what happened if my database goes down and then you take the entire organization the
entire team from the product owner to the designers to the software developers and you
ask them actually what they think is going to happen.
Very often I ask them not to talk about it,
but to write it on the paper,
just to avoid the kind of mutual agreements,
you know, like everyone comes with a consensus
by talking about it.
So if you write it on the paper in private,
you realize everyone in the team has often very different ideas
of what should happen if a database goes down.
And a good thing to do at that moment is to stop and ask,
how is that possible that we have different understanding
of our specifications?
So then you go back into pointing the specification.
Very often people fail to read
it carefully or improving it. But once you have everyone in the team having a real good
understanding of what should happen when the data is go down, then you make an experiment and you
actually want to verify that through that experiment and with all the measure you're doing around it from,
you know, making sure that the system goes back to the initial state,
we call that the steady state of the application.
And also that it does that in the appropriate time, right?
In the time you thought about.
And you have to verify all this.
Yeah, and I think there's a lot to unravel there.
But even before that, right?
Even before you run that first experiment, um, are there, and Andy, I don't know if you've maybe
seen any or Adrian, are there some best practices? Like I know there's the concept of doing multi
zones. Let's say if we're talking about AWS, you can set up your application in multiple zones.
So this way, if one zone has an issue, you still have your two, like the rule of threes.
Is there, way back, if we go way back in time, Steve Souders wrote the best practices for web design, web performance. Is there, and I know you've written some articles and all,
but is there a good checklist for someone starting out or an organization before they
start these experiments to say, here are some of the starting points that you should set up before you even entertain
your next set of hypotheses, right?
Because obviously if you don't do this basic set, you're going to have catastrophic failures.
Is there something written out yet?
Or is it mostly just things we've been talking about and word of mouth and people's blogs?
Well, it's a very good, good question.
Actually, I'll take the note and make sure I can build a list. Well, it's a very good question, actually. I'll take
the note and make sure I can build
a list. Actually, that's a very good idea.
But I would say I think the 12
factor app kind of
patterns are
a very good place to start with.
I did put some
lists around and I talk
about resilient architectures
in different blogs.
I think it's really software engineering basics and it's a timeout, retries,
exception handling, these kind of things.
Make sure that you have redundancy built in, that your system is self-healing
because at the end of the day, you don't want humans to recover.
You want actually the system to recover automatically.
So that's something that is usually not well thought or not at every level.
I always say, so if you don't do infrastructure as code and automatic deployment,
maybe don't do chaos or actually don't do chaos because you're going to have big problems.
Your system should
be automatically
deployed
and managed and self-healed
in a way.
Then, of course, on the operation level, you should
have the full observability
and complete monitoring of your system.
If you have
no visibility on what's happening in your application,
there's no way you can conduct a sound experiment
or even verify that that experiment first has been successful
or has negatively impacted the running system, right?
So measure, measure, measure as much as you can.
And of course, you need to have an incident response theme in a way or practice so that, you know, you know what you should do as soon as an alert comes in and, you know, and treat the tickets and being able to have this full response kind of incident response.
Yep.
Right.
So, and I know Andy is going to want to dive into all this, the phases.
So let's go on.
You mentioned kind of an overview of the experiment, the observations,
the hypothesis and all that.
So Andy, I know you probably have 20,000 questions and things to talk about.
So let's dive into that now.
I mean, again, before I actually dive into,
because you just mentioned in the end, monitoring incident response.
I think in your blog post, you also mentioned that when you are in one of your cases, you inflicted chaos.
And you actually saw that the incident response or the alerting was also impacted by the chaos.
For instance, not able to, let's say, send select messages or send emails or stuff like that
i think this is there's such a huge variety of chaos that like your hypothesis is that we need
to test against yeah it's very common to to build system that kind of host or power our own response
system and you you see actually, I mean,
there's been really a recent outage
and with Cloudflare, for example,
and they were not able to use the internal tool easily
simply because the tools were too secure.
Kind of the people didn't use their maintenance tools
and long enough or or recently enough.
So the system had just removed the credentials.
And when you're in a panic situation,
having systems like this that actually are impacted by your own behavior is hard.
DNS is another one.
Very often, if DNS goes down,
your own tools are not accessible
because they use or might use DNS.
So all these kind of things like this
are usually very important to test as well
while doing the experiment.
That's why you need to start from
or simulate the real outage situation.
So now I got it.
Before going into the phases, because I know you have in the blog post,
you have like five phases that you talk about.
But one question that came up while you were earlier talking about application exceptions,
because Brian and I, we've been kind of live and breathe applications
because we've been, you know, with Dynatrace, especially we monitor applications and services,
but obviously the full stack, but we are very, I think, knowledgeable on applications.
And when you said earlier chaos in applications, like you can, you know, just throw exceptions
that you would normally not throw and see how the application behaves.
Do you then imply there are tools that would, for instance,
use dynamic code injection to force exceptions?
Is that the way it works?
Or do you just, I don't know, change configuration parameters
of the application that you know it will result in exceptions?
There are several
several type of tools um there are libraries that you can directly use to inject uh inject
kind of uh failures um either in python very often it's using a decorator function that you
wrap around the function and that catches the function and then throws some exception.
Actually, I'm building my chaos library toolkit
around using that concept in Python
to inject failures in Lambda functions in Python.
It's being used as well for JavaScript, similar technique.
There are also techniques to have a proxy as well.
So you have proxies between two different systems,
and then the proxy kind of hijack the connection,
kind of man in the middle in a way,
and kind of alter the connection, can add latency,
can throw some exception, can inject, you know, like a packet drops or things like this. So there's a lot of different techniques.
There's also one very common technique that GreenLin is using is using an agent base.
So they have an agent running on an instance or a Docker container that can inject failure locally, all these kind of things.
So they can intercept process level type of things
and just throw exceptions or make it burst or things like this.
Cool.
So moving on to the phases, if you kind of look at your blog posts,
we already talked about the steady state.
Steady state, basically, I think for me, what it means, first of all figure out is this a situation now that is normal or caused by chaos?
I guess steady state really means a system that is stable
and you know what that state is, right?
Yeah, and it's predictable, as you said.
I think a big mistake what people do is usually they use system attribute metrics,
a type of CPU memory or things like this,
and look at this as a way to measure the health
and the sanity of an application.
Actually, a steady state should have nothing to do with this,
or at least not entirely,
and should be a mix of operational metrics and customer experience.
I write in the blog about the way Amazon does that is the number of orders.
And you can easily imagine that CPU on an instance has actually no impact on the number of orders at big scale.
And similarly for Netflix, they use the number of times people press on the play button
globally to define their steady state. They call that the pulse of Netflix, which I think is
beautiful because it's a relation between the customer and the system. If you press several
times, then of course it means the system is not responding the way it should. So it's experiencing
some issues. And similarly, if you can't place an order on Amazon retail page,
it means the system is not working as it's designed.
So it's a very good kind of a steady state.
But it's important to work on that.
It's not easy actually to define it.
And I see a lot of customers having a big trouble
first defining the steady states
or their steady states.
You can have several of them. But for me, I mean, the way big trouble first defining the steady states or their steady states. You can have several of them.
But for me, I mean, the way you explain it to me,
it's kind of like a business metric, right?
Like orders, the pulse.
I don't know if I would apply it to what we do at Dynatrace,
the number of trial signups
or the number of agents that send data to Dynatrace.
I have a hard time believing, but I'm sure you're right,
but I have a hard time believing that companies have a hard time
defining that steady state metric,
because if they don't know what their steady state metric is,
they don't even know whether their business is doing all right
or is currently impacted.
Is that really the case?
I guess you'd be surprised.
It's not that they don't know what the business is doing that's something the business might be doing several things and
and uh you know it's uh i think i think usually the business metrics or things like this are
are used let's say by a higher level higher level monitoring that might go back to the managers or the C-levels
versus what we want is actually that kind of metric
to go down to the engineering team
in a very distilled way and very easy,
accessible easily.
And okay, now that makes a lot of sense.
And I have another question, though.
So you said CPU is typically not a metric
because it doesn't matter how many,
what CPU usage you have
as long as the system gets back to steady state.
But do you still include these metrics?
The reason why I'm asking is what if you are,
what if you are having a, let's say, a constant, like if you look at the orders, right?
If, let's say, 10,000 orders per second is coming in,
and you know you're using X of CPUs in a steady-state environment,
then you're bringing chaos into the system.
The system recovers, goes back to the 10,000 orders,
but all of a sudden you have, I don't know, twice as many CPUs.
Isn't that, shouldn't you still include at least some of the metrics
across the stack to validate not only that you are,
that the system itself is back in a steady state,
but the supporting infrastructure is at least kind of going back to normal?
Or is this something you would just not do at all because it
is not the focus point no you're totally right you know the steady state is kind of the one metric
that is important but you do need all the supportive supporting metrics as well right it's
it doesn't mean when i say the state state is the most important one it doesn't mean you should not
include the rest on the contrary right it's actually, you should have as much metrics as you can. It doesn't mean you should try
to overdo the things, right? But I think the most essential one, especially, I think, especially
for the cloud is, you know, if you say you have 10,000 orders and your infrastructure
to support 10,000 orders is the one that you are currently using,
you make a chaos experiment and then that infrastructure or the need to support that 10,000 orders is kind of a double,
then of course you should raise a big alarm and make sure that this is looked at, of course. Cool.
All right, so steady state, first state or first phase of chaos engineering.
What comes next?
So after the steady states, we make the hypothesis.
Once we understand the system, then we need to make the hypothesis. And this is the what if or what when, because failure going to happen, right? So what happens when the
recommendation engine stops, for example, or what happened when the database fails or things like
this. So if you're first timer in chaos engineering, yeah, definitely start with a
small hypothesis. Don't tackle the big problems right away.
Build your confidence, build your skills, and then grow from there. But yeah, this is endless possibilities, right?
So what I love to do is usually look at an architecture and kind of look at the critical
systems first.
So you, you know, you look at your kind of APIs and what are the critical systems for each of the APIs,
and then you tackle first the critical systems and see if these are really as resilient as
you expected, or if you can uncover some kind of failure mode that you didn't think of and things like this and
then you can go into less critical dependencies but usually most important
is to make sure that the critical competence are tackled first.
This is also the case with the hypothesis if I just recount or repeat what you
said earlier.
This is also, was it, you explained, you do the exercise,
you go into a room and you bring everybody in the room and then you discuss the hypothesis and let everyone kind of write down what they believe is going to
happen, right?
Exactly.
And this is for me is one of the most important part of that exercise because you want to uncover uh different understandings and and making sure that
why do people have different understanding uh from an hypothesis and usually this will uncover
some problems already like a lack of specifications or a lack of communication or simply people have forgotten what's supposed
to be.
Yeah.
And also, if you think about this, you know, let's take the database as an example.
If you say, what happens if the database is gone?
And maybe one team says, well, my system that is relying on the database is just retrying
it later.
And then the other team that is responsible for the database
maybe said, well, but that's not the intended way.
We thought you could just go back to a, like, I don't know,
a static version of the data, blah, blah, blah.
I think that's, as you say, uncovering a lot of problems
that actually have nothing to do with technology,
but actually have to do with lack of communication
or lack of understanding of the system.
Yeah, and very often what I've noticed,
the big difference is usually between the design team
or UI teams and then the backend teams, right?
Many times the UI team will have not thought about this,
like what happens if your database goes down?
Well, for the technical
team, it might be obvious to move into a
read-only mode and say, okay, we move a request
to the read replicas
of the database, and
eventually, after
maybe a minute, we fail over
the new master. So, for one minute,
your application will be in a state of,
hey, there's no database.
So, what do you do?
Very often people in the UI world won't think about it
because it's not the case that they were asked.
Read-only mode is quite weird, actually, to think of.
Hey, Andy, I was also thinking,
when reading through these steps
and also sitting on the hypothesis
and the steady state piece, how this sounds like this would be a wonderful place for performance testers, engineers, whatever, to transition into.
But specifically with this hypothesis phase, I love the idea of making this not just for chaos, but for performance as well.
Let's say there's a new release being built, right?
Gather everybody together and say,
all right, we're going to run a certain load against this.
What do you all think is going to happen?
In a way to make them think,
well, wait, what did I just code and what might happen?
Again, like classic database issue
where suddenly we added four more database queries to a statement.
Well, hey, are you thinking about that? But just even if it's not for performance,
this idea of getting everybody together to say, what do you think is going to happen if X,
whether it be a chaos experiment or performance, some sort of a load test or something else to say,
what do you think is going to happen? I think that sort of communication with people
opens their mind to actually thinking about what they're doing
and what impacts they might have.
I think regardless if it's for chaos,
I think it's just a great practice for organizations to put together.
Yeah, exactly.
And if you think of it, I mean,
executing load or putting load on the system is a form of chaos.
Yes.
Anyway, I love this
hypothesis.
It's great.
So hypothesis is done, then
I guess you run the experiment, right?
Correct.
This is first designing and running
the experiment. And I think here
the most important thing is
blast radius, right? So you have
to
keep the customers in mind, right?
And of course, initially, if it's your first time, you might not want to do that in production and
really do this in test environment and make sure that you are able to control the blast radius
during your experiments. And this is very important to think about. It's like,
you know, how many customers might be affected, what functionality is impaired, or if you are
a multi-site, which location are going to be impacted by this experiment. So, you know,
this is very, very important to think about.
And then it's about running it and identifying also the metrics that you need to measure for that experiment.
So as you mentioned earlier,
you might want to make sure that you control the number of instances
that you need for a particular steady state
and make sure that you return to that same number after the experiment. And this is definitely part of the overall metrics
that we need to check for your application. From a running perspective, I think you mentioned
earlier, you have your Python-based library that can run some experiments.
Did Netflix also release their Chaos Monkey libraries to the world?
Yeah, I mean, the Chaos Monkey was one of the first tools ever released
to do chaos engineering.
That's now part of a continuous integration tool called Spinnaker,
together with the rest of what they call the Siemen Army. And that's, you know, there's a
bunch of monkeys. There's the chaos monkey, which is the original one, which kills randomly
instances. There's the chaos gorilla, which impair an entire availability zone
on in AWS. And then there's the big gorilla, the big Kong, some gorilla is chaos Kong,
which shut down an entire AWS region. And they practice that in production.
You know, if you've maybe once a once a month or something like this
while people are watching Netflix.
Just to make sure that their initial design is still valid, right?
Yeah.
I wonder what would happen if AWS shut down an entire region.
Let's not find out.
You'd be surprised, actually.
We do, I mean, We do chaos engineering as well.
I mean, we started very early on,
but we don't do chaos engineering on paying customers,
but we do a different level of chaos engineering.
And Prime Video does that in production, actually.
I'm writing a blog post about that so it
should should be out maybe in a month or two something like this cool hey and then i think
you mentioned gremlin earlier right one of the agent-based companies um so that's also cool
do you have any insights into what they are particularly doing? As I said, Gremlin is part of the agent-based chaos tool.
So they do a lot of cool things on the peer instance level.
So they can corrupt an instance by making a CPU run wild.
So 100% CPU utilization
and see how your application reacts.
They can take the memory out.
They can add a big file on the drive
to make it run out of disk space.
And there's a lot of different things,
state attacks as well,
like make it terminate or restart or reboot, pause, all these kind of things.
And the UI is beautiful to use.
That's what is cool.
There's another very good framework that I really like is Chaos Toolkit.
And that's more on the API level.
So it's kind of a framework,
an open framework that you can build extension with.
So there's an AWS extension
and that kind of wraps the AWS CLI.
And then you can also like do API queries
to get the steady state,
do some actions, probe the system and things like this.
And the whole template for the experiment is written in JSON.
And then you can integrate that in your CI CD pipeline as well.
So, I mean, and actually the Kiosk toolkit does integrate with Grunin as well.
So, I mean, all the tools are really working together in a way.
I think helping people
to make better chaos experiments.
Perfect. Cool.
Yeah, I think it's always interesting to, the reason I was
asking, you know, people that
want to go into chaos engineering, they probably
also want to figure out, so, how do
I inflict chaos?
Are there tools out there, are there frameworks out there?
And that's why I wanted to ask the question.
The proxy level is a very beautiful tool called Toxiproxy, right?
That has been released by Shopify.
And that's like a proxy-based kind of chaos tool,
which you can put that proxy between your application
and, for example, database or Redis or main cache,
and this term injects some what they call toxics
and injects some latency or do some error,
like dropping packets, let's say, 40% of the time
and things like this.
So it's very interesting.
And then, of course, you have the old-school Linux tools, right?
WRT or, of course, corrupting the IP tables as well
and things like this.
Or the very old-fashioned pull-the-plug.
Yeah, that's actually how, let's say,
Jesse Robbins famously did that at Amazon on retail
in the early 2000s.
He would walk around data center
and pull plugs from servers
and even pull plug of entire data centers.
One question about these tools then,
because the next part of this is all verifying right and you mentioned some metrics and these aren't host or machine type of metrics
these are human metrics things like time to detect time for notification do any of these tools that
you mentioned have either the ability to either pick up on like maybe time to notification
or allow for entry to you know human entry to say okay we detected it at x time first basically
track these reaction metrics uh that you talk about do you know if any of those tools have
anything built in or is that something you'd be running on you know keeping track of on your own
and then let's talk about what those are as well.
Right. So if you look at the Chaos toolkit,
since it's API-based, you can query all your system
if it supports that to add in the Chaos Engineering
experiment report.
So every time you run an experiment,
it will print a report that you
can then analyze. As for the Gremlin, of course, it's more agent-based. So there's no kind
of a complete report like this. But I mean, neither Chaos Toolkit or Grimlin type of things have a full reporting that would satisfy,
let's say, the most, let's say, careful team.
Like if you want to write a COE,
you're going to have to do a lot of human work
in figuring out all these different things
like time to detect, notification, escalation,
public notification, self-healing,
recovery, until all clear and
stable. There's
nothing yet that really covers
everything.
There's definitely space for
a competitor if you want to build your own company.
I think there's an idea
well i actually just thought of you know obviously on the dynatrace side we do the automated problem
detection and we have apis where external tools can feed events to dynatrace so for instance
if you set up the hypothesis you could tell dynatrace about the hypothesis, meaning you can set thresholds on your business metrics,
like order rate or conversion rates.
And you can then also tell Dynatrace,
I don't know, Friday, 6 o'clock, we're starting the chaos run.
In case Dynatrace detects a problem based on your metric going out of the norm, it would
immediately then open up a problem and would collect all the evidence, including the event
that you sent.
So I think in a way we could actually measure a lot of these things.
And by integrating it with these chaos chaos tools we would even allow you to
automatically set up your hypotheses and tell dynatrace about when you started the test run
when this test run was ended and then dynatrace can tell you uh when was the problem detected
when were the notifications sent out and when was the problem gone when did the problem go away
so i think we need to... It's a great idea.
Actually, the KS Toolkit supports
extension currently for open tracing.
So if it's something that is
going to be used at
Dynatrace, it's definitely something you should
look at. It supports Prometheus
probes and
CLI, Instana.
There's a bunch of extensions.
So I think you guys should definitely write an extension
for Chaos Toolkit to actually send
what is called the Chaos Toolkit report to Dynatrace
to add visibility to the whole thing.
Definitely. It's a very good idea.
Yeah. Cool.
So learn and verify.
What else is there to know?
Learn, verify, and then I guess also optimize and learn and fix things, right?
Well, you know, I think the big part of verifying is also the postmortem, right?
I think you should always go through the postmortem part.
And in my opinion, it's one of the interesting things
because you're going to deep dive on the reasons of the failures, right?
If there are failures.
So if your chaos experiment is successful, then good for you, right?
You should also write about this.
If it's unsuccessful and if you've created or resurfaced failures
that you didn't think about, that's the post-mortem.
And then you have to deep dive really, really well on the topic
and figuring out what happened, the timeline of it,
what was the impact, and why did the failure happen.
This is the moment where you kind of want to go to the root cause.
I know root cause is something that where you kind of want to go to the root cause. I know root cause
is something that is very, very difficult to get because in a failure is never one reason. It's
always a collection of small reason that getting together to create this kind of big failure, but
trying to find as many reasons as possible at different layers, different levels.
And then, of course, what we learned, right?
And how do you prevent that from happening in the future?
So how are you going to fix it?
Those are really hard to answer.
They sound easy when I talk about it, but it's very, very difficult
to answer carefully all that stuff.
In your blog post, you talk a lot about,
I think you start the blog post actually
with comparing chaos engineering with firefighters
because I think Jesse Robbins, he was actually a firefighter,
if I kind of remember this correctly,
and he brought Game Day to Amazon, right?
Yes.
And looking at this, so firefighters,
I think you said something like 600 hours of training,
and in general, 80% of their time is always training, training, training.
Is there, and this might be a strange thought,
but is there a training facility for chaos to learn actually chaos engineering? Is there a and this might be a strange thought, but is there a training facility for chaos
to learn actually chaos engineering?
Is there a demo environment
where you have, let's say, a reference architecture
of a, let's say, web shop running
and then you can play around with chaos engineering?
Andy, I think that's called having kids.
You're totally right.
I mean, one of the very, very good point of chaos engineering,
and even when not done in production or in production,
I mean, in any case,
is the fact that you can actually practice recovering from failures.
So you inject failure in your system,
and then you let the team handle it the same way as it would be an outage, right?
So you will, you know, practice and practice and develop what the firefighters want to develop,
like the intuition for understanding errors and behaviors and failures in general.
Like, you know, if you see, for example, an NGINX kind of CPU consumption curve
or a concurrent connection,
and already all of a sudden it becomes flat,
you have to be blocked the intuition for what it might be.
You know, definitely the first thing to look
is the Linux security configuration on your instance.
You know, max connection is probably low,
all these kind of things.
And you can't build that intuition
if you've never kind of debugged the system
or trying to recover from an outage
because those are extreme conditions.
And it's exactly that, right?
It's practice, build intuition,
and then make those failures come out.
Very cool.
Is there anything we missed in the phases?
I think we are...
Fixing it.
Wow.
Who wants to do that?
Yeah, it's like, you know,
you've done all the fun right now,
so you have to fix it.
And this is, in my opinion,
I'll say something very important here is,
unfortunately, I see a lot of companies
doing chaos engineering and very brilliant COEs
or corrections of error post-mortems,
but then the management don't give them the time
to actually fix the problem.
So I actually was with one of these companies
a few months ago, and two
weeks after the chaos experiment, which surfaced some big issues in the infrastructure, they
had this real outage in production and we're down 16 hours, right? So they could have fixed
it before, but they didn't stop the features or they didn't prioritize that and eventually then they pay a bigger price right
so it's very important to just not do those case experiments but actually to get the management
buying and make sure that when you have something serious stop everything else and just fix it and
that's uh super important but but let me ask you why fix it if you can just reboot the machine and make it work again?
Exactly.
You've been using the GVM for some time, right?
Yeah.
It's like reboot Fridays.
Is that what you have as well, like dialing trace?
Yeah.
Imagine.
So besides obviously fixing it, as Andy was going on,
was there anything, you know, obviously we want to fix it uh beyond that uh we kind of can't recommend enough to read the blog by everybody but is there anything that we didn't cover uh that you want to make sure yeah there's
the side effects i think you know chaos enduring is great at uncovering failures but actually i
think the side effects uh on the companies are even more interesting.
And they are mostly cultural.
Like the fact that companies that start to do chaos engineering eventually move when successful.
I've seen non-successful, but most of them are successful.
Move to what I like to call this non-blaming culture and move away from pointing fingers to, you know, how do we fix that?
Yeah, I think that's a beautiful place to be, right?
For developers, for owners as well.
Because it's a culture that accepts failure, embraces failures,
and kind of want to fix things instead of blaming people.
And that's also how we work at AWS and Amazon. And
I really like that, you know, our COEs and postmortems are kind of blame-free. And that's
one very, very important part of writing that postmortem. And, you know, I think it's great,
because if you point someone that is making a mistake, eventually it will come back to you, right?
You will suffer a mistake and be blamed.
And that's never a good place to be.
Yeah.
I think another good side effect too is going back to the hypothesis phase
where people will start thinking probably about what they're going to do
in terms of the hypothesis in mind.
So before they actually implement something,
they'll probably start thinking more about what its effect might be.
Yeah, exactly.
You think more about the overall system versus just the part that you build, right?
I think that's an important thing.
And of course, we didn't mention, but very, very good side effect is sleep, right?
You get more sleep
because you fix outages
before they happen in production.
So you get a lot more sleep.
Awesome.
Hey, Andy,
would you like to summon the Summarytor?
I would love if you summon the Summarytor.
Do it now.
All right.
You've rehearsed that one, right?
No, so, well, Adrian, thanks again for yet another great educational session's a topic, chaos engineering, that I think it's still kind of in its infancy stage when it comes to broader adoption and kind of everybody really understands what it is.
I really like a quote that I think you took from Adrian Cockcroft, who said,
Chaos engineering is an experiment to ensure that the impact of failures is mitigated,
which is also a great way of explaining it.
I really encourage everyone out there, read Adrian's blog.
The five phases for me are, I mean, I think the first two for me are amazing because steady state means you, first of all, need to work on an architecture that is ready for chaos.
And you need to know what the steady state actually is and have a system that is steady.
But then I really like your kind of, I call it experiments now too, experiment with the people to figure out
what they think should happen
when a certain condition comes in.
Like working on your hypotheses
because you can fix and find a lot of
problems already before
you really run your chaos engineering
experiments.
Thanks for that insight
and I really hope
that
you will come back on another show
because I'm sure there's a lot of more things you can tell us.
I'd love to.
Thanks a lot, Andy and Brian, for having me once more on the show.
I really enjoyed it.
It was an absolute pleasure.
I would just like to add one thing as well.
For anybody who's in the listening,
who might be in the performance or load side of things.
Andy and I have talked many times, especially early on the podcast about the idea of leveling up.
I'm listening to this bit about chaos engineering and all.
I keep thinking of, wow, that's what a great place to level up to or not even level up to.
Let's say you're like, I'm kind of done what i feel i can do in the load environment
or that whole field uh this is kind of i feel in the in the continuation of it it really boils down
to hypothesis experiment analysis and obviously you have the fixing at the end which is very much
the same as you know doing your performance and load testing so and it's a brand new you know
chaos uh engineering is not very widespread yet.
Obviously, the big players are doing it.
So there's a lot of opportunity out there
to get involved with that.
So definitely, if I were still on the other side of the fence,
not being a pre-sales engineer,
I would probably start looking into this a bit more.
Yeah, and it's a lovely place.
I mean, people are amazing.
They love sharing.
So I highly recommend everyone to get involved in the space.
Sounds like it's just a sidestep over to a new world of experimentation.
Anyway, Adrian, thanks again, as Andy said.
Thank you.
Welcome back anytime.
Anytime you get something new, please come on.
It's great stuff you have here.
We will put up your information.
We're going to put a link to this article,
so everyone go check out Spreaker slash Pure Performance
or Dynatrace.com slash Pure Performance.
You can get the links to the article.
If anybody has any questions, comments,
you can tweet them at Pure underscore DT,
or you can send an old-fashioned email to pureperformance at dynatrace.com.
Thank you all for listening.
Adrian, thank you once again for being on.
Andy.
Thank you so much, guys.
Ciao, ciao, Andy.
Thank you.
Bye-bye.
Thanks.
Bye.