PurePerformance - Why you should look into Chaos Engineering with Ana Medina
Episode Date: November 16, 2020Daylight savings can bring chaos to systems such as rogue processes consuming CPU or memory and therefore impact your critical systems. The question is: how do you systems react to this chaos? How can... you test for this? And how can you make your systems more resilient against this chaos?In this episode we talk with Ana Margarita Medina, Chaos Engineer at Gremlin. In her previous job, Ana (@Ana_M_Medina) was a Site Reliability Engineer at Uber where she helped coping with the “chaos” on New Years Eve or Halloween. Ana gives us great insights into the discipline of Chaos Engineering, that its really about running controlled experiment and that everyone can get started that has an interest in contributing to more resilient systems.Here the additional links we promised during the recording: Drift into failure, Chaos Engineering Community, Chaos Engineering and System Resilience in Practice.https://www.linkedin.com/in/anammedina/https://twitter.com/Ana_M_Medinahttps://eng.uber.com/nye/https://www.amazon.com/Drift-into-Failure-Sidney-Dekker/dp/1409422216https://www.gremlin.com/community/https://www.amazon.com/Chaos-Engineering-System-Resiliency-Practice/dp/1492043869
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson and as always I have with me my wonderful co-host Andy Krabner.
Andy, how are you doing?
I am fantastic, knowing that this is our third attempt to get this recording going and now it's even better than the first and the second attempt.
Might even be our fifth attempt.
And I think, you know what, I just have to call out Ringer on this.
I don't know what the heck's going on, but first of all,
it looks like you're not using any APM, at least in the front end.
So Ringer, give us a call, dynatrays.com, we'll help you out.
But also, Ringer, please, maybe you should listen to this episode,
because this is the kind of stuff you need to be looking at possibly, because I think we just found a great scenario for the subject of our topic today.
Anyhow, since we've lost so much time, Andy, let's go right on into it.
Right on.
I think you're talking about a little chaos that we just experienced.
Yes.
At least on our end, from the end user perspective of using that service.
Chaos engineering is an awesome topic
and we've been bringing this up a couple of times
in the last couple of months.
And I'm very glad that we have a very cool partner
with us today, a talking partner, a guest.
Hola, Ana Malka.
Talking partner.
Talking partner. Is that a guest? I don't hola i'm just laughing as a synonym let me look up a synonym
for for guest oh talking partner okay so first of all right english is not my first language
it's late it's late in the day and and that's what it is so anyway just to make sure that this
is going to be continuing and like as it is right now as fun as it is uh hol anyway, just to make sure that this is going to be continuing as it is right now,
as fun as it is.
Hola, Ana Margarita Medina.
¿Cómo estás?
Hola, ¿cómo estás?
Muy bien.
Muchas gracias por tenerme hoy.
Perfecto.
Creo que...
Bienvenido.
Bienvenido.
Exactly.
That's why you probably have no clue what we're talking about.
So let's switch back to...
No, I do.
I took Spanish in school.
Come on. Oh, you did? I took Spanish in school. Come on.
Oh, you did?
But I'm not fluent.
But you had like an elementary conversation right there.
I could follow that one.
That's true.
I took Spanish...
But you didn't...
Go on.
I took Spanish in Duolingo.
So the one thing I always laugh about with colleagues or friends
who went through the U.S. school system of taking Spanish as a language or any foreign languages,
you take the lessons and they play a tape,
and this was back in the tape days,
and the person would be like,
Hola, me llamo Miguel. ¿Cómo estás?
And then you hear someone speak Spanish,
and it's 30 times faster than that.
Like, I can't follow. How am I supposed to learn this?
Anyway, because that's chaotic. There's a very chaotic situation as well. See what I did there?
So, Anna, welcome on the show. Thanks for being our talking partner, quote unquote, our guest.
Anna, may you want to introduce yourself? Yes. Thank you once again for having me. I'm very excited to be here and very excited for the conversation we're about to have.
My name is Ana Margarita Medina.
I'm a senior chaos engineer at Gremlin.
I've been with this company for about two years and seven months,
focusing on helping our customers and create a chaos engineering community.
So I do a lot of public speaking.
I lead our educational program.
I help out customers to get started and run some experiments for them to understand why
reliability really needs to matter. Prior to doing Gremlin work, I was a stock reliability
engineer at Uber. That is actually how I got started in chaos engineering. They were a huge microservice shop,
and they really had to make sure that everything in their bare metals data centers was going to be
up and running for their two largest holidays, New Year's and holidays. And of course,
the New Year's and Halloween. And then every day in between, they also had to make sure that they were up and running across their data centers.
And prior to doing that, I do a lot of different software engineering.
I got started early on in my career with just front end technology.
So I built a lot of JavaScript, HTML, CSS websites, then transitioned over to some back end and somehow ended up doing iOS and Android applications before
getting into systems and in the SRE space. So very excited to share a little bit of all my
knowledge across all the stacks of the space today. Yeah, that's awesome. You have a great
history of stuff going on, but I have to say where you're at now is probably my favorite area. If I wasn't a sales engineer, I'd probably be wanting to go into the chaos engineering realm.
But I don't think I'll ever leave being a sales engineer.
Once you get in, you really can't get out.
It's interesting because I kind of feel the same way.
When I was leaving Uber, I had transitioned from cyber liability engineer to software engineer.
And as I was
looking for my next gig, it was like, what company actually attracts me? Like, what are my gear?
What does like the mission statement of these companies make me want to go do? Like, I didn't
want to go to big tech or a lot of the other companies are recruiting me. And then this
company came about, they had gotten started like a year, years before i joined and i was like yes this
makes sense like chaos engineering go back a hundred percent build a platform that lets folks
kind of understand why this matters and makes it easy for them to get started very cool hey i mean
before we go into topic i definitely learned something new already i never thought that
halloween is the uh the second largest day or one within the two largest days of traffic for Uber.
But I guess.
Yeah.
I think it's such an interesting story to tell because it's like everyone is allowed to have a different traffic day.
Like it doesn't have to be the same one for everyone.
And that's kind of when it kind of, I had just gotten started into the space.
So it kind of clicked for me then of like, yeah, this makes sense. We provide a service that people
that are out drinking kids or like hanging out or don't really want to take care of their car.
And they're going to be just on the road. It really, really makes sense. And then when you
start seeing what the trends were, when the company started leading up to like the Halloweens that I saw, it was really interesting because Halloween, I think, was some of the largest outages that happened prior to them getting started in the reliability engineering space.
Because it was like, oh, our infrastructure is literally on fire, like on this day and like surge is going on and all of this and i think it's really
interesting to just kind of like realize that sometimes halloween is going to be more busy
than new year's and some years it was new year's more busy than halloween because it all depended
on like that weekend um but it's it's really neat because you get to really realize how do
you operate systems of such scale that do provide so many rides and servers and all of this.
It's really, really, really cool learning.
I love being able to look back on that experience and realize how much I grew as an engineer in just the two years that I was there.
Hey, Anna.
So for the people that may have not followed what we've been doing together in
the past the two of us we've done some work around chaos engineering and dynatrace or gremlin and
dynatrace we did a a so-called performance clinic a webinar so in case people are interested in
in seeing you quote unquote in her life or recorded on youtube whatever you want to however
you want to call it
check out the the webinar we did i thought it was really interesting where you gave a great
overview of chaos engineering also more on your history and then how chaos experiments how this
all looks like and how it works and how people can get started um i really want to focus today on some of the things that may interest people in moving into this direction.
Just as you said, this is an interesting field.
Brian keeps repeating, this is a great field.
I would like to be a chaos engineer if I wouldn't be a sales engineer.
So can you tell us a little bit about chaos engineering and and what it takes to become a
chaos engineer what a chaos engineering actually means like a quick one-on-one on what it is and
how to get started yes totally so first anyone can be a chaos engineer like you don't really need to
have to go and change your entire career just to start doing chaos engineering or start getting
into this way of thinking about your systems and your software as you build it so first starting
off with that one-on-one discussion it's it's very much about defining what chaos engineering
is like we look at the term and we think chaotic and we're just like breaking things or like
anything like that but the purpose of chaos engineering is that it gets created as a science
where you then use the scientific method to be thoughtful and plan out experiments
that allow for you to understand your system more.
But the purpose of it is that you want to make them more resilient, more reliable.
So the end goal is that you just want to them more resilient, more reliable. So the end goal is that
you just want to learn more, you want to do better. So that is like really, really neat. And then when
we start looking at it, a lot of people kind of get turned off by the term chaos engineering just
because of what it could mean. But the practice itself is about planning it out. Like, this is
not about just randomly running a chaos engineer experiment without letting anyone know and bringing down production or anything like that.
You actually want to tell people that you're running experiments.
It's not about being that bad person in your team and just doing this.
And with that, too, is that you also have to think about the experiment that you're running.
One of the largest things is that you can say,
hey, what happens if my entire data center of the West Coast goes down,
but I still have my East Coast data center?
That is like a really large experiment if that is your first experiment.
You want to start out maybe a little bit small.
How about what happens if this one server in this cluster that I have that has 100 servers goes down?
Let's start out really small.
Let's not go and attack all of our fleet.
So what Chaos Engineering has is that it has some terminology.
We have blast radius.
Blast radius is what you're attacking in an experiment. So whether that is one host, 10% of your fleet, one Kubernetes cluster, one pod, one deployment.
And then you also have magnitude. Magnitude is how intense the experiment that you're running is.
So whether you're starting out by increasing CPU, memory, IO, don't inject a hundred percent of increase. Go ahead and start
small. Start with just what happens if I spike CPU 10%. And then after you see the results of
that, go ahead and make it 15, 20%. And the same thing goes for all the other attacks as like
injecting latency, dropping packets and such. It's all about like science you start small you don't just want to
go full blast on it and the other thing is that you always have abort conditions abort conditions
are those things that are going to cost you to halt that experiment and i think this is one of
those nice topics that we really talked about when we did our our dynatrace gremlin webinar
that abort conditions are those things that sometimes
are really geared towards your KPIs or what really matters to the business. So that could be
like your SLA just kind of not really working out, like just breaking it to maybe your latency spikes
up, or you see a traffic rate drop down, or you see that your customer's error rate just goes up a lot.
Maybe it's HTTP 400, 500 errors on your front end.
So all these little things, you actually have to define them before you start the experiment.
And that's why when you're planning out the chaos engineering experiments, you actually end up learning so much that sometimes you don't even have to execute it to realize like, oh, no,
this is actually not going to run. Let me go make my system more reliable. And then I can actually
pick up this conversation again and get ready to execute it. And the other one is that conversation
of people saying, thinking that chaos engineering, you can only learn things if you're running in
production. That is a really bad myth.
Like go ahead and do chaos engineering and testing, QA, staging, whatever other pre-plot
environments you have around. You're going to learn so much about your systems then.
And sometimes what you learn, you can actually apply to your production systems. At Gremlin,
we ran a chaos engineer experiment on our monitoring tooling and staging
and we learned that we needed to make our dashboards better and like change some other stuff
we changed our staging dashboards and all of a sudden that same change that we put in there
gets applied to our production dashboard so all of a sudden we're making our production operations a
lot more reliable and trustworthy but we never had to go and go implement
these chaos engineering experiment in production to learn that.
That's a great point because I agree with you, right?
I mean, a lot of people think, well, are you crazy?
You're doing this in production and you're breaking things
and then you're impacting our business.
So I think the first thing is important to know
that you have these abort conditions
in case you really do this in production.
But from a pre-production perspective,
and I brought up this term prior to us starting the recording,
what you were basically explaining is kind of what I kind of,
I kind of, you know,
name it test-driven operations
because basically you want to test drive
what, or test-driven or test drive,
test-driven operations.
You want to figure out,
are you, is the system behaving,
or how does the system behave right now
in a pre-prod environment?
And what would that mean to operations?
You don't want to start in operations first.
You want to first test it out in a pre-prod environment.
And I think that you were doing this with your monitoring tools is actually great
because you want to actually know,
does your monitoring actually give you all the data that you need?
Do you have the right dashboards?
Do the right people get notified and alerted
in case there is a problem? And testing this out pre-prod obviously makes much more sense
because you don't want to realize that, hey, we don't have the data that we need. We didn't
get alerted on an actual problem in production. And then you're scrambling and running around
like crazy, like chaos monkeys. Exactly.
Yes.
Yeah. I think there's a lot of great
points in that just that opening um and i gotta agree the idea testing your monitoring is is is
really awesome but it also the idea of pushing it pre-prod to just any kind of testing pre-production
right because if you think about it at least the way i think is that everything should be treated
like code no matter what you're doing.
You put it out and you test it and you make sure everything's working as expected before you put it in production.
You find out things that might be completely flawed about your approach before you get to production.
Just because we have fast, nimble systems in production doesn't mean the old cliche of production is my new QA.
Of course there's going to be situations you can't test for real world conditions, right? You can test as well
as you can, but you know, sometimes we see, you know, people put something to production and I'm
like, would you have put a new bit of code to production without checking it first? Or I'm
going to drop a new framework in, let's drop it, you know, treat it that way. The one joke I needed
to make though, because I didn't want to interrupt you, but it was phenomenally well said that chaos engineering is about planned and purposeful, putting bad things out and making bad things happen in a planned and purposeful way.
Because I think if that were not the case, then we could probably call almost every developer a chaos engineer.
Yes.
You actually bring two big points, like treating everything like code.
One of the things that I love saying on that chaos engineering space is that you're testing your people, you're testing your processes, you're testing your configuration.
You're going through the entire cycle of what development and operations is.
So like kind of like what Andy said, like test driven operations, it's something that like it's,
it's the term that should get be getting picked up more as an actual like
industry practice. And for many reasons, like,
I think because of the pandemic, you know,
like reliability is starting to pop up a little bit more,
the systems are being used more and we see the incidents are spiking up a lot more.
A lot of new engineers are coming on call that had never met their team.
They've never learned how to be on call.
So that is something that's really, really interesting.
But the other point that y'all made was of like making sure that we do start off with
monitoring as a good point of chaos
engineering. One of the other resources that we plug when we have those conversations is actually
looking at the site reliability engineering like hierarchy. It's that little triangle and the first
one is monitoring. And then when you start going up that cone, you see incident response, you see
postmortems, you see testing and release procedures, capacity planning, development products.
So one of the things that we say when we look at that hierarchy, we say, go ahead and start
by validating that your monitoring is set up properly. Then go ahead and run some experiments
to test out your incident response. Go and look at your
postmortems, recreate some of those outages, make sure that if those conditions were to happen
again, you're actually resilient to them. Then go see how your release procedures and testing is
actually going on. Then you can bring in chaos engineering to your capacity planning. That was
actually something that got done at Uber very much for the Halloweens and the New Years. And then really bringing it up to the product of like, how early on can we actually
start doing chaos engineering? One of the neat things at Gremlin is that we also do progressive
delivery. So we use feature flags. So by using feature flags and chaos engineering together,
we see that we have this really amazing value that
gets really, really hard to do where you now have your new features are hidden between feature flags.
So we can turn on that feature flag. And then those accounts that are between in the feature
flag that are testing out a new product or something new that we're launching, we go ahead
and we run chaos engineering experiments on that. And we're able to really nail down that scope and that blast
radius is really small. But when we're launching new features that are consuming more resources,
that are introducing more complexity into our software, that is where you really, really want
to make sure to run them because you're adding things
and now you're about to unleash it to the rest of your servers
and you don't know how that's going to behave
when you're getting new traffic and new users coming in.
So you want to make sure that you're so ready for launch
and so reliable by the time you launch
and that you've already thought about some of these things that can go wrong.
Very cool.
Very well said.
Yeah.
We both paused. I was like, yeah, yeah. I think there was nothing it was just like yeah makes sense um so uh and i'm coming back to
my my my question in the beginning um if somebody what i mean i know you said you were a size
reliability engineer at uber and then now you're working as a chaos engineer at Gremlin.
Have you, for people that are listening in and say,
yeah, this is really cool, I want to get started,
do you have any recommendations on some literature that you want to look at
or some talks from certain folks that you want to listen to?
Obviously, your talks that you've been doing over the last couple of months and years.
But is there anything, like, you know you know basically is there a great book like we always reference when
we talk about continuous delivery or devops we always reference a couple of books whether it's
from jess humble whether it's from jean kim is there anything equivalent on the chaos engineering sites so we have a pretty good tutorial like not tutorial
sorry we have a really good blog that talks about the history and the principles of chaos engineering
it's a 15 minute read it's over on grumlin.com slash community that's usually what i point folks
to um there is also a book called drift into failure that it talks about how do you actually understand
your complex systems adrian cockcroft really recommends this one as a starting point like
how do you really think about filling your systems a complexity for you to actually understand that
the last thing you want to do as an engineer ever is to drift into failure um and that really
touches base on chaos engineering
very, very much where you're able to be like,
I already saw this issue happen.
I saw this incident happen.
How do I make sure that I take all those learnings
so I don't drift to failure in 10 months,
two days or anything like that?
And in terms of like literature,
I know that Nora Jones and Casey Rosenthal recently wrote a book around chaos engineering.
I haven't gotten a chance to read it, but it talks about how they started with chaos engineering at Netflix and have started to see other companies kind of like pick up on it.
And that's it's a neat thing because a lot of folks kind of say chaos engineering is brand new.
This is a buzzword.
There's a marketing term.
Like, why are we reinventing all these things?
And it's kind of interesting because, yes, there is a little bit of that.
There is a little bit of that new word is being created and it's being used in a way that people are like, wait, I've been doing this before.
Like, what do you mean? mean. So we see that chaos engineering gets coined when Netflix has to move over to AWS and really
think about how do they make sure that their systems are scaling globally and they coin
chaos engineering by open sourcing also chaos monkey. So we look at that and we see that as
like where the chaos engineering industry kind of really, really got started. But when we look back, I think six years before that over at Amazon, we have Jesse Robbins that is literally just unplugging data centers and running game days in order to have their engineers really think about what happens if something fails in our systems.
And that was just kind of part of what the culture was there. You really want to be customer focused and you want to prepare for failure, but they never created a term for it. So sometimes you actually might be doing chaos engineering, but the managers are sitting in with their team and doing
tabletop exercises of like, let's pull up our architecture diagram and let's talk about
what happens if this third-party dependency that we have actually has an incident,
or maybe that latency is 200 milliseconds. Would that actually cost any timeouts? Are we actually
testing for that? So sometimes it's about the way that you look at it.
You know what, too?
It's funny.
We went through the same struggle with when you're talking about terms, right?
About chaos, these concepts have been around.
And Andy, I don't know how often you run into it, but I know quite often people go,
oh, we're really interested in observability.
And every time I hear people talking about observability, initially I'd be like, come on, really?
Because it's been around similar chaos like forever.
If we take a look back at what the better APM tools did, right?
We've been doing observability or at least the pillars of observability since almost the beginning.
And suddenly it becomes, I think, very similar,
an open-source DIY project for a lot of companies,
and it became a big thing.
So instead of getting upset when people talk about observability now,
I just really sit back and get really happy because for my whole life in performance testing and stuff,
it's always been trying to get people to pay attention to it.
And what I've come to realize is, no, this is a really good thing because it means people are really starting to take it seriously.
But it's that similar thing where you look back and like, okay, hey, you know, I'm just happy it's taken off.
I think you bring an interesting point.
Like, you just want people to care about it.
Like, call it whatever you want, but please care about this.
And I actually do think that with observability, kind of similar similar where it's like the dev didn't care.
The dev was like, it runs on my machine.
But that was it.
And I think that kind of happens a lot also in this chaos engineering space where it was like, that was a mentality of the dev.
I'm not going to be on call for my code.
There's somebody else's problem.
I don't have to care for it.
We've seen that we've shifted left.
A lot of folks are in DevOps, site reliability engineering, and we see that people now have to care, like whether
upper management makes them, whether their systems are so complex that they actually need to take
notice, whether they're actually going to networking events and picking up all these
new industry trends or just watching social media. they're starting to realize that, no,
like I could actually still be a developer and be conscious of these things.
Like chaos engineering doesn't have to live within ops or site reliability engineering.
You can have a developer that actually says, no, I actually want to make sure that I don't
have any memory leaks in my code, or I want to make sure that I'm thinking about what happens if any of these HTTP calls that I'm building into my Java application actually get handled properly.
Like, how do I make sure that I'm actually injecting the proper chaos into the software?
And I think, too, it goes beyond just the wants and desires of people who are the developers and teams like this.
I think businesses are finally starting to understand.
In the past, whether it's problems due to chaos or problems because people weren't testing or doing their observability, there was always the idea, we'll just throw people at the problem and have everybody pull all-nighters until we get this problem fixed.
And that's a heck of a lot easier than revamping our process. But I don't think it to come work at your organization, you have people
in a position just like you were, where you said, where do I want to work?
I have a lot of great skill and experience, and I can either go to this place where they're
doing the stuff the old-fashioned way, and I'm going to be pulling all-nighters and pulling
all-weekenders, or I can go to a place that's looking forward.
And so it's becoming also a tool to bring in talent so that they can stay competitive.
Oh, totally.
You nailed down things that really speak to me.
It's like people are starting to realize that the cost of downtime is really expensive.
You may have been someone that didn't calculate it or your business just wasn't sharing.
But we also see that this pandemic really made a lot of people go virtual that didn't really want to be a hundred percent virtual on like online driven operations of their companies um so i think that is getting more expensive but
then when it comes to pager fatigue and burnout um there was a lot of that that happened at uber
that like really positioned me for me to leave like i felt really burned out and a lot of it
was that culture of ops and cyber liability engineering was working night, like late nights and weekends in order to kind of like play catch up.
So I do think that the industry is starting to be a little bit more sensitive to that, like,
oh, these are actually humans that I hired, not little robots, like, y'all need food and sleep
and rest and vacation, hold on, hold on. Like you're having to actually tell engineers,
no, like there's a work-life balance.
And I mean, especially in this pandemic too,
like it's harder when you have kids
and you have like schooling and all this.
And I feel bad for any parent that's listening.
Thank you for taking some time
to take some learning on your free time.
I know this pandemic is really hard for them,
but I think it's that it's like
people didn't realize how the harm that they were causing in their industries by pushing their
engineers to burnout constantly. And we see that a lot in operations. We, we try to make this
culture a little bit better, I think in site reliability engineering, but those that haven't
gone through that transformation, they still feel that pain.
And, and with that is like talking about it, you know, like burnout happens to everyone. And like,
you get, you should be having those conversations internally in your organizations, like,
whether it's just one person or with your manager and being able to like, talk to them, like,
no, this, this application is really paging me a lot. Like, it's a little too much. Like, no, this application is really paging me a lot. Like, it's a little too much. Like, what can we do to make better?
And it actually segues to getting started with chaos engineering.
We do say that sometimes you actually do want to look at those applications and high severity incidents that are going on, like the things that you should run to and try to figure out how you can actually do chaos engineering on because you'll be able to actually really see the fruits of what chaos engineering
can do so you you might just want to go look at what are the departments that are having the most
amount of incidents like maybe there's a service in there that you can actually just focus for like
a quarter and just trying to make more reliable,
maybe even less time.
And then once you make that one more reliable,
you will see that maybe those engineers have less pages
and then they can actually start supporting other teams
and such and like really scaling out such operations.
Yeah, and I took a couple of notes
and I have a couple of questions
that go in kind of two directions now.
And I think I want to start with the first one that says,
are there any parallels of chaos engineering of things that have been done in
other industries?
Meaning if I think back on continuous delivery,
we always draw a lot of analogies to the automobile industry with their,
with the way they optimized and automated the delivery of cars.
We also draw a lot of analogies to lean management
when we talk about DevOps
and kind of streamlining our processes.
From a chaos engineering perspective,
are there any other industries that have done this
maybe many, many years already,
but just like you know in
different industries in a different way but still kind of enforcing chaos and if so is there anything
we can learn from them so i know for sure other industries have i feel like i've not done my
research in order to be like hey these are all the industries and this is what they learned like
20 years ago but like some things two things kind of come to mind um you mentioned the automobile automobile industry english is not my
first language uh so it's some words are really hard like that one um and and with that it's like
you you have to do so much testing and preparing to get that car into anyone's hands. And with that, like it comes out safety testing that like,
how does it do with the mileage and gas and all this, but you can kind of think about it when
they do those dummy testings of the airbags and stuff where like, what is the worst thing that
can happen to this car? We're going to go ahead and do it and see what those can do when you put
all those conditions in place what do we learn
and how do we make it better and i think that airbag example is like something really concrete
for folks to kind of think of like we're gonna not have the the little dummy person inside the
car like wear a seat belt um we're going to have the car going 100 miles per hour we're going to
have more weight in the car like what are all
these conditions that can happen like what let's just throw it at it and aviation also of course
had to go through this i think aviation is one that the reliability resilience engineering
community continues like going back to as like there is a lot to learn from from how pilots kind of get trained to how is it that we learn from the black boxes that are left behind when incidents do happen in aviation.
And I mean, the cost of an incident in in an airplane is insanely high because we're talking about lives and you can't really put a money sign to that whatsoever. And the other one that is a little bit touchy due to the current situation,
it's that like vaccines where you're injecting something harmful into your system in order to
build immunity. So those are kind of like other parallels that we do see that bringing in chaos
into a system is only going to make it better because you now get to actually see how
this system will behave and whether that's a person um a vaccine or anything like that you do
get to to learn from it yeah um then a follow-up on this because i remember when we did the the
webinar we had a couple of questions in the end and i remember one question that was asked i was saying
well you were just doing artificial you're making a lot of artificial assumptions about the chaos
that you're enforcing right this is not so i think somebody said this is not realistic and why would
you ever test something that is completely unrealistic and and i wanted to know, what is your response to that? Well, it's like unrealistic in whose mind?
Like, because we're building on software and systems that are so brittle, anything can happen.
And it can be like, yeah, there is no way that 50% of my data center is going to catch on fire.
But we know Murphy's's law we know that everything
can fail will fail so you have to kind of like always go back to that like anything that can go
wrong will go wrong so when you start shifting your mind into that you actually start learning
that you can be like oh no wait like I can actually prepare myself for this type of failure
and like data centers catching on fire to me was something interesting because it was something that you kind of just assume that you have enough coolers,
you have enough people on site that could easily put out a fire.
But learning from many companies that data centers caught on fire and like those engineers, like those are some really interesting stories or even like looking back at like no one in 2017
would have thought that us east one for s3 buckets on aws would go down like that is just something
that's like oh no like they're aws they're 100 reliable like no but you kind of have to be
like pessimistic and like anything that can go wrong will go wrong so how do i make sure that
when and if that happens i am so ready for it like bring it on and that could be like making
sure you're testing all of your vendors all your dependencies to what happens if our main cloud
provider goes down do we have a failover for it like when was the last time we
executed it when is it that um like things can happen where maybe you also kind of want to know
what happens if your lead sre goes on vacation for a week like just make him not respond like
make them not respond to emails for for a day like what type of chaos can that kind of like provide?
Yeah, I remember, Brian,
when we had Adrian Hornsby on,
I think he also talked about
social chaos engineering,
like taking the laptops away from people
or as you said,
don't allow them to answer emails
or something like that, yeah.
I think he said he sends them
on a week vacation when they come in.
But Andy, I got to say, to that argument that you proposed, I think that's a completely irrelevant argument nowadays.
Because anytime someone brings that up, all you got to do is respond and say, hey, 2020.
I mean, for real, it's so ridiculous.
If you were to take 2020, everything that's gone on this year, put it into a book or try to put it into a movie treatment and go to anybody, they would say that's the most outlandish thing.
No one would ever believe that.
This is like the most over-the-top ridiculous, like even more crazy than a Jerry Bruckheimer movie, right?
And it happened.
So when you have an argument, you're like, oh, well, that won't happen.
Dude, come on.
Just watch.
Just give me a year.
That's a very good point.
Hey, the other direction,
the question that I want to make sure you answer.
So we have a lot of listeners,
and I think, Brian, you can agree,
we have a lot of listeners
that have a background in performance engineering.
So I assume a lot of our listeners
do some type of performance testing
as part of the day job.
And if they are now interested in, hey, this chaos engineering is actually pretty cool.
Can we give them some ideas on while they're still mainly focusing on their regular job, performance engineering, running performance tests, performance analysis?
Are there any, let's say, entry-level chaos experiments that people that they can say hey you know what next time i run a performance test i am trying out this little trick here to learn that this podcast like i don't know you
tell me so what is it what what can people start doing so the hello world of chaos engineering to
me is in just injecting a cpu spike into your system because it's something that it's easy to think about.
It's easy to do.
Like you can use a tool to do it.
You can just build a little script
that like increases the CPU.
And that could actually show
some interesting things about your system.
Some other folks use
the shutting down a server,
shutting down an instance
as the other hello world.
How is it that maybe
you have an application
that's running across a cluster,
just various different servers,
if you were to lose one of those nodes,
one of those hosts,
when that server goes down,
like you have less resources.
So does a new host come up in order to offload that?
Or does this application now have less capacity to run on,
but all of a sudden you still have the same heavy application on it? So sometimes that is one of
them. But there's also that part that you kind of get to do any chaos engineering experiment
alongside a load test. With Uber, they actually did that, like leading up to that Halloween,
they did just huge amounts of training
with the team that was going to be on call
for all that Halloween weekend.
And they were running like failovers three times a week,
like full, like from one data center to the other,
completely moving all of our applications.
And what they were doing is that they had,
we had built an internal tool
for chaos engineering called UDestroy.
And we had also built another internal tool
for load testing, Hailstorm.
So by using Hailstorm and UDestroy together,
they were like really trying to think about
how everything kind of happens.
And that's like another place
that you kind of get to put those two together.
But I would definitely say that
if you're just trying to really get started,
a CPU experiment could be sometimes
like the fastest way to think about it.
And if anyone wants something a little bit more advanced,
you could always do that.
Like what happens if I just lose access
to this dependency that this application uses?
Is it that every single call that I have in my application that's trying to reach the server that is not responsive?
How do all those logs kind of like ramp up into my system that can make that performance really, really slow down and really have a bad customer impact?
Hey, I want to ask you a question about that then.
So a friend of our show, and I don't want to say co-founder,
but inspiration for our show, Mark Tomlinson,
we've had conversations about testing things like microservices
and testing scaling systems.
And it almost sounds like the testing of a scaling system
would fall under the
realm of chaos engineering. But let me find that, you know, when you're an expert in this area,
so I wanted to run this by you. So let's say you're doing some performance tests. The idea
being if you're running on a scaling system, let's say you have Kubernetes set up or something
else where if our CPU hits 70%, scale up another instance. So the concept is, besides just testing performance,
test your scaling.
Make sure that's working and doing that.
Now, does that fall under just another hello world example?
Would something as simple as that even fall under,
hey, I'm starting to expand into chaos testing
by pushing my system non-traditionally?
In a performance test, you're usually doing a set load.
But here I'm going to run my test. So it pushes the system into doing its scaling and seeing if it actually
is doing what it's supposed to do. Oh my God. Yes. So much. Yes.
Because that's a really simple one, I think, right? For any performance tester,
we love just throwing load and trying to break a system.
Yeah. And that is the perfect moment to start on it because then that's what I mean. Like CPU to
me is something that like anyone can kind of really easily on it because like then that's what i mean like cpu to me is something
that like anyone can kind of really easily think about like whether it's your own server at home
or your personal laptop it's it's something usable and i do a lot of work in the cloud native space
specifically kubernetes and when last year i got a chance to speak at kubecon and we did we sat down
and there's websites that have all these Kubernetes failure stories.
We sat down and we read postmortems and we were just kind of like giving points and like, was this a DNS outage?
Was this a latency, like a scaling issue, a cloud provider issue?
We ended up seeing that like 50% of those incidents were scaling issues.
They did not set up auto scaling,
whether it was horizontal or vertical or on the cloud.
They just didn't even do that.
And the thing too is that nobody really thinks
that you need to do that
because you kind of just buy into the promises of platforms
such as Kubernetes of like,
everything's going to be up and running.
This thing scales.
That's just kind of like what the pitch is.
And you don't really kind
of like ask questions like what does it mean by scale like do i need to set this up what do i need
to give it resource limits do i actually need to make sure that it scales properly and it's scaling
at the proper pace because sometimes like i i do i've done like chaos engineering on all sorts of cloud providers and monitoring tools.
And that interesting one is like, all right, I set up my cluster.
It's going to auto scale when CPU hits 60% so that when it actually has more traffic coming onto it, we have enough time for the server to ramp up.
Well, how long does it take AWS to spin up that EC2 instance?
How long does it take your primary Kubernetes node
to pick up that new instance
and bring it into the cluster?
How long does it take your monitoring tool
to detect this new instance and be like,
hey, you, you're reporting.
Let me bring you in.
Like all these things take so much time.
And that's kind of when we really do talk
about that complexity of our systems now.
It just gets larger and larger, and you just have abstraction layers around it, and you need to go back and really think about what does it all take?
What are all these granular steps that get me to my end goal of my system being reliable?
So yes, go ahead and try to figure out where is that breaking point for your system to make sure
that things are scaling properly and make sure it releases once it's done yeah
this is brian i'm not sure how you feel about this but every time i we're making it through
these episodes it's just fascinating to learn so many new cool things it's really cool
you know i like to think of it andy is i was thinking about it as we were starting the show to learn so many new cool things. It's really cool.
You know, I like to think of it, Andy,
I was thinking about it as we were starting the show,
one of the reasons why I think I love the idea of chaos testing so much, or chaos engineering,
is, you know, in the performance side of the world,
we're trying to simulate a load to break a system.
Well, when you go into the chaos side,
you're trying to simulate a breakage
instead of the load, right, under the load. And so it's it's very very much related but we love at least i used to love
breaking the system like i wouldn't purposely go out and do it but like when i would get a you know
a decent simulation and things would break that's when i would get excited because now we have
something to do and the idea chaos engineer is like we're going to design a breakage that's
but i don't know that's really awesome it sounds sounds so fun. You know, by the way, I did some,
um,
some research while we were talking on the different links that you,
we mentioned over the course of this podcast,
like the drift into failure,
the books,
uh,
the Gremlin community page,
the,
the chaos engineering book,
uh,
from Nora Jones.
And then also I found the Uber blog on New York,
uh,
New Year's Eve and um and Halloween so that's
really cool I will make sure to post them on the proceedings of the podcast Anna is there
anything else I know we've been you know as with many topics that we're all passionate about we
could probably talk forever but um kind of getting to a closing of this episode, is there anything
else you want to make sure our listeners know if they want to get into this field of chaos
engineering, or if they want to have a conversation maybe with the leadership about the importance
of chaos engineering?
Is there anything else we want to add at the end of this episode?
I think I just definitely want to add very much on that.
It's really easy to get started. It's really not intimidating. And if you're still trying to figure
out how to pitch it to your manager, to your upper leadership, it's that you're just preparing for
those moments that matter. You want to make sure that you're reliable for your Halloween, for your
New Year's, like whatever that peak traffic event
is, or just for day-to-day operations too, especially in this virtual world of the pandemic
that we have for who knows how much longer, you want to make sure that things are kind of like
running smooth for them. And it's like not that hard to get started. Like I do work for a vendor,
so we do have a tool that does make it simple.
But you can easily get started in chaos engineering just by running some tabletop
exercises with your team. So you don't need to do this, like reengineering or like buying things
or bringing in some some tooling in, you can actually start having those conversations of
what happens to my systems, if my server goes down, if I inject
CPU here, if my memory spikes up here, maybe I forget to rotate my logs and my disk fills up,
what happens? All these little hypothetical resource layer, network layer conversations
you can have without implementing the experiment and you still learn so much. Of course, you want to go ahead and start
that experiment in your pre-testing environments and get to the point that you have them automated
in production at the end of the day. So it is a little bit of a really long road, but that's kind
of like what reliability is about. It's like, it's not just a press a button and all of a sudden you
have 99.999 availability. you you had to go through many
other nines before you bring up a good point too and i think that the idea of the conversation
about it not just with the team but if from what i've heard learned from some of the people who've
been on talking about chaos engineering in the past is part of the exercise whenever you're
going to take do a chaos experiment is before you do anything,
first, obviously, you outline what you're going to break,
but then everybody has to hypothesize what they think is going to happen,
what they expect to happen.
I'm just curious if that's the way you see it as well,
but the whole discussion part is a huge part of it,
is this idea of it's a scientific experiment. We're going to, we're going to do something, we're going
to hypothesize on what's going to happen, what we expect to happen based on our designs, and then
test it to make sure that it fulfills that. Yeah. And this is like, it's such a cool space to be in
because of that. It's not, it's not that hard to actually just get started about talking about
hypothesis. And then you then start realizing that everyone just has a different mental
model of a system and everyone's hypothesis is going to be different.
Whether it's because you've been at that organization five years,
because you're a new grad,
because you've used this technology before,
like you're bringing in different knowledge and you now get to share it among
each other.
And that's going to have an interesting conversation.
You're sharing knowledge.
You're building up your team.
But then you also up-level any new person that's in that organization, whether it's an intern, a new grad, someone returning to the workforce that doesn't know what cloud native is.
You're really getting a chance to teach them about your organization, your systems, technology, and really kind of really get started back into it all.
So I think it's really valuable.
And I think at the end of the day, like one of the largest things that chaos engineering
is about is about learning.
And you're already having incidents and you're not learning from them sometimes.
So how is it that you're spending all this money already?
You already invested in this incident.
This downtime was expensive.
Why don't you go ahead and learn from it
by doing little portions of dissecting that post-mortem,
looking at the conditions that happened,
running some tabletop exercises around that,
and eventually get to the point of recreating that incident
with those conditions and see what happens to your system.
Very well said. Awesome cool hey um i you know we should probably you know schedule another podcast with you in a couple of months and talk about you know what you know once
this pandemic is over hopefully we'll maybe at one maybe at one point after all this chaos is over
we can actually get together and meet in person
and then have another conversation that we record on chaos engineering.
But normally I'll do a little recap on what I learned.
We call it the, you know, I'm summarizing what I learned.
There's so many things I learned today.
It's hard to summarize.
But I still, Brian, do you want to summon the Summoner Raider?
I have a couple of things to say.
Do it now.
Yes, summon the Summoner Raider.
We usually put an Arnold Schwarzenegger quote in there.
Because the whole thing is, you know, Andy's Austrian, Arnold's Austrian.
The Austrian-English accent sounds very much like Arnold Schwarzenegger.
So it just became the Summonermarator instead of the Terminator.
For anybody who's wondering about the history and the etymology of the term, there you all go.
It'll be a quiz next week.
So a couple of things that I want to repeat or summarize.
I think it's very important to learn about our systems.
Every system is different, So we want to understand
how our systems behave,
how they expect to behave,
but also potentially
what can go wrong.
And I think that's a great opportunity
to figure out
how do the systems actually behave
if things don't go as planned.
I also like what you said
in the beginning,
start small, go big,
meaning start with a small experiment,
maybe just turning off a host
or turning off a service instance
or even just a little CPU spike.
I also liked the terminology
that you kind of gave us an overview
of the blast radius.
So what's kind of the impact of your experiments
and the magnitude of the experiment.
Also very important when you run experiments that
you always have the ability to abort an experiment so they need the abort conditions i think we also
learned that everyone can think about chaos engineering in their own line of work right
now like we mentioned the performance engineers that may want to just you know as you said run
the test but then maybe add a little c CPU spike and see how the system behaves.
I think it did a great job
in giving a lot of reference material.
So we'll definitely make sure
that we have all of these links available.
And it seems there's a couple
of very easy Hello World examples
that everybody can try out.
I know there's more advanced use cases
that are out there that are especially brought in by tools like Gremlin that you can then use
and run your experiments on. I also think that it's important that we think about chaos
engineering holistically. It's not just a technology thing,
like enforcing chaos on technology,
but also on the humans.
I think that's also very important, what I learned.
And yeah, after all, I'm just very excited about this topic.
It's really cool.
And as Brian said, right,
if you wouldn't have the chance,
there would definitely be a cool a cool future
job to be in yeah yeah yes definitely continue learning it's a good skill to have and i think
any form of making sure that you're always thinking about what is the worst thing that can happen
in life in the year at work in my system you should you should definitely try to
to pick that up as a good exercise. Awesome.
Well, Ana, thank you very much for being on.
We really love this topic, as you can see.
And you just have so much experience and great knowledge about this.
So we really appreciate you sharing that with us and to our listeners.
Also, thank you to our listeners for finding what's coming up to about an hour to listen to this.
I know it's probably a lot harder to find time now that no one's commuting,
but we really appreciate you being our audience
and hope you're all doing well.
Andy, thanks again for helping make this possible.
I really appreciate you as well.
And I say that with a little sarcasm, but I really mean it.
And yeah, if anybody has any other questions, comments,
you can reach out at pure underscore DT on Twitter,
or you can send us an old-fashioned email
at pure underscore performance, right?
At dynatrace.com.
I think it's pure performance, one word, yeah.
Well, people can try it out.
Let's enforce some chaos on our mail server.
Try it and see what what happens see what happens
see if you get a bounce um anna do you have any um place you like people to follow you at all
linkedin or twitter or anything yes i'm available on all social medias you can just google anna
margarita medina and you'll find a lot of my stuff. You get to find me on Twitter easily as Anna underscore M
underscore Medina. Twitter is probably the best way to get a hold of me. So just reach out there.
And if you're interested in learning more about chaos engineering, shoot a message. I'm always
happy to send out more resources, jump on a call and share a little bit more about my experience.
Gremlin does have a freemium model. So if you're interested in getting a quick start within chaos engineering, go ahead and check
Gremlin out and always around for any questions too. So whether it's about getting started in
site reliability engineering, transitioning from somewhere else in the software engineering space
to operations and chaos engineering, feel free to shoot a message out. And I would like to thank Brian and Andy for having me today.
Super fun conversation.
I missed a little chaos that 2020 brought us
and our little platform issues earlier.
It's part of the fun of getting to be in the space.
Excellent. Thank you.
And be sure to check out, we'll have a bunch of those links
down in the show description on the Spreaker page.
So make sure to check those out.
Or as Ana said, reach out to her directly.
Thank you again so much, Ana.
Thanks for taking the time today.
Muchas gracias.
Adios.