PurePerformance - Chaos Engineering Stories that could have prevented a global pandemic
Episode Date: January 25, 2021Nobody has foreseen the global pandemic that put a lot of chaos in all our lives recently. Let’s just hope we learn from 2020 to better prepare on what might be next.The same preparation and learnin...g also goes for Chaos in our distributed systems that power our digital lives. And to learn from those stories and better prepare for common resiliency issues we brought back Ana Medina (@ana_m_medina), Chaos Engineer at Gremlin. As a follow up to our previous podcast with Ana, she is now sharing several stories from her chaos engineering engagements across different industries such as finance, eCommerce or travel. Definitely worth listening in as Chaos Engineering was also put into the Top 5 Technologies to look into 2021 by CNCF.https://twitter.com/Ana_M_Medinahttps://www.spreaker.com/user/pureperformance/why-you-should-look-into-chaos-engineerihttps://twitter.com/CloudNativeFdn/status/1329863326428499971
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have with me my harbinger of good and bad things and mediocre things and anything just period to come.
Andy Grabner, did you know that you're the generic all-purpose harbinger of the year 2021?
I have no idea what a harbinger is what is a harbinger like a um great now put me on the spot i mean i know the word now i gotta define it formally harbinger like it's usually
someone who's uh like someone who brings things on like a bringer of bad news or like uh
i mean this is where everyone listening is going to be like what's wrong with this guy?
You know not quite the same
as an omen
but a harbinger is a person
or an event or something right?
Okay.
That would be
indicate what is going to be
happening.
Okay.
Sort of.
So like a
like a moderator
like a bringer or whatever.
Yeah it's okay.
Maybe.
Well then say bringer
and don't use these strange words
that non-native speakers
have no clue about.
But I also learned something new.
I don't think that's an English word, though.
It's probably not even an English word.
How do I know?
You know what?
Let's skip the subject of the entire episode and just discuss the etymology of…
Yeah, let's bring somebody in, right?
If we have a bringer, if I'm the bringer, I'll bring somebody in.
But that would throw a monkey wrench into my plans that would that's not what
i've prepared for i don't know what will happen if we bring somebody in andy what do you mean like
you feel like we're in a chaotic situation now yeah suddenly what are we gonna do i don't know
we'll we probably need to learn how to react to chaotic situations and in case they come up again
in the future we don't freak out,
but do everything gracefully and that everything stays stable and healthy.
And now we start.
You're such a harbinger of good ideas.
I see.
All right.
Let's cut to the chase.
We obviously talk about chaotic stories and situations.
It was a chaotic year, 2020, that we left behind.
Obviously, at the time of the recording, we still have two weeks to go,
so we don't know what's going to happen.
Assuming everything is fine, it is 2021.
It's going to be a bright future for us.
But we had a chaotic year behind us,
and hopefully we all learn from chaotic situations how we dealt with it
so that we make it better.
And therefore, we have the person that we had on a recent podcast with us again.
Anna, hola, como estas?
Hola, really excited to be back.
Thanks for bringing me back.
Now I'm utterly scared for the next two weeks of 2020.
Let me just say now I'm like, where can I knock on wood to make sure that my next two weeks go really well and my year doesn't get more chaotic than
everything we've seen so far well you know what you should do in 2020 you just self-quarantine
and don't go out of the out of your flat so everything will be good you have to also unplug
the internet i like that idea i really like being
in my house disconnected from the world for two more weeks and make sure that i don't blow anything
up and that my life doesn't get more chaotic just don't take that approach by going to some cabin in
the woods and cutting yourself off because usually when you when you do that nothing good comes of
that if movies are to teach us anything.
It's funny you say that because I totally looked at Airbnb places for like two weeks.
I am going on holiday break soon.
And I was like, this is too expensive.
I can't justify dropping this much money.
But yeah, scary movies teach me not to do that because then things end up not well.
And I want to see this this podcast episode launch great
so last time we had you on so anna for the go on eddie yep so um i think the last time we had you
on sorry to interrupt you here uh we had a podcast it was called why you should look into chaos
engineering and i remember that you know in the end, I think we said,
well, we would have so many more things to talk about,
especially stories that you have collected over the years.
Because, I mean, you have an amazing history
as a site reliability engineer, as a chaos engineer.
You worked at Uber and other companies
and helped them build reliable systems.
And then you said, hey, Andy, Brian, I have a lot of stories.
So now I want to kind of play story roulette.
So we'll pick a story or we'll pick an industry,
and then we'll see what story you have in store for us.
And then we want to learn from these chaotic situations
that you have experienced.
And yeah, actually to learn what we can learn from these stories.
That's perfect.
Yeah, it sounds perfect.
It makes me think that I should change my title from chaos engineer to chaos storyteller.
So I'm liking this.
Yeah.
Now, I know you gave us a couple of, let's say, hints on industry and whether this chaotic
situation was related to, let's say, the monitoring
solution not working or whether you were actually using chaos engineering and load testing,
which is a topic that Brian and I really love.
And I think I would actually like to start with this one.
So you mentioned something about chaos engineering and load testing.
And just fill us in on the story can as much background as you can give us
and then we'll see what brian and i can learn and obviously the listeners can learn from that
yes um perfect so this is a finance company it's a customer of ours i can't name drop sadly
but this company has been working on launching a new product and they basically have to have like those five, nine,
six lines of uptime. And it's almost like their leadership team has said they do need that a
hundred percent uptime. And what they do in order to protect, what they do in order to prepare for
a product launch is that they do go ahead and they do a lot of load testing. They do chaos experiments.
And that's kind of nice because that's kind of where we did touch on last episode,
where doing these two things together is the best way for you to plan on capacity
and for chaotic situations. So they wanted to basically go through their entire stack. They
wanted to validate their monitoring was working. They wanted to make
sure that their nodes were auto-scaling properly, that they can handle some latency happening from
their service to another service. They are using the cloud. So of course, making sure that if the
S3 buckets went offline, that they were resilient to that. And lastly, they wanted to make sure they
were resilient for region failovers and making sure that if an availability zone went down, their application
was still usable. So the best thing about this is that they're not one of those companies or
products that does it after they've launched. They actually do load testing and chaos engineering
experiments while they're developing the product.
And those are like my favorite type of customers because it's really making sure that you bring
reliability to one of the four focuses of your launch. And that allows for you to be successful
and not scrambling the last sprint before launching. And it means that you're developing the product, ensuring that it's going to be reliable from launch date to the next six, one year of the product being around.
And it was interesting because they found a lot of stuff.
It allowed for them to really iron out some of the latencies that they were seeing between service to service, but they also had a mental model of the dependencies
that they had in the service. And they didn't realize that when you go three to four levels
deep, there was more dependencies around. So being able to just even make sure that you're
mapping out that architecture diagrams as early as possible.
Because when you think about a finance company that has five, nine, six nines of uptime,
having that fourth dependency down is going to really affect that uptime. So it really means
that they have to sit down with other service teams and preach to them what the reliability
of their new product is,
but also make sure that those teams are ready to be reliable. So that means maybe that they
went down to that team and it's like, wait, what chaos engineering experiments are y'all running?
What type of load testing are you doing for this launch? And I really like that type of culture
because one of those things is that chaos engineering is not just you're going to break things to make things better.
You're changing the culture.
You're embracing failure.
You're celebrating failure.
And you're coming together as an organization to put reliability in the forefront.
So now you're preparing for this launch maybe in four months, but you're bringing more teams together and saying, we're launching together.
You're part of this.
You're one of the dependencies to this launch.
And I think in current times, that means a lot to people.
You get to feel part of something, part of the organization that you're with.
And especially since we're virtually, it's the little things that make work nice right
now.
Hey, I got a question for you because this is fascinating because the whole thing with
the number of dependencies and how deep in the dependency tree you go, you figure out
things that you didn't know.
But I have a question to you because it was brought up to me at a recent conference.
I spoke at ESSER Recon, and I talked about top performance problems in
distributed architectures. And I basically made that statement that basically said,
the more dependencies you have, the more complexity you have in your architecture,
and therefore, you need to prepare better for it. It's going to be harder to build a resilient
system. And then one of the questions that I got, i got i got challenged on this and he said well this is not true if on every service
you are applying all of the best practices around reliability engineering like proper failover
you know being resilient to if a depending service is actually no longer available and stuff like that
and i agreed with that person i don't remember who it was but in the end i think would you just
confirm and if i heard you correctly we always have a picture in mind of our architecture
but the reality is often different that means there are new dependencies that we have not
thought of somebody brought in a dependency
willingly or unwillingly,
or let's say knowingly or unknowingly.
And maybe they,
if you don't know
that you brought something in,
you most likely have also forgotten
to think about resiliency.
What do you do when the service
is acting weird
or is not available?
And I think this is probably,
this is a great exercise,
what you're saying, right?
Find your true dependency tree
and then on every service level,
educate people on how you build resiliency
into that service
and all of its depending neighbors.
Yeah, totally.
And I'm really glad,
like really awesome to hear
that you spoke at SREcon.
That's one of my favorite conferences
around reliability.
And I don't know, like I feel like the more dependencies you have, the more complexity you are adding due to
mathematics, like just in terms of doing the math of the amount of services that you're going to
have or the amount of possible failures that you can have in terms of like distributed systems.
That's the thing is like the systems are so complex that when you are trying to aggregate all the possible scenarios,
like computers can't really compute that. So that's interesting. Like that is an interesting
discussion topic of, wait, is it really more complex or is it just you're thinking that it's
more complex, but not really, but it is that you need to make sure that
every single dependency is following reliability guidelines. And it's funny that we kind of come
back to this topic because this is one of the works Uber got to do is that they were doing
production ready checklist of you are a tier zero service that we need in order to take a trip.
You need to have gone through this checklist.
You need to make sure that you have all the things to be reliable from run books to monitoring
tools, making sure on-call is set up properly.
And I think it's interesting because sometimes you think that those things are done, but
if the tool that you're using didn't build the right verification you also might
be just seeing check marks and that's that nice thing of chaos engineering because you get to
bring it all together and truly verify what would happen if this would happen to my production system
right now and you know what i want to interject here just because as you're saying this i can't
get this
thought out of my mind and it's going to be a little bit of a silly side note here but every
as i'm hearing these stories as i'm hearing you know people saying you know you bring the unknowns
in uh you have all these different systems with these dependencies that you have to make sure
everything can handle when some of those dependencies break down.
It was something the way Andy was describing it. It made me go back and think of the movie Alien.
And I'm not sure if you've all seen the movie recently or ever, but it's just, you know,
the basic idea was these people got infected to a possible external organism and then they
broke quarantine protocol and then everything else fell apart because they didn't have any
backup protocols.
But if you think of any movie, right right i think chaos theory or the lack of
following chaos engineering is the reason why everything bad happens in every movie because
someone makes a mistake and they have no other way to recover from the mistake of that you know
you think of any typical horror movies like no you don't go and don't split up duh you know like
all those stupid things i just bring it up because it amuses me.
It just popped in my head.
I'm like, this is totally the same thing.
And the other reason why I think it's relevant
is just when I first learned about the idea of DevOps
and, you know, reading the Phoenix project
and the Unicorn project
and the whole Toyota factory idea
of how to bring these fix faster,
fail faster cycles to DevOps.
My mind just started going into all these places of how that can apply to so many more
things besides factories and IT.
It can apply almost everywhere, you know?
And I think the same thing for chaos.
And that was just my sort of stupid little way of saying that the ideas of chaos apply outside of IT, outside of computers.
But you could really start seeing it everywhere when you're aware of the concept, which is really kind of cool.
Brian, are you trying to say we could have prepared for 2020?
To some extent.
Well, you know, again, without going into the reality of what happened, I think there was some preparation for 2020 that was thrown out.
But we will not go into that side of things.
But, you know, to an extent, yeah, it all depends.
And I think this is the challenge that all these organizations that we're hearing about, you know, the stories from today face, right?
How much money and time do we want to put into preparing for these possibilities that what are the chances?
I'm going through this right now.
I'm looking at my finances and possible things, planning for future, old age and all those kind of things.
And we're looking at different insurance options and long-term care and life insurance and all these things.
And you start questioning, okay, all these people want my money
for these insurances for maybe a really tiny risk, right?
And, but that decision process that I'm going through
is the same thing that all these organizations
are going through of how much time do we want to spend
setting up and preparing
and maybe even onboarding a team to do this testing
for something that with luck will never happen.
But then again,
when it does,
you know,
it's,
it is the insurance game in a way,
but with a lot more devastating effects.
And as we see more and more,
uh,
what was it?
Azure the other day just went down.
Um,
we had,
Google went down.
Oh,
Google.
Gmail.
Gmail.
And then we had the,
the security,
like it's happened.
It's happening.
You know,
as it's, it's not a matter of if it's a matter of when, and then when is it security. It's happening. It's not a matter of if, it's a matter of when.
And then when is it going to impact you?
But are you going to be prepared for some form of it?
Yeah.
Brings up a lot.
Yeah, there's two things that kind of come up.
It was nice because KubeCon North America just happened
and one of the things that they tweeted out
after the conference is that they see five trends happening in the cloud native world in 2021.
And chaos engineering is one of them, which was like, yeah, we are seeing that things are getting
more complex and people are realizing. So that was really neat. And the second thing I wanted to chime
in on is that I actually have a story. if we want to talk about what the investment in chaos engineering is versus doing it otherwise
go ahead uh so actually we're back to a finance company um but we do know we have a few other
examples they're doing um chaos engineering in qa so we do talk about the end goal is to run chaos engineering and production,
but we want to build our confidence slowly and incrementally. So this is one of our companies
that is doing chaos engineering in QA. They have a Kafka cluster and they want to start thinking
of what happens when it becomes partially unavailable. So after they ran all these
experiments, they were able to say the chaos
engineering experiments only took two hours to implement. But if they would have used regular
engineering to do this, this was something that was going to take them four or five months to
replicate. And this is just in the QA environment. So it could have actually been a little bit longer
for production and stuff. But what they wanted to see is what failures would happen when the brokers of Kafka would
actually be unavailable.
So this allowed for them to see that they actually hadn't set up monitoring properly.
The monitoring tools weren't alerting the team when the nodes were lost.
And they started to see that by running basically black hole attacks that
stopped the connection to the brokers on just half of those brokers meant that those brokers
crashed. They didn't have alerts and an engineer had to manually come and run a command for these
nodes to come back up. So if you would have to do that every single time you see a failure in your Kafka cluster, that's first of all, you might get pager fatigue and totally burn out. So they didn't really get to see why this was
happening until they broke it. And when you do lose one of the nodes of the Kafka cluster,
this meant that you actually need to manually configure it again for it to reboot. So the
amount of time that kind of was happening for them to manually intervene the few times that they ran the
experiments, they realized that they needed to fix a lot more things in their cluster and the
environment and the configuration. And that's kind of where they were like, we wouldn't have really
seen this failure right now. And if we would have discovered it via an incident, they would have
taken a lot more time to engineer for that failure
versus proactively jumping on that failure.
And obviously, if this would have happened
later on in production,
as you said, it would have taken them a long, long time.
It would have meant a huge downtime,
huge penalty payments,
especially if it's finance, right?
I mean, that's probably, I think,
hard to measure actually how much money they would have lost.
And therefore the upfront investment makes a lot of sense
and pays off immediately.
Yeah.
And finance is interesting because like you mentioned,
it's like you're going to get charged on fines
depending on the countries that you're operating on.
Like that cost of downtime bill could get incrementally really, really high
for every minute that you're done.
Hey, you mentioned monitoring
and that they figured out
that maybe not the right people were alerted
or got alerted,
which reminds me of a thing that we,
as an industry, I think in general,
try to promote a lot is everything is code
and everything is configuration.
So we also talk about, you know,
making sure that your delivery pipelines can,
well, let's say that all of the tools
that you use from dev to ops
can be automatically configured
through configuration as code
and that you then also apply it accordingly.
And I think this is a great way also
for for you to test this for your monitoring solution meaning if you are right now only
using monitoring and production and you've everything manually configured because you've
never needed it to replicate it in a pre-production environment well then it's about time that you
figure out how you can configure your monitoring fully automatically and then also use that configuration and propagate it
from dev to QA into prod
so that you always get the same alerts
and the same thresholds are used and all that.
So I think that's especially for our listeners
that are using maybe the monitoring tool
that Brian and I try to bring to the world every day.
And it's funny you mentioned that
because it's weird to,
like we saw this a lot with prospects and customers.
They didn't have monitoring set up.
And it was like, how are you operating at scale?
And then maybe they would only have monitoring and production,
but then they're not doing this in any pre-prod environment.
And it's like, wait, but how are you making sure
that you're not building in more failures
as you're deploying through your environments?
But then that also allows for me to touch base.
At Gremlin, we use Gremlin for chaos engineering on Gremlin.
And that was actually one of the wins that we saw.
We were able to run a game day in staging
and we realized that our dashboards
needed a little bit more granularity
and have a little bit a better way
to understand what the system was doing.
But we were able to implement that
in staging monitoring.
And then you can take that exact same win
and put it into your production dashboard.
And it's like I never had to run it on production
for me to make that improvement.
Yeah.
So another lesson learned for everyone.
Whatever tools you use, and if it is, for instance, monitoring,
you need to test.
You can test the monitoring in your lower-level environments
while you run your chaos engine experiments,
and then if you can automatically propagate these configurations
to the upstream environments like production,
you automatically get the same data that you know you will need in case chaos strikes
and the data system doesn't heal itself
because that's obviously the next thing
we can also do with chaos engineering,
that we not only enforce chaos and find the alerting
and get better in fixing things manually,
but also fixing things,
or letting things fix automatically
through auto-remediations.
But Andy, an episode or two ago,
you were just advocating the idea that
the less environments you have, the more mature you are.
How do we test in lower environments
if we're trying to get rid of them?
Yeah, well, if you have listened closely to that episode,
then just having,
let's say,
only production doesn't mean that you cannot
safely deploy
into an environment
in production
where you can run
your experiments up front
before you release this
to the whole world, right?
You still have
a separation
between your regular
production traffic
and let's say
your canary traffic
or whatever you want to call it.
But you're right, obviously.
And that also means, right, if you're doing production only,
if you really are that mature,
that you only have one single environment
and you're deploying something new,
you want to make sure that you have the chance
to automatically test, configure,
and run experiments on these new components
before you maybe release it to the whole world.
And that's why I think canary deployments and feature flags are so quick.
And I think that's one thing people have to think about when people say things like,
oh, you only need one environment.
It's like, yes, you need one environment, but that environment has to have those, let's
call them subdivide.
It's not really truly one environment.
It's one set of everything, but you have little pieces that you turn on and off and
you have the ability to treat it as if it's partially separate environment so that you can
control a controlled single environment it's a good way to think of it i think it would just you
know if people are not fully paying attention or fully thinking it through they just might be
panicking be like what we just put everything 100 to prod but yeah anyway there's some great podcasts by um who who who does those andy is that um pure
performance who's talked about uh some canary releases and baby testing stuff in the past
be good to check out maybe our maybe our friends from pure performance who knows yeah anyway i
want to get back to some stories sorry uh yeah i do want to say that the movement of progressive delivery and kiosk engineering, when you bring both of those together, it's so awesome.
Like, I really want to see more companies just really using both of them a lot more and being able to talk about it because it is it comes back to being customer focused and putting a focus on experimentation. And I think when we're looking
at how distributed things are in the cloud native world, we need to be doing that. Like it's to me,
it should be mandatory. Like when we do have a talk on cloud native and adopting this stuff,
it's like, well, you might as well take all the new movements going on and really make sure that
you're getting ahead of those failures, but you're also not stopping innovation and being able to slowly release it's like it's perfect like i want
to see more of that yeah hey anna um as we started in the beginning with uh reflecting the chaotic year of 2020 and obviously COVID has been one of the main reasons
for this chaotic year.
Do you have any stories maybe related to COVID?
Anything that you've learned from organizations
that may have seen different traffic patterns
or different things due to COVID?
Yes.
So there's multiple companies that we were able to hear about
from the increase in capacity that they had,
from all of a sudden their systems are seeing a huge increase
in people reaching their services and they didn't do load testing
or they didn't capacity plan well enough.
So it's really interesting because some
of these companies were not expecting it because they're not really the companies that you would
use because things went to virtual or online. It's not Zoom, it's not Slack, and it's not your
favorite grocery stores or anything like that. But the one example that I am able to talk about
today is an airline company that had seen some interesting
changes due to COVID. And a lot of this was based because the only way that they were able to talk
to their customers was their mobile application. And in their mobile application, they were using
this to broadcast messages to the user base. This meant that they had around 60,000 people getting messages every single time they sent a push
notification. So when you get 60,000 people to get a push notification, this meant a lot of those
folks were actually going into their mobile application. This also meant the increase in
calls to the APIs once you open the application. And the backend that was the RDS
was the biggest thing that was impacted.
And they were just like, I don't know what's going on.
Like their mobile app kept on crashing.
And this really meant that they needed a way
to prepare their system for database connectivity,
latency, increase capacity,
or just that the RDS system went down. So they ended up doing a lot
of latency experiments. They were also black holing all traffic from their mobile app to their
internal service and black holing RDS specifically. And some of the findings that they had is that
their login screen, they weren't able to like exit out of that. You basically had to log
in every single time you wanted to open this application. You also saw that the login would
kind of like just crash and go all black. And then they also had the mobile app had a spinner
that you just couldn't get out of that page. And all of this was due to latency or RDS being just at max capacity on reads
that they're able to have. And it's really interesting because it's like, yeah, I wouldn't
expect that due to COVID people are more likely to open a push notification because they have the
downtime or this is an airline that they're using to go visit their family. And when we're talking
about COVID, it's like every single month has been very different for every single country.
And that mobile app is the only way that sometimes they're knowing what the COVID regulations are for travel between countries or their own cities.
It's funny because that's like a marketing team's dream, right? We send a push notification and, you know,
especially we see during games like the Superbowl or Olympics or whatever it
might be that, Oh,
there's this terrible celebration when someone's website crashes, you know,
by the own company, like, Hey, we crashed our website. No, that's terrible.
But it's, this ties right into it, right? You send a notification. I mean,
I think, I think this is a great story because this could just be turned into that what if besides even you know take the covet factor out which you know you
shouldn't but taking the covet factor out what happens if your customers respond as marketing
would dream that they do you know because it's always like what maybe 10 right that really
actually come in but what happens if, suddenly something is so successful?
You know,
back when I worked at,
um,
WebMD,
there was a scenario that was similar.
This is,
I think way before the idea of chaos theory was even like maybe in principle
things,
the idea was there,
but it didn't have a name.
Um,
this was when,
uh,
Obama was the president and Michelle Obama was the first lady and she was
taking up the health initiative.
And the idea was like,
if she decides to publish a health blog on our site,
we're suddenly going to get slammed
with more traffic than we could have ever anticipated.
Right?
So are we prepared for that?
And that was one of the questions.
So it kind of goes back to that whole bit.
And I think the good point you make there
is that with COVID,
who would have thought it would actually go that high but that's the question what what can you handle
that's great yeah and and we actually have been having a lot of those conversations with finance
companies like just in the industry those were the companies that also weren't necessarily expect
expecting an increase in usage of their internet services,
but all their branches are shutting down.
So online banking has to be as reliable as ever from transfers to credit card payments,
viewing your statements, or even being able to call the call center
and cancel your credit card payments because you've been laid off.
So it is one of those things that you have to be ready
for those moments that you can't expect.
Yeah.
I mean, there's two thoughts here that I have.
The first one is on this push message example.
In the end, it comes back to dependency mapping,
even though it might not be a dependency
between a front-end service to a back-end service to a database,
but it's a dependency from one feature to another feature.
And that might be implemented by completely independent teams.
But you still need to know that dependency
and then you need to do the proper load
or I don't know what the right terminology for that is.
How many people are potentially jumping from that feature
to the other feature and therefore causing a lot of load.
So for me, this is just another example of dependency mapping
between features and not just between services in the backend.
The other thought that I had though,
and this comes back to what we said earlier,
if you would go back to 12 months in the past,
like December, January last year,
and if somebody says,
hey, I think we need to prepare for a global pandemic. And then maybe the product managers say, sure,
add it to the list of the backlog,
but probably it's not going to be highly prioritized
because it's very unlikely.
So in the end, it comes back to how you prioritize these things
and somebody needs to make an assessment
on what is the chance of this really happening.
And obviously with COVID, we just hit the checkbox this year
um that probably nobody really nobody really saw that happening this year and i think i think you
do iron that out where it's like this is a big case of dependency mapping and i think the other
word is capacity planning where you don't know how many calls you can actually sustain in your systems
until you're seeing all the traffic come in and it's like oh no our servers are on fire we're
about to crash and it's like getting to that point sucks but it's true when you have that
conversations with your product teams your managers how do you tell them i expect an
increase in traffic during this time and having capacity
planning conversations like i know in the last episode i talked about what that was like at uber
you have to do planning for three to six months when you're working in a bare metal data center
it's like six to twelve months and when you're on the cloud, you also might need six to 12 months, depending on the type of like what you pay in your cloud provider.
And you can't really just pull the plug on having more capacity because you
now need to implement it and that bill.
And then you have to make sure everything works.
And that's that part where it's like, yeah,
I think I pressed the button and things are all going to auto scale.
But until you test it, you don't know.
And I had a question about the dependency side.
I think where the two of you were going with this is, at least for me, a new concept in dependency.
So if I'm thinking about this the right way with this push notification one, right?
If we take a step back and using some sort of APM tool as an example,
I don't know, maybe Dynatrace, where when you look at your dependency map of your application,
right, that's learned by basically all the dependencies of the code execution
going through your system. Now, when we take that in the push notification, that's one system,
right? Messages get queued up, they get sent out and pushed to the end users. Within that message is a link, right? But that link has nothing to do with the backend system that pushed the notification. That link goes through another set of applications and another maybe set of infrastructure or clusters or whatever it might be so you have two systems with a soft dependency that only occurs
with that link so they're meaning there'd be no way to really trace that through with any sort of
tooling or monitoring because there's a there's basically that air gap of having to touch
the link and hit it so how does how do you plan for that sort of dependency?
Or how do you make people aware of thinking of
the push notification part of that dependency
is the system that happens if a person engages with that?
It's a proper communication between the different,
whoever is the owner of these individual features, right?
I mean, it's proper.
But do people think about that?
I mean, is that, it's a brand new concept to me. I mean, it's proper company. But do people think about that? I mean, is that...
It's a brand new concept to me.
I mean, that's something I never would have even thought of.
But is that something that people actually are thinking of?
Do we see that happening?
Or is that something that people need to really start opening their eyes to?
I think we're seeing the conversations happen,
but we need to see them more.
And it goes back to bringing reliability to every single product team.
I think sometimes they don't even think about it.
They just launch without thinking the largest impact that their product can have.
And it's funny because that actually has been architecture questions
in software engineering interviews that I've been part of.
But even though you get those questions in interviews,
doesn't mean that every single team out of companies is asking that.
And I think a lot of people just don't know what their dependencies are. You ask a staff engineer,
maybe they might know senior engineers, they might know who to ask or what architecture diagram to look at. But then when was the last time that architecture diagram or mental model got verified?
Like you don't necessarily ever know how distributed your system is until you see the state, like the trace down of what that call was. So I think it's interesting because like observability would allow for you maybe not to like really pin down what some of those things are. But then there's a portion that with chaos engineering, by just
closing out one of those dependencies, you'll see if your system failed. So it's like, are you going
to go and maybe look at logs and really try to figure out all those dependencies? Or do you just
go and close a connection and try to figure out where those break-ins happen.
And I guess there's a way to automate some of the sites from the front end,
again, using that push notification as an example,
or whether it be a web page or any bit,
it's if you're at least somehow scanning whatever that user interface is to understand what all the endpoints that that can connect to are.
You can then at least identify those and say,
all right, this is connecting.
We know 90% of them are going to the main site that we expect,
but hey, look, there's these three links here
that go to a completely different system
that we have to make sure if we're driving more traffic,
those people need to know.
And you bring in interesting, like that use case itself,
because I would then argue that you can just build with failure in mind
and you can assume that loading that page could fail and you just load a static page that says be right back or something or a main link to log in.
So I think a lot of it is just going back to that whiteboard and being like, okay, what is everything that can fail and let's start building around that.
Like one of the examples that always gets talked about in chaos engineering
are Netflix examples. They're known as one of the pioneers in the space. And when you listen to
their front end engineers talk about chaos engineering, one of the use cases that has
always stayed in my head is that you log into Netflix, their main, like their main kpi is just like seconds to stream but when you log look into the the first
page there is a continual watching contain like division and it's like all the shows or movies
that you've been watching but if that service was to go down you don't see empty boxes you end up
just seeing like top movies or tv shows in the United States. But that's building with failure in mind of like, wait, no, this service, if that times out at 300 milliseconds, the website just won't load it.
And it's really, really nice to have experiences like that.
It's incredible what some of these sites do.
When you think about it, we take for granted that you go on Netflix or you on amazon or anything and within you know a heartbeat you're doing what you want to do
i mean don't even think about what's all behind there but these stories really give us a great
appreciation for for understanding everything that went into that it also shows us that chaos
engineering and let's say that way designing with resiliency in mind goes far beyond what you
would maybe normally think of like a you know when the database is gone okay this is one problem but
if a widget on the website doesn't load then i want to make sure that the website is still
functional and also doesn't look like everything is broken. So I think that's also very interesting, right, that we teach and we make
engineers from the front end to the back end aware of resiliency engineering and
that you can make, that you can, that you have to, let's say that you have to think
about resiliency in every layer of your architecture, whether it's front end,
the mobile app, the browser, or the backend.
Hey, Anna, you have one more example that we should talk about?
Yes. So funny enough, we were just talking about that capacity planning. And the last example I had was an e-commerce company that wanted to verify they set up auto scaling properly,
but they were using Kubernetes and they just didn't have any regularly, like just
regularly nodes auto scaling. They went ahead and they implemented horizontal
pod auto scaling. So they usually traditionally use load testing to verify any sort of auto
scaling, but due to the way that horizontal
pod autoscaling works with Kubernetes is based on research consumption. So they ended up running
CPU attacks on every single service for them to really isolate every single application to a
single container. And that's how they were able to realize that HPA was set up properly and their Kubernetes clusters were actually auto-scaling appropriately, which is kind of nice because one, more people need to do Kubernetes auto-scaling, whether you're doing it from the cloud provider on the node standpoint, or you're actually setting up HPA, which I highly recommend.
Sometimes you really hope that you set it up properly, but unless you test it, you don't
know. So when I was like looking for these stories, I was like, no, I'm really excited to see these
e-commerce company going and making sure to run CPU experiments that if their resources
actually pass the research consumption that they have allocated for, they are seeing things
actually auto-scale and the customers won't
get impacted.
Because when we look at the outages of Kubernetes, sometimes they are just based on capacity
planning.
And it's always like, oh, yeah, people assume that things are going to auto-scale just because
it's cloud native.
But wait, did you forget that you had to configure it?
And you have to architecture for it as well.
Just, I mean, because you have an app or a service in a container
doesn't mean it is ready for auto-scaling, right?
There's a lot of things that have to be considered
and then those affected into the architecture.
So, yeah, very important. Very cool.
Now it's the beginning of 2021,
at least when people listen to this,
any predictions for 2021?
Or I think you said earlier at KubeCon,
they mentioned that chaos engineering is going to be one of the hot topics of
2021, which
I completely agree with.
Is there anything else maybe to kind of conclude
this session today?
Anything else that people should
look into that you would
like to put under their virtual
Christmas tree, which I know is happening for us
in the future, for them in the past,
but is there anything that you would wish people would also do in 2021
when it comes to perform chaos engineering?
I think it's like, it's for sure.
That number one is go perform chaos engineering.
But the one that I will continue pushing forward is do it to onboard your
engineers. So on call,
I just continue hearing of instances
of engineers like getting put on call,
they're thrown into pagers
and those engineers never learned
what they had to do in a safe space.
They never got to really explore
what the systems that they were doing.
So when I talk about onboarding engineers
with Chaos Engineering,
it's about running an experiment, having them open up that runbook, having them get page, having them acknowledge the page, communicating with the team that the incident was started.
Whether you're using a tool for incident command and really jotting down everything, like going through the entire process as if things were actually on fire in production would really be helpful because sometimes psychologically you get paged in the middle
of the night. This is not the time that you want to be awake and you just kind of forget maybe some
of the steps. So I really want to push that one forward for the purpose of pager fatigue going
down, making sure runbooks and links to dashboards are up to date, and just making sure that
we are trying to build a healthy, reliability, engineering, mental health culture.
Cool.
Well, this hopefully now sparks an idea for tomorrow when people finish listening to this
pod and they go back to the office tomorrow and say
hey i want to voluntarily sign up for being on call and i'll make sure that's cool thank you so
much cool um brian anything from else from your end yeah i got one idea or one wish for 2021 with
in terms of chaos but it's also a setup for anna in case it does exist already
then we can say well then great you'll have to come back on the show and discuss it with us
but what i would think would be cool to to have in the world of chaos is categories of chaos
right because if we think about this idea that we bring up quite a lot with you on COVID, right? This world pandemic that were people prepared?
Did they think of people responding, right?
The overwhelm, the, I guess the way that might overwhelm someone's mind would be thinking of like,
well, if I think of everything that can go wrong in the world that we might have to prepare for,
it's going to be too big, right?
But I was just thinking about that.
I'm like, well, no, it's not necessarily the individual characteristics
of the event. It's the type of events.
So if we think about besides all of the, what if our database goes down,
what if AWS has an issue or whatever,
we think about what if there's a global event?
What if there's a local event?
What if there are different kinds of events and situations?
So I'm curious to see if there's a local event? What if there are different kinds of events and situations? So I'm curious to see if there are categories of types of events that can help guide people into creating and thinking of ways to test different types of chaos from server level, host level, all the way to global community level.
I'm not sure if that's something that exists in the Chaos world yet, if that's something that would be useful in the Chaos world, but it'd be interesting
to see at least what it might look like.
Yeah, so I'll definitely do the plug
that I do work for a vendor, and one of the products that
Gremlin does offer is scenarios, and in scenarios is that you do work for a vendor and one of the products that Gremlin does offer is scenarios
and in scenarios is that you get to choose a technology that you're using and we give you
recommended scenarios to run on your infrastructure and like for example with databases it's like
prepare for the lack of memory in your MySQL make sure that your database cache is set up properly
so you can block that traffic what is a timeout of your DynamoDB?
So we do try that.
And then I know the open source community,
like with Litmus Chaos,
they have been able to also do something similar
where like the experiments that they offer
is based on recommendations of failures in Kubernetes.
And the other plug for it is that
I just recommend people to read incident postmortems.
Go ahead and look up your favorite technologies,
replicate those conditions,
and know that your organization is resilient
if that was to happen.
So go read the GitHub outage of last year.
Go read the Google Cloud that happened
like a few weeks ago.
Gmail just went down.
Go read them and understand what happened to lead few weeks ago gmail just went down go read them and
understand what happened to lead to that failure and then replicate it i think that brings up a
good point i was just thinking as you were talking there it's not necessarily what event happens
right if we take again global pandemic it's not necessarily thinking about something global
happens it's about some system got overloaded in some way right so if you're if
you're reading those stories and reading those postmortems about what systems got overloaded or
got destroyed it's more about thinking of your systems again you don't have to necessarily think
from the outside world of what happens let's say if vampires come to life tomorrow and you know
well what what and if you do want to think about that way you
think about okay well how the vampires use the internet that might suddenly we might suddenly
have to deal with a lot more nighttime traffic can we handle nighttime traffic if we're doing
our maintenance windows right um but it's more about i guess the systems and where i was just
going there with my previous question was thinking again kind of in the wrong way of the actual event
not the system the system is your customer in a way in chaos,
right?
We shouldn't say your customer,
your subject maybe is the better way to think of it.
All right,
cool.
And I would say,
right,
as you've said,
you know,
do this and this,
make sure you listen to the podcast from our friends at Pew Performance.
It is called,
Why You Should Look Into Chaos Engineering
with Anna Medina.
Yes.
Those people are wonderful.
Yeah.
All right.
That's all for me.
Andy, anything else from you?
No, just, you know,
we all have it in our hands
to make 2021 a better year
than the last one. It shouldn't be too hard, but it's all in our hands to make 2021 a better year than the last one.
It shouldn't be too hard, but it's all in our hands.
First of all, start washing your hands more often and then do something useful with it.
No, but it's great to learn from people like you, Anna.
It's really a pleasure.
It opens up our minds.
It teaches us something new.
And in the end, we will all just get better in engineering.
And that's why I really want to say thank you so much for coming on the show.
Thanks for having me.
Very much open to talk about reliability and just making this world a better place and more reliable as humans too.
So feel free to reach out.
I'm on social media as Anna underscore M underscore Medina.
Awesome.
Thank you so much.
We'll put that link in show notes as well.
And if anybody has any questions, comments, please feel free to tweet us at Pure underscore
DT on Twitter, or you can send us an old fashioned email at pure performance at
dietrace.com.
Love any questions,
comments,
feedback,
or ideas.
And happy 2021.
Here's looking to the future,
everyone.
Thanks.
Bye.