Screaming in the Cloud - Episode 22: The Chaos Engineering experiment that is us-east-1
Episode Date: August 8, 2018Trying to convince a company to embrace the theory and idea of Chaos Engineering is an uphill battle. When a site keeps breaking, Gremlin’s plan involves breaking things intentionally. How ...do you introduce chaos as a step toward making things better? Today, we’re talking to Ho Ming Li, lead solutions architect at Gremlin. He takes a strategic approach to deliver holistic solutions, often diving into the intersection of people, process, business, and technology. His goal is to enable everyone to build more resilient software by means of Chaos Engineering practices. Some of the highlights of the show include: Ho Ming Li previously worked as a technical account manager (TAM) at Amazon Web Services (AWS) to offer guidance on architectural/operational best practices Difference between and transition to solutions architect and TAM at AWS Role of TAM as the voice and face of AWS for customers Ultimate goal is to bring services back up and make sure customers are happy Amazon Leadership Principles: Mutually beneficial to have the customer get what they want, be happy with the service, and achieve success with the customer Chaos Engineering isn’t about breaking things to prove a point Chaos Engineering takes a scientific approach Other than during carefully staged DR exercises, DR plans usually don’t work Availability Theater: A passive data center is not enough; exercise DR plan Chaos Engineering is bringing it down to a level where you exercise it regularly to build resiliency Start small when dealing with availability Chaos Engineering is a journey of verifying, validating, and catching surprises in a safe environment Get started with Chaos Engineering by asking: What could go wrong? Embrace failure and prepare for it; business process resilience Gremlin’s GameDay and Chaos Conf allows people to share experiences Links: Ho Ming Li on Twitter Gremlin Gremlin on Twitter Gremlin on Facebook Gremlin on Instagram Gremlin: It’s GameDay Chaos Engineering Slack Chaos Conf Amazon Leadership Principles Adrian Cockcroft and Availability Theater Digital Ocean .
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
This week's episode of Screaming in the Cloud is generously sponsored
by DigitalOcean. I would argue that every cloud platform out there biases for different things.
Some bias for having every feature you could possibly want offered as a managed service
at varying degrees of maturity. Others bias for, hey, we heard there's some money to be made in
the cloud space. Can you give us some of it?
DigitalOcean biases for neither. To me, they optimize for simplicity. I polled some friends of mine who are avid DigitalOcean supporters about why they're using it for various things,
and they all said more or less the same thing. Other offerings have a bunch of shenanigans,
root access and IP addresses.
DigitalOcean makes it all simple.
In 60 seconds, you have root access to a Linux box with an IP.
That's a direct quote, albeit with profanity about other providers taken out.
DigitalOcean also offers fixed price offerings.
You always know what you're going to wind up paying this month,
so you don't wind up having a minor heart issue when the bill comes in.
Their services are also understandable without spending three months going to cloud school.
You don't have to worry about going very deep to understand what you're doing. It's click button or make an API call and you receive a cloud resource.
They also include very understandable monitoring and alerting.
And lastly, they're not
exactly what I would call small time. Over 150,000 businesses are using them today. So go ahead and
give them a try. Visit do.co slash screaming, and they'll give you a free $100 credit to try it out.
That's do.co slash screaming. Thanks again to DigitalOcean for their support of Screaming in the Cloud.
Hello and welcome to Screaming in the Cloud.
I'm Corey Quinn.
This week, I am on location at the Gremlin offices in San Francisco.
I'm joined by Ho-Ming Lee, who, well, we'll talk about Gremlin in a minute.
First, I want to talk a little bit about what you've been doing historically.
First, welcome to the show.
Sure, absolutely. Thank you for having me and happy to be here on the show.
Nope, always great to wind up talking with new and interesting people.
So before you went to Gremlin, you were a TAM at AWS, or Technical Account Manager for
people who don't eat, sleep, and breathe cloud computing acronyms.
Yep.
So without breaking any agreements or saying anything that's going to cause AWS to come after me with knives in the night again,
what was it like to work inside of AWS?
Yeah, I actually want to step back a little bit further.
And when I joined AWS, this was probably four years ago, I joined as a solutions architect over there. I am one of the only, if not the only person
that actually went from the solutions architect role
into the technical account manager role.
Interesting.
Out of curiosity, what is the difference
between a solutions architect and a technical account manager?
I get that a lot, for sure.
And from a technical perspective, both roles are very technical.
The expectation is that you're technical and you're able to help the customers.
I would say for solutions architect, you are a little bit working more with the architecture,
whereas with the technical account management role, you're a lot more on the operations,
so you're a lot more on the ground.
When you say that you transitioned
from being a solutions architect
into a technical account manager,
and that was a rare transition,
is that because transitions themselves are rare,
or is that particular direction of transfer the rarity?
That's a great question.
Particular direction of transition.
And there is a tendency for people to start with a support role.
Technical account manager has an enterprise support element to it.
Now, as technical account managers,
instead of being reactive on a lot of support cases,
you're actually proactively thinking about
how we can make their operations a lot smoother
and best practices around that.
So I'll give you an example.
Every year, there's usually very critical events
for a lot of businesses.
If you think of Amazon, the natural thing to think of
is Black Fridays and Cyber Mondays.
A lot of these.
Prime Day as well recently.
Oh, absolutely.
Happy Prime Day, belatedly.
Exactly.
And so for these events, there's actually a lot of planning, a lot of thinking ahead.
And so as technical account managers, you work with the customers to make sure that they are ready for the event.
Now, technical account manager is an interesting role in that you are in between the customers as well as AWS services.
And so in that sense, you're helping services, you're helping the service team educate the
customer to let them know how to properly use a service.
Because a lot of people use services in an unintended way,
which makes things very interesting, to say the least.
Oh, absolutely.
I found that Secrets Manager, for example,
makes a terrific database.
And every time I talk to someone
who knows what a database is supposed to do,
they just stare at me, shake their head sadly,
and then try to take my computer away.
I have to say there are interesting misuse though
because there are interesting use cases
that the AWS service team may not have thought of.
And so the other part is actually getting some of these feedback
from our customers and bringing it back to the service team
so that the service team enhances features.
So it's a really interesting role in that you're between both the customers and the service team.
What's interesting to me about the TAM role historically has been how, I want to say maligned it is.
In that when you speak to large enterprise clients, fairly frequently something I will hear from engineers on the ground is,
oh, the TAMs are terrible.
They have no idea what they're doing and it's awful.
And okay, I suspend disbelief.
I wind up getting engaged from time to time
in those environments and speaking with the TAMs myself.
And I come away every time very impressed
by the people that I've been speaking with
on the AWS side. So I understand
that it's a role where you are effectively the voice and face of AWS to your customers.
And that means that you're effectively the person in the room who gets blamed for any
slight real or imagined that AWS winds up committing. You will get blamed for all of their sins.
It's a maligned role.
What do you wish that people understood more about the TAM position?
I think some part of that is really just about
what people are hearing anecdotally and what words gets out.
This is like reviews.
You generally tend to see more bad reviews than good reviews.
But in my experience as a TAM, working with a lot of our customers, I've actually worked
with a lot of good customers and I'm thankful and lucky to be in that position where a lot
of our customers actually come to us with very reasonable requests and very reasonable
incident management.
So working with them in finding out what the root cause is,
and it could be something on the AWS side,
it also can be something on the customer side.
Now, I do understand where the sentiment comes from
because there are definitely certain customers
that likes to just initially blame AWS.
And sometimes when they don't have visibility into it, that likes to just initially blame AWS.
And sometimes when they don't have visibility into it,
it's easy for them to blame AWS.
And that might be where some of the sentiments come from.
But for the most part, and the customers I work with have been really good in that it is usually a joint investigation
to find out what's wrong, as the ultimate
goal is really to kind of bring services back up and make sure our customers are healthy
and happy going forward.
It always amuses me to talk to large enterprise customers who are grousing about enterprise
support and why they don't want to pay it.
They're debating turning it off.
The few times I've seen companies actually
do that, it lasts at most a month and then they turn it right back on and, oops, that was a
mistake. We're really, really sorry about that. Not because you need enterprise support in order
to get things done, but rather that you need it in order to get visibility into problems that only
really crop up at significant scale. That's not, incidentally,
a condemnation of AWS in the least. That's the nature of dealing with large, complex platforms.
And the one thing that has always surprised me about speaking with Tams, even off the record,
after I poured an embarrassing number of drinks into some of them, I don't ever get any of them to break character and say,
oh, that customer is the worst. There's a genuine bias for caring about the customer and having a
level of empathy that I don't think I've ever encountered in another support person for any
company. Is that just because there's electric shock collars hidden on people or implants? Or
is this because it's something that they buy us for in the hiring process?
Yeah, totally.
So kind of dialing back a little bit on what you mentioned
with how enterprise customers are getting value with enterprise support.
I think there's an element of embracing enterprise support.
You have to really embrace it and work with the AWS staff to really get
value out of it.
And so the more you embrace it, the more value you'll get out of it.
Now to the point of why are all the staff in AWS, not just TAMs,
are hardwired to really help customers is part of the Amazon leadership
principles.
I really think that that's something that Amazon has it right in that the culture of
it, how the hiring process requires people to kind of read up on the leadership principles,
embrace them, and really present them as you go through your hiring journey,
as well as even as an Amazonian,
which is what they call for the staff in Amazon.
Yay, cutesy names for employees.
Every tech company has them.
Exactly.
But the leadership principle speaks really well,
and one of them is customer obsession.
And so every staff within Amazon are customer obsessed.
And so at the end of the day,
it should be mutually beneficial to have the customer
get what they want, be very happy with the service,
as well as the TAM achieving success with the customer
to make sure that everybody's happy in the equation.
Absolutely, and I think that's a very fair point.
So I think that's probably enough time dwelling on the past.
Let's talk about what you're doing now.
You left AWS around the beginning of this year,
and then you came here to Gremlin,
which is an awesomely named company,
especially once you delve a little bit into what they do,
which is an awesomely named company, especially once you delve a little bit into what they do, which is chaos engineering.
So effectively, from a naive perspective on my part,
where I haven't ever participated in the chaos engineering exercises,
it looks to me from the outside like what you do
is you've productized or servicized, if that's a thing,
turning off things randomly in other people's applications.
How do you expect to find a market for this
when Amazon already does this for free
in US East 1 all the time?
Well, there's a lot going on in US East 1
and a lot of new shiny toys happen to be there too.
US East 1 is definitely an interesting region.
That said, there's definitely a lot of misconceptions
in chaos engineering.
We have heard where some people would go
and break other people's things
and just break things to prove a point.
As a company, we're definitely not advocating for that.
That's not the direction that we expect our customers to take.
Really, chaos engineering is about planned, thoughtful experiments to reveal weaknesses in your systems.
You look at companies that are out there that are doing similar things like Netflix, like Amazon themselves.
The intent really is to build resilience.
With microservices, you hear this word a ton
for sure in the industry,
it's very difficult to understand all the interactions
and to really have a good grasp,
even as an architect.
I've been an architect and I'm here as a solutions architect as well.
Even as an architect, it's impossible to know
all the interactions and ensure that your systems are resilient.
And so one good way is to be thoughtful
and plan out these experiments to reveal weaknesses and then to build resilience
over time. One of the things that I appreciate when I speak to chaos engineers is that they
always seem to take a much more scientific approach to this. It's a mindset shift where
other people, sorry, other departments call themselves site reliability engineering.
There's very little engineering.
There's very little scientific rigor that goes into that.
It's more or less, in many shops, an upskilled version of a systems administrator.
Whereas every time I go in depth with chaos engineers, there's always a discussion about the process, about the methodology that ties into it.
So let's delve into this a little bit.
What is chaos engineering?
Because it's easy to interpret this, I think, as just having a DR plan that is just better
implemented and imagined than a dusty old binder that assumes that one data center completely
died, but everything else is okay and just fine.
What is chaos engineering?
It's interesting you bring up the DR plans.
Definitely most companies have a business continuity plan.
They are, like you said, have a DR plan so that if their primary data center fail, they
would fail over to a secondary data center.
Spoiler, other than during carefully staged DR exercises, they almost never work.
That's exactly what I want to bring up. There is actually the terminology that's recently
said by Adrian Cockroft calling it availability theater. Just having it in the binder, just having a passive data center
is not enough.
And so first of all is to exercise it.
You have to actually exercise your DR plan.
And if you go out and ask people,
I honestly don't know how many
properly exercise their DR plan.
Now the other thing is,
I can understand why so many people
are reluctant to exercise the DR plan
because the huge blast radius is very dangerous to do
and it just takes a lot of planning,
a lot of effort to do.
So if you take that effort
and shrink it down to
a very small outage or
a very small issue in your architecture
and ask the question,
what could go wrong in this environment?
It could be as simple as a network link going down.
And that's much easier to do, and that's much easier to practice.
So the idea behind chaos engineering really is just bringing it down to a level where
you exercise it regularly so that you can build resiliency.
But it has to be thoughtful, it still has to be planned.
So we like to think of it as an experiment where you ask the question, in this environment,
what could go wrong? And then once you start understanding
what parts can go wrong,
you want to experiment against it.
You have a hypothesis that if I disconnect
my application from the database,
I'm able to pull data from cache.
I'm able to still show this information to my end users. That's the hypothesis. Now,
as much as you think that that's going to happen, you don't really know for sure that that's
actually going to happen. So you would want to actually test it and experience it. And so the
key here is actually injecting default, actually disconnecting the link to your database, and see that, oh, whether it is actually showing you cache data, or maybe in some cases it fails horribly, which is actually still a learning.
And so ultimately it comes down to learning about your systems and building resiliency over time. One of the aspects that I think appeals to me the most is that it doesn't need to be
a world-changing disaster that can be modeled.
A lot of times it's something on the order of you add 100 milliseconds of latency to
every database call.
What does that cause a degradation around?
And that's a fascinating idea to me just because so many DR plans that I've seen are built
on ridiculous fantasies of assume the city lies in ruins,
but magically our employees care more about their jobs than they do about their families.
And they're still valiantly trying to get to work. It never works out that way. There's also a
general complete disregard for network effects. For example, how many DR plans have you ever seen that say,
what if all of US East 1 goes down? Okay. And we're going to just automatically try failing
over to US West 2 in Oregon, for example. Okay. But somehow we're going to magically pretend that
we are the only company in the world that has thought ahead to do this. There's no plan
put in place for things like half the internet is doing exactly what you're talking about. Perhaps
provisioning calls will be particularly latent, so you may want to have instances already spun up.
What if there are weird issues that wind up clogging network links where, oh, we want to
shove a bunch of data there. Oh, well, first, we can't get to that data in its original case,
and oops, there was a failure in planning.
Or instead, it winds up being way too long since this was tested,
and there are entire services that this was never thought about.
So it feels to me, on some level,
like chaos engineering is, in some respects,
an end run around the crappy history
of disaster recovery and business continuity planning.
It just allows people to be a lot more granular, in my opinion. around the crappy history of disaster recovery and business continuity planning?
It just allows people to be a lot more granular,
in my opinion.
Like I said, the DR efforts need to be there still.
I'm not saying that they don't serve any purpose and they don't have a place.
You should think about that.
But it allows you to think more granular
in the AWS world,
what happens if an availability zone goes down?
It's not always about an entire region goes down.
From a chaos engineering practice perspective,
we actually advocate for starting small.
You want to start asking questions around,
what if one link fails?
What if one host fails?
What if just a small component fails?
Are you able to handle that?
Then you start dialing up this blast radius
and think about what happens if a wider array of things fails.
What happens if your entire fleet of service goes down?
Eventually, gradually, you definitely do want to get to the point
like what Netflix can do.
They can fail over regions.
But to want to get there on day one
is an extraordinary amount of work.
So you can start small.
So what happens is a lot of people want to do that
and they just find it too difficult
and throw their hands up in the air.
I agree.
I think there's also a challenge
where you see people who tried something,
it was hard, and they back away from it
and don't want to do it again.
I'm fascinated by stories of failure in an infrastructure context, not because I enjoy
pointing out the failures of others, but because this is something that we can all learn from.
Only a complete jerk watches their competitors' websites struggle and fail under load and is happy
about this, because it could be you tomorrow.
There's the idea in the operations world of hug ops.
If someone else works at a company that is a direct competitor to yours, you still hope they get past their outages reasonably quickly.
I don't think as an industry comprised of operational professionals and engineers that we are particularly competitive in that sense. We want to have the best technology,
but not at the expense of our compatriots
disappointing their customers.
Yeah, absolutely.
We all feel the pain, that's definitely for sure.
Whether it's AWS, whether it's other cloud providers,
whether it's Gremlin,
our approach is to really embrace failure,
but really is to learn from this failure to then build resilience.
Absolutely. And that's not something that ever comes out of a box. I mean, I remember a few
years back, a particular monitoring company on Twitter saw that a company was having an outage
and chimed in with, if you were using our service, you'd probably be back up now. And they were
roundly condemned by most of the internet
for that. It turned out that a marketing intern had gone a little too far in how do we wind up
being relevant to what's going on on the internet right now without the contextual awareness to
understand what was going on. And that was something that became very, I guess, heartwarming
in a sense in that we're all in this together and we don't use it
as an opportunity to capitalize on the misfortune of others. Maybe that's something that runs
counter to what they teach at Harvard Business School. But fundamentally, it speaks to an
empathetic world in which I prefer to exist. There's totally a human side to this.
For sure. So at this point, it seems to me that trying to convince a company to embrace the
theory and the idea of chaos engineering has got to be a bit of an uphill battle in some cases.
So our problem is that our site keeps breaking and falling over when things stop working. And
your genius plan is to go ahead and start breaking things intentionally. Why do we pay you again? Seems like one of those natural
objections to the theory. In fact, that was not me mocking someone else. That's what I said the
first time that someone floated it past me. How do you introduce chaos as a step towards making
things better and not get laughed out of the room. So I think there's another misconception here where a lot of people might think about chaos
engineering as something you just go in and break things on purpose.
We are breaking things, but an important part of thinking about chaos engineering is that
thoughtful and planned nature of it, where you are thinking about the experiments,
you are planning ahead of time,
you're communicating both to upper management
as well as to other services
on what you're trying to achieve.
The goal is resilience.
The goal is not things breaking.
The hard part, of course, is getting there.
Are there some companies that are frankly too immature
for chaos engineering to be a viable strategy?
If a company's struggling to keep its website up,
it feels like introducing failure that early in the game
may not be the best path.
Maybe that's wrong.
I think it's a journey.
Chaos engineering is a journey.
You're not jumping into the deep end right away.
So we don't advocate for you not knowing what you're doing
and just running a bunch of things that breaks production.
You're not going to just say,
let's shut down all our servers in production
and see what happens.
That's not the intent of chaos engineering.
I like to dial it back to the thoughtful and planned approach all the time
because you are trying to do things that are in a very controlled manner.
You almost want to know what the outcome is
and you're just verifying, validating it
and also catching some surprises in a safe environment.
Well, I do understand the answer to this question may very well be,
hey, Gremlin, I'd like a slightly more nuanced answer to
how do you get started with the idea of chaos engineering?
For people who are listening to this right now and saying,
that's fantastic, I'm going to implement that right now.
And because people are listening while driving,
they ram a bridge abutment to see what happens.
Don't try that.
How do people get started once they safely get to the office?
Just pay Gremlin.
Serious note, as I mentioned,
I think it's a little bit of a mindset
as well as using the right tools.
You can simply look at your service,
look at your architecture, look at your
architecture diagram and ask the same questions.
What could go wrong?
Are there some hard dependency?
Are there some soft dependency?
Even if you know ins and outs of your code and your architecture, there's always some
learning by experiencing some of these failures.
And so to really get started, you want to ask yourself what could go wrong
now in terms of resources
we actually have a chaos engineering slack
that is not just about Gremlin
but overall chaos engineering practice
and you're welcome to join the chaos engineering slack
oh wonderful I'll throw a link to that in the show notes
one question I do have
and this might be a little on the sensitive side,
so if it is, I apologize.
But it feels like in many cases,
there are some companies,
Netflix, sorry, something stuck in my nose there,
that like to go on stage
and talk about things that they are in fact running.
And they talk about the cases in which it works,
but they don't talk about
edge cases and things where that doesn't wind up applying. A classic example of this might be,
for example, a company gets on stage and talks about how everything they do is cattle instead
of pets. And they're happy and thrilled with this. And then you go and look at their environment and
well, okay, yeah, their web tier is completely comprised of cattle. There are no pets, but their payroll is running on an AS400 somewhere. And the databases that handle
transactions are absolutely bespoke unicorns. To that end, how much of chaos engineering as
implemented today in the outside world is done holistically throughout the system or is it mostly focused on one particular area
and then broadened out over time?
I like that question.
That's a pretty interesting question.
To put it into perspective,
even a company the size of Netflix, as you mentioned,
has different teams,
different organizations,
pretty widespread.
They use different types of technologies.
The important thing about chaos engineering
is to embrace failure and prepare for failure.
So whether it is chaos monkey
that is shutting down hosts
or just ensuring that if my tool's not working, I can still use a
spreadsheet to track something. The important things with chaos engineering really is about
having that failure mindset and making sure that you're prepared for failure.
You just said something that flipped a bit in my mind. The idea of using a spreadsheet when a tool isn't available,
is that something that you talk about as an aspect of chaos engineering?
Most discussions I've seen have been purely focused on technical failover
and technical resilience.
You're talking about resilience of business process.
Absolutely, because when we talk about game days,
which is actually an element that we haven't discussed deeply.
So let me maybe very quickly just talk a little bit about game day.
Where game day is a time where you can bring in different people and collaboratively run experiments to reveal weaknesses as well as just to learn about a service or a system.
And so as you're doing the game day, your experiments are not only looking at the technical
aspects, whether an application is able to handle a failure by doing retries or timing
out or some graceful degradation.
You're actually also validating your observability, whether you can see what's going on in the system,
whether you're getting the proper alerts
when certain thresholds are crossed,
and then all the way to the point where your on-call person,
when they get the alert,
they have enough information to take action.
That's all part of the experiments
and your learning as a whole.
This is actually an interesting aspect to chaos engineering.
A lot of people experience on-call by just being given a pager
and here's a runbook.
Here, catch.
Exactly, go ahead.
And the real learning comes from just your first incident,
but you're very nervous about it and you don't know what to do.
So by thoughtfully planning and executing these experiments,
you're also allowing your on-call person to get ahead
and know what's coming so that they're more calm
as they execute it.
Even with runbooks, on-call engineers have made mistakes before
because they're just in this very nervous state.
And so if you calm them down in training,
in helping them out, understand what the flow looks like
so that they know what to expect,
I'm sure they will feel better as a real incident comes in.
Which makes an awful lot of sense.
So there's also a conference coming up, if I'm not mistaken,
where you talk about the joy, glory, and pain
that is chaos engineering. And I'll throw a link to that in the show notes. What do you find,
or what are you hoping that comes out of a community gathering to talk about the principles
of chaos? Yep, absolutely. There's going to be a chaos conf that's happening in September in San
Francisco. And we have a great speaker lineup that is going to talk about their experiences in chaos engineering,
talking about failure scenarios that they've experienced and how they handle the situation.
As well as just overall a good gathering of like-minded people to talk about failures, talk about how to embrace it,
how to prepare for failures,
as well as how to handle certain situations.
It sounds like it'll be a lot of fun.
I know I'm looking forward to it.
Thank you once again for being so generous with your time.
I definitely appreciate it.
And I have to say, it's nice being here in the office
and recording with you.
And not once during this entire session
did the lights go out or a wall fall over.
So living with chaos engineering
does not mean that every minute is a disaster.
Thank you for having me.
Always a pleasure.
This has been Ho Ming Lee with Gremlin,
and I'm Corey Quinn.
This is Screaming in the Cloud.
This has been this week's episode of Screaming in the Cloud. This has been this week's episode
of Screaming in the Cloud.
You can also find more Corey at
screaminginthecloud.com or wherever
fine snark is sold.