Software at Scale - Software at Scale 28 - Tammy Butow: Principal SRE, Gremlin
Episode Date: July 27, 2021Tammy Butow is a Principal SRE at Gremlin, an enterprise Chaos Engineering platform that makes it easy to build more reliable applications in order to prevent outages, innovate faster, and earn custom...er trust. She’s also the co-founder of Girl Geek Academy, an organization to encourage women to learn technology skills. She previously held IC and management roles in SRE at Dropbox and Digital Ocean.Apple Podcasts | Spotify | Google PodcastsIn this episode, we talk about reliability engineering and Chaos Engineering. We talk about the growing trend of outages across the internet and their underlying reasons. We explore common themes in outages, like marketing events and lack of budgets/planning, the impact of such outages on businesses like online retailers, and how tools and methodologies from Chaos Engineering and SRE can help.Highlights01:00 - Starting as the seventh employee at Gremlin04:00 - An analysis of recent outages and their root causes.09:00 - A mindset shift on software reliability14:00 - If you’re suddenly in charge of the reliability of thousands of MySQL databases, what do you do? How do you measure your own success?25:00 - Why is it important to know exactly how many nodes your service requires to run reliably?30:00 - What attracts customers to Chaos Engineering? Do prospects get concerned when they hear "chaos” or “failure as a service”?43:00 - Regression testing failure in CI/CD51:00 - Trends of interest in Chaos Engineering over time. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
Transcript
Discussion (0)
Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications.
I'm your host, Utsav Shah, and thank you for listening.
Hey, welcome to another episode of the Software at Scale podcast.
Joining me today is Tammy Buteau, who is a Principal SRE at Gremlin. Thank you for joining me.
Thanks so much for having me. It's great to be here.
Yeah, so maybe you could tell listeners a little bit about what Gremlin does.
And I would love to hear your story on why I start as the seventh employee at Gremlin.
Yeah, sure. So at Gremlin, our mission is to build a more reliable internet
so something that you probably would notice like everyone listening is that over the last few years
like there's been a lot more outages reported in the news and some of them are absolutely huge like
massive outages um for example robin hood they actually had like a sequence, a series of outages, and then they ended up getting a $70 million fine from the regulatory board here in the US.
And so that's a really interesting issue that's happening, I think, more and more.
And my background, like I started working at the National Australia Bank many years ago now, about 12 years ago.
And I was working on critical systems, mortgage broking, foreign exchange, internet bank banking.
And whenever there would be an outage, then you definitely could get a fine from... In Australia,
it's called APRA, which is the Australian Regulatory Board for Financial Systems and Companies. And I remember getting my first
fine and it was this really big deal. And it was really bad. The CTO came to my desk and told me,
like, oh, you're the one that's responsible for this fine. You better make sure we don't get
another one. And I was just straight out of, you know, university, out of college. I'd studied
computer science and this was my first job. And I'd only been there for a few weeks and had this big outage. And so I really understood that it's not just that we build these
systems and we make them as awesome as we can and hopefully people love them. It's that they
really impact people's lives day to day, their livelihood. When you're working at a bank,
if your banking systems don't work like internet banking, ATMs, anything like that,
then people maybe can't eat that night because they can't get money out to feed their family.
And I would often see people writing in onto Twitter saying exactly that. So it made me really
understand that it's important what we do as engineers. And at Gremlin, we built a platform
which actually allows you to do chaos engineering
or what folks call value injection.
And there's a number of different out-of-the-box attack types.
For example, latency, packet loss.
One of my favorites is black hole where you can make something unavailable like an internal
or external dependency.
And so a lot of our customers are in the finance space, healthcare space, retail. Some of our customers
include Target, JPMC, like really big names that you've heard of because they care too about making
sure that everything is always up and running for their customers and working as expected.
So that's what Gremlin does. And the reason I decided to join Gremlin almost four years ago now really early um was i was working at dropbox at
the time i was a site reliability engineering manager and i was doing a lot of chaos engineering
as part of my role leading the databases team and also magic pocket block storage and as part of
that work we realized that we could reduce the number of critical high severity incidents by
injecting failure to learn more about our systems.
And some interesting examples there were,
we did a lot of work on something called SQL proxy
because we wanted to understand
like how it failed in different ways.
And if we could actually make it more reliable,
and we did, we were able to figure that out.
But a lot of it is really having this
like very scientific mindset of,
this is my hypothesis.
This is what I think is going to happen when I inject this failure and then you actually do it measure the results and then make some fixes afterwards and then test again and just so many
different use cases came out of that work which we can dig into as well but that's really what
got me excited in the first place is like you know let's actually just make the internet a lot better more and more people are getting online we need to make sure that
it actually works when everyone arrives yeah i think that's super interesting and just digging
into the first point we just spoke about you know outages are just increasing why do you think that
is it just that there's more services that we depend on are we just building software more unreliably like why do you think the trend is towards things getting worse yeah
i recently did some analysis of outages so there's a few github repos where you can see publicly
reported outages and um when you look at different types of outages there's definitely different
results so as a whole if you look at like out as a whole, what is the most common reason for outages?
A lot of the time, it'll be some type of configuration-related change or issue
caused an outage. And that's really complex to solve because config can change in so many
different ways. And it could be like something just to do with spinning up
machines. It could be a specific type of configuration for a specific type of software,
like for example, Kafka or MySQL. Could be some type of managed service that you're using that
you haven't configured correctly. So that's a really, really hard one to solve. And that's
the majority of the outages there. But then if you dig into specific technologies like Kubernetes
and you look at why do those outages commonly occur,
it's actually really different because each of the cloud providers
has their own managed Kubernetes service
or you can roll your own Kubernetes.
But then digging into it is really interesting.
Like a lot of outages are actually related to CPU,
but it is maybe know maybe not what
you would expect it's like cpu spiking cpu throttling um downstream impact caused by that
as well or also configuration not being set up correctly for auto scaling based on cpu
so it's like really um complicated i think actually to understand all of these fine fine
you know, very detailed
elements of your systems when you're creating them and spinning them up. Because just using
the cloud, it's not so simple. It's different. And I think like the thing is, back in the day,
when we would build our own software, and you did everything in house, you knew what all of those
little details were, because you did it all yourself and you like had
to go through it all and you understood it and you memorized the code and you like knew what
the configuration was it's like so different now you just pick something up off the shelf and you
try and make it work with what you already have which a lot of it is really more like lego or
plumbing things making them work together and like you know plumbing has leaks. That's actually what happens, right?
And if you're taking things from different places and trying to get them to work,
it's actually really, really hard. And I think the role of an engineer that's focused on reliability,
preventing data loss, improving uptime, it's really different to what it was you know 10 12 years ago for sure like i think the other thing too is like
the speed of change um companies are rolling out more products they're trying out new technologies
faster and faster to be competitive i remember when i first started out we had like you know
two products and we were like yeah we'll build another product but we're going to give ourselves
two to four years now it's like you hear folks, I want to ship a new product in six months or three months
and just like really, really fast.
That means you need to learn a lot faster
and that can introduce a lot of failure much faster.
So it's an exciting time.
But to me, that totally makes sense
as to why we do have more outages
and why there's reliability issues.
Yeah.
Yeah, that makes sense.
I still remember like when the covid vaccine tracker
thing came out and we had to book our appointments on like the california system that site used to go
down all the time and makes me wonder as an engineer like this doesn't seem so hard but i
can totally imagine like there was pressure to ship this out super quickly and there's like
millions of people trying to log in at the same time.
Yeah, exactly. And then there's also like the budget constraints around that, right? Because say if you're trying to build something that works reliably, but also doesn't blow your budget and
your costs, because often like you can't just throw more servers at it, because that's not a
cost effective way to handle that. So then you have to think about how can I do this with code
or with adding like, you know, some sort of queuing system to there. But that's the tough thing, right? That
also takes time to be able to figure that all out and make it work technically. So I think like,
whenever I see big outages like that, yeah, I would do exactly what you did think through,
I wonder how long they actually had to build this? Like they have two weeks did they have a month like how long was it for real some okay so now that makes sense in terms of outages so how would you think about
building software from day one maybe when that's actually reliable or can withstand like you know
like a load spike or like a cpu spike or some throttling like what is like a mindset shift that you've seen
that's been like super effective for people yeah i think um over the last few years like you know i
started out my career working on on-prem systems and then i moved to working on the cloud like
using aws azure gcp um and i think like actually those cloud providers have created really great
tools to help folks that are using the cloud like like AWS's well-architected framework.
I think that's a really good tool to be able to look at and, you know, try and utilize whenever you're building something new.
That's got a lot of great tips.
The other thing, too, though, that I think about is, you know, I like Google's work.
It's interesting.
Like I think each of the cloud providers has done really interesting things,
like AWS as well, architecture framework.
With Google, I like their focus on SLOs and SLIs and error budgets,
but planning those out from the beginning,
not once you've already shipped everything and it's already running in production.
But if you can think through your SLIs and your SLOs
during the design phase,
when you're actually planning out your new system.
I think that's a great thing to do.
And obviously like that takes more time
and it takes maybe someone with,
that has an interest and an understanding of reliability
and how to meet those certain SLOs that you set.
But it also allows you to have this really great conversation,
which is to me like,
what are the top five to 10 critical pieces of this system? Like if you're building a whole new product, what are the top five to 10 critical pieces of this
system? Like if you're building a whole new product, what are the top five most critical
pieces that you need to have SLOs and SLIs for? Because, you know, maybe you don't want to create
them for every single thing that you build, right? It could be some tiny little piece of a system
that's not that critical. If it's not available, it's okay. You can go through something else,
or there's like a failover mode that works well so i would focus more on that and that enables you to prioritize and i think like
as i've gone through my career obviously like that's something that you become better and better
at all the time and it's a great skill as an engineer to learn how to prioritize what you're
doing because you have limited time in the day so then then it's like, what are the top things that I want to get done
that are going to help me prevent failure down the road?
And then so during that design phase, that's what I would say is like,
think about those two things.
And then once you've moved on from that,
I like to think about like, how can I codify this?
That's like a really interesting part.
Sometimes folks call it like shifting left
and figuring out how you can
do this work more in your like cicd pipelines working with your build team to do more proactive
testing but i also really like this approach um that jpmc has been doing and it's all about like
using chaos engineering the approach of chaos engineering but to create patterns to
inject failure proactively every time you're going to like ship a new product or a new feature or
service to production they'll run like a gauntlet or a suite of these different failure injection
experiments and then they make sure that it passes those and i think that's awesome like it's
interesting right it sounds like you know maybe sense. Of course we should have been doing that years ago,
but like, there's just a lot of stuff that people haven't been doing that, you know,
maybe it was hard to build those different types of things into our system, or we weren't sure what
the pattern should look like. We, you know, but also we didn't have a lot of stuff. We didn't
have auto scaling from AWS. That's only been around for a few years now. So it's also creating these new
types of patterns or like gauntlets of experiments that you can run in an automated way that's
codified. And then teams can feel confident when they ship their code that it's going to work when
it gets to production because they've already run this series of tests. And I just love that so much.
It's like a really empowering, scalable way to give folks the
tools to feel confident in what they're building rather than it getting to production. And then
everyone's like, hey, your software doesn't work. Like, here's why it doesn't work. Like, that's
also like a, it's a hard thing, right? If you spent all this time building something and you're
really excited about it, and then it's got some like huge major flaws and maybe you have to do a
big roll back. And that's very difficult if you've some like huge major flaws and maybe you have to do a big roll back.
And that's very difficult if you've done like a press release and a whole big launch around
something. So yeah, I really like that idea because it's going to help engineers feel more
confident in what they're building. And it enables us to just do a lot more proactive testing.
We can meet the needs of our other teams that are relying on us like the product
teams or the marketing teams the business teams the ceo of your company like will be a lot happier
so i i'm very excited about that cool and you were talking about some of your previous roles
and you mentioned that you were a site reliability engineering manager and that's not like a job
title you hear too often often maybe you can walk us
through you know like what does that role really mean and then also we can if you can talk publicly
like about you know what were you doing as like the manager at of a databases team at a company
that you know manages a thousand database nodes like how do you make sure to keep your backup safe
and like make sure that you know your availability your availability stays up. That'll be pretty interesting.
Yeah, for sure.
So yeah, it was really interesting being in that role as a site reliability engineering manager.
I think, you know, the reason I was really excited to take that role too is because I
just worked on a lot of really critical systems and I really believe that reliability is like
core and it's so important and you know to me
like reliability is feature number one because if your product doesn't work like it's not up
and running then no one can use it so it doesn't matter what features you've built it's just like
not even available to people and I saw a lot of those issues where we didn't focus enough
on reliability and then you know maybe your very senior sales executives would be doing a demo
of your product. And it just wasn't even up for them to be able to show these customers that are
like VIP customers. And that's a really bad situation. And so I like the idea of being
able to just focus on reliability. I thought that was really awesome. And so when Dropbox
reached out about that role, I thought that was great. And I've always loved databases
because I just love data.
I think that it's just really cool, actually.
Like I'm definitely a data nerd.
I like the idea of being able to store all of our data
and making it available on the internet,
like the data that we choose to share with others.
The idea of you can basically read any book now
because it's on the internet.
I think that's so cool.
And like that you can watch movies, you know, coming from Australia, it was hard to get
books sometimes because like when I was young, like, what am I going to do?
Buy it from America and it would take like months to get there.
It's just like really like coming from a small island, it's like very far away and very remote
from everywhere else.
The internet is like a lifeline that keeps you connected
to knowledge and to other people.
And I was a big fan of Dropbox.
I've been a Dropbox customer since I was in university
and I used to use it with the other folks that were studying computer science
when we were doing our projects work.
We would share things like in our Dropbox folder.
And so I thought that was pretty amazing.
And then I started to use it too
at the National Australia Bank.
We were also using Dropbox there.
And all my friends and family use it as well.
And I know like a lot of people use it
for really interesting use cases.
And that's always what matters to me.
It was like, I know that like really famous bands
make their music on Dropbox.
Like that's how they share the files around. I know that like lawyers use it when they're doing huge court cases
and so being able to be like okay like when that huge court case is going to trial like dropbox
will be up and running and they'll be able to access all of their data because my team's helping
to make sure that that happens like that made me very motivated as an individual.
And also then meeting the team,
like it's like a superstar team of folks
who come from a lot of folks from YouTube,
from Pocona, Booking.com,
like a lot of amazing MySQL experts
who had just been working with databases
for years and years.
And they were like amazing at all sorts of things
like performance tuning, Linux, the Linux kernel,
being able to do backups, restores,
like building automated systems to test restores of backups,
which was just happening all the time.
Building like web UI interfaces to be able to manage backups
and see which ones were working
which ones failed like this is all stuff that i've just never seen anywhere else when i'd gone
to visit companies or talk to friends you ever seen anything like this they were like no that's
so cool and so i just love this idea of like basically the dropbox databases team was building
startups at dropbox but specifically for like dev tools for databases, which was like awesome.
I'm like, wow, this is amazing. And so there's still a lot of things that, you know, the Dropbox
SREs and database engineers, block storage engineers, a lot of things that they built
that do not exist anywhere else, except maybe like at companies like Google or something like that.
But they're not things that you can just use, you know, they're not products. And so it was really cool to be able to see that and see what everyone had
built. And the reason too, why they had to build that was super small team. It's like, you know,
I think when I joined, we had 200 million customers and there was only four database engineers.
And then when I left, we had 500 million customers and like five database engineers.
So we just had to do a lot of automation and a lot of, you know, large scale, like looking
after systems, not with adding extra bodies, but by trying to be smart and intelligent
and building systems.
And that was like also very motivating for me.
I love that as well.
Yes. And what was the role
once you ended when once you started working there like how do you measure that you're successful
like is it just like oh if the site is up like three nines I'm doing my job or like how do you
go deeper than that yeah yeah that's a great question so I think these days everyone measures
things like pretty differently um but back then when I started doing that, it was like, you know, maybe seven years ago
now or something like that, six years ago now.
When I very first came onto the job, I mean, even during the interviews, I asked that question.
I think that's a great question to ask if you're an SRE during the interview.
How will my success be measured?
Because if you want to like have an amazing career career a great journey if you're on a mission to do really great work then it's a good
thing to ask and so like i said that like hey like what are the big problems that you would want me
to help you solve like that's like an interview question i always ask and um they were like well
we actually have like pretty high amount of on-call pages that are happening.
And we're not sure if they're like actual problems or if it's like noisy pages, if there's like automation that we could add in, if it's toil.
We want to be able to like dig into that and then reduce it.
And I was like, yeah, sure.
Like that's a really great first project because I'm going to learn so much from doing that.
Right.
When you get assigned that project, you're like, yeah, I'm going to learn all about the systems,
all about the different failure modes. I'm going to try and actually decrease the number of
incidents that are happening so that we aren't getting paged at 2am in the morning anymore.
Because who wants that? That's annoying. And it's also really bad for customers too and for
the business. So I started on the team and I asked folks like, yeah, do you have any idea why it's also really bad for customers too and for the business. So I started on the team and I asked folks like, yeah, like, you know,
do you have any idea why it's so high?
Do you think we'd be able to reduce it?
And they were just like pretty, I'd say like, you know,
it was tough at that time because maybe they were getting like a lot of pages
through the night and it's hard to like step out of that sort of like mindset
when you're constantly being bombarded with pages and
you're just like getting hit with them all the time it's hard to like step back and think like
how can i stop this from happening because you're just trying to keep everything up and running
and they're like an amazing team so they built all these great tools and all this great software
and that's an awesome thing about having a new team member join the team right like they're
able to come in and look at it just from a different approach,
different angle, like fresh set of eyes.
And I always love when you add a new team member to your team and they do this.
And so, yeah, just ask a few questions and then started to,
I just pulled all of the data because I love data, like totally a data nerd.
I pulled all of the data for, I think,
six months of incidents that had happened, like every single
page, and then analyzed all those pages by like just crunching that data and being able to pull
out patterns and trends. And then that gave me like more interesting questions that I could ask
the team, you know? So I came back that next week, we had like a weekly encore session on Wednesday
mornings, like an encore handover. And I was like, i noticed that you know 80 of our pages are related to this one page like
and this one specific database system like why like and i was like oh that's really interesting
like maybe we can prevent that from happening like that's interesting let's dig into that we
can do a project around that and so that got everyone really excited about it.
It was the file journal. So then we worked with that other team, which owned the file journal, to be able to collaborate with them to do some interesting failure injection experiments,
to understand how and why this system failed with the database as the backend for all of the data
there. And we had a really good understanding of like what we needed to fix
and also what we needed to prevent going forwards.
And we could create some good like patterns
for how we work together as a team.
And that was awesome.
And also much better reporting,
much better understanding of that.
So that was like one of the key things that we did.
And fixing those issues,
injecting that failure,
doing that chaos engineering work,
that ended up getting an incident
reduction and then we did more like i mentioned um with another system there's another system
called sql proxy and that one also was causing a lot of issues but the code had been around been
around for a long time no one really understood it well so injecting failure is just a way quicker
and easier way to understand it and by doing things like process killing shutting down
nodes understanding like how many do you need to have like how many proxies running what is the
sweet spot do you have too little do you have too many like what does it need to be and um just
being like a real scientist which i like this approach now too with chaos engineering that it's
you know we study computer science so it's like let's bring the science into it and experiment more and learn more and like i don't know i feel like that's like
why it's so exciting to be an engineer the fact that you do get to experiment and dig into things
and analyze the data and then be like i think if i do this this and this then i'll be able to make
an improvement of you know 20 or 100 improvement 10 improvement, like whatever it is. And I'm also
a bit of a gambling woman, I'd say. Like I do like to, I like to play pool. If anyone ever wants to
play pool, I'm always down. But I always like to say like, I'm going to hit my ball from here to
that, like, you know, to the right corner. And then I'm going to hit that other ball. And then
we're going to go into the pocket or like, I'm going to hit that other ball and then we're going to go into the pocket or like I'm going to jump that ball and then that other ball is going to hit there,
that corner, and then we're going to go into the pocket. So I don't know. I just kind of think it's
more fun to like call out what you're going to do. And then it's also more impressive. Like it's not
like it was a fluke, right? Because you said it, you're like, I'm going to do this, this, this,
and then you do it and you have done it. Like you've demonstrated that you could do it.
And I definitely learned that playing pool for like many,
many hours when I was in university, which is pretty funny.
But it's a great thing, great skill.
And I think it helps you as an engineer as well.
And so that's a thing too I think of when you're doing this work,
it's really important to communicate what you're doing.
And I think often
as engineers we like don't focus on like the communication of what we're doing but it's like
tell everyone what you're gonna do then do it then tell everyone what you did and what the
improvement was like it's like it's really basic but that's my framework for it as well
cool and i have a lot of questions about chaos engineering, but first I need you to elaborate on one thing, which I think is sound super basic, but I think is important for to just understand, like, why is it knowledge to somebody else and they've left the company and now you have the service and you have no idea whether
you're over provision under provision like why is it important to actually know that yeah oh that's
a great question i love that and i i think like you know so that specifically is like so you can
serve the traffic that you have like that's like the basic answer is um you know and it's difficult
like if you have fluctuating traffic on different days,
different hours of the day, like sometimes you can have massive traffic spikes where you need
more nodes. And that could be like, you know, maybe Monday morning for some types of services
and then say like Sundays are really nothing much. So you could have way less nodes in your
fleet for a specific service because you just don't have as much traffic that requires those
nodes. So that's like basically what it is. And i like to think of it as like a fleet of nodes
and make and but knowing the right amount is difficult because like i said it can fluctuate
um but the thing too is yeah if you join a team and you you just are not going to know like is
this the right amount is this too little is this too high? Until you dig into the data and understand it and look at the patterns that have been happening.
And then also, the other important thing there to learn too is those patterns can change at an
instant. Say, for example, what happens if your marketing team does a huge campaign? This happened
while we're at Dropbox, where Dropbox did a massive campaign for Dropbox business.
That was all over the news.
And that's huge, right?
So that's going to change all of your patterns.
The other campaign there was one, which was an integration with Samsung mobile phones.
So every time someone's channeling their mobile phone, it would call out to Dropbox.
So that's also a lot of traffic, new traffic that you'll get.
And then that made me realize as an SRE,
it's super important to be actually talking and watching
and seeing like what your marketing team is doing actually,
which like I never, ever thought of doing that
in the early days as an engineer,
because I feel like being an engineer is so far away
from what marketing is.
But actually like, if you know, okay,
we're going to do this huge like marketing campaign, there's going to know, okay, we're going to do this huge,
like, marketing campaign.
There's going to be billboards.
There's going to be TV ads.
There's going to be, like, a big push on social media.
We estimate we're going to get, you know, a million, 100 million,
whatever it is, new users.
Then you can know, like, to prepare for that.
And then also you want to understand, like,
what are the usage patterns going to be?
Like, what will those people be doing? Like, how will they be using our API,
for example? Like, what are going to be the common calls that they'll be making?
Are we trying to get users to do something totally different than they were doing before?
So like, those are all the questions that I ask now. And that's like, definitely not something
that I knew coming out of university at all. You know, this idea of trying to predict what different things would be like once a new,
completely new product didn't exist.
And also, I just don't think we have even meetings like that where marketing and engineering
sit down together and go, okay, like how do we prepare for this to make sure it's reliable,
which we should do.
So I've definitely been encouraging folks to do that.
It's like the important to have reliability in the design phase of a new product but also like when you do launch because you want to know like
is marketing putting like millions of dollars behind this launch because that's going to change
things if it's a soft launch then that's totally different as well you know it doesn't matter as
much but yeah yeah yeah i've heard like anecdotally that, you know, Uber Eats just provisions a lot of servers for Super Bowl Day.
Yeah.
And I've also heard like Prime Day is like a six month event at Amazon to make sure that everything is correct and ready for prime time, I guess.
Yeah, that's exactly right.
Like, you know, when some of those peak days are like I've also heard with with Uber, obviously New Year's Eve is a huge day for Uber.
So they make sure to have enough nodes,
enough machines provisioned then as well.
So there's like some things you can kind of guess.
Like I think I need to be ready for this,
but better to even just have like,
I don't think this ever existed yet,
but me and you, we can like riff on it
and come up with ideas.
It's like, what about like a reliability calendar or something? we just know like these are the points that it matters for our
business and like in your first week when you join a new company you could say hey like what are
our most important days of the year like that happen when we get loads of traffic like i want
to be ready for those days and make sure that we always like crush it and do an amazing job like i
think that's a great question to ask too. Yeah.
Yeah.
We don't want an embarrassing moment on like the day we spent millions of
dollars.
Exactly.
Like the ball.
Yeah.
So then let me ask you a little bit about chaos engineering.
We've spoken about the problem, right?
There's outages, there's like on-call toil,
which is like a really important thing for SREs to solve.
And I think these are like approachable problems.
Like people are generally aware,
like these are real problems and we need to solve them.
How does chaos engineering help, I guess, is the first piece.
But I think what I'm interested in is
how do you productize a solution?
A lot of these solutions are, you know,
something that I would think about
the company has to implement internally.
What was the initial idea, if you can talk about you know grumblin and like how do you sell a solution to customers is something that i'm super curious about yeah for sure so i guess
you know to think about what is chaos engineering and how does it help you for example reduce
outages um one of the things is you can think about this like i i personally choose
chaos engineering and i have for the past you know 12 years as my favorite way to um be able
to make sure that systems are more reliable because there's a number of different things
that you can do but i feel like to me i've just i've picked chaos engineering because i feel like
it's the thing that gives me the biggest reward in the shortest amount of time. And it gives me the best long-lasting understanding of my systems
that I'm working on and the best knowledge of the systems and also the customers, the product,
just everything. And I'm always looking for what's the most efficient but also impactful way to learn
about something. And so the reason that I say that is I've done lots of different things.
So say, for example, you join a new team,
you have this service that you pick up and you're told,
hey, this service is not reliable.
I'd like you to try and improve it.
It currently has, you know, 500 pages a week.
Everyone's too scared to make code changes
because it's really like old piece of software.
We can't deprecate it yet
because we're not sure actually how it even works.
We're not sure what would happen if we did deprecate it.
We're not sure like what it even connects to,
what the dependencies are,
like upstream, downstream,
what cascading failures we might have.
Like you just like think through all of those things
of like what this system,
the damage could be if something went wrong.
Say if you did a code change and then suddenly it actually
made things way worse.
That would totally happen.
And it's kind of like also from building things
and getting a bit burned.
You realize you have to be a bit more careful.
So it sounds interesting, but actually chaos engineering
is a more careful way
to understand systems and how they fail
than like just making code change
to see like what happens now,
like pushing code into production.
That's like, to me, too dangerous of an approach.
And it feels like you're going in blind,
like just doing a code change.
And so the idea there is like,
okay, if I'm to think like a scientist,
I'm not just going to randomly change stuff. I want to do an experiment. And so if I want to understand
this system, I want to understand like, how does this system impact other systems within my
architecture? So I can do little tiny experiments that allow me to inject failure. So one really
good example, like say if you've got this, maybe it's say an ad service
within an e-commerce store.
That could be our system.
And we've got a lot of problems
with this ad service,
but we want it to work.
But let's just fail it.
But specifically,
we could do something
called a black hole attack,
which means you can make this service
unavailable for 60 seconds.
You don't have to do anything else.
That's like a Gremlin
specific type of attack.
And we're just going to make it unavailable for 60 seconds and see what happens to other services
around it. And you don't even have to do this in production, right? You can do this in dev.
You can do this in your pre-prod environments. And you can see when the ad service doesn't work,
is everything else still functional? Can I make um checkout items can i purchase items can i add
things to my cart can i look at the catalog all that sort of stuff am i getting any other pages
from any other systems that are trying to call the ad service and it's not there and then that's
causing problems for those services like those are the things that i would do and then from there
you can go okay like you either learn that this is like really badly like hard
coded and there's a lot of different issues, but you actually would know like all the systems that
has issues with like what hard coded dependencies are there on the ad service? What do you need to
then prioritize fixing? Or you're like, boom, in 60 seconds, I learned that this service does not
have any issues if we just take it away. That's awesome. Like how fast is that? Like rather than
if you just imagine any other way to learn awesome. How fast is that? Rather than...
If you just imagine any other way to learn, how else can you learn something in 60 seconds?
And so that's why I love to nerd out about it. It's just such a fast way to be able to learn.
And it wasn't always like that. To be able to do a black hole attack in 60 seconds,
that's something that we built at Gremlin and built into the product. And it works for everybody. Like everyone can use that on Windows, on Linux.
We even have like a serverless feature for Alfie application level failure injection
where you're able to do something like that as well.
If you write in Java, that's a beta product right now
and you just integrate it with your code.
But like we created
that because i remember in the past when i would try and do activities like that a failover activity
is what you would call it right it was just wow what a nightmare doing them in the national
australia bank 12 years ago it was something that we had to plan for for probably three months
to be able to do an experiment like that. We would have to book out the weekend.
We'd have to go to a separate office.
We'd have to make sure that all the other teams that might get paged
that might see an issue were in the room at the same time.
We'd need to all be sitting there together live.
Like we didn't have tools like Slack.
Like we couldn't just communicate with each other.
We didn't have like page duty where we could quickly pull up the reports
of the pages or, you know, software like Datadog and new relic for really awesome monitoring it was like a lot of
logging and you know splunk didn't exist back then either so it's just like now that we have all
these tools we can learn amazing things in 60 seconds um if we just inject failure and learn
quickly and then you just turn it off and everything's good as gold again so it like goes
back to like your state that you were at previously so yeah that's why i really like it
do you feel like customers get scared when they hear about the concept of like
chaos engineering the first time and how do you like help them go over that like initial barrier
yeah i think a lot of the people do get scared mostly because they think that they have to do
chaos engineering in production first which is like true. Obviously you don't have to start in
production. It's very powerful to start in other like environments, definitely. So I would say that
is really like, as soon as everyone hears that, they're like, oh, that makes sense. I'm like,
yeah, it's a journey. Like I never started in production. I started when doing it at dropbox in the um staging environments
in like our our dev environments for databases because we had these like staging databases that
we could do all of our experimental work on that were in a totally different safe environment so
you could just test things and try things out and it was much better like you don't have to be
worried and then once you're ready then we could do it in production but i always say that it's a journey to get to production it could take years
and that's totally okay sometimes it takes folks two to three years to be able to get to that point
and maybe some folks might never get to production and that's also okay like you can learn so much
from doing it in your environments like before production um so that's probably the main reason
they get worried.
And then I think also sometimes the name just scares them
like chaos engineering because it sounds like very chaotic.
But I think like, you know, for me,
I practice chaos engineering as a reliability engineer.
I always bring it back to that.
Like our goal is to make systems more reliable.
We're going to actually create chaos in the system,
but it's going to actually
help us uncover issues and make our system reliable. So that's really what it's all about.
And it's like controlled chaos. I like to say that too. You know, we're not just going to,
I don't like the idea of randomly injecting failure. I love this like experimental approach.
Let's be a scientist and let's learn that way yeah yeah
that makes a lot of sense and it's like super similar i think to the idea of a chaos monkey
that like netflix released like a few years ago is one of the founders of a gremlin somebody who
wrote chaos monkey a long time back yeah so chaos monkey that was released like I think in 2010 or 2011, something like that.
So it's like, wow, 10 years ago now.
And Colton, our CEO, like he worked at Netflix.
He created something which was very similar to actually what we have
called Alfie.
He created something called Fish,
which is the failure injection framework for Netflix.
So, yeah, he worked on that.
He also built something called Gremlin at AWS,
which is a lot like Gremlin that you can use. And he did that before working at Netflix. So
they were also doing chaos engineering there, but they called it failure injection. And so yeah,
he's been doing this work for such a long time for, you know, maybe 20 years, something like that,
been doing chaos engineering, injection and yeah um what
is the product really uh where i'm i'm curious about you know where are some like complexities
of where the product comes in so like the mental product the mental model of my of gremlin in my
head right now is just um do you have a system which lets you run these tests against,
you know, like maybe like a set of services
or like a set of instances and lets you decide,
you know, run these like chaos tests.
But then what are all of the knobs and stuff that you have to tune
and like how do those help the customer
is something I'm pretty interested in.
Yeah, totally.
Yeah, so it's definitely changed a lot over the years. Gremlin, what it looks like right now, if you're to log in and there is a free version,
if you go to gremlin.com slash buttons, you can try it out. Buttons is my nickname. So that's
what that is. But so if you go there, you'll actually be able to see there's a few key
features. So one of them, ones that we just released is service discovery
so once you have our agent running um you just install our agent you know the daemon you have
it running on your machines wherever you'd like it to run or you can run it as a helm chart if
you're using kubernetes open shift something like that and so then what you can do is you can
automatically see all of your services within that you can see how many nodes does your service live on,
like how many hosts. You can see like how many pods does it have. That allows you to understand
like what you would be attacking. Say, for example, if you go, I want to attack just one of the three
hosts that this service is on, or I want to attack 50%, something like that, we'll actually be able
to then pull that data for you
and allow you to then inject the failure.
And there's also a visualization tool.
So it actually shows you a map of like all of your different nodes
and your pods if you're using Kubernetes, for example,
and where they fit, and it will highlight them
when you're creating your experiment.
And then what we want to do there is actually think through,
okay, I understand like the specific service I pick,
like ad service, for example,
I understand that that lives across two hosts,
there's two pods.
And then the next thing that you want to think is
what kind of failure do I want to inject?
And we have 11 different types of failures,
which are just out of the box.
And so a lot of the time what folks do is
they'll either start with just one type,
like maybe packet loss, latency, process killer,
could be like something like CPU, IO, memory, disk, spiking.
And they're thinking through like a specific use case.
Like for example, what if you want to test auto scaling?
So you can inject CPU to spike CPU using using that attack and what you might do is
chain three or four cpu attacks together with a little bit of a delay in between so say like
let's inject cpu here um now let's have a little delay inject more inject more until suddenly your
cpu is spiked then it should kick in auto scaling and it should work um and then your cpu is also
going to go back down so then auto scaling
should go back to the situation you're in before with like you should release the extra nodes that
you created and that's like something that you want to definitely test before you're like suddenly
getting a ton of traffic and need to use auto scaling and it doesn't work and a lot of outages
have been caused by incorrectly configured auto scaling just as one simple use case. But that's often what I see folks do is,
and linking back to, you know,
this idea of creating patterns and codifying your work
and, you know, being able to think through,
if I was to do this work in a CICD pipeline,
you know, not just manually creating these tests,
but having them run over and over,
then you're thinking through like,
I want to have auto-scaling as a pattern that I test and i want to just create a grumlin experiment to do that
codify everything and then just make sure every time you ship something new a new feature a new
addition new piece of code that adds to that service you're just going to run this again
and make sure that everything works correctly and we also have another feature called status
checks which checks your monitoring before and after so So that's really cool, right? You hook it in to say like,
let's check that the service is up and running. Yep. Now let's run the attacks. Yep. Now let's
check. Actually, our monitoring still says everything's good. We didn't suddenly have
an outage or like the system crashed or the service is no longer available or the service,
you know, SLI went down. We're no longer meeting our SLO.
So that's what I see a lot of customers doing is they start there
by thinking of what are their specific use cases
that they want to test for.
Then they go and make those into experiments in Gremlin.
We call them scenarios.
And then they'll look at how can I automate this
within my CICD pipelines.
So that's like a really cool thing I
see with the integration of status checks too. Yeah. That is so interesting. So it's basically
like you're adding regression tests in a sense from that's what the example sounds like, right?
Check if my auto scaling is working at this scenario. And only if the status check says
you're approved, should you move on to the next step of, you know,
pushing to all of production or something like that.
Exactly. Yep. That's exactly it.
And I think a lot about regression testing
when thinking through these types of experiments.
You know, like I remember there was a huge regression testing project
that one of the engineers on my team, the databases team was working on.
And she was looking at making all of the pages
on Dropbox.com much faster to run.
And like they went through and identified
all of the pages, how fast they ran right now,
which ones were the slowest,
looked at making improvements.
But the thing that you always got to remember
is like, dang, like someone can come in tomorrow
and ruin all this great work that we did.
So you've got to have regression testing in place.
Like we all know, like someone can write
one bad SQL query and then everything's blown, you know, with have regression testing in place. Like we all know, like someone can write one bad SQL query
and then everything's blown, you know, with your metrics for Perth.
And I think like that's an interesting thing too.
Like chaos engineering really appeals to folks that are SREs,
but also performance engineers,
because we have attacks that you can run that injects latency.
So you can actually say like,
what happens if I add latency to this service?
Like, would I know, like, what happens if I add latency to this service? Like,
would I know, like, how does it impact my service? Like what's going to happen to other dependencies
on it that work with this service? And then also a lot of like QA engineers who are looking to do
more automation and shifting left, like integrating with CICD, they also are interested in chaos
engineering because they're like, wow, this is cool. Like, I can prevent these issues before they get to production in like a really nice way
and build out a super scalable system that just tests all of these services.
So yeah, that's cool, too.
Yeah.
And I guess if you integrate as part of the CI CD pipeline, you also don't have to worry
about is anybody actually going to fix the bugs caused
by like, so one thing that I always thought about was we would do these like monthly DRTs
at my previous job at Dropbox. And we'd always have to prioritize whatever we found out. And
sometimes you just don't do those action items. So what's the point? But if you again, shift left,
you need to make sure these things are fixed before you roll
out a new version which is pretty interesting yeah i really like that approach too because a lot of
folks say that exact thing like say if you're you know you're doing a drt exercise in production or
you know a different environment you go through you identify those issues you know some teams are
good about it they'll they'll be like i'm gonna get this done but then sometimes you can't because maybe you have some other team pushing you to deliver
something else, maybe related to their items that came out of their DRT.
So it can be really hard.
Like which team gets priority for you to help them, especially, you know, the team that
you were in, you're like helping every team across the whole company.
So it's really hard to prioritize at that point.
And so I really like this idea of doing it within
the cicd pipeline and also then everyone has the metrics everyone has the data everyone knows what's
passing and what's failing it's just like a way better more visible approach and then that helps
um push back to the management and the leadership and say hey like you know whenever we're trying
to ship new features to this service, it doesn't work well.
I think we need more headcount on that service.
Like, you know, we need to put engineering team,
like, you know, resources.
We need to put folks on there that can help
because it's kind of hard to prove
that you need more people on your team sometimes.
And I feel like that's always like
the constant battle in engineering.
And this is like a real way to do it with data as well.
So all of this makes sense to me. When you were interviewing with Gremlin, like a few years ago,
you must have had like a certain idea of, you know, what the product is and how it can help
customers. And you decided to join. What is something like unexpected you've learned,
like on the way, like on the journey, like how are customers using the product differently from
how you thought about it when, you know know you design product or like gremlin engineers and like epd was thinking about things
like what is something different that you've learned about it yeah i mean i definitely like
when i first saw the product i was just like wow like i was blown away and this is you know back
when it was seed round um pre-series a and atbox, like I had built a lot of the tooling to do the chaos
engineering at Dropbox. And a lot of it was just not as advanced, you know, as what Gremlin is.
It just wasn't. Like Gremlin, when I first saw it was this amazing like UI that you logged into,
it was super easy to deploy the agent. You then had access to all these different attack types.
And I just never thought
of some of the attack types that exist like that's what i thought was really cool um so for example
at dropbox we've done a lot of process killer attacks which i thought was pretty awesome like
that's much more advanced a lot of people don't do that and then shut down attacks as well but
things like hey let's like inject latency let's inject packet loss like we never done networking related
um types of attacks like injecting networking failure we've done networking related experiments
where we're trying to understand like why you know is this being throttled by the network and
then we were able to figure it out but it took a long time to get to that point and if you can
actually just debug the network
by injecting failure into the network,
that helps you prove it.
So I love that because there's always this saying of like,
it's never the network.
And sometimes it is, like I've just proved it that it is,
but it takes you, I feel like,
say it takes you a day to prove
a basic normal type of issue is happening as an SRE.
For a networking one, I feel like it was
always weeks to be able to prove that it was the network and to get the networking engineering team
to like back you up and like work with you to to resolve it because they've also got a lot of work
that they have to do so to get them to listen to you and work with you is hard but this helps you
get that data um but the thing the experiment or the attack type that i was most impressed with was definitely black hole and that's because like i had always seen failover like say um you know
region failover service failover being able to switch to like a backup like hot hot like let's
just shut down one region and make sure that the other one works. I'd always seen it as not what we have as black hole.
Like a lot of the time it was like, let's shut down completely.
Like let's tear down this data center.
We're just going to shut down everything.
And then we're going to see what happens.
And that's like very destructive approach actually,
because you're thinking of that, you're like shutting everything down,
like doing a power outage.
And then you're going to have to bring everything back up. And just any act of like taking it down and bring it back up that
can introduce a lot of extra failure that you don't really want to introduce and also it just
takes a lot of time because of that whole process of like bringing everything back up and so the
idea of a black hole that you could do a failover exercise in 60 seconds without having to turn
anything off just making it kind of
like invisible for a period of time like wow that's cool that's just like a great pattern that's so
much safer and i like gremlin's approach of safety that was like one of the biggest um focuses and
security and simplicity that was like the three values since i joined the other really cool thing
too is a halt all button.
So there's this button in the UI at the top right,
and it says halt all.
And you can just stop all experiments that are running at any time.
And the agent will just like stop running them.
There's a dead man switch, which we built into it,
which I love as well.
So it's just like all these really cool like safety and security features, which I just would nerd out about as well.
I'm like, that's such a great idea.
So yeah, those are still my favorites.
And have you just generally seen, I think you were mentioning about customers earlier.
Have you just generally seen customers that care a lot about uptime as somebody who's
interested?
And what is the trend of customer interest?
Like, have you seen, you know know more people finding out about the term
chaos engineering what have you seen over the last few years yeah it's definitely changed so like
you know back four years ago when i joined gremlin um chaos engineering was popular but
it was like really just starting to take off i think like a lot of people had heard about it
they'd heard of chaos monkey um i would say like most people hadn't done chaos engineering. They just heard about it. And then I was having fun speaking at a few
conferences, but I didn't want to just talk. I wanted to like help people do chaos engineering.
I'm like, this is like a fun thing to do and you're going to learn a lot. So I was teaching
this workshop while I was at Dropbox, just a chaos engineering bootcamp. And it was really fun. It
was like, I got everyone to
spin up a Kubernetes cluster and then inject failure into the cluster. So at the start,
I'd be like, Hey, who here put your hand up? If you've done chaos engineering, like two people
would put their hand up. And then at the end, it's like 300 people put their hand up out of
the workshops. That was like a fun way for me to help everyone across the industry just get to
actually try it out. And a lot of people, when they did it, they're like, oh my gosh, I love that so much. Like I would get them to inject packet loss and
they'd be like, I can see it visually that everything's running slower. Like that's
really cool that I'm able to do that. It's very like visceral feeling. Like you're trying to
use your service and you just can't because it's no longer working as expected. And I think it was
rare. Like it's still rare to see what your service,
your application looks like during a failure mode, like when there is an issue.
And coming from Australia, like so many networking problems with like latency and packet loss. Like
I'm like, you know, it's just like a thing. Other people don't have to experience that. I'm like,
oh, I've been buffering videos what feels like a lifetime, like when I'm coming from Australia.
And in America, you never have to buffer anything.
So, you know, that's like a really funny thing.
But I would say it's really shifted a lot too
that now I see engineers measuring not just SLOs and SLIs and uptime
and how many nines, but also dollar value.
So I can ask customers, hey, how much money is it for you
if your company is down for a minute?
And then they're able to give the dollar value of that.
Well, one minute outage costs us this much as a company.
And once you've got that number, you know,
and you might need to talk to a few different people,
finance team, product team, to be able to calculate that.
But we have some like very well-known customers that
a lot of people use every single day. And they're able to figure out that dollar value. And that's
really powerful because then as an engineering team, you're able to say like, every one minute
outage that we are preventing by doing this work is saving us this much money. And it's easy too
to be able to go, well, last year we had this many hours or minutes
of outages.
And this year we did all this preventative like chaos engineering work and we've reduced
it by this much.
You know, that's a great way to be able to show that value back.
Yeah.
Yeah, that makes sense.
And then over time, as you know, more engineers and managers see the value of these things
is how something like a new term like similar to observability right like
it just expands over time uh maybe yeah maybe like a final question just to wrap up is if i'm a new
sre and i'm starting off in a role where um the site's going down too much or you know on-call
toilet stuff is just too high and i want to start with chaos engineering but i'm not going to just
pitch like let's buy this product on week one.
How, what are some baby steps that I can take to think about, you know, how do I inject
some ideas or like, how do I show to the organization that, you know, this is an important
thing to do?
And how do I start with that?
How do I prove to myself?
How do I prove to my team that, you know, I should start with a little bit of chaos
engineering?
Yeah, I think that's a great question. So one of the things that I think is really
powerful is for engineers to learn about software and techniques like this, practices like this,
and then do your own demo internally within your organization. And so that's like, you know,
you can use Gremlin for free, you go to gremlin.com slash buttons, but then you can spin up your own demo environment
yourself where you actually inject failure and learn about it. And it doesn't have to be your
work product, right? When you're first learning about chaos engineering, it could be a demo
environment. Google has a really cool one that runs on Kubernetes called Bank of Anthos. And I
feel like that's a really good demo because it's a bank like it does deposits it does
withdrawals and people can see how serious that is but you can inject all sorts of different failures
and see how it impacts it like for example um if you black hole um i think it's like one of the
services then the balance reader then the balance will show up as zero and so that's interesting
like people think they have no money in their account um but that's like a really good demo to just show people internally
like instead of telling them what chaos engineering is you can visually help them experience it um and
i think that's great especially for this world during covid where you're remote a lot so you can
do an actually like interesting fun demo um or it could be something like, you know,
a tech demo, a lunch and learn.
If you do those, you can be like,
hey, like let's just chat about chaos engineering,
but I want to share this demo that I created.
I think that's a great way to do it. I'd recommend that first.
And then the other thing too,
is to just like read about it.
Like it's great folks are listening to this podcast.
So this is a cool way to learn more about it as well.
Just say like, you know, if you Google chaos engineering,
read through some of the articles there,
watch some videos on YouTube.
I created a series of videos called chaos engineering
in 60 seconds, like gone in 60 seconds.
And those are like really short videos
which show all the different attack types.
So you can see like,
what does a black hole attack actually look like?
Like just look up chaos engineering in 60 seconds, black and you'll find it um so yeah that's what i
was saying like have fun with it like this is a cool thing that's going to help you with your
career in like you know for many many years to come so it's a great practice to invest your
time in like i do not regret it at all like learning chaos engineering cool well thank you
so much for being a guest i feel
like i learned a lot i need to look into the bank of that that is not something i need to talk about
so i'm gonna take a look at that thank you again
thanks so much for having me i really enjoyed it