Software Misadventures - Oliver Leaver-Smith - On how "just a monitoring change" took down the entire site and resilience engineering - #5
Episode Date: February 19, 2021Oliver Leaver-Smith, better known as Ols, is a Senior Devops Engineer at Sky Betting and Gaming. In this episode, we discuss how a seemingly simple monitoring change ended up taking down the entire s...ite. We also talk about chaos and resilience engineering. We discuss how the team at Sky Betting and Gaming conducts fire drills (chaos engineering exercises) where they not only test the resiliency of their software systems but also their people systems. We walk through a recent example of a fire drill, how they have evolved over the past few years and the lessons learned in the process.
Transcript
Discussion (0)
If you have a team that is focusing solely on features and new shiny things in your application,
that's fine. But there comes a point where it doesn't matter how many new features you add,
if you suddenly have an outage and every system crashes because there's been no thought put into
the resiliency of that system. It doesn't matter how fancy your application is, if no one can get to it because
you've not thought about how it handles failure.
Welcome to the Software Misadventures podcast, where we sit down with software and DevOps
experts to hear their stories from the trenches about how software breaks in production.
We are your hosts, Ronak, Austin, and Guang.
We've seen firsthand how stressful it is when something breaks in production,
but it's the best opportunity to learn about a system more deeply.
When most of us started in this field, we didn't really know what to expect,
and wish there were more resources on how veteran engineers overcame the daunting task
of debugging complex systems. In these conversations, we discuss the principles and practical tips to build resilient software,
as well as advice to grow as technical leaders.
Hey everyone, this is Ronak here.
In this episode, Austin and I speak with Oliver Lever-Smith, better known as OLS.
OLS is a DevOps engineer at SkyBedding and Gaming.
He has been interested in technology,
specifically how it breaks, from a very young age. This interest of his aligns very much with ours,
and we had a lot of fun speaking with him. We discussed how a seemingly simple monitoring
change ended up taking down the entire site. We also talked about chaos and resilience engineering,
a topic Oles deeply cares about.
We discuss how his team at Skybetting and Gaming conducts fire drills, in other words, chaos engineering exercises,
where they not only test the resiliency of their software systems, but also their people systems.
We walk through a recent example of a fire drill that Oles lit himself.
We talk about how these fire drills have evolved over the last few years and the lessons learned in the process. Please enjoy this fun conversation with Ols.
Ols, welcome to the show. We are super excited to have you here.
Thank you very much. It's good to be here.
So we were researching for this episode and I was reading about you on the internet.
There was one bit which I found which kind of stood out.
At least I was fascinated by it and I want to know more about this.
So back in 2003, I found a bio which said that back in 2003, you were learning more about setting up Red Hat
and you unintentionally upgraded your dad's Windows XP machine to Red Hat 9.
I'm very curious about how that happened.
Can you share that story with us?
Yeah, so I use the term upgraded.
He used the term ruined.
I think I upgraded him.
So basically, he had this book of Sam's Teach Yourself Red Hat on his bookshelf.
And he had a CD in it to run like a live instance
of Red Hat 9.
So I put that in his computer
because I didn't have my own at the time.
And I clicked around the live CD a bit
and thought, this is quite interesting.
I'm going to install it.
I thought it was just like, you know,
when you install a game or anything,
there's a little install button on the desktop so i thought i'll click that
uh i went through the install steps and um it said it's now safe to restart your computer
so i thought okay i don't know why i need to restart but fine um and when i went to restart uh all i had was grub and uh and the the option for red hat nine and uh i couldn't
work out what i'd done because i was at the stage where i didn't i was dangerous enough to
know how to do things but not why i was doing them and what the actual effect would be
uh this did actually result in getting my own computer though to tinker on so that I see it
as a positive really oh yeah there is there is a bright side to it but I can't imagine if your
dad was pissed off pretty much yeah oh I got I got a terrible computer out of it it was like a
reject from work or something that nobody wanted well at least you got your own computer to play with.
Exactly, yeah.
So can you tell us a little bit about your background?
I know a lot of listeners would want to know how you started off.
I saw your LinkedIn profile and it said that you started off as a network engineer and
now you're more in the DevOps space.
So we would love to hear from you. Yeah, so I started off on like help desk type thing
at an ISP.
So the natural progression there for me
was to go into networking as a discipline.
So I went up sort of through the ranks in the help desk
and then started being a real network engineer.
And then I moved to another ISP
and got more in the weeds of networking.
And then I branched out from ISPs
and went to, started in the gambling sector
as a network engineer still.
But I found the environment, the fast pace,
the ridiculously short down times that
you were permitted, all that sort of thing, I found that really, really interesting. And
I kind of, I saw what these DevOps engineers were doing and all these infrastructure engineers,
and I saw how they were not automating themselves out of a job, but doing less, doing more with the time they had by doing less actual work and toil and spending more time on it, working out how they could automate their jobs.
So I did quite a bit to automate the boring stuff that we had to do as network engineers.
So like device config audits and all that sort of stuff.
And that really like piqued my interest in the automation side of things.
So I saw a job advert for a DevOps engineer.
And I always had Linux in my back pocket and that sort of stuff.
So I thought, you know what, I'll make the jump.
I know this DevOps thing from a networking perspective.
I know a bit about Linux, so why not?
And it's gone from there.
But it's good to have a specialism that isn't necessarily just DevOps
because if you want to get that full view of the whole stack as a team,
you really need people that have got the T-shaped developer, if you want to get that full view of the whole stack as a team, you really need people that have got the T-shaped developer,
if you like, or the T-shaped engineer.
Oh, absolutely.
He's got the specialism.
So it's worked out all right for me.
Yeah.
I mean, I know a lot of DevOps engineers
who come from many different backgrounds.
I mean, including Austin and myself,
we come from, I don't know if you would call it unconventional
if everyone is coming from unconventional backgrounds. But yeah, having expertise in one domain certainly helps.
So now that you're a DevOps engineer at Skybetting and Gaming, can you tell us a little bit about
your team, your role, like what you do day to day and what does your team structure look like?
Yeah, so Skybetting and Gaming itself uses the tribal model that Spotify invented and
made famous. So the tribe I am in is called Core. And how the tribes work, it's kind of like they're
all individual companies that take resources off each other as if they are individual businesses.
So in Core Tribe, where I am, what we focus on is a lot of sort of key account functionality.
So we don't really talk, we don't really deal with the betting or the casino side of things. We're primarily user registration and identity verification,
payments, like taking payments from customers and sending withdrawals out.
So we're sort of like the beating heart, if you like, that a lot of other tribes within the
company utilize.
The team I'm in is a specific platform team.
So there are feature squads that have different areas of expertise in their different domains,
different applications.
But the platform squad that I'm in kind of sits underneath all that and supports the
development and the rollout of new features and new products.
So we do it in kind of a few different rather interesting ways.
So we'll sometimes get parachuted into a team to be like some SWAT-style platform resource
that just needs to spin up a database cluster or something quickly to allow to allow a team to start developing something um but but other other times we'll be kind of pulled
into this uh the phrase we use is a pop-up squad which is a like a single use um squad from
different domains that can all come together and um and do do good things. A most recent example of this is we had some GDPR work to do,
which is, for those that don't know,
is the EU data privacy regulation stuff.
So this needed some developers to make changes on some of their systems.
It required platform there to ensure that backups were being kept
at the right amount of times of databases and all this sort of thing.
So it's quite playing hard and loose with the definition of a feature squad.
But like I say, it works for us,
and it's good to get the different exposure to different areas of the business that you wouldn't necessarily if you were just being a platform engineer working on just platforms.
That makes sense.
And it's actually very interesting, like the tribe structure that you mentioned.
I actually want to dig in a little bit into that, if you don't mind.
So you mentioned you're part of the core tribe team
so does a tribe have like multiple teams within it and the other teams you work with are they
part of different tribes or would they would they be part of the same tribe so the teams that are in
the core tribe i'll try not to leave any out in case anyone from work is listening because that'd
be terrible.
So obviously there's the most important one, which is platform.
And then there's a squad that is focused on account as a service. So that includes the actual account bit you see when you log in um so like you're changing your details and
your credentials and everything and also things like uh any any exclusions you want to put on
your account if you feel that you're spending too much money on site um all the all the tools that
we have there to to help you manage that as a customer are all part of that
team. There's also then the payments squad, which solely look after taking all the money and giving
it back. And then another squad is the onboarding squad that handle getting customers through the door um in a responsible way and also ensuring
that we can verify they are who they say they are um you know whether that be using third-party
um identity providers or manually um verifying documentation that the that the customer will provide.
I think that's it.
That's it, yeah.
There's a lot of principal engineers that kind of float around different squads
depending on where the resource is needed,
but they're the main squads.
And that pattern of a tribe made up of multiple squads
that have a specific domain to look after, that is what is replicated across the business in different tribes.
I see. Makes sense. It's a fascinating concept. And how many people are on the core tribe in general? How many engineers total?
Engineers, I would guess probably around 60 to 80, including all disciplines like test and software dev and platform i see pretty
good though um and for for some of our listeners who might not be fully aware can you tell us a
little bit about uh what sky betting and gaming as a company does yeah so we are um i think we are the biggest online bookmaker in the uk
um so we do um your traditional sports book betting so betting on football soccer
and uh and horse racing and things like that and then we also have um online gaming platforms so
your traditional sort of slot machines online uh live casino with croupiers spinning
roulette wheels and things like that um and then we also have a lot of um a lot of products that
are free to play and so we have things like um prize machine where it's free to spin and you win
money or free spins elsewhere.
We have things like where you can put a free guess on the outcome of a few
different football matches.
And if that matches the
actual results then you win money. We're lucky in that we were closely
affiliated with Sky the company which is quite a good brand and that's it's like
it's very much the brand you think of when you think about sports at least in
the UK,
because they're sort of the home of Premier League football for a long time.
Makes sense.
So considering there is a lot of payments involved,
people are betting,
so I would imagine performance and reliability
would be of paramount importance
of all the systems that you're working with.
And the requirements would be of paramount importance of all the systems that you're working with. And the requirements would be extremely tight.
Yeah, so we're unfortunate, I guess you could say,
in that everyone in the business relies on us
and our availability.
So if one of the other tribes that,
say if the BET tribe has a problem with their website,
the gaming tribe, they can continue to run their products.
Whereas if our services go down,
then every single consumer of our services
is having the same problem.
So we are, rightly so, we are held to a very high standard
in terms of our system performance anyway.
Nice. So, I know today you published a blog post recently on your website about
how a seemingly benign monitoring change resulted into an outage, resulting in making your system
grind to a halt. I know we I want to dig more into that.
And Austin here is on the monitoring infrastructure team at LinkedIn.
So I'm going to let him drive this part because he is extremely excited to talk to you about this.
Yeah, so I'm on the monitoring infrastructure team.
We provide a monitoring platform pretty much
for all of the variety of applications at LinkedIn.
And we expect it to run smoothly all the time and shouldn't affect the applications in most circumstances.
So this is really interesting for me.
Can you give us a little bit of background on the kind of the systems that you were monitoring
for this particular incident? Yeah, so I'm fine talking about this because I was the one that did
it. So I don't mind throwing the engineer that did it under the bus at all because that engineer was
me. It's healthy to talk about your failures, right? So it's good to talk about it.
So the specific application that we were wanting to monitor in this instance was part of the voodoo backend,
the very legacy backend that talks directly to the database
sort of systems rather than anything further up the stack
and we're in the situation where this particular um application that talks directly to the database
is one that is provided to us by a third party and it's closed source we have like a route into
them for bug fix and feature releases and that sort of thing but um but that's on like a um a consultancy basis
so something that we requested from them was a metrics endpoint that we could scrape to tell us
how many unfulfilled payments were in the queue of payments waiting to be fulfilled
so how our payment fulfillment works um not just at Sky Betting and Gaming,
anywhere, is there'll be an initial sort of hold on the bank account that says,
is this money available? The bank says, yes, that's fine. And then at a later date,
the actual fulfillment, taking the money from the bank, will happen. So this queue is the payments that are in that state between yes the money's there
and actually having taken the money. So we can see from this if this queue grows and it doesn't seem
to be coming down we can see or maybe there's a problem with actually taking the money that
customers have asked us to take from their account. And it's easily rectified. We just need to talk to, you know,
whoever owns that service, get them to maybe restart it and everything's happy again.
So that's what we wanted to monitor. And we asked the third party that managed that application for us to provide that metrics endpoint and they
did and it worked. Yeah, there's a metrics endpoint, there's some metrics on it, cool,
we'll come to that in a bit when we've got some more time to actually implement a proper
monitoring check around it. And then yeah, it kind of sat for a good few months.
The people that were working on it initially moved on to a different thing, different projects.
And then that's when I came into the point and started actually looking at it again.
Interesting.
And that's when the fun started.
Yeah.
So it's interesting that you mentioned the third-party application, just like a third-party providing the metrics endpoint for this particular use case.
You mentioned that there's this backend, I guess, legacy database.
Was your team unable to access the database directly,
and this was just something that the third party had kind of like the sole access to.
What I'm trying to get at is I'm really interested about like the trade-off of,
you know, asking the third party to provide, you know, a solution to you guys,
or was it something that you guys could also build yourself,
but it's one of those things where it's like, you know, it's just not worth our time.
They're the subject matter experts on this.
Let's let them do it. Yeah, so we have access to the databases if we need them. And as Skybetting and Gaming, not necessarily my team, because we don't have a reason to get into that database
because of the information that's contained within it.
So separation of privileges and all that sort of thing.
So us as a company have access to those databases,
but my team specifically don't.
We had built something that did kind of this sort of monitoring in the past
based on the information that we had to hand,
which was basically tailing the log
files and checking for any errors there.
And that gave us an indication that there were failures to fulfil the payments, but
what it didn't tell us is if payment fulfilment was just not running for a particular reason.
So that slight nuance is why we needed to actually get into the application
to get that further detail.
And like I say, it's not an open source application that's run,
so the only route we had was via the third party.
Got it. All right. That's really interesting.
And I kind of want to take a step back a little bit.
I know Ronik and myself are very familiar with this.
We've worked with Prometheus as a monitoring solution.
And so you mentioned like this query exporter from this third party.
Can you kind of briefly explain kind of to the audience that may be not familiar with this,
what a Prometheus query exporter is and what they may not be aware of?
So I'm also new to Prometheus
and the world of query exporters,
which is where a lot of the failure came from in this.
So my understanding, at least,
a query exporter is something
that's built into the application,
which will provide metrics
on how an application is behaving or not behaving
so that your Prometheus server,
which is a time series database, will be able to scrape that endpoint and pull in those metrics
and observe what the application is doing. So my understanding, this is where it all fell down,
really. My understanding was a Prometheus exporter, a query exporter,
will just present a static metrics page that is updated by the application.
However, what I've since learned after looking into this is that best practices from Prometheus
actually dictate that when you hit that metrics endpoint, it then
does the work to generate the metrics. And that's, yeah, that's a bit that I wasn't aware of
and is what made this so entertaining. Yeah, that's super interesting because I
intuitively would have thought exactly like what you were talking about is the process itself is
responsible for updating it.
And, you know, that kind of has a nice separation of concerns. It looks like it's an interesting
trade-off that I guess Prometheus made on like how fresh the data is. So that's super interesting.
And when the third party provided this query exporter to you,
I'm curious, was this going to be something that just ran on one machine or was this something that you would have to roll out to probably multiple VMs at this point?
So how did that rollout process work, given that it's a third party?
So the query exporter application itself was going to run on all the machines that were responsible for
for doing that fulfillment process so we have multiple multiple machines that do it
like on a round robin queue basis and yeah this this metric standpoint was going to
run on all of them, but not necessarily...
Because it was looking at what was left in the database,
the number of items that were left in the database,
it didn't matter which of the servers
was running the fulfillment process at that moment in time
because any of them could be hit on the metrics endpoint
and still get the same data.
Got it.
And so you mentioned you had rolled this out
and it sat there for several months.
And I recall reading from the blog,
there were some firewall things
that you guys were trying to work through
and all sorts of things which probably added to the delay.
So kind of like fast forwarding
to maybe the exciting part,
once that firewall kind of said said cool yeah you guys are
you guys are good to go um can you kind of like talk a little bit about kind of the events that
unfolded um after that yeah sure so i um i got everything running as far as prometheus was
concerned uh it was attempting to scrape the endpoint with our default settings,
which was have a timeout of 10 seconds and scrape every 30 seconds.
And then when you look on the Prometheus list of targets
that it is scraping on the web UI,
you'll see that it says whether the target is healthy or not.
And the query exporter that we were looking at said, you'll see that it says whether the target is healthy or not.
And the query exporter that we were looking at said
connection reset or something along those lines.
So we think, oh, firewall, right.
Put the firewall request in, I'm going home.
See you tomorrow, sort of thing.
And then the firewall request,
how our firewalls requests work is they're largely automated in terms of the elaboration of which firewall it needs to go on, which interfaces, which groups of IP addresses, etc.
And also the actual implementation is automated as well.
So this went through the automating process
and the firewall rule was put in place. And at that point, Prometheus says, right,
let me at it. And it starts polling the metrics endpoint. Now, here is the interesting bit
in that the actual, I mentioned earlier, it's not a static metrics page that is populated by
the application it's something that runs every time the metrics endpoint is hit and the request
that was being made is quite a big one because we are looking at the total number of unfulfilled
payments in the past week which is a big number. It's bringing back millions and
millions of records every time this request is made to the database. So that starts to slow down
the database a little because it's doing quite a lot of work. And it's taking probably 16 seconds to return the data.
We're timing out after 10 seconds.
We don't really care.
And the query exporter doesn't care that we're timing out from Prometheus
because it's run the query
and it's waiting for the response
regardless of what Prometheus thinks.
So it starts to take a little longer than 16, 20 seconds.
It starts to creep up, creep up a little longer.
And then we're at the stage where it's taking longer to run the query
than the interval of the query itself.
So we've got multiple queries.
We've got 10 times this query.
We've got very much queued up.
And all of a sudden, the database that is containing these payments records that also contains things like user credentials makes it so that it's not able to be read anymore because it's just too busy, which results in logins failing for a start, issues with people being able to place bets if they are already logged in. You know, this is a total outage, essentially,
because this query is just running itself into the ground.
Interesting.
Yeah, and you mentioned in the blog about the breakdown of communication
and understanding what the queryeryExporter application was doing.
But even beyond that, too, of just like, you know,
not everyone's familiar with QueryExporter,
probably just learning, figuring this stuff out.
But also from the third-party team,
were they able to provide any sort of documentation
about this thing that they had just shipped to you guys?
Or is this also maybe just something new to them too?
So there wasn't any documentation that I saw.
It was just like a handover from one team member to another.
But when they found out what we were doing with that query,
they were very shocked that that's how we decided to do things.
Oh, interesting.
So we were not following their best practices
of how to get that data.
I see.
They said, yeah, that's a pretty heavy query
to be running every 30 seconds.
You should be doing that every 20 minutes maybe.
If you're trending how many payments have failed
to be fulfilled over the past week, that's not really data, you need to be renewed every 30
seconds, you can, you can have half an hour to an hour delay on that data.
Got it. So I guess moving forward, now that you guys have been able to were able to root cause
that, like, okay, yeah, this query pattern is going to generally be expensive. We can't afford to keep thrashing
the database like this. What did you guys end up moving more towards?
Again, trying to balance this whole
aspect of, is my data the most fresh it can be right now?
Or can I wait?
Yeah, so we made a couple of changes
to the actual application itself,
the query exporter application,
in that it won't run
if there are already two processes of it running,
which would have been a nice thing to have from the beginning,
but you live and learn,
and it's certainly going to be something
we put into things in future. And then, yeah, we went back to the team that specifically looks
after the payment side of things. And we had a conversation with them about how fresh do you need
this data? Because after a busy week, we did some calculations with the database team.
And we worked out that on a really busy week, this query could take upwards of two minutes to return all the data.
So that now runs every half an hour.
And that was like the trade-off, like you say say between the freshness of data and the stability of
the database
but really
it could now run every
probably every minute because
now we've got this safeguard
in place of it's not going to run if there's already
one running
we could make it more
frequent but
it's not on anyone's roadmap to make it more frequent, you know, just in case.
Right.
I don't think anyone's going to be arguing for that at this time.
I think everyone remembers and like, yeah, let's step away from that a little bit.
It's still a bit raw.
Yeah.
So after all is said and done, I think like, I mean, there sounded like like there was definitely gonna be a lot of eyes on this what were some of like kind of like the big learnings that your that your team
or maybe even other teams got um out of this incident uh so our team uh we took a lot of
learnings from it um sort of procedurally about handing off work to other people and if you pick
up a piece of work that has been dormant for a while you
really need to put the effort in up front to uh to understand exactly what the state of things are
and if you're not if you don't feel that you're knowledgeable enough to pick that specific bit up
then the onus is on you to uh either seek out that extra information from the person who worked on it previously or from the internet
because I googled Prometheus Query Exporter
and it said, oh yeah, the best practice is to run the command
every single time it's hit.
If I'd have done that at the start,
then we wouldn't have been in this situation.
The other main big learning that had a lot of focus
from higher-ups in the company was
the fact that it wasn't me as the engineer owning that system that put the check live.
It was the automated firewall rule that ran at some point in the evening that put that live.
When I noticed that the check was failing because it couldn't talk to the endpoint,
at that point I should have removed the check or disabled it,
sorted the firewall access out, and then re-enabled it.
But that's where the whole, it's just a monitoring change,
misnomer comes in, it's like, how much harm can it do, really?
Right, right.
Just letting that sit there and wait for the firewall to let it through.
And then there's some little things about the application itself that we had to think about.
So I mentioned the fact that we now have Safeguard to only allow it to have two instances of it
self-running. We have a real-time backup of the data in that database.
There's no reason why we shouldn't be querying that backup
instead of the live database, like query the replica instead.
So it's just things that should be best practice
but maybe weren't thought about at the time.
But yeah, it's been a really interesting learning experience for sure.
Awesome. time um but yeah it's it's been a really uh interesting learning experience for sure it's awesome um and i guess kind of stepping back oh yeah you mentioned that there was a lot of it was
more of like kind of like the process sort of thing were there any like like large organizational
practices that were also look forward to for the future of like third-party applications like i
mean i think we've probably also gotten bitten by this.
I'm not personally aware of it at LinkedIn,
but we also use other third-party,
like, you know, we have a license with them
and, you know, we're kind of subject
to whatever client that they've provided to us.
So, and a lot of times it works.
And I think that's the really kind of like
the kind of part
where it's tough where you know 95 of the time 99 of the time the software they give us from
multiple vendors potentially they just work out of the box so it's like what what's wrong with
just you know one more right um so yeah i'm just curious on that side yeah so i think it's difficult to say in this instance because
it wasn't really a failure of the third party there's more of a misunderstanding of like yeah
what they thought you guys were going to use it for and yeah how you guys were actually using it
yeah so they thought we were going to use it in a different way we thought it did something
completely different um i don't know yet that there's been any specific organization-wide policies put in place to
do with that.
But I know that our team specifically are now a lot more fine-tooth comb when we're
picking up things from third parties.
Fair enough.
Makes sense. I want to take a step back.
You mentioned that this database
was also processing a lot of other tasks.
And you mentioned when there was this full outage,
people weren't able to log in.
So in terms of just categorizing the issues,
like if we had three categories,
say major, minor, medium minor medium for instance this would be
accounted for major i assume yeah this is uh this is the top top priority this is everyone everyone
gets paged even if you don't know what the thing is about you get paged because it might affect
your system oh interesting So when this happened,
and you mentioned the blog as well,
that the banners on your website would go out saying,
hey, we know our systems are affected and we'll be working on it to fix it.
What does that incident management process look like?
Like what happened after that?
So after we started seeing the problem, you mean?
Yeah, yeah.
Once you see the problem, you know there is an issue
and people aren't able to log in.
How do you go about just fixing the system then?
So we're pretty slick at incident management throughout the company, not just within core.
So the kind of process of this was the banners go up we see okay lots of different services are all having
problems talking to the database let's get the database people look at this they they instantly
see i mean i'm talking within within minutes that they'd seen this is the query this is running
loads of times i don't know what this is i I've not seen this before. This is something brand new." At which point someone in the payments team in core says, that looks
like a query for the last, all unfulfilled payments in the last week. And then you've
got enough people there to kind of inject the context of, well, I know that that query has just gone
live on these servers. Let's stop these servers from doing anything. Let's firewall them off
and get the database in a healthy state. Which, yeah, like I'm saying, it's probably 10-15 minutes before we're in a situation where we can say okay
we've identified the the cause of this problem we've mitigated it by putting banners up we've
actually fixed the problem by getting rid of the the query being made from these servers
we've tested it from behind banners to check that everything is working now as expected, we can now go and remove the banners and allow people back onto site.
There's a lot of really quick moving scenarios with our incident management,
purely because of the fact we want to get people back on site as soon as possible,
because it's very costly if people are not able to get on site,
especially in certain sporting events.
If an outage during the afternoon is bearable,
an outage in the evening when there's a big sport event on is terrible.
And yeah, there's a lot of pressure to get things back up as soon as possible.
Yeah, it certainly reflects our more reliable system as it establishes trust with the users of the system as well.
And what you describe is a really quick recovery, like as soon as things started going south your team was paged
or multiple teams were paged who came together and were able to recover the system really quickly
so talking about incident response i know you have mentioned on well some of the other blogs
on the website you do something which both austin and i and many other folks in this domain are also interested in. Some people like to call it chaos engineering or recently resiliency engineering.
You refer to this word fire drills, like you simulate failures in your system, again,
not in production, of course, but in a controlled environment so that everyone who is on the on-call
rotation kind of gets used to how the system works, can resolve the issue, and so that everyone who is on the on-call rotation kind of gets used to how the system works,
can resolve the issue, and so that you can recover the systems fast when they actually go down.
Can you tell us a little bit about how this process of fire drill started
and how it has evolved over the last few years?
Yeah, so fire drills for us are a way to run chaos engineering experiments on our systems, computer systems, to see how they respond when we pull the rug from underneath them, like disk or network.
But also we use them as a really effective tool for chaos engineering experiments on our people systems, like the on-call team.
Very important.
Very, very important.
Because I think it was Dave Rensin from Google says,
employees are buggy microservices.
Which is so true.
It is, it is.
So they need as much, if not more,
attention than your computer systems.
Oh, yeah, for sure. I mean, having sound processes in place is equally important than just having sound systems.
Exactly. So we started doing fire drills just within core a few years ago now.
And it was every Thursday morning we would break something and the actual people that were on call would get paged out.
Over the years, up till recently, we did sort of just have that same pattern every Thursday morning.
Primarily the platform squad would break something.
But we noticed that it was getting a bit stale.
So it was nearly always platform that were breaking something.
And so the scenarios were getting a bit samey,
a bit, oh, the disk is broken again,
purely because we didn't have the knowledge,
the in-depth knowledge that the engineers
building the systems themselves have of those systems.
So we made a pledge that we were going to rotate
around all the different squads on a weekly basis,
and each of them would run a scenario on their own systems.
And that's been in place for maybe six, seven months now, maybe longer.
And it's been really effective because not only are these scenarios more realistic and more
engaging, but the owners of these systems that are breaking them are doing it in a way that they can try and
understand what happens when their systems break. So by trying to catch their colleagues out with
an interesting problem, they're inadvertently sort of resilience engineering experimenting on
their own systems. So yeah, it's been really, really successful, this change.
Makes sense. I mean, having the teams who understand the system more deeply create these scenarios, because I would imagine, as the platform group itself, after a while, it's hard to come up with new ideas on breaking your own systems.
And having SMEs do that for you would result into more engaging outcomes uh can you describe one of the last uh
fire drills that either your team or one of your other teams simulated if that's okay to share on
this platform yeah so i uh i did one yesterday oh nice it's fresh fresh in my mind um and this was
uh this was good because this was a cross-tribe Viadrill.
So it involved us as core and also the bet tribe.
And what we did was we made a change to one of the core systems,
remove some API keys,
which meant that putting a selection onto the bet slip
to actually place a bet would fail
and give a bet placement unavailable error.
So this kind of ran where I made the change
and then I was slowly restarting Kubernetes pods instead of doing it big bang.
So it was sort of like a slow degradation of service.
I see.
And then the engineer was paged and saw the errors, thought, oh, this looks like something to do with core.
Let's call core out.
And everyone's happy.
Everyone enjoys a good investigative scenario, scenario don't they oh yes uh but
what we uh what we spent time doing is uh is making focusing a lot on the realism and the
immersion of the of the fire drills so we've got this uh this slack bot where you as the
the exercise coordinator can type in what you want to say, but who you
want to say it as. Oh, nice. Interesting. Very interesting. Yeah. So you can say like,
tech desk says, we're seeing a lot of calls coming through from the contact center to say that
customers are unable to place bets. And it's just another one of those things
that helps keep people in the moment
and treat it like it is real.
Because it's all too easy to just,
you know, I ain't got time for this.
I've got more important work to be doing.
I'll leave other people to deal with that problem.
Whereas if it's actually engaging and entertaining,
then it's a lot more interesting,
a lot easier to get people involved in it. Oh yeah, sure thing. I mean, it's more of a, it's a lot more a lot more interesting a lot easier to get people
involved in it oh yeah sure thing i mean it's more of a it's a cultural shape or it's more of
culture that people buy into uh you mentioned that it's so first of all how long do some of
these fire drills go on for so we we book out the morning um but it doesn't take that full time. So we allocate one hour purely because we want to put a window on it
so that if somebody needs to do something in the environment
in which we're running the drill,
we're not blocking them from doing what they need to do.
Because while we don't use customer-facing production,
we do use our production disaster recovery environments
so that we can have a truly representative environment
to do the testing in,
in terms of application scale and everything like that.
So we timebox that to an hour,
and then what we were doing previously
is having a retrospective as if it was
um a post-incident review of a real incident um and then raising any actions and sending them off
to to the to the relevant squad to deal with what we do now is we have a specific uh hour after the
end of the fire drill where we have the retrospective straight away,
tick everything up. And then if it's small bits like documentation changes,
then we just do them then and there instead of necessarily passing them off to someone else to do.
So it's been really good and it's helped get a lot of low hanging fruit, whereas otherwise it'd go
and sit on someone's backlog for X number of years before it actually becomes important enough to do.
Oh, yeah.
Doing the retrospective right away sounds like a good idea because it's so much, the incident is so much fresh in your mind.
And you know exactly the improvements to make.
Can you tell us a little bit about what the anatomy of the file drill looks like before you actually start?
So let's say you mentioned you do it every
week. So I'm assuming you or other team members would be thinking of certain scenarios beforehand.
You don't think of what you're going to break that day itself. And the scenario that you create
would also be something, this is just again an assumption, you might be sharing it with your
team members for learning it at a later point.
So what does that look like?
How do you structure this in docs?
When do you prepare for these things?
Do you have a list of scenarios that you want to cycle through?
So for platforms specifically,
now that we don't own every fire drill,
we no longer have visibility
of what the other squads are planning, unfortunately.
Or fortunately, because it makes it more realistic.
But there's two main sources of where we pull our scenarios from.
One is past incidents.
Nice.
Because we're using the fire drills not just as experimenting
on the computer systems
it's the people systems as well. We say
that process kind of broke down in that
the last time we had this incident.
Let's run it again and see how
people respond this time.
And the other
source is
just people's brains
and figuring what's the worst that could
happen or what would happen if X.
And we as platform have a list of potential scenarios to run.
And if you want to simulate this happening, run this command on this server.
Here's what you should see.
Here is where you'll see the evidence
that it's having the desired effect. Here is how you back it out quickly. And here is how people
would probably go about fixing it. I see. Nice. Makes sense. So you mentioned now that the other
tribes are also doing this. You don't always have visibility into what will be happening,
which is in a way is good.
It's more realistic.
So say, for instance, one of your on-call team members gets paged.
How do they differentiate between a real page
versus a page from a fire drill?
I'm afraid we're a bit of a cop-out.
So when we raise the pages, we prefix it with Fire Drill.
Okay, that makes sense.
I know.
Yeah, it's good, I would say.
In an ideal world, we'd not only be not doing that,
but we'd be doing it in production as well,
like in customer-facing environments.
Oh, that's risky.
It's hard to get right.
It's very hard to get right, but we can all dream.
Oh, yes, yes.
I'm curious, you mentioned you don't necessarily do this on production systems, which makes sense.
Have any of the fire drills gone sideways where someone tried to simulate a failure,
but it got worse than what they're planned for?
I can't think of any that have gone worse.
I can think of lots where they've gone not at all how we expected.
Oh, okay. I would love to hear the scenario if you can share it.
So we had one where we thought, right, what we're going to do,
we're going to take this database down
and this is going to break everything for everyone.
Non-production, of course.
So we ran what we thought would happen.
And the systems just seemed to handle it and just not be bothered at all.
System is pretty good.
Yeah.
So we're here like waiting to page all these people and say, top priority, priority one incident, everybody, all's pretty good. Yeah, so we're here waiting to page all these people and say,
top priority, priority one, incident, everybody, all hands on deck,
and nothing's broken at all.
How rarely does that happen?
Very rare.
I wish it happened more often.
Yeah, nice.
So you also touched a little bit on once you've been doing it every week, which is a pretty good frequency, in my opinion. And there is a trade off between spending time on a fire drill versus like you mentioned, doing other things like project work, because everyone's planning for new features and new things they want to get out. How do you, as an organization, how do you balance this trade-off and justify the cost of
doing fire drills every week as it relates to the amount of time you invest in the project work that
needs to happen? This is something I feel very strongly about, and this is a horn I blow a lot
to get people to listen to. And it is something that the company accepts, thankfully,
but I can imagine in other organizations,
it may not be the case,
and you may need to do a lot of bargaining.
The way I see it, and the way I put it to people,
is that if you have a team that is focusing solely
on features and new shiny things in your application that's fine
but there comes a point where it doesn't matter how many new features you add if you suddenly
have an outage and every system crashes because there's been no thought put into the resiliency
of that system it doesn't matter how fancy your application is
if no one can get to it
because you've not thought about how it handles failure.
People have no loyalty, right?
Yeah.
As soon as that happens,
they're going to go to the competitor who,
yeah, their website may not be using
the latest and greatest JavaScript framework
for its webpage, but...
It works.
As long as it works, yeah.
I can place a bet.
That is really well put.
That is really well put.
So do you have any advice or thoughts for organizations who are thinking about chaos
engineering or resiliency engineering and just getting started?
This is not something that they have done, but they are thinking about starting it.
Yeah, the first thing I think you need to know and you need to have in place before started? This is not something that they have done, but they are thinking about starting it.
Yeah, the first thing I think you need to know and you need to have in place before you can even start thinking about breaking your system is having the observability nailed. So if you're
going to expend the effort to have your engineers breaking the systems.
If they haven't got the ability to deep dive into exactly what the application is doing when it's being broken, then it's wasted effort.
The first thing you need to do before you even think about breaking stuff
is ensure that you have a total knowledge of what's going on in your platform.
It doesn't necessarily have to be like,
you know, distributed tracing level down to that,
you know, down that deep,
but you do have to be able to see when your system and services are misbehaving.
And then in terms of actually getting started on stuff,
there is a temptation if you like to go with
the easy obvious things to break like the network goes away that's that's
gonna happen sure but that's not very exciting you're not gonna get your
engagement up the best thing and we learned this too late this is why our
fire drills went stale the easiest way and the best way to get people to get a buy-in from
people in the business is to involve people in the business and get them
thinking how their own systems can break instead of instead of you know the
platform team coming in and saying we're going to break your system and tell you
what's wrong with it and how you need to fix it.
Instead of doing that, it's about, right, let's, as a team, as a collective, let's look at your system and see how could it break? What have you thought about this? You don't know what happens
if this goes away. Well, let's take this downstream dependency away and see how your application
behaves. Yeah, these have been great discussions. I think even like all the talk about
the fire drills, I think this would be a wonderful onboarding tool for even new engineers. I would
think this is something that happens in many organizations, many companies. New engineers
come in, they don't know like how the layout of the land is. But with these fire drills, I think it's a very real way
to kind of immerse them into this environment
so that they can quickly figure out like,
oh, my application talks to these other applications
and those sorts of things where without that,
unfortunately, it's kind of learned on call,
which I think is what a lot of companies kind of do.
And it's fair for the on-call engineers to go in
and be like, I'm terrified.
I'm like, yeah, it's going to take some time.
But with these, I think it's probably less stressful for them,
but I think it's a wonderful experience for new engineers to come in and be like,
I can do this in a safe environment.
And when I do go on-call for real, it's not as scary, which is a great feeling.
It's throwing people in at the deep end, but you've given them a rubber ring.
They've got flotation devices all over them.
They're not going to sink.
They might feel scared for the first 10 seconds or so, but actually they're going to realize that it's safe to do.
And by the time they get rid of the flotation devices and they're actually on call,
it's like the deep end, that's fine.
As part of going on call and onto our on-call rotation,
you have to have gone through a number of fire drill experiences
before you can actually go on call.
That's perfect.
Cool.
So I'd like to kind of,
this is a question that we ask all of the folks that come on to our podcast.
So given that you have a huge breadth,
given that you've kind of like put together
these fire drills,
you've probably worked with a lot of tools
at this point in the DevOps space
or in other places.
So what was kind of maybe the last tool that you discovered
and that you just really enjoyed using or really liked?
It might seem kind of a cop out,
because it's not what you might think.
There is no wrong answer here. There's no right answer. So I recently went back from Bash to ZSH.
And I found this theme called Power Level 10K.
And what it is, you know if you have loads of plugins in ZSH, it kind of slows your prompt down and you press enter and you just get the gaps on your terminal.
This, I don't know how it does it, it's magic.
It sort of lazy loads your plugins but gives you a prompt straight away.
And then it fills your prompt with all these super low latency utilities.
So it does like your Git or subversion or whatever,
version control in your prompt.
It gives you a clock that actually counts up the seconds in your prompt
instead of being the time that you last press enter,
which I think is amazing.
I think every prompt should come with that.
So, yeah, I don't know if a ZSH theme is going to be the most exciting tool
that you're going to get on this segment ever,
but it amazed me purely because of how it manages to take something
that would take literal seconds to load up your prompt
and just makes it like 10
milliseconds before you have a prompt i just found it amazing to see it yeah no that's that's huge i
mean i think for anyone who's working in this space probably some of the most frustrating things
is you're trying to run something and you're like oh i have to wait a few just even like three
seconds is enough to just like how any of us go a little bit crazy.
So that's really neat.
What is the team name again?
It's Power Level 10K.
Okay, Power Level 10K.
Nice.
Yeah.
Well, and so where can people find you on the internet and learn more about what you're up to these days now um i i tweet occasionally um on uh at hey it's alls all one word um i'm i'm sometimes i sometimes mess about on the on the fediverse but i'm getting a bit bored of that so
maybe not um my uh my website is alls.wtf which I sometimes write blog posts on, sometimes don't.
But if I'm going to be active, it's on there, basically.
It's on there or Twitter.
Awesome.
And is there anything else that you would like to share with our listeners today?
No.
Oh, well, actually, yeah, go and break stuff.
Because you don't know how things work until you've broken them.
That's true.
Plus went to that.
Well, it's been a blast having you on our podcast.
So thank you so much for coming on to the show.
Yeah, cheers.
It's been brilliant.
Hey, thank you so much for listening to the show.
You can subscribe wherever you get your podcasts
and learn more about us at softwaremisadventures.com.
You can also write to us at hello at softwaremisadventures.com.
We would love to hear from you. Until next time, take care.