Software Misadventures - Oliver Leaver-Smith - On how "just a monitoring change" took down the entire site and resilience engineering - #5

Starting point is 00:00:00 If you have a team that is focusing solely on features and new shiny things in your application, that's fine. But there comes a point where it doesn't matter how many new features you add, if you suddenly have an outage and every system crashes because there's been no thought put into the resiliency of that system. It doesn't matter how fancy your application is, if no one can get to it because you've not thought about how it handles failure. Welcome to the Software Misadventures podcast, where we sit down with software and DevOps experts to hear their stories from the trenches about how software breaks in production. We are your hosts, Ronak, Austin, and Guang.

Starting point is 00:00:46 We've seen firsthand how stressful it is when something breaks in production, but it's the best opportunity to learn about a system more deeply. When most of us started in this field, we didn't really know what to expect, and wish there were more resources on how veteran engineers overcame the daunting task of debugging complex systems. In these conversations, we discuss the principles and practical tips to build resilient software, as well as advice to grow as technical leaders. Hey everyone, this is Ronak here. In this episode, Austin and I speak with Oliver Lever-Smith, better known as OLS.

Starting point is 00:01:21 OLS is a DevOps engineer at SkyBedding and Gaming. He has been interested in technology, specifically how it breaks, from a very young age. This interest of his aligns very much with ours, and we had a lot of fun speaking with him. We discussed how a seemingly simple monitoring change ended up taking down the entire site. We also talked about chaos and resilience engineering, a topic Oles deeply cares about. We discuss how his team at Skybetting and Gaming conducts fire drills, in other words, chaos engineering exercises, where they not only test the resiliency of their software systems, but also their people systems.

Starting point is 00:02:01 We walk through a recent example of a fire drill that Oles lit himself. We talk about how these fire drills have evolved over the last few years and the lessons learned in the process. Please enjoy this fun conversation with Ols. Ols, welcome to the show. We are super excited to have you here. Thank you very much. It's good to be here. So we were researching for this episode and I was reading about you on the internet. There was one bit which I found which kind of stood out. At least I was fascinated by it and I want to know more about this. So back in 2003, I found a bio which said that back in 2003, you were learning more about setting up Red Hat

Starting point is 00:02:41 and you unintentionally upgraded your dad's Windows XP machine to Red Hat 9. I'm very curious about how that happened. Can you share that story with us? Yeah, so I use the term upgraded. He used the term ruined. I think I upgraded him. So basically, he had this book of Sam's Teach Yourself Red Hat on his bookshelf. And he had a CD in it to run like a live instance

Starting point is 00:03:09 of Red Hat 9. So I put that in his computer because I didn't have my own at the time. And I clicked around the live CD a bit and thought, this is quite interesting. I'm going to install it. I thought it was just like, you know, when you install a game or anything,

Starting point is 00:03:24 there's a little install button on the desktop so i thought i'll click that uh i went through the install steps and um it said it's now safe to restart your computer so i thought okay i don't know why i need to restart but fine um and when i went to restart uh all i had was grub and uh and the the option for red hat nine and uh i couldn't work out what i'd done because i was at the stage where i didn't i was dangerous enough to know how to do things but not why i was doing them and what the actual effect would be uh this did actually result in getting my own computer though to tinker on so that I see it as a positive really oh yeah there is there is a bright side to it but I can't imagine if your dad was pissed off pretty much yeah oh I got I got a terrible computer out of it it was like a

Starting point is 00:04:20 reject from work or something that nobody wanted well at least you got your own computer to play with. Exactly, yeah. So can you tell us a little bit about your background? I know a lot of listeners would want to know how you started off. I saw your LinkedIn profile and it said that you started off as a network engineer and now you're more in the DevOps space. So we would love to hear from you. Yeah, so I started off on like help desk type thing at an ISP.

Starting point is 00:04:51 So the natural progression there for me was to go into networking as a discipline. So I went up sort of through the ranks in the help desk and then started being a real network engineer. And then I moved to another ISP and got more in the weeds of networking. And then I branched out from ISPs and went to, started in the gambling sector

Starting point is 00:05:17 as a network engineer still. But I found the environment, the fast pace, the ridiculously short down times that you were permitted, all that sort of thing, I found that really, really interesting. And I kind of, I saw what these DevOps engineers were doing and all these infrastructure engineers, and I saw how they were not automating themselves out of a job, but doing less, doing more with the time they had by doing less actual work and toil and spending more time on it, working out how they could automate their jobs. So I did quite a bit to automate the boring stuff that we had to do as network engineers. So like device config audits and all that sort of stuff.

Starting point is 00:06:05 And that really like piqued my interest in the automation side of things. So I saw a job advert for a DevOps engineer. And I always had Linux in my back pocket and that sort of stuff. So I thought, you know what, I'll make the jump. I know this DevOps thing from a networking perspective. I know a bit about Linux, so why not? And it's gone from there. But it's good to have a specialism that isn't necessarily just DevOps

Starting point is 00:06:37 because if you want to get that full view of the whole stack as a team, you really need people that have got the T-shaped developer, if you want to get that full view of the whole stack as a team, you really need people that have got the T-shaped developer, if you like, or the T-shaped engineer. Oh, absolutely. He's got the specialism. So it's worked out all right for me. Yeah. I mean, I know a lot of DevOps engineers

Starting point is 00:06:56 who come from many different backgrounds. I mean, including Austin and myself, we come from, I don't know if you would call it unconventional if everyone is coming from unconventional backgrounds. But yeah, having expertise in one domain certainly helps. So now that you're a DevOps engineer at Skybetting and Gaming, can you tell us a little bit about your team, your role, like what you do day to day and what does your team structure look like? Yeah, so Skybetting and Gaming itself uses the tribal model that Spotify invented and made famous. So the tribe I am in is called Core. And how the tribes work, it's kind of like they're

Starting point is 00:07:40 all individual companies that take resources off each other as if they are individual businesses. So in Core Tribe, where I am, what we focus on is a lot of sort of key account functionality. So we don't really talk, we don't really deal with the betting or the casino side of things. We're primarily user registration and identity verification, payments, like taking payments from customers and sending withdrawals out. So we're sort of like the beating heart, if you like, that a lot of other tribes within the company utilize. The team I'm in is a specific platform team. So there are feature squads that have different areas of expertise in their different domains,

Starting point is 00:08:34 different applications. But the platform squad that I'm in kind of sits underneath all that and supports the development and the rollout of new features and new products. So we do it in kind of a few different rather interesting ways. So we'll sometimes get parachuted into a team to be like some SWAT-style platform resource that just needs to spin up a database cluster or something quickly to allow to allow a team to start developing something um but but other other times we'll be kind of pulled into this uh the phrase we use is a pop-up squad which is a like a single use um squad from different domains that can all come together and um and do do good things. A most recent example of this is we had some GDPR work to do,

Starting point is 00:09:30 which is, for those that don't know, is the EU data privacy regulation stuff. So this needed some developers to make changes on some of their systems. It required platform there to ensure that backups were being kept at the right amount of times of databases and all this sort of thing. So it's quite playing hard and loose with the definition of a feature squad. But like I say, it works for us, and it's good to get the different exposure to different areas of the business that you wouldn't necessarily if you were just being a platform engineer working on just platforms.

Starting point is 00:10:13 That makes sense. And it's actually very interesting, like the tribe structure that you mentioned. I actually want to dig in a little bit into that, if you don't mind. So you mentioned you're part of the core tribe team so does a tribe have like multiple teams within it and the other teams you work with are they part of different tribes or would they would they be part of the same tribe so the teams that are in the core tribe i'll try not to leave any out in case anyone from work is listening because that'd be terrible.

Starting point is 00:10:49 So obviously there's the most important one, which is platform. And then there's a squad that is focused on account as a service. So that includes the actual account bit you see when you log in um so like you're changing your details and your credentials and everything and also things like uh any any exclusions you want to put on your account if you feel that you're spending too much money on site um all the all the tools that we have there to to help you manage that as a customer are all part of that team. There's also then the payments squad, which solely look after taking all the money and giving it back. And then another squad is the onboarding squad that handle getting customers through the door um in a responsible way and also ensuring that we can verify they are who they say they are um you know whether that be using third-party

Starting point is 00:11:56 um identity providers or manually um verifying documentation that the that the customer will provide. I think that's it. That's it, yeah. There's a lot of principal engineers that kind of float around different squads depending on where the resource is needed, but they're the main squads. And that pattern of a tribe made up of multiple squads that have a specific domain to look after, that is what is replicated across the business in different tribes.

Starting point is 00:12:30 I see. Makes sense. It's a fascinating concept. And how many people are on the core tribe in general? How many engineers total? Engineers, I would guess probably around 60 to 80, including all disciplines like test and software dev and platform i see pretty good though um and for for some of our listeners who might not be fully aware can you tell us a little bit about uh what sky betting and gaming as a company does yeah so we are um i think we are the biggest online bookmaker in the uk um so we do um your traditional sports book betting so betting on football soccer and uh and horse racing and things like that and then we also have um online gaming platforms so your traditional sort of slot machines online uh live casino with croupiers spinning roulette wheels and things like that um and then we also have a lot of um a lot of products that

Starting point is 00:13:40 are free to play and so we have things like um prize machine where it's free to spin and you win money or free spins elsewhere. We have things like where you can put a free guess on the outcome of a few different football matches. And if that matches the actual results then you win money. We're lucky in that we were closely affiliated with Sky the company which is quite a good brand and that's it's like it's very much the brand you think of when you think about sports at least in

Starting point is 00:14:24 the UK, because they're sort of the home of Premier League football for a long time. Makes sense. So considering there is a lot of payments involved, people are betting, so I would imagine performance and reliability would be of paramount importance of all the systems that you're working with.

Starting point is 00:14:44 And the requirements would be of paramount importance of all the systems that you're working with. And the requirements would be extremely tight. Yeah, so we're unfortunate, I guess you could say, in that everyone in the business relies on us and our availability. So if one of the other tribes that, say if the BET tribe has a problem with their website, the gaming tribe, they can continue to run their products. Whereas if our services go down,

Starting point is 00:15:12 then every single consumer of our services is having the same problem. So we are, rightly so, we are held to a very high standard in terms of our system performance anyway. Nice. So, I know today you published a blog post recently on your website about how a seemingly benign monitoring change resulted into an outage, resulting in making your system grind to a halt. I know we I want to dig more into that. And Austin here is on the monitoring infrastructure team at LinkedIn.

Starting point is 00:15:52 So I'm going to let him drive this part because he is extremely excited to talk to you about this. Yeah, so I'm on the monitoring infrastructure team. We provide a monitoring platform pretty much for all of the variety of applications at LinkedIn. And we expect it to run smoothly all the time and shouldn't affect the applications in most circumstances. So this is really interesting for me. Can you give us a little bit of background on the kind of the systems that you were monitoring for this particular incident? Yeah, so I'm fine talking about this because I was the one that did

Starting point is 00:16:32 it. So I don't mind throwing the engineer that did it under the bus at all because that engineer was me. It's healthy to talk about your failures, right? So it's good to talk about it. So the specific application that we were wanting to monitor in this instance was part of the voodoo backend, the very legacy backend that talks directly to the database sort of systems rather than anything further up the stack and we're in the situation where this particular um application that talks directly to the database is one that is provided to us by a third party and it's closed source we have like a route into them for bug fix and feature releases and that sort of thing but um but that's on like a um a consultancy basis

Starting point is 00:17:25 so something that we requested from them was a metrics endpoint that we could scrape to tell us how many unfulfilled payments were in the queue of payments waiting to be fulfilled so how our payment fulfillment works um not just at Sky Betting and Gaming, anywhere, is there'll be an initial sort of hold on the bank account that says, is this money available? The bank says, yes, that's fine. And then at a later date, the actual fulfillment, taking the money from the bank, will happen. So this queue is the payments that are in that state between yes the money's there and actually having taken the money. So we can see from this if this queue grows and it doesn't seem to be coming down we can see or maybe there's a problem with actually taking the money that

Starting point is 00:18:19 customers have asked us to take from their account. And it's easily rectified. We just need to talk to, you know, whoever owns that service, get them to maybe restart it and everything's happy again. So that's what we wanted to monitor. And we asked the third party that managed that application for us to provide that metrics endpoint and they did and it worked. Yeah, there's a metrics endpoint, there's some metrics on it, cool, we'll come to that in a bit when we've got some more time to actually implement a proper monitoring check around it. And then yeah, it kind of sat for a good few months. The people that were working on it initially moved on to a different thing, different projects. And then that's when I came into the point and started actually looking at it again.

Starting point is 00:19:18 Interesting. And that's when the fun started. Yeah. So it's interesting that you mentioned the third-party application, just like a third-party providing the metrics endpoint for this particular use case. You mentioned that there's this backend, I guess, legacy database. Was your team unable to access the database directly, and this was just something that the third party had kind of like the sole access to. What I'm trying to get at is I'm really interested about like the trade-off of,

Starting point is 00:19:54 you know, asking the third party to provide, you know, a solution to you guys, or was it something that you guys could also build yourself, but it's one of those things where it's like, you know, it's just not worth our time. They're the subject matter experts on this. Let's let them do it. Yeah, so we have access to the databases if we need them. And as Skybetting and Gaming, not necessarily my team, because we don't have a reason to get into that database because of the information that's contained within it. So separation of privileges and all that sort of thing. So us as a company have access to those databases,

Starting point is 00:20:31 but my team specifically don't. We had built something that did kind of this sort of monitoring in the past based on the information that we had to hand, which was basically tailing the log files and checking for any errors there. And that gave us an indication that there were failures to fulfil the payments, but what it didn't tell us is if payment fulfilment was just not running for a particular reason. So that slight nuance is why we needed to actually get into the application

Starting point is 00:21:07 to get that further detail. And like I say, it's not an open source application that's run, so the only route we had was via the third party. Got it. All right. That's really interesting. And I kind of want to take a step back a little bit. I know Ronik and myself are very familiar with this. We've worked with Prometheus as a monitoring solution. And so you mentioned like this query exporter from this third party.

Starting point is 00:21:32 Can you kind of briefly explain kind of to the audience that may be not familiar with this, what a Prometheus query exporter is and what they may not be aware of? So I'm also new to Prometheus and the world of query exporters, which is where a lot of the failure came from in this. So my understanding, at least, a query exporter is something that's built into the application,

Starting point is 00:21:56 which will provide metrics on how an application is behaving or not behaving so that your Prometheus server, which is a time series database, will be able to scrape that endpoint and pull in those metrics and observe what the application is doing. So my understanding, this is where it all fell down, really. My understanding was a Prometheus exporter, a query exporter, will just present a static metrics page that is updated by the application. However, what I've since learned after looking into this is that best practices from Prometheus

Starting point is 00:22:41 actually dictate that when you hit that metrics endpoint, it then does the work to generate the metrics. And that's, yeah, that's a bit that I wasn't aware of and is what made this so entertaining. Yeah, that's super interesting because I intuitively would have thought exactly like what you were talking about is the process itself is responsible for updating it. And, you know, that kind of has a nice separation of concerns. It looks like it's an interesting trade-off that I guess Prometheus made on like how fresh the data is. So that's super interesting. And when the third party provided this query exporter to you,

Starting point is 00:23:28 I'm curious, was this going to be something that just ran on one machine or was this something that you would have to roll out to probably multiple VMs at this point? So how did that rollout process work, given that it's a third party? So the query exporter application itself was going to run on all the machines that were responsible for for doing that fulfillment process so we have multiple multiple machines that do it like on a round robin queue basis and yeah this this metric standpoint was going to run on all of them, but not necessarily... Because it was looking at what was left in the database, the number of items that were left in the database,

Starting point is 00:24:11 it didn't matter which of the servers was running the fulfillment process at that moment in time because any of them could be hit on the metrics endpoint and still get the same data. Got it. And so you mentioned you had rolled this out and it sat there for several months. And I recall reading from the blog,

Starting point is 00:24:33 there were some firewall things that you guys were trying to work through and all sorts of things which probably added to the delay. So kind of like fast forwarding to maybe the exciting part, once that firewall kind of said said cool yeah you guys are you guys are good to go um can you kind of like talk a little bit about kind of the events that unfolded um after that yeah sure so i um i got everything running as far as prometheus was

Starting point is 00:24:58 concerned uh it was attempting to scrape the endpoint with our default settings, which was have a timeout of 10 seconds and scrape every 30 seconds. And then when you look on the Prometheus list of targets that it is scraping on the web UI, you'll see that it says whether the target is healthy or not. And the query exporter that we were looking at said, you'll see that it says whether the target is healthy or not. And the query exporter that we were looking at said connection reset or something along those lines.

Starting point is 00:25:31 So we think, oh, firewall, right. Put the firewall request in, I'm going home. See you tomorrow, sort of thing. And then the firewall request, how our firewalls requests work is they're largely automated in terms of the elaboration of which firewall it needs to go on, which interfaces, which groups of IP addresses, etc. And also the actual implementation is automated as well. So this went through the automating process and the firewall rule was put in place. And at that point, Prometheus says, right,

Starting point is 00:26:11 let me at it. And it starts polling the metrics endpoint. Now, here is the interesting bit in that the actual, I mentioned earlier, it's not a static metrics page that is populated by the application it's something that runs every time the metrics endpoint is hit and the request that was being made is quite a big one because we are looking at the total number of unfulfilled payments in the past week which is a big number. It's bringing back millions and millions of records every time this request is made to the database. So that starts to slow down the database a little because it's doing quite a lot of work. And it's taking probably 16 seconds to return the data. We're timing out after 10 seconds.

Starting point is 00:27:08 We don't really care. And the query exporter doesn't care that we're timing out from Prometheus because it's run the query and it's waiting for the response regardless of what Prometheus thinks. So it starts to take a little longer than 16, 20 seconds. It starts to creep up, creep up a little longer. And then we're at the stage where it's taking longer to run the query

Starting point is 00:27:29 than the interval of the query itself. So we've got multiple queries. We've got 10 times this query. We've got very much queued up. And all of a sudden, the database that is containing these payments records that also contains things like user credentials makes it so that it's not able to be read anymore because it's just too busy, which results in logins failing for a start, issues with people being able to place bets if they are already logged in. You know, this is a total outage, essentially, because this query is just running itself into the ground. Interesting. Yeah, and you mentioned in the blog about the breakdown of communication

Starting point is 00:28:23 and understanding what the queryeryExporter application was doing. But even beyond that, too, of just like, you know, not everyone's familiar with QueryExporter, probably just learning, figuring this stuff out. But also from the third-party team, were they able to provide any sort of documentation about this thing that they had just shipped to you guys? Or is this also maybe just something new to them too?

Starting point is 00:28:49 So there wasn't any documentation that I saw. It was just like a handover from one team member to another. But when they found out what we were doing with that query, they were very shocked that that's how we decided to do things. Oh, interesting. So we were not following their best practices of how to get that data. I see.

Starting point is 00:29:14 They said, yeah, that's a pretty heavy query to be running every 30 seconds. You should be doing that every 20 minutes maybe. If you're trending how many payments have failed to be fulfilled over the past week, that's not really data, you need to be renewed every 30 seconds, you can, you can have half an hour to an hour delay on that data. Got it. So I guess moving forward, now that you guys have been able to were able to root cause that, like, okay, yeah, this query pattern is going to generally be expensive. We can't afford to keep thrashing

Starting point is 00:29:48 the database like this. What did you guys end up moving more towards? Again, trying to balance this whole aspect of, is my data the most fresh it can be right now? Or can I wait? Yeah, so we made a couple of changes to the actual application itself, the query exporter application, in that it won't run

Starting point is 00:30:13 if there are already two processes of it running, which would have been a nice thing to have from the beginning, but you live and learn, and it's certainly going to be something we put into things in future. And then, yeah, we went back to the team that specifically looks after the payment side of things. And we had a conversation with them about how fresh do you need this data? Because after a busy week, we did some calculations with the database team. And we worked out that on a really busy week, this query could take upwards of two minutes to return all the data.

Starting point is 00:30:56 So that now runs every half an hour. And that was like the trade-off, like you say say between the freshness of data and the stability of the database but really it could now run every probably every minute because now we've got this safeguard in place of it's not going to run if there's already

Starting point is 00:31:18 one running we could make it more frequent but it's not on anyone's roadmap to make it more frequent, you know, just in case. Right. I don't think anyone's going to be arguing for that at this time. I think everyone remembers and like, yeah, let's step away from that a little bit. It's still a bit raw.

Starting point is 00:31:38 Yeah. So after all is said and done, I think like, I mean, there sounded like like there was definitely gonna be a lot of eyes on this what were some of like kind of like the big learnings that your that your team or maybe even other teams got um out of this incident uh so our team uh we took a lot of learnings from it um sort of procedurally about handing off work to other people and if you pick up a piece of work that has been dormant for a while you really need to put the effort in up front to uh to understand exactly what the state of things are and if you're not if you don't feel that you're knowledgeable enough to pick that specific bit up then the onus is on you to uh either seek out that extra information from the person who worked on it previously or from the internet

Starting point is 00:32:26 because I googled Prometheus Query Exporter and it said, oh yeah, the best practice is to run the command every single time it's hit. If I'd have done that at the start, then we wouldn't have been in this situation. The other main big learning that had a lot of focus from higher-ups in the company was the fact that it wasn't me as the engineer owning that system that put the check live.

Starting point is 00:32:54 It was the automated firewall rule that ran at some point in the evening that put that live. When I noticed that the check was failing because it couldn't talk to the endpoint, at that point I should have removed the check or disabled it, sorted the firewall access out, and then re-enabled it. But that's where the whole, it's just a monitoring change, misnomer comes in, it's like, how much harm can it do, really? Right, right. Just letting that sit there and wait for the firewall to let it through.

Starting point is 00:33:26 And then there's some little things about the application itself that we had to think about. So I mentioned the fact that we now have Safeguard to only allow it to have two instances of it self-running. We have a real-time backup of the data in that database. There's no reason why we shouldn't be querying that backup instead of the live database, like query the replica instead. So it's just things that should be best practice but maybe weren't thought about at the time. But yeah, it's been a really interesting learning experience for sure.

Starting point is 00:34:05 Awesome. time um but yeah it's it's been a really uh interesting learning experience for sure it's awesome um and i guess kind of stepping back oh yeah you mentioned that there was a lot of it was more of like kind of like the process sort of thing were there any like like large organizational practices that were also look forward to for the future of like third-party applications like i mean i think we've probably also gotten bitten by this. I'm not personally aware of it at LinkedIn, but we also use other third-party, like, you know, we have a license with them and, you know, we're kind of subject

Starting point is 00:34:35 to whatever client that they've provided to us. So, and a lot of times it works. And I think that's the really kind of like the kind of part where it's tough where you know 95 of the time 99 of the time the software they give us from multiple vendors potentially they just work out of the box so it's like what what's wrong with just you know one more right um so yeah i'm just curious on that side yeah so i think it's difficult to say in this instance because it wasn't really a failure of the third party there's more of a misunderstanding of like yeah

Starting point is 00:35:12 what they thought you guys were going to use it for and yeah how you guys were actually using it yeah so they thought we were going to use it in a different way we thought it did something completely different um i don't know yet that there's been any specific organization-wide policies put in place to do with that. But I know that our team specifically are now a lot more fine-tooth comb when we're picking up things from third parties. Fair enough. Makes sense. I want to take a step back.

Starting point is 00:35:46 You mentioned that this database was also processing a lot of other tasks. And you mentioned when there was this full outage, people weren't able to log in. So in terms of just categorizing the issues, like if we had three categories, say major, minor, medium minor medium for instance this would be accounted for major i assume yeah this is uh this is the top top priority this is everyone everyone

Starting point is 00:36:16 gets paged even if you don't know what the thing is about you get paged because it might affect your system oh interesting So when this happened, and you mentioned the blog as well, that the banners on your website would go out saying, hey, we know our systems are affected and we'll be working on it to fix it. What does that incident management process look like? Like what happened after that? So after we started seeing the problem, you mean?

Starting point is 00:36:42 Yeah, yeah. Once you see the problem, you know there is an issue and people aren't able to log in. How do you go about just fixing the system then? So we're pretty slick at incident management throughout the company, not just within core. So the kind of process of this was the banners go up we see okay lots of different services are all having problems talking to the database let's get the database people look at this they they instantly see i mean i'm talking within within minutes that they'd seen this is the query this is running

Starting point is 00:37:20 loads of times i don't know what this is i I've not seen this before. This is something brand new." At which point someone in the payments team in core says, that looks like a query for the last, all unfulfilled payments in the last week. And then you've got enough people there to kind of inject the context of, well, I know that that query has just gone live on these servers. Let's stop these servers from doing anything. Let's firewall them off and get the database in a healthy state. Which, yeah, like I'm saying, it's probably 10-15 minutes before we're in a situation where we can say okay we've identified the the cause of this problem we've mitigated it by putting banners up we've actually fixed the problem by getting rid of the the query being made from these servers we've tested it from behind banners to check that everything is working now as expected, we can now go and remove the banners and allow people back onto site.

Starting point is 00:38:30 There's a lot of really quick moving scenarios with our incident management, purely because of the fact we want to get people back on site as soon as possible, because it's very costly if people are not able to get on site, especially in certain sporting events. If an outage during the afternoon is bearable, an outage in the evening when there's a big sport event on is terrible. And yeah, there's a lot of pressure to get things back up as soon as possible. Yeah, it certainly reflects our more reliable system as it establishes trust with the users of the system as well.

Starting point is 00:39:19 And what you describe is a really quick recovery, like as soon as things started going south your team was paged or multiple teams were paged who came together and were able to recover the system really quickly so talking about incident response i know you have mentioned on well some of the other blogs on the website you do something which both austin and i and many other folks in this domain are also interested in. Some people like to call it chaos engineering or recently resiliency engineering. You refer to this word fire drills, like you simulate failures in your system, again, not in production, of course, but in a controlled environment so that everyone who is on the on-call rotation kind of gets used to how the system works, can resolve the issue, and so that everyone who is on the on-call rotation kind of gets used to how the system works, can resolve the issue, and so that you can recover the systems fast when they actually go down.

Starting point is 00:40:11 Can you tell us a little bit about how this process of fire drill started and how it has evolved over the last few years? Yeah, so fire drills for us are a way to run chaos engineering experiments on our systems, computer systems, to see how they respond when we pull the rug from underneath them, like disk or network. But also we use them as a really effective tool for chaos engineering experiments on our people systems, like the on-call team. Very important. Very, very important. Because I think it was Dave Rensin from Google says, employees are buggy microservices.

Starting point is 00:40:55 Which is so true. It is, it is. So they need as much, if not more, attention than your computer systems. Oh, yeah, for sure. I mean, having sound processes in place is equally important than just having sound systems. Exactly. So we started doing fire drills just within core a few years ago now. And it was every Thursday morning we would break something and the actual people that were on call would get paged out. Over the years, up till recently, we did sort of just have that same pattern every Thursday morning.

Starting point is 00:41:39 Primarily the platform squad would break something. But we noticed that it was getting a bit stale. So it was nearly always platform that were breaking something. And so the scenarios were getting a bit samey, a bit, oh, the disk is broken again, purely because we didn't have the knowledge, the in-depth knowledge that the engineers building the systems themselves have of those systems.

Starting point is 00:42:09 So we made a pledge that we were going to rotate around all the different squads on a weekly basis, and each of them would run a scenario on their own systems. And that's been in place for maybe six, seven months now, maybe longer. And it's been really effective because not only are these scenarios more realistic and more engaging, but the owners of these systems that are breaking them are doing it in a way that they can try and understand what happens when their systems break. So by trying to catch their colleagues out with an interesting problem, they're inadvertently sort of resilience engineering experimenting on

Starting point is 00:42:57 their own systems. So yeah, it's been really, really successful, this change. Makes sense. I mean, having the teams who understand the system more deeply create these scenarios, because I would imagine, as the platform group itself, after a while, it's hard to come up with new ideas on breaking your own systems. And having SMEs do that for you would result into more engaging outcomes uh can you describe one of the last uh fire drills that either your team or one of your other teams simulated if that's okay to share on this platform yeah so i uh i did one yesterday oh nice it's fresh fresh in my mind um and this was uh this was good because this was a cross-tribe Viadrill. So it involved us as core and also the bet tribe. And what we did was we made a change to one of the core systems,

Starting point is 00:43:58 remove some API keys, which meant that putting a selection onto the bet slip to actually place a bet would fail and give a bet placement unavailable error. So this kind of ran where I made the change and then I was slowly restarting Kubernetes pods instead of doing it big bang. So it was sort of like a slow degradation of service. I see.

Starting point is 00:44:30 And then the engineer was paged and saw the errors, thought, oh, this looks like something to do with core. Let's call core out. And everyone's happy. Everyone enjoys a good investigative scenario, scenario don't they oh yes uh but what we uh what we spent time doing is uh is making focusing a lot on the realism and the immersion of the of the fire drills so we've got this uh this slack bot where you as the the exercise coordinator can type in what you want to say, but who you want to say it as. Oh, nice. Interesting. Very interesting. Yeah. So you can say like,

Starting point is 00:45:14 tech desk says, we're seeing a lot of calls coming through from the contact center to say that customers are unable to place bets. And it's just another one of those things that helps keep people in the moment and treat it like it is real. Because it's all too easy to just, you know, I ain't got time for this. I've got more important work to be doing. I'll leave other people to deal with that problem.

Starting point is 00:45:39 Whereas if it's actually engaging and entertaining, then it's a lot more interesting, a lot easier to get people involved in it. Oh yeah, sure thing. I mean, it's more of a, it's a lot more a lot more interesting a lot easier to get people involved in it oh yeah sure thing i mean it's more of a it's a cultural shape or it's more of culture that people buy into uh you mentioned that it's so first of all how long do some of these fire drills go on for so we we book out the morning um but it doesn't take that full time. So we allocate one hour purely because we want to put a window on it so that if somebody needs to do something in the environment in which we're running the drill,

Starting point is 00:46:19 we're not blocking them from doing what they need to do. Because while we don't use customer-facing production, we do use our production disaster recovery environments so that we can have a truly representative environment to do the testing in, in terms of application scale and everything like that. So we timebox that to an hour, and then what we were doing previously

Starting point is 00:46:43 is having a retrospective as if it was um a post-incident review of a real incident um and then raising any actions and sending them off to to the to the relevant squad to deal with what we do now is we have a specific uh hour after the end of the fire drill where we have the retrospective straight away, tick everything up. And then if it's small bits like documentation changes, then we just do them then and there instead of necessarily passing them off to someone else to do. So it's been really good and it's helped get a lot of low hanging fruit, whereas otherwise it'd go and sit on someone's backlog for X number of years before it actually becomes important enough to do.

Starting point is 00:47:26 Oh, yeah. Doing the retrospective right away sounds like a good idea because it's so much, the incident is so much fresh in your mind. And you know exactly the improvements to make. Can you tell us a little bit about what the anatomy of the file drill looks like before you actually start? So let's say you mentioned you do it every week. So I'm assuming you or other team members would be thinking of certain scenarios beforehand. You don't think of what you're going to break that day itself. And the scenario that you create would also be something, this is just again an assumption, you might be sharing it with your

Starting point is 00:48:02 team members for learning it at a later point. So what does that look like? How do you structure this in docs? When do you prepare for these things? Do you have a list of scenarios that you want to cycle through? So for platforms specifically, now that we don't own every fire drill, we no longer have visibility

Starting point is 00:48:22 of what the other squads are planning, unfortunately. Or fortunately, because it makes it more realistic. But there's two main sources of where we pull our scenarios from. One is past incidents. Nice. Because we're using the fire drills not just as experimenting on the computer systems it's the people systems as well. We say

Starting point is 00:48:48 that process kind of broke down in that the last time we had this incident. Let's run it again and see how people respond this time. And the other source is just people's brains and figuring what's the worst that could

Starting point is 00:49:04 happen or what would happen if X. And we as platform have a list of potential scenarios to run. And if you want to simulate this happening, run this command on this server. Here's what you should see. Here is where you'll see the evidence that it's having the desired effect. Here is how you back it out quickly. And here is how people would probably go about fixing it. I see. Nice. Makes sense. So you mentioned now that the other tribes are also doing this. You don't always have visibility into what will be happening,

Starting point is 00:49:43 which is in a way is good. It's more realistic. So say, for instance, one of your on-call team members gets paged. How do they differentiate between a real page versus a page from a fire drill? I'm afraid we're a bit of a cop-out. So when we raise the pages, we prefix it with Fire Drill. Okay, that makes sense.

Starting point is 00:50:07 I know. Yeah, it's good, I would say. In an ideal world, we'd not only be not doing that, but we'd be doing it in production as well, like in customer-facing environments. Oh, that's risky. It's hard to get right. It's very hard to get right, but we can all dream.

Starting point is 00:50:23 Oh, yes, yes. I'm curious, you mentioned you don't necessarily do this on production systems, which makes sense. Have any of the fire drills gone sideways where someone tried to simulate a failure, but it got worse than what they're planned for? I can't think of any that have gone worse. I can think of lots where they've gone not at all how we expected. Oh, okay. I would love to hear the scenario if you can share it. So we had one where we thought, right, what we're going to do,

Starting point is 00:51:00 we're going to take this database down and this is going to break everything for everyone. Non-production, of course. So we ran what we thought would happen. And the systems just seemed to handle it and just not be bothered at all. System is pretty good. Yeah. So we're here like waiting to page all these people and say, top priority, priority one incident, everybody, all's pretty good. Yeah, so we're here waiting to page all these people and say,

Starting point is 00:51:25 top priority, priority one, incident, everybody, all hands on deck, and nothing's broken at all. How rarely does that happen? Very rare. I wish it happened more often. Yeah, nice. So you also touched a little bit on once you've been doing it every week, which is a pretty good frequency, in my opinion. And there is a trade off between spending time on a fire drill versus like you mentioned, doing other things like project work, because everyone's planning for new features and new things they want to get out. How do you, as an organization, how do you balance this trade-off and justify the cost of doing fire drills every week as it relates to the amount of time you invest in the project work that

Starting point is 00:52:12 needs to happen? This is something I feel very strongly about, and this is a horn I blow a lot to get people to listen to. And it is something that the company accepts, thankfully, but I can imagine in other organizations, it may not be the case, and you may need to do a lot of bargaining. The way I see it, and the way I put it to people, is that if you have a team that is focusing solely on features and new shiny things in your application that's fine

Starting point is 00:52:46 but there comes a point where it doesn't matter how many new features you add if you suddenly have an outage and every system crashes because there's been no thought put into the resiliency of that system it doesn't matter how fancy your application is if no one can get to it because you've not thought about how it handles failure. People have no loyalty, right? Yeah. As soon as that happens,

Starting point is 00:53:14 they're going to go to the competitor who, yeah, their website may not be using the latest and greatest JavaScript framework for its webpage, but... It works. As long as it works, yeah. I can place a bet. That is really well put.

Starting point is 00:53:29 That is really well put. So do you have any advice or thoughts for organizations who are thinking about chaos engineering or resiliency engineering and just getting started? This is not something that they have done, but they are thinking about starting it. Yeah, the first thing I think you need to know and you need to have in place before started? This is not something that they have done, but they are thinking about starting it. Yeah, the first thing I think you need to know and you need to have in place before you can even start thinking about breaking your system is having the observability nailed. So if you're going to expend the effort to have your engineers breaking the systems. If they haven't got the ability to deep dive into exactly what the application is doing when it's being broken, then it's wasted effort.

Starting point is 00:54:13 The first thing you need to do before you even think about breaking stuff is ensure that you have a total knowledge of what's going on in your platform. It doesn't necessarily have to be like, you know, distributed tracing level down to that, you know, down that deep, but you do have to be able to see when your system and services are misbehaving. And then in terms of actually getting started on stuff, there is a temptation if you like to go with

Starting point is 00:54:48 the easy obvious things to break like the network goes away that's that's gonna happen sure but that's not very exciting you're not gonna get your engagement up the best thing and we learned this too late this is why our fire drills went stale the easiest way and the best way to get people to get a buy-in from people in the business is to involve people in the business and get them thinking how their own systems can break instead of instead of you know the platform team coming in and saying we're going to break your system and tell you what's wrong with it and how you need to fix it.

Starting point is 00:55:30 Instead of doing that, it's about, right, let's, as a team, as a collective, let's look at your system and see how could it break? What have you thought about this? You don't know what happens if this goes away. Well, let's take this downstream dependency away and see how your application behaves. Yeah, these have been great discussions. I think even like all the talk about the fire drills, I think this would be a wonderful onboarding tool for even new engineers. I would think this is something that happens in many organizations, many companies. New engineers come in, they don't know like how the layout of the land is. But with these fire drills, I think it's a very real way to kind of immerse them into this environment so that they can quickly figure out like,

Starting point is 00:56:11 oh, my application talks to these other applications and those sorts of things where without that, unfortunately, it's kind of learned on call, which I think is what a lot of companies kind of do. And it's fair for the on-call engineers to go in and be like, I'm terrified. I'm like, yeah, it's going to take some time. But with these, I think it's probably less stressful for them,

Starting point is 00:56:33 but I think it's a wonderful experience for new engineers to come in and be like, I can do this in a safe environment. And when I do go on-call for real, it's not as scary, which is a great feeling. It's throwing people in at the deep end, but you've given them a rubber ring. They've got flotation devices all over them. They're not going to sink. They might feel scared for the first 10 seconds or so, but actually they're going to realize that it's safe to do. And by the time they get rid of the flotation devices and they're actually on call,

Starting point is 00:57:08 it's like the deep end, that's fine. As part of going on call and onto our on-call rotation, you have to have gone through a number of fire drill experiences before you can actually go on call. That's perfect. Cool. So I'd like to kind of, this is a question that we ask all of the folks that come on to our podcast.

Starting point is 00:57:31 So given that you have a huge breadth, given that you've kind of like put together these fire drills, you've probably worked with a lot of tools at this point in the DevOps space or in other places. So what was kind of maybe the last tool that you discovered and that you just really enjoyed using or really liked?

Starting point is 00:57:54 It might seem kind of a cop out, because it's not what you might think. There is no wrong answer here. There's no right answer. So I recently went back from Bash to ZSH. And I found this theme called Power Level 10K. And what it is, you know if you have loads of plugins in ZSH, it kind of slows your prompt down and you press enter and you just get the gaps on your terminal. This, I don't know how it does it, it's magic. It sort of lazy loads your plugins but gives you a prompt straight away. And then it fills your prompt with all these super low latency utilities.

Starting point is 00:58:47 So it does like your Git or subversion or whatever, version control in your prompt. It gives you a clock that actually counts up the seconds in your prompt instead of being the time that you last press enter, which I think is amazing. I think every prompt should come with that. So, yeah, I don't know if a ZSH theme is going to be the most exciting tool that you're going to get on this segment ever,

Starting point is 00:59:15 but it amazed me purely because of how it manages to take something that would take literal seconds to load up your prompt and just makes it like 10 milliseconds before you have a prompt i just found it amazing to see it yeah no that's that's huge i mean i think for anyone who's working in this space probably some of the most frustrating things is you're trying to run something and you're like oh i have to wait a few just even like three seconds is enough to just like how any of us go a little bit crazy. So that's really neat.

Starting point is 00:59:48 What is the team name again? It's Power Level 10K. Okay, Power Level 10K. Nice. Yeah. Well, and so where can people find you on the internet and learn more about what you're up to these days now um i i tweet occasionally um on uh at hey it's alls all one word um i'm i'm sometimes i sometimes mess about on the on the fediverse but i'm getting a bit bored of that so maybe not um my uh my website is alls.wtf which I sometimes write blog posts on, sometimes don't. But if I'm going to be active, it's on there, basically.

Starting point is 01:00:29 It's on there or Twitter. Awesome. And is there anything else that you would like to share with our listeners today? No. Oh, well, actually, yeah, go and break stuff. Because you don't know how things work until you've broken them. That's true. Plus went to that.

Starting point is 01:00:48 Well, it's been a blast having you on our podcast. So thank you so much for coming on to the show. Yeah, cheers. It's been brilliant. Hey, thank you so much for listening to the show. You can subscribe wherever you get your podcasts and learn more about us at softwaremisadventures.com. You can also write to us at hello at softwaremisadventures.com.

Starting point is 01:01:10 We would love to hear from you. Until next time, take care.

Software Misadventures - Oliver Leaver-Smith - On how "just a monitoring change" took down the entire site and resilience engineering - #5

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.