PurePerformance - Achieving Reliability through Chaos Engineering with Tammy Bütow

Episode Date: April 13, 2020

Starting your new job as Infrastructure Engineer in a large bank with your to-be boss and his key architects just leaving feels like Chaos! Maybe that’s why Tammy Butow has made a career in Chaos an...d Site Reliability Engineering. In this episode, Tammy shares her experiences of bring reliability into highly complex systems at NAB, Digital Ocean, DropBox or now Gremlin through chaos engineering. You learn about the importance to know and baseline your metrics, to define your SLIs and SLOs and to continuously run your fire drills to ensure your system is as reliable as it has to be.If you want to learn more check out Tammy’s presentations on speakerdeck and make sure to join the chaosengineering slack channel.https://www.linkedin.com/in/tammybutow/https://speakerdeck.com/tammybutowhttps://slofile.com/slack/chaosengineering

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always with me is my wonderful, lovely co-host, Andy Grabner. Andy Grabner, how are you doing today? Good, how do you know I'm still here actually, if I haven't said anything? Because we just talked before we started. Oh, damn it, you started it's the whole pulling the
Starting point is 00:00:45 curtain up behind the yeah yeah i have i have an inside knowledge of what goes on before i hit record what if i what if i would stay quiet and didn't see anything wouldn't that be a bring a little chaos in your world oh i see where you're going yeah no so you know introducing that right so we might as well say the topic of today's episode because i wanted to mention something with this so today's topic is more chaos engineering where we have a second episode on chaos engineering and you know i i feel i'm a broken record i'm always excited with every episode right andy it's just hard not to i love getting i love the knowledge that i pick up right a lot of our listeners might even think, oh, well, you're one of the hosts.
Starting point is 00:01:26 You know all this stuff. No, I learned so much from the show that I love it. But to me, going back to the last episodes we had with Chaos Engineering with Adrian Hornsby, going back to the previous episode we just had about security testing and thinking about coming from both your and I world of performance engineering, these are all really, really cool topics around the idea of working with or testing, working, verifying things that other people built. You have the developers and the architects and the infrastructure putting everything together. And then whether it's security, whether it's chaos, or whether it's performance, there's this whole other realm of job roles that have really exciting and increasingly more fun things to do in confirming these.
Starting point is 00:02:14 And I've really took a shine to chaos engineering after our last episode. So that's why I'm excited to have another chaos engineering episode. But I'll shut up now, Andy. Yeah, that's okay. No, let's get everyone pumping. Yeah, exactly. So I'm also really happy to have Tammy, and I call her name, the last name,
Starting point is 00:02:35 I speak it out the way I would read it when I look at her Spreaker Deck page. It says Tammy Butow with an U. And fortunately, with my Austrian-German background, I should be able to pronounce this correctly. Tammy is a principal SRE at Gremlin. And if I look at her Spreaker Deck page, which is also spreakerdeck.com slash Tammy Butow or B-U-T-O-W. With the umlaut over the U?
Starting point is 00:03:00 Yeah, with the umlaut, yeah. But the thing is, so if I look at the last posting she did, chaos engineering, when the network breaks, chaos engineering, breaking things on purpose, using chaos to build resilient system, and the list goes on and on. A lot of great talks and a lot of great content out there. So, Tammy, now the floor is yours. Thanks for being on the show.
Starting point is 00:03:22 And please, first of all, give us a little background on what you actually do as a principal SRE at Gremlin. And then also let's dive into the topic about chaos engineering. What is this all about? What do people need to know? What's important for people to know? Sure. Hi, it's great to be here today with you both. Yeah, you said my name well, better than I could say it. I actually have a, you know, very Australian, thick Australian accent. And it is a European name from, you know, Poland, Germany, sort of. It's a town where I live, what my last name is from. And my family immigrated to Australia.
Starting point is 00:03:56 So, yeah, chaos engineering. To me, the biggest thing that I, the most important thing that I think we can achieve through chaos engineering is reliability. So I like to really say like, you can achieve reliability through using chaos engineering. And the way that I know that, the reason I know that is because that's what I did. And I've been doing that for 10 years. And obviously like, if you've done something for 10 years, you become like pretty good at it. I've obviously spent a lot of hours doing it, have a lot of expertise, but I also sort of got to the point where I was like, yeah, like I'd love to be able to share the wins that I've got. I'd love to be able to package
Starting point is 00:04:32 that up and tell some stories, especially after doing it for so long in both finance. I worked at the National Australia Bank as an engineer for six years across mortgage broking, foreign exchange trading, internet banking, in a number of different roles, including like, you know, starting out full stack where I was looking after databases and hardware in the data center, but then also writing CSS. And the reason that I went further and further back into the stack and like all the way down to actually like racking and stacking in the data center was because just like performance to me it really annoyed me when i would build some sort of front-end feature like just trying to quickly whip something up that customers needed and it was fast
Starting point is 00:05:16 for me to do it but then the performance was always horrible because like you know the cache is bad and then the database queries are bad and then the database is bad and then the server's not tuned and there was just all these different points of pain and because I was you know across everything I could actually see it so that was like a unique opportunity um and then yeah like I decided to move to America and I was really interested in chaos engineering because we used chaos monkey from Netflix at the National Australia Bank We used it when we were migrating to AWS. So I think it's obviously like very good practice to get involved in when you're doing a cloud migration, especially if your company is going from fully on-prem to the cloud, you can learn a lot of habits. And then after that, I worked at DigitalOcean, which is like, you know, building the cloud for people to be able to use. Super cool. Like, you know, over 10 data centers all over the world,
Starting point is 00:06:07 lots of on-prem machines that we were looking after, you know, had to do a lot of incident management there as well, making sure that everyone was able to access their cloud instances at all times. And, you know, I really came from that perspective that I do really care about customers and I care about what we're providing to customers. And that's something that definitely motivates me. And it's also because, you know, when I was working in mortgage broking, one of the biggest things that happened was very early on, I think I'd been working for only two months after graduating from university. And I got this massive fine from the government,
Starting point is 00:06:42 from the regulators, because someone wasn't able to purchase their home. They'd had their mortgage. It was in processing with us. My system was down. I was on call for it. And, you know, at the time, like I was very, very new to the industry, but I got this like huge fine. And, you know, they actually have to put your name as the engineer that was responsible
Starting point is 00:07:01 for the system at that time on the fine. So I was like, oh, this is like really serious. Like first off, this person couldn't get their dream home. Like they obviously went to, you know, they went in, they put in an offer, they've been searching for ages. Now someone else is going to get their dream home. They're obviously really mad about it. So they complained about it. And yeah, I just like from then on, I was like, all right, I do really want to help people be able to use the internet. And obviously, like the internet back then, 10 years ago, you know, over 10 years ago, I knew it was going to be really popular and even more important. So it made sense to focus on reliability from back then.
Starting point is 00:07:37 Did you actually have to pay the fine or did your company, were they on the hook? Yeah, so it's actually interesting. So, yeah, you you know it's a big company like tens of thousands of people um we have like over a million customers in australia it's one of australia's biggest banks and i was a new grad so the the cto like came over to my desk and was like you you just got me a fine that i'm gonna have to pay like how are you gonna make up for it i was like whoa this is so serious um and then I was like, oh, like, how can I make up for it? All right. Like, I'll try and figure it out. And it was interesting, a very interesting team
Starting point is 00:08:09 that I was on there because I applied for their graduate programmer job. So, I came in as a software engineer, but the week before I started, the boss of the team quit and he took the three senior software engineers with him. So, we were a team of three junior engineers and it was an acquisition company, but it was making tons of money. Like it was processing a lot of mortgages. So it was a very important business to the company, to the customers. But internally it just wasn't staffed. So yeah, I just like work really long hours. Like I was working seven to seven from the office plus overtime at night, plus every weekend. I went and worked actually like, and sat with a number of different teams and shadowed them to understand all of the pain points of the system. And then we made it way better within,
Starting point is 00:08:56 you know, a year and a half. It had just like dramatically reduced the amount of incidents. I ran a lot of initiatives to make it much more reliable, but I had to do a lot of disaster recovery work, a lot of failure injection. And that's what really taught me about the value of chaos engineering, because when you are in a situation where a lot of things are wrong, the most important thing that you need to do is figure out what are the most important things I need to fix. And to me, that's what chaos engineering helps you do. It helps you prioritize critical fixes that you need to make. And yeah, that's what I've started to do as well over the last few years is really collect up those things. So, you know, when the regulator comes to you and says, this happens, you know, when you're about to IPO as well, I've had to do
Starting point is 00:09:40 that work too. So they say, you need to prove that you can do a region failover and you have to actually demonstrate it live. You need to prove that you can run a backup. You need to prove that you can do a data restore. So there's certain things that are very critical that you have to be able to do. And you only know this if you've had to do compliance and regulatory work. But, you know, I think I always thought coming from banking
Starting point is 00:10:04 and finance, every company should be doing that. Like if you don't, you know, if you don't think about what's going to happen, if you lose your backup, if you can't restore a backup, you've never even tried, you're going to lose all your customer data. Like, you know, the person responsible for that, if customer data is lost, that goes to the engineering team. And it's not a security engineering responsibility. Like security engineers care more about if you got popped, if customer data was stolen and like shared wide on the internet.
Starting point is 00:10:33 But SREs care about reliability, availability, durability, making sure you don't lose data. Like if someone needs to own that responsibility, and it's a big responsibility, but that's what SREs do own. And I think that's a really important thing to own. And that's why I And it's a big responsibility, but that's what SREs do own. And I think that's a really important thing to own. And that's why I think it's a great career. So I thought it was fascinating when you said when you started that job, basically the old boss quit. So basically this was the biggest chaos that could probably be.
Starting point is 00:11:02 Yeah. You were part of the chaos experiment right yeah totally yeah it was exactly like that like it's a big thing you know one of the um things that we thought think about with chaos engineering too is so for me for example i live in florida uh i'm australian but yeah i live here it's the most australian place i could find in america um it's very true you know there's gators and like wild weather. So, you know, in Australia we have bushfires and often they can impact if you can go to work or not and it really impacts your quality of life. But here in Florida, we have hurricanes. So,
Starting point is 00:11:37 for example, what happens if I can't work, you know, for 10 days because I don't have internet access because there's and have no electricity because there's a hurricane like that's something I have to be prepared for um and so yeah like if there's no electricity I have a generator but if there's no internet like that makes it a lot harder then obviously my team has to be ready to be able to um handle that when that happens and I have to be able to notify them um so yeah there's a lot of interesting things that you need to think about. I like to also say another way that you can do chaos engineering to enable your team to get better in those situations is actually on-call training or fire drills. And that's something that, yeah, I've learned a lot about doing that over the years. I've done it at every company I've worked at. We've run fire drills and, you know, there was never really like a way to do that. That was a great way. So I,
Starting point is 00:12:29 yeah, I worked to package up all my knowledge and right now I'm helping to build that into Gremlin, which is pretty cool. So it's like taking everything I've learned over the last 10 years, putting it together and making it that you can just like press go, like let's run that fire drill. Let's actually inject failure and make it happen. And I think I was like one of the first people in the world to actually go, let's do on-call training. I started to do that when I was at Dropbox. And the reason that I needed to do that was because, you know, in the past, like whenever I've gone and worked at a company, yeah, they're like, you're on call, good luck. Like you don't even have a boss, like good luck. Like that's actually what happened to me on my first time I was on call.
Starting point is 00:13:08 Like there was no training and there was no one that I even reported to except the CTO that was coming to visit sometimes. And, you know, well, actually a lot, like most days. And, um, and I got to know him really well, Brian, but, um, you know, that's probably not the ideal situation. So then after probably about, you know, eight years or something like that of working, I was like, maybe there's a better way to do this, especially because I really wanted to try and help students who've come from alternative education schools like boot camps, like, you know, alternative schools. There's a great school that I love that I'm a mentor at called Holberton School. I wanted to try and hire some of those students and get them excited
Starting point is 00:13:51 about SRE and, you know, they were interested. They really wanted to learn how to be an SRE. And the cool thing there was I worked with them to be able to say, hey, like I want to put your grads on call when they come and work at Dropbox. And we did that and they were really great at it. And this is like years ago now and they still work there. The grads that we hired from that school on like an SRE apprenticeship program and school actually puts its students on call for their projects, which I think is amazing. It's like you build it, you own it. Like they build some software, then they get put on call and then they actually get paged and they have to respond to the
Starting point is 00:14:28 pages and fix the issue and they inject failure. So it's like real chaos engineering that the school is doing too. I think that's great. And the students are learning amazing skills, you know, because often what happens when you're on call is, yeah, you might get paged at 3am, 4am and you need to fix the issue because otherwise everything's going to be down and your customers will be impacted, especially if you have a global business. And a lot of the time we don't do those fire drills. So you can see a lot of problems with, like, mean time to detection being several hours just because paging systems
Starting point is 00:14:58 aren't set up correctly or monitoring and alerting is not set up correctly. So, yeah, there's tons that I've learned that enable me to then go and say, hey, like, maybe we can do this better. So yeah, that's a big thing that I've loved doing as well. So speaking of that, and all the steps going through that, I was curious for your take on, you know, if an organization is going to start going into chaos engineering and chaos testing and all this, what would you say are some of the prerequisites? Obviously, there are things you're going to build out. You're going to build out better monitoring everything.
Starting point is 00:15:30 But before you get started, what do you think you, what would you say has to be in place before you go down that road? You just don't want to stumble into it, right? Yeah. So one thing, I've actually helped a lot of companies do this now, which has been pretty cool. I've had that a lot of companies do this now, which has been pretty cool. Um, I've had that opportunity, you know, um, at Dropbox, I did, you know, chaos engineering,
Starting point is 00:15:50 uh, three times a week to be able to get a 10 X reduction in incidents in three months and then have no more high severity, like catastrophic incidents. We call them sev zeros for, you know, the next 12 months after that. And then I went to look after other teams. But I think the main thing that I've learned is that you really need to baseline your metrics. That is the most important thing. And I think that's actually something that we often don't do because we don't make the time to do it or we're not sure how to get started. So, and by that, I mean, like you need to know what you're trying to improve. And the other thing that I like to say is like, pick your five critical systems to start with. So if you're a team and you own multiple systems, or maybe, you know, you believe your system is critical and that's like systems that are in the critical path. So you kind of need to map that out.
Starting point is 00:16:42 And often I'm, when I meet people, I say, Hey, do you know what your top five critical systems are at your company? And they're like, actually, no, no, no, I'm not sure. Like, obviously one is going to be the database. Another one's probably like your traffic layer, like maybe if you use NGINX. So there are certain systems that, yeah, they're critical. Like maybe if you process payments, then your payment service, that's going to be critical. But then you can have other systems that are in the critical path that you maybe want to remove from the critical path. So you can also use chaos engineering to do that. But that's why it's very good to just actually get into a room and say, let's run a game day. So that's something we do a lot of at Gremlin. We'll run a game day and that could be like, you know, an hour and a half to three hours where we actually fill
Starting point is 00:17:25 out a workbook up front a game day workbook and we say these are the things that we want to do we actually have a spot to put in an architecture diagram so we put that in before we run the game day we make sure that everyone's clear about what we're trying to figure out what our hypothesis is what experiments or chaos engineering attacks we're going to run. And then we actually say like what metric we're going to track to see how the actual chaos engineering attack impacts our system. And we'll be able to also figure out through that, like what dashboard should we look at? What alerts should fire? How do we determine if this was successful or not? Was the experience what we expected or not? Did we have a graceful degradation if it
Starting point is 00:18:05 was related to like, you know, some sort of UI component? So there's a lot of things that we look into, but we package it all up into this game day workbook, which enables you to quickly and efficiently run a game day. So yeah, that's how we've gone about it. But that's like a lot of years of learning and then packaging that into that. And like, you know, the main thing I think is cool that people are starting to talk more about now are SLOs and SLIs. So, you know, I came from a world, yeah, there was tons of SLAs because in finance, you have those a lot. You have SLAs between every team, like, you know, you need to be up and running 99.99% of the time. That's the SLA. So that's actually what it is like in finance. In startups, they don't really have that as much. But if you don't meet your SLA when you're in a bank,
Starting point is 00:18:50 and say you could have an SLA between like different companies that rely on your data or a different company that you need to use, you'll have an SLA for them. If one of you don't meet the SLA, then you can fine the company. So then it actually ends up being very expensive. Like SLA fines can be quite big. But a big thing that Google started to talk about was not just having SLAs, which are more about the business side of things and fining. And it's a lot of like, you know, heavy stick, like more punishment, not really a carrot sort of thing.
Starting point is 00:19:22 And so, and it's not really about, I don't know, no, like I was very good at monitoring it because I never wanted to get fined again, you know, because once you get fined, once you don't want to get fined again. But a lot of people weren't good at monitoring it over time and they didn't have it in their dashboards. So over time, like, yeah, Google started to talk about SLOs, service level objectives and service level indicators, SLIs. And I think like that really resonates with a lot of people. It totally does with me. And I think like that really resonates with a lot of people. It totally does with me.
Starting point is 00:19:47 And I think like then you can actually go, what is the SLO and SLI for our specific service that we have? And I think that's a better way to do it. So, for example, if you look at like a critical service, like something that stores data, then you're going to have to have an SLO and an SLI for data backups and an SLO and an SLI for data restore. And then you would actually want to be monitoring that and alerting on it if you breach your SLO. And so, for example,
Starting point is 00:20:18 for an SLO, it might be, we want to be able to ensure that backups need to be completed within 60 minutes, 99.999% of the time. And then your SLI is the time it takes to process a backup in minutes for backup restore, data restore. Then it could be the time it takes to restore from a backup. And you also want to verify data correctness. So then you might need to do a data verification check as well. So there are like two indicators for that. But that's the sort of level of detail that you need to get to if you want to make sure that you're able to do a good job. But the thing is, right, you only have to do that once.
Starting point is 00:21:01 Like it's not something where you have to go, I need to continuously set SLOs and SLIs for my critical services like no you don't you just set it once you do as you know a good job as good as job as you can the first time and you can tweak it make it better but there are like a set of slos that every company should definitely have like if you store data then yeah you need like a specific set. If you process payments, you need a specific set. Like if you, you know, that there's like certain things where it just totally makes sense. So yeah, I think that's cool that that's where we're heading. And then obviously like chaos engineering enables you to verify your SLOs and SLIs because how do you know that you're
Starting point is 00:21:40 even tracking it correctly? How do you know that backups and restores are able to be processed and meet your SLOs if you don't actually inject failure? One of the things I like to do is kill a backup process mid-process, kill a restore mid-process, and make sure that I still meet the SLO because that can happen. Processes can just stop working, stop running, or a machine can go down or something can go wrong. But if you notice that there's some type of issue where, you know, you're not meeting your SLO, like it could be that there's a networking problem and the network slowing down all of your data traffic for the backup and restore. Like that's happened to me in the past. That's why I know that example. I was being throttled by the network engineering team.
Starting point is 00:22:24 So I had to like collect my information and go, hey, usually it takes me this long to process a backup. Since this date, it's actually taken me like double the time. Why? Like I didn't change anything. What did you change? And they're like, oops, yep, that was us. Sorry about that.
Starting point is 00:22:38 Like we'll fix it right now. So they fix it straight away. Yeah. But if you don't have that evidence, like I like to think of myself as a little detective, you know, it's just way easier when you have evidence and data and you say, this is my problem. This is the data that shows that it's a problem. This is what I think is going on. Like, can you verify it from your end? Yep. All right, cool. Like, let's just fix it and then move on
Starting point is 00:22:59 and keep going on with our lives. So yeah, that's how I like to work. It's a very Australian approach, very direct. Well, it also ties very, the reason why I find this so interesting too, is it ties very much into the pure scientific method, right? Hypothesis, your criteria, all this, you put it to the test and evaluate how it came out and adjust from there. So it's, that I think is the really cool thing about it because it takes it without getting into the lab coat, you know, safety goggles kind of science. You're doing real science there, you know. So it's real fun stuff.
Starting point is 00:23:31 Exactly. Yeah, it's cool. Like we have like a, you know, a system that's continuously running backups and restores over and over every day to make sure that, you know, because what you can notice when you really get into the detail of this, when you get into the weeds, you do notice interesting things. On Mondays, for example, a lot of companies will have peak traffic periods Monday mornings. It can take obviously way longer to do a backup or a restore. So if you have an outage on a Monday and you have a critical problem related to data, and what if it takes you, instead of 60 minutes to restore, it takes you five hours or something like that? It can be really bad.
Starting point is 00:24:07 Whereas on a weekend or a Friday night, it could be way faster than your SLO. But yeah, it's the stuff that you really only learn about once you do the work. And it depends on... Every business is different because every business has different types of customers in different countries that are doing different types of work. The other thing is you can't assume that you know how your customers use your product. You know, you might think that, you know, one thing I thought at Dropbox was maybe we wouldn't have much traffic on the weekends, but actually we had a lot of traffic because there's
Starting point is 00:24:39 a lot of university students that were using Dropbox. So that's not something that I thought, you know, would be happening. But, yeah, that totally happens. And you can see that when you investigate the graphs and the monitoring and actually dive into it and just go in with more of an open mind. I think that's a good way to do it. Hey, it's amazing.
Starting point is 00:24:59 I keep, you know, taking notes and notes, and I think by now I can almost start writing a book on the information that you tell us about. I got a couple of questions. And so I think the first thing is you always talk about these game days and about running experiments. Are we talking about purely production? And if this is the case, isn't that freaking out people
Starting point is 00:25:19 when you approach them with chaos engineering and say you bring chaos in production? Isn't this something you first address in a pre-prod environment or how does this really work what where's the first where do you first start with with bringing chaos in yeah it's a great question so like when i first started doing chaos engineering i actually did do it in production at a bank which is you know a lot of people are surprised by it they're like oh banks do that i'm like yeah because you have to like um that's actually something you need to do to show the regulators in australia the regulators are called apra you need to be able to demonstrate that every single quarter but the thing too is you can't just show up like it's actually a very
Starting point is 00:25:54 big exercise you need to go to a totally different unmarked building that's like you're failing over to this data center that's outside of the nuclear blast zone like it's a very serious thing and um And you do that. Everyone volunteers to do it because they want to do it. So I volunteered to do it a ton of times. And you spend the whole weekend there and they fail over every single service. And then somebody actually comes around with a checklist and they actually make notes about your service, how it handled the failover, anything that didn't work. And then you have to come back and make sure that you don't fail anything again the next quarter. And obviously like, you know, yeah. So nobody just shows up on that day and go, let's see how it goes. Like it's really serious. So you're preparing for months, like for the whole quarter, you're preparing to make sure that you pass. And I remember I once
Starting point is 00:26:42 failed one small component of like my mortgage broking system at the disaster recovery test and it was like super serious so then I went back immediately like and was able to fix that issue with my team but yeah we like ended up having to demonstrate it to a number of different teams to be able to show within you know two weeks that we'd fixed it because it's one of those things where yeah you just don't want things to be not working in production, right? Like that's bad. And if you notice something's not working, then you need to try and fix it fast. Like I always say, fix it within a month for sure. Like if you have a post-mortem report that you write up and say, well, what I like to say is write up, you know, three to five critical things that you need to fix as action items. There's
Starting point is 00:27:24 always something that needs to be fixed for production incidents. And then go and do those items, make sure that you have someone making sure that they get done, but then also run a chaos engineering game day that actually reproduces the incident and then verifies that your fixes work. Because that's a big part that we miss in our industry. We like write up the postmodern, we spend ages talking about it. Maybe some companies do the action items, maybe some don't, but they never like try and reproduce the incident. And actually like I've had a lot of incidents that have been the same incident over and over and over. For example, you know, the same batch job running every Tuesday night at 8pm, but it was hard to like dig in, like where is this
Starting point is 00:28:01 batch job coming from? What is the batch job doing? Like, who even wrote this? Is it important? Can we shut it down? Like, and it can become quite complicated. That's why you need to actually follow up and figure it out so you can fix it. So, yeah, I don't know. There's a lot of interesting things that I've learned through doing that, the production side. But what I also know is, you know, I've worked at other places where, yeah, they didn't want to do production chaos engineering work and that's fine.
Starting point is 00:28:31 Like I like to start all the way from very early in the beginning. Like when, you know, a cool thing that I get to do at Gremlin is I'm able to review the PRDs. So, you know, before I joined Gremlin, didn't even know what a PRD was. Like I'm an SRE, I've never read one. And it's like a product requirements was. Like I'm an SRE. I've never read one. And it's like a product requirements doc. And it's what the product team usually writes when, before they even give it to the engineers and the product engineers. And I've always worked on infrastructure engineering.
Starting point is 00:28:56 So I never really did like product development with a product manager. I worked in a bank where I worked on product features, but we didn't have a product manager doing that sort of work. You know, this is many, many years ago. That role didn't exist at the bank then. And so, yeah, I actually like look at the PRD and try and think like, what game day will I run for this feature before it goes into production? So, and I'm thinking about that, like how are we going to make sure
Starting point is 00:29:22 that this is reliable through using chaos engineering and actually verifying it and proving that it is reliable before it's in production. And so then I also have thought like, hey, from the PRD, we're figuring out what chaos engineering attacks we want to run, what game days we want to run. Then after that, I'm also thinking, hey, like maybe there's this whole different world in the future. Like if you imagine, you know, five, 10 years down the road, like what does our development environment look like right now? Like, you know, it's kind of different for a few different people. Some people have like a remote instance on AWS and that's their development environment. That's pretty cool. I think like you can blow it away when it's not good anymore. It's like super easy. The only thing that you need to have though is internet,
Starting point is 00:30:08 but like most of the time we have internet everywhere now, but that's the only real downside I think. And then, or other people have like local development where they're like doing it on their computer, they can do whatever they want. It's like totally self-managed, but also like that means it can come with a lot of problems too with different version issues and like you have no tooling support maybe you don't have a dev tools company to help you out or maybe it's like you're doing local development and you have a dev tools company and they say hey you do everything with docker like locally but one cool thing that we've been thinking about at gremlin is what happens if in the future like yeah everyone's doing um remote development on their own like development um instance and you can actually
Starting point is 00:30:51 have chaos engineering attacks running in your development environment so while you're working on new features as a feature developer or as a you know infrastructure engineer um it's a product engineer infra engineer. You can actually have these experiments running and like, you can make them run in an automated way as well. So you can actually see like, how does my system handle packet loss? How does it handle latency? Um, how does it handle if this other service isn't working? What does it look like? Like, so for example, if you're building an e-commerce store, what happens if all of a sudden the recommendation service doesn't work? Like, is that a clean graceful degradation or does it break your entire
Starting point is 00:31:31 e-commerce site? Like that's something you can figure out with chaos engineering by blackholing a service. And so that's what we do a lot of at the moment. So I kind of think that's like a very cool thing to do. And then from there, you can also say, hey, like let's run these chaos engineering attacks in our CICD pipeline. So like to actually run them in an automated way and make sure that we pass these before we move the code, move the build into production. So yeah, like, you know,
Starting point is 00:32:00 shifting all the way left to the PRDs and then all the way right into production. Like I think there's a lot of value that you can get across all of the different developer loops that we have. And yeah, to me, that's really exciting. Like, I think it depends on the maturity of your company. Obviously, if you have a lot of production incidents, then you need to focus on production. Like you obviously, that's where you got to go. If you don't have many production incidents, then focus more on like helping developers improve velocity and help them build better customer experiences. Yeah.
Starting point is 00:32:31 Yeah. So when you talked about the developer environment chaos, I actually first thought you are trying to bring chaos into the tools developers are using and depending on. Let's say you're bringing chaos into GitLab, you're bringing chaos into Jira, and you're bringing chaos into all these tools and then see how the chaos actually impacts the development processes
Starting point is 00:32:55 and how the organization reacts to that. And maybe that's what you meant, but this is... Yeah, so I meant more like, say, when you pull down your code. So say if you have like um yeah like a github repo maybe you're in a mono repo maybe you have lots of you know different repos and then you're working on a new service but what happens if your service can't communicate with another service like does it break the whole experience like that's a really interesting case for chaos engineering because a lot of the time like people who are building features have
Starting point is 00:33:24 never been like on call for production systems like it's a weird thing like in our industry but they've just never been given the opportunity maybe um so i think it's a cool way to learn about failure in production and learn about how systems work and learn more about distributed systems um and it's something that i want to help people do right because yeah people come to me and ask me all the time like how do i make sure that my service is reliable? I recently went to mentor at a hackathon and all of the college students were like, hey, like, tell me about this chaos engineering. Sounds so cool. Like, of course, like when I'm writing code, I want it to be reliable. Like, how do I do that? You know, they want to know, like they want to learn and they want to
Starting point is 00:34:01 be empowered. Like they want tools to be able to help them learn how to do that. And that's like a really different mindset. But they come from a world where like, you know, they were born on the internet. They get annoyed when they like put items in their shopping cart and they disappear, you know, like stuff like that. Like what is this? And so I kind of come from that world too, you know,
Starting point is 00:34:19 because I grew up in Australia. I had the internet since I was really young, since I was about 10. My mom got me a computer when I was like less than five years old. She thought that, you know, computers and the internet were the future. And her best friend worked at Microsoft and still does. And yeah, so she really got me to think more about that. But the problem in Australia is the internet quality is really bad. So the latency is super high. So if you're trying to watch a video, you'll just have like buffering for ages. You'll have to wait for so long for your video to buffer and then you can finally play it. But you know, the internet is just really hard.
Starting point is 00:34:55 Even now, if I try and go home to Australia and work remotely, I can't video conference, it doesn't work. I can't use Google Hangouts or Zoom. And obviously, that's what the world is right now. But in 10 years, like it's going to be much better and the internet will be even more important. But it does like make me think like, you know, if you say like, yeah, what if my dev developer tools aren't accessible? And that happens.
Starting point is 00:35:17 Like recently there was a GitHub outage. Like, yeah, what happens when you can't access GitHub? Like, you know, are you down for the whole day? Like what other kind of work can you do? Do you have like a full back plan or redundancy in place for that occurring? So yeah, like just as we go more and more, you know, full throttle into the future,
Starting point is 00:35:37 we're all connected and we're using the internet. Like I don't think that's going to slow down anytime soon. We need to be prepared for failure because like it's going to happen and that anytime soon. We need to be prepared for failure because it's going to happen. And that's just an important thing to be ready for. Hey, the whole, I mean, Brian and I, we had several, many probably, many talks about shifting left and integrating checks into the delivery pipeline, starting with automated testing.
Starting point is 00:36:02 And then after the test, you want to validate not only functional requirements, but especially non-functional requirements like what you just explained earlier with SLIs and SLOs. This is also stuff that we've been, at Dynatrace, been promoting through our open source project, Captain, where we also agree that, you know, we want developers to think about what are their SLIs,
Starting point is 00:36:26 what are the metrics that are important for them, what are the indicators, and then what are their objectives, and then let this be validated automatically in the pipeline after tests are executed. And I think it would be awesome to then, and I think this is the best practice, right, that you also try to get there,
Starting point is 00:36:41 is while a pipeline runs and tests are executed and evaluated your SLIs and SLOs, how does this get impacted if you also, while the tests are running, enforcing chaos and seeing what the behavior then is? Yeah, totally. Like it's a great way to catch, you know, regressions, like networking problems or like service regressions.
Starting point is 00:37:03 The challenge that i've seen and you know we've been promoting you know you know sit down with you know define your slis and sls but then we have we sometimes get challenged by by people and say well so does this mean we need to sit down and think about what armor is sliced for my individual service that's a lot of work and i think as you said earlier yes it is but it's only a one-time effort in the beginning exactly that's the thing you only have to do it once like maybe what you do is you say let's have a three-day off-site where we actually just sit down together as a you know engineering team and we determine all of our slos and slis like you can 100 get that done in three days for your critical services like Like, you know, you have to then go, Hey, what are our critical services?
Starting point is 00:37:46 Let's agree on that. Now let's determine SLOs and SLIs. There's like tons of resources that you can read to be able to figure out how to measure those. And I also think it's like, you know, baby steps, like start somewhere. Don't have no SLOs and SLIs. Like you should definitely have some. But the thing too, that's most important where I've seen people really struggle is they didn't have, they didn't focus on critical service SLOs and SLIs,
Starting point is 00:38:11 the most important systems, and they didn't focus on the customer. So, you know, you really want to focus on that, like what matters to the end user. And from that, I like to think about, you know, if this system were to fail, like, would we still be around as a company? You know, and that's a pretty scary question to ask yourself. But yeah, like if we lost all of our customer data today, like, would we still be around as a company? And then the next question is which systems are responsible for making sure that we don't lose all of our customer data? And then you sort of go from there. Okay, we've figured out there's these 10 systems that have to be working for us to not lose our customer data. Okay. Like let's have SLOs and SLIs for that. To me, that's important. Then you can look at like, say revenue, like when you make sure that,
Starting point is 00:38:53 yeah, you're always able to maintain data durability and you've set goals for that, then you can go, Hey, this is how we make money. This is how we're able to service our customers. If it's like e-commerce payments or if it's airlines, like booking reservations, you know, then you focus on the systems that handle that. So it's not so much work. Like, I mean, I could go and visit any company and work with them to do it in three days.
Starting point is 00:39:18 You know what I mean? Like, it's like, you can do it. Yeah. One thing we've been doing with some of our customers in order to help them find the right SLIs, we basically look at production monitoring data, right? If you monitor the production systems and then we say, okay, let's figure out when in our case Dynatrace detected an issue that had an impact on your business. And then what we do with Dynatrace, when we find a problem, then we look at all the metrics of every component that is depending on that, let's say, application or service, and we do change point detection.
Starting point is 00:39:48 So we basically say there's an impact of 50% of your user base, and we could see that the five metrics that actually told us that something is wrong is this metric here, this metric here, this metric here. So some of our customers are now going into figuring and looking at the top problems that they had in, let's say, the past month. And then based on the production monitoring data, look at what are the key metrics that actually told us the monitoring system about the problem. And then take this as an input for their SLIs so that they can then validate that every time they make a code change or a new configuration change, they make sure that these metrics are not going out of norm because obviously we've used these metrics earlier
Starting point is 00:40:30 to detect the root cause of an issue. So that's something we've been trying to do and help with. I think that's a really smart approach to it. And to me, that's what engineers are great at, right? You're basically reverse engineering it. You're going, hey, these are your top incidents. This is how we detected them. This know, you know, these, this is the point that this issue occurred. These are all the metrics that fired. These are the systems that are linked to those metrics,
Starting point is 00:40:52 like totally a smart approach. And it's going to enable you to figure out those SLOs and SLIs really fast. Like if you're a company that has a lot of big incidents, like I would totally go for that approach. Like that is fast. Like that you can probably figure it out definitely within a day. And then you're going to be able to get much better. You know, the other thing we need to think then is, okay, now we know which systems are critical. We've set our SLOs and SLIs. Then as an engineering leadership team, the leadership team needs to go, where do we put our people? You know, where do we put our engineers? And actually look at that, like say, we've got maybe not enough engineers on the database team, but we've got too many engineers on the Kafka team. You know, maybe we don't have enough people who understand monitoring and alerting. We need to train up some people to understand that so
Starting point is 00:41:40 they can make sure that all of the systems actually have better monitoring and alerting. So like, I think that's really an important thing for leadership to do. And also to think too about like, you know, your spread of senior engineers and junior engineers, folks who are new into industry, you know, do you have a good enough spread across the teams? And then think about like rotating people through so they can transfer knowledge across the company. Like, you know, these are all really good things to do. And I think, you know, one of the things I really like to advocate for too is have your junior engineers, you know, your new grads get involved in these conversations around setting SLOs and SLIs, like let them come to the workshop, do it as an offsite, you know, don't do it in a
Starting point is 00:42:19 silo. I think that just helps people understand why we're setting these metrics, why they're important and how they can use them day to day. Like, that's obviously how to really get buy-in across your entire engineering team. And for everyone to feel like they were able to contribute, like, I think that's important. Yeah. Hey, at the end, I want to bring up one question that just came up. So our CTO, Bernd Greifenheter, he just wrote an article, was published on enterpriseproject.com
Starting point is 00:42:49 and he kind of explained our approach to software engineering. He explained our approach of what we call no-ops. So no-ops for us is a mentality where we say, what can potentially go wrong and how can we prevent this through automation? And what I thought was interesting in his article he is comparing our approach also to sre because and and and i want to get your
Starting point is 00:43:13 point on it so the way he put it he said the way he sees it sre is you have an existing system and then you bring sres in that are trying to figure out how can we make an existing system better, more reliable, more stable by running all these experiments, where what he is explaining, the way we try to build new software from scratch, is how can we, from the get-go, architecture software so that we always have this thing in mind, what can potentially go wrong and how can we fix this through automation or through the right architecture, through the right failover mechanism? How can we build it right from the get-go to be resilient? So I was wondering what you see.
Starting point is 00:43:53 Is this a good explanation of, let's say, quote-unquote SRE or traditional SRE versus to where we need to get to when we're starting up building new software? Yeah. I mean, I think he's spot on. So it's interesting too. I have a lot of experience in that. So yeah, when you're thinking about building new software from scratch, like, yeah, it's a different world. And also if you have the skills and ability
Starting point is 00:44:16 to focus on automation, awesome. Like a lot of companies don't have that, but I've worked at companies where we did, you know, at Dropbox, we were looking after tens of thousands of database servers with a small team of four people on call, including me. So obviously there's like a ton of automation because those servers are going to fail in so many different ways every single day. But the thing there that I realized was, you know, automation fails. So actually we had our automation, which was our primary automation.
Starting point is 00:44:43 Then we had secondary automation, which is a backup for when your primary automation fails. And then we had a third level of automation actually for like when the secondary fails. So it's like automation on automation on automation. Like, you know, and the thing is, if that third one fails, then you go, okay, now we have to manually do it and actually manually doing the work of a database promotion properly, like by an expert who's been doing it for like, you know, 10 plus years takes hours. So it's way faster to do it in an automated way, especially if you have like thousands and thousands of servers. And obviously like, so we had a fleet where it was like a primary and two secondaries and then you have backups and clones and restores happening all the time. But then,
Starting point is 00:45:24 yeah, what if you need to do a promotion? Like, yeah, automation is way faster. It goes super fast and it's awesome. You can make sure it checks all of your automated checklist for data verification and everything there. But, yeah, if that automation fails, you need to make sure that you know that it failed. And then if that second level automation fails,
Starting point is 00:45:43 like the one that kicks in, then you need to be able to handle that too so you know that's what i would say with a heavily automated world like automation fails too and that's actually what my a lot of my chaos engineering work at dropbox was focused on making automation more resilient too like not just um yeah yeah because it's like lots of automation. Yeah. Yeah. I like that. So also bring chaos into the automation because what if that process is impacted and that's a similar tool. Yeah. Perfect.
Starting point is 00:46:13 Yeah. Pretty cool. Yeah. Brian, is there anything else that we should cover before we kind of wrap it up? Because also considering. Yeah. Time.
Starting point is 00:46:23 Time. Yeah. No, I think we covered quite a, quite a lot and some really great stuff so uh unless tammy was there anything um that you wanted to make sure you got in that we didn't cover um just one thing is um we we also have a chaos engineering community slack so anybody can join that there's over 4 000 members and i've like met so many awesome people through this community. I would totally recommend joining. You just go to gremlin.com slash slack.
Starting point is 00:46:50 And there's engineers from, you know, colleagues that I work with at Dropbox and folks from Facebook, Google, Twitter, you know, everyone who's interested in chaos engineering. Yeah. Target, Under Armour, Walmart, like tons of places, finance companies. And yeah, I think that's a cool thing to do. Like get involved, come along, join the community, you know, join the learning channel. If you want to run your first ever chaos engineering attack, everyone will help you figure out how to do it. Like I recommend going for a networking one. Like one of my faves is packet loss. So yeah, just join. And thank you so much, Brian and Andy, for having me too.
Starting point is 00:47:26 Thank you. Thank you. And, you know, we will put a lot of these links that you just mentioned also to the proceedings of this podcast so people can also see it when they browse through it and then click on it to get all the important pieces. That's awesome. Awesome.
Starting point is 00:47:42 That's great. Thanks. I had so much fun chatting with you both. Yeah, thank you for all the wisdom. As I said, I think we can write multiple books here now. If you want to, that'd be great. I'd love that.
Starting point is 00:47:54 You'll have to give Tammy a percentage of the proceeds there, though. Yeah, it's okay. She's like, no, I'm going to take it from you. All right. Well, thank you very much. We'll talk to you all soon. And Tammy, thanks for coming on.
Starting point is 00:48:09 We'd love to have you on. If you ever have any really great stories you want to share with us, we'd love to have you back on. So hopefully you'll hear from me in the future. Awesome. Thank you. Thank you. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.