Software at Scale - Software at Scale 28 - Tammy Butow: Principal SRE, Gremlin

Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Hey, welcome to another episode of the Software at Scale podcast. Joining me today is Tammy Buteau, who is a Principal SRE at Gremlin. Thank you for joining me. Thanks so much for having me. It's great to be here. Yeah, so maybe you could tell listeners a little bit about what Gremlin does. And I would love to hear your story on why I start as the seventh employee at Gremlin. Yeah, sure. So at Gremlin, our mission is to build a more reliable internet

Starting point is 00:00:47 so something that you probably would notice like everyone listening is that over the last few years like there's been a lot more outages reported in the news and some of them are absolutely huge like massive outages um for example robin hood they actually had like a sequence, a series of outages, and then they ended up getting a $70 million fine from the regulatory board here in the US. And so that's a really interesting issue that's happening, I think, more and more. And my background, like I started working at the National Australia Bank many years ago now, about 12 years ago. And I was working on critical systems, mortgage broking, foreign exchange, internet bank banking. And whenever there would be an outage, then you definitely could get a fine from... In Australia, it's called APRA, which is the Australian Regulatory Board for Financial Systems and Companies. And I remember getting my first

Starting point is 00:01:46 fine and it was this really big deal. And it was really bad. The CTO came to my desk and told me, like, oh, you're the one that's responsible for this fine. You better make sure we don't get another one. And I was just straight out of, you know, university, out of college. I'd studied computer science and this was my first job. And I'd only been there for a few weeks and had this big outage. And so I really understood that it's not just that we build these systems and we make them as awesome as we can and hopefully people love them. It's that they really impact people's lives day to day, their livelihood. When you're working at a bank, if your banking systems don't work like internet banking, ATMs, anything like that, then people maybe can't eat that night because they can't get money out to feed their family.

Starting point is 00:02:30 And I would often see people writing in onto Twitter saying exactly that. So it made me really understand that it's important what we do as engineers. And at Gremlin, we built a platform which actually allows you to do chaos engineering or what folks call value injection. And there's a number of different out-of-the-box attack types. For example, latency, packet loss. One of my favorites is black hole where you can make something unavailable like an internal or external dependency.

Starting point is 00:02:59 And so a lot of our customers are in the finance space, healthcare space, retail. Some of our customers include Target, JPMC, like really big names that you've heard of because they care too about making sure that everything is always up and running for their customers and working as expected. So that's what Gremlin does. And the reason I decided to join Gremlin almost four years ago now really early um was i was working at dropbox at the time i was a site reliability engineering manager and i was doing a lot of chaos engineering as part of my role leading the databases team and also magic pocket block storage and as part of that work we realized that we could reduce the number of critical high severity incidents by injecting failure to learn more about our systems.

Starting point is 00:03:46 And some interesting examples there were, we did a lot of work on something called SQL proxy because we wanted to understand like how it failed in different ways. And if we could actually make it more reliable, and we did, we were able to figure that out. But a lot of it is really having this like very scientific mindset of,

Starting point is 00:04:03 this is my hypothesis. This is what I think is going to happen when I inject this failure and then you actually do it measure the results and then make some fixes afterwards and then test again and just so many different use cases came out of that work which we can dig into as well but that's really what got me excited in the first place is like you know let's actually just make the internet a lot better more and more people are getting online we need to make sure that it actually works when everyone arrives yeah i think that's super interesting and just digging into the first point we just spoke about you know outages are just increasing why do you think that is it just that there's more services that we depend on are we just building software more unreliably like why do you think the trend is towards things getting worse yeah i recently did some analysis of outages so there's a few github repos where you can see publicly

Starting point is 00:04:56 reported outages and um when you look at different types of outages there's definitely different results so as a whole if you look at like out as a whole, what is the most common reason for outages? A lot of the time, it'll be some type of configuration-related change or issue caused an outage. And that's really complex to solve because config can change in so many different ways. And it could be like something just to do with spinning up machines. It could be a specific type of configuration for a specific type of software, like for example, Kafka or MySQL. Could be some type of managed service that you're using that you haven't configured correctly. So that's a really, really hard one to solve. And that's

Starting point is 00:05:40 the majority of the outages there. But then if you dig into specific technologies like Kubernetes and you look at why do those outages commonly occur, it's actually really different because each of the cloud providers has their own managed Kubernetes service or you can roll your own Kubernetes. But then digging into it is really interesting. Like a lot of outages are actually related to CPU, but it is maybe know maybe not what

Starting point is 00:06:06 you would expect it's like cpu spiking cpu throttling um downstream impact caused by that as well or also configuration not being set up correctly for auto scaling based on cpu so it's like really um complicated i think actually to understand all of these fine fine you know, very detailed elements of your systems when you're creating them and spinning them up. Because just using the cloud, it's not so simple. It's different. And I think like the thing is, back in the day, when we would build our own software, and you did everything in house, you knew what all of those little details were, because you did it all yourself and you like had

Starting point is 00:06:45 to go through it all and you understood it and you memorized the code and you like knew what the configuration was it's like so different now you just pick something up off the shelf and you try and make it work with what you already have which a lot of it is really more like lego or plumbing things making them work together and like you know plumbing has leaks. That's actually what happens, right? And if you're taking things from different places and trying to get them to work, it's actually really, really hard. And I think the role of an engineer that's focused on reliability, preventing data loss, improving uptime, it's really different to what it was you know 10 12 years ago for sure like i think the other thing too is like the speed of change um companies are rolling out more products they're trying out new technologies

Starting point is 00:07:32 faster and faster to be competitive i remember when i first started out we had like you know two products and we were like yeah we'll build another product but we're going to give ourselves two to four years now it's like you hear folks, I want to ship a new product in six months or three months and just like really, really fast. That means you need to learn a lot faster and that can introduce a lot of failure much faster. So it's an exciting time. But to me, that totally makes sense

Starting point is 00:07:56 as to why we do have more outages and why there's reliability issues. Yeah. Yeah, that makes sense. I still remember like when the covid vaccine tracker thing came out and we had to book our appointments on like the california system that site used to go down all the time and makes me wonder as an engineer like this doesn't seem so hard but i can totally imagine like there was pressure to ship this out super quickly and there's like

Starting point is 00:08:22 millions of people trying to log in at the same time. Yeah, exactly. And then there's also like the budget constraints around that, right? Because say if you're trying to build something that works reliably, but also doesn't blow your budget and your costs, because often like you can't just throw more servers at it, because that's not a cost effective way to handle that. So then you have to think about how can I do this with code or with adding like, you know, some sort of queuing system to there. But that's the tough thing, right? That also takes time to be able to figure that all out and make it work technically. So I think like, whenever I see big outages like that, yeah, I would do exactly what you did think through, I wonder how long they actually had to build this? Like they have two weeks did they have a month like how long was it for real some okay so now that makes sense in terms of outages so how would you think about

Starting point is 00:09:12 building software from day one maybe when that's actually reliable or can withstand like you know like a load spike or like a cpu spike or some throttling like what is like a mindset shift that you've seen that's been like super effective for people yeah i think um over the last few years like you know i started out my career working on on-prem systems and then i moved to working on the cloud like using aws azure gcp um and i think like actually those cloud providers have created really great tools to help folks that are using the cloud like like AWS's well-architected framework. I think that's a really good tool to be able to look at and, you know, try and utilize whenever you're building something new. That's got a lot of great tips.

Starting point is 00:09:57 The other thing, too, though, that I think about is, you know, I like Google's work. It's interesting. Like I think each of the cloud providers has done really interesting things, like AWS as well, architecture framework. With Google, I like their focus on SLOs and SLIs and error budgets, but planning those out from the beginning, not once you've already shipped everything and it's already running in production. But if you can think through your SLIs and your SLOs

Starting point is 00:10:21 during the design phase, when you're actually planning out your new system. I think that's a great thing to do. And obviously like that takes more time and it takes maybe someone with, that has an interest and an understanding of reliability and how to meet those certain SLOs that you set. But it also allows you to have this really great conversation,

Starting point is 00:10:40 which is to me like, what are the top five to 10 critical pieces of this system? Like if you're building a whole new product, what are the top five to 10 critical pieces of this system? Like if you're building a whole new product, what are the top five most critical pieces that you need to have SLOs and SLIs for? Because, you know, maybe you don't want to create them for every single thing that you build, right? It could be some tiny little piece of a system that's not that critical. If it's not available, it's okay. You can go through something else, or there's like a failover mode that works well so i would focus more on that and that enables you to prioritize and i think like as i've gone through my career obviously like that's something that you become better and better

Starting point is 00:11:15 at all the time and it's a great skill as an engineer to learn how to prioritize what you're doing because you have limited time in the day so then then it's like, what are the top things that I want to get done that are going to help me prevent failure down the road? And then so during that design phase, that's what I would say is like, think about those two things. And then once you've moved on from that, I like to think about like, how can I codify this? That's like a really interesting part.

Starting point is 00:11:41 Sometimes folks call it like shifting left and figuring out how you can do this work more in your like cicd pipelines working with your build team to do more proactive testing but i also really like this approach um that jpmc has been doing and it's all about like using chaos engineering the approach of chaos engineering but to create patterns to inject failure proactively every time you're going to like ship a new product or a new feature or service to production they'll run like a gauntlet or a suite of these different failure injection experiments and then they make sure that it passes those and i think that's awesome like it's

Starting point is 00:12:22 interesting right it sounds like you know maybe sense. Of course we should have been doing that years ago, but like, there's just a lot of stuff that people haven't been doing that, you know, maybe it was hard to build those different types of things into our system, or we weren't sure what the pattern should look like. We, you know, but also we didn't have a lot of stuff. We didn't have auto scaling from AWS. That's only been around for a few years now. So it's also creating these new types of patterns or like gauntlets of experiments that you can run in an automated way that's codified. And then teams can feel confident when they ship their code that it's going to work when it gets to production because they've already run this series of tests. And I just love that so much.

Starting point is 00:13:02 It's like a really empowering, scalable way to give folks the tools to feel confident in what they're building rather than it getting to production. And then everyone's like, hey, your software doesn't work. Like, here's why it doesn't work. Like, that's also like a, it's a hard thing, right? If you spent all this time building something and you're really excited about it, and then it's got some like huge major flaws and maybe you have to do a big roll back. And that's very difficult if you've some like huge major flaws and maybe you have to do a big roll back. And that's very difficult if you've done like a press release and a whole big launch around something. So yeah, I really like that idea because it's going to help engineers feel more

Starting point is 00:13:36 confident in what they're building. And it enables us to just do a lot more proactive testing. We can meet the needs of our other teams that are relying on us like the product teams or the marketing teams the business teams the ceo of your company like will be a lot happier so i i'm very excited about that cool and you were talking about some of your previous roles and you mentioned that you were a site reliability engineering manager and that's not like a job title you hear too often often maybe you can walk us through you know like what does that role really mean and then also we can if you can talk publicly like about you know what were you doing as like the manager at of a databases team at a company

Starting point is 00:14:15 that you know manages a thousand database nodes like how do you make sure to keep your backup safe and like make sure that you know your availability your availability stays up. That'll be pretty interesting. Yeah, for sure. So yeah, it was really interesting being in that role as a site reliability engineering manager. I think, you know, the reason I was really excited to take that role too is because I just worked on a lot of really critical systems and I really believe that reliability is like core and it's so important and you know to me like reliability is feature number one because if your product doesn't work like it's not up

Starting point is 00:14:51 and running then no one can use it so it doesn't matter what features you've built it's just like not even available to people and I saw a lot of those issues where we didn't focus enough on reliability and then you know maybe your very senior sales executives would be doing a demo of your product. And it just wasn't even up for them to be able to show these customers that are like VIP customers. And that's a really bad situation. And so I like the idea of being able to just focus on reliability. I thought that was really awesome. And so when Dropbox reached out about that role, I thought that was great. And I've always loved databases because I just love data.

Starting point is 00:15:28 I think that it's just really cool, actually. Like I'm definitely a data nerd. I like the idea of being able to store all of our data and making it available on the internet, like the data that we choose to share with others. The idea of you can basically read any book now because it's on the internet. I think that's so cool.

Starting point is 00:15:46 And like that you can watch movies, you know, coming from Australia, it was hard to get books sometimes because like when I was young, like, what am I going to do? Buy it from America and it would take like months to get there. It's just like really like coming from a small island, it's like very far away and very remote from everywhere else. The internet is like a lifeline that keeps you connected to knowledge and to other people. And I was a big fan of Dropbox.

Starting point is 00:16:10 I've been a Dropbox customer since I was in university and I used to use it with the other folks that were studying computer science when we were doing our projects work. We would share things like in our Dropbox folder. And so I thought that was pretty amazing. And then I started to use it too at the National Australia Bank. We were also using Dropbox there.

Starting point is 00:16:31 And all my friends and family use it as well. And I know like a lot of people use it for really interesting use cases. And that's always what matters to me. It was like, I know that like really famous bands make their music on Dropbox. Like that's how they share the files around. I know that like lawyers use it when they're doing huge court cases and so being able to be like okay like when that huge court case is going to trial like dropbox

Starting point is 00:16:56 will be up and running and they'll be able to access all of their data because my team's helping to make sure that that happens like that made me very motivated as an individual. And also then meeting the team, like it's like a superstar team of folks who come from a lot of folks from YouTube, from Pocona, Booking.com, like a lot of amazing MySQL experts who had just been working with databases

Starting point is 00:17:22 for years and years. And they were like amazing at all sorts of things like performance tuning, Linux, the Linux kernel, being able to do backups, restores, like building automated systems to test restores of backups, which was just happening all the time. Building like web UI interfaces to be able to manage backups and see which ones were working

Starting point is 00:17:45 which ones failed like this is all stuff that i've just never seen anywhere else when i'd gone to visit companies or talk to friends you ever seen anything like this they were like no that's so cool and so i just love this idea of like basically the dropbox databases team was building startups at dropbox but specifically for like dev tools for databases, which was like awesome. I'm like, wow, this is amazing. And so there's still a lot of things that, you know, the Dropbox SREs and database engineers, block storage engineers, a lot of things that they built that do not exist anywhere else, except maybe like at companies like Google or something like that. But they're not things that you can just use, you know, they're not products. And so it was really cool to be able to see that and see what everyone had

Starting point is 00:18:29 built. And the reason too, why they had to build that was super small team. It's like, you know, I think when I joined, we had 200 million customers and there was only four database engineers. And then when I left, we had 500 million customers and like five database engineers. So we just had to do a lot of automation and a lot of, you know, large scale, like looking after systems, not with adding extra bodies, but by trying to be smart and intelligent and building systems. And that was like also very motivating for me. I love that as well.

Starting point is 00:19:04 Yes. And what was the role once you ended when once you started working there like how do you measure that you're successful like is it just like oh if the site is up like three nines I'm doing my job or like how do you go deeper than that yeah yeah that's a great question so I think these days everyone measures things like pretty differently um but back then when I started doing that, it was like, you know, maybe seven years ago now or something like that, six years ago now. When I very first came onto the job, I mean, even during the interviews, I asked that question. I think that's a great question to ask if you're an SRE during the interview.

Starting point is 00:19:41 How will my success be measured? Because if you want to like have an amazing career career a great journey if you're on a mission to do really great work then it's a good thing to ask and so like i said that like hey like what are the big problems that you would want me to help you solve like that's like an interview question i always ask and um they were like well we actually have like pretty high amount of on-call pages that are happening. And we're not sure if they're like actual problems or if it's like noisy pages, if there's like automation that we could add in, if it's toil. We want to be able to like dig into that and then reduce it. And I was like, yeah, sure.

Starting point is 00:20:20 Like that's a really great first project because I'm going to learn so much from doing that. Right. When you get assigned that project, you're like, yeah, I'm going to learn all about the systems, all about the different failure modes. I'm going to try and actually decrease the number of incidents that are happening so that we aren't getting paged at 2am in the morning anymore. Because who wants that? That's annoying. And it's also really bad for customers too and for the business. So I started on the team and I asked folks like, yeah, do you have any idea why it's also really bad for customers too and for the business. So I started on the team and I asked folks like, yeah, like, you know, do you have any idea why it's so high?

Starting point is 00:20:49 Do you think we'd be able to reduce it? And they were just like pretty, I'd say like, you know, it was tough at that time because maybe they were getting like a lot of pages through the night and it's hard to like step out of that sort of like mindset when you're constantly being bombarded with pages and you're just like getting hit with them all the time it's hard to like step back and think like how can i stop this from happening because you're just trying to keep everything up and running and they're like an amazing team so they built all these great tools and all this great software

Starting point is 00:21:18 and that's an awesome thing about having a new team member join the team right like they're able to come in and look at it just from a different approach, different angle, like fresh set of eyes. And I always love when you add a new team member to your team and they do this. And so, yeah, just ask a few questions and then started to, I just pulled all of the data because I love data, like totally a data nerd. I pulled all of the data for, I think, six months of incidents that had happened, like every single

Starting point is 00:21:45 page, and then analyzed all those pages by like just crunching that data and being able to pull out patterns and trends. And then that gave me like more interesting questions that I could ask the team, you know? So I came back that next week, we had like a weekly encore session on Wednesday mornings, like an encore handover. And I was like, i noticed that you know 80 of our pages are related to this one page like and this one specific database system like why like and i was like oh that's really interesting like maybe we can prevent that from happening like that's interesting let's dig into that we can do a project around that and so that got everyone really excited about it. It was the file journal. So then we worked with that other team, which owned the file journal, to be able to collaborate with them to do some interesting failure injection experiments,

Starting point is 00:22:35 to understand how and why this system failed with the database as the backend for all of the data there. And we had a really good understanding of like what we needed to fix and also what we needed to prevent going forwards. And we could create some good like patterns for how we work together as a team. And that was awesome. And also much better reporting, much better understanding of that.

Starting point is 00:22:56 So that was like one of the key things that we did. And fixing those issues, injecting that failure, doing that chaos engineering work, that ended up getting an incident reduction and then we did more like i mentioned um with another system there's another system called sql proxy and that one also was causing a lot of issues but the code had been around been around for a long time no one really understood it well so injecting failure is just a way quicker

Starting point is 00:23:20 and easier way to understand it and by doing things like process killing shutting down nodes understanding like how many do you need to have like how many proxies running what is the sweet spot do you have too little do you have too many like what does it need to be and um just being like a real scientist which i like this approach now too with chaos engineering that it's you know we study computer science so it's like let's bring the science into it and experiment more and learn more and like i don't know i feel like that's like why it's so exciting to be an engineer the fact that you do get to experiment and dig into things and analyze the data and then be like i think if i do this this and this then i'll be able to make an improvement of you know 20 or 100 improvement 10 improvement, like whatever it is. And I'm also

Starting point is 00:24:08 a bit of a gambling woman, I'd say. Like I do like to, I like to play pool. If anyone ever wants to play pool, I'm always down. But I always like to say like, I'm going to hit my ball from here to that, like, you know, to the right corner. And then I'm going to hit that other ball. And then we're going to go into the pocket or like, I'm going to hit that other ball and then we're going to go into the pocket or like I'm going to jump that ball and then that other ball is going to hit there, that corner, and then we're going to go into the pocket. So I don't know. I just kind of think it's more fun to like call out what you're going to do. And then it's also more impressive. Like it's not like it was a fluke, right? Because you said it, you're like, I'm going to do this, this, this, and then you do it and you have done it. Like you've demonstrated that you could do it.

Starting point is 00:24:46 And I definitely learned that playing pool for like many, many hours when I was in university, which is pretty funny. But it's a great thing, great skill. And I think it helps you as an engineer as well. And so that's a thing too I think of when you're doing this work, it's really important to communicate what you're doing. And I think often as engineers we like don't focus on like the communication of what we're doing but it's like

Starting point is 00:25:09 tell everyone what you're gonna do then do it then tell everyone what you did and what the improvement was like it's like it's really basic but that's my framework for it as well cool and i have a lot of questions about chaos engineering, but first I need you to elaborate on one thing, which I think is sound super basic, but I think is important for to just understand, like, why is it knowledge to somebody else and they've left the company and now you have the service and you have no idea whether you're over provision under provision like why is it important to actually know that yeah oh that's a great question i love that and i i think like you know so that specifically is like so you can serve the traffic that you have like that's like the basic answer is um you know and it's difficult like if you have fluctuating traffic on different days, different hours of the day, like sometimes you can have massive traffic spikes where you need

Starting point is 00:26:10 more nodes. And that could be like, you know, maybe Monday morning for some types of services and then say like Sundays are really nothing much. So you could have way less nodes in your fleet for a specific service because you just don't have as much traffic that requires those nodes. So that's like basically what it is. And i like to think of it as like a fleet of nodes and make and but knowing the right amount is difficult because like i said it can fluctuate um but the thing too is yeah if you join a team and you you just are not going to know like is this the right amount is this too little is this too high? Until you dig into the data and understand it and look at the patterns that have been happening. And then also, the other important thing there to learn too is those patterns can change at an

Starting point is 00:26:54 instant. Say, for example, what happens if your marketing team does a huge campaign? This happened while we're at Dropbox, where Dropbox did a massive campaign for Dropbox business. That was all over the news. And that's huge, right? So that's going to change all of your patterns. The other campaign there was one, which was an integration with Samsung mobile phones. So every time someone's channeling their mobile phone, it would call out to Dropbox. So that's also a lot of traffic, new traffic that you'll get.

Starting point is 00:27:23 And then that made me realize as an SRE, it's super important to be actually talking and watching and seeing like what your marketing team is doing actually, which like I never, ever thought of doing that in the early days as an engineer, because I feel like being an engineer is so far away from what marketing is. But actually like, if you know, okay,

Starting point is 00:27:44 we're going to do this huge like marketing campaign, there's going to know, okay, we're going to do this huge, like, marketing campaign. There's going to be billboards. There's going to be TV ads. There's going to be, like, a big push on social media. We estimate we're going to get, you know, a million, 100 million, whatever it is, new users. Then you can know, like, to prepare for that.

Starting point is 00:28:01 And then also you want to understand, like, what are the usage patterns going to be? Like, what will those people be doing? Like, how will they be using our API, for example? Like, what are going to be the common calls that they'll be making? Are we trying to get users to do something totally different than they were doing before? So like, those are all the questions that I ask now. And that's like, definitely not something that I knew coming out of university at all. You know, this idea of trying to predict what different things would be like once a new, completely new product didn't exist.

Starting point is 00:28:30 And also, I just don't think we have even meetings like that where marketing and engineering sit down together and go, okay, like how do we prepare for this to make sure it's reliable, which we should do. So I've definitely been encouraging folks to do that. It's like the important to have reliability in the design phase of a new product but also like when you do launch because you want to know like is marketing putting like millions of dollars behind this launch because that's going to change things if it's a soft launch then that's totally different as well you know it doesn't matter as much but yeah yeah yeah i've heard like anecdotally that, you know, Uber Eats just provisions a lot of servers for Super Bowl Day.

Starting point is 00:29:07 Yeah. And I've also heard like Prime Day is like a six month event at Amazon to make sure that everything is correct and ready for prime time, I guess. Yeah, that's exactly right. Like, you know, when some of those peak days are like I've also heard with with Uber, obviously New Year's Eve is a huge day for Uber. So they make sure to have enough nodes, enough machines provisioned then as well. So there's like some things you can kind of guess. Like I think I need to be ready for this,

Starting point is 00:29:36 but better to even just have like, I don't think this ever existed yet, but me and you, we can like riff on it and come up with ideas. It's like, what about like a reliability calendar or something? we just know like these are the points that it matters for our business and like in your first week when you join a new company you could say hey like what are our most important days of the year like that happen when we get loads of traffic like i want to be ready for those days and make sure that we always like crush it and do an amazing job like i

Starting point is 00:30:02 think that's a great question to ask too. Yeah. Yeah. We don't want an embarrassing moment on like the day we spent millions of dollars. Exactly. Like the ball. Yeah. So then let me ask you a little bit about chaos engineering.

Starting point is 00:30:16 We've spoken about the problem, right? There's outages, there's like on-call toil, which is like a really important thing for SREs to solve. And I think these are like approachable problems. Like people are generally aware, like these are real problems and we need to solve them. How does chaos engineering help, I guess, is the first piece. But I think what I'm interested in is

Starting point is 00:30:37 how do you productize a solution? A lot of these solutions are, you know, something that I would think about the company has to implement internally. What was the initial idea, if you can talk about you know grumblin and like how do you sell a solution to customers is something that i'm super curious about yeah for sure so i guess you know to think about what is chaos engineering and how does it help you for example reduce outages um one of the things is you can think about this like i i personally choose chaos engineering and i have for the past you know 12 years as my favorite way to um be able

Starting point is 00:31:13 to make sure that systems are more reliable because there's a number of different things that you can do but i feel like to me i've just i've picked chaos engineering because i feel like it's the thing that gives me the biggest reward in the shortest amount of time. And it gives me the best long-lasting understanding of my systems that I'm working on and the best knowledge of the systems and also the customers, the product, just everything. And I'm always looking for what's the most efficient but also impactful way to learn about something. And so the reason that I say that is I've done lots of different things. So say, for example, you join a new team, you have this service that you pick up and you're told,

Starting point is 00:31:53 hey, this service is not reliable. I'd like you to try and improve it. It currently has, you know, 500 pages a week. Everyone's too scared to make code changes because it's really like old piece of software. We can't deprecate it yet because we're not sure actually how it even works. We're not sure what would happen if we did deprecate it.

Starting point is 00:32:14 We're not sure like what it even connects to, what the dependencies are, like upstream, downstream, what cascading failures we might have. Like you just like think through all of those things of like what this system, the damage could be if something went wrong. Say if you did a code change and then suddenly it actually

Starting point is 00:32:30 made things way worse. That would totally happen. And it's kind of like also from building things and getting a bit burned. You realize you have to be a bit more careful. So it sounds interesting, but actually chaos engineering is a more careful way to understand systems and how they fail

Starting point is 00:32:48 than like just making code change to see like what happens now, like pushing code into production. That's like, to me, too dangerous of an approach. And it feels like you're going in blind, like just doing a code change. And so the idea there is like, okay, if I'm to think like a scientist,

Starting point is 00:33:03 I'm not just going to randomly change stuff. I want to do an experiment. And so if I want to understand this system, I want to understand like, how does this system impact other systems within my architecture? So I can do little tiny experiments that allow me to inject failure. So one really good example, like say if you've got this, maybe it's say an ad service within an e-commerce store. That could be our system. And we've got a lot of problems with this ad service,

Starting point is 00:33:30 but we want it to work. But let's just fail it. But specifically, we could do something called a black hole attack, which means you can make this service unavailable for 60 seconds. You don't have to do anything else.

Starting point is 00:33:42 That's like a Gremlin specific type of attack. And we're just going to make it unavailable for 60 seconds and see what happens to other services around it. And you don't even have to do this in production, right? You can do this in dev. You can do this in your pre-prod environments. And you can see when the ad service doesn't work, is everything else still functional? Can I make um checkout items can i purchase items can i add things to my cart can i look at the catalog all that sort of stuff am i getting any other pages from any other systems that are trying to call the ad service and it's not there and then that's

Starting point is 00:34:15 causing problems for those services like those are the things that i would do and then from there you can go okay like you either learn that this is like really badly like hard coded and there's a lot of different issues, but you actually would know like all the systems that has issues with like what hard coded dependencies are there on the ad service? What do you need to then prioritize fixing? Or you're like, boom, in 60 seconds, I learned that this service does not have any issues if we just take it away. That's awesome. Like how fast is that? Like rather than if you just imagine any other way to learn awesome. How fast is that? Rather than... If you just imagine any other way to learn, how else can you learn something in 60 seconds?

Starting point is 00:34:54 And so that's why I love to nerd out about it. It's just such a fast way to be able to learn. And it wasn't always like that. To be able to do a black hole attack in 60 seconds, that's something that we built at Gremlin and built into the product. And it works for everybody. Like everyone can use that on Windows, on Linux. We even have like a serverless feature for Alfie application level failure injection where you're able to do something like that as well. If you write in Java, that's a beta product right now and you just integrate it with your code. But like we created

Starting point is 00:35:25 that because i remember in the past when i would try and do activities like that a failover activity is what you would call it right it was just wow what a nightmare doing them in the national australia bank 12 years ago it was something that we had to plan for for probably three months to be able to do an experiment like that. We would have to book out the weekend. We'd have to go to a separate office. We'd have to make sure that all the other teams that might get paged that might see an issue were in the room at the same time. We'd need to all be sitting there together live.

Starting point is 00:35:56 Like we didn't have tools like Slack. Like we couldn't just communicate with each other. We didn't have like page duty where we could quickly pull up the reports of the pages or, you know, software like Datadog and new relic for really awesome monitoring it was like a lot of logging and you know splunk didn't exist back then either so it's just like now that we have all these tools we can learn amazing things in 60 seconds um if we just inject failure and learn quickly and then you just turn it off and everything's good as gold again so it like goes back to like your state that you were at previously so yeah that's why i really like it

Starting point is 00:36:29 do you feel like customers get scared when they hear about the concept of like chaos engineering the first time and how do you like help them go over that like initial barrier yeah i think a lot of the people do get scared mostly because they think that they have to do chaos engineering in production first which is like true. Obviously you don't have to start in production. It's very powerful to start in other like environments, definitely. So I would say that is really like, as soon as everyone hears that, they're like, oh, that makes sense. I'm like, yeah, it's a journey. Like I never started in production. I started when doing it at dropbox in the um staging environments in like our our dev environments for databases because we had these like staging databases that

Starting point is 00:37:13 we could do all of our experimental work on that were in a totally different safe environment so you could just test things and try things out and it was much better like you don't have to be worried and then once you're ready then we could do it in production but i always say that it's a journey to get to production it could take years and that's totally okay sometimes it takes folks two to three years to be able to get to that point and maybe some folks might never get to production and that's also okay like you can learn so much from doing it in your environments like before production um so that's probably the main reason they get worried. And then I think also sometimes the name just scares them

Starting point is 00:37:48 like chaos engineering because it sounds like very chaotic. But I think like, you know, for me, I practice chaos engineering as a reliability engineer. I always bring it back to that. Like our goal is to make systems more reliable. We're going to actually create chaos in the system, but it's going to actually help us uncover issues and make our system reliable. So that's really what it's all about.

Starting point is 00:38:11 And it's like controlled chaos. I like to say that too. You know, we're not just going to, I don't like the idea of randomly injecting failure. I love this like experimental approach. Let's be a scientist and let's learn that way yeah yeah that makes a lot of sense and it's like super similar i think to the idea of a chaos monkey that like netflix released like a few years ago is one of the founders of a gremlin somebody who wrote chaos monkey a long time back yeah so chaos monkey that was released like I think in 2010 or 2011, something like that. So it's like, wow, 10 years ago now. And Colton, our CEO, like he worked at Netflix.

Starting point is 00:38:52 He created something which was very similar to actually what we have called Alfie. He created something called Fish, which is the failure injection framework for Netflix. So, yeah, he worked on that. He also built something called Gremlin at AWS, which is a lot like Gremlin that you can use. And he did that before working at Netflix. So they were also doing chaos engineering there, but they called it failure injection. And so yeah,

Starting point is 00:39:16 he's been doing this work for such a long time for, you know, maybe 20 years, something like that, been doing chaos engineering, injection and yeah um what is the product really uh where i'm i'm curious about you know where are some like complexities of where the product comes in so like the mental product the mental model of my of gremlin in my head right now is just um do you have a system which lets you run these tests against, you know, like maybe like a set of services or like a set of instances and lets you decide, you know, run these like chaos tests.

Starting point is 00:39:53 But then what are all of the knobs and stuff that you have to tune and like how do those help the customer is something I'm pretty interested in. Yeah, totally. Yeah, so it's definitely changed a lot over the years. Gremlin, what it looks like right now, if you're to log in and there is a free version, if you go to gremlin.com slash buttons, you can try it out. Buttons is my nickname. So that's what that is. But so if you go there, you'll actually be able to see there's a few key features. So one of them, ones that we just released is service discovery

Starting point is 00:40:25 so once you have our agent running um you just install our agent you know the daemon you have it running on your machines wherever you'd like it to run or you can run it as a helm chart if you're using kubernetes open shift something like that and so then what you can do is you can automatically see all of your services within that you can see how many nodes does your service live on, like how many hosts. You can see like how many pods does it have. That allows you to understand like what you would be attacking. Say, for example, if you go, I want to attack just one of the three hosts that this service is on, or I want to attack 50%, something like that, we'll actually be able to then pull that data for you

Starting point is 00:41:05 and allow you to then inject the failure. And there's also a visualization tool. So it actually shows you a map of like all of your different nodes and your pods if you're using Kubernetes, for example, and where they fit, and it will highlight them when you're creating your experiment. And then what we want to do there is actually think through, okay, I understand like the specific service I pick,

Starting point is 00:41:27 like ad service, for example, I understand that that lives across two hosts, there's two pods. And then the next thing that you want to think is what kind of failure do I want to inject? And we have 11 different types of failures, which are just out of the box. And so a lot of the time what folks do is

Starting point is 00:41:44 they'll either start with just one type, like maybe packet loss, latency, process killer, could be like something like CPU, IO, memory, disk, spiking. And they're thinking through like a specific use case. Like for example, what if you want to test auto scaling? So you can inject CPU to spike CPU using using that attack and what you might do is chain three or four cpu attacks together with a little bit of a delay in between so say like let's inject cpu here um now let's have a little delay inject more inject more until suddenly your

Starting point is 00:42:17 cpu is spiked then it should kick in auto scaling and it should work um and then your cpu is also going to go back down so then auto scaling should go back to the situation you're in before with like you should release the extra nodes that you created and that's like something that you want to definitely test before you're like suddenly getting a ton of traffic and need to use auto scaling and it doesn't work and a lot of outages have been caused by incorrectly configured auto scaling just as one simple use case. But that's often what I see folks do is, and linking back to, you know, this idea of creating patterns and codifying your work

Starting point is 00:42:52 and, you know, being able to think through, if I was to do this work in a CICD pipeline, you know, not just manually creating these tests, but having them run over and over, then you're thinking through like, I want to have auto-scaling as a pattern that I test and i want to just create a grumlin experiment to do that codify everything and then just make sure every time you ship something new a new feature a new addition new piece of code that adds to that service you're just going to run this again

Starting point is 00:43:18 and make sure that everything works correctly and we also have another feature called status checks which checks your monitoring before and after so So that's really cool, right? You hook it in to say like, let's check that the service is up and running. Yep. Now let's run the attacks. Yep. Now let's check. Actually, our monitoring still says everything's good. We didn't suddenly have an outage or like the system crashed or the service is no longer available or the service, you know, SLI went down. We're no longer meeting our SLO. So that's what I see a lot of customers doing is they start there by thinking of what are their specific use cases

Starting point is 00:43:51 that they want to test for. Then they go and make those into experiments in Gremlin. We call them scenarios. And then they'll look at how can I automate this within my CICD pipelines. So that's like a really cool thing I see with the integration of status checks too. Yeah. That is so interesting. So it's basically like you're adding regression tests in a sense from that's what the example sounds like, right?

Starting point is 00:44:15 Check if my auto scaling is working at this scenario. And only if the status check says you're approved, should you move on to the next step of, you know, pushing to all of production or something like that. Exactly. Yep. That's exactly it. And I think a lot about regression testing when thinking through these types of experiments. You know, like I remember there was a huge regression testing project that one of the engineers on my team, the databases team was working on.

Starting point is 00:44:43 And she was looking at making all of the pages on Dropbox.com much faster to run. And like they went through and identified all of the pages, how fast they ran right now, which ones were the slowest, looked at making improvements. But the thing that you always got to remember is like, dang, like someone can come in tomorrow

Starting point is 00:44:57 and ruin all this great work that we did. So you've got to have regression testing in place. Like we all know, like someone can write one bad SQL query and then everything's blown, you know, with have regression testing in place. Like we all know, like someone can write one bad SQL query and then everything's blown, you know, with your metrics for Perth. And I think like that's an interesting thing too. Like chaos engineering really appeals to folks that are SREs, but also performance engineers,

Starting point is 00:45:16 because we have attacks that you can run that injects latency. So you can actually say like, what happens if I add latency to this service? Like, would I know, like, what happens if I add latency to this service? Like, would I know, like, how does it impact my service? Like what's going to happen to other dependencies on it that work with this service? And then also a lot of like QA engineers who are looking to do more automation and shifting left, like integrating with CICD, they also are interested in chaos engineering because they're like, wow, this is cool. Like, I can prevent these issues before they get to production in like a really nice way

Starting point is 00:45:48 and build out a super scalable system that just tests all of these services. So yeah, that's cool, too. Yeah. And I guess if you integrate as part of the CI CD pipeline, you also don't have to worry about is anybody actually going to fix the bugs caused by like, so one thing that I always thought about was we would do these like monthly DRTs at my previous job at Dropbox. And we'd always have to prioritize whatever we found out. And sometimes you just don't do those action items. So what's the point? But if you again, shift left,

Starting point is 00:46:23 you need to make sure these things are fixed before you roll out a new version which is pretty interesting yeah i really like that approach too because a lot of folks say that exact thing like say if you're you know you're doing a drt exercise in production or you know a different environment you go through you identify those issues you know some teams are good about it they'll they'll be like i'm gonna get this done but then sometimes you can't because maybe you have some other team pushing you to deliver something else, maybe related to their items that came out of their DRT. So it can be really hard. Like which team gets priority for you to help them, especially, you know, the team that

Starting point is 00:46:57 you were in, you're like helping every team across the whole company. So it's really hard to prioritize at that point. And so I really like this idea of doing it within the cicd pipeline and also then everyone has the metrics everyone has the data everyone knows what's passing and what's failing it's just like a way better more visible approach and then that helps um push back to the management and the leadership and say hey like you know whenever we're trying to ship new features to this service, it doesn't work well. I think we need more headcount on that service.

Starting point is 00:47:27 Like, you know, we need to put engineering team, like, you know, resources. We need to put folks on there that can help because it's kind of hard to prove that you need more people on your team sometimes. And I feel like that's always like the constant battle in engineering. And this is like a real way to do it with data as well.

Starting point is 00:47:44 So all of this makes sense to me. When you were interviewing with Gremlin, like a few years ago, you must have had like a certain idea of, you know, what the product is and how it can help customers. And you decided to join. What is something like unexpected you've learned, like on the way, like on the journey, like how are customers using the product differently from how you thought about it when, you know know you design product or like gremlin engineers and like epd was thinking about things like what is something different that you've learned about it yeah i mean i definitely like when i first saw the product i was just like wow like i was blown away and this is you know back when it was seed round um pre-series a and atbox, like I had built a lot of the tooling to do the chaos

Starting point is 00:48:27 engineering at Dropbox. And a lot of it was just not as advanced, you know, as what Gremlin is. It just wasn't. Like Gremlin, when I first saw it was this amazing like UI that you logged into, it was super easy to deploy the agent. You then had access to all these different attack types. And I just never thought of some of the attack types that exist like that's what i thought was really cool um so for example at dropbox we've done a lot of process killer attacks which i thought was pretty awesome like that's much more advanced a lot of people don't do that and then shut down attacks as well but things like hey let's like inject latency let's inject packet loss like we never done networking related

Starting point is 00:49:06 um types of attacks like injecting networking failure we've done networking related experiments where we're trying to understand like why you know is this being throttled by the network and then we were able to figure it out but it took a long time to get to that point and if you can actually just debug the network by injecting failure into the network, that helps you prove it. So I love that because there's always this saying of like, it's never the network.

Starting point is 00:49:32 And sometimes it is, like I've just proved it that it is, but it takes you, I feel like, say it takes you a day to prove a basic normal type of issue is happening as an SRE. For a networking one, I feel like it was always weeks to be able to prove that it was the network and to get the networking engineering team to like back you up and like work with you to to resolve it because they've also got a lot of work that they have to do so to get them to listen to you and work with you is hard but this helps you

Starting point is 00:49:59 get that data um but the thing the experiment or the attack type that i was most impressed with was definitely black hole and that's because like i had always seen failover like say um you know region failover service failover being able to switch to like a backup like hot hot like let's just shut down one region and make sure that the other one works. I'd always seen it as not what we have as black hole. Like a lot of the time it was like, let's shut down completely. Like let's tear down this data center. We're just going to shut down everything. And then we're going to see what happens. And that's like very destructive approach actually,

Starting point is 00:50:39 because you're thinking of that, you're like shutting everything down, like doing a power outage. And then you're going to have to bring everything back up. And just any act of like taking it down and bring it back up that can introduce a lot of extra failure that you don't really want to introduce and also it just takes a lot of time because of that whole process of like bringing everything back up and so the idea of a black hole that you could do a failover exercise in 60 seconds without having to turn anything off just making it kind of like invisible for a period of time like wow that's cool that's just like a great pattern that's so

Starting point is 00:51:10 much safer and i like gremlin's approach of safety that was like one of the biggest um focuses and security and simplicity that was like the three values since i joined the other really cool thing too is a halt all button. So there's this button in the UI at the top right, and it says halt all. And you can just stop all experiments that are running at any time. And the agent will just like stop running them. There's a dead man switch, which we built into it,

Starting point is 00:51:37 which I love as well. So it's just like all these really cool like safety and security features, which I just would nerd out about as well. I'm like, that's such a great idea. So yeah, those are still my favorites. And have you just generally seen, I think you were mentioning about customers earlier. Have you just generally seen customers that care a lot about uptime as somebody who's interested? And what is the trend of customer interest?

Starting point is 00:52:02 Like, have you seen, you know know more people finding out about the term chaos engineering what have you seen over the last few years yeah it's definitely changed so like you know back four years ago when i joined gremlin um chaos engineering was popular but it was like really just starting to take off i think like a lot of people had heard about it they'd heard of chaos monkey um i would say like most people hadn't done chaos engineering. They just heard about it. And then I was having fun speaking at a few conferences, but I didn't want to just talk. I wanted to like help people do chaos engineering. I'm like, this is like a fun thing to do and you're going to learn a lot. So I was teaching this workshop while I was at Dropbox, just a chaos engineering bootcamp. And it was really fun. It

Starting point is 00:52:44 was like, I got everyone to spin up a Kubernetes cluster and then inject failure into the cluster. So at the start, I'd be like, Hey, who here put your hand up? If you've done chaos engineering, like two people would put their hand up. And then at the end, it's like 300 people put their hand up out of the workshops. That was like a fun way for me to help everyone across the industry just get to actually try it out. And a lot of people, when they did it, they're like, oh my gosh, I love that so much. Like I would get them to inject packet loss and they'd be like, I can see it visually that everything's running slower. Like that's really cool that I'm able to do that. It's very like visceral feeling. Like you're trying to

Starting point is 00:53:16 use your service and you just can't because it's no longer working as expected. And I think it was rare. Like it's still rare to see what your service, your application looks like during a failure mode, like when there is an issue. And coming from Australia, like so many networking problems with like latency and packet loss. Like I'm like, you know, it's just like a thing. Other people don't have to experience that. I'm like, oh, I've been buffering videos what feels like a lifetime, like when I'm coming from Australia. And in America, you never have to buffer anything. So, you know, that's like a really funny thing.

Starting point is 00:53:50 But I would say it's really shifted a lot too that now I see engineers measuring not just SLOs and SLIs and uptime and how many nines, but also dollar value. So I can ask customers, hey, how much money is it for you if your company is down for a minute? And then they're able to give the dollar value of that. Well, one minute outage costs us this much as a company. And once you've got that number, you know,

Starting point is 00:54:17 and you might need to talk to a few different people, finance team, product team, to be able to calculate that. But we have some like very well-known customers that a lot of people use every single day. And they're able to figure out that dollar value. And that's really powerful because then as an engineering team, you're able to say like, every one minute outage that we are preventing by doing this work is saving us this much money. And it's easy too to be able to go, well, last year we had this many hours or minutes of outages.

Starting point is 00:54:46 And this year we did all this preventative like chaos engineering work and we've reduced it by this much. You know, that's a great way to be able to show that value back. Yeah. Yeah, that makes sense. And then over time, as you know, more engineers and managers see the value of these things is how something like a new term like similar to observability right like it just expands over time uh maybe yeah maybe like a final question just to wrap up is if i'm a new

Starting point is 00:55:12 sre and i'm starting off in a role where um the site's going down too much or you know on-call toilet stuff is just too high and i want to start with chaos engineering but i'm not going to just pitch like let's buy this product on week one. How, what are some baby steps that I can take to think about, you know, how do I inject some ideas or like, how do I show to the organization that, you know, this is an important thing to do? And how do I start with that? How do I prove to myself?

Starting point is 00:55:40 How do I prove to my team that, you know, I should start with a little bit of chaos engineering? Yeah, I think that's a great question. So one of the things that I think is really powerful is for engineers to learn about software and techniques like this, practices like this, and then do your own demo internally within your organization. And so that's like, you know, you can use Gremlin for free, you go to gremlin.com slash buttons, but then you can spin up your own demo environment yourself where you actually inject failure and learn about it. And it doesn't have to be your work product, right? When you're first learning about chaos engineering, it could be a demo

Starting point is 00:56:14 environment. Google has a really cool one that runs on Kubernetes called Bank of Anthos. And I feel like that's a really good demo because it's a bank like it does deposits it does withdrawals and people can see how serious that is but you can inject all sorts of different failures and see how it impacts it like for example um if you black hole um i think it's like one of the services then the balance reader then the balance will show up as zero and so that's interesting like people think they have no money in their account um but that's like a really good demo to just show people internally like instead of telling them what chaos engineering is you can visually help them experience it um and i think that's great especially for this world during covid where you're remote a lot so you can

Starting point is 00:56:59 do an actually like interesting fun demo um or it could be something like, you know, a tech demo, a lunch and learn. If you do those, you can be like, hey, like let's just chat about chaos engineering, but I want to share this demo that I created. I think that's a great way to do it. I'd recommend that first. And then the other thing too, is to just like read about it.

Starting point is 00:57:19 Like it's great folks are listening to this podcast. So this is a cool way to learn more about it as well. Just say like, you know, if you Google chaos engineering, read through some of the articles there, watch some videos on YouTube. I created a series of videos called chaos engineering in 60 seconds, like gone in 60 seconds. And those are like really short videos

Starting point is 00:57:39 which show all the different attack types. So you can see like, what does a black hole attack actually look like? Like just look up chaos engineering in 60 seconds, black and you'll find it um so yeah that's what i was saying like have fun with it like this is a cool thing that's going to help you with your career in like you know for many many years to come so it's a great practice to invest your time in like i do not regret it at all like learning chaos engineering cool well thank you so much for being a guest i feel

Starting point is 00:58:05 like i learned a lot i need to look into the bank of that that is not something i need to talk about so i'm gonna take a look at that thank you again thanks so much for having me i really enjoyed it

Your Ad Here

Software at Scale - Software at Scale 28 - Tammy Butow: Principal SRE, Gremlin

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.