Software Misadventures - Tammy Bryant Butow - On failure injection, chaos engineering, extreme sports and being curious - #6

Starting point is 00:00:00 So something that I think is good, yeah, there's like three areas to focus on. The first one to me is helping educate folks on what is chaos engineering and sometimes like what is reliability engineering, like what even is SRE. And you've got to do that first, like so that people understand why we're doing this work. Then the next step to me is like trying to figure out like how do we move towards a culture of reliability where we're injecting failure and a lot of the time for that I would say you want to think about are we going to do this as like a centralized team that's doing all of this work like all of this chaos engineering work or do we want to do it more like self-service where we just make it available to everybody. Welcome to the Software Misadventures podcast, where we sit down with software and DevOps

Starting point is 00:00:46 experts to hear their stories from the trenches about how software breaks in production. We are your hosts, Rannoch, Austin, and Guang. We've seen firsthand how stressful it is when something breaks in production, but it's the best opportunity to learn about a system more deeply. When most of us started in this field, we didn't really know what to expect and wish there were more resources on how veteran engineers overcame the daunting task of debugging complex systems. In these conversations, we discuss the principles and practical tips to build resilient software, as well as advice to grow as technical leaders.

Starting point is 00:01:21 Hello everyone, this is Ronak here. In this episode, Guang and I speak with Tammy Bryant-Butow. Tammy is a principal SRE at Gremlin, where she works on chaos engineering. She previously led SRE teams at Dropbox, responsible for databases and storage systems used by over 500 million customers. Prior to this, Tammy worked at DigitalOcean and one of Australia's largest banks in security, product, and infrastructure engineering. We had a great time speaking with Tammy. We discussed how her curiosity led her to the world of infrastructure engineering, an outage from her early days where a core switch took down half the data center, her experience running a disaster recovery test,

Starting point is 00:01:59 and how it taught her about the importance of injecting failures into a system to make it more resilient. We also touch on advanced failure injection techniques, how chaos engineering is evolving, and how extreme sports help Tammy keep calm under pressure. Lastly, Tammy has some great advice for teams looking to get started with chaos engineering. Please enjoy this super educational and highly entertaining conversation with Tammy Bryant-Buteau. Hi, Tammy. I'm super excited to have you with us today. Welcome to the show. Thanks so much for having me. Great to be here. I noticed that you were doing full-stack software engineering before doubling down on SRE.

Starting point is 00:02:37 So what made you kind of decide to join the dark side and focus on infrastructure engineering? Yeah, that's like a, it's actually a really fun story so when i was i always loved computers like since i was super little you know had a computer had the internet since i was like 11 um but the big thing that happened to me was you know i finished school went to universities studied computing you know computer science did programming loved it i really loved everything though so i was like trying to figure out what would I like to focus on. And then I got my first job out of university college at the National Australia Bank in their like graduate programmer job, like rotation that they have. And so what they do is they put you in like three different teams. And that was pretty cool for me because I thought then I'll get to try a bunch

Starting point is 00:03:25 of different things out and you know see what I really wanted to double down on and my first team I went on was mortgage broking um that was like super cool because it was very critical systems but I realized every single time they were like can you build some new features on the front end can you also do some business logic work can you fix some issues the database? Can you fix some issues with the load balancer? Can you be on call? It was like do everything. And I realized that like every time I tried to build anything on the front end, I had to go back a layer.

Starting point is 00:03:54 And I was like, okay, like this isn't working. It's like super slow. Why is it not working? Oh, like the middle tier is really bad. Wait, actually, it's like the SQL queries. Can I fix those? No, the database structure is like really bad. And then there was issues with load balancing too.

Starting point is 00:04:09 So I just like would go back and back and back. And then I was like going all the way to the hardware. Like what kind of hardware is this running on? And that's just the kind of person that I am. I'm like, I want to like, I just can't stop going all the way back to the very like bottom level. Like, you know, how is this data center powered? Like, and I went and visited the data center to learn more about it but that's just me like i'm curious and i realized that it's like really hard to make things amazing on the front if they're really

Starting point is 00:04:34 not very great on the back so you kind of have to fix the back end first a lot of the time yeah that's why um and now you're a principal sre at gremlin yeah what does a day in your life look like as a principal sre yeah it's really varied like i've been at gremlin for three years now so i've done a lot of different things i join as the ninth employee um and you know i've done a lot of things over the years like you know normal things like you would imagine uh we are primarily on aws um we've got a really nice like layout we we try and like in terms of architecture we also use a lot of new services so when um aws releases a new service we try and be like at the forefront that was one of the reasons i like joining gremlin because we weren't like we're just going to stick with old stuff that we know well

Starting point is 00:05:21 gremlin's like written in rust um we use like SQS, SNS. We've used Lambdas for some stuff. Like, so that's been really fun. Like I wanted to join Gremlin. One of the reasons was I'd always actually worked on on-prem because like Dropbox is on-prem, DigitalOcean's on-prem, National Australia Bank's on-prem. I worked on a little AWS project,

Starting point is 00:05:42 but it was always like building the cloud or cloud sort of related products for other people. And Gremlin was founded by two engineers that worked at AWS, like building AWS. And I was like, that's awesome. If I want to learn about AWS, I'll go work with them. And so, yeah, like I've learned so much from them about how to build like reliable, scalable systems on AWS, which is like a whole new area for me. So actually, like, even though I'm a, you know, principal SRE, like I'd say like the last three years, I've just learned so much about building reliable systems on AWS. And especially by doing a lot of chaos engineering, failure injection. Obviously, I've been on call over the years,

Starting point is 00:06:22 but we really don't have like pretty much never got paged. I think in my first year, I got paged like once, like it was like nothing. And that's very different. In the past, I used to get paged hundreds of times a week, you know, until we would fix issues. And also, you got the advantage of joining a startup when it's small, being the ninth employee. It's like super early on, right?remlin just existed we just got a web ui because before it was like a command line tool for the agent it's all like really new we did a big migration to react um and that was something that happened over the years but yeah lots of cool stuff i'd say my day to day a lot of it's now actually helping other people understand like how to build reliable systems how to do chaos engineering how to to do failure injection, and also thinking through like strategy and future work, which is something I'm really excited about.

Starting point is 00:07:14 Like, you know, where is SRE heading? Where is chaos engineering going? Like what are the new platforms that are coming out in the future? What are some easier ways that we can get big reliability wins like instead of just doing it like the same way that we've always done it because i have a few tricks up my sleeve but i like to always think of new ways too so yeah that's me um that's really cool kind of moving away from doing on-prem to cloud what was your biggest pleasant or unpleasant surprise um during that move i mean it yeah it was really different than i expected like i honestly thought it was going to be a lot more complicated and

Starting point is 00:07:52 difficult because like you know no offense to aws but like i've like it's like there's a lot of services and everything's really different and there's it feel it felt like before i started to do it there was just like a lot like it was like, there's like a lot of services and a lot of things to learn. And like, it didn't seem like the pieces all connected really well together, right, to me. But now it doesn't feel like that at all because I've done it for three years. So now I just feel like, oh yeah, just grab this bit, put it over here, grab that, do that. You know, it's all actually been very easy to do. And then injecting failure enables you to see like what happens when systems don't work well together. Right. Like, and I like a lot of the features, but I've learned like the ins and outs of how they don't work. Like, you know,

Starting point is 00:08:36 auto scaling, that's a really cool product that AWS built. But like, I've also seen a lot of outages related to auto scaling because like of configuration issues and like throttling problems there um and so just you know I'm at that level where I'm going into that next layer of like how can you actually make sure that this is reliable but like in a really really like tiny little details with on-prem it's different. Like I was focused on such different stuff, like hardware, performance tuning, firmware upgrades, kernel versions. I mean, like I'm just not looking at, you know, kernel versions for EC2.

Starting point is 00:09:14 That's like, there's just different things that I just don't even do anymore. Or picking hardware, you know, I would be part of like buying hardware decisions. What hardware do we want to buy for our databases? Let's look at all the options. Let's have like hardware vendors come in and demo to us and like pick the best hardware and do capacity planning it's like totally different you still have to do it but um i don't have to do it like

Starting point is 00:09:35 you know oh we have to buy this hardware and get it shipped to the us and then put it in our data centers buy more data center space like just a lot of projects that you don't do. But I honestly love on-prem work. Like I really love it. It's super fun and cool. Like I know you do it at LinkedIn and then you also have like Azure as well with Microsoft. So it's cool. You get to do both like, and I think it's good. Yeah. Yeah. LinkedIn is actually in the process of migrating to Azure and some of the challenges that you're describing is like, there's a set of problems that you don't have to think about anymore. But then there is a whole new set of problems. I know. That's exactly what happened

Starting point is 00:10:14 to me. Like, yeah. And then you're like, oh, wow, these are really different problems than I saw before. And like, I'd never solved those problems. So it's like learning totally from scratch um yeah and gremlin is a chaos engineering company uh how did you get into chaos engineering um so i started back at the national australia bank and um one of the things that they told me that i had to do when i first started they were like all righty so we're volunteering you to run our disaster recovery testing and that's like drTs, they call them. And basically they were like, so for mortgage broking, you have to make sure that it fails over from one data center to another.

Starting point is 00:10:53 And you've got to go to this like secret location on the weekend. Someone's going to come around and like ask you to fail over your system. And then they're going to check that it actually worked. And if it doesn't work, then they're going to like mark it down on a piece of paper and then you'll have to fix everything and then test it again in a quarter. I was like, wow, this is like super serious. Like, and I, and I just graduated too. So I was like, do I even know how to do this work? Oh my goodness. Like, but my boss was like, you know, I think you're going to be good at this. So he just like volunteered me for it, which is really cool. And, um, yeah, he was a a great boss i was lucky to have an awesome boss

Starting point is 00:11:25 coming out of university gave me a lot of good opportunities and yeah i went to this like secret unmarked building did the failover exercise passed the first time because we'd like you do a lot of work to prep a lot of failure injection on purpose proactively and making sure that you're like injecting that chaos that you'll pass that region failover and the big thing too is like we have to do it for compliance reasons right like because you're a bank so you have to pass these um big massive exercises and i did so many over the next six years and sometimes i passed sometimes i failed but it was like on big systems mortgage broking internet banking foreign exchange trading so like i love that like working on critical systems and then moved to america worked at

Starting point is 00:12:12 digital ocean and i was like oh i think the easiest way for us to learn about really complex system is to inject failure still believe that and digital ocean was like 14 data centers so it's like massive scale for on-prem and like you know yeah like things go wrong failure happens like you need to be able to be ready to handle it it's like such a cool like um scale to like having that many data centers it was really fun and we did a lot of cool work first we started with just like drills you know kind of like tabletop exercises and then you just think through what else can you do you you obviously want to try and inject failure that's the best way to do it um and then at Dropbox I went there and did it straight away like in my first three months I reduced incidents by

Starting point is 00:12:55 10x with my team the databases team by injecting failure on purpose to identify those areas of like weakness that we needed to fix and then we just did a like reliability sprint so two weeks dedicated just to reliability even though we are sres and obviously you do reliability but we were like nobody can book meetings with us we're focused on this like we we want to fix all these things we're just going to work really really fast and hard and we came out of it and it was awesome like we just never were as bad for the next 12 months we never had a high severity incident so yeah that was really cool but that's how i got started and it's just been such a fun journey i've learned a lot over the years nice nice my i remember my first exposure to the concept i think was watching a talk at aws reinvent where yeah the netflix people were

Starting point is 00:13:43 talking about the chaos gorilla, chaos monkey and just being sort of mesmerized by when they show the graphic of doing a live deep VR event and redirecting. I'm curious how does cloud or the more

Starting point is 00:13:59 massive adoption of cloud play into this? On one hand, I can definitely see because these cloud providers kind of abstract into this on one hand i can definitely see right like because these cloud providers kind of abstracts more and more of these things away such that you don't have to worry about it but then on the other hand because of um more abstraction you can move faster so there's more errors and there's more room for failure so there's more of a need to kind of preemptively test it i'm curious to kind of give your thoughts yeah that's a great question like i get asked that a lot as well and um like there's two things that come to mind one

Starting point is 00:14:30 is you know moving to the cloud a lot of folks think like yeah this is going to be easy i won't have to worry about reliability it'll happen out of the box but like that totally doesn't happen so don't think that um definitely it's not going to be as easy as you think like and i think as sres we know that going into it you like you totally know that right but um a lot of folks don't have that reliability background and so a lot of what i've been doing actually is helping people create like a an understanding or culture of reliability too because that's really important like why is reliability important to us you know how do we show that we care about it by demonstrating like that and doing actual work to you know be proactive and focus on

Starting point is 00:15:11 reliability and um specifically like so if we look at you know kubernetes and if you think about that there's like a lot of outages reported just for kubernetes on the cloud like lots like there's a whole github repo yeah like yeah k8s.af if you check that out there's like tons of outages so and i did some work like research to analyze those outages um that's like something i do for fun i'm like a super nerd and um and i and i created this diagram and it was like, wow, like, you know, 25% of those outages were related to just CPU issues, like spiking CPU or CPU throttling, which is like, you know, you're like, Whoa, as an SRA, like that's crazy. And then the other thing was about just clusters being unavailable. So like, that's just, you know, shutdown or unavailable machines,

Starting point is 00:16:05 like an unavailable node usually, maybe, not even at the pod level, which is more complicated. Yeah. But, yeah, looking at that, you just go, wow, like we still have a lot of basic things to fix. Like, you know, we can't handle CPUs, but I can, like, shutdown of nodes with Kubernetes, which is supposed to be reliable. Like that's kind of where we are right now as an industry.

Starting point is 00:16:25 And like, obviously in like 10 years, we'll be somewhere way better, but it's still just the beginning phases. That's what I really think. Yeah, it's interesting. What were the other kind of failures you saw with like patterns with Kubernetes failures in general in the cloud? Yeah, it's really interesting. So that was like the, you know, half of it was CPU and hosts or nodes going away.

Starting point is 00:16:48 And a lot of that has to do with like how you can set up your clusters. Like, cause there's limitations there around like AZs and regions and that can cause outages. Yeah. It's like those fine little details are what you need to look into. And there have been outages of like two regions at the same time, you know? So just knowing that is something you've got to prepare for and that has caused people to lose a lot of money because

Starting point is 00:17:10 like if folks don't know like at some enterprise companies if you have an outage you might have to pay a fine to like your customers like an SLA downtime fine it can be like millions of dollars in some cases so it's like you know that's why SRE teams are so important in this work. You save like lots of money for a company as well from just that perspective. And then the other side of the issues were actually mostly related to like networking and resources. But when you think about networking, DNS is like a big one, you know, that's like always a big one that has failures. Yeah. If you don't know what it is, it's probably DNS. Exactly's like always a big one that has failures yeah if you don't know what it is it's probably dns exactly it caused a lot of issues um you know and you can do like redundant dns or like have more reliable dns infrastructure by having a backup and stuff

Starting point is 00:17:56 like that but a lot of folks don't do that and then the other area for networking issues is usually like latency or packet loss and that's kind of um you know yeah just being able to say what happens if my system experiences latency the first thing i always think is like would we know like and then do we have good tools to be able to identify that and um can we like pinpoint where the actual problem is and then can we remove that problem like from the path and i i think like it's really good like superpower for an sre to understand networking you know it's like very handy like whenever i've worked anywhere i was always like super good friends with the network engineering team and they're like awesome you know i'm like

Starting point is 00:18:34 what tools do you have can i have access to your tools and they're like yeah sure they'd like let me log into thousand eyes and i would like be like wow it's like this is a dream like i can see the network diagram and then they taught me all about peering and how we would like be like, wow, this is a dream. Like I can see the network diagram and then they taught me all about peering and how we would like, you know, make the network a lot better. And yeah, like that's a really cool thing to focus on because you can improve your system, you know, a lot, but you need to be able to work with other teams to do that, right?

Starting point is 00:19:02 Like that's a big thing. Yeah. Yeah, absolutely. I think networking part, it gets more complex with now overlay networks and containers in there where every Linux machine becomes that outer of sorts. Exactly. So it's like so many layers of abstractions you have to understand.

Starting point is 00:19:19 Yeah, you're so right, Ronak. And then you can have, yeah, latency between all those layers. And, you know just like total packet loss or a little bit of packet loss yeah job creation right um so you at dropbox you were uh an engineering manager um do you feel your experiences i guess maybe the question is kind of based on this um post i read from uh charity majors uh like i think a while back talking about sort of straddling between the ic track and the uh the management track it's very hard to do both well at the same time so but you also don't want to you know get too far away from uh one another

Starting point is 00:19:57 so then does do you feel like your experience managing a team kind of helps you in this like more like a leadership role as an ic yeah Yeah, definitely. Like I've, I've read that post too. And I think it's really good. Like I've definitely flipped between the two roles over time. So, you know, like, yeah, obviously most folks probably start out as an IC, like I sure did. And then, you know, you get your first opportunity. My first opportunity was to lead like an intern, which was really cool. And they were like, can you help lead this intern and guide them for their internship and then after that and I count that as leadership experience right like if you're told to lead an intern then you are a manager for that intern and then after that I was leading a new grad and so I had like one person and then gradually

Starting point is 00:20:40 led more and more folks but then I decided I want to go back to IC work. And this is still, like, at the National Australia Bank. So I went back to IC work. But after, it's kind of like the way to describe it is, like, when you first become a manager, managing people, you, like, look behind this red curtain. You, like, open it up and you're like, whoa, this is what they care about. Like, I, like, did not know how I was being mentioned for performance or, like, what people thought was important you know

Starting point is 00:21:05 for me to do until like I you know I didn't know what a roundtable discussion was for performance conversations like there's all this stuff that like you can't imagine what it looks like until you see it and the only way to see it is to become a manager and like it's hard to describe it to people but when I would go I went in I remember my first roundtable after a performance review sort of cycle so well, like people saying, I think this engineer should get promoted because they deliver this and this actually, no, I think this one should get promoted more and first and like have a bigger pay rise because they delivered this other project, which was much more important and

Starting point is 00:21:38 helped all these other teams. And like, I just didn't know before that, cause no one had explained to me like how we were being assessed. And so you're kind of just like trying to do your best work based on what your manager tells you. But then going into that room, I was like, oh, now I know how to, you know, really do well at my job. And it definitely helps. So like I tell everyone, like if you can have a stint as a manager, just do it. So you can like see behind the curtain and be like, okay, now I get it. At least you would know how the system works then.

Starting point is 00:22:07 Yeah, totally. And as engineers, we love to know how the system works, right? You're curious and you want to understand it. And just even doing it for three months, it's really good. I wish there was an easier way to just understand it and see. But yeah, it's a good thing to just do it. I recommend it for everyone. way to just like understand it and see but yeah it's a good thing to just do it i'll recommend it for everyone so uh when you when you became a manager uh did you miss some of the ic work that

Starting point is 00:22:31 you did because a lot of your time would then go into people management no because i started really small with leadership so and i recommend that as well like i started with leading one person one engineer and he was a new grad and he was like super smart and switched on and i would just help him with it was very much more like i would say like being a tech lead manager you know so having him and um you know mentoring him doing one-on-ones doing his performance reviews figuring out what work he could deliver for the company and like assigning him to projects you kind of have to like be like my engineer can work on all these things like like shopping around for projects and stuff but that that's how it works in a bank's really

Starting point is 00:23:08 different. And, and they have like internal charge codes for projects. It's like really different. But yeah, when we did that, most of my work was actually with him code review, like architecture discussions, like it's very technical technical because he was new to industry and he needed to learn a lot of that um and then after that i managed like you know two people so still like had a lot of time to do ic work and i had ic projects assigned to me as well and then i then i went actually when i was at digital ocean i managed um like 33 people at my like highest point so that was like that's a lot yeah that was a big team and I was having like 15 minute one-on-ones every two weeks with my team like I didn't even have the ability to meet all of them enough but then we ended up scaling that out but that's like

Starting point is 00:23:58 startup life you know you grow so fast and eventually hire managers um but i always knew that that was way too high at at um dropbox my largest team was 14 people across like database and block storage magic pocket at the same time which was like fine that was really cool like and i think um i know i was just involved in a lot of technical conversations and i did do some IC work. So I suppose I'm like someone that can never totally be away from what we're doing. Cause I feel like you wouldn't be very good at your job. You know, like I was involved in picking like what firmware version we should use in the kernel, like discussions and hardware. And, you know, I was the one doing the chaos engineering experiments, even though I was managing the teams, like I was doing a lot of the failure injection, but more as like a validation for my team. All right, I'm going to

Starting point is 00:24:48 like fail the system. Is it ready, guys? Yeah, it's ready. Let's do it. Okay, cool. It survived. Like we did great. Or like, oh no, like we identified some issues. Let's fix them. But yeah, I'm lucky to have been, you know, to have worked with so many awesome engineers. I think that's like as a manager, when you have an amazing team, you just like, wow, I could get to have been, you know, to have worked with so many awesome engineers. I think that's like, as a manager, when you have an amazing team, you're just like, wow, I get to work with all these folks every day. Like, that's like super inspiring. Like I, yeah, a lot of great people I've worked with. Awesome.

Starting point is 00:25:16 Like when you have a manager who's also getting into the technical details, some of the managers might hate me for saying this, but they get a lot of street cred from their entire team because like, hey, I know my manager understands all these details. It's not just about the eventual goal, but also what it takes to get there. Yeah. And that's what I think is really important. Say you've got a project, this is just a really simple example, but we had to do this database migration project and i like looked at all of the tables understood the schema i interviewed every engineer that had worked on all the different services that touch that database because i was like we need to migrate from my sql to our new

Starting point is 00:25:56 distributed data store and i was like i feel like this is going to be a big project like i can't just be like hey guys i want you to deliver this in six months you know like without looking into it that'd be crazy like and so then i did all these interviews and looked at the code and understood it looked into what data was even there tried to think through could we just like get rid of these tables or do we actually need to migrate them over like talk to people about that so like really looking into the details and then you know when we looked at it like i was like i think this is going to be like over a hundred engineers and it's going to take like over 12 months. And that's what I put forward. And that's what it ended up being like, you know, but it was like, that's the thing too. I think a great manager has to talk to their team and like their engineers and ask, Hey,

Starting point is 00:26:39 this is a big project. Like, what do you think is going to be the hardest things for us? What's going to, you know, take us time? Um, you know, do you think we is going to be the hardest things for us? What's going to, you know, take us time? You know, do you think we're going to come across any issues here? Like, let's chat about it. Like, let's get into a room and draw out a whiteboard diagram of the architecture. But I think it's also, it's like, yeah, the engineering manager has to be interested in that. But it's like be passionate, have purpose, but also be, like, curious.

Starting point is 00:27:04 And you can't fake that, right? You can't fake it. We could not agree more with that. So we've been talking about some of the outages in the public cloud, and I actually want to touch that. But before we get there, you were managing the storage. Some of the storage teams are Dropbox and managing storage systems, in my opinion, are hard. Anyone who does that would attest to it.

Starting point is 00:27:33 So we love discussing some of the production outages and lessons learned on this show. Are there any such outages that you could share with us today? Yeah, totally. So so yeah definitely agree that you know managing anything that is related to data is hard and scary right like yeah it just has to be because it's critical data and then you always think through like what if we lost our data then it would be gone forever and you can't just get it back like if you don't even have backups or something or there's no like extra layers there so that that's why it's so scary and also just um around consistency issues like oh yeah that's yeah that's the other thing so you know they i always thought this like engineers who are who are gonna work on storage

Starting point is 00:28:15 databases it's like well like you gotta be brave for that work it's a big thing it is it's really like that and um and things change so fast and you know a lot of teams depend on you because you know if you're at the data layer like pretty much every team needs that data so you have a lot of internal customers and you need to make it available like always reliable and like accurate um and so yeah that's a big thing and then um but in terms of outages yeah everywhere that i've worked over the years um pretty much except Gremlin, like, not Wood, has had, like, a data-related outage. And, yeah, like, right before I joined Dropbox,

Starting point is 00:28:53 there was, like, a big outage that had happened. And if you look up, like, you know, Dropbox outage, I think it was, like, when was it, 2014, something like that. There was a three-day outage related to the data databases. And so, yeah, that's a big outage. That's a long one. For three days. Yeah.

Starting point is 00:29:13 And it took a long time to get everything back up and running. And it was just from like human error. Like, you know, somebody did something that they didn't mean to do and there wasn't enough guardrails in place. And that's why that ended up happening. So it was just like a thing where, yeah, of course, like after that, you put in the guardrails in place and, you know, it's just like they should have been there in the first place

Starting point is 00:29:34 and they weren't there. But a big part of what I did when I joined was to try and think through, like, how can we make this better? Like what can we do to, you know, not be in that situation again, like work with the team to like put in all these different guardrails in place, um, walk down like the ability to do certain actions. That's really important. But I think like, you know, even before that, that's what I'd say with, with databases, it's always a thing where you go, like, you only want some people to have access to those that really are careful.

Starting point is 00:30:03 And, you know, they're like typing like this, like, oh, my goodness. It's like a big thing when you're doing something. So most people don't want access if they don't know what's going to happen or something like that. You really want to limit it. And I think that's better rather than just having it available to everyone. Yeah. I remember one of my colleagues, we were running a maintenance on our database,

Starting point is 00:30:26 and one of my colleagues was like, I want you to look over my shoulder to see every command I type before I enter. Let's double check the exact thing. And then we'll hit enter on this production system. Like, yeah, I can relate to that. I think that's a good way to do it. That's a great way though,

Starting point is 00:30:40 having that person peer review you running those really important commands. Because that's the thing, like you check with others, but it's like it's that moment when you run it. Like that's like the moment that counts. And like, yeah, that's a great tip to just it's normal. I would do that. I would be like, hey, can you check that this is right? Like I'd usually send it in Slack.

Starting point is 00:30:56 I'm going to run this. Does this look good? Yeah, that looks good. Or like, oh, maybe you could improve it by changing it to be this. All right, cool. Now I'm going to run it. I've definitely said, hey, does this look good? Look good? Okay, yeah, that changing it to be this. All right, cool. Now I'm going to run it. I've definitely said, hey, does this look good? Look good?

Starting point is 00:31:06 Okay, yeah, that's good. Do it. All right, cool. And then you get, you know, more confident, but you don't want to be too confident that you make mistakes. So it's like always good to like actually just take that time to check it. And then, so that's like data-related outages. And there's plenty of others.

Starting point is 00:31:21 But, you know, other really interesting outages too, like from working on-prem just like a lot of different types of failures caused by you know one was like core switches like took down half a data center um do you know do you remember details about this one where a core switch took down like half the data center yeah so i mean there was a lot of problems that i've worked on in the past where it could be like just a configuration problem or an upgrade issue or just like you know even like i've worked on outages where there was like power failure um within the data center that happens too right and then um yeah and then a lot of outages related to firmware as well at that next level up so i think like you know the way that i level up. So I think like, you know,

Starting point is 00:32:06 the way that I always think about it now is like everything's going to break in all these different ways. And then you just have to really try and build in your failover mechanisms and think through like what would happen if this failed in this way. And it is really different like on-prem versus the cloud.

Starting point is 00:32:22 So at least you don't have to think through like, you know, core-prem versus the cloud. So at least you don't have to think through core switches with the cloud. That is definitely one big advantage of moving to the cloud. So I want to talk about some of the cloud stuff. Before we do that, you mentioned when you went to Dropbox, one of the first responsibilities was to kind of harden these systems. Now, one of the ways, as you mentioned, to do that is you identify these failures or you inject failures and do those or do these experiments in a more controlled

Starting point is 00:32:51 environment. Personally, I feel like when it comes to orchestrating stateless systems, we have gotten much farther as an industry. But when it comes to stateful, everything is so specific. So even when you're doing these like chaos engineering experiments on stateful systems to identify these failures, what kind of challenges did you see in the past or what kind of challenges do you see today to kind of run those on stateful systems? Yeah. So definitely agree there. Like one way I'm thinking about what to run as our first, you know, chaos engineering experiments for the stateful systems so for us it was like you know thousands of my sql machines and then also um some proxy machines too some like host running a proxy for the database cache like memcache as well

Starting point is 00:33:37 that was like the main area we started we just like i think a good thing to do is to get into a room and brainstorm like different types of experiments you could run like and using the scientific method like what is our hypothesis what kind of failure do we want to inject what do we expect is going to happen and then like we ran it on staging first like we didn't start in production we gradually worked towards production it didn't take us that long but like it was good to start in staging for sure and um the first ones that we decided to run were actually like process killer which is like actually pretty advanced like type of attack like not a lot of people use those um like now that i work at gremlin i see tons of people using you

Starting point is 00:34:16 know practicing chaos engineering all over the world and process killer for us we were like we really want to make sure that and it's like very much based on our specific circumstances. So that's why it's important to get in a room and talk it through. So we were like, we want to make sure if my SQLD dies, then the machine, entire machine is like reaped away. It's taken away and we get a fresh one. Like we don't want anything else to happen. That's exactly what we want to happen.

Starting point is 00:34:41 Like give us a fresh new machine. It should already be in the free pool of machines. So it should already be pre-built and then it should just go into the cluster and everything should be great. Like, that's what we want to test. And then we were like, we've sort of, we sort of have a feeling that sometimes that process doesn't happen as fast on certain days of the week due to like networking load. So we were like, let's do it on Monday morning versus like Friday night and then like see if it changes and it did change we like saw it sometimes it would be super fast other times really long so you want to like know all of that right then we

Starting point is 00:35:14 started to take um much more detailed metrics of like how long does it take for us to replace machines if the process dies and like you know this was like a really detailed project that went on for like several months and then we just kept getting better. But that whole time we were like injecting the failure to learn from it and then making improvements, going to talk to the networking team, sharing our data and results with them, figuring out what we could do to make it faster. We realized at one point we were being throttled by the networking team you know there's like qos you can like pick who gets what traffic and like just some tiny portion of um throttling was happening but not at a large scale so we like resolve that so yeah there's

Starting point is 00:35:56 just a it's like you got to be in that level of detail like i'm gonna be you know comfortable climbing into that detail and figuring things out. Yeah, yeah, totally. So one thing that engineers like is numbers and metrics. Yeah. Once you did all these experiments, of course, it would change over time, but over like the first span of the six months to a year, were you able to identify some of the low-hanging fruits

Starting point is 00:36:23 to say, we caught these in staging or in production and yeah these were just ticking bombs that would have taken down the system yeah so it's actually really interesting i love that topic so yeah i'm really big fan of metrics like everyone is and um one thing that i really loved when i joined dropbox was that we had these automated um metric emails that got sent And so I think you do it at LinkedIn is all I've heard that, that like every day there's like automated emails that get sent out with top metrics for the systems. Yeah. I think that's awesome. Like I hadn't seen that anywhere else that I'd worked, but it's also for us, it was color coded. It would be like red

Starting point is 00:36:57 if it was below expected and green, if it was like above expected. And it's like very easy for everyone to just sign up and get those emails for every system that's critical. And then that's cool because everyone sees your system getting better. Like your system starts to be like super green and it used to be really red and like, hey, you guys like fixing things over there? What's going on? So that's a really cool way to do it. It's like really nice. And the other thing was just like, you know, I'm actually big on creating presentations, like based on your data and trying to tell a story around it. Like, hey, we identify these problems. This is the data set that we first looked at.

Starting point is 00:37:32 You know, we identified that these would be key areas to fix. And then this is what we did to fix it. And these are our results like afterwards, like telling that story. And that really helped us a lot to get like buy in. And other teams then were like can you show us how to do this and then we started to help them and then we built tools for them to do it themselves so it was more self-service like we built a a dashboard called scout that was like an internal tool so any engineer across the company could add their like pager duty service id and

Starting point is 00:38:01 then see the metrics for their incidents but i'm actually like i'm a little bit different that the way that i like to pick what problems to solve is i'm like let's go after the big problems like the big fish the ones where it's like if we fix this problem then it's going to knock out like 80 of our issues you know like thinking about it from pareto principle like that sort of 80 20 rule and that's always like scary to folks a lot of the time, but I think it's like more fun. And I'm like into extreme sports too. So, like, it's like, let's go for the big one. Like this one's really bad. Let's fix that. And they're like, oh. And it's also like, you know, that system that people are scared of that no one

Starting point is 00:38:39 wants to write code for because no one's written code for it for like 10 years. And they're like, oh my goodness. But yeah, I'm like, let's do yeah and then just we would do that we would be like let's go after that system or let's decommission that like flaky thing get rid of it and it feels so awesome when you do it you know yeah absolutely i mean once you get past all of that you do it's like a threaded background of your hair has just gone away now all this mental capacity has been freed up yep totally you don't have to be thinking about it at all you're like we just removed this really bad part of the system and it's gone like it feels so much better like i imagine it i think about it a lot like say you live on this great street and there's one house it's like super ugly and like

Starting point is 00:39:19 real smelly and bad and no one wants to go in there it's like that's like some of our systems what uh extreme sports do you do you're australians i feel like that already sets the bar kind of yeah yeah when you say it's extreme sports so i i like lots of stuff like yeah definitely you know everyone's into surfing that's not so extreme but definitely love skateboarding and i used to go in like skateboarding competitions and got like sponsored as well so yeah i love that wow yeah it's really fun i've been doing that since i was little and snowboarding love that too um dirt bikes mountain bikes like bmx like i love like actually bmx jumping that's like super fun but also like very painful when you fall it's like the most painful but also so fun to fly through the air

Starting point is 00:40:12 so it's like a trade-off i would argue that surfing is quite extreme especially when you got sharks in the water so yeah that's so funny i i've never had a shark issue like i've been in the water when there was sharks but i've never like you know and i've just gotten out it was okay like my worst um surfing injury was like one time i was riding a wave and then got like right to the end of it and got you know sort of toppled over and my board went flying up in the air and like came right down my foot and it was like so bad like it like almost broke my foot. That's painful. It was so painful, but it sounds so funny.

Starting point is 00:40:48 Cause like, you don't think that sort of thing happened to me. Well, you're very brave, Tammy, is what I would say. Go ahead, go ahead. I was going to say, I always invite everyone to come along too. So yeah, if you ever want to go surfing, snowboarding, skateboarding, we can give it a go. I'll need some pep talk before we actually start doing that. I love it. I imagine some of your kind of interest in the extreme sports would also help with uh kind of preparing you mentally for

Starting point is 00:41:25 dealing with production systems and things which are scary yeah that's something a lot of people don't understand and they think that it seems weird when i tell them like visualize your system like i'm like visualize it before you go on call like just think through what could happen but it's exactly right um you know because yeah like they a lot of athletes do that right like if you read yeah like basketball athletes and stuff like that professional folks they'll visualize themselves like getting you know the ball in the hoop and like being like bam like i did it before they do it and there's like really good research that shows that helps you do it and definitely you're going to visualize yourself doing a skateboard trick before you do it

Starting point is 00:42:04 like you spend a lot of time we call it like you know you're going to visualize yourself doing a skateboard trick before you do it. Like you spend a lot of time, we call it like, you know, you're like amping yourself up, getting ready for the trick and thinking through all the details. Like, where will I put my foot? How fast will I like move my foot? What direction? Like, you know, all these things. What's the wind like? Like everything. And so if you do that with systems, it makes it a lot easier as well. And it doesn't take that long, right? It's like, like you said, even with the example of what am I going to type when I'm running this command, like just taking the time to be patient and think it through. That's way better than like, just like, you know, running off on your skateboard off the edge of the

Starting point is 00:42:39 steps and be like, good luck, hope it works out. Like, it's probably not going to work out. Well, hope is not a strategy, right? No, exactly. No, not at all. So you mentioned running chaos engineering experiments like science experiments of sorts where you have this hypothesis, you kind of think about the failure you want to inject,

Starting point is 00:42:58 and then you have kind of an expectation of what should happen. Yep. Have there been instances where your hypothesis kind of an expectation of what should happen. Yep. Have there been instances where your hypothesis kind of went sideways where you're like, I think this is what should happen, but a completely unrelated different thing

Starting point is 00:43:14 happened to the system? Yeah, like that actually, it didn't used to happen to me until more recently. So yeah, like when I was at Dropbox, it was always pretty much like, sometimes we would learn something new, but it was more like the detail of how we could fix that problem in particular like related to proxy chaos engineering or like had different types of failure modes if you did like a hard shutdown or a really slow shutdown like non-graceful um like hanging threads stuff like that but when i

Starting point is 00:43:41 was when i've been doing it recently like over over the last few years, especially, you know, with Kubernetes and on the cloud, definitely like seeing unexpected things. And a lot of it I think is around, you know, one is dependency analysis, like you said, because there's a lot more complexity. So it's like, well, I have containers, how many containers inside each pod, and then I have the pods, and then I have like all the orchestration on top of it. And then I've got multiple nodes and even just doing something simple, like saying, if I fail this service, I expect that this other service will fail, but I think everything else should be good.

Starting point is 00:44:13 That's like a really hard to guess now. Like, you know, you're really like guessing a lot of the time because it's a super complex system. And I used to like do a thing where I would print out the code to like read the code of older systems I worked on that are more like monolith and you know you're learning about a specific area and it was easier to get how things connected but now I think with distributed systems and you know containerization it's like way harder so pretty much every time something unexpected

Starting point is 00:44:42 happens like why is that failing oh my goodness okay Let's look at the code and see why this is a hard-coded dependency or why is there a problem between these two services? Yeah. Yeah, like microservices just makes this entire graph super hard. I know. It's like a really complicated graph. That's if you Google microservices death ball. It's like that. I haven't Googled it, but I'm definitely going to do that after this chat.

Starting point is 00:45:14 Yeah, that's exactly what it feels like. When you look at that diagram, you're going to be like, oh my goodness. It's just like trying to draw the architecture diagram for microservices is like horrible. Oh yeah. Well, it's a work of art at the end right that's a nice way to say it i like that especially if you picked cool colors or something yeah exactly uh so do you remember any of the weird things uh any of the latest weird things you discovered in one of these experiments yeah so one um actually has been around so just like definitely around dependency analysis and i think like there's

Starting point is 00:45:46 something that we do a lot of work on when we're doing tests so we're always trying to think through like what are new types of chaos engineering attacks that we should create or run and i have like um a list of like 60 plus that are like on my list of things that would be cool to have available um and so then with that i'm always trying to think of like on my list of things that would be cool to have available. And so then with that, I'm always trying to think of like, what different types of failure will I inject? But I think like, lately, the ones that are most interesting to me is like, sometimes I'll get to work on systems that are payments related. And that to me is really interesting when you see different failure modes there. But yeah, like that's going to be really different depending on different people, what they're doing and how they're processing

Starting point is 00:46:29 their payments. But I feel like lately that's like just an interesting area, but it's probably because I worked in banking too. But if you think about it too, like you go, okay, what's going to happen with payments, how that could fail. There's all these different things. It's like the shopping cart process, the payment processing, sending the information back. Then it depends too if you're doing this via like credit card processing or something like PayPal or Stripe. And if you've ever worked on like PayPal-related like checkout reliability issues, like PayPal has so many error codes.

Starting point is 00:47:00 Like this is like a thing. Like when you've worked on that, you're like, whoa. I haven't. Yeah, you've got to that you're like whoa like i haven't yeah you gotta look up like paypal error codes this is like a huge like encyclopedia of error codes that you need to handle oh wow and they all mean different things like person has not enough balance or person's paypal isn't working right now or person's paypal has been locked or all these different stuff and so then your system might not let people process anything and then you could have like yeah just different outages related to payment providers so i think to me that's but that's like

Starting point is 00:47:30 a an area that i'm interested in in particular yeah just all the payments failures makes sense and with different payment providers i assume it gets more complex because the error codes are different yeah exactly and then you have to go out to a lot of third parties related to processing different types of payments and transactions so you have to learn about all those different um companies and what kind of information they give back so yeah like that's that's a whole interesting world that you learn about and all those systems have to work crazy fast right like that's like that's why i think it's fun it, wow, you're trying to process this like that so the user doesn't realize. And now it's like, whoa, like sending that information everywhere.

Starting point is 00:48:09 Yeah. So we talked about Kubernetes a little bit. What are some of the patterns that you're seeing or some of the ways to inject failure in Kubernetes? I've seen some of your blogs around like doing site reliability engineering for Kubernetes or some of the common failures that you referred to before. What are the recent patterns that you're seeing that people could use to inject failures? Yeah, lately I've actually seen a lot of folks are using AKS.

Starting point is 00:48:37 So this is more and more popular now. I've seen a huge spike in people using that. A lot of our customers use AKS. So I think that's really cool and a lot of the time like you know folks will start on things that you know maybe sound simple but it's really not simple like around region failover and making sure that you know if one region goes away what happens my biggest tip there when folks are starting to do it though is like you know a lot of folks will go well I'll just like shut down my cluster and see if it fails over

Starting point is 00:49:05 to the other one. I'll just shut down my nodes. But what I like to do is say, well, like, you know, Gremlin, we have an attack called a black hole attack. And what that does is it makes it unavailable. So it's a networking attack. And instead of like tearing down a whole cluster or a whole region or like a ton of machines and then having to like build them back up again, you know, that's like really time consuming and also can cause a lot of like unnecessary issues um because like taking something down and bring it back up that's like adding more opportunities for failure to happen yeah so if you just do a black hole then you're like say oh let's just make this not available for a period of time whatever it is the, the pod, the node, the whole cluster, and then turn it back on.

Starting point is 00:49:45 And you can do it for like 60 seconds or 30 seconds and it's just like gone. Now it's back. And so I like to recommend that. Like I think at the moment we're still at a point where that's where folks have to focus on is like building out really good architecture with your clusters and, you know, thinking about that too. Like how many regions am I in? What happens if it fails? And just the configuration of how you've got it all set up as well. Yeah, I'd say that's my main tip. Once we're like more advanced, then I'll come back and be like,

Starting point is 00:50:16 okay, now let's get ready for this. Nice. So, I mean, my introduction to chaos engineering was very similar to Guang's actually when I saw that chaos monkey. And I saw like the CPU exhaustion is like the hello world of chaos engineering experiments. But as you see the teams kind of maturing on their chaos engineering practices, what are some of the sophisticated practices that you've seen in the industry? Yeah. So I'd say like one of the interesting things is like where do folks get started with chaos engineering and thinking through like the use cases of chaos engineering. So the first one that a lot of folks start with is validating monitoring and alerting,

Starting point is 00:50:54 which makes a lot of sense, right? Like kind of like as a smoke test. So do a CPU attack, check that that fires, that you actually can see that in your monitoring. So say like in your dashboards and then also to be able to monitor, validate monitoring and alerting for if an alert needs to fire because like you've breached an SLO. And I'm seeing a lot of people, which I think is cool, do like aspirational SLOs

Starting point is 00:51:17 for new services that they're building. I think that's a great thing to do. And then being like, let's set up the alerts for them. Let's validate that that works um that we actually get a page based on that and i think that's that's really different to where uh what people were talking about in the past but it's really connecting the dots between a lot of things that folks are focusing on like slos and slis and and i like the idea of doing it all really

Starting point is 00:51:40 automated um so at gremlin we built something called status checks, which enables you to first actually see like, okay, what is my monitoring at now? Like, what is the current level for my system? Like for whatever you're looking at, if it's a specific resource limit or something like that, then run an attack in an automated way if that first check is okay. And then check back again and see if an alert fired or everything's still good and if it's great then still progress even more um so that's really cool and that is pretty cool actually yeah yeah like and it's different right it's like let's automate this like let's tie in the monitoring and alerting into the automation so it's like check first run it check again yep all good run it again check and then

Starting point is 00:52:26 you can just have that running like on a cycle and if it fails then you know hey something broke like something unusual happened that we weren't expecting um you're trying to see how far the system how far you can push a system actually exactly yeah like um really push it that's a good thing like stress testing like sometimes people will think about chaos engineering like that i'm like yeah that makes sense you are trying to stress test your system and the other thing i like to see lately is a lot of integrating chaos engineering into the cicd pipelines um lots of folks are still using jenkins you know that's like still really popular and i'm seeing them like uh run these attacks like you deploy your code to staging, automatically

Starting point is 00:53:05 write a set of like a chaos gauntlet. Or sometimes folks call it a reliability blueprint, which is a set of scenarios that every piece of code like a new service needs to pass. And it's nice. It's like super automated. I pass my reliability blueprint. Now I'm good and I can go to production. And yeah, I think that's cool.

Starting point is 00:53:24 And if you didn't pass, then you know why, right? It's not like strange to you. You can just go and like fix it and then be like, yeah, cool, now I should pass, do it again, pass, great. Off we go. Yeah, that's very interesting. Can you share like some of the attributes of what this reliability gauntlet would look like?

Starting point is 00:53:39 Yeah, sure. So something that we focus a lot on over the, probably the last year was the idea of like let's go okay we have the idea of attack so say an attack is a process killer or a spike cpu or shut down a node or a pod then what you want to do is think through like the scientific method so what is my hypothesis what is the attacker or the failure i'm going to inject what do i expect to see happen after so what we did at gremlin is we built something called scenarios which is like you could have one or more attacks and like one or more attack types in one scenario so it can get pretty complicated um and some of

Starting point is 00:54:15 our customers have built it out that it's like you know there's over a hundred scenarios that have to pass because it's like really complicated systems yeah where they have to meet a ton of compliance requirements yeah that is significant like a yeah if it's a bank or a finance company they have to also prove that they their code and their services pass those um scenarios now it's like a check that it has to get through and so that's like probably on the like most advanced complex um sort of style of what i seen. And then like probably what I also think is great too is when folks are getting started before they get to that point, they're doing something like, say, let's figure out 10 scenarios that we want to create that we should be able to pass and that we want to make sure that we can run and everything goes well.

Starting point is 00:55:01 And like ideally like everything just passes all those and it's just like a check that's in place. Yeah. And it's just running. But a lot of the time it doesn't pass. Right. And then you And like, ideally, like everything just passes all those and it's just like a check that's in place. Yeah, and it's just running. But a lot of the time it doesn't pass, right? And then you're like, good, we caught this already. Yeah. And it could be like, let's black hole a service. That's an example, right?

Starting point is 00:55:15 Let's make this service unavailable, a third party dependency. Does our whole system crash? Yeah, like a lot of the time it does. It does. You want to check that out before you go to prod. Yeah, like a lot of the time it does. It does. You want to check that out before you go to prod. Yeah, absolutely. So for teams who want to adopt or start doing chaos engineering,

Starting point is 00:55:31 apart from just building good tooling and trying to make the systems more resilient, it also requires a cultural buy-in. They want to buy into the entire idea of kind of breaking things on purpose to try and make them more resilient. So for teams who are early in their journey of just testing their systems like this, do you have any words of advice? Yeah, definitely. Something that I think is good. Yeah, there's like three areas to focus on. The first one to me is helping educate folks on what is chaos engineering and sometimes like what is reliability engineering like what even is sre um and you've got to do that first

Starting point is 00:56:12 like so that people understand why we're doing this work then the next step to me is like trying to figure out like how do we move towards a culture of reliability where we're injecting failure and a lot of the time for that, I would say you want to think about, are we going to do this as like a centralized team that's doing all of this work, like all of this chaos engineering work? Or do we want to do it more like self-service where we just make it available to everybody? And those are like two really different strategies. And then if you're like, okay, we pick centralized, you've got like more control over it. You can help guide folks and you could do the work. But if you're going with self-service then i think what you want to do is really think through

Starting point is 00:56:49 other things like you need to build like a wiki and you need to make a lot of things like much more accessible and available to folks like little videos or tutorials to get your team members ramped up and started um yeah that's like my main tip there i say uh have you seen like uh differences in making it more self-service versus uh kind of running it centrally yeah um definitely so i've seen lately a lot of folks moving to more of a self-service kind of practice of chaos engineering especially when they're integrating into cicd so it's like yeah we want you to pass these scenarios if you don't pass here's how you can rer rerun it yourself on your service and make sure that you understand why your service isn't passing and what's not working well. Because I think like, you know, I worked on a build team for a while.

Starting point is 00:57:34 And a lot of the time people were like, why didn't my code pass? Like, I'm so mad that didn't pass. It's like, okay, well, we got to figure out why. Like, you know, that's like diving into it. But sometimes it's not clear because the tests aren't clear. Like it's like what does this test even test for? Like who wrote this test? I always think through like should this test even be here? Is this a really old test? Like is it relevant anymore?

Starting point is 00:57:53 There's just like a lot of issues with that. And so with this, you want to make it as easy as possible for people to reproduce the test that you asked them to pass and then to understand like how to be able to fix that issue so that's like what this whole idea of self-service is it's like sort of like get out of the way give people tools give them education and enable them to like uplift and build like more reliable services themselves but you have to continuously be guiding them so there has to be that team that's like yeah doing the the work to like help figure out what are the scenarios we need everyone to pass?

Starting point is 00:58:26 How have they changed? Like what new types of failure do we want to inject? Like what new systems are we going to be using in the next few months? Doing more like high level strategy work too. Yeah. Yeah, that makes sense. We're getting close to the end of the chat, but this is something that's super cool that I wanted to make sure we touch on. So we saw that you were the co-founder

Starting point is 00:58:48 of Girl Geek Academy, where the goal, I think, is to teach 1 million girls technical skills by 2025. We would love to learn more about it. How did all this get started? So yeah, this is a really fun thing. I started off doing this work in Australia a really fun thing i started off um doing this work in australia really long time ago so like while i was in university you know studying computing my

Starting point is 00:59:12 my lecturer she was really cool like um head of computer science faculty ruth she said hey tammy do you want to help more girls study you know technology, technology at university. I was like, yeah, that sounds fun. And so then she gave me like a project to like run kind of like a day at university for high school students. And I just thought it was really cool. Like I'd never thought of doing that before, but it was really fun. It felt great to like help them all learn and they had a great day. And so then when I moved to Melbourne, I started to go to some meetups, you know, like there's one group called Girl Geek Dinners. And I liked that it was like fun to meet other women that were in tech, but I'm like super nerdy. So I was like, I want to do like hands on stuff. Like I want to build things. I want to learn new, you know, new languages or new technologies or new platforms, whatever it is.

Starting point is 00:59:59 And then I asked some friends, like, what do you reckon? Should we build our own group and make it more like workshops and hands-on stuff and do hackathons and do like whatever we want to do? My friend's like, yeah, I want to do 3D printing. I'm like, sweet, let's do it. And so we've done some super cool stuff. Like we had one weekend where it was like a make maker sort of weekend. And we had like a 3D scanning machine that scan your whole body. And then you could like print yourself out. Like we were doing like really fun stuff like that but um yeah we're just like

Starting point is 01:00:29 let's do whatever we want to do why not you know there's no rules here um so that's why we created that and we've just helped so many women and girls over the last few years so it's been really cool like and all over the world as well and we've worked with a lot of great companies too. Like Microsoft has been like super supportive of our work. We do a lot of workshops with them, classes. If you go to Girl Geek Academy, there's like some Microsoft partner classes that are coming up actually, like really soon. And so, yeah, it's just a ton of fun. I've met so many awesome people through that. That's awesome.

Starting point is 01:01:04 Do you find it difficult to balance is there like a lot of work to sort of balance it with the full-time job or it's no so yeah girl geek academy has like a full-time ceo um sarah my friend we we asked her hey sarah will you be the ceo she was like yep and so we got funding from the australian government to run it we got like our first grant was a million dollars. So like it's really cool. The Australian government really cares about this and like was happy to back it with money.

Starting point is 01:01:31 So, yep, she's been able to do that for the last like six years. It's a full-time job and then I help out. But I don't really have to do that much work. So it's like it's good too, you know, you can like it's kind of like my tip is like if you have some things you want to do, think of like what your ideal dream life looks like, visualize it first, like we talked about, and then just make it happen. Like, yeah, that's it. And then you do it. That's awesome. So the fun question that we'd like to end on is, what was the last tool that you discovered and

Starting point is 01:02:00 really liked? Oh, that's cool. So lately, the main thing i've been looking at actually is load testing tools so i'm currently looking at a lot of different tools like gatling um neotis neo load um locust um and jmeter kind of like comparing things so i think like yeah like gatling very popular and a lot of people like that and it's interesting too because the reason why to look at load testing is like you obviously want to be able to like um you know make sure that you can simulate load on your other environments like staging something like that when you're doing your chaos engineering work so yeah i would like check out those different tools like jmeet has you know been around for a really long time um but then there's newer things like neo load that are becoming more and more

Starting point is 01:02:49 popular so yeah that's what i say check that out awesome uh and anything else you'd like to share with our listeners um thanks so much for listening and like yeah if you're interested in sre or chaos engineering you can find me on twitter i'm always happy to answer questions. My Twitter handle is TammyXBrian. So yeah, that's me. Awesome. Thank you so much, Tammy, for taking the time. Really appreciate it. Thanks for having me.

Starting point is 01:03:12 It was fun. Hey, thank you so much for listening to the show. You can subscribe wherever you get your podcasts and learn more about us at softwaremisadventures.com. You can also write to us at hello at softwaremisadventures.com. You can also write to us at hello at softwaremisadventures.com. We would love to hear from you. Until next time, take care.

Software Misadventures - Tammy Bryant Butow - On failure injection, chaos engineering, extreme sports and being curious - #6

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.