PurePerformance - Chaos Engineering Stories that could have prevented a global pandemic

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have with me my harbinger of good and bad things and mediocre things and anything just period to come. Andy Grabner, did you know that you're the generic all-purpose harbinger of the year 2021? I have no idea what a harbinger is what is a harbinger like a um great now put me on the spot i mean i know the word now i gotta define it formally harbinger like it's usually someone who's uh like someone who brings things on like a bringer of bad news or like uh i mean this is where everyone listening is going to be like what's wrong with this guy? You know not quite the same

Starting point is 00:01:06 as an omen but a harbinger is a person or an event or something right? Okay. That would be indicate what is going to be happening. Okay.

Starting point is 00:01:15 Sort of. So like a like a moderator like a bringer or whatever. Yeah it's okay. Maybe. Well then say bringer and don't use these strange words

Starting point is 00:01:23 that non-native speakers have no clue about. But I also learned something new. I don't think that's an English word, though. It's probably not even an English word. How do I know? You know what? Let's skip the subject of the entire episode and just discuss the etymology of…

Starting point is 00:01:38 Yeah, let's bring somebody in, right? If we have a bringer, if I'm the bringer, I'll bring somebody in. But that would throw a monkey wrench into my plans that would that's not what i've prepared for i don't know what will happen if we bring somebody in andy what do you mean like you feel like we're in a chaotic situation now yeah suddenly what are we gonna do i don't know we'll we probably need to learn how to react to chaotic situations and in case they come up again in the future we don't freak out, but do everything gracefully and that everything stays stable and healthy.

Starting point is 00:02:10 And now we start. You're such a harbinger of good ideas. I see. All right. Let's cut to the chase. We obviously talk about chaotic stories and situations. It was a chaotic year, 2020, that we left behind. Obviously, at the time of the recording, we still have two weeks to go,

Starting point is 00:02:29 so we don't know what's going to happen. Assuming everything is fine, it is 2021. It's going to be a bright future for us. But we had a chaotic year behind us, and hopefully we all learn from chaotic situations how we dealt with it so that we make it better. And therefore, we have the person that we had on a recent podcast with us again. Anna, hola, como estas?

Starting point is 00:02:57 Hola, really excited to be back. Thanks for bringing me back. Now I'm utterly scared for the next two weeks of 2020. Let me just say now I'm like, where can I knock on wood to make sure that my next two weeks go really well and my year doesn't get more chaotic than everything we've seen so far well you know what you should do in 2020 you just self-quarantine and don't go out of the out of your flat so everything will be good you have to also unplug the internet i like that idea i really like being in my house disconnected from the world for two more weeks and make sure that i don't blow anything

Starting point is 00:03:32 up and that my life doesn't get more chaotic just don't take that approach by going to some cabin in the woods and cutting yourself off because usually when you when you do that nothing good comes of that if movies are to teach us anything. It's funny you say that because I totally looked at Airbnb places for like two weeks. I am going on holiday break soon. And I was like, this is too expensive. I can't justify dropping this much money. But yeah, scary movies teach me not to do that because then things end up not well.

Starting point is 00:04:03 And I want to see this this podcast episode launch great so last time we had you on so anna for the go on eddie yep so um i think the last time we had you on sorry to interrupt you here uh we had a podcast it was called why you should look into chaos engineering and i remember that you know in the end, I think we said, well, we would have so many more things to talk about, especially stories that you have collected over the years. Because, I mean, you have an amazing history as a site reliability engineer, as a chaos engineer.

Starting point is 00:04:37 You worked at Uber and other companies and helped them build reliable systems. And then you said, hey, Andy, Brian, I have a lot of stories. So now I want to kind of play story roulette. So we'll pick a story or we'll pick an industry, and then we'll see what story you have in store for us. And then we want to learn from these chaotic situations that you have experienced.

Starting point is 00:04:59 And yeah, actually to learn what we can learn from these stories. That's perfect. Yeah, it sounds perfect. It makes me think that I should change my title from chaos engineer to chaos storyteller. So I'm liking this. Yeah. Now, I know you gave us a couple of, let's say, hints on industry and whether this chaotic situation was related to, let's say, the monitoring

Starting point is 00:05:26 solution not working or whether you were actually using chaos engineering and load testing, which is a topic that Brian and I really love. And I think I would actually like to start with this one. So you mentioned something about chaos engineering and load testing. And just fill us in on the story can as much background as you can give us and then we'll see what brian and i can learn and obviously the listeners can learn from that yes um perfect so this is a finance company it's a customer of ours i can't name drop sadly but this company has been working on launching a new product and they basically have to have like those five, nine,

Starting point is 00:06:06 six lines of uptime. And it's almost like their leadership team has said they do need that a hundred percent uptime. And what they do in order to protect, what they do in order to prepare for a product launch is that they do go ahead and they do a lot of load testing. They do chaos experiments. And that's kind of nice because that's kind of where we did touch on last episode, where doing these two things together is the best way for you to plan on capacity and for chaotic situations. So they wanted to basically go through their entire stack. They wanted to validate their monitoring was working. They wanted to make sure that their nodes were auto-scaling properly, that they can handle some latency happening from

Starting point is 00:06:52 their service to another service. They are using the cloud. So of course, making sure that if the S3 buckets went offline, that they were resilient to that. And lastly, they wanted to make sure they were resilient for region failovers and making sure that if an availability zone went down, their application was still usable. So the best thing about this is that they're not one of those companies or products that does it after they've launched. They actually do load testing and chaos engineering experiments while they're developing the product. And those are like my favorite type of customers because it's really making sure that you bring reliability to one of the four focuses of your launch. And that allows for you to be successful

Starting point is 00:07:37 and not scrambling the last sprint before launching. And it means that you're developing the product, ensuring that it's going to be reliable from launch date to the next six, one year of the product being around. And it was interesting because they found a lot of stuff. It allowed for them to really iron out some of the latencies that they were seeing between service to service, but they also had a mental model of the dependencies that they had in the service. And they didn't realize that when you go three to four levels deep, there was more dependencies around. So being able to just even make sure that you're mapping out that architecture diagrams as early as possible. Because when you think about a finance company that has five, nine, six nines of uptime, having that fourth dependency down is going to really affect that uptime. So it really means

Starting point is 00:08:38 that they have to sit down with other service teams and preach to them what the reliability of their new product is, but also make sure that those teams are ready to be reliable. So that means maybe that they went down to that team and it's like, wait, what chaos engineering experiments are y'all running? What type of load testing are you doing for this launch? And I really like that type of culture because one of those things is that chaos engineering is not just you're going to break things to make things better. You're changing the culture. You're embracing failure.

Starting point is 00:09:10 You're celebrating failure. And you're coming together as an organization to put reliability in the forefront. So now you're preparing for this launch maybe in four months, but you're bringing more teams together and saying, we're launching together. You're part of this. You're one of the dependencies to this launch. And I think in current times, that means a lot to people. You get to feel part of something, part of the organization that you're with. And especially since we're virtually, it's the little things that make work nice right

Starting point is 00:09:41 now. Hey, I got a question for you because this is fascinating because the whole thing with the number of dependencies and how deep in the dependency tree you go, you figure out things that you didn't know. But I have a question to you because it was brought up to me at a recent conference. I spoke at ESSER Recon, and I talked about top performance problems in distributed architectures. And I basically made that statement that basically said, the more dependencies you have, the more complexity you have in your architecture,

Starting point is 00:10:17 and therefore, you need to prepare better for it. It's going to be harder to build a resilient system. And then one of the questions that I got, i got i got challenged on this and he said well this is not true if on every service you are applying all of the best practices around reliability engineering like proper failover you know being resilient to if a depending service is actually no longer available and stuff like that and i agreed with that person i don't remember who it was but in the end i think would you just confirm and if i heard you correctly we always have a picture in mind of our architecture but the reality is often different that means there are new dependencies that we have not thought of somebody brought in a dependency

Starting point is 00:11:05 willingly or unwillingly, or let's say knowingly or unknowingly. And maybe they, if you don't know that you brought something in, you most likely have also forgotten to think about resiliency. What do you do when the service

Starting point is 00:11:18 is acting weird or is not available? And I think this is probably, this is a great exercise, what you're saying, right? Find your true dependency tree and then on every service level, educate people on how you build resiliency

Starting point is 00:11:31 into that service and all of its depending neighbors. Yeah, totally. And I'm really glad, like really awesome to hear that you spoke at SREcon. That's one of my favorite conferences around reliability.

Starting point is 00:11:44 And I don't know, like I feel like the more dependencies you have, the more complexity you are adding due to mathematics, like just in terms of doing the math of the amount of services that you're going to have or the amount of possible failures that you can have in terms of like distributed systems. That's the thing is like the systems are so complex that when you are trying to aggregate all the possible scenarios, like computers can't really compute that. So that's interesting. Like that is an interesting discussion topic of, wait, is it really more complex or is it just you're thinking that it's more complex, but not really, but it is that you need to make sure that every single dependency is following reliability guidelines. And it's funny that we kind of come

Starting point is 00:12:31 back to this topic because this is one of the works Uber got to do is that they were doing production ready checklist of you are a tier zero service that we need in order to take a trip. You need to have gone through this checklist. You need to make sure that you have all the things to be reliable from run books to monitoring tools, making sure on-call is set up properly. And I think it's interesting because sometimes you think that those things are done, but if the tool that you're using didn't build the right verification you also might be just seeing check marks and that's that nice thing of chaos engineering because you get to

Starting point is 00:13:11 bring it all together and truly verify what would happen if this would happen to my production system right now and you know what i want to interject here just because as you're saying this i can't get this thought out of my mind and it's going to be a little bit of a silly side note here but every as i'm hearing these stories as i'm hearing you know people saying you know you bring the unknowns in uh you have all these different systems with these dependencies that you have to make sure everything can handle when some of those dependencies break down. It was something the way Andy was describing it. It made me go back and think of the movie Alien.

Starting point is 00:13:50 And I'm not sure if you've all seen the movie recently or ever, but it's just, you know, the basic idea was these people got infected to a possible external organism and then they broke quarantine protocol and then everything else fell apart because they didn't have any backup protocols. But if you think of any movie, right right i think chaos theory or the lack of following chaos engineering is the reason why everything bad happens in every movie because someone makes a mistake and they have no other way to recover from the mistake of that you know you think of any typical horror movies like no you don't go and don't split up duh you know like

Starting point is 00:14:22 all those stupid things i just bring it up because it amuses me. It just popped in my head. I'm like, this is totally the same thing. And the other reason why I think it's relevant is just when I first learned about the idea of DevOps and, you know, reading the Phoenix project and the Unicorn project and the whole Toyota factory idea

Starting point is 00:14:40 of how to bring these fix faster, fail faster cycles to DevOps. My mind just started going into all these places of how that can apply to so many more things besides factories and IT. It can apply almost everywhere, you know? And I think the same thing for chaos. And that was just my sort of stupid little way of saying that the ideas of chaos apply outside of IT, outside of computers. But you could really start seeing it everywhere when you're aware of the concept, which is really kind of cool.

Starting point is 00:15:13 Brian, are you trying to say we could have prepared for 2020? To some extent. Well, you know, again, without going into the reality of what happened, I think there was some preparation for 2020 that was thrown out. But we will not go into that side of things. But, you know, to an extent, yeah, it all depends. And I think this is the challenge that all these organizations that we're hearing about, you know, the stories from today face, right? How much money and time do we want to put into preparing for these possibilities that what are the chances? I'm going through this right now.

Starting point is 00:15:49 I'm looking at my finances and possible things, planning for future, old age and all those kind of things. And we're looking at different insurance options and long-term care and life insurance and all these things. And you start questioning, okay, all these people want my money for these insurances for maybe a really tiny risk, right? And, but that decision process that I'm going through is the same thing that all these organizations are going through of how much time do we want to spend setting up and preparing

Starting point is 00:16:18 and maybe even onboarding a team to do this testing for something that with luck will never happen. But then again, when it does, you know, it's, it is the insurance game in a way, but with a lot more devastating effects.

Starting point is 00:16:30 And as we see more and more, uh, what was it? Azure the other day just went down. Um, we had, Google went down. Oh,

Starting point is 00:16:38 Google. Gmail. Gmail. And then we had the, the security, like it's happened. It's happening. You know,

Starting point is 00:16:44 as it's, it's not a matter of if it's a matter of when, and then when is it security. It's happening. It's not a matter of if, it's a matter of when. And then when is it going to impact you? But are you going to be prepared for some form of it? Yeah. Brings up a lot. Yeah, there's two things that kind of come up. It was nice because KubeCon North America just happened and one of the things that they tweeted out

Starting point is 00:17:04 after the conference is that they see five trends happening in the cloud native world in 2021. And chaos engineering is one of them, which was like, yeah, we are seeing that things are getting more complex and people are realizing. So that was really neat. And the second thing I wanted to chime in on is that I actually have a story. if we want to talk about what the investment in chaos engineering is versus doing it otherwise go ahead uh so actually we're back to a finance company um but we do know we have a few other examples they're doing um chaos engineering in qa so we do talk about the end goal is to run chaos engineering and production, but we want to build our confidence slowly and incrementally. So this is one of our companies that is doing chaos engineering in QA. They have a Kafka cluster and they want to start thinking

Starting point is 00:17:57 of what happens when it becomes partially unavailable. So after they ran all these experiments, they were able to say the chaos engineering experiments only took two hours to implement. But if they would have used regular engineering to do this, this was something that was going to take them four or five months to replicate. And this is just in the QA environment. So it could have actually been a little bit longer for production and stuff. But what they wanted to see is what failures would happen when the brokers of Kafka would actually be unavailable. So this allowed for them to see that they actually hadn't set up monitoring properly.

Starting point is 00:18:37 The monitoring tools weren't alerting the team when the nodes were lost. And they started to see that by running basically black hole attacks that stopped the connection to the brokers on just half of those brokers meant that those brokers crashed. They didn't have alerts and an engineer had to manually come and run a command for these nodes to come back up. So if you would have to do that every single time you see a failure in your Kafka cluster, that's first of all, you might get pager fatigue and totally burn out. So they didn't really get to see why this was happening until they broke it. And when you do lose one of the nodes of the Kafka cluster, this meant that you actually need to manually configure it again for it to reboot. So the amount of time that kind of was happening for them to manually intervene the few times that they ran the

Starting point is 00:19:45 experiments, they realized that they needed to fix a lot more things in their cluster and the environment and the configuration. And that's kind of where they were like, we wouldn't have really seen this failure right now. And if we would have discovered it via an incident, they would have taken a lot more time to engineer for that failure versus proactively jumping on that failure. And obviously, if this would have happened later on in production, as you said, it would have taken them a long, long time.

Starting point is 00:20:16 It would have meant a huge downtime, huge penalty payments, especially if it's finance, right? I mean, that's probably, I think, hard to measure actually how much money they would have lost. And therefore the upfront investment makes a lot of sense and pays off immediately. Yeah.

Starting point is 00:20:35 And finance is interesting because like you mentioned, it's like you're going to get charged on fines depending on the countries that you're operating on. Like that cost of downtime bill could get incrementally really, really high for every minute that you're done. Hey, you mentioned monitoring and that they figured out that maybe not the right people were alerted

Starting point is 00:20:56 or got alerted, which reminds me of a thing that we, as an industry, I think in general, try to promote a lot is everything is code and everything is configuration. So we also talk about, you know, making sure that your delivery pipelines can, well, let's say that all of the tools

Starting point is 00:21:15 that you use from dev to ops can be automatically configured through configuration as code and that you then also apply it accordingly. And I think this is a great way also for for you to test this for your monitoring solution meaning if you are right now only using monitoring and production and you've everything manually configured because you've never needed it to replicate it in a pre-production environment well then it's about time that you

Starting point is 00:21:41 figure out how you can configure your monitoring fully automatically and then also use that configuration and propagate it from dev to QA into prod so that you always get the same alerts and the same thresholds are used and all that. So I think that's especially for our listeners that are using maybe the monitoring tool that Brian and I try to bring to the world every day. And it's funny you mentioned that

Starting point is 00:22:10 because it's weird to, like we saw this a lot with prospects and customers. They didn't have monitoring set up. And it was like, how are you operating at scale? And then maybe they would only have monitoring and production, but then they're not doing this in any pre-prod environment. And it's like, wait, but how are you making sure that you're not building in more failures

Starting point is 00:22:31 as you're deploying through your environments? But then that also allows for me to touch base. At Gremlin, we use Gremlin for chaos engineering on Gremlin. And that was actually one of the wins that we saw. We were able to run a game day in staging and we realized that our dashboards needed a little bit more granularity and have a little bit a better way

Starting point is 00:22:57 to understand what the system was doing. But we were able to implement that in staging monitoring. And then you can take that exact same win and put it into your production dashboard. And it's like I never had to run it on production for me to make that improvement. Yeah.

Starting point is 00:23:17 So another lesson learned for everyone. Whatever tools you use, and if it is, for instance, monitoring, you need to test. You can test the monitoring in your lower-level environments while you run your chaos engine experiments, and then if you can automatically propagate these configurations to the upstream environments like production, you automatically get the same data that you know you will need in case chaos strikes

Starting point is 00:23:46 and the data system doesn't heal itself because that's obviously the next thing we can also do with chaos engineering, that we not only enforce chaos and find the alerting and get better in fixing things manually, but also fixing things, or letting things fix automatically through auto-remediations.

Starting point is 00:24:05 But Andy, an episode or two ago, you were just advocating the idea that the less environments you have, the more mature you are. How do we test in lower environments if we're trying to get rid of them? Yeah, well, if you have listened closely to that episode, then just having, let's say,

Starting point is 00:24:26 only production doesn't mean that you cannot safely deploy into an environment in production where you can run your experiments up front before you release this to the whole world, right?

Starting point is 00:24:35 You still have a separation between your regular production traffic and let's say your canary traffic or whatever you want to call it. But you're right, obviously.

Starting point is 00:24:43 And that also means, right, if you're doing production only, if you really are that mature, that you only have one single environment and you're deploying something new, you want to make sure that you have the chance to automatically test, configure, and run experiments on these new components before you maybe release it to the whole world.

Starting point is 00:25:03 And that's why I think canary deployments and feature flags are so quick. And I think that's one thing people have to think about when people say things like, oh, you only need one environment. It's like, yes, you need one environment, but that environment has to have those, let's call them subdivide. It's not really truly one environment. It's one set of everything, but you have little pieces that you turn on and off and you have the ability to treat it as if it's partially separate environment so that you can

Starting point is 00:25:31 control a controlled single environment it's a good way to think of it i think it would just you know if people are not fully paying attention or fully thinking it through they just might be panicking be like what we just put everything 100 to prod but yeah anyway there's some great podcasts by um who who who does those andy is that um pure performance who's talked about uh some canary releases and baby testing stuff in the past be good to check out maybe our maybe our friends from pure performance who knows yeah anyway i want to get back to some stories sorry uh yeah i do want to say that the movement of progressive delivery and kiosk engineering, when you bring both of those together, it's so awesome. Like, I really want to see more companies just really using both of them a lot more and being able to talk about it because it is it comes back to being customer focused and putting a focus on experimentation. And I think when we're looking at how distributed things are in the cloud native world, we need to be doing that. Like it's to me,

Starting point is 00:26:30 it should be mandatory. Like when we do have a talk on cloud native and adopting this stuff, it's like, well, you might as well take all the new movements going on and really make sure that you're getting ahead of those failures, but you're also not stopping innovation and being able to slowly release it's like it's perfect like i want to see more of that yeah hey anna um as we started in the beginning with uh reflecting the chaotic year of 2020 and obviously COVID has been one of the main reasons for this chaotic year. Do you have any stories maybe related to COVID? Anything that you've learned from organizations that may have seen different traffic patterns

Starting point is 00:27:20 or different things due to COVID? Yes. So there's multiple companies that we were able to hear about from the increase in capacity that they had, from all of a sudden their systems are seeing a huge increase in people reaching their services and they didn't do load testing or they didn't capacity plan well enough. So it's really interesting because some

Starting point is 00:27:45 of these companies were not expecting it because they're not really the companies that you would use because things went to virtual or online. It's not Zoom, it's not Slack, and it's not your favorite grocery stores or anything like that. But the one example that I am able to talk about today is an airline company that had seen some interesting changes due to COVID. And a lot of this was based because the only way that they were able to talk to their customers was their mobile application. And in their mobile application, they were using this to broadcast messages to the user base. This meant that they had around 60,000 people getting messages every single time they sent a push notification. So when you get 60,000 people to get a push notification, this meant a lot of those

Starting point is 00:28:33 folks were actually going into their mobile application. This also meant the increase in calls to the APIs once you open the application. And the backend that was the RDS was the biggest thing that was impacted. And they were just like, I don't know what's going on. Like their mobile app kept on crashing. And this really meant that they needed a way to prepare their system for database connectivity, latency, increase capacity,

Starting point is 00:29:02 or just that the RDS system went down. So they ended up doing a lot of latency experiments. They were also black holing all traffic from their mobile app to their internal service and black holing RDS specifically. And some of the findings that they had is that their login screen, they weren't able to like exit out of that. You basically had to log in every single time you wanted to open this application. You also saw that the login would kind of like just crash and go all black. And then they also had the mobile app had a spinner that you just couldn't get out of that page. And all of this was due to latency or RDS being just at max capacity on reads that they're able to have. And it's really interesting because it's like, yeah, I wouldn't

Starting point is 00:29:52 expect that due to COVID people are more likely to open a push notification because they have the downtime or this is an airline that they're using to go visit their family. And when we're talking about COVID, it's like every single month has been very different for every single country. And that mobile app is the only way that sometimes they're knowing what the COVID regulations are for travel between countries or their own cities. It's funny because that's like a marketing team's dream, right? We send a push notification and, you know, especially we see during games like the Superbowl or Olympics or whatever it might be that, Oh, there's this terrible celebration when someone's website crashes, you know,

Starting point is 00:30:34 by the own company, like, Hey, we crashed our website. No, that's terrible. But it's, this ties right into it, right? You send a notification. I mean, I think, I think this is a great story because this could just be turned into that what if besides even you know take the covet factor out which you know you shouldn't but taking the covet factor out what happens if your customers respond as marketing would dream that they do you know because it's always like what maybe 10 right that really actually come in but what happens if, suddenly something is so successful? You know, back when I worked at,

Starting point is 00:31:07 um, WebMD, there was a scenario that was similar. This is, I think way before the idea of chaos theory was even like maybe in principle things, the idea was there, but it didn't have a name.

Starting point is 00:31:19 Um, this was when, uh, Obama was the president and Michelle Obama was the first lady and she was taking up the health initiative. And the idea was like, if she decides to publish a health blog on our site, we're suddenly going to get slammed

Starting point is 00:31:31 with more traffic than we could have ever anticipated. Right? So are we prepared for that? And that was one of the questions. So it kind of goes back to that whole bit. And I think the good point you make there is that with COVID, who would have thought it would actually go that high but that's the question what what can you handle

Starting point is 00:31:48 that's great yeah and and we actually have been having a lot of those conversations with finance companies like just in the industry those were the companies that also weren't necessarily expect expecting an increase in usage of their internet services, but all their branches are shutting down. So online banking has to be as reliable as ever from transfers to credit card payments, viewing your statements, or even being able to call the call center and cancel your credit card payments because you've been laid off. So it is one of those things that you have to be ready

Starting point is 00:32:26 for those moments that you can't expect. Yeah. I mean, there's two thoughts here that I have. The first one is on this push message example. In the end, it comes back to dependency mapping, even though it might not be a dependency between a front-end service to a back-end service to a database, but it's a dependency from one feature to another feature.

Starting point is 00:32:52 And that might be implemented by completely independent teams. But you still need to know that dependency and then you need to do the proper load or I don't know what the right terminology for that is. How many people are potentially jumping from that feature to the other feature and therefore causing a lot of load. So for me, this is just another example of dependency mapping between features and not just between services in the backend.

Starting point is 00:33:23 The other thought that I had though, and this comes back to what we said earlier, if you would go back to 12 months in the past, like December, January last year, and if somebody says, hey, I think we need to prepare for a global pandemic. And then maybe the product managers say, sure, add it to the list of the backlog, but probably it's not going to be highly prioritized

Starting point is 00:33:50 because it's very unlikely. So in the end, it comes back to how you prioritize these things and somebody needs to make an assessment on what is the chance of this really happening. And obviously with COVID, we just hit the checkbox this year um that probably nobody really nobody really saw that happening this year and i think i think you do iron that out where it's like this is a big case of dependency mapping and i think the other word is capacity planning where you don't know how many calls you can actually sustain in your systems

Starting point is 00:34:25 until you're seeing all the traffic come in and it's like oh no our servers are on fire we're about to crash and it's like getting to that point sucks but it's true when you have that conversations with your product teams your managers how do you tell them i expect an increase in traffic during this time and having capacity planning conversations like i know in the last episode i talked about what that was like at uber you have to do planning for three to six months when you're working in a bare metal data center it's like six to twelve months and when you're on the cloud, you also might need six to 12 months, depending on the type of like what you pay in your cloud provider. And you can't really just pull the plug on having more capacity because you

Starting point is 00:35:12 now need to implement it and that bill. And then you have to make sure everything works. And that's that part where it's like, yeah, I think I pressed the button and things are all going to auto scale. But until you test it, you don't know. And I had a question about the dependency side. I think where the two of you were going with this is, at least for me, a new concept in dependency. So if I'm thinking about this the right way with this push notification one, right?

Starting point is 00:35:41 If we take a step back and using some sort of APM tool as an example, I don't know, maybe Dynatrace, where when you look at your dependency map of your application, right, that's learned by basically all the dependencies of the code execution going through your system. Now, when we take that in the push notification, that's one system, right? Messages get queued up, they get sent out and pushed to the end users. Within that message is a link, right? But that link has nothing to do with the backend system that pushed the notification. That link goes through another set of applications and another maybe set of infrastructure or clusters or whatever it might be so you have two systems with a soft dependency that only occurs with that link so they're meaning there'd be no way to really trace that through with any sort of tooling or monitoring because there's a there's basically that air gap of having to touch the link and hit it so how does how do you plan for that sort of dependency?

Starting point is 00:36:46 Or how do you make people aware of thinking of the push notification part of that dependency is the system that happens if a person engages with that? It's a proper communication between the different, whoever is the owner of these individual features, right? I mean, it's proper. But do people think about that? I mean, is that, it's a brand new concept to me. I mean, it's proper company. But do people think about that? I mean, is that...

Starting point is 00:37:05 It's a brand new concept to me. I mean, that's something I never would have even thought of. But is that something that people actually are thinking of? Do we see that happening? Or is that something that people need to really start opening their eyes to? I think we're seeing the conversations happen, but we need to see them more. And it goes back to bringing reliability to every single product team.

Starting point is 00:37:25 I think sometimes they don't even think about it. They just launch without thinking the largest impact that their product can have. And it's funny because that actually has been architecture questions in software engineering interviews that I've been part of. But even though you get those questions in interviews, doesn't mean that every single team out of companies is asking that. And I think a lot of people just don't know what their dependencies are. You ask a staff engineer, maybe they might know senior engineers, they might know who to ask or what architecture diagram to look at. But then when was the last time that architecture diagram or mental model got verified?

Starting point is 00:38:14 Like you don't necessarily ever know how distributed your system is until you see the state, like the trace down of what that call was. So I think it's interesting because like observability would allow for you maybe not to like really pin down what some of those things are. But then there's a portion that with chaos engineering, by just closing out one of those dependencies, you'll see if your system failed. So it's like, are you going to go and maybe look at logs and really try to figure out all those dependencies? Or do you just go and close a connection and try to figure out where those break-ins happen. And I guess there's a way to automate some of the sites from the front end, again, using that push notification as an example, or whether it be a web page or any bit, it's if you're at least somehow scanning whatever that user interface is to understand what all the endpoints that that can connect to are.

Starting point is 00:39:01 You can then at least identify those and say, all right, this is connecting. We know 90% of them are going to the main site that we expect, but hey, look, there's these three links here that go to a completely different system that we have to make sure if we're driving more traffic, those people need to know. And you bring in interesting, like that use case itself,

Starting point is 00:39:19 because I would then argue that you can just build with failure in mind and you can assume that loading that page could fail and you just load a static page that says be right back or something or a main link to log in. So I think a lot of it is just going back to that whiteboard and being like, okay, what is everything that can fail and let's start building around that. Like one of the examples that always gets talked about in chaos engineering are Netflix examples. They're known as one of the pioneers in the space. And when you listen to their front end engineers talk about chaos engineering, one of the use cases that has always stayed in my head is that you log into Netflix, their main, like their main kpi is just like seconds to stream but when you log look into the the first page there is a continual watching contain like division and it's like all the shows or movies

Starting point is 00:40:13 that you've been watching but if that service was to go down you don't see empty boxes you end up just seeing like top movies or tv shows in the United States. But that's building with failure in mind of like, wait, no, this service, if that times out at 300 milliseconds, the website just won't load it. And it's really, really nice to have experiences like that. It's incredible what some of these sites do. When you think about it, we take for granted that you go on Netflix or you on amazon or anything and within you know a heartbeat you're doing what you want to do i mean don't even think about what's all behind there but these stories really give us a great appreciation for for understanding everything that went into that it also shows us that chaos engineering and let's say that way designing with resiliency in mind goes far beyond what you

Starting point is 00:41:07 would maybe normally think of like a you know when the database is gone okay this is one problem but if a widget on the website doesn't load then i want to make sure that the website is still functional and also doesn't look like everything is broken. So I think that's also very interesting, right, that we teach and we make engineers from the front end to the back end aware of resiliency engineering and that you can make, that you can, that you have to, let's say that you have to think about resiliency in every layer of your architecture, whether it's front end, the mobile app, the browser, or the backend. Hey, Anna, you have one more example that we should talk about?

Starting point is 00:41:59 Yes. So funny enough, we were just talking about that capacity planning. And the last example I had was an e-commerce company that wanted to verify they set up auto scaling properly, but they were using Kubernetes and they just didn't have any regularly, like just regularly nodes auto scaling. They went ahead and they implemented horizontal pod auto scaling. So they usually traditionally use load testing to verify any sort of auto scaling, but due to the way that horizontal pod autoscaling works with Kubernetes is based on research consumption. So they ended up running CPU attacks on every single service for them to really isolate every single application to a single container. And that's how they were able to realize that HPA was set up properly and their Kubernetes clusters were actually auto-scaling appropriately, which is kind of nice because one, more people need to do Kubernetes auto-scaling, whether you're doing it from the cloud provider on the node standpoint, or you're actually setting up HPA, which I highly recommend.

Starting point is 00:43:01 Sometimes you really hope that you set it up properly, but unless you test it, you don't know. So when I was like looking for these stories, I was like, no, I'm really excited to see these e-commerce company going and making sure to run CPU experiments that if their resources actually pass the research consumption that they have allocated for, they are seeing things actually auto-scale and the customers won't get impacted. Because when we look at the outages of Kubernetes, sometimes they are just based on capacity planning.

Starting point is 00:43:34 And it's always like, oh, yeah, people assume that things are going to auto-scale just because it's cloud native. But wait, did you forget that you had to configure it? And you have to architecture for it as well. Just, I mean, because you have an app or a service in a container doesn't mean it is ready for auto-scaling, right? There's a lot of things that have to be considered and then those affected into the architecture.

Starting point is 00:44:02 So, yeah, very important. Very cool. Now it's the beginning of 2021, at least when people listen to this, any predictions for 2021? Or I think you said earlier at KubeCon, they mentioned that chaos engineering is going to be one of the hot topics of 2021, which I completely agree with.

Starting point is 00:44:29 Is there anything else maybe to kind of conclude this session today? Anything else that people should look into that you would like to put under their virtual Christmas tree, which I know is happening for us in the future, for them in the past, but is there anything that you would wish people would also do in 2021

Starting point is 00:44:49 when it comes to perform chaos engineering? I think it's like, it's for sure. That number one is go perform chaos engineering. But the one that I will continue pushing forward is do it to onboard your engineers. So on call, I just continue hearing of instances of engineers like getting put on call, they're thrown into pagers

Starting point is 00:45:09 and those engineers never learned what they had to do in a safe space. They never got to really explore what the systems that they were doing. So when I talk about onboarding engineers with Chaos Engineering, it's about running an experiment, having them open up that runbook, having them get page, having them acknowledge the page, communicating with the team that the incident was started. Whether you're using a tool for incident command and really jotting down everything, like going through the entire process as if things were actually on fire in production would really be helpful because sometimes psychologically you get paged in the middle

Starting point is 00:45:49 of the night. This is not the time that you want to be awake and you just kind of forget maybe some of the steps. So I really want to push that one forward for the purpose of pager fatigue going down, making sure runbooks and links to dashboards are up to date, and just making sure that we are trying to build a healthy, reliability, engineering, mental health culture. Cool. Well, this hopefully now sparks an idea for tomorrow when people finish listening to this pod and they go back to the office tomorrow and say hey i want to voluntarily sign up for being on call and i'll make sure that's cool thank you so

Starting point is 00:46:33 much cool um brian anything from else from your end yeah i got one idea or one wish for 2021 with in terms of chaos but it's also a setup for anna in case it does exist already then we can say well then great you'll have to come back on the show and discuss it with us but what i would think would be cool to to have in the world of chaos is categories of chaos right because if we think about this idea that we bring up quite a lot with you on COVID, right? This world pandemic that were people prepared? Did they think of people responding, right? The overwhelm, the, I guess the way that might overwhelm someone's mind would be thinking of like, well, if I think of everything that can go wrong in the world that we might have to prepare for,

Starting point is 00:47:23 it's going to be too big, right? But I was just thinking about that. I'm like, well, no, it's not necessarily the individual characteristics of the event. It's the type of events. So if we think about besides all of the, what if our database goes down, what if AWS has an issue or whatever, we think about what if there's a global event? What if there's a local event?

Starting point is 00:47:41 What if there are different kinds of events and situations? So I'm curious to see if there's a local event? What if there are different kinds of events and situations? So I'm curious to see if there are categories of types of events that can help guide people into creating and thinking of ways to test different types of chaos from server level, host level, all the way to global community level. I'm not sure if that's something that exists in the Chaos world yet, if that's something that would be useful in the Chaos world, but it'd be interesting to see at least what it might look like. Yeah, so I'll definitely do the plug that I do work for a vendor, and one of the products that Gremlin does offer is scenarios, and in scenarios is that you do work for a vendor and one of the products that Gremlin does offer is scenarios and in scenarios is that you get to choose a technology that you're using and we give you

Starting point is 00:48:30 recommended scenarios to run on your infrastructure and like for example with databases it's like prepare for the lack of memory in your MySQL make sure that your database cache is set up properly so you can block that traffic what is a timeout of your DynamoDB? So we do try that. And then I know the open source community, like with Litmus Chaos, they have been able to also do something similar where like the experiments that they offer

Starting point is 00:48:57 is based on recommendations of failures in Kubernetes. And the other plug for it is that I just recommend people to read incident postmortems. Go ahead and look up your favorite technologies, replicate those conditions, and know that your organization is resilient if that was to happen. So go read the GitHub outage of last year.

Starting point is 00:49:19 Go read the Google Cloud that happened like a few weeks ago. Gmail just went down. Go read them and understand what happened to lead few weeks ago gmail just went down go read them and understand what happened to lead to that failure and then replicate it i think that brings up a good point i was just thinking as you were talking there it's not necessarily what event happens right if we take again global pandemic it's not necessarily thinking about something global happens it's about some system got overloaded in some way right so if you're if

Starting point is 00:49:45 you're reading those stories and reading those postmortems about what systems got overloaded or got destroyed it's more about thinking of your systems again you don't have to necessarily think from the outside world of what happens let's say if vampires come to life tomorrow and you know well what what and if you do want to think about that way you think about okay well how the vampires use the internet that might suddenly we might suddenly have to deal with a lot more nighttime traffic can we handle nighttime traffic if we're doing our maintenance windows right um but it's more about i guess the systems and where i was just going there with my previous question was thinking again kind of in the wrong way of the actual event

Starting point is 00:50:22 not the system the system is your customer in a way in chaos, right? We shouldn't say your customer, your subject maybe is the better way to think of it. All right, cool. And I would say, right,

Starting point is 00:50:37 as you've said, you know, do this and this, make sure you listen to the podcast from our friends at Pew Performance. It is called, Why You Should Look Into Chaos Engineering with Anna Medina. Yes.

Starting point is 00:50:49 Those people are wonderful. Yeah. All right. That's all for me. Andy, anything else from you? No, just, you know, we all have it in our hands to make 2021 a better year

Starting point is 00:51:03 than the last one. It shouldn't be too hard, but it's all in our hands to make 2021 a better year than the last one. It shouldn't be too hard, but it's all in our hands. First of all, start washing your hands more often and then do something useful with it. No, but it's great to learn from people like you, Anna. It's really a pleasure. It opens up our minds. It teaches us something new. And in the end, we will all just get better in engineering.

Starting point is 00:51:30 And that's why I really want to say thank you so much for coming on the show. Thanks for having me. Very much open to talk about reliability and just making this world a better place and more reliable as humans too. So feel free to reach out. I'm on social media as Anna underscore M underscore Medina. Awesome. Thank you so much. We'll put that link in show notes as well.

Starting point is 00:51:57 And if anybody has any questions, comments, please feel free to tweet us at Pure underscore DT on Twitter, or you can send us an old fashioned email at pure performance at dietrace.com. Love any questions, comments, feedback, or ideas. And happy 2021.

Starting point is 00:52:15 Here's looking to the future, everyone. Thanks. Bye.

PurePerformance - Chaos Engineering Stories that could have prevented a global pandemic

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.