Screaming in the Cloud - Episode 22: The Chaos Engineering experiment that is us-east-1

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This week's episode of Screaming in the Cloud is generously sponsored by DigitalOcean. I would argue that every cloud platform out there biases for different things. Some bias for having every feature you could possibly want offered as a managed service

Starting point is 00:00:36 at varying degrees of maturity. Others bias for, hey, we heard there's some money to be made in the cloud space. Can you give us some of it? DigitalOcean biases for neither. To me, they optimize for simplicity. I polled some friends of mine who are avid DigitalOcean supporters about why they're using it for various things, and they all said more or less the same thing. Other offerings have a bunch of shenanigans, root access and IP addresses. DigitalOcean makes it all simple. In 60 seconds, you have root access to a Linux box with an IP. That's a direct quote, albeit with profanity about other providers taken out.

Starting point is 00:01:19 DigitalOcean also offers fixed price offerings. You always know what you're going to wind up paying this month, so you don't wind up having a minor heart issue when the bill comes in. Their services are also understandable without spending three months going to cloud school. You don't have to worry about going very deep to understand what you're doing. It's click button or make an API call and you receive a cloud resource. They also include very understandable monitoring and alerting. And lastly, they're not exactly what I would call small time. Over 150,000 businesses are using them today. So go ahead and

Starting point is 00:01:52 give them a try. Visit do.co slash screaming, and they'll give you a free $100 credit to try it out. That's do.co slash screaming. Thanks again to DigitalOcean for their support of Screaming in the Cloud. Hello and welcome to Screaming in the Cloud. I'm Corey Quinn. This week, I am on location at the Gremlin offices in San Francisco. I'm joined by Ho-Ming Lee, who, well, we'll talk about Gremlin in a minute. First, I want to talk a little bit about what you've been doing historically. First, welcome to the show.

Starting point is 00:02:25 Sure, absolutely. Thank you for having me and happy to be here on the show. Nope, always great to wind up talking with new and interesting people. So before you went to Gremlin, you were a TAM at AWS, or Technical Account Manager for people who don't eat, sleep, and breathe cloud computing acronyms. Yep. So without breaking any agreements or saying anything that's going to cause AWS to come after me with knives in the night again, what was it like to work inside of AWS? Yeah, I actually want to step back a little bit further.

Starting point is 00:02:58 And when I joined AWS, this was probably four years ago, I joined as a solutions architect over there. I am one of the only, if not the only person that actually went from the solutions architect role into the technical account manager role. Interesting. Out of curiosity, what is the difference between a solutions architect and a technical account manager? I get that a lot, for sure. And from a technical perspective, both roles are very technical.

Starting point is 00:03:28 The expectation is that you're technical and you're able to help the customers. I would say for solutions architect, you are a little bit working more with the architecture, whereas with the technical account management role, you're a lot more on the operations, so you're a lot more on the ground. When you say that you transitioned from being a solutions architect into a technical account manager, and that was a rare transition,

Starting point is 00:03:53 is that because transitions themselves are rare, or is that particular direction of transfer the rarity? That's a great question. Particular direction of transition. And there is a tendency for people to start with a support role. Technical account manager has an enterprise support element to it. Now, as technical account managers, instead of being reactive on a lot of support cases,

Starting point is 00:04:23 you're actually proactively thinking about how we can make their operations a lot smoother and best practices around that. So I'll give you an example. Every year, there's usually very critical events for a lot of businesses. If you think of Amazon, the natural thing to think of is Black Fridays and Cyber Mondays.

Starting point is 00:04:45 A lot of these. Prime Day as well recently. Oh, absolutely. Happy Prime Day, belatedly. Exactly. And so for these events, there's actually a lot of planning, a lot of thinking ahead. And so as technical account managers, you work with the customers to make sure that they are ready for the event. Now, technical account manager is an interesting role in that you are in between the customers as well as AWS services.

Starting point is 00:05:14 And so in that sense, you're helping services, you're helping the service team educate the customer to let them know how to properly use a service. Because a lot of people use services in an unintended way, which makes things very interesting, to say the least. Oh, absolutely. I found that Secrets Manager, for example, makes a terrific database. And every time I talk to someone

Starting point is 00:05:39 who knows what a database is supposed to do, they just stare at me, shake their head sadly, and then try to take my computer away. I have to say there are interesting misuse though because there are interesting use cases that the AWS service team may not have thought of. And so the other part is actually getting some of these feedback from our customers and bringing it back to the service team

Starting point is 00:06:01 so that the service team enhances features. So it's a really interesting role in that you're between both the customers and the service team. What's interesting to me about the TAM role historically has been how, I want to say maligned it is. In that when you speak to large enterprise clients, fairly frequently something I will hear from engineers on the ground is, oh, the TAMs are terrible. They have no idea what they're doing and it's awful. And okay, I suspend disbelief. I wind up getting engaged from time to time

Starting point is 00:06:35 in those environments and speaking with the TAMs myself. And I come away every time very impressed by the people that I've been speaking with on the AWS side. So I understand that it's a role where you are effectively the voice and face of AWS to your customers. And that means that you're effectively the person in the room who gets blamed for any slight real or imagined that AWS winds up committing. You will get blamed for all of their sins. It's a maligned role.

Starting point is 00:07:09 What do you wish that people understood more about the TAM position? I think some part of that is really just about what people are hearing anecdotally and what words gets out. This is like reviews. You generally tend to see more bad reviews than good reviews. But in my experience as a TAM, working with a lot of our customers, I've actually worked with a lot of good customers and I'm thankful and lucky to be in that position where a lot of our customers actually come to us with very reasonable requests and very reasonable

Starting point is 00:07:42 incident management. So working with them in finding out what the root cause is, and it could be something on the AWS side, it also can be something on the customer side. Now, I do understand where the sentiment comes from because there are definitely certain customers that likes to just initially blame AWS. And sometimes when they don't have visibility into it, that likes to just initially blame AWS.

Starting point is 00:08:08 And sometimes when they don't have visibility into it, it's easy for them to blame AWS. And that might be where some of the sentiments come from. But for the most part, and the customers I work with have been really good in that it is usually a joint investigation to find out what's wrong, as the ultimate goal is really to kind of bring services back up and make sure our customers are healthy and happy going forward. It always amuses me to talk to large enterprise customers who are grousing about enterprise

Starting point is 00:08:39 support and why they don't want to pay it. They're debating turning it off. The few times I've seen companies actually do that, it lasts at most a month and then they turn it right back on and, oops, that was a mistake. We're really, really sorry about that. Not because you need enterprise support in order to get things done, but rather that you need it in order to get visibility into problems that only really crop up at significant scale. That's not, incidentally, a condemnation of AWS in the least. That's the nature of dealing with large, complex platforms.

Starting point is 00:09:12 And the one thing that has always surprised me about speaking with Tams, even off the record, after I poured an embarrassing number of drinks into some of them, I don't ever get any of them to break character and say, oh, that customer is the worst. There's a genuine bias for caring about the customer and having a level of empathy that I don't think I've ever encountered in another support person for any company. Is that just because there's electric shock collars hidden on people or implants? Or is this because it's something that they buy us for in the hiring process? Yeah, totally. So kind of dialing back a little bit on what you mentioned

Starting point is 00:09:51 with how enterprise customers are getting value with enterprise support. I think there's an element of embracing enterprise support. You have to really embrace it and work with the AWS staff to really get value out of it. And so the more you embrace it, the more value you'll get out of it. Now to the point of why are all the staff in AWS, not just TAMs, are hardwired to really help customers is part of the Amazon leadership principles.

Starting point is 00:10:25 I really think that that's something that Amazon has it right in that the culture of it, how the hiring process requires people to kind of read up on the leadership principles, embrace them, and really present them as you go through your hiring journey, as well as even as an Amazonian, which is what they call for the staff in Amazon. Yay, cutesy names for employees. Every tech company has them. Exactly.

Starting point is 00:10:58 But the leadership principle speaks really well, and one of them is customer obsession. And so every staff within Amazon are customer obsessed. And so at the end of the day, it should be mutually beneficial to have the customer get what they want, be very happy with the service, as well as the TAM achieving success with the customer to make sure that everybody's happy in the equation.

Starting point is 00:11:25 Absolutely, and I think that's a very fair point. So I think that's probably enough time dwelling on the past. Let's talk about what you're doing now. You left AWS around the beginning of this year, and then you came here to Gremlin, which is an awesomely named company, especially once you delve a little bit into what they do, which is an awesomely named company, especially once you delve a little bit into what they do, which is chaos engineering.

Starting point is 00:11:47 So effectively, from a naive perspective on my part, where I haven't ever participated in the chaos engineering exercises, it looks to me from the outside like what you do is you've productized or servicized, if that's a thing, turning off things randomly in other people's applications. How do you expect to find a market for this when Amazon already does this for free in US East 1 all the time?

Starting point is 00:12:14 Well, there's a lot going on in US East 1 and a lot of new shiny toys happen to be there too. US East 1 is definitely an interesting region. That said, there's definitely a lot of misconceptions in chaos engineering. We have heard where some people would go and break other people's things and just break things to prove a point.

Starting point is 00:12:41 As a company, we're definitely not advocating for that. That's not the direction that we expect our customers to take. Really, chaos engineering is about planned, thoughtful experiments to reveal weaknesses in your systems. You look at companies that are out there that are doing similar things like Netflix, like Amazon themselves. The intent really is to build resilience. With microservices, you hear this word a ton for sure in the industry, it's very difficult to understand all the interactions

Starting point is 00:13:19 and to really have a good grasp, even as an architect. I've been an architect and I'm here as a solutions architect as well. Even as an architect, it's impossible to know all the interactions and ensure that your systems are resilient. And so one good way is to be thoughtful and plan out these experiments to reveal weaknesses and then to build resilience over time. One of the things that I appreciate when I speak to chaos engineers is that they

Starting point is 00:13:52 always seem to take a much more scientific approach to this. It's a mindset shift where other people, sorry, other departments call themselves site reliability engineering. There's very little engineering. There's very little scientific rigor that goes into that. It's more or less, in many shops, an upskilled version of a systems administrator. Whereas every time I go in depth with chaos engineers, there's always a discussion about the process, about the methodology that ties into it. So let's delve into this a little bit. What is chaos engineering?

Starting point is 00:14:28 Because it's easy to interpret this, I think, as just having a DR plan that is just better implemented and imagined than a dusty old binder that assumes that one data center completely died, but everything else is okay and just fine. What is chaos engineering? It's interesting you bring up the DR plans. Definitely most companies have a business continuity plan. They are, like you said, have a DR plan so that if their primary data center fail, they would fail over to a secondary data center.

Starting point is 00:15:06 Spoiler, other than during carefully staged DR exercises, they almost never work. That's exactly what I want to bring up. There is actually the terminology that's recently said by Adrian Cockroft calling it availability theater. Just having it in the binder, just having a passive data center is not enough. And so first of all is to exercise it. You have to actually exercise your DR plan. And if you go out and ask people, I honestly don't know how many

Starting point is 00:15:38 properly exercise their DR plan. Now the other thing is, I can understand why so many people are reluctant to exercise the DR plan because the huge blast radius is very dangerous to do and it just takes a lot of planning, a lot of effort to do. So if you take that effort

Starting point is 00:16:01 and shrink it down to a very small outage or a very small issue in your architecture and ask the question, what could go wrong in this environment? It could be as simple as a network link going down. And that's much easier to do, and that's much easier to practice. So the idea behind chaos engineering really is just bringing it down to a level where

Starting point is 00:16:30 you exercise it regularly so that you can build resiliency. But it has to be thoughtful, it still has to be planned. So we like to think of it as an experiment where you ask the question, in this environment, what could go wrong? And then once you start understanding what parts can go wrong, you want to experiment against it. You have a hypothesis that if I disconnect my application from the database,

Starting point is 00:16:57 I'm able to pull data from cache. I'm able to still show this information to my end users. That's the hypothesis. Now, as much as you think that that's going to happen, you don't really know for sure that that's actually going to happen. So you would want to actually test it and experience it. And so the key here is actually injecting default, actually disconnecting the link to your database, and see that, oh, whether it is actually showing you cache data, or maybe in some cases it fails horribly, which is actually still a learning. And so ultimately it comes down to learning about your systems and building resiliency over time. One of the aspects that I think appeals to me the most is that it doesn't need to be a world-changing disaster that can be modeled. A lot of times it's something on the order of you add 100 milliseconds of latency to

Starting point is 00:17:53 every database call. What does that cause a degradation around? And that's a fascinating idea to me just because so many DR plans that I've seen are built on ridiculous fantasies of assume the city lies in ruins, but magically our employees care more about their jobs than they do about their families. And they're still valiantly trying to get to work. It never works out that way. There's also a general complete disregard for network effects. For example, how many DR plans have you ever seen that say, what if all of US East 1 goes down? Okay. And we're going to just automatically try failing

Starting point is 00:18:33 over to US West 2 in Oregon, for example. Okay. But somehow we're going to magically pretend that we are the only company in the world that has thought ahead to do this. There's no plan put in place for things like half the internet is doing exactly what you're talking about. Perhaps provisioning calls will be particularly latent, so you may want to have instances already spun up. What if there are weird issues that wind up clogging network links where, oh, we want to shove a bunch of data there. Oh, well, first, we can't get to that data in its original case, and oops, there was a failure in planning. Or instead, it winds up being way too long since this was tested,

Starting point is 00:19:10 and there are entire services that this was never thought about. So it feels to me, on some level, like chaos engineering is, in some respects, an end run around the crappy history of disaster recovery and business continuity planning. It just allows people to be a lot more granular, in my opinion. around the crappy history of disaster recovery and business continuity planning? It just allows people to be a lot more granular, in my opinion.

Starting point is 00:19:32 Like I said, the DR efforts need to be there still. I'm not saying that they don't serve any purpose and they don't have a place. You should think about that. But it allows you to think more granular in the AWS world, what happens if an availability zone goes down? It's not always about an entire region goes down. From a chaos engineering practice perspective,

Starting point is 00:19:52 we actually advocate for starting small. You want to start asking questions around, what if one link fails? What if one host fails? What if just a small component fails? Are you able to handle that? Then you start dialing up this blast radius and think about what happens if a wider array of things fails.

Starting point is 00:20:10 What happens if your entire fleet of service goes down? Eventually, gradually, you definitely do want to get to the point like what Netflix can do. They can fail over regions. But to want to get there on day one is an extraordinary amount of work. So you can start small. So what happens is a lot of people want to do that

Starting point is 00:20:31 and they just find it too difficult and throw their hands up in the air. I agree. I think there's also a challenge where you see people who tried something, it was hard, and they back away from it and don't want to do it again. I'm fascinated by stories of failure in an infrastructure context, not because I enjoy

Starting point is 00:20:50 pointing out the failures of others, but because this is something that we can all learn from. Only a complete jerk watches their competitors' websites struggle and fail under load and is happy about this, because it could be you tomorrow. There's the idea in the operations world of hug ops. If someone else works at a company that is a direct competitor to yours, you still hope they get past their outages reasonably quickly. I don't think as an industry comprised of operational professionals and engineers that we are particularly competitive in that sense. We want to have the best technology, but not at the expense of our compatriots disappointing their customers.

Starting point is 00:21:31 Yeah, absolutely. We all feel the pain, that's definitely for sure. Whether it's AWS, whether it's other cloud providers, whether it's Gremlin, our approach is to really embrace failure, but really is to learn from this failure to then build resilience. Absolutely. And that's not something that ever comes out of a box. I mean, I remember a few years back, a particular monitoring company on Twitter saw that a company was having an outage

Starting point is 00:21:58 and chimed in with, if you were using our service, you'd probably be back up now. And they were roundly condemned by most of the internet for that. It turned out that a marketing intern had gone a little too far in how do we wind up being relevant to what's going on on the internet right now without the contextual awareness to understand what was going on. And that was something that became very, I guess, heartwarming in a sense in that we're all in this together and we don't use it as an opportunity to capitalize on the misfortune of others. Maybe that's something that runs counter to what they teach at Harvard Business School. But fundamentally, it speaks to an

Starting point is 00:22:35 empathetic world in which I prefer to exist. There's totally a human side to this. For sure. So at this point, it seems to me that trying to convince a company to embrace the theory and the idea of chaos engineering has got to be a bit of an uphill battle in some cases. So our problem is that our site keeps breaking and falling over when things stop working. And your genius plan is to go ahead and start breaking things intentionally. Why do we pay you again? Seems like one of those natural objections to the theory. In fact, that was not me mocking someone else. That's what I said the first time that someone floated it past me. How do you introduce chaos as a step towards making things better and not get laughed out of the room. So I think there's another misconception here where a lot of people might think about chaos

Starting point is 00:23:30 engineering as something you just go in and break things on purpose. We are breaking things, but an important part of thinking about chaos engineering is that thoughtful and planned nature of it, where you are thinking about the experiments, you are planning ahead of time, you're communicating both to upper management as well as to other services on what you're trying to achieve. The goal is resilience.

Starting point is 00:24:00 The goal is not things breaking. The hard part, of course, is getting there. Are there some companies that are frankly too immature for chaos engineering to be a viable strategy? If a company's struggling to keep its website up, it feels like introducing failure that early in the game may not be the best path. Maybe that's wrong.

Starting point is 00:24:18 I think it's a journey. Chaos engineering is a journey. You're not jumping into the deep end right away. So we don't advocate for you not knowing what you're doing and just running a bunch of things that breaks production. You're not going to just say, let's shut down all our servers in production and see what happens.

Starting point is 00:24:40 That's not the intent of chaos engineering. I like to dial it back to the thoughtful and planned approach all the time because you are trying to do things that are in a very controlled manner. You almost want to know what the outcome is and you're just verifying, validating it and also catching some surprises in a safe environment. Well, I do understand the answer to this question may very well be, hey, Gremlin, I'd like a slightly more nuanced answer to

Starting point is 00:25:12 how do you get started with the idea of chaos engineering? For people who are listening to this right now and saying, that's fantastic, I'm going to implement that right now. And because people are listening while driving, they ram a bridge abutment to see what happens. Don't try that. How do people get started once they safely get to the office? Just pay Gremlin.

Starting point is 00:25:35 Serious note, as I mentioned, I think it's a little bit of a mindset as well as using the right tools. You can simply look at your service, look at your architecture, look at your architecture diagram and ask the same questions. What could go wrong? Are there some hard dependency?

Starting point is 00:25:50 Are there some soft dependency? Even if you know ins and outs of your code and your architecture, there's always some learning by experiencing some of these failures. And so to really get started, you want to ask yourself what could go wrong now in terms of resources we actually have a chaos engineering slack that is not just about Gremlin but overall chaos engineering practice

Starting point is 00:26:15 and you're welcome to join the chaos engineering slack oh wonderful I'll throw a link to that in the show notes one question I do have and this might be a little on the sensitive side, so if it is, I apologize. But it feels like in many cases, there are some companies, Netflix, sorry, something stuck in my nose there,

Starting point is 00:26:36 that like to go on stage and talk about things that they are in fact running. And they talk about the cases in which it works, but they don't talk about edge cases and things where that doesn't wind up applying. A classic example of this might be, for example, a company gets on stage and talks about how everything they do is cattle instead of pets. And they're happy and thrilled with this. And then you go and look at their environment and well, okay, yeah, their web tier is completely comprised of cattle. There are no pets, but their payroll is running on an AS400 somewhere. And the databases that handle

Starting point is 00:27:10 transactions are absolutely bespoke unicorns. To that end, how much of chaos engineering as implemented today in the outside world is done holistically throughout the system or is it mostly focused on one particular area and then broadened out over time? I like that question. That's a pretty interesting question. To put it into perspective, even a company the size of Netflix, as you mentioned, has different teams,

Starting point is 00:27:44 different organizations, pretty widespread. They use different types of technologies. The important thing about chaos engineering is to embrace failure and prepare for failure. So whether it is chaos monkey that is shutting down hosts or just ensuring that if my tool's not working, I can still use a

Starting point is 00:28:10 spreadsheet to track something. The important things with chaos engineering really is about having that failure mindset and making sure that you're prepared for failure. You just said something that flipped a bit in my mind. The idea of using a spreadsheet when a tool isn't available, is that something that you talk about as an aspect of chaos engineering? Most discussions I've seen have been purely focused on technical failover and technical resilience. You're talking about resilience of business process. Absolutely, because when we talk about game days,

Starting point is 00:28:42 which is actually an element that we haven't discussed deeply. So let me maybe very quickly just talk a little bit about game day. Where game day is a time where you can bring in different people and collaboratively run experiments to reveal weaknesses as well as just to learn about a service or a system. And so as you're doing the game day, your experiments are not only looking at the technical aspects, whether an application is able to handle a failure by doing retries or timing out or some graceful degradation. You're actually also validating your observability, whether you can see what's going on in the system, whether you're getting the proper alerts

Starting point is 00:29:29 when certain thresholds are crossed, and then all the way to the point where your on-call person, when they get the alert, they have enough information to take action. That's all part of the experiments and your learning as a whole. This is actually an interesting aspect to chaos engineering. A lot of people experience on-call by just being given a pager

Starting point is 00:29:51 and here's a runbook. Here, catch. Exactly, go ahead. And the real learning comes from just your first incident, but you're very nervous about it and you don't know what to do. So by thoughtfully planning and executing these experiments, you're also allowing your on-call person to get ahead and know what's coming so that they're more calm

Starting point is 00:30:17 as they execute it. Even with runbooks, on-call engineers have made mistakes before because they're just in this very nervous state. And so if you calm them down in training, in helping them out, understand what the flow looks like so that they know what to expect, I'm sure they will feel better as a real incident comes in. Which makes an awful lot of sense.

Starting point is 00:30:39 So there's also a conference coming up, if I'm not mistaken, where you talk about the joy, glory, and pain that is chaos engineering. And I'll throw a link to that in the show notes. What do you find, or what are you hoping that comes out of a community gathering to talk about the principles of chaos? Yep, absolutely. There's going to be a chaos conf that's happening in September in San Francisco. And we have a great speaker lineup that is going to talk about their experiences in chaos engineering, talking about failure scenarios that they've experienced and how they handle the situation. As well as just overall a good gathering of like-minded people to talk about failures, talk about how to embrace it,

Starting point is 00:31:25 how to prepare for failures, as well as how to handle certain situations. It sounds like it'll be a lot of fun. I know I'm looking forward to it. Thank you once again for being so generous with your time. I definitely appreciate it. And I have to say, it's nice being here in the office and recording with you.

Starting point is 00:31:42 And not once during this entire session did the lights go out or a wall fall over. So living with chaos engineering does not mean that every minute is a disaster. Thank you for having me. Always a pleasure. This has been Ho Ming Lee with Gremlin, and I'm Corey Quinn.

Starting point is 00:32:00 This is Screaming in the Cloud. This has been this week's episode of Screaming in the Cloud. This has been this week's episode of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com or wherever fine snark is sold.

Screaming in the Cloud - Episode 22: The Chaos Engineering experiment that is us-east-1

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.