Screaming in the Cloud - Cloud Resilience Strategies with Seth Eliot

Starting point is 00:00:00 What risks does your company need to be protected against? If it's the risk of a fire in a data center or a minor electrical outage, guess what? Multi-AZ is fine. You don't need to be multi-region. Because again, it's multiple data centers when you're multi-AZ. Welcome to Screaming in the Cloud.

Starting point is 00:00:21 I'm Corey Quinn. You may know my guest today from the reInvent stage or from various YouTube videos or, you know, blogs if you're more of the reading type. Seth Elliott was a principal solutions architect at AWS and is currently between roles. Seth, you're free. What's it like to suddenly not have to go to work at an enterprise-style company, which I think is where you spent your entire career? No, instead I'm working full-time doing interviews. It's another job to have.

Starting point is 00:00:51 It's a lot less process, a lot fewer stakeholders have to sign off on these things most of the time. Be your own boss, work from home. What's that like? But hey, after hearing this conversation, anyone out there thinks they could use my skills as a cloud architect with a focus on resilience.

Starting point is 00:01:06 I'm on LinkedIn. That's Elliot with one L and one T. And we will, of course, toss that into the show notes. Backblaze and leading CDN and compute providers like Fastly, Cloudflare, Vulture, and Coreweave. Visit backblaze.com to learn more. Backblaze, cloud storage built better. You've been talking for a while now about something that I've always had a strange relationship with, specifically the well-architected framework, tool, team, process, etc. at AWS. What is your involvement with that whole well-architected sphere? So actually, I was having a really interesting job. We could talk about it in another question about it as an AWS Solutions

Starting point is 00:02:00 Architect embedded in Amazon.com, not working for AWS, but working for Amazon. But my first job in AWS was as global reliability lead for AWS Well-Architected. So there were five pillars at the time. There are six now. Let's see if I can remember them. Operational excellence, performance, security, reliability, cost, and the sixth is sustainability. Cost might be of some interest to you, Corey. I'm not sure. But I joined that team as reliability lead. There is. Computers tend to do interesting things in that sense, Corey. I'm not sure. But I joined that team as reliability lead. There is. Computers tend to do interesting things in that sense, but we'll roll with it. A question that I always had when I first heard about the well-architected framework coming out was, oh, great, I might be about to learn a whole bunch of things.

Starting point is 00:02:37 And I've got to be direct. I was slightly disappointed when I first saw it because it's, well, this is just stuff. This is how you're supposed to build on the cloud. Who's worked in operations and doesn't know these things? And then I started talking to a whole bunch of AWS customers and it turns out lots of them, lots of them have not thought about these things because not everyone is an old grumpy sysadmin like I am. And where you sort of learn exactly why through painful experience, you do things the way that you do. Yeah, exactly. You know, certainly the performance pillar would play a role here in terms of this recording and the lag issue. But yeah, all the pillars are important and they're a really great resource.

Starting point is 00:03:15 Basically, they're just a set of best practices that everyone should be knowledgeable about in building the cloud. And if you take a conscientious decision to not do one of those best practices, we want you to make sure it's a conscientious decision and not just a decision of omission. What I found, though, about it, and you mentioned this when you talked about the well-architected tool, is that people tend to think of well-architected as the well-architected tool. And it's not. Well-architected is the best practices, and the tool is a great resource by which you could sit down with six of your friends that you work with from all different parts of the company and do a multi-hour, sometimes multi-day

Starting point is 00:03:50 deep dive review, going through each of the best practices and checking the boxes. And that actually is useful if you're actually the right people there and you have the right conversations. But what I was finding was people were not doing that. They were just doing this as a checkbox exercise, taking it off in an hour. And that's not useful. So when I joined, I wanted to bring the best practices to the people, democratize them. So a couple of things that I did, I led the effort to rewrite my pillar and all my pillar leads followed suit so that it's now on the web. It's not a PDF, not an 80-page PDF, but it's on the web with hyperlinks to each section and hyperlinks to each best practice.

Starting point is 00:04:25 So I want to know about just disaster recovery. I could zip right in there. And the other thing I did is, like you said, I got on stage, I wrote blogs. I wanted to get the best practices into people's hands, how they wanted it, whether they want to do a review or not. And reviews are useful, don't get me wrong, but whether they want to do a review or not, I want to make sure they had those best practices. It always felt like something of a maturity model where it's easy to misinterpret the, okay, what is your maturity along a variety of axes and different arenas? The goal is not to get a perfect score on everything. It's to understand where you are with these, along these various pillars, to use your term,

Starting point is 00:04:59 and figure out where you need to be on those pillars. Not everything needs to excel across the entire universe of well-architected. If you're making that decision, it should be a conscious decision rather than, well, apparently cost is not something we care about here after the fact. You'll care about it when you get your bill, right? Exactly. So that's why I presented the best practices that I did. That's why I would talk one-on-one with customers. I mean, basically, a well-reviewed review only works well if you're having a conversation. If you're just going to a checkbox exercise, then don't even bother.

Starting point is 00:05:35 It's not going to really tell you anything. So I decided there's too many ways to have conversations. One is with the tool, and there's another at a whiteboard, and there's another on a stage, and there's all kinds of ways to have conversations. And I'm talking about conversations with the engineers and there's another at a whiteboard and there's another on a stage. And there's all kinds of ways to have conversations. And I'm talking about conversations with the engineers and the business, right? Like if we're going to like disaster recovery objectives, guess what? That's a business decision. That's not a technical decision.

Starting point is 00:05:56 It's a technical implementation. But your business better, you know, you better work with them to understand what those disaster recovery objectives are. In the time before cloud, I wound up being assigned to build out a mail server for a company that I worked at. That's where I come from, is relatively large-scale email systems. And my question was, okay, great, what is the acceptable downtime for this thing? And their answer was, no downtime is acceptable. It's, okay, so we'll start with five nines and figure it out from there, which you won't get in a single facility. So we'll have that in multiple data centers because it was before

Starting point is 00:06:28 cloud. That'll start at around $20 million. When can I tap the first portion of the budget? Which led to a negotiation. It turned out what they meant was we kind of like it to mostly be available during business hours with an added bonus outside of that. Okay, that dramatically lowered the cost, but it was a negotiation with the business. There is kind of this lever that, and we talk about pillars being in intention with each other. There's this lever, cost versus resilience, right? And it's not always that.

Starting point is 00:06:55 You can definitely add resilience without additional cost by doing certain smart things and optimizations. But very often it's that cost resilience lever. And I try to talk to customers about it and say, you have to decide where you want that lever to be. There's no magic formula that you get lowest cost and highest resilience at the same time. You were embedded in an Amazon team before you wound up moving over to AWS. Were you working on the equivalent of the responsibility model or the actual responsibility model? I confess I don't know much about the internal workings of the Amazon behemoth.

Starting point is 00:07:27 I was embedded in several teams. I actually started at Amazon in 2005 when it was this relatively small company in Seattle. I remember in 2005, when I moved here and I'd go to a gym, they'd say, oh, we have discount for Microsoft members. Are you a Microsoft member? I said, no, I work for Amazon. They're like, nah, we don't have a discount for them. They're too small. So I started out on the catalog team deep into the inner workings. I later worked on the team that launched Prime Video. It was called Amazon Unbox at the time, if anybody out there remembers that. I then took a break for

Starting point is 00:08:03 a little period of time. If you want to call working at Microsoft a break. It was actually quite rewarding. Yeah, for a brief half decade or so. Yeah, it worked out. I worked at Amazon Fresh, and I worked in international technology. And then the role I was telling you about before was the most interesting one, where I was embedded as one of only two AWS-focused solution architects in all of Amazon.com, not counting AWS, but in Amazon.com, working across Amazon teams in, obviously in Seattle, but in Tokyo and Beijing and Luxembourg and Germany on their adoption and modernization on AWS. And that was a really cool and interesting job. And I got to see how teams build with AWS, a wide variety. And the number one thing, interestingly enough, is that the cool thing about internally at Amazon is that cost is a first class metric. You'd be happy to know that everybody I talk to when I'm talking about system design and architecture, they ask about cost. Should we use Lambda? Lambda seems cool. I don't want to maintain servers. Okay, how much is it going to cost? And we'd have to work through those numbers and make a decision. Is it worth the cost? So cost is a first class metric inside Amazon.com. You folks, I don't remember the exact timing on this. You may have been at AWS by then, but they came out with a shared responsibility model for resilience as opposed to the one for security,

Starting point is 00:09:29 which I always somewhat cynically tended to view as, well, you have to put in the shared responsibility model because when someone gets breached, you need to drag that out for 45 minutes. You can't just say, you left it misconfigured. It's your fault. We wash our hands of you. Well, Corey, you're looking at one of the co-authors of that shared responsibility model.

Starting point is 00:09:47 So there was already a shared responsibility model for security. And Alex Livingston and myself wrote a white paper on disaster recovery, which became very, very popular in late 2020 and early 2021. And we could talk about why, but we wrote a white paper on disaster recovery. And in there, we put the shared responsibility model for resiliency. And that since has been backported into the reliability pillar too. And it's important actually to say, hey, look, you're talking about myths, right? There's a myth on some people's part that says to make it more resilient, we put it in the cloud. We put it in the cloud, it's now more resilient. That is not necessarily so. You put it in the cloud and you take these steps, those steps being

Starting point is 00:10:29 the best practices in the reliability pillar and the operational excellence pillar, and now it's more resilient. You can't just count on the cloud to do everything for you. One of the inherent tensions behind the way to approach resilience is it cuts against the grain of the idea of the cost pillar, where you get to make an investment decision of, do you invest in saving money, or do you spend the money to wind up build boosting your resilience? And it's rarely an all or nothing approach. But it's always been tricky to message that from the position of AWS, because it sounds an awful lot like, we would like to see more data transfer and maybe twice as many instances in different regions or availability zones. That would be, well, that's the right way to do it. It rings a little hollow, though. I have no doubt that was in no way, shape or form

Starting point is 00:11:14 your intention at the time. As I said, it's sometimes a lever. I mean, but it's sometimes not like if you're running on EC2 instances and you switch over to like Kubernetes and put yourself on pods, you could be using a lot less EC2 instances and have a lot more pods in place, and you make yourself more resilient. So it's not always more cost. But if you're talking about disaster recovery, a question I used to get a lot is, do I need to be multi-region? Okay, I'm multi-AZ, multi-availability zone. And for folks that might not know, that means I'm in multiple physical data centers, but I'm still within a single region. That makes me highly resilient if I've taken, or highly available, I should say,

Starting point is 00:11:48 if I've taken the right steps to architect it. But do I need to be multi-region? And when I go multi-region, yeah, I'm going to be setting up infrastructure in another region, and that's probably going to increase my costs. So when answering that question, I always ask, what risks are you trying to protect yourself against? Again, it's a that question, I always ask, what risks are you trying to protect yourself against? Again, it's a business question, right? What risks does your company need to be protected against? If it's the risk of a fire in a data center or a minor electrical outage, guess what? Multi-AZ is fine.

Starting point is 00:12:20 You don't need to be multi-region because, again, it's multiple data centers when you're multi-AZ. So then people say, why do I need to be multi-region? And the first thing that comes to mind is, oh, a comet, you know, wiping out the eastern seaboard, or a nuclear bomb, or, you know, something like that. And guess what? That's never happened yet. I mean, yesterday's performance is no guarantee of tomorrow's returns, but no comets, no nuclear bombs. Right. There's also the idea that, okay, assume that you do lose the entire U.S. East Coast somehow. How much are people going to care about your specific web app? I guess not that much in that extreme example. There's a question of what level disaster are you bracing for?

Starting point is 00:12:57 Yeah, we talk about that, too. Your disaster recovery strategy is part of your business continuity plan. Do you have a plan for getting workers in front of keyboards? Do you have a plan for your supply chain if the eastern seaboard is wiped out? If not, don't worry about the tech right now. It's not important. But I was standing in front of a crowd once teaching a course, and I gave the whole, you don't have to worry about nuclear bombs for your service, probably. And then I realized I was literally in Washington, D.C., talking to public sector. I'm like, well, maybe some of you do. Maybe that is on your business continuity plan. One thing that I'll see a lot when people try to go with multiple regions is that they'll very often, ah, we don't want a single point of failure, so we're going to use

Starting point is 00:13:33 multiple regions with a second region. And then after a whole lot of work and expense and time, they built a second single point of failure. That's funny. But there is a reason. There is a legitimate reason to go multi-region, and you know what it is. Do you care to venture a guess on what that is? In many cases, there's the idea of a few things. You can separate out different phases of what's going on in your environment at different times. There's a hard control plane separation. So if there's a region-wide service outage, theoretically, you can wind up avoiding that by being in multiple regions. And of course, there's always the question of getting closer to customers too. Oh yeah, well, that's not a resilience issue. That's a performance issue,

Starting point is 00:14:14 and that's a legitimate reason to go multi-region. But yeah, the number one thing that the actual risk we have seen, and across all cloud providers, this is not a slam on AWS, all cloud providers is an event, a nice service, or a cloud provider-owned network. And if you have a hard dependency on that service, or unfortunately, if you have a hard dependency on other services that depend on that service, then you need a plan to be able to go to another region. So I mentioned how our disaster recovery white paper became really popular in 2021. That's because in December 2020, there was a Kinesis event in US East 1. And I don't know how many people use Kinesis,

Starting point is 00:14:54 but a lot of AWS services apparently at the time used Kinesis. So like I think CloudWatch, I don't know. I could be wrong. Double check me on that. But so people were affected. And so if you need to protect yourself against that, and again, it's all cloud providers, and actually, AWS likes to show data that is objectively true that they have the least number of these events of all the major cloud providers. But if you need to protect yourself okay, great. Well, Amazon themselves is a single point of failure. The credit card payment instrument that I have on file for my AWS account is in fact a single point of failure to some extent. And I'll see companies in some cases storing, rehydrate the business level backups in another provider where they're

Starting point is 00:15:39 certainly not going to be active active, but they don't have to shut down as a going concern in the event that something catastrophic happens AWS-wide. Yeah, again, it's about risk assessment. If you're afraid of your cloud provider going belly up in some reason, I think that's a pretty, pretty low risk. But you're right, taking the step of just doing backups of your data and infrastructure to another cloud provider without any of the operational ability to bring it up quickly. You're just, this is the, oh crap, recovery scenario. I know it's going to take a long time. My recovery time is going to be extended, but this is like, I don't care that it's a long RTO because it's protecting myself against a risk that's probably never going to happen. That seems

Starting point is 00:16:17 legitimate. But yeah, I was just simply trying to say, not protect yourself against your cloud provider or protect yourself against an event in a region of your cloud provider. I know of at least one company that winds up having to rehydrate the backups level of infrastructure and other providers specifically so they don't have to call it out as strongly as a risk factor in their quarterly filings. In some cases, it's just easier to do it and stop trying to explain this to the auditor every quarter and just smile, nod, and do the easier thing. It comes down to being, again, a business decision.

Starting point is 00:16:49 And that could be a fairly low effort to implement. It's going to cost you. Data transfer is going to cost you. Data storage on the other provider is going to cost you. But yeah, unlike a full... So when I talk about disaster recovery, I adopted the model that pre-existed before me at AWS of four strategies, backup and recovery, which is the one we're talking about now. Very easy to do, but very long recovery times and

Starting point is 00:17:11 longer recovery points, although not always. And then moving towards shorter recovery times, you could have a pilot light, or you could have a warm standby, or you could have an active-active, which I consider to be both a high availability and disaster recovery strategy. Some of my colleagues at AWS consider it not to be a disaster recovery strategy, but that's an argument. I don't know if it's worth getting into. One thing that I found when I was doing my own analysis for the stuff that I built, I have an overly complicated system that does my newsletter publication every week, and it's exclusively in US West too, in AWS

Starting point is 00:17:45 and Oregon. And the services that I use are all, as it turns out, multi-AZ. So there's really no reason for me to focus on resilience in any meaningful sense. Because if Oregon is unreachable as a region for multiple days, well, that week I can write the newsletter by hand because I think I'm going to have a bigger story to talk about than they released yet another CloudFront edge pop in Dallas. No, it's part of the plan, Corey. They're bringing you down so you can't write about the outage. But honestly, you are resilient. You're highly available, but you don't have a disaster recovery strategy because you don't need one.

Starting point is 00:18:20 So just to be clear, resilience has that high availability piece. I'm in multiple availability zones. I could tolerate component level failures versus disaster recovery where I need to stand myself up somewhere else. so in the event that AWS has an outage, we're not exposed to it. In practice, what they do is they expose themselves to everyone's vulnerability unless they're extremely careful. And if they're, I don't know, using Stripe to process payments. Stripe is all in on AWS. So great, we're now living on Azure,

Starting point is 00:18:58 but our payment vendor has the dependency on AWS. So when there's an actual serious outage, you wind up with dependency issues that are several layers removed. Some cases, your vendors know about it, and many more they don't. So when we see things like that giant S3 issue about seven years ago,

Starting point is 00:19:15 well, that's one of those things where everyone's learning an awful lot about the various interchain dependencies as you go down this path. Though on the plus side for most of us, on that day, just the internet is having a bad day. So we don't have to spend a lot of time explaining why we alone are down. There's safety in numbers. First of all, you're still talking about S3 from over a decade ago. I need to educate you on more recent events so you have more

Starting point is 00:19:38 recent stuff to talk about. But as for your point of dependencies, that's so true because we have customers that will look at, we used to publish, well, we publish SLAs, but those are not guaranteed numbers. Those are just a cost agreement of what any provider is going to credit you. And we used to publish the design for reliability. They since took them away. It used to be part of the reliability white paper. And you'd see the number of nines designed for EC2 and S3. And people would like to try to take those numbers and do the availability math. Availability math says if I have redundant ones,

Starting point is 00:20:09 I could like, you know, if I have two things that are four nines and I put them redundantly, then I now have eight nines. But if I have two things that are four nines and they're in parallel with each other, I have to multiply the errors times each other and I have less than four nines.

Starting point is 00:20:22 But, and that availability math is well known. And, but you try to do that is the way to madness, right? Because yeah, you've done that times each other, and I have less than four nines. And that availability math is well known. But you try to do that is the way to madness, right? Because, yeah, you've done that with all the components in AWS, but how about those third-party providers? How about DNS? How about the internet having problems? You're not accounting for all those other things. So I think you're not getting a real number when you do that. And even when you're doing DR testing, you have these scenarios where, okay, we've tested our ability to fail between regions. But what you haven't done is tested your ability to fail between regions when everyone else in that region is doing something similar, because it's not just you doing this at three o'clock on a Tuesday afternoon. Suddenly there is a service-wide

Starting point is 00:21:04 outage. We'll avoid picking on S3 further. But when everyone is starting to evacuate, you often see, like even an older issue, we saw that with EBS volume failures in US East 1, I want to say in 2012, where suddenly there was the herd of elephants problem that we all learned a lot from. So thundering herd is more of an issue if you're in an availability zone in a region and that availability zone is having issues and there's only two more availability zones to go. So everybody's going to those two availability zones.

Starting point is 00:21:30 That's a real thundering herd issue, especially if you're looking for EC2 availability, instance availability. The type you want, if you're not flexible, might be gone by the time you get over there. When you're talking about multi-region, it's less so because, especially if you're in the U.S., you have multiple is for commercial regions. So A, there's no guarantee everybody's even going to the same region. But B, most people aren't even failing over, right? Not everybody has a multi-region strategy. So we actually haven't seen Thundering Herd happen with multi-region fail. What we have

Starting point is 00:22:01 seen happen is control plane dependency. So I actually added this into the reliability pillar pretty late. It was like the last two best practices I added to the reliability pillar, which was don't take hard dependencies on the control plane if you could help it. Because the way this works is for every service you use, if it's a regional service like S3 or EC2, there's a data plane and there's a control plane in that region. Data plane is basically the stuff running all the time to service actual requests. Control plane of the CRUD operations, create, modify, update, delete. The gotcha is if you're using a global service like Route 53, at the time when I last was at AWS, had a single control plane in US East 1. And so what happened was, I think this was a 2021 outage event or 2022. We saw an outage in US East 1. It was a network outage and it brought

Starting point is 00:22:53 down the control plane for Route 53 so that people couldn't modify their Route 53 records, which was how they planned to do a failover. So they couldn't fail over. Now there are solutions to this. The solution is choose a data plane strategy instead. Since then, AWS has come out with application recovery controller, which I want to hear from you, Corey, what you think of the cost benefit of that is. It's a little spendy, but you could also roll your own application recovery controller by doing something like creating a CloudWatch alarm, connecting that to a Route 53 health check, and having that CloudWatch alarm not check for health, because

Starting point is 00:23:30 that's not reliable, but literally check, is there an object with this name in S3? If not, alarm. And then you could delete that object, data plane operation, the alarm will go off, data plane operation, the Route 53 health check will go off, data plane operation, and it'll swap over. It's very helpful. I do like the application recovery controller. The challenge is it starts at two grand a month, which means for small scale experiments that that gets a little pricey just to kick the tires on and really get a lot of hands-on experience with it. But for the large scale sites that use it, it's who cares money. They're thrilled to be able to have something like that. So it's just a question of who the product is actually for. On the topic of control planes, one of the challenges

Starting point is 00:24:09 I've run into in the past is it's not just is the control plane available, but is it latent? At some point when you have a bunch of folks spinning up EC2 instances, yeah, the SLA on the data plane of those instances is still there, but it might take 45 minutes to get enough capacity to spin up just by the time that your request gets actioned. Yeah, and that's taking a dependency on a control plane. Even if you're multi-AZ, if your plan is, I need to use auto-scaling to spin up EC2 instances in the two remaining healthy availability zones, that's control plane. If you want to avoid that, you need to be statically stable and have capacity. If your costs are three availability zones, then having full capacity in two of them means you're 50% over-provisioned in a given availability zone.

Starting point is 00:24:52 Math works, right? So that's something you have to be willing to pay for. Or take the dependency on the control plane, and it'll probably work, but you're taking on more risk. And this, again, is driven by business need. If you were to take a look at the entire resiliency landscape, as my last question, this is something I'm deeply curious about.

Starting point is 00:25:09 What do you see people getting wrong the most that you wish they wouldn't in 2024? I think in general, what we're looking at is people not understanding how the cloud is different, I think, when they're moving from on-prem. I'm not talking about your mature folks in the cloud, but folks looking to adopt cloud for the first time. It needs to be explained to them that an availability zone is not only a data center,

Starting point is 00:25:32 it's multiple data centers. And the other availability zone, guess what? That's a completely separate set of data centers. So like if your on-prem strategy is to be in two data centers, woo two, that are like 400 miles apart, and that's a really far distance. So there's no chance those two are going to be affected by each other, even a thousand miles apart. Guess what? When you move to AWS, or at least with AWS's availability zone model, if your two availability zones are not 400 miles apart, they're only between 10 and 30 miles apart. But AWS has put in a lot of effort, and I've seen some of these reports. I've seen the reports include the geological survey and the floodplain analysis, so that these availability zones are not sharing the same floodplain, and that if any disaster happens, it's unlikely

Starting point is 00:26:14 that it should affect more than one availability zone. So guess what? You don't have to be 1,000 miles apart. You don't have to be 400 miles apart. Being 30 miles apart is giving you almost that same benefit. Now, let's talk to your regulator, your auditor, and convince them of that too, so you don't have to set up in another region a thousand miles away. I really want to thank you for taking the time to speak with me about all this. Given that you're currently on the market, if people want to learn more or potentially realize that, huh, we could potentially use a cloud architect with a resiliency emphasis where we work, where's the best place for them to find you these days? Well, I mean, just search for me on your search engine of choice, Seth Elliott, E-L-I-O-T, 1L1T. Throw AWS on the end of that, and you'll probably find stuff related to me, especially my LinkedIn account.

Starting point is 00:26:58 That's a good way to reach me. Awesome. And we will, of course, put links to that in the show notes. Thank you so much for taking the time to speak with me. I really appreciate it. Oh, thank you. It's been a pleasure. Seth Elliott, Principal Solutions Architect, currently between roles. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you hated this podcast, please leave a five-star review on your podcast

Starting point is 00:27:22 platform of choice, along with an angry, insulting comment because despite being in four different regions, you didn't take all of the control plane access away from Dewey, who pushed a bad configuration change and brought you down anyway.

Screaming in the Cloud - Cloud Resilience Strategies with Seth Eliot

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.