Screaming in the Cloud - Episode 2: Shoving a SAN into us-east-1

Episode Date: March 21, 2018

When companies migrate to the Cloud, they are literally changing how they do everything in their IT department. If lots of customers exclusively rely on a service, like us-east-1, then they a...re directly impacted by outages. There is safety in a herd and in numbers because everybody sits there, down and out. But, you don’t engineer your application to be a little more less than a single point of failure. It’s a bad idea to use a sole backing service for something, and it’s unacceptable from a business perspective. Today, we’re talking to Chris Short from the Cloud and DevOps space. Recently, he was recognized for his DevOps’ish newsletter and won the Opensource.com People’s Choice Award for his DevOps writing. He’s been blogging for years and writing about things that he does every day, such as tutorials, codes, and methods. Now, Chris, along with Jason Hibbets, run the DevOps team for Opensource.com Some of the highlights of the show include: Chris’ writing makes difficult topics understandable. He is frank and provides broad information. However, he admits when he is not sure about something. SJ Technologies aims to help companies embrace a DevOps philosophy, while adapting their operations to a Cloud-native world. Companies want to take advantage of philosophies and tooling around being Cloud native. Many companies consider a Cloud migration because they’ve got data centers across the globe. It’s active-passive backup with two data centers that are treated differently and cannot switch to easily. Some companies do a Cloud migration to refactor and save money. A Cloud migration can result in you having to shove your SAN into the USC1. It can become a hybrid workflow. Lift and shift is often considered the first legitimate step toward moving to the Cloud. However, know as much as you can about your applications and RAM and CPU allowances. Look at density when you’re lifting and shifting. Know how your applications work and work together. Simplify a migration by knowing what size and instances to use and what monitoring to have in place. Some do not support being on the Cloud due to a lack of understanding of business practices and how they are applied. But, most are no longer skeptical about moving to the Cloud. Now, instead of ‘why cloud,’ it becomes ‘why not.’ Don’t jump without looking. Planning phases are important, but there will be unknowns that you will have to face. Downtime does cost money. Customers will go to other sites. They can find what they want and need somewhere else. There’s no longer a sole source of anything. The DevOps journey is never finished, and you’re never done migrating. Embrace changes yourself to help organizations change. Links: Chris Short on Twitter DevOps'ish SJ Technologies Amazon Web Services Cloud Native Infrastructure Oracle OpenShift Puppet Kubernetes Simon Wardley Rackspace The Mythical Man-Month Atlassian BuzzFeed Quotes by Chris: “Let’s not say that they’re going whole hog Cloud Native or whole hog cloud for that matter but they wanna utilize some things.” “They can never switch from one to the other very easily, but they want to be able to do that in the Cloud and you end up biting off a lot more than you can chew…” “Create them in AWS. Go. They gladly slurp in all your VM where instances you can create a mapping of this sized thing to that sized thing and off you go. But it’s a good strategy to just get there.” “We have to get better as technologists in making changes and helping people embrace change.”.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This episode of Screaming in the Cloud is sponsored by my friends at GorillaStack. GorillaStack's a unique automation solution for cloud cost optimization, which of course is something here and dear to my heart. By day, I'm a consultant who fixes exactly one
Starting point is 00:00:38 problem, which is the horrifying AWS bill. Every organization eventually hits a point where they start to really, really care about their cloud spend, either in terms of caring about the actual dollars and cents that they're spending, or in understanding what teams or projects are costing money and starting to build predictive analytics around that. And it turns out that early on in my consulting work, I spent an awful lot of time talking with some of my clients about a capability that GorillaStack has already built. There's a laundry list of analytics offerings in this space that tell you what you're spending and where it goes, and then they stop. Or worse, they slap a beta label on that side of it for remediation and
Starting point is 00:01:23 then say that they're not responsible for anything or everything that their system winds up doing. So some folks try and go in a direction of doing things to write their own code, such as spinning down developer environments out of hours, bolting together a bunch of different services to handle snapshot aging, having a custom Slack bot that you build that alerts you when your budget's hitting a red line. And this is all generic stuff. It's the undifferentiated heavy lifting that's not terribly specific to your own specific environment. So why build it when you can buy it? GorillaStack does all of this. Think of it more or less like if this, then that, IFTTT for AWS. It can manage resources, it can alert folks when things are about to turn off, it keeps people appraised of what's going on, more or less the works. Go
Starting point is 00:02:13 check them out. They're at GorillaStack.com spelled exactly like it sounds. Gorilla like the animal, stack as in a pile of things. Use the discount code screaming for 15% off the first year. Thanks again for your support Gorillastack. Appreciate it. Welcome to Screaming in the Cloud. I'm Corey Quinn, and I am joined today by Chris Short, who has been doing a number of interesting things in the cloud and DevOps space, but most recently has been recognized for his newsletter, DevOps-ish. He also won 2018 OpenSource.com People's Choice Award for his writing in the DevOps space. Chris, thanks for joining me today.
Starting point is 00:02:55 Can you tell us a little bit about your writing and what it is that brought you to the people's attention? Thanks for having me, Corey. The writing, I've been kind of blogging on and off for years, and it hit me maybe a year orource.com, which don't hark at the DevOps team moniker. It's actually a team of talks going forward for DevOps type things, which is interesting because he's Jason is more of a community manager writer and I'm more of a technical hands on person. So we have an interesting dynamic as far as like lessons learned and then how they apply not just to technical fields, but also other ones. Right. And I've been a fan of your newsletter for a little while, almost since before I launched last week in AWS. I tended to be very focused and very snarky on one area. You tended to be a lot broader and frankly, a lot kinder to the things that you write about as a general rule. Wow. So what I appreciate is your ability to dissemble intelligently about such a wide variety of
Starting point is 00:04:27 different topics and also admit when you're unclear on something, when you're not entirely sure on how something fits in. I've seen you call that out on a number of occasions. And to my mind, that's always been the mark of mastery, where you can take a look at something and understand that, oh, there's something I'm not seeing, so I'm going to call it out rather than hand-waving my way past it and faking it. So just always been a fan of that type of approach. I appreciate that. And don't get me wrong, last week in AWS gives me as many lives as it does knowledge about what to do and what not to do in
Starting point is 00:05:00 AWS. So I appreciate that very much. So most recently, you've been working at SJ Technologies with John Willis. Yes. And you've been focusing on helping companies embrace DevOps philosophy while adapting their operations to a cloud native world, as you put it. What does that look like on a, I guess, a day to day basis for you? So let's look at it as two things, the cloud native world, you know, cloud native means a lot of things to a lot of people. A friend, Justin Garrison and Chris Nova wrote a great book, Cloud Native Infrastructure. I highly recommend it. Go read it. But a cloud native world does not necessarily mean that these organizations are like moving to AWS or Google Cloud or Azure. It just means that they want to take advantage of some of the
Starting point is 00:05:43 philosophies and tooling around being cloud native. Now, that's not to say they're going whole hog cloud native or whole hog cloud for that matter, but they want to utilize some things. So one of the clients I'm working with is actually using an Oracle tool and OpenShift on bare metal in their own data center. But they want to really lean forward and embrace moving faster. And one of the key takeaways for them is they want to be able to develop their mobile app and make it better, faster. So it all comes back to consumer outcome focus type things, just like the DevOps work we do, where we worry about what the outcome is for various steps in the software development lifecycle. And coming back to culture and as well
Starting point is 00:06:32 as, you know, some tooling here and there, we're not so much focused on tooling, but more so on process. If you want to use Puppet for configuration management and Ansible for, you know, some kind of deployment pipeline and Kubernetes for something else, all within your one little world, fine. We're more concerned about the process itself, creating that left to right flow, shifting the security things closer to the left so you detect them earlier,
Starting point is 00:06:59 as well as adapting some of the value stream mapping type processes that you see in Lean and other disciplines in the DevOps world. I hear echoes of Simon Wardley's mapping starting to creep into the conversation there. I'm detecting a recurring theme in conversations I have around this sort of thing.
Starting point is 00:07:16 But rather than going into those particular weeds today, something that you've talked about on and off for a while is the concept of doing migrations. When you think about doing a migration, a cloud migration specifically, in your mind, is that generally coming from on-prem? Is it coming from a different cloud provider into something else? Or are you viewing it as something else altogether? You know, I've seen it kind of go two ways.
Starting point is 00:07:41 You've got all these data centers across the globe where you're doing colo, you have your own data center in your own facilities and you're moving to the cloud. And that's great. The one thing I've noticed the most is people just say, I have this cage in locations A and B and I want to put all that in AWS. And most of the time they're like, yeah, it's active, passive backup. And really it's like they have two data centers and they treat them very differently. They can never switch from one to the other very easily. But they'd want to be able to do that in the cloud. And you just you end up buying an awful lot more than you can choose.
Starting point is 00:08:13 What I've seen for the most part, regardless of which way you are coming from and then to. But I've seen very few AWS to like Googletype migrations between cloud providers, aside from going from Rackspace to AWS. Those kinds of things are pretty common, but also a lot easier than you would think. What I've seen emerge as a recurring theme with a cloud migration is they'll start the process with the idea of we're going to take everything from our on-prem data center and shove it into, let's say, AWS. And it goes well until they encounter workloads that weren't in the initial pilot that it turns out are really hard to move. Your Amazon account rep gets very difficult to work with when one of your requirements is to shove your SAN into US East
Starting point is 00:09:06 One. I'm not saying they won't let you do it, but it costs more than you're probably willing to pay. So at that point, it's, ooh, that one workload is either going to need to be significantly refactored. But what often happens is the customer says, oh, we're not going to move that one workload. We're going to leave that one in our data center. Now we're going to call our environment hybrid, plant a flag and declare victory. That's more or less where it dies in some cases. Is that a pattern that you've seen or are companies starting to get over that hump and start moving their mainframes into the cloud? So I think you're right. Like for the time being, you're going to see that kind of hybrid workflow of we have this thing and, you know, data center X and it's not moving anywhere until we greatly refactor it.
Starting point is 00:09:55 I've seen that and I've heard of that happening. And that's some of our potential clients that we're in the work of talking to at SJ are in those scenarios. But what I've also seen is some people have made some assumptions about the cloud is definitely going to save us money because we don't have to run our own infrastructure and we're just going to let them shift everything and call it a day without any kind of real planning or refactoring before they decide to make that actual leap and start moving workloads. The fact that this even occurs is fascinating to me. And I'm sure we're going to dive more into that. But the sense of hybrid and victory is claimed is very much a real world thing. And I think that's systemic of how fragmented a lot of the
Starting point is 00:10:39 IT organizations are, some of these organizations. One term that keeps coming up in these conversations that I think means something slightly different to various people who hear it is lift and shift. What does that term mean to you? Because I can think of at least two ways that plays out, and that's just because I haven't talked about it with too many people yet. So people make this assumption that lift and shift is a legitimate first step towards moving to the cloud. AWS will tell you, yes, go ahead, lift and shift is a legitimate first step towards moving to the cloud. AWS will tell you, yes, go ahead, lift and shift everything. Take server A in your data center and make it server A in AWS.
Starting point is 00:11:13 You'll be fine. But what they don't tell you is that in VMware, you can allocate 16 gigs of RAM and 48 CPUs. Well, now what is that going to be in AWS, right? Like there's not a great correlation because, you know, VMware is very flexible. AWS has these instances that are all various shape sizes and for different workloads. So you have to know a lot about what your applications are before you even get started. And a lot of people, when it comes to lift and shift, they don't care. They're just like, try to make it close and go, right?
Starting point is 00:11:42 And you don't make the assessment of how do I optimize for reserved instances or anything else. So you just take all your servers and VMs and everything else and you just say, create them in AWS, go. And they gladly slurp in all your VMware instances. You can create a mapping of this size thing to that size thing and off you go. But I mean, it's a good strategy to just get there, right? Like one of my previous employers, we had a contract renewal with the data center coming and we didn't want to renew. We could go month to month at a very exorbitant cost. And we really, really pushed very hard, very quickly to move a decade's worth of presence in a colo into AWS. And it got very
Starting point is 00:12:20 expensive very quick because we weren't sitting there doing optimization of resources as we migrated. We weren't sitting there looking at costs. We were more concerned about meeting the deadline and less concerned about money. And it got, you know, at the rate we were going, it was going to be like twice as much to move to AWS unless we optimize as opposed to staying in the actual data center. So you need to look at density just like you would in a data center in your cloud environment as well, which a lot of people don't realize when you're lifting and shifting. I understand where you're coming from. I mean, my day job is as a consultant, where the one problem I fix is optimizing AWS bills for the business side of organizations.
Starting point is 00:12:59 And what I often see is what you describe, and everyone's upset that the migration is costing more than they thought it would. But the other side of that coin is if you go ahead and refactor your applications during the migration to take advantage of cloud primitives, to make them quote unquote cloud native, to embrace auto scaling groups, to be able to flex and scale out as workload conditions change, what you'll very often see is indeterminate errors. And it's not clear initially whether it's with the platform, whether it's with the application, you wind up with a bunch of finger pointing. So the successful path that I've seen play out multiple times has been to do a lift and
Starting point is 00:13:42 shift first. In other words, you take everything exactly as it is and shove it into a cloud provider. And yes, it runs on money in that context, but that's okay. The second phase then becomes refactor things, start addressing first off the idea of scaling things in reasonably when they're not in use or going down the reserved instance path, or even refactoring applications to take advantage of, for example, serverless primitives, things that you don't generally have in an on-prem environment. That historically, in my experience, has been an approach that works reasonably well. Is that something that you've seen as well, or are you more of an advocate for doing the
Starting point is 00:14:20 transformation in flight? I don't recommend doing the actual refactoring as you're migrating. What, what ends up usually happening is people end up running into these deadlines and they can't, you know, move to AWS without refactoring and they can't refactor without moving to AWS. And it's just, it becomes this, you know, circular kind of finger pointing, like you mentioned. And what I recommend is, you know, build your application on AWS or lift and shift it to AWS and see how that goes, right? Like, you know, have your prototype kind of thing going. And if you can easily move it without refactoring it and you feel sufficient that you can determine, you know, like if you know your applications well enough, you know what size and statistic use,
Starting point is 00:15:01 you know, what monitoring you're going to need to have in place, you know, all those basic things where this, you know, if I see this log entry, something's gone wrong and act accordingly. Chances are, if you know all these things, your migration is going to be very simple. But if you have very little knowledge of how your applications work, especially how they work together, you're going to have a really hard time of things. And making the move is, you know, one thing, doing the actual refactoring, it becomes another thing altogether. And, you know, it's, you have to choose one or the other, I feel like. I don't feel like you can say, okay, we're going to lift and shift very slowly. We're going to refactor in flight, or we're going to build a whole new stack in one side and have our other stack running, then spin up all of our applications in AWS, have the data center as a
Starting point is 00:15:52 backup for a little while and shift traffic over. That works, but that's as much lift and shift as it is refactoring in my mind. The premise I think people need to realize is that you're literally changing how you do everything in your IT department when you go to the cloud. You're not dealing with your CapEx anymore. You're dealing with somebody else's CapEx, and it's become your OpEx. So you have to optimize for that, and you definitely need to make sure you understand how your applications work as you shift them over. It seems that a lot of that is sometimes bounded by lack of alignment on the part of the company that's doing the migration. Many moons ago, I had a client that was adamant about moving from AWS into a physical data center.
Starting point is 00:16:41 And they were planning, and there were regulatory reasons for this at the time that no longer hold true. I'm not sure I would suggest that they would do this today, but this was years ago. Sure. Yeah. And one of the big obstacles they had is they're sitting there trying to figure out the best way to replicate SQS, simple queuing service, in their data center. And that was a big problem. And they had engineers looking at it up one side and down the other. And the insight that I had at that time was, well, let's check the bill. Okay, you're spending $60 a month on SQS and no regulated data is passing through that thing. Why don't we just ignore that for right now? And down the road, once you have everything else around it migrated over, then you can come back and take a look at this. And down the road, once you have everything else around it migrated over,
Starting point is 00:17:31 then you can come back and take a look at this. And that seemed to be a better approach for what their constraints were at the time. But they were, I guess, too far in the rabbit hole of, we have a mandate to move everything out of AWS and that's it, full stop, without really understanding the business drivers behind it. So it's a communications problem and a problem of alignment in many cases. How do you tackle that? So at a previous organization, we had a lot of on-prem data center, and we had very much cost controls over that. And we were paying very little for bandwidth. We were paying very little for hardware.
Starting point is 00:18:03 We really optimized the on-prem scenario, but we had workloads that were not quite serverless ready because they were too long running, but were very similar in nature of just need some CPU, need to run some Java and bounce some messages between SQS queues
Starting point is 00:18:20 and off you go, cleaning up S3 buckets. So, I mean, aside from the long running part, it was perfect for serverless. And we would totally use AWS for that stuff. And we would use AWS for storage in a lot of cases because it made sense.
Starting point is 00:18:33 The premise of, you know, a lack of understanding of business practices and how it applies is something I see often, right? Like legal departments say, oh my gosh, there's this new regulatory thing
Starting point is 00:18:44 and now we got to do this. You just got to get out of cloud. That's the only way to see often, right? Like legal departments say, oh my gosh, there's this new regulatory thing. And now we got to do this. You just got to get out of cloud. That's the only way to be safe, right? Like legal always tries to play it as safe as humanly possible. They don't want to, you know, they want a black or a white. They don't want a gray.
Starting point is 00:18:56 So to address that, you have to do a lot like what you said was, you know, you have these regulatory needs. We completely understand those. You know, as long as you say, these are buckets of regulated data or services or whatever, being able to control that's very easy. Otherwise, you kind of have to go back to the stakeholders and say, listen, for you to just sit there and say, we're moving from AWS to on-prem or on-prem to AWS, flip a big switch and you're just going to move everything over. You kind of have to understand what the impact of all that is. And you have to plan accordingly. That's the biggest thing in all these migrations is people just jump without looking, I feel like a lot. So the planning phases are super important and you have to understand that
Starting point is 00:19:39 there's going to be like those Don Rumsfeld unknown unknowns that are out there for sure. And you have to be ready for those and sit there and say like, well, what is this going to impact by moving this workload or that workload? You have to know where your boundaries are as far as legal is concerned, as far as contractual obligations are concerned, way before you get started. And you have to realize that if you say it's going to take six months, it'll probably take 18, right? It depends on how long and how old some of your processes are. Right. And then you wind up with Mythical Man-Month Territory Things, which is a great book.
Starting point is 00:20:17 And if you haven't read it, let me know and I'll send you three copies so you can read it faster. Yes, because I am a multi-threaded book reader. I have two eyes. Why can't I read the same book twice differently? Exactly. It's all a question of perspective. No. And increasingly, we're seeing a lot of focus on migrating from on-prem to cloud. We're seeing focus on migrating between different providers. But at this point, it feels like even in larger, shall we say, more traditional blue chip enterprises, there's no longer the sense of skepticism and humor around the idea of moving to a third party cloud provider. It instead becomes a question of not, instead of why cloud, it becomes why not. And as the cloud's ability to sustain regulated workloads continues to grow, And as you see various conversations with different stakeholders who bring up points that are increasingly being knocked down by various feature enhancements and improvements to these providers, it becomes a very real thing.
Starting point is 00:21:18 And there are benefits directly back to the business that are very clear. Even if, for example, a company still wants to treat everything as capital expenditure rather than OPEX, there are ways to do that. If your accountant and auditors sign off on it, you can classify portions of your cloud spend as CAPEX. That is something that's not commonly done, but it can be. You also start to see smoothing of various spend points. For example, with storage, as you store more data, the increase in what it costs to do that remains linear. You don't have these almost step function type graphs where we added one more terabyte and now it's time to run out
Starting point is 00:21:58 and buy a new shelf for the NetApp. It just continues to grow in a very reasonably well-understood way based upon what you're storing. And in time, you start aging data out or transitioning into different storage tiers. And the economics generally continue to make sense until you're into the ludicrous scale point of data storage.
Starting point is 00:22:20 Yes, past a certain point when you're in a position where your budget is running in the hundreds of millions for storage, yes, there are a number of options that compliance that you get, quote unquote, for free with the large scale cloud providers who have to do this for everyone. So where do you stand in the perspective of should a company undertake a migration project as they're looking around their decaying physical data center? Was that a rat we just saw? Good Lord, these fans are loud. It's always the same fluorescent lights. And I think I'm going slowly deaf. How do you, how do you wind up approaching the, should we migrate conversation? So I think the, I like to use an example. The, the one thing that I remember from my first job after getting out of the air force was there was always this metric, right? Like a minute of downtime was worth like 50 grand or something like that. And, you know, when you
Starting point is 00:23:30 take that metric and you say, okay, fine. And then you're on call and all of a sudden your data center gets struck by lightning. And while it has the facilities to handle that, something, you know, inevitably goes wrong. And that data center is now 120 degrees because, you know, some HVAC switches didn't reset accordingly when the generator flipped over because that generator got struck by lightning. Then all of a sudden, after about five minutes, your stuff just stops responding because it's just too hot and starts shutting itself off. What then? You know, you have to go in and literally turn off everything and turn it all back on after the disaster recovery at the data center is completed so how many hours were lost because of one millisecond of event you know it took three hours for us to get everything back to normal
Starting point is 00:24:18 you know 50 grand a minute you do the math that's a lot of money so i look at it as like you're paying for an extra nine you know you're you're when you do these things you that's a lot of money. So I look at it as like you're paying for an extra nine. You know, when you do these things, you don't have to worry about sending somebody to the data center to start swapping out disks in your NetApp. You don't have to worry about, hey, what server chassis should we put this workload in? Where do we have the most capacity? You don't have to think about those things. You're kind of buying yourself an extra nine because inevitably, wherever you are, there will be downtime in that facility for one reason or another. Just look at like BGP. How many times have BGP mistakes occurred in the global internet and caused some kind of weird outage in some location? Do you really want to be a part of that? Or would you rather
Starting point is 00:25:00 to design your workloads to be more effective and resilient? And with some of the largest consumers of network stacks on the internet, the idea of co-locating your things with big, big, big companies that control wide swaths of the internet is highly, highly effective at keeping resilience in your systems. One thing you mentioned that I wanted to call a little bit of attention to, and I'm not talking specifically about the Air Force. Please don't feel you need to respond on their behalf. I'd prefer not to see you renditioned, although if it was, it would be extraordinary, I'm sure. Yes. But I'm curious as to the metric of a minute of
Starting point is 00:25:42 downtime costs us $50,000. I mean, that's an argument that's easy to make depending upon what your company does. But to give you an example, back a few years ago, I tried to buy something on Amazon. I think it was probably a pair of socks because that's the level of excitement that drives my life. And it threw a 500 error. And that was amazing. I'd never seen this from Amazon before. That's quite impressive. So I smiled, I laughed, I tried it again, same thing.
Starting point is 00:26:12 And of course, Twitter, or however long ago this was, maybe FARC, if that was the year of what social media looked like back then. So I shrugged, and I went and did something else. And an hour later, I went and I bought my socks. So the idea of did they lose any money from that outage? In my case, the answer was no. Because the decision point on my side was not, well, I guess I'm never going to buy socks again, the end. And now I've been down one pair ever since. Instead, it's I'll just do this later. Now, there is the counter argument that if one time out of three that I try to make any given purchase on Amazon, it didn't work, I'd probably
Starting point is 00:26:50 be going doing something really sad, like buying from Target instead. But when it's a one-off and it's rare and hasn't eroded customer confidence, there may not be the same level of economic impact that people think there is. As a counterpoint, if you're an ad network, every second you're down, you're not displaying something. No one is going to go back and read a news article a second time so they can get that display ad presented to them. So in that case, it's true. But I guess to that point, there is the question of what downtime really costs. Do you have anything to say on that? Yeah. So let's take your timeline or point in time of trying to buy these socks, right?
Starting point is 00:27:32 If FARC was the social network of choice, chances are those socks were only on Amazon. When you look at the landscape now and how it's changed, right? You don't get fail wheels on Twitter anymore for a reason. You can buy the same pair of socks on Amazon that you can buy on Target and vice versa. And guess what? They price match each other to an extent. If Amazon can't sell you that pair of socks, you're going to go to a different site and buy them there because those socks are going to be in more than one place if they're on Amazon. Now, with that being said, you bring up a good point. The company I was with where it was a minute of downtime cost 50 grand was very much ad-driven business. If you were buying something or you're consuming content and ad revenue is being generated that way, people will find another place on the internet nowadays to go
Starting point is 00:28:32 get what they were looking for. That's just the nature of things, right? Like there's no sole source of anything anymore. And you can't, you can't compete thinking like that nowadays. 10 years ago, you most definitely could, right? Like you were only going to get that thing from Amazon because there was no way you were going to go get it locally, let alone from some other website. And that's very fair. There's also the argument to be made in favor of cloud migrations from my perspective, where if you go back to a year or so ago when Amazon had their first notable S3 outage in the entire lifetime of the service. And it was unavailable for, I believe, six hours or something like that.
Starting point is 00:29:11 There was a knee-jerk reaction in the SRE DevOps space of, well, now we're going to replicate to another provider and we're going to go ahead and have multiple buckets in multiple regions. And these things spike the cost rather significantly to avoid a once every seven years style outage of a few hours. And when you look at how that outage was reported, notice I'm talking to you about the Amazon S3 outage. I'm not saying, oh, the American Airlines outage or the Instagram outage or Twitter for pets was down during this time. Because it became, today the internet is broken. And individual companies that were impacted by this weren't held to account in the same way
Starting point is 00:29:57 as if it had been just that one company with their own internal outage. Because frankly, I struggle to accept that you're going to be able to deliver a level of uptime and service availability that exceeds that of AWS. They have an incredibly large army of incredibly smart people working on these specific problems all day, every day.
Starting point is 00:30:19 But I feel like there's also some safety in being part of the herd, if you'll pardon the term. When US East 1 has a bad day, we all have a bad day. And it feels like there's safety in numbers. Is that valid? That's very valid, right? It's amazing to me to think about how the world has changed because of services like AWS and large-scale applications on the web like Twitter and Facebook and Google, people have a greater
Starting point is 00:30:46 understanding of backing services that drive these things. It's also surprising to me how many people rely on US East 1 exclusively. And that outage you were speaking of, I was directly impacted by it. I don't know anybody who wasn't, but we had a significant amount of data in US East 1 and we decided, you know what, US East 1 is kind of a dumpster fire from what we're hearing. Let's move it to US East 2. And literally, it was just a regex. You just do a search, fine, replace 1 with 2, move everything, done, off you go, cool. But the idea of, yes, there is safety in the herd, right, because everybody could sit there and be like, well, listen, we know, we bought this thing and it's down, we're sorry. I feel like
Starting point is 00:31:27 that's kind of a BS excuse, right? Like, you didn't engineer your application to be, you know, less single point of failure. If US East 1 goes down and you're calling something that uses US East 1 as its sole backing service for something, that's a really bad idea in my opinion, right? So a lot of the things that were down, like look at Atlassian. Not to pick on any one company, but a lot of their stuff was down. And I remember that very distinctly because I couldn't get to my Jira instance. I couldn't get to my documentation. I couldn't get to a lot of things
Starting point is 00:32:01 because they use somebody that utilized us east one rather heavily for things and they got bit by that that i think is completely unacceptable from like a business perspective you have to know like where your single points of failure are and at least be aware of them you don't necessarily have to address them because i do agree with you good luck getting better reliability than aw, you know, the past five years. But I do remember a time when S3 was not like as good as it was, just like any brand new Amazon service is never going to be the best it is at release time. The idea of people saying, you know, hey, we can just blame AWS for everything, I think is a very, very, very tenuous
Starting point is 00:32:44 situation to put yourself in. Because if your customers are going to sit there and rely on you for something, they don't care. If your SLA says 99.999% uptime and you don't deliver on that, that's your penalty. You can't pass it on to AWS. You think they're going to foot the bill for that? No. It's like the cable company at that point, right? Like, oh, we're sorry you don't have service. We're not going to give you a credit. Or if there is an SLA credit, it's minor compared to the impact it potentially had on your business as well. And again, this is not to bag on Amazon specifically. The fact that we can talk about the single issue and everyone knows what I'm talking about is testament to how rock solid,
Starting point is 00:33:23 at least some of their services have become. Right. Like I remember reading the BuzzFeed newsletter the day after that outage, and it talked very intimately about US East One. I mean, we're talking, this is BuzzFeed here, all right? This is the people that made listicles a thing, right? But they know what US East One is. It's amazing to me. Yes. Service number seven will blow your feet off. Yes. Well, thank you very much for joining me today, Chris. Are there any parting comments, observations, or things you'd like to show before we call it an episode? I think the biggest thing for me is, well, the two biggest things, right?
Starting point is 00:33:57 Like a DevOps journey is never finished, right? You're never done migrating. You're never done doing DevOps. You're always doing DevOps, right? That's thing one. Thing two is something I've realized maybe this year is as technologists like you, myself, and probably a lot of your listeners are, we have to be more embracing of changes in our own work, life, everythings so that we can help our organizations change. When we make change simple for ourselves, for everything, I'm talking
Starting point is 00:34:25 like the way you drive to work, the way you log into your systems, the shell you're using for that matter. When you can make change seamless and not painful for yourselves, it exudes a sense of confidence when you're trying to make larger changes throughout the organization. We have to get better as technologists in making changes and helping people embrace change. Very well put. Thank you once again for joining me here on Screaming in the Cloud. This has been Chris Short of DevOps-ish, and I'm Corey Quinn. I'll talk to you next week. This has been this week's episode of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com or wherever fine snark is sold.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.