Screaming in the Cloud - Episode 2: Shoving a SAN into us-east-1
Episode Date: March 21, 2018When companies migrate to the Cloud, they are literally changing how they do everything in their IT department. If lots of customers exclusively rely on a service, like us-east-1, then they a...re directly impacted by outages. There is safety in a herd and in numbers because everybody sits there, down and out. But, you don’t engineer your application to be a little more less than a single point of failure. It’s a bad idea to use a sole backing service for something, and it’s unacceptable from a business perspective. Today, we’re talking to Chris Short from the Cloud and DevOps space. Recently, he was recognized for his DevOps’ish newsletter and won the Opensource.com People’s Choice Award for his DevOps writing. He’s been blogging for years and writing about things that he does every day, such as tutorials, codes, and methods. Now, Chris, along with Jason Hibbets, run the DevOps team for Opensource.com Some of the highlights of the show include: Chris’ writing makes difficult topics understandable. He is frank and provides broad information. However, he admits when he is not sure about something. SJ Technologies aims to help companies embrace a DevOps philosophy, while adapting their operations to a Cloud-native world. Companies want to take advantage of philosophies and tooling around being Cloud native. Many companies consider a Cloud migration because they’ve got data centers across the globe. It’s active-passive backup with two data centers that are treated differently and cannot switch to easily. Some companies do a Cloud migration to refactor and save money. A Cloud migration can result in you having to shove your SAN into the USC1. It can become a hybrid workflow. Lift and shift is often considered the first legitimate step toward moving to the Cloud. However, know as much as you can about your applications and RAM and CPU allowances. Look at density when you’re lifting and shifting. Know how your applications work and work together. Simplify a migration by knowing what size and instances to use and what monitoring to have in place. Some do not support being on the Cloud due to a lack of understanding of business practices and how they are applied. But, most are no longer skeptical about moving to the Cloud. Now, instead of ‘why cloud,’ it becomes ‘why not.’ Don’t jump without looking. Planning phases are important, but there will be unknowns that you will have to face. Downtime does cost money. Customers will go to other sites. They can find what they want and need somewhere else. There’s no longer a sole source of anything. The DevOps journey is never finished, and you’re never done migrating. Embrace changes yourself to help organizations change. Links: Chris Short on Twitter DevOps'ish SJ Technologies Amazon Web Services Cloud Native Infrastructure Oracle OpenShift Puppet Kubernetes Simon Wardley Rackspace The Mythical Man-Month Atlassian BuzzFeed Quotes by Chris: “Let’s not say that they’re going whole hog Cloud Native or whole hog cloud for that matter but they wanna utilize some things.” “They can never switch from one to the other very easily, but they want to be able to do that in the Cloud and you end up biting off a lot more than you can chew…” “Create them in AWS. Go. They gladly slurp in all your VM where instances you can create a mapping of this sized thing to that sized thing and off you go. But it’s a good strategy to just get there.” “We have to get better as technologists in making changes and helping people embrace change.”.
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
This episode of Screaming in the Cloud is sponsored by my friends at
GorillaStack. GorillaStack's a unique automation solution for cloud cost optimization, which of
course is something here and dear to my heart. By day, I'm a consultant who fixes exactly one
problem, which is the horrifying AWS bill. Every organization eventually hits a point where they start to really, really
care about their cloud spend, either in terms of caring about the actual dollars and cents that
they're spending, or in understanding what teams or projects are costing money and starting to
build predictive analytics around that. And it turns out that early on in my consulting work,
I spent an awful lot of time talking with some of
my clients about a capability that GorillaStack has already built. There's a laundry list of
analytics offerings in this space that tell you what you're spending and where it goes,
and then they stop. Or worse, they slap a beta label on that side of it for remediation and
then say that they're not responsible for anything or everything that their system winds up doing. So some folks try and go in
a direction of doing things to write their own code, such as spinning down developer environments
out of hours, bolting together a bunch of different services to handle snapshot aging,
having a custom Slack bot that you build that alerts you when your budget's
hitting a red line. And this is all generic stuff. It's the undifferentiated heavy lifting that's not
terribly specific to your own specific environment. So why build it when you can buy it?
GorillaStack does all of this. Think of it more or less like if this, then that, IFTTT for AWS. It can manage resources, it can alert folks when things
are about to turn off, it keeps people appraised of what's going on, more or less the works. Go
check them out. They're at GorillaStack.com spelled exactly like it sounds. Gorilla like the animal,
stack as in a pile of things. Use the discount code screaming for 15% off the first year.
Thanks again for your support Gorillastack. Appreciate it.
Welcome to Screaming in the Cloud. I'm Corey Quinn, and I am joined today by Chris Short,
who has been doing a number of interesting things in the cloud and DevOps space,
but most recently has been recognized for his newsletter, DevOps-ish.
He also won 2018 OpenSource.com People's Choice Award for his writing in the DevOps space.
Chris, thanks for joining me today.
Can you tell us a little bit about your writing and what it is that brought you to the people's attention?
Thanks for having me, Corey.
The writing, I've been kind of blogging on and off for years, and it hit me maybe a year orource.com, which don't hark at the DevOps team moniker. It's actually a team of talks going forward for DevOps type things, which is interesting because he's Jason is more of a community manager writer and I'm more of a technical hands on person.
So we have an interesting dynamic as far as like lessons learned and then how they apply not just to technical fields, but also other ones.
Right. And I've been a fan of your newsletter for a little while, almost since before I launched
last week in AWS. I tended to be very focused and very snarky on one area. You tended to be
a lot broader and frankly, a lot kinder to the things that you write about as a general rule.
Wow. So what I appreciate is your ability to dissemble intelligently about such a wide variety of
different topics and also admit when you're unclear on something, when you're not entirely
sure on how something fits in.
I've seen you call that out on a number of occasions.
And to my mind, that's always been the mark of mastery, where you can take a look at something
and understand that, oh, there's something I'm
not seeing, so I'm going to call it out rather than hand-waving my way past it and faking it.
So just always been a fan of that type of approach. I appreciate that. And don't get me wrong,
last week in AWS gives me as many lives as it does knowledge about what to do and what not to do in
AWS. So I appreciate that very much. So most recently, you've been working at SJ
Technologies with John Willis. Yes. And you've been focusing on helping companies embrace DevOps
philosophy while adapting their operations to a cloud native world, as you put it. What does that
look like on a, I guess, a day to day basis for you? So let's look at it as two things, the cloud
native world, you know, cloud native means a lot of things to a lot of people. A friend, Justin Garrison and
Chris Nova wrote a great book, Cloud Native Infrastructure. I highly recommend it. Go read it.
But a cloud native world does not necessarily mean that these organizations are like moving to AWS or
Google Cloud or Azure. It just means that they want to take advantage of some of the
philosophies and tooling around being cloud native.
Now, that's not to say they're going whole hog cloud native or whole hog cloud for that matter, but they want to utilize some things.
So one of the clients I'm working with is actually using an Oracle tool and OpenShift on bare metal in their own data center.
But they want to really lean forward and embrace moving
faster. And one of the key takeaways for them is they want to be able to develop their mobile app
and make it better, faster. So it all comes back to consumer outcome focus type things,
just like the DevOps work we do, where we worry about what the outcome is for
various steps in the software development lifecycle. And coming back to culture and as well
as, you know, some tooling here and there, we're not so much focused on tooling, but more so on
process. If you want to use Puppet for configuration management and Ansible for, you know, some kind
of deployment pipeline and Kubernetes for something else,
all within your one little world, fine.
We're more concerned about the process itself,
creating that left to right flow,
shifting the security things closer to the left
so you detect them earlier,
as well as adapting some of the value stream mapping
type processes that you see in Lean
and other disciplines
in the DevOps world.
I hear echoes of Simon Wardley's mapping
starting to creep into the conversation there.
I'm detecting a recurring theme in conversations I have
around this sort of thing.
But rather than going into those particular weeds today,
something that you've talked about on and off for a while
is the concept of doing migrations.
When you think about doing a migration, a cloud migration specifically, in your mind,
is that generally coming from on-prem?
Is it coming from a different cloud provider into something else?
Or are you viewing it as something else altogether?
You know, I've seen it kind of go two ways.
You've got all these data centers across the globe where you're doing colo, you have your own data center in your own facilities and you're moving to the cloud.
And that's great.
The one thing I've noticed the most is people just say, I have this cage in locations A and B and I want to put all that in AWS.
And most of the time they're like, yeah, it's active, passive backup.
And really it's like they have two data centers and they treat them very differently.
They can never switch from one to the other very easily.
But they'd want to be able to do that in the cloud.
And you just you end up buying an awful lot more than you can choose.
What I've seen for the most part, regardless of which way you are coming from and then to.
But I've seen very few AWS to like Googletype migrations between cloud providers, aside from going from Rackspace
to AWS.
Those kinds of things are pretty common, but also a lot easier than you would think.
What I've seen emerge as a recurring theme with a cloud migration is they'll start the
process with the idea of we're going to take everything from our on-prem data center and shove it into, let's say, AWS.
And it goes well until they encounter workloads that weren't in the initial pilot that it turns out are really hard to move.
Your Amazon account rep gets very difficult to work with when one of your requirements is to shove your SAN into US East
One. I'm not saying they won't let you do it, but it costs more than you're probably willing to pay.
So at that point, it's, ooh, that one workload is either going to need to be significantly
refactored. But what often happens is the customer says, oh, we're not going to move that one workload. We're going
to leave that one in our data center. Now we're going to call our environment hybrid, plant a
flag and declare victory. That's more or less where it dies in some cases. Is that a pattern
that you've seen or are companies starting to get over that hump and start moving their mainframes
into the cloud? So I think you're right.
Like for the time being, you're going to see that kind of hybrid workflow of we have this thing and, you know, data center X and it's not moving anywhere until we greatly refactor it.
I've seen that and I've heard of that happening.
And that's some of our potential clients that we're in the work of talking to at SJ are in those scenarios. But what I've also seen is some people have made some assumptions about the cloud is definitely
going to save us money because we don't have to run our own infrastructure and we're just
going to let them shift everything and call it a day without any kind of real planning
or refactoring before they decide to make that actual leap and start moving workloads.
The fact that this even occurs is fascinating to me.
And I'm sure we're going to dive more into that. But the sense of hybrid and victory is claimed
is very much a real world thing. And I think that's systemic of how fragmented a lot of the
IT organizations are, some of these organizations. One term that keeps coming up in these conversations that I
think means something slightly different to various people who hear it is lift and shift.
What does that term mean to you? Because I can think of at least two ways that plays out,
and that's just because I haven't talked about it with too many people yet.
So people make this assumption that lift and shift is a legitimate first step towards moving
to the cloud. AWS will tell you, yes, go ahead, lift and shift is a legitimate first step towards moving to the cloud.
AWS will tell you, yes, go ahead, lift and shift everything.
Take server A in your data center and make it server A in AWS.
You'll be fine.
But what they don't tell you is that in VMware, you can allocate 16 gigs of RAM and 48 CPUs.
Well, now what is that going to be in AWS, right?
Like there's not a great correlation because, you know, VMware is very flexible.
AWS has these instances that are all various shape sizes and for different workloads.
So you have to know a lot about what your applications are before you even get started.
And a lot of people, when it comes to lift and shift, they don't care.
They're just like, try to make it close and go, right?
And you don't make the assessment of how do I optimize for reserved instances or anything else.
So you just take all your servers and VMs and everything else and you just say, create them in AWS, go.
And they gladly slurp in all your VMware instances.
You can create a mapping of this size thing to that size thing and off you go.
But I mean, it's a good strategy to just get there, right? Like one of
my previous employers, we had a contract renewal with the data center coming and we didn't want to
renew. We could go month to month at a very exorbitant cost. And we really, really pushed
very hard, very quickly to move a decade's worth of presence in a colo into AWS. And it got very
expensive very quick because we weren't sitting there doing optimization of resources as we migrated.
We weren't sitting there looking at costs.
We were more concerned about meeting the deadline and less concerned about money.
And it got, you know, at the rate we were going, it was going to be like twice as much to move to AWS unless we optimize as opposed to staying in the actual data center.
So you need to look at density just like you would in a data center in your cloud
environment as well, which a lot of people don't realize when you're lifting and shifting.
I understand where you're coming from. I mean, my day job is as a consultant,
where the one problem I fix is optimizing AWS bills for the business side of organizations.
And what I often see is what you describe, and everyone's upset that the migration is costing
more than they thought it would. But the other side of that coin is if you go ahead and refactor
your applications during the migration to take advantage of cloud primitives, to make them
quote unquote cloud native, to embrace auto scaling groups, to be able to flex and scale out as workload conditions change, what you'll very
often see is indeterminate errors.
And it's not clear initially whether it's with the platform, whether it's with the application,
you wind up with a bunch of finger pointing.
So the successful path that I've seen play out multiple times has been to do a lift and
shift first.
In other words, you take everything exactly as it is and shove it into a cloud provider. And yes, it runs on money in that
context, but that's okay. The second phase then becomes refactor things, start addressing first
off the idea of scaling things in reasonably when they're not in use or going down the reserved
instance path, or even refactoring applications to take advantage
of, for example, serverless primitives, things that you don't generally have in an on-prem
environment. That historically, in my experience, has been an approach that works reasonably well.
Is that something that you've seen as well, or are you more of an advocate for doing the
transformation in flight? I don't recommend doing the actual refactoring as you're migrating.
What, what ends up usually happening is people end up running into these deadlines and they can't,
you know, move to AWS without refactoring and they can't refactor without moving to AWS.
And it's just, it becomes this, you know, circular kind of finger pointing, like you mentioned. And
what I recommend is, you know, build your application on AWS or lift and shift
it to AWS and see how that goes, right? Like, you know, have your prototype kind of thing going.
And if you can easily move it without refactoring it and you feel sufficient that you can determine,
you know, like if you know your applications well enough, you know what size and statistic use,
you know, what monitoring you're going to need to have in place,
you know, all those basic things where this, you know, if I see this log entry, something's gone wrong and act accordingly. Chances are, if you know all these things, your migration is going
to be very simple. But if you have very little knowledge of how your applications work, especially
how they work together, you're going to have a really hard time of things. And making the move is, you know, one thing, doing the actual refactoring,
it becomes another thing altogether. And, you know, it's, you have to choose one or the other,
I feel like. I don't feel like you can say, okay, we're going to lift and shift very slowly. We're
going to refactor in flight, or we're going to build a whole new stack in one side and have our
other stack running, then spin up all of our applications in AWS, have the data center as a
backup for a little while and shift traffic over. That works, but that's as much lift and shift as
it is refactoring in my mind. The premise I think people need to realize is that you're literally
changing how you do everything in your IT department when you go to the cloud.
You're not dealing with your CapEx anymore.
You're dealing with somebody else's CapEx, and it's become your OpEx.
So you have to optimize for that, and you definitely need to make sure you understand how your applications work as you shift them over.
It seems that a lot of that is sometimes bounded by lack of alignment on the part of the company that's doing the migration.
Many moons ago, I had a client that was adamant about moving from AWS into a physical data center.
And they were planning, and there were regulatory reasons for this at the time that no longer hold true. I'm not sure I would suggest that they would do this today,
but this was years ago. Sure. Yeah.
And one of the big obstacles they had is they're sitting there trying to figure out the best way
to replicate SQS, simple queuing service, in their data center. And that was a big problem.
And they had engineers looking at it up one side and down the other. And the insight that I had at that time was, well, let's check the bill.
Okay, you're spending $60 a month on SQS and no regulated data is passing through that thing.
Why don't we just ignore that for right now? And down the road, once you have everything else
around it migrated over, then you can come back and take a look at this. And down the road, once you have everything else around it migrated over,
then you can come back and take a look at this. And that seemed to be a better approach for what their constraints were at the time. But they were, I guess, too far in the rabbit hole of,
we have a mandate to move everything out of AWS and that's it, full stop, without really
understanding the business drivers behind it. So it's a communications problem and a problem of alignment in many cases.
How do you tackle that?
So at a previous organization, we had a lot of on-prem data center, and we had very much
cost controls over that.
And we were paying very little for bandwidth.
We were paying very little for hardware.
We really optimized the on-prem
scenario, but we had workloads
that were not quite serverless
ready because they were too long running,
but were very similar in nature of
just need some CPU, need to run
some Java and bounce
some messages between SQS queues
and off you go, cleaning up S3 buckets.
So, I mean, aside from the long
running part, it was perfect for serverless.
And we would totally use AWS
for that stuff.
And we would use AWS
for storage in a lot of cases
because it made sense.
The premise of, you know,
a lack of understanding
of business practices
and how it applies
is something I see often, right?
Like legal departments say,
oh my gosh,
there's this new regulatory thing
and now we got to do this. You just got to get out of cloud. That's the only way to see often, right? Like legal departments say, oh my gosh, there's this new regulatory thing.
And now we got to do this.
You just got to get out of cloud.
That's the only way to be safe, right?
Like legal always tries to play it as safe as humanly possible.
They don't want to, you know,
they want a black or a white.
They don't want a gray.
So to address that, you have to do a lot
like what you said was, you know,
you have these regulatory needs.
We completely understand those.
You know, as long as you say, these are buckets of regulated data or services or whatever, being able to control that's very easy. Otherwise, you kind of have to go back to the stakeholders and say, listen, for you to just sit there and say, we're moving from AWS to on-prem or on-prem to AWS, flip a big switch and you're just going to move everything
over. You kind of have to understand what the impact of all that is. And you have to plan
accordingly. That's the biggest thing in all these migrations is people just jump without looking,
I feel like a lot. So the planning phases are super important and you have to understand that
there's going to be like those Don Rumsfeld unknown unknowns that are out there for sure.
And you have to be ready for those and sit there and say like, well, what is this going to impact
by moving this workload or that workload? You have to know where your boundaries are as far
as legal is concerned, as far as contractual obligations are concerned, way before you get
started. And you have to realize that if you say it's going to take six months, it'll probably take
18, right?
It depends on how long and how old some of your processes are.
Right. And then you wind up with Mythical Man-Month Territory Things, which is a great book.
And if you haven't read it, let me know and I'll send you three copies so you can read it faster.
Yes, because I am a multi-threaded book reader.
I have two eyes. Why can't I read the same book twice differently?
Exactly. It's all a question of perspective. No. And increasingly, we're seeing a lot of focus on migrating from on-prem to cloud. We're seeing focus on migrating between different providers.
But at this point, it feels like even in larger, shall we say, more traditional blue chip enterprises, there's no longer the sense
of skepticism and humor around the idea of moving to a third party cloud provider. It instead becomes
a question of not, instead of why cloud, it becomes why not. And as the cloud's ability to
sustain regulated workloads continues to grow, And as you see various conversations with different stakeholders who bring up points that are increasingly being knocked down by various feature enhancements and improvements to these providers, it becomes a very real thing.
And there are benefits directly back to the business that are very clear. Even if, for example, a company still wants to treat
everything as capital expenditure rather than OPEX, there are ways to do that. If your accountant and
auditors sign off on it, you can classify portions of your cloud spend as CAPEX. That is something
that's not commonly done, but it can be. You also start to see smoothing of various spend points. For example, with storage, as you store more data,
the increase in what it costs to do that remains linear.
You don't have these almost step function type graphs
where we added one more terabyte
and now it's time to run out
and buy a new shelf for the NetApp.
It just continues to grow
in a very reasonably well-understood way
based upon what you're
storing.
And in time, you start aging data out or transitioning into different storage tiers.
And the economics generally continue to make sense until you're into the ludicrous scale
point of data storage.
Yes, past a certain point when you're in a position where your budget is running in the hundreds of millions for storage, yes, there are a number of options that compliance that you get, quote unquote, for free with the large scale cloud providers who have to
do this for everyone. So where do you stand in the perspective of should a company undertake
a migration project as they're looking around their decaying physical data center? Was that a
rat we just saw? Good Lord, these fans are loud. It's
always the same fluorescent lights. And I think I'm going slowly deaf. How do you, how do you
wind up approaching the, should we migrate conversation? So I think the, I like to use
an example. The, the one thing that I remember from my first job after getting out of the air
force was there was always this metric, right? Like a minute of downtime was worth like 50 grand or something like that. And, you know, when you
take that metric and you say, okay, fine. And then you're on call and all of a sudden your data center
gets struck by lightning. And while it has the facilities to handle that, something, you know,
inevitably goes wrong. And that data center is now 120 degrees because, you know, some HVAC switches didn't reset accordingly when the generator flipped over because that generator got struck by lightning.
Then all of a sudden, after about five minutes, your stuff just stops responding because it's just too hot and starts shutting itself off.
What then?
You know, you have to go in and literally turn off everything and turn it all back on
after the disaster recovery at the data center is completed so how many hours were lost because of
one millisecond of event you know it took three hours for us to get everything back to normal
you know 50 grand a minute you do the math that's a lot of money so i look at it as like you're
paying for an extra nine you know you're you're when you do these things you that's a lot of money. So I look at it as like you're paying for an extra nine.
You know, when you do these things, you don't have to worry about sending somebody to the data center to start swapping out disks in your NetApp. You don't have to worry about, hey, what server
chassis should we put this workload in? Where do we have the most capacity? You don't have to think
about those things. You're kind of buying yourself an extra nine because inevitably, wherever you are,
there will be downtime in that facility for one reason or another. Just look at like BGP.
How many times have BGP mistakes occurred in the global internet and caused some kind of weird
outage in some location? Do you really want to be a part of that? Or would you rather
to design your workloads to be more effective and resilient?
And with some of the largest consumers of network stacks on the internet, the idea of
co-locating your things with big, big, big companies that control wide swaths of the
internet is highly, highly effective at keeping resilience in your systems.
One thing you mentioned that I wanted to call a
little bit of attention to, and I'm not talking specifically about the Air Force. Please don't
feel you need to respond on their behalf. I'd prefer not to see you renditioned, although if
it was, it would be extraordinary, I'm sure. Yes. But I'm curious as to the metric of a minute of
downtime costs us $50,000. I mean, that's an argument that's easy to make depending upon what your company does.
But to give you an example, back a few years ago, I tried to buy something on Amazon.
I think it was probably a pair of socks because that's the level of excitement that drives my life.
And it threw a 500 error.
And that was amazing.
I'd never seen this from Amazon before.
That's quite impressive.
So I smiled, I laughed, I tried it again, same thing.
And of course, Twitter, or however long ago this was,
maybe FARC, if that was the year of what social media looked like back then.
So I shrugged, and I went and did something else.
And an hour later, I went and I bought my socks. So the idea of did they lose any money from that outage? In my case, the answer was no.
Because the decision point on my side was not, well, I guess I'm never going to buy socks again,
the end. And now I've been down one pair ever since. Instead, it's I'll just do this later.
Now, there is the counter argument that if one
time out of three that I try to make any given purchase on Amazon, it didn't work, I'd probably
be going doing something really sad, like buying from Target instead. But when it's a one-off and
it's rare and hasn't eroded customer confidence, there may not be the same level of economic impact
that people think there is. As a counterpoint, if you're an
ad network, every second you're down, you're not displaying something. No one is going to go back
and read a news article a second time so they can get that display ad presented to them. So in that
case, it's true. But I guess to that point, there is the question of what downtime really costs.
Do you have anything to say on that?
Yeah. So let's take your timeline or point in time of trying to buy these socks, right?
If FARC was the social network of choice, chances are those socks were only on Amazon.
When you look at the landscape now and how it's changed, right? You don't get fail wheels on
Twitter anymore for a reason. You can buy the same pair
of socks on Amazon that you can buy on Target and vice versa. And guess what? They price match each
other to an extent. If Amazon can't sell you that pair of socks, you're going to go to a different
site and buy them there because those socks are going to be in more than one place if they're
on Amazon. Now, with that being said, you bring up a good point. The company I was with where it was a minute of downtime cost 50 grand was very much ad-driven business. If you were buying something or you're consuming content and ad
revenue is being generated that way, people will find another place on the internet nowadays to go
get what they were looking for. That's just the nature of things, right? Like there's no sole
source of anything anymore. And you can't, you can't compete thinking like that nowadays.
10 years ago, you most definitely could, right?
Like you were only going to get that thing from Amazon because there was no way you were going
to go get it locally, let alone from some other website. And that's very fair. There's also the
argument to be made in favor of cloud migrations from my perspective, where if you go back to a
year or so ago when Amazon had their first notable S3 outage in the entire lifetime
of the service. And it was unavailable for, I believe, six hours or something like that.
There was a knee-jerk reaction in the SRE DevOps space of, well, now we're going to replicate to
another provider and we're going to go ahead and have multiple buckets in multiple regions. And
these things spike the cost rather significantly
to avoid a once every seven years style outage of a few hours. And when you look at how that
outage was reported, notice I'm talking to you about the Amazon S3 outage. I'm not saying,
oh, the American Airlines outage or the Instagram outage or Twitter for pets was down during this time.
Because it became, today the internet is broken.
And individual companies that were impacted by this weren't held to account in the same way
as if it had been just that one company with their own internal outage.
Because frankly, I struggle to accept
that you're going to be able to deliver
a level of uptime and service availability
that exceeds that of AWS.
They have an incredibly large army
of incredibly smart people
working on these specific problems all day, every day.
But I feel like there's also some safety
in being part of the herd, if you'll pardon the term.
When US East 1 has a bad day, we all have a bad day.
And it feels like there's safety in numbers.
Is that valid?
That's very valid, right?
It's amazing to me to think about how the world has changed because of services like
AWS and large-scale applications on the web like Twitter and Facebook and Google, people have a greater
understanding of backing services that drive these things. It's also surprising to me how
many people rely on US East 1 exclusively. And that outage you were speaking of, I was directly
impacted by it. I don't know anybody who wasn't, but we had a significant amount of data in US East
1 and we decided, you know what,
US East 1 is kind of a dumpster fire from what we're hearing. Let's move it to US East 2. And
literally, it was just a regex. You just do a search, fine, replace 1 with 2, move everything,
done, off you go, cool. But the idea of, yes, there is safety in the herd, right, because everybody
could sit there and be like, well, listen, we know, we bought this thing and it's down, we're sorry. I feel like
that's kind of a BS excuse, right? Like, you didn't engineer your application
to be, you know, less single point of failure.
If US East 1 goes down and you're calling something that
uses US East 1 as its sole backing service for something, that's a really
bad idea in my opinion,
right? So a lot of the things that were down, like look at Atlassian. Not to pick on any one
company, but a lot of their stuff was down. And I remember that very distinctly because I couldn't
get to my Jira instance. I couldn't get to my documentation. I couldn't get to a lot of things
because they use somebody that utilized us east one rather heavily for things
and they got bit by that that i think is completely unacceptable from like a business perspective
you have to know like where your single points of failure are and at least be aware of them you
don't necessarily have to address them because i do agree with you good luck getting better
reliability than aw, you know,
the past five years. But I do remember a time when S3 was not like as good as it was, just like any
brand new Amazon service is never going to be the best it is at release time. The idea of people
saying, you know, hey, we can just blame AWS for everything, I think is a very, very, very tenuous
situation to put yourself in. Because
if your customers are going to sit there and rely on you for something, they don't care.
If your SLA says 99.999% uptime and you don't deliver on that, that's your penalty. You can't
pass it on to AWS. You think they're going to foot the bill for that? No. It's like the cable
company at that point, right? Like, oh, we're sorry you don't have service. We're not going to give you a credit.
Or if there is an SLA credit, it's minor compared to the impact it potentially had on your business
as well. And again, this is not to bag on Amazon specifically. The fact that we can talk about
the single issue and everyone knows what I'm talking about is testament to how rock solid,
at least some of their services have become. Right. Like I remember reading the BuzzFeed newsletter the day after that outage,
and it talked very intimately about US East One. I mean, we're talking, this is BuzzFeed here,
all right? This is the people that made listicles a thing, right? But they know what US East One is.
It's amazing to me. Yes. Service number seven will blow your feet off. Yes.
Well, thank you very much for joining me today, Chris.
Are there any parting comments, observations, or things you'd like to show before we call it an episode?
I think the biggest thing for me is,
well, the two biggest things, right?
Like a DevOps journey is never finished, right?
You're never done migrating.
You're never done doing DevOps.
You're always doing DevOps, right?
That's thing one. Thing two is something I've realized maybe this year is as technologists like you, myself,
and probably a lot of your listeners are, we have to be more embracing of changes in our own work,
life, everythings so that we can help our organizations change. When we make change
simple for ourselves, for everything, I'm talking
like the way you drive to work, the way you log into your systems, the shell you're using for that
matter. When you can make change seamless and not painful for yourselves, it exudes a sense of
confidence when you're trying to make larger changes throughout the organization. We have to
get better as technologists in making changes and helping people embrace change. Very well put. Thank you once again for joining me here on Screaming in
the Cloud. This has been Chris Short of DevOps-ish, and I'm Corey Quinn. I'll talk to you next week.
This has been this week's episode of Screaming in the Cloud. You can also find more Corey
at screaminginthecloud.com
or wherever fine snark is sold.