Screaming in the Cloud - Episode 36: I'm Not Here to Correct Your English, Just Cloud Bills
Episode Date: November 14, 2018Do you enjoy watching sports? Wear your favorite team or player’s jersey? Are you a fan who has shopped at Fanatics on the Cloud? Today, we’re talking to Johnny Sheeley, director of Clou...d engineering at Fanatics, which is a sports eCommerce business that manufactures and sells sports apparel. Fanatics runs Cloud engineering to provide a robust and reliable set of services by building and deploying applications on top of the Azure Data Lake Store (ADLS) platform. Some of the highlights of the show include: If you compete with Amazon, be ready for it to come after you; some companies avoid its Cloud perspective or go multi-Cloud (paranoia-based movement) Focus on your ability to make your business function smoothly Transition, migration, and abstraction may be painful, but should not stop work; paying for Cloud-agnostic technology may not be worth it Challenges of governing use of Cloud resources to prevent mistakes/problems related to Fanatics’ security and budget Data collected focuses on what’s trending up or down to select an instance type that calculates costs; remain flexible and be aware of what you pay Natural instinct is to blame people; mistakes are made, especially when a human factor is introduced to an automated system Creating a mindset that focuses on feature and detail-oriented is challenging Cottage industry of code bases running in Big Data and other expensive realms As a product continues to evolve and grow, governance comes along for the ride and AWS bills are streamlined Will serverless, Lambda, and RDS change how Amazon charges in the future? State of scale of AWS and developing a more palatable method for releases because people can’t keep up with them and stop paying attention Two-Pizza Team: Amazon’s management philosophy that any team that works on a service should be able to be fed with two pizzas Such small teams work quickly and have the freedom to fail, but Amazon has a reliability for the longevity of its different services Links: Johnny Sheeley's Email Johnny Sheeley on Twitter Rands Leadership Slack Hangops.slack.com Fanatics Kubernetes Azure Lambda RDS Getafix: How Facebook Tools Learn to Fix Bugs Automatically Accidentally Quadratic Blog re:Invent Jeff Barr’s AWS News Blog Amazon SimpleDB Lots of Amazon's projects have failed...and that's ok, says Amazon's Andy Jassy Digital Ocean .
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
This week's episode of Screaming in the Cloud is generously sponsored
by DigitalOcean. I would argue that every cloud platform out there biases for different things.
Some bias for having every feature you could possibly want offered as a managed service at
varying degrees of maturity. Others bias for, hey, we heard there's some money to be made in the cloud space. Can you give us some of it?
DigitalOcean biases for neither. To me, they optimize for simplicity. I polled some friends of mine who are avid DigitalOcean supporters about why they're using it for various things,
and they all said more or less the same thing. Other offerings have a bunch of shenanigans,
root access and IP addresses.
DigitalOcean makes it all simple.
In 60 seconds, you have root access to a Linux box with an IP.
That's a direct quote, albeit with profanity about other providers taken out.
DigitalOcean also offers fixed price offerings. You always know what you're going to wind up paying this month,
so you don't wind up having a minor heart issue when the bill comes in.
Their services are also understandable without spending three months going to cloud school.
You don't have to worry about going very deep to understand what you're doing.
It's click button or make an API call and you receive a cloud resource.
They also include very understandable monitoring and alerting.
And lastly, they're not
exactly what I would call small time. Over 150,000 businesses are using them today. So go ahead and
give them a try. Visit do.co slash screaming, and they'll give you a free $100 credit to try it out.
That's do.co slash screaming. Thanks again to DigitalOcean for their support of Screaming in the Cloud.
Welcome to Screaming in the Cloud.
I'm Corey Quinn.
I'm joined today by Johnny Shealy, who, in addition to being a fantastic dresser,
is the Director of Cloud Engineering at Fanatics.
Welcome to the show, Johnny.
Hi, thanks for having me.
That's a really wonderful wardrobe
compliment. I don't know if it's founded, though. That's the beautiful thing. You have this very
cultured voice, so whenever people listen to you, they assume you're well-dressed.
Or that I'm dressed at all, which is phenomenal. Yeah, don't make the audio folks to bleep out
too much of this. Good lord. I was trying to make Kendall happy. As it works. So explain to us from the
beginning, what does Fanatics do? So Fanatics is a sports e-commerce business. And we do everything
from manufacturing sports apparel to selling it on our own sites, to running major league sites, to running sites for international
teams like Manchester City. So if you were wearing a Titans jersey or some sort of soccer jersey or
anything like that, you probably wound up interacting with us at some part of that.
That's very reasonable. It effectively is sportswear sold through e-commerce. Got it.
Yep. reasonable it effectively is sportswear sold through e-commerce got it yep and you run cloud engineering there what does that look like from a perspective that i guess from an outsider's view
what is cloud engineering at fanatics in whatever level of depth you're comfortable sharing
publicly yeah that's actually a really difficult question internally, we've been working on defining that. I'm, I believe, the fourth or fifth iteration of management in this area. And I've got my own specific bent. And what it means to me is that we provide a robust and reliable set of services that allow our engineers a easy experience of building and deploying applications on top of the AWS platform.
Historically, there have been efforts to provide operational support and do a bunch of architecture.
And over time, we found that that's just really difficult to scale. And the challenges that each
individual team winds up having are really theirs to own and theirs to
solve. And so in some ways, we've become more of a conduit between those teams and TAMs on the AWS
side. And internally, we're focusing a lot more on productivity tools and providing a solid platform,
both from the sort of service discovery, secrets management, and your favorite, Kubernetes.
Absolutely.
So if you can, I guess, address a somewhat common theme that has sort of come up,
not just in this show, but in loud, heated arguments I have with people at conferences, usually over drinks.
There's this idea of if you're in a market that potentially competes with Amazon,
that you don't want to wind up using their cloud perspective. Or if you do, you want to at least be
able to go multi-cloud and at a moment's notice, be able to pivot to a different provider.
I mean, you mentioned in your description of what Fanatics does that you are an e-commerce company.
An awful lot of folks in that position try and actively avoid Amazon.
Was that ever something that was on your radar?
You know, I think that at the end of the day,
everybody has to have some sort of perspective
on what will happen when Amazon comes for me
because they're coming for you.
And it doesn't seem to matter what business you're in
or what city you live in.
They've got some sort of idea of how they're going to take that and do something with it.
The overall thing that I think is important to us is to really focus on our ability to make our
business function smoothly. And if we have in the back of our minds some thoughts on what if Amazon were to make moves in a direction that would be harmful for us, then we will have a way to get out of that.
That's the sort of thinking that I believe we've really focused on.
So really, in other words, we're not going multi-cloud right off the bat. There are specific use cases where we see a
stellar set of tools where there could be something where we run a Microsoft program
on-premise and they have disaster recovery for it that's plug and play in Azure. And cool,
that's an easy thing to adopt. Or Google's know, Google's got Spanner and Dataflow,
and those are really interesting technologies to take a look at. But they're not necessarily
the sort of, I don't know if I want to call it paranoia-based movement, but real specific use
cases where we gain a significant benefit from moving in that direction rather than providing abstractions everywhere
so that you don't care about what cloud provider you're on.
I generally tend to agree with the perspective.
The other piece of it, of course, winds up being somewhere that is,
I guess, trying to figure out,
well, what if this thing happens in three to five years?
What if we need to be able to embrace that
in a reasonably quick response time window?
And I'm not convinced that's necessarily as viable of a concern as people like to pretend it is.
I'm a fan of building things that could at least theoretically be transitioned out.
For example, if you're requiring Google Cloud Spanner as a core tenant of your architecture for your software application,
maybe that's not the best move.
There's no equivalent anywhere else,
and you're redesigning everything from scratch.
If you're running a traditional CRUD app,
then as long as you're effectively building something
that doesn't require a tremendous number of tweaks architecturally
to move somewhere else,
then it's still going to be painful,
but it's not going to be an all-work-stops-for-18-months-while-we-do-a-migration story.
Absolutely.
And I think even the level of abstraction that you can find yourself
getting into with a single provider can begin to open up those thousand cuts.
There are a number of different service discovery tools. There are things like Kubernetes. There
are all these different ways that you could be implementing your own platform, because I don't think that, well, you're the expert, so I'll defer to you. we're not necessarily happy with just using DNS for service discovery. So we'll use console or we'll use something that's based on ZooKeeper
or these other areas where you do wind up investing in a technology
that is cloud agnostic, but you're then paying rent on that.
You're continuing to have to update it,
keep it running appropriately as you scale out.
What's the impact there? So I think
that there's a little tax that we all pay, but I agree that your assessment that really trying to
implement now what you'll need in five years is a really difficult story. And you'll probably wind
up building something that doesn't have anything to do with what you really need in that
next time frame. I would say that you're probably right. The challenges generally don't tend to come
from vendor lock-in so much as they do, to some extent, I guess, a governance model that doesn't
map appropriately to what the company is trying to achieve. I mean, you're sort of a case study
in that, I would imagine, in that you describe a centralized cloud engineering
group that can be loaned out to other product and feature teams. How do you effectively govern
the use of cloud resources to, for example, keep people from blowing the budget, to keep people
from making hilariously awful security mistakes, from effectively just going off in a bunch of
different governance directions and
causing problems for the organization, either financial or risk-based?
Gosh, we have some really interesting challenges there. And there are different models that I've
seen out there where on one end of the spectrum, it seems like there's something along the lines
of Netflix where you can go and just build whatever you need to. And you can also expect that another team may come along and kill your stuff.
And it needs to be resilient or you need some sort of remediation to be there to expect your services to live.
And then there are things like former employers where I was familiar with a very specific sort of blessed method of handing from
team to team your jar, your deliverable that moves into a new environment, gets load and
performance tested. And there's a lot of manual stuff. And some of the challenges that we have
here at Fanatics are that we don't have a homogenous group of people that all have the
same desires as far as management of their infrastructure and applications. And there
are people who want to be able to hand things off and have the security model and deployment and
operation of it all handled for them. And then there are people who want to get deep down into
determining what sort
of instance type makes the most sense for them. And, you know, what, what level of network ops
and, um, any sort of disk IO, things like that, where you wind up having a really nebulous problem
if, if you're governing that. So because of the different levels of maturity amongst teams,
and their different focuses, we've got a pretty wide variety in how we actually engage with
different teams. And the primary focus for us right now is security, and the secondary is budget. So as far as security, we have a really awesome team
that is able to go out
and actually very proactively find issues
with whether it's an OS bug
or some sort of software package
that we're leveraging
and be able to work with each individual team
so that depending on the level of exposure
of their application,
they can identify like, hey, we need to remediate this immediately. Or maybe this is an internal
tool that is actually locked down in a number of other ways. So it's okay that, you know,
they've got some sort of SQL injection issue, but keep that in the back of your pocket. And
at some point, you probably want to fix that. On the other end of the spectrum, we've got this budget thing, and we've got a number of teams that
are asked by our business to deliver tremendous amounts of data processing in a very narrow time
window. We want, especially as we're approaching Black Friday, Cyber Monday, and some of the
different hot markets that we serve, an ability for our users to, our internal users to be able to see, hey, I need to go and order
10,000 more of these jerseys or another 30,000 hats because the team that's looking like they're
going to win will wind up really causing us to sell out of what we've got. Or we need to be able to near real-time process
a lot of events as the World Series ends
or some other major event is ending
so we can actually have that real-time view of,
hey, this is what sales are doing.
Maybe something's going on with this part of the system.
And it becomes a really interesting challenge
because all of that data is funneled through my team
and winds up being essentially shared out to other teams.
And we give some sort of a bit of feedback on,
hey, you're trending up, you're trending down.
This is great.
It looks like you may be adopting different things
or maybe you should be looking at different
instance types. And we've actually got a principal engineer here who focuses a lot on whether people
are using the right instance types, if we've got the right reservations. But the model that
we're aiming to get to is really being able to calculate based on a declarative model,
what sort of costs you're going to be incurring and where your service is actually exposed so that we can do static analysis of what our entire cloud architecture looks like.
And be able to predict, hey, this commit that you just checked in to provide more Cassandra servers, that's actually going to cost like $100,000 more a month. Maybe
we should reel that in and take a look at what's going on with your team. Alternatively, you know
what your team needs to provide. So maybe that budget is actually something that is sensible.
And that's a real area where I'm very interested in seeing continued evolution within the industry as far as how that
information is shared and then governed and the way that people allocate resources, especially
across teams as we move more towards a shared model. Which makes an awful lot of sense. The
counterpoint, of course, to that always becomes one of where is the right organizational
balance per company, I suppose. You wind up very quickly walking into a world where you
see certain companies try to wind up mapping forward a governance model from the on-prem
days of where everything was done as CapEx and planning ahead was something that you
had to do. So they think nothing of, well, it used to be six weeks to provision a server.
Now we're going to make instance provisioning take a week. And it feels like
it's the right move. But in practice, when people go through that, they never, ever
turn things off because they very quickly turn into a scenario where, well,
it takes a week to get this spun back up, so I'm just going to leave it there.
And you wind up effectively with a policy that works against itself.
Absolutely. Yeah. And I mean, I think that it's even fair to say that as humans,
we don't necessarily do a good job of prioritizing cleaning things up. I keep a mess at my desk on a
regular basis, and it takes some level of a jarring sensation that there's dirtiness around
for me to actually want to change that. And, you know, particularly when something is digital and
not in your tangible world, it's really easy to spin up a gigantic instance that is very expensive
or a cluster and walk away from that and not really
be aware. And that's something that totally has happened. And to your point, I don't think that
we're an organization right now that optimizes for locking down every single thing. We have a
lot of flexibility for our engineers and we enable them to go and use their own authority to say,
hey, this may be a gigantic expenditure if it were to stay on for a year,
but it'll get something done today that I wouldn't be able to accomplish in a number of weeks if I weren't to use this or I want to experiment.
And that's definitely a spot where I don't want to be preventing anyone
from being able to actually accomplish what they're setting out to do.
It's a rather concerning thing.
As you're talking about looking back towards the on-premise days where you kind of had to depend on a specific team or person to push your application live.
And that just doesn't sound like fun for the person that's that bottleneck, right?
I don't want to be there. I don't want to be saying, well, this is going to cost too much,
so don't do it. So that's a really interesting area for us to need to remain flexible, but also
have some semblance of guardrails so people aren't necessarily shooting themselves in the foot if
they really step into it accidentally.
And let's also not escape the fact that a lot of times this is not due to any sort of
bad actor sort of scenario. This instead turns into a scenario pretty rapidly where you're seeing
people making honest mistakes. I mean, my entire life is built around my consultancy
of optimizing every AWS bill that comes in front of me, which means that, yes,
I spend time optimizing my currently roughly $30 bill. And that's a complete waste of my time.
But I take a look recently when the last bill came in, and I had a $20 spike because I'd forgotten
that VPC endpoints in a test account had been left running, and those incur a per-hour charge to the tune of $20,
which is nothing as far as my business goes. But as a percentage of my bill, it was something like over 50% of what my existing bill was, but then added on top of it. That's terrible. That winds
up just being the sort of thing that happens. And while it's frustrating, at scale, something like that leads to people
getting yelled at. It leads to gatekeepers being put in. It leads to people being unable to spin
up resources without going through vast swaths of approval. And that model doesn't seem to work
either. Oh, absolutely. I think I shared with you my new backup solution that I implemented very poorly.
And I think something like quintupled my AWS build just because it was querying S3.
It wasn't actually even writing any additional data to S3.
It's very easy to make a mistake with cloud APIs and interacting with them.
Oh, absolutely. And none of this stuff is intuitive, and none of this stuff is one of
those intrinsically obvious things. It all comes back to the fact that this is complex,
this is hard to do, and no one really has a great answer as far as how to get to sanity.
Absolutely.
I wish I did. Believe me, I'd sell it to people.
But unfortunately, I kind of don't have that luxury.
Well, yeah.
And the best part is it's often not even just a human.
We've got a system that is built in-house that is similar to Fugue or sort of a constantly running Terraform where it sees a model of what the infrastructure should look like.
It queries AWS APIs to find the delta, and then it remediates.
And there have been times where it's killed things that are critical by accident, thankfully in dev environments.
And there have been times where it's accidentally spun up things that,
you know,
a human would do and a machine can do it much more faster,
much more faster.
That's good.
I'm not here to correct your English,
just cloud bills.
Yeah.
Well,
if,
if my English were worse than my cloud bill would probably be better.
Yeah.
Like we,
when you've got an automated system
that is going out and interacting with cloud providers
or anything that can be spinning up resources
that are expensive,
then adding a human factor to that,
whether it's the human implementer of that system
or the human variables saying,
oh, we need to scale this cluster up, you can very
quickly cost yourself a lot of money accidentally. Oh, yeah, absolutely. I see that constantly. And
it's one of those areas where the natural instinct is to blame people for what's gone on, either the
people who didn't budget appropriately or people who spun resources up or try to prevent this
terrible thing from ever happening again. I mean, and people have taken different technological approaches that
sort of result in mixed bags. The idea of mandating tags, of shooting down infrastructure after it's
been alive for a certain period of time, of having a provisioning system that nags you every week,
that you're running X dollars in your development account. But by and large, it mostly has to do
with a mindset shift.
And I'm not convinced for most companies until they hit a reasonable point of scale,
that training the engineers who can provision resources on the nuances of cloud costing is
necessarily the right answer. Yeah, I think you're hitting the nail on the head there.
And I would actually be curious when you actually reach that point. It seems like there's almost
always a dividing line between the folks who are focusing on feature work and those who are
really coming back to do some of the more detail-oriented, hey, what can we be doing
more efficiently? There are a few people throughout my career that I've met where it does feel like
they're able to spread across both of those realms, but it's a really challenging mindset
that I think you won't find in a lot of people where, oh, I want to go out and create this great
art, but then I want to leave the studio spotless when I'm done. I don't know. Is that something
that you've encountered out there? Or do you
typically find that, hey, management has reached its budget threshold and they really are concerned
about what's going on? What I tend to see is that there's very few hard and fast rules that map to
everything. You're going to see some companies where coming in very early and structuring out
a costing program makes sense. You see other companies that are riding a rocket ship,
and while they're spending tens of millions of dollars a year on cloud spend,
that's a tiny molehill next to the mountain of revenue that they're seeing,
or VC money that's pouring in, or potential upside.
It's one of those stories where when you're all hands on deck in a hyper growth company, optimizing to save a few
bucks here and there is absolutely not material to your business. There does come a time where
that changes. Conversely, I'm a bootstrapped consultancy of one where when my cloud bill
starts spiraling away from me, if I wake up to a $20,000 bill tomorrow, I should probably fix that
before I do almost anything else. Because
it doesn't take too many of those before my business starts winding up in trouble.
It comes down to a number of different levels of maturity. That's why I've never been a fan of
the models for cloud governance that tend to equate everyone to being similar.
That's always going to be disparate based upon who you are and what your constraints look
like.
Yeah, totally.
And, you know, I was just reading a really interesting article on Facebook's new or newly
public bug remediation and automation of suggested changes in their code bases that sort of makes me think that that might
be an interesting area for us to head. And similar to, are you familiar with the blog
Accidentally Quadratic? I am not. That sounds like math. I was told there would be no math.
So it's all these really, really wonderful code snippets where people have found that it's just an inefficient algorithm being used.
And they share a little bit of the context around what the code base is, what the intention behind implementing it this way probably was, and how they went and made it better.
You see companies like HashiCorp coming with some new features to help predict costs. You see a lot of
the AWS trusted advisor and other things like cloud health moving in different directions of
helping to at least say reactively, hey, you spent too much, you need to solve this.
I won't be terribly surprised if this is a new sort of cottage industry of
ML or something where you're actually looking at the code bases that are running,
particularly in the big data realm or the other truly expensive as far as compute and data
transfer areas go, where you're not just saying, oh, let's reserve instances,
but let's actually take a look into your code and double check. Are you using the
current version of the framework? Are you using the minimal amount of data that you could be?
And sort of removing that from the responsibility of those creative types who are more responsible for going out and building
new features for the business. I don't intend to say unfortunate things about a lot of the
vendors in this space, but every time I've seen something like this today that purports to use
machine learning to determine whether your resource usage is sensible, whether things
should be turned up or turned down or not, they either tend to focus on a very small portion of the overall picture, or they tend to have
unfortunately naive assumptions baked in. A quick and easy example. There's no way programmatically
to distinguish between an instance that is oversized and sitting idle and should be
downscaled or turned off, and an active DR site that's going to have about three seconds of warning before it gets slammed
with traffic. In one of those, you want to turn those things off. In the other scenario,
you absolutely don't. And that's a business process problem. That's not something that
I've ever seen any realistic chance of solving via writing code. The same story with,
to be frank, a lot of these businesses pricing models, where it's a percentage of your bill in
order to sit there and do analysis. Well, okay, that's fine, I guess, but no one likes the model.
I mean, when I've tried that in my very early days of a consultancy, I got laughed out of the room.
Now I charge fixed fee with guarantees that I do it and I wind up not having to fight that particular battle the same
way. Yeah. And I'll totally admit that I am a total nerd and optimist. And I believe that there
are a bajillion areas that in the next 20 to 40 to whatever years, we'll see some really astonishing changes.
I totally agree that right now, the industry is paid little attention. And it's, as you're saying,
not a high value proposition to come into the next unicorn and say, hey, as you're making that
billion dollars, I can save you
20K every month. That's not really worth their time. No, and it's really not. It becomes a better
narrative around the idea of helping establish good practices, good governance, demonstrating
they're being responsible stewards of the money entrusted to them. But it's not the big win in
this space. For a second there, I thought you were going to say that,
oh, the code is going to get better.
In the future, the cloud bills will self-optimize,
at which point I'm obligated to ask you,
will we pay for them in Bitcoin?
But I'm sorry, I'm not one of those people
with stars in their eyes and everything is terrible
up until this point, but the future is better.
And we see evolutions of these things.
I think to some extent,
the providers are going to have to come up
with some form of simplification pass over their bill.
They'd have to.
The level of increase in complexity over time
is not something that's going to be sustained.
The other side of that, though,
is how do we get better than we are today?
If we don't have a perfect solution,
okay, we don't need it to be. But how do we get better than we have now? If we don't have a perfect solution, okay, we don't need it to be.
But how do we get better than we have now?
Yeah, and isn't that what you do?
From my perspective, but there's only one of me.
There should also, to be very blunt with you,
I shouldn't have a business.
This shouldn't be as complex of a problem as it has become.
You shouldn't need to bring in a consultant
to solve these things.
And until companies are spending
at least a certain baseline threshold on their cloud bill, I can't help them because there's no ROI for
retaining me. Yes, I'll come in and look at your bill and you'll hit break-even on my services in
only a couple decades. That's not a compelling sales pitch. So it's not something that's ever
going to work. And you shouldn't have to be spending a king's ransom in order to make those numbers make sense. It should be something that as the product continues to evolve and grow that you're building, that governance just sort of comes along for the ride, that your bill streamlines itself. And I think that we're doing, do you have other industries that you see similar consulting
where there are either retailers or some people dealing in physical goods where it's a similar
problem where they need to optimize? I mean, I could imagine that there are industries where
paying the right amount for raw goods, that's critical.
But do you have any analogs that you've really used to help guide yourself as you've embarked
down this road? Not exactly in the way that you mean it, but there's nothing new about my business
model. We saw this in the 70s and 80s where companies would come in to large enterprises
and say, hi, I'm a consultant. I'm going to just sit
in a room quietly and tear apart your telephone bill. Because back then, telephone bills were
complex, they were massive. And they would say, we'll find errors that the phone company made
when they calculated these things out. And when we save you money, we'll take a percentage of it.
And that was a brilliant business model that I don't think we can quite get back to.
But the beauty of that was first, it's money that the company is never going to recoup.
Secondly, it requires zero investment on the company side other than,
here's the bills, now go away and tell us what you find.
It doesn't require a team of engineers to sit there with someone and explain architecture.
It doesn't require a team of people to sit there and go back and forth with vendors and negotiation team. It became very simple and
very streamlined. I don't think that there's quite a direct equivalent to that, but I did
take inspiration from that philosophically. So do you think that there are similar evolutions
that are coming in cloud computing? Because I mean, you look at our phone
bills today, and I pay a flat rate every month. And when I go to Europe, it doubles. And that's
fine, because I know it's also just going to be another flat rate. Do you think that we could
get somewhere like that with especially all of the serverless, you know, not just talking about
Lambda, but moving into the RDS realm, it seems
like at some point Amazon could be charging me per cycle or per request or conversion or
something that's a little bit different than just this dollars and cents to resource reservation
time. I hesitate to try and predict the future. It always seems like that's either one of those
things that winds up leading very quickly to, yeah, you were right, no one cares, or you were
wrong, now we're going to laugh at you for eternity. There's no real upside to that.
I will say that the current pace that AWS seems to be on in several fronts is unsustainable.
For example, right now the market is always talking about percentage growth. Well, if you make boats and you sell them for a million
bucks a piece, and last year you sold one boat and you were independent, now you've hired an
assistant and you sell two boats this year, you've demonstrated 100% year-over-year growth.
Back when you had a $20 million cloud business, we made $40 million
this year on it. That's easy. The growth numbers are fantastic. They have eclipsed, I think,
$25 billion a year now as a run rate, according to their last published numbers.
That is a much larger number to have to double and try to onboard rapidly. People generally don't tend to spend
that much that quickly in a new platform except by accident. And accidentally charging people a
few billion dollars is not great customer service. Counterpoint, it only has to work once.
Yeah. Where do I apply for that?
Absolutely. So you also see this now on the other fronts where at reInvent, for example,
they get on stage and they trot out their slides showing year over year number of feature releases
and enhancements. Okay, that is good to know that you're not resting on your laurels and you're
innovating rapidly, but that line can't continue up forever. We're already at a point where there
are services out there that solve problems that I've had and I didn't know they existed.
And I spent a fair bit of time tracking this down.
Instead, I have to go down this entire merry-go- when I get confused or caught out by something new and exciting that
launched. But eventually, you're going to see a world where the official Amazon blog that Jeff
Barr writes just doesn't have enough space to wind up publishing these things. He collapses
due to exhaustion from writing 85 posts a week. And at some point, people working on these things. He collapses due to exhaustion from writing 85 posts a week.
And at some point, people working on these things, we all have jobs to do that don't include analyzing new service releases or feature enhancements. So we stop paying attention,
even to the things we really should be paying attention to. Things can't go up and to the right
forever. And what that leveling off or normalization starts to look like, I have no clue.
There are smarter people than I am at Amazon who work on these things as a full-time job.
I'm just sitting here in the cheap seats throwing peanuts at people and sometimes rattling the cage and screaming.
Well, you hide that part very well.
Oh, yes.
The things we say in public and things we scream in the middle of the night while working on articles.
I like your approach.
The time makes a lot of sense.
Oh, yes.
Nothing good ever happens after 3 a.m.
Whenever I'm writing blog posts, then nothing good. So when you're describing these sort of granular services and the solutions to problems that are not well publicized,
do you think that that's just the state of scale of AWS specifically?
Or do you think that it's their approach and folks like Google or Azure or AliCloud
or whoever out there might be taking different approaches that
would actually be able to condense those solutions into something that's more palatable,
more meaningful, and easier to adopt? I don't know. That's a great question.
But even now, you wind up not just with competition from third parties, but,
for example, let's say that I have a string that I want to send from me to you.
And I want to do that programmatically via APIs.
Within AWS, there are no fewer than 15 different services I can use to store that string and have it go to you.
And that number is not getting smaller.
And incidentally, I'm talking as a, not terribly abusing services either.
Well, technically I could spin up Amazon Chime
and message you.
No, that's not what I'm talking about.
Or, well, theoretically I can spin up an EC2 instance
and store that string in a tag.
No, none of that.
We're talking using services as generally intended.
And the varying differentiators between these services are getting harder and
harder to discern. Back when there was one queuing-style service, it was easy. You use that
one and complain about it. Now that there are 15 of them, you pick one, are convinced you do the
wrong thing, complain about it, switch to something else, trip over a constraint you didn't know
existed, and the cycle repeats until you eventually give up and go raise goats on a mountainside
somewhere. I like goats.
That's because you never raised them.
This is true.
So you've got way more of a close relationship
with Amazon than I do, for example.
Much to their everlasting chagrin.
You don't know that.
That's just what they say to your face.
You should see what they say when they think I'm not listening.
So as you're talking about this evolution of growing nearly the same service over and over
again, have you experienced anything that you could share around why that happens? I
completely understand the concept of not invented here,
but is it something that they can find another two-pizza team
that is so dissatisfied with the service
that they really just have to reinvent it?
Sort of. It's a great question.
And this is sort of the Achilles heel, from my perspective,
of the entire Amazon model.
For those who aren't aware, the term two-pizza team is an Amazon management philosophy. They believe that any team that works
on a service should be able to be fed with two pizzas. My take on that is you're not allowed on
the team unless you can eat two entire pizzas yourself. History will say which was better.
But as they're building these things out in the small teams, they get ideas, they do
internal style of bake-offs, to my understanding. And that's why you wind up with services that
wind up competing with one another. They move very quickly, they have the freedom to fail,
which is incredibly valuable. And by the time something launches, it's generally already got
customers lined up to use it. They aren't building things and hoping that people use these things one day. They have customers who are asking for the specific things that they build.
The counterpoint and the pain that many of us experience is that anything that depends upon
a shared service for all of those is very difficult. Take a look at the console, for example.
You have to unify all of those services and present them in the same way. That's really hard. You take a look at other shared services
like the bill. Every different service team has a different billing model and the numbers of
dimensions and metrics that wind up influencing that bill. The billing system alone is an
incredible service that most people don't understand as far as the sheer volume
of data that it has to process and what it has to do to get those bills out to people on time.
But people's only interaction with that is at the end with the output where first, it's a bill. No
one's thrilled to get one of those. Secondly, it's super complex. No one likes that either because
here's what we're charging you and here's why and you look at that and you feel dumb is a crappy customer experience.
How do they fix that?
Couldn't tell you.
Yeah.
But if you just, but you're going to feel dumb because you feel dumb because you're
dumb.
I mean, there's, there's some basic expectation there.
I tend to not be a big fan of blaming people who are confused or annoyed over the bill itself.
I mean, it's in anything in this space. There is no simple problem in anything that touches the
cloud. If your answer to a problem is, oh, you should just stop speaking there because you're
already wrong. Yeah. And that was more a comment on me feeling perpetually dumb, which is just something
that I'm dealing with personally. One of the things that you mentioned in there that I think
is really interesting is you called out the freedom to fail. And I've also seen you talk
about the reliability that Amazon has as far as the longevity of the different services.
So what does that mean when you say that they've got the freedom to fail? Is that something that's
just internal? The project may not make it to production? Or have you seen instances where,
you know, just there aren't enough people using this thing, so we're actually going to
be sunsetting it and have some potential
significant impact on users. I've never seen them sunset a product. I've seen them
deprecate things a couple of times in strange ways. The first is reduced redundancy storage.
It no longer participates in price cuts. It's an S3 storage class, and it now costs more than the
good storage. It's still there if you want to use it.
But the one that I find more interesting is SimpleDB.
You don't see it in the console.
It's not advertised.
And relevant to this conversation, Andy Jassy, the CEO of AWS, publicly referred to it as a failed service, which is fascinating to me. The value of being able to say something like that publicly,
even though it still has active users on it,
there's still a service team maintaining it,
incidentally, that feels like the saddest job in the world.
But it's not something that they're ever going to turn off completely
because they made a commitment to customers
that you can build a business on this.
And until that last customer gets off
of using that product or service,
Amazon's going to continue to honor that,
as best I can tell.
Now, I'd be surprised at this point
if they don't have teams of people
actively working with some customers
to migrate them off
so they can finally turn it off.
But to date, that hasn't happened.
I'm not particularly worried about trusting Amazon
with my production infrastructure.
That's fair.
As opposed to other cloud companies
who turn things off for kicks.
I believe that could fall under some form of chaos.
It's just branding that's missing.
Absolutely.
We've decided to turn off the database
that you're building everything on top of.
Have a good day.
Yeah, no one's having a good day when that happens.
Business chaos.
Exactly.
If people want to talk to you more,
where should they find you on this wide internet of ours?
Probably the place that I'm interacting the most
is on the Rans Leadership Slack.
I'm on the Gopher Slack and the Hangup slack as well.
But there's Twitter
or they can just email me
at sheely at ag.org.
I'm out there.
Perfect. I will throw links to those things
in the show notes.
Thank you so much for taking the time to speak with me today.
It's appreciated.
Yeah, thanks, Corey. This was awesome.
It really has been.
Johnny Sheely, Director of Corey. This was awesome. It really has been. Johnny
Shaley, Director of Cloud Engineering at Fanatics. I'm Corey Quinn, and this is Screaming in the
Cloud. This has been this week's episode of Screaming in the Cloud. You can also find more
Corey at screaminginthecloud.com or wherever fine snark is sold.