Screaming in the Cloud - Episode 36: I'm Not Here to Correct Your English, Just Cloud Bills

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This week's episode of Screaming in the Cloud is generously sponsored by DigitalOcean. I would argue that every cloud platform out there biases for different things. Some bias for having every feature you could possibly want offered as a managed service at

Starting point is 00:00:37 varying degrees of maturity. Others bias for, hey, we heard there's some money to be made in the cloud space. Can you give us some of it? DigitalOcean biases for neither. To me, they optimize for simplicity. I polled some friends of mine who are avid DigitalOcean supporters about why they're using it for various things, and they all said more or less the same thing. Other offerings have a bunch of shenanigans, root access and IP addresses. DigitalOcean makes it all simple. In 60 seconds, you have root access to a Linux box with an IP. That's a direct quote, albeit with profanity about other providers taken out. DigitalOcean also offers fixed price offerings. You always know what you're going to wind up paying this month,

Starting point is 00:01:23 so you don't wind up having a minor heart issue when the bill comes in. Their services are also understandable without spending three months going to cloud school. You don't have to worry about going very deep to understand what you're doing. It's click button or make an API call and you receive a cloud resource. They also include very understandable monitoring and alerting. And lastly, they're not exactly what I would call small time. Over 150,000 businesses are using them today. So go ahead and give them a try. Visit do.co slash screaming, and they'll give you a free $100 credit to try it out.

Starting point is 00:02:00 That's do.co slash screaming. Thanks again to DigitalOcean for their support of Screaming in the Cloud. Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined today by Johnny Shealy, who, in addition to being a fantastic dresser, is the Director of Cloud Engineering at Fanatics. Welcome to the show, Johnny. Hi, thanks for having me. That's a really wonderful wardrobe

Starting point is 00:02:25 compliment. I don't know if it's founded, though. That's the beautiful thing. You have this very cultured voice, so whenever people listen to you, they assume you're well-dressed. Or that I'm dressed at all, which is phenomenal. Yeah, don't make the audio folks to bleep out too much of this. Good lord. I was trying to make Kendall happy. As it works. So explain to us from the beginning, what does Fanatics do? So Fanatics is a sports e-commerce business. And we do everything from manufacturing sports apparel to selling it on our own sites, to running major league sites, to running sites for international teams like Manchester City. So if you were wearing a Titans jersey or some sort of soccer jersey or anything like that, you probably wound up interacting with us at some part of that.

Starting point is 00:03:19 That's very reasonable. It effectively is sportswear sold through e-commerce. Got it. Yep. reasonable it effectively is sportswear sold through e-commerce got it yep and you run cloud engineering there what does that look like from a perspective that i guess from an outsider's view what is cloud engineering at fanatics in whatever level of depth you're comfortable sharing publicly yeah that's actually a really difficult question internally, we've been working on defining that. I'm, I believe, the fourth or fifth iteration of management in this area. And I've got my own specific bent. And what it means to me is that we provide a robust and reliable set of services that allow our engineers a easy experience of building and deploying applications on top of the AWS platform. Historically, there have been efforts to provide operational support and do a bunch of architecture. And over time, we found that that's just really difficult to scale. And the challenges that each individual team winds up having are really theirs to own and theirs to solve. And so in some ways, we've become more of a conduit between those teams and TAMs on the AWS

Starting point is 00:04:32 side. And internally, we're focusing a lot more on productivity tools and providing a solid platform, both from the sort of service discovery, secrets management, and your favorite, Kubernetes. Absolutely. So if you can, I guess, address a somewhat common theme that has sort of come up, not just in this show, but in loud, heated arguments I have with people at conferences, usually over drinks. There's this idea of if you're in a market that potentially competes with Amazon, that you don't want to wind up using their cloud perspective. Or if you do, you want to at least be able to go multi-cloud and at a moment's notice, be able to pivot to a different provider.

Starting point is 00:05:17 I mean, you mentioned in your description of what Fanatics does that you are an e-commerce company. An awful lot of folks in that position try and actively avoid Amazon. Was that ever something that was on your radar? You know, I think that at the end of the day, everybody has to have some sort of perspective on what will happen when Amazon comes for me because they're coming for you. And it doesn't seem to matter what business you're in

Starting point is 00:05:42 or what city you live in. They've got some sort of idea of how they're going to take that and do something with it. The overall thing that I think is important to us is to really focus on our ability to make our business function smoothly. And if we have in the back of our minds some thoughts on what if Amazon were to make moves in a direction that would be harmful for us, then we will have a way to get out of that. That's the sort of thinking that I believe we've really focused on. So really, in other words, we're not going multi-cloud right off the bat. There are specific use cases where we see a stellar set of tools where there could be something where we run a Microsoft program on-premise and they have disaster recovery for it that's plug and play in Azure. And cool,

Starting point is 00:06:42 that's an easy thing to adopt. Or Google's know, Google's got Spanner and Dataflow, and those are really interesting technologies to take a look at. But they're not necessarily the sort of, I don't know if I want to call it paranoia-based movement, but real specific use cases where we gain a significant benefit from moving in that direction rather than providing abstractions everywhere so that you don't care about what cloud provider you're on. I generally tend to agree with the perspective. The other piece of it, of course, winds up being somewhere that is, I guess, trying to figure out,

Starting point is 00:07:17 well, what if this thing happens in three to five years? What if we need to be able to embrace that in a reasonably quick response time window? And I'm not convinced that's necessarily as viable of a concern as people like to pretend it is. I'm a fan of building things that could at least theoretically be transitioned out. For example, if you're requiring Google Cloud Spanner as a core tenant of your architecture for your software application, maybe that's not the best move. There's no equivalent anywhere else,

Starting point is 00:07:45 and you're redesigning everything from scratch. If you're running a traditional CRUD app, then as long as you're effectively building something that doesn't require a tremendous number of tweaks architecturally to move somewhere else, then it's still going to be painful, but it's not going to be an all-work-stops-for-18-months-while-we-do-a-migration story. Absolutely.

Starting point is 00:08:04 And I think even the level of abstraction that you can find yourself getting into with a single provider can begin to open up those thousand cuts. There are a number of different service discovery tools. There are things like Kubernetes. There are all these different ways that you could be implementing your own platform, because I don't think that, well, you're the expert, so I'll defer to you. we're not necessarily happy with just using DNS for service discovery. So we'll use console or we'll use something that's based on ZooKeeper or these other areas where you do wind up investing in a technology that is cloud agnostic, but you're then paying rent on that. You're continuing to have to update it, keep it running appropriately as you scale out.

Starting point is 00:09:03 What's the impact there? So I think that there's a little tax that we all pay, but I agree that your assessment that really trying to implement now what you'll need in five years is a really difficult story. And you'll probably wind up building something that doesn't have anything to do with what you really need in that next time frame. I would say that you're probably right. The challenges generally don't tend to come from vendor lock-in so much as they do, to some extent, I guess, a governance model that doesn't map appropriately to what the company is trying to achieve. I mean, you're sort of a case study in that, I would imagine, in that you describe a centralized cloud engineering

Starting point is 00:09:45 group that can be loaned out to other product and feature teams. How do you effectively govern the use of cloud resources to, for example, keep people from blowing the budget, to keep people from making hilariously awful security mistakes, from effectively just going off in a bunch of different governance directions and causing problems for the organization, either financial or risk-based? Gosh, we have some really interesting challenges there. And there are different models that I've seen out there where on one end of the spectrum, it seems like there's something along the lines of Netflix where you can go and just build whatever you need to. And you can also expect that another team may come along and kill your stuff.

Starting point is 00:10:27 And it needs to be resilient or you need some sort of remediation to be there to expect your services to live. And then there are things like former employers where I was familiar with a very specific sort of blessed method of handing from team to team your jar, your deliverable that moves into a new environment, gets load and performance tested. And there's a lot of manual stuff. And some of the challenges that we have here at Fanatics are that we don't have a homogenous group of people that all have the same desires as far as management of their infrastructure and applications. And there are people who want to be able to hand things off and have the security model and deployment and operation of it all handled for them. And then there are people who want to get deep down into

Starting point is 00:11:22 determining what sort of instance type makes the most sense for them. And, you know, what, what level of network ops and, um, any sort of disk IO, things like that, where you wind up having a really nebulous problem if, if you're governing that. So because of the different levels of maturity amongst teams, and their different focuses, we've got a pretty wide variety in how we actually engage with different teams. And the primary focus for us right now is security, and the secondary is budget. So as far as security, we have a really awesome team that is able to go out and actually very proactively find issues

Starting point is 00:12:11 with whether it's an OS bug or some sort of software package that we're leveraging and be able to work with each individual team so that depending on the level of exposure of their application, they can identify like, hey, we need to remediate this immediately. Or maybe this is an internal tool that is actually locked down in a number of other ways. So it's okay that, you know,

Starting point is 00:12:37 they've got some sort of SQL injection issue, but keep that in the back of your pocket. And at some point, you probably want to fix that. On the other end of the spectrum, we've got this budget thing, and we've got a number of teams that are asked by our business to deliver tremendous amounts of data processing in a very narrow time window. We want, especially as we're approaching Black Friday, Cyber Monday, and some of the different hot markets that we serve, an ability for our users to, our internal users to be able to see, hey, I need to go and order 10,000 more of these jerseys or another 30,000 hats because the team that's looking like they're going to win will wind up really causing us to sell out of what we've got. Or we need to be able to near real-time process a lot of events as the World Series ends

Starting point is 00:13:29 or some other major event is ending so we can actually have that real-time view of, hey, this is what sales are doing. Maybe something's going on with this part of the system. And it becomes a really interesting challenge because all of that data is funneled through my team and winds up being essentially shared out to other teams. And we give some sort of a bit of feedback on,

Starting point is 00:13:56 hey, you're trending up, you're trending down. This is great. It looks like you may be adopting different things or maybe you should be looking at different instance types. And we've actually got a principal engineer here who focuses a lot on whether people are using the right instance types, if we've got the right reservations. But the model that we're aiming to get to is really being able to calculate based on a declarative model, what sort of costs you're going to be incurring and where your service is actually exposed so that we can do static analysis of what our entire cloud architecture looks like.

Starting point is 00:14:35 And be able to predict, hey, this commit that you just checked in to provide more Cassandra servers, that's actually going to cost like $100,000 more a month. Maybe we should reel that in and take a look at what's going on with your team. Alternatively, you know what your team needs to provide. So maybe that budget is actually something that is sensible. And that's a real area where I'm very interested in seeing continued evolution within the industry as far as how that information is shared and then governed and the way that people allocate resources, especially across teams as we move more towards a shared model. Which makes an awful lot of sense. The counterpoint, of course, to that always becomes one of where is the right organizational balance per company, I suppose. You wind up very quickly walking into a world where you

Starting point is 00:15:34 see certain companies try to wind up mapping forward a governance model from the on-prem days of where everything was done as CapEx and planning ahead was something that you had to do. So they think nothing of, well, it used to be six weeks to provision a server. Now we're going to make instance provisioning take a week. And it feels like it's the right move. But in practice, when people go through that, they never, ever turn things off because they very quickly turn into a scenario where, well, it takes a week to get this spun back up, so I'm just going to leave it there. And you wind up effectively with a policy that works against itself.

Starting point is 00:16:08 Absolutely. Yeah. And I mean, I think that it's even fair to say that as humans, we don't necessarily do a good job of prioritizing cleaning things up. I keep a mess at my desk on a regular basis, and it takes some level of a jarring sensation that there's dirtiness around for me to actually want to change that. And, you know, particularly when something is digital and not in your tangible world, it's really easy to spin up a gigantic instance that is very expensive or a cluster and walk away from that and not really be aware. And that's something that totally has happened. And to your point, I don't think that we're an organization right now that optimizes for locking down every single thing. We have a

Starting point is 00:16:58 lot of flexibility for our engineers and we enable them to go and use their own authority to say, hey, this may be a gigantic expenditure if it were to stay on for a year, but it'll get something done today that I wouldn't be able to accomplish in a number of weeks if I weren't to use this or I want to experiment. And that's definitely a spot where I don't want to be preventing anyone from being able to actually accomplish what they're setting out to do. It's a rather concerning thing. As you're talking about looking back towards the on-premise days where you kind of had to depend on a specific team or person to push your application live. And that just doesn't sound like fun for the person that's that bottleneck, right?

Starting point is 00:17:46 I don't want to be there. I don't want to be saying, well, this is going to cost too much, so don't do it. So that's a really interesting area for us to need to remain flexible, but also have some semblance of guardrails so people aren't necessarily shooting themselves in the foot if they really step into it accidentally. And let's also not escape the fact that a lot of times this is not due to any sort of bad actor sort of scenario. This instead turns into a scenario pretty rapidly where you're seeing people making honest mistakes. I mean, my entire life is built around my consultancy of optimizing every AWS bill that comes in front of me, which means that, yes,

Starting point is 00:18:25 I spend time optimizing my currently roughly $30 bill. And that's a complete waste of my time. But I take a look recently when the last bill came in, and I had a $20 spike because I'd forgotten that VPC endpoints in a test account had been left running, and those incur a per-hour charge to the tune of $20, which is nothing as far as my business goes. But as a percentage of my bill, it was something like over 50% of what my existing bill was, but then added on top of it. That's terrible. That winds up just being the sort of thing that happens. And while it's frustrating, at scale, something like that leads to people getting yelled at. It leads to gatekeepers being put in. It leads to people being unable to spin up resources without going through vast swaths of approval. And that model doesn't seem to work either. Oh, absolutely. I think I shared with you my new backup solution that I implemented very poorly.

Starting point is 00:19:26 And I think something like quintupled my AWS build just because it was querying S3. It wasn't actually even writing any additional data to S3. It's very easy to make a mistake with cloud APIs and interacting with them. Oh, absolutely. And none of this stuff is intuitive, and none of this stuff is one of those intrinsically obvious things. It all comes back to the fact that this is complex, this is hard to do, and no one really has a great answer as far as how to get to sanity. Absolutely. I wish I did. Believe me, I'd sell it to people.

Starting point is 00:20:07 But unfortunately, I kind of don't have that luxury. Well, yeah. And the best part is it's often not even just a human. We've got a system that is built in-house that is similar to Fugue or sort of a constantly running Terraform where it sees a model of what the infrastructure should look like. It queries AWS APIs to find the delta, and then it remediates. And there have been times where it's killed things that are critical by accident, thankfully in dev environments. And there have been times where it's accidentally spun up things that, you know,

Starting point is 00:20:45 a human would do and a machine can do it much more faster, much more faster. That's good. I'm not here to correct your English, just cloud bills. Yeah. Well, if,

Starting point is 00:20:57 if my English were worse than my cloud bill would probably be better. Yeah. Like we, when you've got an automated system that is going out and interacting with cloud providers or anything that can be spinning up resources that are expensive, then adding a human factor to that,

Starting point is 00:21:18 whether it's the human implementer of that system or the human variables saying, oh, we need to scale this cluster up, you can very quickly cost yourself a lot of money accidentally. Oh, yeah, absolutely. I see that constantly. And it's one of those areas where the natural instinct is to blame people for what's gone on, either the people who didn't budget appropriately or people who spun resources up or try to prevent this terrible thing from ever happening again. I mean, and people have taken different technological approaches that sort of result in mixed bags. The idea of mandating tags, of shooting down infrastructure after it's

Starting point is 00:21:53 been alive for a certain period of time, of having a provisioning system that nags you every week, that you're running X dollars in your development account. But by and large, it mostly has to do with a mindset shift. And I'm not convinced for most companies until they hit a reasonable point of scale, that training the engineers who can provision resources on the nuances of cloud costing is necessarily the right answer. Yeah, I think you're hitting the nail on the head there. And I would actually be curious when you actually reach that point. It seems like there's almost always a dividing line between the folks who are focusing on feature work and those who are

Starting point is 00:22:33 really coming back to do some of the more detail-oriented, hey, what can we be doing more efficiently? There are a few people throughout my career that I've met where it does feel like they're able to spread across both of those realms, but it's a really challenging mindset that I think you won't find in a lot of people where, oh, I want to go out and create this great art, but then I want to leave the studio spotless when I'm done. I don't know. Is that something that you've encountered out there? Or do you typically find that, hey, management has reached its budget threshold and they really are concerned about what's going on? What I tend to see is that there's very few hard and fast rules that map to

Starting point is 00:23:17 everything. You're going to see some companies where coming in very early and structuring out a costing program makes sense. You see other companies that are riding a rocket ship, and while they're spending tens of millions of dollars a year on cloud spend, that's a tiny molehill next to the mountain of revenue that they're seeing, or VC money that's pouring in, or potential upside. It's one of those stories where when you're all hands on deck in a hyper growth company, optimizing to save a few bucks here and there is absolutely not material to your business. There does come a time where that changes. Conversely, I'm a bootstrapped consultancy of one where when my cloud bill

Starting point is 00:23:57 starts spiraling away from me, if I wake up to a $20,000 bill tomorrow, I should probably fix that before I do almost anything else. Because it doesn't take too many of those before my business starts winding up in trouble. It comes down to a number of different levels of maturity. That's why I've never been a fan of the models for cloud governance that tend to equate everyone to being similar. That's always going to be disparate based upon who you are and what your constraints look like. Yeah, totally.

Starting point is 00:24:29 And, you know, I was just reading a really interesting article on Facebook's new or newly public bug remediation and automation of suggested changes in their code bases that sort of makes me think that that might be an interesting area for us to head. And similar to, are you familiar with the blog Accidentally Quadratic? I am not. That sounds like math. I was told there would be no math. So it's all these really, really wonderful code snippets where people have found that it's just an inefficient algorithm being used. And they share a little bit of the context around what the code base is, what the intention behind implementing it this way probably was, and how they went and made it better. You see companies like HashiCorp coming with some new features to help predict costs. You see a lot of the AWS trusted advisor and other things like cloud health moving in different directions of

Starting point is 00:25:35 helping to at least say reactively, hey, you spent too much, you need to solve this. I won't be terribly surprised if this is a new sort of cottage industry of ML or something where you're actually looking at the code bases that are running, particularly in the big data realm or the other truly expensive as far as compute and data transfer areas go, where you're not just saying, oh, let's reserve instances, but let's actually take a look into your code and double check. Are you using the current version of the framework? Are you using the minimal amount of data that you could be? And sort of removing that from the responsibility of those creative types who are more responsible for going out and building

Starting point is 00:26:25 new features for the business. I don't intend to say unfortunate things about a lot of the vendors in this space, but every time I've seen something like this today that purports to use machine learning to determine whether your resource usage is sensible, whether things should be turned up or turned down or not, they either tend to focus on a very small portion of the overall picture, or they tend to have unfortunately naive assumptions baked in. A quick and easy example. There's no way programmatically to distinguish between an instance that is oversized and sitting idle and should be downscaled or turned off, and an active DR site that's going to have about three seconds of warning before it gets slammed with traffic. In one of those, you want to turn those things off. In the other scenario,

Starting point is 00:27:15 you absolutely don't. And that's a business process problem. That's not something that I've ever seen any realistic chance of solving via writing code. The same story with, to be frank, a lot of these businesses pricing models, where it's a percentage of your bill in order to sit there and do analysis. Well, okay, that's fine, I guess, but no one likes the model. I mean, when I've tried that in my very early days of a consultancy, I got laughed out of the room. Now I charge fixed fee with guarantees that I do it and I wind up not having to fight that particular battle the same way. Yeah. And I'll totally admit that I am a total nerd and optimist. And I believe that there are a bajillion areas that in the next 20 to 40 to whatever years, we'll see some really astonishing changes.

Starting point is 00:28:06 I totally agree that right now, the industry is paid little attention. And it's, as you're saying, not a high value proposition to come into the next unicorn and say, hey, as you're making that billion dollars, I can save you 20K every month. That's not really worth their time. No, and it's really not. It becomes a better narrative around the idea of helping establish good practices, good governance, demonstrating they're being responsible stewards of the money entrusted to them. But it's not the big win in this space. For a second there, I thought you were going to say that, oh, the code is going to get better.

Starting point is 00:28:47 In the future, the cloud bills will self-optimize, at which point I'm obligated to ask you, will we pay for them in Bitcoin? But I'm sorry, I'm not one of those people with stars in their eyes and everything is terrible up until this point, but the future is better. And we see evolutions of these things. I think to some extent,

Starting point is 00:29:07 the providers are going to have to come up with some form of simplification pass over their bill. They'd have to. The level of increase in complexity over time is not something that's going to be sustained. The other side of that, though, is how do we get better than we are today? If we don't have a perfect solution,

Starting point is 00:29:23 okay, we don't need it to be. But how do we get better than we have now? If we don't have a perfect solution, okay, we don't need it to be. But how do we get better than we have now? Yeah, and isn't that what you do? From my perspective, but there's only one of me. There should also, to be very blunt with you, I shouldn't have a business. This shouldn't be as complex of a problem as it has become. You shouldn't need to bring in a consultant

Starting point is 00:29:41 to solve these things. And until companies are spending at least a certain baseline threshold on their cloud bill, I can't help them because there's no ROI for retaining me. Yes, I'll come in and look at your bill and you'll hit break-even on my services in only a couple decades. That's not a compelling sales pitch. So it's not something that's ever going to work. And you shouldn't have to be spending a king's ransom in order to make those numbers make sense. It should be something that as the product continues to evolve and grow that you're building, that governance just sort of comes along for the ride, that your bill streamlines itself. And I think that we're doing, do you have other industries that you see similar consulting where there are either retailers or some people dealing in physical goods where it's a similar problem where they need to optimize? I mean, I could imagine that there are industries where

Starting point is 00:30:41 paying the right amount for raw goods, that's critical. But do you have any analogs that you've really used to help guide yourself as you've embarked down this road? Not exactly in the way that you mean it, but there's nothing new about my business model. We saw this in the 70s and 80s where companies would come in to large enterprises and say, hi, I'm a consultant. I'm going to just sit in a room quietly and tear apart your telephone bill. Because back then, telephone bills were complex, they were massive. And they would say, we'll find errors that the phone company made when they calculated these things out. And when we save you money, we'll take a percentage of it.

Starting point is 00:31:20 And that was a brilliant business model that I don't think we can quite get back to. But the beauty of that was first, it's money that the company is never going to recoup. Secondly, it requires zero investment on the company side other than, here's the bills, now go away and tell us what you find. It doesn't require a team of engineers to sit there with someone and explain architecture. It doesn't require a team of people to sit there and go back and forth with vendors and negotiation team. It became very simple and very streamlined. I don't think that there's quite a direct equivalent to that, but I did take inspiration from that philosophically. So do you think that there are similar evolutions

Starting point is 00:32:00 that are coming in cloud computing? Because I mean, you look at our phone bills today, and I pay a flat rate every month. And when I go to Europe, it doubles. And that's fine, because I know it's also just going to be another flat rate. Do you think that we could get somewhere like that with especially all of the serverless, you know, not just talking about Lambda, but moving into the RDS realm, it seems like at some point Amazon could be charging me per cycle or per request or conversion or something that's a little bit different than just this dollars and cents to resource reservation time. I hesitate to try and predict the future. It always seems like that's either one of those

Starting point is 00:32:46 things that winds up leading very quickly to, yeah, you were right, no one cares, or you were wrong, now we're going to laugh at you for eternity. There's no real upside to that. I will say that the current pace that AWS seems to be on in several fronts is unsustainable. For example, right now the market is always talking about percentage growth. Well, if you make boats and you sell them for a million bucks a piece, and last year you sold one boat and you were independent, now you've hired an assistant and you sell two boats this year, you've demonstrated 100% year-over-year growth. Back when you had a $20 million cloud business, we made $40 million this year on it. That's easy. The growth numbers are fantastic. They have eclipsed, I think,

Starting point is 00:33:32 $25 billion a year now as a run rate, according to their last published numbers. That is a much larger number to have to double and try to onboard rapidly. People generally don't tend to spend that much that quickly in a new platform except by accident. And accidentally charging people a few billion dollars is not great customer service. Counterpoint, it only has to work once. Yeah. Where do I apply for that? Absolutely. So you also see this now on the other fronts where at reInvent, for example, they get on stage and they trot out their slides showing year over year number of feature releases and enhancements. Okay, that is good to know that you're not resting on your laurels and you're

Starting point is 00:34:18 innovating rapidly, but that line can't continue up forever. We're already at a point where there are services out there that solve problems that I've had and I didn't know they existed. And I spent a fair bit of time tracking this down. Instead, I have to go down this entire merry-go- when I get confused or caught out by something new and exciting that launched. But eventually, you're going to see a world where the official Amazon blog that Jeff Barr writes just doesn't have enough space to wind up publishing these things. He collapses due to exhaustion from writing 85 posts a week. And at some point, people working on these things. He collapses due to exhaustion from writing 85 posts a week. And at some point, people working on these things, we all have jobs to do that don't include analyzing new service releases or feature enhancements. So we stop paying attention,

Starting point is 00:35:17 even to the things we really should be paying attention to. Things can't go up and to the right forever. And what that leveling off or normalization starts to look like, I have no clue. There are smarter people than I am at Amazon who work on these things as a full-time job. I'm just sitting here in the cheap seats throwing peanuts at people and sometimes rattling the cage and screaming. Well, you hide that part very well. Oh, yes. The things we say in public and things we scream in the middle of the night while working on articles. I like your approach.

Starting point is 00:35:50 The time makes a lot of sense. Oh, yes. Nothing good ever happens after 3 a.m. Whenever I'm writing blog posts, then nothing good. So when you're describing these sort of granular services and the solutions to problems that are not well publicized, do you think that that's just the state of scale of AWS specifically? Or do you think that it's their approach and folks like Google or Azure or AliCloud or whoever out there might be taking different approaches that would actually be able to condense those solutions into something that's more palatable,

Starting point is 00:36:31 more meaningful, and easier to adopt? I don't know. That's a great question. But even now, you wind up not just with competition from third parties, but, for example, let's say that I have a string that I want to send from me to you. And I want to do that programmatically via APIs. Within AWS, there are no fewer than 15 different services I can use to store that string and have it go to you. And that number is not getting smaller. And incidentally, I'm talking as a, not terribly abusing services either. Well, technically I could spin up Amazon Chime

Starting point is 00:37:09 and message you. No, that's not what I'm talking about. Or, well, theoretically I can spin up an EC2 instance and store that string in a tag. No, none of that. We're talking using services as generally intended. And the varying differentiators between these services are getting harder and harder to discern. Back when there was one queuing-style service, it was easy. You use that

Starting point is 00:37:31 one and complain about it. Now that there are 15 of them, you pick one, are convinced you do the wrong thing, complain about it, switch to something else, trip over a constraint you didn't know existed, and the cycle repeats until you eventually give up and go raise goats on a mountainside somewhere. I like goats. That's because you never raised them. This is true. So you've got way more of a close relationship with Amazon than I do, for example.

Starting point is 00:37:59 Much to their everlasting chagrin. You don't know that. That's just what they say to your face. You should see what they say when they think I'm not listening. So as you're talking about this evolution of growing nearly the same service over and over again, have you experienced anything that you could share around why that happens? I completely understand the concept of not invented here, but is it something that they can find another two-pizza team

Starting point is 00:38:32 that is so dissatisfied with the service that they really just have to reinvent it? Sort of. It's a great question. And this is sort of the Achilles heel, from my perspective, of the entire Amazon model. For those who aren't aware, the term two-pizza team is an Amazon management philosophy. They believe that any team that works on a service should be able to be fed with two pizzas. My take on that is you're not allowed on the team unless you can eat two entire pizzas yourself. History will say which was better.

Starting point is 00:39:01 But as they're building these things out in the small teams, they get ideas, they do internal style of bake-offs, to my understanding. And that's why you wind up with services that wind up competing with one another. They move very quickly, they have the freedom to fail, which is incredibly valuable. And by the time something launches, it's generally already got customers lined up to use it. They aren't building things and hoping that people use these things one day. They have customers who are asking for the specific things that they build. The counterpoint and the pain that many of us experience is that anything that depends upon a shared service for all of those is very difficult. Take a look at the console, for example. You have to unify all of those services and present them in the same way. That's really hard. You take a look at other shared services

Starting point is 00:39:49 like the bill. Every different service team has a different billing model and the numbers of dimensions and metrics that wind up influencing that bill. The billing system alone is an incredible service that most people don't understand as far as the sheer volume of data that it has to process and what it has to do to get those bills out to people on time. But people's only interaction with that is at the end with the output where first, it's a bill. No one's thrilled to get one of those. Secondly, it's super complex. No one likes that either because here's what we're charging you and here's why and you look at that and you feel dumb is a crappy customer experience. How do they fix that?

Starting point is 00:40:28 Couldn't tell you. Yeah. But if you just, but you're going to feel dumb because you feel dumb because you're dumb. I mean, there's, there's some basic expectation there. I tend to not be a big fan of blaming people who are confused or annoyed over the bill itself. I mean, it's in anything in this space. There is no simple problem in anything that touches the cloud. If your answer to a problem is, oh, you should just stop speaking there because you're

Starting point is 00:40:59 already wrong. Yeah. And that was more a comment on me feeling perpetually dumb, which is just something that I'm dealing with personally. One of the things that you mentioned in there that I think is really interesting is you called out the freedom to fail. And I've also seen you talk about the reliability that Amazon has as far as the longevity of the different services. So what does that mean when you say that they've got the freedom to fail? Is that something that's just internal? The project may not make it to production? Or have you seen instances where, you know, just there aren't enough people using this thing, so we're actually going to be sunsetting it and have some potential

Starting point is 00:41:45 significant impact on users. I've never seen them sunset a product. I've seen them deprecate things a couple of times in strange ways. The first is reduced redundancy storage. It no longer participates in price cuts. It's an S3 storage class, and it now costs more than the good storage. It's still there if you want to use it. But the one that I find more interesting is SimpleDB. You don't see it in the console. It's not advertised. And relevant to this conversation, Andy Jassy, the CEO of AWS, publicly referred to it as a failed service, which is fascinating to me. The value of being able to say something like that publicly,

Starting point is 00:42:27 even though it still has active users on it, there's still a service team maintaining it, incidentally, that feels like the saddest job in the world. But it's not something that they're ever going to turn off completely because they made a commitment to customers that you can build a business on this. And until that last customer gets off of using that product or service,

Starting point is 00:42:50 Amazon's going to continue to honor that, as best I can tell. Now, I'd be surprised at this point if they don't have teams of people actively working with some customers to migrate them off so they can finally turn it off. But to date, that hasn't happened.

Starting point is 00:43:03 I'm not particularly worried about trusting Amazon with my production infrastructure. That's fair. As opposed to other cloud companies who turn things off for kicks. I believe that could fall under some form of chaos. It's just branding that's missing. Absolutely.

Starting point is 00:43:22 We've decided to turn off the database that you're building everything on top of. Have a good day. Yeah, no one's having a good day when that happens. Business chaos. Exactly. If people want to talk to you more, where should they find you on this wide internet of ours?

Starting point is 00:43:36 Probably the place that I'm interacting the most is on the Rans Leadership Slack. I'm on the Gopher Slack and the Hangup slack as well. But there's Twitter or they can just email me at sheely at ag.org. I'm out there. Perfect. I will throw links to those things

Starting point is 00:43:56 in the show notes. Thank you so much for taking the time to speak with me today. It's appreciated. Yeah, thanks, Corey. This was awesome. It really has been. Johnny Sheely, Director of Corey. This was awesome. It really has been. Johnny Shaley, Director of Cloud Engineering at Fanatics. I'm Corey Quinn, and this is Screaming in the Cloud. This has been this week's episode of Screaming in the Cloud. You can also find more

Starting point is 00:44:17 Corey at screaminginthecloud.com or wherever fine snark is sold.

Screaming in the Cloud - Episode 36: I'm Not Here to Correct Your English, Just Cloud Bills

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.