Screaming in the Cloud - The Complexities of AWS Cost Optimization with Rick Ochs

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. vendor due to proprietary data collection, querying, and visualization. Modern-day containerized environments require a new kind of observability technology that

Starting point is 00:00:49 accounts for the massive increase in scale and attendant cost of data. With Chronosphere, choose where and how your data is routed and stored, query it easily, and get better context and control. 100% open-source compatibility means that no matter what your setup is,

Starting point is 00:01:05 they can help. Learn how Chronosphere provides complete and real-time insight to ECS, EKS, and your microservices, wherever they may be, at snark.cloud slash chronosphere. That's snark.cloud slash chronosphere. This episode is brought to you in part by our friends at Veeam. Do you care about backups? Of course you don't. Nobody cares about backups. Stop lying to yourselves. You care about restores, usually right after you didn't care enough about backups. If you're tired of the vulnerabilities, costs, and slow recoveries when using snapshots to restore your data, assuming that you even have them at all, living in AWS land, there's an alternative for you. Check out Veeam. That's V-E-E-A-M for secure, zero-fuss

Starting point is 00:01:53 AWS backup that won't leave you high and dry when it's time to restore. Stop taking chances with your data. Talk to Veeam. My thanks to them for sponsoring this ridiculous podcast. Welcome to Screaming in the Cloud. I'm Corey Quinn. For those of you who've been listening to this show for a while, a theme has probably emerged, and that is that one of the key values of this show is to give the guest a chance to tell their story. It doesn't beat the guest up about how they approach things. It doesn't call them out for being completely wrong on things. Because honestly, I'm pretty good at choosing guests and I don't bring people on that are, you know, walking trash fires. And that is certainly not a concern for this episode. But this might devolve

Starting point is 00:02:42 into a screaming loud argument despite my best effort. Today, I'm joined by Rick Oaks, Principal Product Manager at AWS. Rick, thank you for coming back on the show. The last time we spoke, you were not here. You were at, I believe it was Turbonomic. Yeah, that's right. Thanks for having me on the show, Corey. I'm really excited to talk to you about optimization and my current role and what we're doing. Well, let's start at the beginning. Principal Product Manager. It sounds like one of those corporate titles that can mean a different thing in every company or every team that you're talking to. What is your area of responsibility? Where do you start and where do you stop?

Starting point is 00:03:22 Awesome. So I am the product manager lead for all of AWS optimizations team. So I lead the product team that includes several other product managers that focus in on compute optimizer, cost explorer, right sizing recommendations, as well as reservation and savings plan purchase recommendations. In other words, you are the person who effectively oversees all of the AWS cost optimization tooling and approaches to same? Yeah. Give or take. I mean, you could argue that, oh, every team winds up focusing on helping customers save money. I could fight that argument just as effectively, but you effectively start and stop with respect to helping customers save money or understand where the money is going on their AWS bill. I think that's a fair statement. And I also

Starting point is 00:04:11 agree with your comment that I think a lot of service teams do think through those use cases and provide capabilities. You know, there's like S3 Storage Lens, you know, there's all sorts of other products that do offer optimization capabilities as well. But as far as the unified purpose of my team, it is unilaterally focused on how do we help customers safely reduce their spend and not hurt their business at the same time. Safely being the key word. For those who are unaware of my day job, I am a partial owner of the Duck Bill Group, a consultancy where we fix exactly one problem, the horrifying AWS bill. This is all that I've been doing for the last six years.

Starting point is 00:04:55 So I have some opinions on AWS bill reduction as well. So this is going to be a fun episode for the two of us to wind up more or less smacking each other around, but politely, because we are both professionals. So let's start at the very high level. How does AWS think about AWS bills from a customer perspective? You talk about optimizing it, but what does that mean to you? Yeah. So, I mean, there's a lot of ways to think about it, especially depending on who I'm talking to, where they sit in an organization. I would say I think about optimization in four major themes. The first is how do you scale correctly, whether that's right sizing or architecting things to scale in and out. The second thing I would say is how do you do pricing and discounting, whether that's reservation management,

Starting point is 00:05:42 savings plan management, coverage, how do you handle the expenditures of prepayments and things like that. Then I would say suspension. What that means is turn the lights off when you leave the room. We have a lot of customers that do this, and I think there's a lot of opportunity for more. Turning EC2 instances off when they're not needed, if they're non-production workloads or other sort of stateful services that charge by the hour. I think there's a lot of opportunity there. And then the last of the four methods is cleanup. And I think it's maybe one of the lowest hanging fruit, but essentially, are you done using this thing? Delete it. And there's a whole opportunity of cleaning up, you know, IP addresses, unattached EBS volumes, sort of these resources that hang around in AWS

Starting point is 00:06:24 accounts that sort of get lost that hang around in AWS accounts that sort of get lost and forgotten as well. So those are the four kind of major thematic strategies for how to optimize a cloud environment that we think about and spend a lot of time working on. I feel like there's, at least the way that I approach these things, that there are a number of different levels you can look at AWS billing constructs on. The way that I tend to structure most of my engagements when I'm working with clients is we come in and step one, cool. Why do you care about the AWS bill? It's a weird question to ask because most of the engineering folks look at me like I've just grown a second head. Like,

Starting point is 00:07:02 so why do you care about your AWS bill? They're like, what, why do you? You run a company doing this. It's, no, no, no. It's not that I'm being rhetorical and I don't, or I'm trying to be clever somehow and pretend that I don't understand all of the nuances around this, but why does your business care

Starting point is 00:07:18 about lowering the AWS bill? Because very often the answer is, is they kind of don't. What they care about from a business perspective is being able to accurately attribute costs for the service or good that they provide, being able to predict what that spend is going to be. And also, yes, a sense of being good stewards of the money that has been entrusted to them by via investors, public markets, or the budget allocation process of their companies, and make sure that they're not doing foolish things with it. And that makes an awful lot of sense. It is rare at the corporate level that the stated number one concern is make the bill lower. Because at that point, well, easy enough. Let's just turn off everything you're running in production. You'll save a lot of money on your AWS bill. You won't be in business anymore,

Starting point is 00:08:04 but you'll be saving a lot of money on the AWS bill. The answer is always deceptively nuanced and complicated. At least, that's how I see it. Let's also be clear that I talk with a relatively narrow subset of the AWS customer totality. The things that I do are very much intentionally things that do not scale. Definitionally, everything that you do has to scale. How do you wind up approaching this in ways that will work for customers spending billions versus independent learners who are paying for this out of their own personal pocket? It's not easy. Let me just preface that. The team we have is incredible. And we spend so much time thinking about scale and the different personas that engage with our products and what their experience is when they interact with a bill or AWS platform at large. There's also a couple of different personas here, right? the cloud bill, the finance, whether that's if an organization has created a FinOps organization, if they have a cloud center of excellence versus an engineering team that maybe has started to go

Starting point is 00:09:12 towards decentralized IT and has some accountability for the spend that they attribute to their AWS bill. And so these different personas interact with us in really different ways where CostExplorer, you know, downloading the curve and taking a look at the bill. And one thing that I always kind of imagine is somebody putting a headlamp on and going into the caves in the depths of their AWS bill and kind of like spelunking through their bill sometimes. Right. And so you have these FinOps folks and billing and people that are deeply interested in making sure that the spend they do have meets their business goals. Meaning this is providing high value to our company. It's providing high value to our customers.

Starting point is 00:09:53 We're spending on the right things. We're spending the right amount on the right things. Versus the engineering organization that's like, hey, how do we configure these resources? What types of instances should we be focused on using? What services should we be building on top of that maybe are more flexible for our business needs? And so there's really like two major personas that I spend a lot of time, our organization spends a lot of time wrapping our heads around because they're really different. We very different approaches to how we think about cost because you're right. If you just wanted to lower your AWS bill, it's really easy. Just size everything to a T2 nano and you're done. Move on, right? T3 or T4 nano, depending upon whether regional availability is going to save you less. I'm still better at this. Let's not kid ourselves. I kid. For sure. So T4 nano, absolutely. T4G, remember, now the way forward is everything has this

Starting point is 00:10:42 explicit letter designator to define which processor company made the CPU that underpins the instance itself. Because that's a level of abstraction we certainly wouldn't want the cloud provider to take away from us, honey. Absolutely. And actually, the performance differences of those different processor models can be pretty incredible. So there's huge decisions behind all of that as well. Oh, yeah. There's so many factors that factor into all these things. It's gotten to a point of, you see this usually with lawyers and very senior engineers, but the answer to almost everything is it depends. There are always going to be edge cases. Easy example of if you check a box and enable an S3 gateway endpoint inside of

Starting point is 00:11:23 a private subnet. Suddenly, you're not passing traffic through a 4.5 cent per gigabyte managed NAT gateway. It's being sent over that endpoint for no additional cost whatsoever. Check the box, save a bunch of money. But there are scenarios where you don't want to do it. Always double-checking and talking to customers about this is critically important. Just because the first time you make a recommendation that does not work for their constraints, you lose trust and make a few of those. And it looks like you're more or less just making naive recommendations that don't add any value and they learn to ignore you. So down the road, when you make a really high

Starting point is 00:12:00 value, great recommendation for them, they stop paying attention. Absolutely. And we have that really high bar for recommendation accuracy, especially with right sizing. That's such a key one. Although I guess savings plan purchase recommendations can be critical as well. If a customer over commits on the amount of savings plan purchase they need to make, right, that's a really big problem for them. So recommendation accuracy must be above reproach. Essentially, if a customer takes a recommendation and it breaks an application, they're probably never going to take another right-sizing recommendation again. And so this bar of trust must be exceptionally high. That's also why out of the box, the compute optimizer recommendations can be a little bit

Starting point is 00:12:39 mild. They're a little tame. Because the first order of business is do no harm, focus on the performance requirement of the application first, because we have to make sure that the reason you build these workloads in AWS is served. Now, ideally, we do that without overspending and without over provisioning the capacity of these workloads, right? And so, for example, like if we make these right sizing recommendations from Compute Optimizer, we're taking a look at the utilization of CPU, memory, disk network, throughput, IOPS, and we're vending these recommendations to customers. And when you take that recommendation, you must still have great application performance for your business to be served. It's such a crucial part of how we optimize and run long term because optimization is not a one-time band-aid it's an ongoing behavior so it's really critical that for that accuracy to be exceptionally high so we can build business process on top of it as well

Starting point is 00:13:34 let me ask you this how do you contextualize what the right approach to optimization is. What is your entire... There are certain tools that you have, you, I mean, of course, as an organization, have repeatedly gone back to in different approaches that don't seem to deviate all that much from year to year and customer to customer. How do you think about the general things that apply universally?

Starting point is 00:14:02 So we know that EC2 is a very popular service for us. We know that sizing EC2 is difficult. We think about that optimization pillar of scaling. It's an obvious area for us to help customers. We run into this sort of industry-wide experience where whenever somebody picks the size of a resource, they're going to pick one generally larger than they need. It's almost like asking a new employee to your company, hey, pick your laptop. We have a 16 gig model or a 32 gig model. Which one do you want? That person making the decision on capacity, hardware capacity, they're always going to pick the 32 gig model laptop, right? And so we have this sort of human nature in IT of, we don't want to get called at two in the morning for performance issues. We don't want our apps to fall over.

Starting point is 00:14:49 We want them to run really well. So we're going to size things very conservatively, and we're going to oversize things. So we can help customers by providing those recommendations to say, you can size things up in a different way way using math and analytics based on the utilization patterns. And we can provide and pick different instance types. There's hundreds and hundreds of instance types in all of these regions across the globe. How do you know which is the right one for every single resource you have? It's a very, very hard problem to solve.

Starting point is 00:15:21 And it's not something that is lucrative to solve one by one. If you have 100 EC2 instances, trying to pick the correct size for each and every one can take hours and hours of IT engineering resources to look at utilization graphs, look at all of the instance types available, look at what is the performance difference between processor models and providers of those processors? Is there application compatibility constraints that I have to consider? The complexity is astronomical. And then not only that, as soon as you make that sizing decision, one week later, it's out of date and you need a different size. So you didn't really solve the problem. So we have to programmatically use data science and math to say, based on these

Starting point is 00:16:03 utilization values, these are the sizes that would make sense for your business that would have the lowest cost and the highest performance together at the same time. And it's super important that we provide this capability from a technology standpoint, because it would cost so much money to try to solve that problem that the savings you would achieve might not be meaningful. Then at the same time, you know, that's really from an engineering perspective. But when we talk to the FinOps, the finance folks, the conversations are more about reservations and savings plans. How do we correctly apply savings plans and reservations across a high percentage of our portfolio to reduce the costs

Starting point is 00:16:39 on those workloads, but not so much that dynamic capacity levels in our organization mean we all of a sudden have a bunch of unused reservations or savings plans. And so a lot of organizations that engage with us and we have conversations with, we start with the reservation and savings plan conversation because it's much easier to click a few buttons and buy a savings plan than to go institute an entire right-sizing campaign across multiple engineering teams. That can be very difficult, a much higher bar. So some companies are ready to dive

Starting point is 00:17:12 into the engineering task of sizing. Some are not there yet. And they're maybe a little earlier in their FinOps journey or the building optimization technology stacks or achieving higher value out of their cloud environment. So starting with kind of the low-hanging can, it can vary depending on the company, size of company, technical aptitude, skill sets, all sorts of things like that. And so those finance

Starting point is 00:17:35 focused teams are definitely spending more time looking at and studying what are the best practices for purchasing savings plans, covering my environment, getting the most out of my dollar that way, then they don't have to engage the engineering teams. They can kind of take a nice chunk off the top of their bill and sort of have something to show for that amount of effort. So there's a lot of different approaches to start in on optimization. My philosophy runs somewhat counter this because everything you're saying does work globally. It's safe. It's non-threatening. And it also really on some level feels like it is an approach that can be driven forward by finance or business. Whereas my worldview is it cost and architecture and cloud are one in the same. And there are architectural consequences of cost

Starting point is 00:18:20 decisions and vice versa that can be adjusted and addressed. Like one of my favorite party tricks, although I admit it's a weird party, is I can look at the exploded PDF view of a customer's AWS bill and describe their architecture to them. And people have questioned that a few times. And now I have a testimonial on my client website that mentions it was weird how he was able to do this. Yeah, it's real. I can do it. And it's not a skill I would recommend cultivating for most people. But it does also mean that I think I'm onto something here where there's always context that needs to be applied. It feels like there's an entire ecosystem of product companies out there

Starting point is 00:19:00 trying to build what amount to a better cost explorer that also is not free the way that cost explorer is. So the challenge I see there is they all tend to look more or less the same. There is very little differentiation in that space. And in the fullness of time, cost explorer does ideally get better. How do you think about it? Absolutely. And if you're looking at ways to understand your bill, there's obviously Cost Explorer, the CUR. That's a very common approach is to take the CUR and put a BI front end on top of it.

Starting point is 00:19:33 That's a common experience. A lot of companies that have chops in that space will do that themselves instead of purchasing a third party product that does do bill breakdown and dissemination. There's also the cross-charge, showback, organizational breakdown and boundaries because you have these super large organizations that have fiefdoms. You have HR IT and sales IT and product IT. You have all these different IT departments that are fiefdoms within your AWS bill and construct, whether they have different AWS accounts or say different AWS organizations sometimes, right?

Starting point is 00:20:09 It can get extremely complicated. And some organizations require the ability to break down their bill based on those organizational boundaries. Maybe tagging works, maybe it doesn't. Maybe they do that by using a third-party product that lets them set custom scopes on their resources based on organizational boundaries. That's a common approach as well. We do also have our first-party solutions that can do that, like the Kudos dashboard

Starting point is 00:20:34 as well. It's something that's really popular and highly used across our customer base. It allows you to have a dashboard and customizable view of your AWS costs and kind of split it up based on tag, organizational value, account name, things like that as well. So you mentioned you feel like the architectural and cost problem is the same problem. I really don't disagree with that at all. I think what it comes down to is some organizations are prepared to tackle the architectural element of cost and some are not. And it really comes down to how does the customer view their bill? Is it somebody in the finance organization looking at the bill?

Starting point is 00:21:15 Is it somebody in the engineering organization looking at the bill? Ideally, it would be both. Ideally, you would have some of those skill sets that overlap, or you would have an organization that does focus in on FinOps or cloud operations as it relates to cost. But then at the same time, there are organizations that are like, hey, we need to go to cloud. Our CIO told us, go to cloud. We don't want to pay the least renewal on this building. There's a lot of reasons why customers move to cloud. A lot of great reasons, right? Three major reasons you move to cloud. Several terrible ones.

Starting point is 00:21:48 Yeah, and some not so great ones too. So there's so many different dynamics that get exposed when customers engage with us that they might or might not be ready to engage on the architectural element of how to build hyperscale systems. So many of these customers are bringing legacy workloads and applications to the cloud and something like a re-architecture to

Starting point is 00:22:11 use stateless resources or something like spot that's just not possible for them. So how can they take 20% off the top of their bill? Savings plans or reservations are kind of that easy, low-hanging fruit answer to just say we know these are fairly static environments that don't change a whole lot that are going to exist for some amount of time their legacy you know we can't turn them off it doesn't make sense to rewrite these applications because they just don't change they don't have high business value or something like that and so the architecture part of that conversation doesn't always come into play. Should it? Yes. The long-term maturity and approach for cloud optimization does absolutely

Starting point is 00:22:53 account for architecture, thinking strategically about how you do scaling, what services you're using. Are you going down the Kubernetes path, which I know you're going to laugh about, but how do you take these applications and componentize them? What services are using to do that? How do you get that long-term scale and manageability out of those environments? Like you said at the beginning, the complexity is staggering and there's no one unified answer. That's why there's so many different entrance paths into how do I optimize my AWS bill? There's no one answer.

Starting point is 00:23:26 And every customer I talk to has a different comfort level and appetite. And some of them have tried suspension. Some of them have gone heavy down savings plans. Some of them want to dabble in rightsizing. So every customer is different. We want to provide those capabilities for all of those different customers that have different appetites or comfort levels with each of these approaches. This episode is sponsored in part by our friends at Redis, the company behind

Starting point is 00:23:50 the incredibly popular open source database. If you're tired of managing open source Redis on your own, or if you're looking to go beyond just caching and unlocking your data's full potential, these folks have you covered. Redis Enterprise is the go-to managed Redis service that allows you to reimagine how your geo-distributed applications process, deliver, and store data. To learn more from the experts in Redis how to be real-time, right now, from anywhere, visit snark.cloud slash redis. That's snark.cloud slash r-E-D-I-S. And I think that's very fair. I think that it is not necessarily a bad thing that you wind up presenting a lot of these options to customers. But there are some rough edges.

Starting point is 00:24:36 An example of this is something I encountered myself somewhat recently and put on Twitter because I have those kinds of problems. Where originally I remember this, that you were able to buy hourly savings plans, which again, savings plans are great. No knock there. I would wish that they applied to more services there rather than,

Starting point is 00:24:54 Oh, SageMaker is going to do its own savings plan. No, stop keeping me from going from something where I have to manage myself on EC2 to something you manage for me, and making that cost money. You've nailed it with Fargate. You've nailed it with Lambda. Please just have one unified savings plan thing.

Starting point is 00:25:10 I digress. But you had a limit once upon a time of $1,000 per hour. Now it's $5,000 per hour, which I believe in a three-year all up front means you will cheerfully add $130 million purchase to your shopping cart. And I kept adding a bunch of them and then had a little over a billion dollars, a single button click away

Starting point is 00:25:29 from being charged to my account. Let me begin with what's up with that. Thank you for the tweet, by the way, Corey. Always sort of ruin your month, Rick. You know that. Yeah, fantastic. We took that tweet, you know, it was tongue in cheek, but also it was a serious opportunity for us to ask the question of what does happen. And it's something we did ask internally and have some fun conversations about. I can tell you that if you click purchase, it would have been declined. So you would have not been... American Express would have had a problem with that. But the question is, would you have attempted to charge American Express or would something internally have gone, this has a few too many commas for us to wind up presenting it to the card issuer with a straight face. Right. So it wouldn't have gone through. And I can tell you that if your account was on a PO-based

Starting point is 00:26:13 configuration, it would have gone to the account team and it would have gone through our standard process for having a conversation with our customer there. That being said, it's an awesome opportunity for us to examine what is that shopping cart experience. We did increase the limit. You're right. And we increased the limit for a lot of reasons that we sat down and worked through. But at the same time, there's always an opportunity for improvement of our product and experience. We want to make sure that it's really easy and lightweight to use our products, especially purchasing savings plans. Savings plans are already kind of wrought with mental concern and risk of purchasing something so expensive and large that has a big impact on your AWS bill. So we don't really want

Starting point is 00:26:50 to add any more friction necessarily to the process, but we do want to build an awareness and make sure customers understand, hey, you're purchasing this. This has a pretty big impact. And so we're also looking at other ways we can kind of improve the ability for the savings plans chopping cart experience to ensure customers don't put themselves in a position where you have to unwind or make phone calls and say oops right we want to avoid those sort of situations for our customers so we are looking at quite a few additional improvements to that experience as well that I'm really excited about that I probably can't share here, but stay tuned. I am looking forward to it. I will say the counterpoint to that is having worked with customers who do make large eight-figure purchases at once, there's a psychology

Starting point is 00:27:37 element that plays into it. Everyone is very scared to click the button on the buy it now thing or the approve it. So what I've often found is at that scale, one, you can reduce what you're buying, buy half of it, and then see how that treats you, and then continue to iterate forward rather than doing it all at once. Or reach out to your account team and have them orchestrate the buy. In previous engagements, I had a customer do this religiously, and at one point, the concierge team bought the wrong thing in the wrong region. And from my perspective, I would much rather have AWS apologize for that and fix it on their end than for us having to go over the customer side of, oh crap, oh crap, please be nice to us. Not that I doubt you would do it, but that's not the nervous conversation I want to have in quite the same way. It just

Starting point is 00:28:24 seems odd to me that someone would want to make that scale of purchase without ever talking to a human. I mean, I get it. I'm as antisocial as they come some days. But for that kind of money, I kind of just want another human being to validate that I'm not making a giant mistake. We love that. That's such a tremendous opportunity for us to engage and discuss with an organization

Starting point is 00:28:46 that's going to make a large commitment that here's the impact, here's how we can help, how does it align to our strategy? We also do recommend from a strategic perspective, those more incremental purchases. I think it creates a better experience long-term when you don't have a single savings plan that's going to expire on a specific day that all of a sudden increases your entire bill by a significant percentage. So making staggered monthly purchases makes a lot of sense. And it also works better for incremental growth. Right. If your organization is growing five percent month over month or year over year or something like that, you can purchase those incremental savings plans that sort of stack up on top of each other. And then you don't have that risk of a cliff one day where one super large SP expires and boom, you have to scramble and repurchase within minutes because every minute that goes by is an additional

Starting point is 00:29:35 expense, right? That's not a great experience. And so that's really a large part of why those staggered purchase experiences make a lot of sense. That being said, a lot of companies do their math and their finance in different ways. And single large purchases make sense to go through their process and their rigor as well. So we try to support both types of purchasing patterns. I think that that is an underappreciated aspect of cloud cost savings and cloud cost optimization, where it is much more about humans than it is about math. I see this most notably when I'm helping customers negotiate their AWS contracts with AWS, where they're often perspectives such as, well, we feel like we really got screwed over last time, so we want to stick it to them and make them give us a bigger percentage

Starting point is 00:30:22 discount on something. And it's like, look, you can do that, but I would much rather, if it were me, go for something that moves the needle on your actual business and empowers you to move faster, more effectively, and lead to an outcome that is a positive for everyone versus the, well, we're just going to be difficult in this one point

Starting point is 00:30:41 because they were difficult on something last time. But ego's a thing. Human psychology is never going to have an API for it. And again, customers get to decide their own destiny in some cases. I completely agree. I've actually experienced that. So this is the third company I've been working at on cloud optimization. I spent several years at Microsoft running the optimization program. I went to Turbonomic for several years, building out the right sizing and savings plan reservation purchase capabilities there. And now here at AWS and through all of these journeys and experiences working with companies to help optimize their cloud spend, needle, moving the needle is significantly harder than the technology stack of sizing something correctly or deleting something that's unused. We can solve the technology part. We can build great products that identify opportunities to save money. There's still this psychological

Starting point is 00:31:38 component of IT for the last several decades has gone through this maturity curve of if it's not broken, don't touch it. Five, nine, six sigma, all of these methods of IT sort of rationalizing do no harm, don't touch anything, everything must be up. And it even kind of goes back several decades back when if you rebooted a physical server, the motherboard capacitors would pop, right? So there's even this anti or the stigma against even rebooting servers sometimes. And the cloud really does away with a lot of that stuff because we have live migration

Starting point is 00:32:14 and we have all of these sort of stateless designs and capabilities, but we still carry along with us this mentality of don't touch it. It might fall over and we have to really get past that. And that means that the trust, we went back to the trust conversation where we talk about the recommendations must be incredibly accurate. You're risking your job in some cases. If you are a DevOps engineer and your commitments on your yearly goals are uptime, latency response time, load time, these sorts of things, these operational metrics, KPIs that you use, you don't want to take a downsized recommendation.

Starting point is 00:32:50 It has a severe risk of harming your job and your bonus. These instances are idle. Turn them off. It's like, yeah, these instances are the backup site or the DR environment or something that takes very bursty but occasional traffic. And yeah, I know it costs us some money, but here's the revenue figures for having that thing available. Like, oh yeah, maybe we should shut up and not make dumb recommendations around things is the human

Starting point is 00:33:15 response. But computers don't have that context. Absolutely. And so the accuracy and trust component has to be the highest bar we meet for any optimization activity or behavior. We have to circumvent or supersede the human aversion, the risk aversion that IT is built on, right? Oh, absolutely. And let's be clear, we see this all the time where I'm talking to customers and they have been burned before because we tried to save money. And then we took a production outage as a side effect of a change that we made. And now we're not allowed to try to save money anymore. And there's a hidden truth in there, which is auto-scaling is something that a lot of customers talk about, but very few have instrumented true auto-scaling.

Starting point is 00:33:58 Because they interpret as we can scale up to meet demand. Because yeah, if you don't do that, you're dropping customers on the floor. Well, what about scaling back down again? And the answer there is like, yeah, that's not really a priority because it's just money. We're not disappointing customers, causing brand reputation,

Starting point is 00:34:12 and we're still able to take people's money when that happens. It's only money, we can fix it later. COVID was shining a real light on a lot of this stuff, just because there are customers that we've spoken to whose, their user traffic dropped off a cliff, infrastructure spend remained constant day over day. And yeah, they believe genuinely they were auto-scaling. The most interesting lies are the ones that customers tell to themselves, but

Starting point is 00:34:34 the bill speaks. So getting a lot of modernization traction from things like that was really neat to watch. But customers, I don't think necessarily intuitively understand most aspects of their bill because it is a multidisciplinary problem. It's engineering, it's finance, it's accounting, which is not the same thing as finance. And you need all three of those constituencies to be able to communicate effectively using a shared and common language. It feels like we're marriage counseling between engineering and finance most weeks. Absolutely, we are. And it's important we get it right, that the data is accurate, that the recommendations we provide are trustworthy. If the finance team gets their hands

Starting point is 00:35:13 on the savings potential they see out of right-sizing, takes it to engineering, and then engineering comes back and says, no, no, no, we can't actually do that. We can't actually size those, right? We have problems. And they're cultural, they're transformational. Organizations' appetite for these things varies greatly. And so it's important we address that problem from all of those angles. And it's not easy to do. How big do you find the optimization problem is when you talk to customers?

Starting point is 00:35:44 How focused are they on it? I have my answers, but that's the scale of anecdata. I want to hear your actual answer. Yeah. So we talk with a lot of customers that are very interested in optimization, and we're very interested in helping them on the journey towards having an optimal estate. There are so many nuances and barriers, most of them psychological, like we already talked about. I think there's this opportunity for us to go do better exposing

Starting point is 00:36:12 the potential of what an optimal AWS estate would look like from a dollar and savings perspective. And so I think it's kind of not well understood. I think it's one of the biggest areas or barriers of companies really attacking the optimization problem with more vigor is if they knew that the potential savings they could achieve out of their AWS environment would really align their spend much more closely with the business value they get. I think everybody would go bonkers. And so I'm really excited about us making progress on exposing that capability or the total savings potential and amount is something we're looking into doing in a much more obvious way. And we're really excited about customers doing that on AWS where they know they can trust AWS to get the best value for their cloud spend, that it's a long-term good bet because their resources that they're using on AWS are all focused on giving business value. And that's the whole key. How can we align the dollars to the business value, right? And I think optimization is that connection between those two concepts. Companies are generally not going to greenlight a project whose sole job is

Starting point is 00:37:27 to save money unless there's something very urgent going on. What will happen is as they iterate forward in the next generation of services or migration of a service from one thing to another, they will make design decisions that benefit those optimizations. There's low-hanging fruit we can find, usually of the form, turn that thing off or configure this thing slightly differently. That doesn't take a lot of engineering effort in place, but on some level, it is not worth the engineering effort it takes to do an optimization project. We've all met those engineers, speaking as one of them myself, who left to our own devices. We'll spend two months just knocking a few hundred bucks a month

Starting point is 00:38:06 off of our AWS developer environment. We steal more than that in office supplies. I'm not entirely sure what the business value of doing that is in most cases. For me, yes, okay, this is, things that work in small environments work very well in large environments, generally speaking. So I learn how to save 80 cents here, and that's a few million bucks a month somewhere else. Most folks don't have that benefit happening. So it's a question of meeting them where they are. Absolutely. And I think the skill component

Starting point is 00:38:35 is huge, which you just touched on. When you're talking about 100 EC2 instances versus 1000, optimization becomes kind of a different component of how you manage that AWS environment. And while single decision recommendations to scale an individual server, the dollar amount might be different. The percentages are just about the same. When you look at what is it to be sized correctly, what is it to be configured correctly? And so it really does come down to priority. And so it's really important to really support all of those companies of all different sizes and industries, because they will have different experiences on AWS. And some will have more sensitivity to cost than others,

Starting point is 00:39:25 but all of them want to get great business value out of their AWS spent. And so as long as we're meeting that need and we're supporting our customers to make sure they understand the commitment we have to ensuring that their AWS spend is valuable, it is meaningful, right? They're not spending money on things that are not adding value. That's really important to us. I do want to have as a last topic of discussion here, how AWS views optimization,

Starting point is 00:39:54 where there have been a number of repeated statements where helping customers optimize their cloud spend is extremely important to us. And I'm trying to figure out where that falls on the spectrum from, it's a thing we say because they make us say it, but no, we're here to milk them like cows all the way on over to, no, no, we passionately believe in this at every level, top to bottom in every company. We're just bad at it. So I'm trying to understand how that winds up being expressed

Starting point is 00:40:21 from your lived experience, having solved this problem first outside and then inside? Yeah. So it's kind of like part of my personal story. It's the main reason I joined AWS. And when you go through the interview loops and you talk to the leaders of an organization you're thinking about joining, they always stop at the end of the interview and ask, do you have any questions for us? And I asked that question to pretty much every single person I interviewed with, like, what is AWS's appetite for helping customers save money? Because like, from a business perspective, it kind of is a little bit wonky, right? But the answers were varied, and all of them were customer obsessed and passionate. And I got this sense that my personal passion for helping companies have better efficiency of their IT

Starting point is 00:41:05 resources was an absolute primary goal of AWS and a big element of Amazon's leadership principle, be customer obsessed. Now, I'm not a spokesperson, so we'll see. But we are deeply interested in making sure our customers have a great long-term experience and a high trust relationship. And so when I ask these questions in these interviews, the answers were all about we have to do the right thing for the customer. It's imperative. It's also in our DNA. It's one of the most important leadership principles we have to be customer obsessed. And it is the primary reason why I joined, because of that answer to that question. Because it's so important that we achieve a better efficiency for our IT resources,

Starting point is 00:41:55 not just for like AWS, but for our planet. If we can reduce consumption patterns and usage across the planet for how we use data centers and all the power that goes into them. We can talk about meaningful reductions of greenhouse gas emissions, the cost and energy needed to run IT business applications. And not only that, but most all new technology that's developed in the world seems to come out of a data center these days. We have a real opportunity to make a material impact to how much resource we use to a data center these days. We have a real opportunity to make a material impact to how much resource we use to build and use these things. And I think we owe it to the

Starting point is 00:42:30 planet, to humanity. And I think Amazon takes that really seriously. And I'm really excited to be here because of that. As I recall, and feel free to make sure that this comment never sees the light of day, you asked me before interviewing for the role and then deciding to accept it, what I thought about you working there and whether I would recommend it, whether I wouldn't. And I think my answer was fairly nuanced. And you're working there now and we still are on speaking terms. So people can probably guess what my comments took the shape of, generally speaking. But I have to ask now,

Starting point is 00:43:05 it's been what, a year since you joined? Almost. I think it's been about eight months. Time during a pandemic is always strange, but I have to ask, did I steer you wrong? No, definitely not. I'm very happy to be here. The opportunity to help such a broad range of companies get more value out of technology. And it's not just cost, right? Like we talked about, it's actually not about the dollar number going down on a bill. It's about getting more value and moving the needle on how do we efficiently use technology to solve business needs.

Starting point is 00:43:43 And that's been my career goal for a really long time. I've been working on optimization for like seven or eight, I don't know, maybe even nine years now. And it's like this strange passion for me, this combination of my dad taught me how to be a really good steward of money and a great budget manager, and then my passion for technology. So it's this really cool combination of like childhood life skills that really came together for me to create a career that I'm really passionate about. And this move to AWS has been such a tremendous way to supercharge my ability to scale my personal mission and really align it to AWS's broader mission of helping companies achieve more with cloud platforms. Right. And so it's been a really nice eight months. It's been wild.

Starting point is 00:44:27 Learning the AWS culture has been wild. It's a sharp, divergent culture from where I have been in the past. But it's also really cool to experience the leadership principles in action. They're not just things we put on a website. They're actually things people talk about every day. And so that journey has been humbling and a great learning opportunity as well. If people want to learn more, where's the best place to find you? Oh yeah. Contact me on LinkedIn or Twitter.

Starting point is 00:44:56 My Twitter account is at Rickio1138. Let me know if you get the 1138 reference. That's a fun one. THX1138, who doesn't? Yeah, there you go. And it's hidden in almost every single George Lucas movie as well. You can contact me on any of those social media platforms and I'd be happy to engage

Starting point is 00:45:16 with anybody that's interested in optimization, cloud technology, bill, anything like that. Or even not, even anything else either. Thank you so much for being so generous with your time. I really appreciate it. My pleasure, Corey. It was wonderful talking to you. Rick Oaks, Principal Product Manager at AWS. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star

Starting point is 00:45:43 review on your podcast platform of choice. Whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment rightly pointing out that while AWS is great and all, Azure is far more cost-effective for your workloads because given their lax security, it is trivially easy to just run your workloads in someone else's account. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS.

Starting point is 00:46:28 We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started. This has been a humble pod production stay humble

Screaming in the Cloud - The Complexities of AWS Cost Optimization with Rick Ochs

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.