Screaming in the Cloud - Working on the Whiteboard from the Start with Tim Banks

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This episode is sponsored in part by Honeycomb. When production is running slow, it's hard to know where problems originate.

Starting point is 00:00:36 Is it your application code, users, or the underlying systems? I've got five bucks on DNS, personally. Why scroll through endless dashboards while dealing with alert floods, going from tool to tool to tool that you employ, guessing at which puzzle pieces matter? Context switching and tool sprawl are slowly killing both your team and your business. You should care more about one of those than the other. Which one is up to you? Drop the separate pillars and enter a world of getting one unified understanding of the one thing driving your business, production. With Honeycomb, you guess

Starting point is 00:01:12 less and know more. Try it for free at honeycomb.io slash screaming in the cloud. Observability, it's more than just hipster monitoring. light colored Excel tables into your deck or sift through endless spreadsheets looking for just the right data set, have you ever wondered why is it that sales and marketing get all this shiny, awesome analytics and insight tools, whereas engineering basically gets left with the dregs? Well, the founders of Jellyfish certainly did. That's why they created the Jellyfish Engineering Management Platform, but don't you dare call it JEMP. Designed to make it simple to analyze your engineering organization, Jellyfish ingests signals from your tech stack,

Starting point is 00:02:15 including JIRA, Git, and collaborative tools. Yes, depressing to think of those things as your tech stack, but this is 2021. And they use that to create a model that accurately reflects just how the breakdown of engineering work aligns with your wider business objectives. In other words, it translates from code into spreadsheet. When you have to explain what you're doing from an engineering perspective to people whose primary IDE is Microsoft PowerPoint, consider Jellyfish. That's jellyfish.co and tell them Corey sent you. Watch for the wince. That's my favorite part. Welcome to Screaming in the Cloud. I'm Corey Quinn. Periodically, I have a

Starting point is 00:02:52 whole bunch of guests come on a second time. Now, it's easy to take the naive approach of assuming that it's because it's easier for me to find a guest if I know them and don't have to reach out to brand new people all the time. This is absolutely correct. I'm exceedingly lazy. But I don't have too many folks on a third time, but that changes today. My guest is Tim Banks. I've had Tim on the show twice before. Both times, it led to really interesting conversations around a wide variety of things. Since those episodes, Tim has taken a job as a principal cloud economist here at the Duckbill Group. Yes, that is probably the strangest interview process you can imagine, but here we are. Tim, thank you so much for joining me, both on the show and in the business.

Starting point is 00:03:38 My pleasure, Corey. It was definitely an interesting interview process, but I was glad to be here. So I'm happy to be here a third time. I don't know if you get a jacket like you do in Saturday Night Live if you host like a fifth time, but we'll see. Maybe it's a vest. A cool vest would be nice. We can come up with something. Effectively, it can be like reverse hangman where you wind up getting a vest and every time you come on after that, you get like a sleeve.

Starting point is 00:04:00 Then you get a second sleeve and then you get a collar and we can do all kinds of neat stuff. I actually like that idea a lot. So I'm super excited to be able to have this conversation with you because I don't normally talk a lot on this show about what cloud economics is because my guest usually is not as deep into the space as I am. And that's fine. People should never be as deep into the space as I am in the general sense, unless they work here. Awesome. But I do guest on other shows and people ask me all kinds of questions about AWS billing and cloud economics, and that's fine. It's great, but they don't ask the questions about the space in the same way that I would and the way that I think about it. So it's hard for me to interview myself. Now, I'm not saying I won't try it someday, but it's challenging. But today I get to take the easy path out and talk to you about it. So Tim,

Starting point is 00:04:47 what the hell is a principal cloud economist? So a principal cloud economist is a cloud computing expert, both in architecture and practice, who looks at cloud cost in the same way that a lot of folks look at cloud security or cloud resilience or cloud performance. So the same engineering concerns you have about making sure that your API stays up all the time or to make sure that you don't have people that are able to escape containers or to make sure that you can have super, super low response times is the same engineering fundamentals that I look at when I'm trying to find a way to reduce your AWS bill. Okay, when we say cloud cost and cloud economics, the natural picture that leaps to mind is, oh, I get it. You're an Excel jockey. And sometimes, yeah, we all kind of play those roles. But what you're talking about is

Starting point is 00:05:40 something else entirely. You're talking about engineering expertise. And sure enough, if you look at the job postings we have for roles on the team from time to time, we have not yet hired anyone who does not have an engineering and architecture background. That seems odd to folks who do not spend a lot of time thinking about the AWS bill. I'm told those people are what is known as happy, but here we are. Why do we care about the engineering aspect of any of this? Well, I think first and foremost, because what we're doing in essence is still engineering, right? People aren't putting construction paper up on AWS. Sometimes they do put recipes up in there, but it still involves working on a computer and writing code and deploying it somewhere. So to have that basic

Starting point is 00:06:26 understanding of what it is that folks are doing on the platform, you have to have some engineering experience, first and foremost. Secondly, the fact of the matter is that most cost optimization, in my opinion, can be done on the whiteboard before anything else. And really, I think, should be done on the whiteboard before anything else. And so, I think, should be done on the whiteboard before anything else. And so the Excel aspect of it is always reactive. We have now spent this much. How much was it?

Starting point is 00:06:53 Where did it go? And now we have to figure out where it went. I like to figure out and get a ballpark on how much something's going to cost before I write the first line of code. I want to know,

Starting point is 00:07:03 hey, we have a tier here. We're using this kind of storage. It's going to take this kind of instance types. Okay, well, I've got an idea of how much it's going to cost. And I was like, you know, that's going to be expensive. Before we do anything, is there a way that we can reduce cost there? And so I'm reverse engineering that on already deployed workloads. Or when customers want to say, hey, we were thinking about doing this,

Starting point is 00:07:22 and this is our proposed architecture, I'm going to look at it and say, well, if you do this and this and this and this, you can save money. So it sounds like you and I have a bit of a philosophical disagreement in some ways. One of my recurring talking points has always been that, oh, by and large, application developers don't need to think overly much about cloud cost. What they need to know generally fits on an index card. It's, okay, big things cost more than small things. If you turn something on, it will never get turned off and will bill you in perpetuity. Data transfer has some weird stuff. And if you store data, you pay for data. Like that level of baseline understanding. When I'm trying to build something out, my immediate thought is, great,

Starting point is 00:07:59 is this thing possible? Because A, I don't always know that it is, and B, I'm super bad at computers, so for me, it may absolutely not be. Whereas you're talking about baking cost assessments into the architecture as a day one type of approach, even when sketching ideas out on the whiteboard. I'm curious as to how we diverge there. Can you talk more about your philosophy? Sure. And the reason I do that is because, as most folks that have an engineering background in cloud infrastructure will tell you, you want to build resilience on the whiteboard. You certainly want to build performance on the whiteboard, right? And security folks will tell you you want to do security on the whiteboard because those things are hard to fix after they're deployed. As soon as they're deployed without that, you now have technical debt. If you don't consider cost optimization and cost efficiency on the whiteboard, and then you try and do it after it's deployed, you not only have technical debt,

Starting point is 00:08:55 you may have actual real debt. One of the comments I tend to give a lot is that architecture and cost are the same thing in the world of cloud. And I think that we might be in violent agreement, as Liz Fong-Jones is fond of framing it, where I am acutely aware of aspects of cost, and that does factor into how I build things on the whiteboard. Let's also be very clear, most of the things that I build are very small scale. The largest cost by a landslide is the time I spend building it. In practice, that's an awful lot of environments. People are always more expensive than the AWS environment they're working on. But instead, it's about baking in the assumptions and making sure you're not

Starting point is 00:09:34 coming up with something that is going to just be wasteful and horrible out of the gate. And I guess part of that also is the fact that I am at a level of billing understanding that I've sort of absorbed these concepts intrinsically. Because to me, there is no difference between cost and architecture in an environment like this. You're right that there's always an inherent tradeoff between cost and durability. On the one hand, I don't like that. On the other, it feels like it's been true forever. I don't see a way out of it. It is inescapable. And it's interesting because you talk about, you know, the level like of an application developer or something like that. On the other, it feels like it's been true forever. I don't see a way out of it. It is inescapable. And it's interesting because you talk about the level of an application

Starting point is 00:10:08 developer or something like that. What is your level of concern? But retroactively, we'll go in for cost optimization analysis. And I've done this as far back as when I was working at AWS as a TAM. And I'll ask the question to an application developer or a database administrator, why do you do this? Why do you have a string value for something that could be a Boolean? And you're asked, well, what difference does that make? Well, it makes a big difference when you're talking about cycles for CPU. You can reduce your CPU consumption on a database instance by changing a string to a Boolean.

Starting point is 00:10:42 Now you need fewer instances, or you need a less powerful instance, or you need less memory. And now you can run a less expensive instance for your database architecture. Well, maybe for one node, that's not that big a difference, but if you're talking about something that's multi-AZ and multi-node,

Starting point is 00:10:57 I mean, that can be a significant amount of savings just by making one simple change. And that might be the difference right there. I didn't realize that offhand. It makes sense if you think about it, just realizing that I've made that mistake on one of my DynamoDB tables. It costs something like seven cents a month right now, so it's not something I'm rushing to optimize, but you're right. Expand that out by a factor of a million or so, and we're talking serious money. And then that sort of optimization makes an awful lot of

Starting point is 00:11:20 sense. I think that my position on it is that when you're building out something small scales, a demo or a proof of concept, spending time on optimizations like this is not the best use of anyone's time or brain sweat, for lack of a better term. How do you wind up deciding when it's time to focus on stuff like that? Well, first I will say that I dare say that somewhere in the 80% of production workloads are just, we're the POC. Because they, well, it worked for this to get funding, let's run it, right? Let they who does not have a DynamoDB table in production with the word test or dev in it, cast the first stone. This is certainly not me. So I understand how some of those decisions get made. And that's why I think it's better to think about it early. Because as I mentioned before,

Starting point is 00:12:04 when you start something and say, hey, this works for now, and you don't give consideration to that in the future or consideration for what it's going to be like in the future, you know, when you start doing it, you'll paint yourself into corners. That's how you get something like static values put in somewhere. Or that's how you get something like, well, we have to run this instance type because we didn't build on the ability to be more microservice-based or stateless or anything like that. You've seen people that say, hey, we could save you a lot of money if you can move this thing off to a different tier. It's like, well, that would be

Starting point is 00:12:33 an extensive rewrite of code. That'd be very expensive. I dare say that's the main reason why most AS400s are still being used right now is because it's too expensive to rewrite the code. Yeah, and there's no AWS 400 that they can migrate to. Yet, reInvent is nigh. So I think that's why, even at the very beginning, even if you were saying, well, this is something

Starting point is 00:12:54 we will do later, don't make it impossible for you to do later in your code. Don't make it impossible for you to do later in your architecture. Make things as modular as possible so that way you can say, hey, later on down the road, oh, we can switch to this instance. Oh, or here's a new managed service that we can maybe save money on doing this. And you allow yourself to switch things out or turn different knobs or change the way you do things and give yourself more options in the future, whether those options are for resilience or those options are for security or those options are for performance or they're for cost optimizations. If you make binding decisions earlier on, you're going to have debt that's going to build up at some point in the future, and then you're going to have to pay

Starting point is 00:13:33 the piper. Sometimes that piper is going to be AWS. One thing that I think gets lost in a lot of conversations about cloud economics, because I know that it happened to me when I first started this place, where I am planning to basically go out and be the world leading expert in AWS cost analyses and understanding and optimization. Great. Then I went out into the world and started doing some of my first engagements, and they looked a lot less like far future cost attribution projections, and a lot more like, what's a reserved instance? And we haven't bought any of those in 18 months. And oh, yeah, we shut down an entire project six months ago. We should probably delete all the resources. Huh. The stuff that I was preparing for at the high end of the maturity curve are great and useful and terrific to have conversations

Starting point is 00:14:25 about in some very nuanced depth. But very often there's a walk before you can run style of conversation where, okay, let's do the easy stuff first before we start writing a whole bunch of bespoke internal stuff that maps your business needs to the AWS bill. How do you, I guess, reconcile those things where you're on the one hand, you see the easy stuff and on the other, you see some of the, just the absolutely challenging, very hard five years of engineering effort style problems on the other. Well, it's interesting because I've seen one customer very recently who has brilliant

Starting point is 00:15:02 analyses as to their cost, just, you know, well charted, well tagged, well documented. Well, you know, everything is diagrammed quite nice and everything like that. And they're very, very aware of their costs, but they leave test instances running all weekend. You know, and their associated volumes and things like that. And that's a very easy thing to fix. That is a very, very low hanging fruit. And so sometimes you just have to look at where they're spending their efforts. Or sometimes they do spend so much time chasing those hard to do things, because they are hard to do, and they're exciting in an engineering aspect. And then something as simple as like, hey, how about we delete these old volumes it just isn't there or how about we switch your s3 bucket storage type those are easy low-hanging fruits and

Starting point is 00:15:50 you would be surprised and how sometimes they just don't get that but at the same time sometimes customers have like hey we knock this thing out we knock this thing out because it's in trusted advisor every ai cost optimization recommendation you can get can will tell you these five things to do no matter who you are or where you are but they don't do the very you know the conceptual things like understanding some of the principles behind cost optimization and cost optimization architecture and proactive cost optimization versus reactive cost optimizations so you're doing very kind of conceptual education and conversations with the folks rather than the do these five things. And I've not often found a customer that you have to do both on. It's usually one or the other.

Starting point is 00:16:37 It's funny that you made that specific reference to that example. One of my very first projects, not naming names. Generally, when it comes to things like this, you can tell stories or you can name names. I buy us for stories. I was talking to a company who was convinced that their developer environments were incredibly overwrought, expensive, et cetera, and burning money. Okay, great. So I talked about the idea of turning those things off at night or between test runs, deleting volumes to snapshot and restore them on a schedule when people come in in the morning because all your developers sit in the same building in the same time zones. Great. They were super on board with the idea. It was going to be

Starting point is 00:17:12 a little bit of work, but all right. This was in the days before the EC2 instance scheduler, for example. But first, let's go ahead and do some analysis. This is one of those early engagements that really reinforced my idea of, yeah, before we start going too far down the rabbit hole, let's double check what's going on in the account, because periodically you encounter things that surprise people. Like, what's up with those Australia instances? Oh, we don't have anything in that region. I believe you are being sincere when you say this. However, the API generally doesn't tell lies. So that becomes a, oh, security incident time. But looking at

Starting point is 00:17:46 this, they were right. They had some fairly sizable developer instances that were running all the time. But doing some analysis, their developer environment was 3% of their bill at the time, and they hadn't bought RIs in a year and a half. And looking at what they were doing, there was so much easier stuff that they could do to generate significant savings without running the potential of turning a developer environment off at night in the middle of an incident or something like that. The risk factor and effort were easier just to do the easy stuff, then do another pass and look at the deep stuff. And to be clear, they weren't lying to me. They weren't wrong. Back when they started building this stuff out, their developer environments were significantly large and were a significant portion of their spend.

Starting point is 00:18:29 And then they hit product market fit. And suddenly their production environment had to scale significantly in a short period of time, which, yay, cloud, it's good at that. Then it just became such a small portion that developer environments weren't really a thing. But the narrative internally doesn't get updated very often because once people learn something, they don't go back to relearn whether or not it's still true. It's a constant mistake. I make it myself frequently. I think it's interesting. There are things that we really need to kind of like put into buckets as far as what's an engineering effort and what's an administrative effort. And when I say an administrative effort, I mean, if I can save money with a stroke of a pen, well, that's going to be pretty easy.

Starting point is 00:19:10 And that's usually going to be RIs. That's going to be EDPs or PPAs or something like that, that don't require engineering effort. It just requires administrative effort. I think RIs being the simplest one, it's like, oh, all I have to do is go in here and click these things four times and I'm going to save money. Well, let's do that. And it's surprising how often people don't do that. But you know, you still have to understand that whether it's RIs or whether it's a savings plan, it's still a commitment of some kind. But if you are willing to make that commitment, you can save money with no engineering effort whatsoever. That's almost free money. So much of what we do here comes down to psychology in many ways, more than it does math.

Starting point is 00:19:50 And a lot of times, you're right. Everything you say is right. But in a large-scale environment, go ahead and click that button to buy the savings plan or the reserved instance. And that's a $20 million purchase. And companies will stall for months trying to run a different series of analyses on this. And what if this happens? What if that happens? And I get it because, yeah, I'm going to click this button that's going to cost more money than I'll make in my lifetime. That's a scary thing to do. I get it. But you're going to spend the money one way or the other with the

Starting point is 00:20:18 provider. And if you believe that that number is too high, I get it. I am right there with you. Buy half of them right now. And then you can talk about the rest until you get to a point of being comfortable with it and do it incrementally. It's not all or nothing. You have one shot to make the buy. Take pieces out of it that make sense. You know you're probably not going to turn off your database cluster that handles all of production in the next year. So go ahead and go for it. It saves some money. Do the thing that makes sense. And that doesn't require deep dive analytics. That requires, on some level, someone who's seen a lot of these before who gets what customers

Starting point is 00:20:52 are going through. And honestly, it's empathy in many respects that becomes one of those powerful things that we can apply to our customer accounts. Absolutely. I mean, people don't understand that decision paralysis about making those commitments costs you money. You can spend months doing analysis, but those months doing analysis, you're going to spend 30, 40, 50, 60, 70% more

Starting point is 00:21:11 on your EC2 instances or other compute than you would otherwise. And that can be quite significant. It's one of those cases where we talk about the psychology around, you know, perfect being the enemy of good, right? You don't have to make the perfect purchase of RIs or savings plans to have that so tuned perfectly that you're going to get 100% utilization and zero, like you don't, you don't have to do that. Just do something, do a little bit, like I said, buy half, buy anything, just something, right? And you're going to save money. And then you can run analysis later on while you're saving money, you know, and get a little better and tune it up a little more and get more analysis on it and maybe fine tune it. But you don't actually ever need to have it down to like the penny. Like it never has to be that good. At some point, one of the value propositions we have for

Starting point is 00:22:00 our customers has always been that we tell you when to stop focusing on saving money because there's a theoretical cap of a hundred percent of the cloud bill that you can save. But you can make so much more than that by launching the right feature to the right market a little sooner. Focus on that. Be responsible stewards of the money that's invested with you, but by and large, as a general piece of guidance, at some point, stop cutting and go back to doing the thing that makes your company work. It's not all about saving money at all costs for almost all of us. It is for us, but we're sort of a special case. Well, it's a conversation I often have.

Starting point is 00:22:29 It's like, all right, are you trying to save money on AWS? Are you trying to save money overall? So if you're going to spend $400,000 worth of engineering effort to save $10,000 on your AWS bill, that doesn't make no sense. Right, there has to be a strategic reason to do things like that. And make sure you understand the value of what you're getting for this. One reason that

Starting point is 00:22:49 we wind up charging the way that we do, and we've gotten questions on this for a while, has been that we charge fixed fee for what we do on engagements. And similarly, people have asked this, but haven't tied the two things together. You talk about cost optimization, but never cost cutting. Why is that? Is that just a negative term? And the answer has been no. They're aligned. What we do focuses on what is best for the customer. Once that fixed fee is decided upon, every single thing that we say is what we would do if we were in the customer's position. There are times where we'll look at what they have going on and say, ah, you really should spend more money here for resiliency or durability or, okay, that is critical data that's not being backed up. You should consider doing that. It's why we don't take percentages of things, because at that point, we're not just going with the useful stuff. It's, well, we're going to basically throw the entire kitchen sink at you. We had an early customer, and I was talking to their AWS account manager about what we were going to be doing. And their comment was, oh, saving money on AWS bills is great. Make sure you check the EBS snapshots. Yeah, I did that.

Starting point is 00:23:54 They were spending 150 bucks a month on EBS snapshots, which is basically nothing. It's one of those stories where if in the course of an hour-long meeting, I can pay for that entire service by putting a quarter on the table, I'm probably not going to talk about it, barring some extenuating circumstance. It's focus on the big things, not the things that worked in a different environment with a different account and different constraints. It's hard to context switch like that, but it gets a lot easier when it is basically the entirety of what we do all day. The difference I draw between cost optimization and cost cutting is that cost optimization is ensuring that you're not spending money unnecessarily or that you're maximizing your dollar. And so sometimes we get called in there and we're just validation for the measures they've already done. Like your team is doing this exactly right. You're doing the things you should be doing.

Starting point is 00:24:46 We can nitpick if you want to. If I'm going to save you $7 a year, but who cares about that, right? But y'all are doing what you should be doing. This is great. Going forward, you want to look for these things and look for these things and look for these things. We're going to give you some more concepts

Starting point is 00:24:59 so that you are cost optimized in the future. But it doesn't necessarily mean that we have to cut your bill because if you're already spending efficiently, you don't optimized in the future. But it doesn't necessarily mean that we have to cut your bill because if you're already spending efficiently, you don't need your bill cut. You're already cost optimized. Oh, we're not gonna nitpick on that. You're mostly optimized there.

Starting point is 00:25:13 It's like, yeah, that workload's $140 million a year and rising, please pick nits. That which point, okay, great. That's the strategic reason to focus on something. But by and large, it comes down to understanding what the goals of clients are. I think that is widely misunderstood about what we do and how we do it. The first question I always ask when someone does outreach of, hey, we'd like to talk about coming here and doing a consulting engagement with us. It's great. I always like to ask the

Starting point is 00:25:40 quote-unquote foolish question of, why do you care about the AWS bill? And occasionally I'll get people who look at me like I have two heads of why wouldn't I care about the AWS bill? Because there are more important things to care about for the business almost certainly. One of the things I try and do, especially when we're talking about cost optimization, especially trying to do something for the right now so they can do things going forward, is like, you know, all right, so if we cut this much from your bill, if you just do nothing else but do reserved instances or buy a savings plan, right, you're going to save enough money to hire four engineers. Think about what four engineers would

Starting point is 00:26:18 do for your overall business. And that's how I want you to frame it. I want you to look at what cost optimization is going to allow you to do in the future without costing you any more money. Or maybe you save a little more money and you can shift it. Instead of paying for your AWS bill, maybe you can train your developers. Maybe you can get more developers. Maybe you can get some pro serve. Maybe you can do whatever, buy newer computers for your people so they can do whatever it

Starting point is 00:26:40 is, right? We're not saying that you no longer have to spend this money, but you can use this money to do something other than give it to Jeff Bezos. This episode is sponsored in part by Liquibase. If you're anything like me, you've screwed up the database part of a deployment so severely that you've been banned from ever touching anything that remotely sounds like SQL at at least three different companies. We've mostly got code deployment solved for, but when it comes to databases, we basically rely on desperate hope

Starting point is 00:27:11 with a rollback plan of keeping our resumes up to date. It doesn't have to be that way. Meet Liquibase. It's both an open-source project and a commercial offering. Liquibase lets you track, modify, and automate database schema changes across almost any database, with guardrails that ensure you'll still have a company left after you deploy the change.

Starting point is 00:27:35 No matter where your database lives, Liquibase can help you solve your database deployment issues. Check them out today at Liquibase.com. Offer does not apply to Route 53. There was an article recently, as of the time of this recording, where Pinterest discussed what they had disclosed in one of their regulatory filings, which was over the next eight years, they have committed to pay AWS $3.2 billion. And in this article, they have the head of engineering talking to the

Starting point is 00:28:05 reporter about how they're thinking about these things, how they're looking at things that are relevant to their business. And they're talking about having a dedicated team that winds up doing a whole bunch of data analysis and running some analytics on all of these things from piece to piece to piece. And that's great. And I worry on some level that other companies will say, oh, Pinterest is doing that, we should too. Yeah, for the course of this commitment, a 1% improvement is $32 million.

Starting point is 00:28:36 So yeah, at that scale, I'm gonna hire a team of data scientists too to look at these things. Your bill is $50,000 a month. Perhaps that's not worth the effort that you're going to put into it, barring other things that contribute to it. It's interesting because, you know, we will get folks that will approach us

Starting point is 00:28:53 that have small accounts, very small spend, and like, hey, can you come and talk to us about this, whatever? And we can say very honestly, like, look, we could, but the amount of money we're going to charge you is not going to be worth your while right now. You could probably get by on the automated recommendations, on the things that are already out there on the internet that everybody can do to optimize their bill. And then when you grow to a point where now saving 10% is somebody's salary, that's when it kind of becomes more critical. And it's hard to say what point that is in anyone's business, but I can say sometimes, hey, you know what,

Starting point is 00:29:28 that's not really what you need to focus on, right? If you need to save $100 a month on your AWS bill and that's critical, you've got other concerns that are not in your AWS bill. So back when you were interviewing to work here, one of the areas of focus that you kept bringing up was the concept of observability. And my response to this was, oh, hell, another one. Because let's be clear, Mike Julian, my business partner and our CEO, has written a book called Practical Monitoring. And apparently what we learned from this is as soon as you finish writing a book on a topic, you never want to talk about that topic ever again, which, yeah, in hindsight, makes

Starting point is 00:30:03 sense. Why do you care about observability when you're here to look at cloud cost? Because cloud cost is another metric, just like you would use for performance or resilience or security, right? You do real-time monitoring to see if somebody has compromised the system. You do real-time monitoring to see if you have bad performance, if response times are too slow. You do real-time monitoring to know if something has gone down and then you need to make adjustments or that the automated responses you have in response to that downtime are working. But cloud cost, you send somebody a report at the end of the month.

Starting point is 00:30:40 Can you imagine, if you will, just for a second, if you got a downtime report at the end of the month and then you can react to something that's gone down? Or if you get a security report at the end of the month and then you can react to the fact that somebody has your root keys? Or if you get a report at the end of the month that says, hey, the CPU on this one was pegged, you should probably scale up. That's outrageous to anybody in this industry right now. But why do we accept that for cloud cost? It's worse than that. There are a number of startups that talk about, oh, real-time cloud cost monitoring. Okay, the only way you're going to achieve such a thing is if you build an API shim that interprets everything that you're telling your cloud control plane to do, taking

Starting point is 00:31:21 cost metrics out of it, and then passing it on to the actual cloud control plane. Otherwise, you're talking about it showing up in the billing record in ideally eight hours in practice, several days, or you're talking about looking at cloud trail events, which is not holistic, but gives you some rough idea, but is also in some cases, five to 20 minutes delayed. There's no real-time way to do this without significant disruption to what's going on in your environment. So when I hear about, oh, we do real-time bill analysis, yeah, it feels, to be very direct, like you don't know enough about the problem space you're working within to speak intelligently about it, because anyone who's played in this space for a while knows exactly how hard it is to get there. Now, I've talked to companies that have built

Starting point is 00:32:03 real-time-ish systems that take that shim approach and act sort of as a metadata sidecar or SaaS billing system that tracks all of this so they can wind up intercepting potentially very expensive configuration mistakes. And that's great. That's also a bit beyond for a lot of folks today, but it's where the industry is going. But there is no way to get there today short of effectively intercepting all of those calls in a way that is cohesive and makes sense. How do you square that circle, given the complete lack of effective tooling?

Starting point is 00:32:34 Honestly, I'm going to point that right back at the cloud provider, because they know how much you're spending real time. They know exactly how much you're spending real time. They've figured it out, right? They have the buckets, they have the APIs for it internally. Sure they do. Like, it would make no sense for them not to. Without giving anything away, I know that when I was at AWS, I knew how much you were spending almost real time. That's impressive. I wish that existed.

Starting point is 00:32:59 My never-having-worked-at-AWS perspective on it is that they, of course, have the raw data effective immediately or damn close to it. But the challenge for the billing system is distilling and summarizing and attributing all of that in a reasonable timeframe. It is an exabyte scale problem. I've talked to folks there who have indicated it is comfortably north of a petabyte in raw data per day. And that was a couple of years ago. So one can only imagine as the footprint has increased, so has all of this. I mean, the billing system is fundamentally magic from the outside. I'm not saying it's good magic, but it is magic. And it's something that is unappreciated that

Starting point is 00:33:34 every customer uses and is one of those areas that doesn't get the attention it deserves. Because let's be clear here, we talk about observability. The bill is still the only thing that AWS offers that gives you a holistic overview of everything running in your account in one place. What I think is interesting is that you talk about the scale of the problem and that it makes it difficult to solve. At the same time, I can have a conversation with my partner about kitty litter. And then all of a sudden, I'm going to start getting ads about kitty litter within minutes. So I feel like it's possible to emit cost as a metric, like you would CPU or disk. And if I'm going to look at who's going to do that, I'm going to look right back at AWS. The fun part about that is, though, is I know

Starting point is 00:34:35 from AWS's business model that if that's something they were to emit, it would also cost you like 25 cents per call. And then you would actually like triple your cloud cost just trying to figure out how much it costs you. Only with 16 other billing dimensions, because of course it would. And again, I'm talking about stuff because of how I operate and how I think about this stuff that is inherently corner case or vertice case in many cases. But for the vast majority of folks,

Starting point is 00:34:59 it's not the, ooh, you have this really weird data transfer paradigm between these two resources, which, yeah, that's a problem and needs to be addressed in an awful lot of cases because data transfer pricing is bonkers. But instead, it's the, huh, you just spun up a big cluster that's going to cost $20,000 a month. You probably don't need to wait a full day to flag that. And you also can't put this on the customer in the sense of, oh, just set some budget alarms. That's great. That's the first thing you should do in a new AWS account. Well, Jack Hole, I've done an awful lot of first things I'm supposed to do in an AWS account in my dedicated

Starting point is 00:35:37 test account for these sorts of things. It's been four months. I'm not done yet with all of those first things I'm supposed to do. It's incredibly secure, increasingly expensive, and so far all it runs is a single EC2 instance that is mostly there just so that everything else doesn't error out trying to divide by zero. There are some things that are built in, right? If I stand up an EC2 instance and it goes down, I'm going to get an alert that this instance terminated for some reason. It's just going to show up informationally. In the console, you're not going to get called

Starting point is 00:36:09 about it or paged about it unless you have something else in the business that will, like a boss that screams at you at two o'clock in the morning. This is why we have very little production facing here. But if I know that that alert exists somewhere in the console, that's easy for me to write a trap for, right? That's easy for me to write, say, hey, I'm going to respond to that because this call is going to come out somewhere. It's going to get emitted somewhere.

Starting point is 00:36:31 I can now, as an engineer, write a very easy trap that says, hey, pop this into Slack, send alerts and a page, right? So if I could emit a cost metric and I can say, wow, somebody has spun up this thing that's going to cost

Starting point is 00:36:45 X amount of money. Someone should get paged about this because if they don't page about this and we wait eight hours, that's my month's salary. And you would do that if your database server went down. You would do that if someone rooted that database server. You would do that if the database server was bogging you to scale up another one, right? So why can't you do that if someone rooted that database server. You would do that if the database server was bogging you to scale up another one, right? So why can't you do that if that database server is all of a sudden costing you way more than you had calculated? And there's a lot of nuance here, because what you're talking about makes perfect sense for smaller scale accounts. But even in some of the very large accounts, or we're talking hundreds of millions a year in spend,

Starting point is 00:37:22 you can set compromised keys up on GitHub, put them in PaySpin, whatever, and then people start spinning up Bitcoin miners everywhere. Great. It takes a long time to materially move the needle on that level of spend. It gets lost in the background noise. I lose my mind when I wind up leaving a managed NAT gateway running, and it costs me 70 bucks a month in my $5 a month test account. Yeah, but you realize that you could basically buy an island, and it gets lost in the AWS bill at some of the high watermarks for some of these larger accounts. Oh, someone spun up a cluster that's going to cost $400,000 a year. Yeah, do I need to re-explain to you what a data science team does? They light money on fire in return for questionable returns as a general rule.

Starting point is 00:38:06 You knew that when you hire them, leave them alone. Whereas someone in their developer account does this, yeah, you kind of want to flag that immediately. It always comes down to rules and context, but I'd love to have some templates ready to go of, I'm a starving student. Please alert me anytime it looks like I might possibly exceed the free tier, or better yet, don't let me, and if I do, it's on you and you eat the cost. Conversely, it's, yeah, this is a Netflix sub account or whatnot. Maybe don't bother me for anything whatsoever because freedom and responsibility is how we roll. I imagine that that's what they do internally on a lot of their cloud costing stuff because freedom and responsibility is ingrained in their culture.

Starting point is 00:38:42 It's great. It's the freedom for me to think about cloud bills and the responsibility for paying it of the cloud bill. Yeah, we will get internally alerted if things are up too long, and then we will actually get paged, and then our manager would get paged, and it would go up the line if you leave something that's running too expensive too long. So there is a system there for it. Oh yeah. The internal AWS systems for employees are probably my least favorite AWS service, full stop. And I've seen things posted about it. I believe it's called Isengard for spinning up internal accounts and the rest. There's a separate one I think called Conduit, but I digress. That you spin something up and apparently if it doesn't wind up, and I don't need you to comment on this because you work there and confidentiality is super important.

Starting point is 00:39:26 But to my understanding, it's great. It has a whole bunch of formalized stuff like that. And it solves for a whole lot of nifty features that buy us for the way that AWS focuses on accounts and how they view security and the rest. And, oh, well, we couldn't possibly ship this to customers because it's not how they operate. And that's great.

Starting point is 00:39:42 My problem with this internal provisioning system is it isolates and insulates AWS employees from the real pain of working with multiple accounts as a customer. You don't have to deal with the provisioning process of control tower or whatnot. You have your own internal thing. Eat your own dog food, gargle your own champagne, whatever it takes to wind up getting exposure to the pain that hits customers. And suddenly you'll see those things improve. I find that the best way to improve a product is to make the people building it live with the painful parts. I think it's interesting that the stance as well, like this is not how the customers operate and we wouldn't want the

Starting point is 00:40:17 customers to have to deal with this. But at the same time, you have to open up like a hundred accounts if you need more than a certain number of S3 buckets. So they are very comfortable with burdening the customer with a lot of constraints and they say, well, constraints drive innovation. Certainly this is a constraint that you could at least offer and let the customers innovate around that. And at least define who the customer is. Because yeah, I'm a Netflix sub-account is one story. I'm a regulated bank is another story. And I'm a student in my dorm room trying to learn how this whole cloud thing works is another story. From risk tolerance, from a data protection story, from a billing surprise story, from a, I'm trying to learn what the hell this is, and all these other service offerings

Starting point is 00:41:05 who keep talking to me about confuse the hell out of me, please streamline the experience. There's a whole universe of options and opportunity that isn't being addressed here. Well, I will say it very simply like this. We're talking about a multi-trillion dollar company versus someone who, if their AWS bill is too high, they don't pay rent. Maybe they don't eat. Maybe they have other issues. Medical bill doesn't get paid.

Starting point is 00:41:31 Childcare doesn't get paid. And if you're going to tell me that this multi-trillion dollar company can't solve for that, so that that doesn't happen to that person, and tells them, well, you know, if you come in afterwards after your bill gets there, maybe we can do something about it. But in the meantime, suffer through this. That's not ethical. Full stop. There are a lot of things that AWS gets right.

Starting point is 00:41:54 And I want to be clear that I'm not sitting here trying to cast blame and say that everything they're doing is terrible. I feel like every time I talk about billing in any depth, I have to throw this disclaimer in. 90 to 95% of what they do is awesome. It's just the missing piece that is incredibly painful for customers. And that's what I spend most of my time focusing on. It should not be interpreted to think that I hate the company. I just want them to do better than they are. And what they're doing now is pretty decent in most respects. I just want to fix the painful parts. Tim, thank you for joining me for a third time here. I'm certain I'll have you back in the somewhat

Starting point is 00:42:25 near future to talk about more aspects of this, but until then, where can people find you slash retain your services? Well, you can find me on Twitter at El Chefe. And if you want to retain my services, for which you would be very, very happy to have, you can go to duckbillgroup.com and fill out a little questionnaire, and I will magically appear after an exchange of goods and services. Make sure to reference Tim by name, just so that we can make our sales team facepalm, because they know what's coming next. Tim, thank you so much for your time. It's appreciated. Thank you so much, Corey. I loved it. Principal Cloud Economist here at the Duckbill Group, Tim Banks.

Starting point is 00:43:06 I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice. Wait at least eight hours, possibly as many as 48 to 72, and then leave a comment explaining what you didn't like. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business,

Starting point is 00:43:49 and we get to the point. Visit duckbillgroup.com to get started. This has been a humble pod production stay humble

Screaming in the Cloud - Working on the Whiteboard from the Start with Tim Banks

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.