Screaming in the Cloud - A Cloud Economist is Born - The AlterNAT Origin Story

Episode Date: November 9, 2022

About BenBen Whaley is a staff software engineer at Chime. Ben is co-author of the UNIX and Linux System Administration Handbook, the de facto standard text on Linux administration, and is th...e author of two educational videos: Linux Web Operations and Linux System Administration. He is an AWS Community Hero since 2014. Ben has held Red Hat Certified Engineer (RHCE) and Certified Information Systems Security Professional (CISSP) certifications. He earned a B.S. in Computer Science from Univ. of Colorado, Boulder.Links Referenced:Chime Financial: https://www.chime.com/alternat.cloud: https://alternat.cloudTwitter: https://twitter.com/iamthewhaleyLinkedIn: https://www.linkedin.com/in/benwhaley/

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Forget everything you know about SSH and try Tailscale. Imagine if you didn't need to manage PKI or rotate SSH keys every time someone leaves.
Starting point is 00:00:42 That'd be pretty sweet, wouldn't it? With Tailscale SSH, you can do exactly that. Tailscale gives each server and user device a node key to connect to its VPN, and it uses the same node key to authorize and authenticate SSH. Basically, you're SSHing the same way you manage access to your app. What's the benefit here? Built-in key rotation. Permissions as code.
Starting point is 00:01:05 Connectivity. Between any two devices. Reduced latency. And there's a lot more, but there's a time limit here. You can also ask users to re-authenticate for that extra bit of security. Sounds expensive? Nope. I wish it were. Tailscale is completely free for personal use on up to 20 devices.
Starting point is 00:01:23 To learn more, visit snark.cloud slash tailscale. Again, that's snark.cloud slash tailscale. Welcome to Screaming in the Cloud. I'm Corey Quinn, and this is an episode unlike any other that has yet been released on this August podcast. Let's begin by introducing my first-time guest somehow, because apparently an invitation got lost in the mail somewhere. Ben Whaley is a staff software engineer at Chime Financial and has been a AWS community hero since Andy Jassy was basically in diapers, to my level of understanding. Ben, welcome to the show. Corey, so good to be here. Thanks for having me on.
Starting point is 00:02:09 I'm embarrassed that you haven't been on the show before. You're one of those people that slipped through the cracks, and somehow I was very bad at following up slash hounding you into finally agreeing to be here. But you certainly waited until you had something auspicious to talk about. Well, you know, I'm the one that really should be embarrassed here. You did extend the invitation, and I guess I just didn't feel like I had something to drop. But I think today we have something that will interest most of the listeners without a doubt. So folks who have listened to this podcast before, or read my newsletter, or follow me on Twitter,
Starting point is 00:02:45 or have shared an elevator with me, or at any point have passed me on the street have heard me complain about the managed NAT gateway and its egregious data processing fee of four and a half cents per gigabyte. And I have complained about this for small customers because they're in the free tier. Why is this thing charging them 32 bucks a month? And I have complained about this on behalf of large customers who are paying the GDP of the nation of Belize in data processing fees as they wind up shoving very large workloads to and fro,
Starting point is 00:03:15 which is, I think, part of the prerequisite requirements for having a data warehouse. And you are no different than the rest of these people who have those challenges, with the singular exception that you have done something about it. And what you have done is so, in retrospect, blindingly obvious that I am embarrassed. The rest of us never thought of it. It's interesting because when you are doing engineering, it's often the simplest solution that is the best. I've seen this repeatedly.
Starting point is 00:03:48 And it's a little surprising that it didn't come up before, but I think it's in some way just a matter of timing. But what we came up with, and is this the right time to get into it? Do you want to just kind of name the solution here? Oh, by all means. I'm not going to steal your thunder. Please tell us what you have wrought. We're calling it Alternat.
Starting point is 00:04:02 And it's an alternative solution to a high availability NAT solution. As everybody knows, NAT Gateway is sort of the default choice. It certainly is what AWS pushes everybody towards. But there is in fact a legacy solution, NAT instances. These were around long before NAT Gateway made an appearance. And like I said, they're considered legacy. But with the help of lots of modern AWS innovations and technologies like lambdas and auto scaling groups with max instance lifetimes and the latest generation of networking, improved or enhanced instances, it turns out that we can maybe not quite get as effective as a NAT gateway, but we can save a lot of money and skip those data processing charges entirely by having a NAT instance solution with a failover NAT gateway, which I think is kind of the key point behind this solution. So are you interested in diving into the technical details? That is very much the missing piece right there. You're right.
Starting point is 00:05:03 What we used to use was NAT instances. That was the thing that we used because we didn't really have another option. And they had an interface in the public subnet where they lived and an interface hanging out in the private subnet, and they had to be configured to wind up passing traffic to and fro. Well, okay, that's great and all, but isn't that kind of brittle and dangerous? I basically have a single instance as a single point of failure, and these are the days early on when individual instances did not have
Starting point is 00:05:31 the level of availability and durability they do now. Yeah, it's kind of awful, but here you go. I mean, the most galling part of the Manage NAT Gateway service is not that it's expensive. It's that it's expensive, but also incredibly good at what it does. You don't have to think about this whole problem anymore.
Starting point is 00:05:50 And as of recently, it also supports IPv4 to IPv6 translation as well. It's not that the service is bad. It's that the service is stonkingly expensive, particularly at scale. And everything that we've seen before is either, oh, run your own NAT instances or bend your knee and pays your money. And a number of folks have come up with different options where this is ridiculous. Just go ahead and run your own NAT instances.
Starting point is 00:06:17 Yeah, but what happens when I have to take it down for maintenance or replace it? It's like, well, I guess you're not going to the internet today. This has the, in hindsight, obvious solution. Well, we run the managed NAT gateway because the 32 bucks a year in instance hour charges don't actually matter at any point of scale when you're doing this, but you wind up using that for day in, day out traffic. And the failover mode is simply,
Starting point is 00:06:41 you'll use the expensive managed NAT gateway until the instance is healthy again, and then automatically change the route table back and forth. Yep, that's exactly it. So the auto-scaling NAT instance solution has been around for a long time, well before even NAT gateway was released. You could have NAT instances in an auto-scaling group where the size of the group was one, and if the NAT instance failed, it would just replace itself. But this left a period in which you'd have no internet
Starting point is 00:07:10 connectivity during that, you know, when the NAT instance was swapped out. So the solution here is that when auto-scaling terminates an instance, it fails over the route table to a standby NAT gateway, rerouting the traffic. So there's never a point at which there's no internet connectivity, right? The NAT instance is running, processing traffic, gets terminated after a certain period of time, configurable 14 days, 30 days, whatever makes sense for your security strategy. Could be never, right? You could choose that you want to have your own maintenance window in which to do it. Let's face it, this thing is more or less sitting there as a network traffic
Starting point is 00:07:47 router, for lack of a better term. There is no need to ever log into the thing and make changes to it until and unless there's a vulnerability that you can exploit via somehow just talking to the TCP stack when nothing's actually listening on the host. You know, you can run your own AMI that has been pared down to almost nothing. And that instance doesn't do much. It's using just the Linux kernel to sit on two networks and pass traffic back and forth. It has a translation table that kind of keeps track of the state of connections. And so you don't need to have any service running. To manage the system,
Starting point is 00:08:25 we have SSM. So you can use Session Manager to log in. But frankly, you can just disable that. You almost never even need to get a shell. And that's, in fact, an option we have in the solution is to disable SSM entirely. One of the things I love about this approach is that it is turnkey. You throw this thing in there, and it's good to go. And in the event that the instance becomes unhealthy, great, it fails traffic over to the managed NAT gateway while it terminates the old node and replaces it with a healthy one, and then fails traffic back. Now, I do need to ask, what is the story of network connections during that failover and failback scenario? Right. That's the primary drawback,
Starting point is 00:09:05 I would say, of the solution is that any established TCP connections that are on the NAT instance at the time of a route change will be lost. So say you have... TCP now terminates on the floor. Pretty much. The connections are dropped. If you have an open SSH connection from a host in the private network to a on the internet and the instance fails over to the NAT gateway, the NAT gateway doesn't have the translation table that the NAT instance had. And not to mention the public IP address also changes because you have an Elastic IP assigned to the NAT instance, a different Elastic IP assigned to the NAT gateway. And so because that upstream IP is different, the remote host is tracking the wrong IP.
Starting point is 00:09:45 So those connections, they're going to be lost. So there are some use cases where this may not be suitable. We do have some ideas on how you might mitigate that, for example, with the use of a maintenance window to schedule the replacement. Replace less often so it doesn't have to affect your workflow as much. But frankly, for many use cases, my belief is that it's actually fine. In our use case at Chime, we found that it's completely fine and we didn't actually experience any errors or failures. But there might be some use cases that are more sensitive or less resilient to failure in the first place. I would also point out that a lot of how software
Starting point is 00:10:25 is going to behave is going to be a reflection of the era in which it was moved to cloud. Back in the early days of EC2, you had no real sense of reliability around any individual instance. So everything was written
Starting point is 00:10:38 in a very defensive manner. These days, with instances automatically being able to flow among different hardware so we don't get instance interrupt notifications the way we once did in a semi-constant basis it more or less has become what presents as bulletproof so a lot of people are writing software that's a bit more brittle but it's always been a best practice that when a connection fails okay what happens at failure do you just give up and throw your hands in the air and shriek for help? Or do you attempt to retry a few times,
Starting point is 00:11:06 ideally backing off exponentially? In this scenario, those retries will work. So it's a question of how well have you built your software? Okay, let's say that you've made the worst decisions imaginable. And okay, if that connection dies, the entire workload dies. Okay, you have the option to refactor it to be a little bit
Starting point is 00:11:26 better behaved, or alternately, you can keep paying the managed net gateway tax of four and a half cents per gigabyte in perpetuity forever. I'm not going to tell you what decision to make, but I know which one I'm making. Yeah, exactly. The cost savings potential of it far outweighs the potential maintenance troubles, I guess, that you could encounter. But the fact is, if you're relying on Managed NAT Gateway and paying the price for doing so, it's not as if there's no chance for connection failure. NAT Gateway could also fail.
Starting point is 00:12:01 I will admit that I think it's an extremely robust and resilient solution. I've been really impressed with it, especially so after having worked on this project. But it doesn't mean it can't fail. And beyond that, upstream of the NAT gateway, something could in fact go wrong. Internet connections are unreliable, kind of by design. So if your system is not resilient to connection failures, there's a problem to solve there anyway. You're kind of relying on hope. So it's a kind of a forcing function in some ways to build architectural best practices, in my view.
Starting point is 00:12:34 I can't stress enough that I have zero problem with the capabilities and the stability of the ManageNet gateway solution. My complaints about it start and stop entirely with the price. Back when you first showed me the blog post that is releasing at the same time as this podcast, and you can visit that at alternat.cloud, you sent me an early draft of this. And what I loved the most was that your math was off because of a not complete understanding of the gloriousness that is just how egregious the NAT gateway charges are. Your initial analysis said, all right, if you're throwing half a terabyte out to the Internet, this has the potential of cutting the bill by, I think it was $10,000 or something like that. It's, oh no, no.
Starting point is 00:13:27 It has the potential to cut the bill by an entire $22,500 because this processing fee does not replace any egress fees whatsoever. It's purely additive. If you forget to have a free S3 gateway endpoint in a private subnet, every time you put something into or take something out of S3,
Starting point is 00:13:45 you're paying four and a half cents per gigabyte on that. Despite the fact there's no internet transitory work, it's not crossing availability zones, it is simply a four and a half cent fee to retrieve something that only costs you, at most, 2.3 cents per month to store in the first place. Flip that switch, that becomes completely free. Yeah, I'm not embarrassed at all to talk about the lack of education I had around this topic. The fact is, I'm an engineer primarily, and I came across the cost stuff because it kind of seemed like a problem that needed to be solved within my organization. And if you don't mind, I might just linger on this point and kind of think back a few months. I looked at the AWS bill and I saw
Starting point is 00:14:31 this egregious EC2 other category. It was taking up the majority of our bill. Like the single biggest line item was EC2 other. And I was like, what could this be? I want to wind up flagging that just because that bears repeating, because I often get people pushing back of, well, how bad it's one managed net gateway. How much could it possibly cost? Ten dollars? No, it is the majority of your monthly bill. I cannot stress that enough.
Starting point is 00:14:58 And that's not because the people who work there are doing anything that they should not be doing or didn't understand all the nuances of this. It's because for the security posture that is required for what you do, you are at time financial, let's be clear here, putting everything in public subnets was not really a possibility for you folks. Yeah, not only that, but there are plenty of services that have to be on private subnets. For example, AWS Glue services must run in private VPC subnets if you want them to be able to talk to other systems in your VPC. They cannot live in public subnets. So you're essentially, if you want to talk to the internet from those jobs, you're forced into some kind of NAT solution.
Starting point is 00:15:40 So I dug into the EC2 other category and I started trying to figure out what was going on there. There's no way, natively, to look at what traffic is transiting the NAT gateway. There's not an interface that shows you what's going on, what's the biggest talkers over that network. Instead, you have to have flow logs enabled, and you have to parse those flow logs. So I dug into that. Well, you're missing a step first, because in a lot of environments, people have more than one of these things. So you get to first do the scavenger hunt of,
Starting point is 00:16:11 okay, I have a whole bunch of managed NAT gateways. And first I need to go diving into CloudWatch metrics and figure out which are the heavy talkers. It's usually one or two, followed by a whole bunch of small stuff, but not always. So figuring out which VPC you're even talking about is a necessary prerequisite. Yeah, exactly. The data around
Starting point is 00:16:31 it is almost missing entirely. Once you come to the conclusion that it is a particular NAT gateway, that's a set of problems to solve on its own. But first, you have to go to the flow logs. You have to figure out what are the biggest upstream IPs that it's talking to. Once you have the IP, it still isn't apparent what that host is. In our case, we had all sorts of outside parties that we were talking to a lot. And it's a matter of sorting by volume and figuring out, well, this IP, what is the reverse IP? Who is potentially the host there? I actually had some wrong answers at first. I set up VPC endpoints to S3 and DynamoDB and SQS because those were some talk talkers. And that was a nice way to gain some security and some resilience and save some money.
Starting point is 00:17:19 And then I found, well, Datadog, that's another top talkcker for us. So I ended up creating a nice private link to Datadog, which they offer for free, by the way, which is more than I can say for some other vendors. But then I found some outside parties, there wasn't a nice private link solution available to us. And yet it was by far the largest volume. So that's what kind of started me down this track is analyzing the NAT gateway myself
Starting point is 00:17:47 by looking at VPC flow logs. Like it's shocking that there isn't a better way to find that traffic. It's worse than that because VPC flow logs tell you where the traffic is going and in what volumes, sure, on an IP address and port basis. But okay, now you have a Kubernetes cluster
Starting point is 00:18:03 that spans two availability zones. Okay, great. What is actually passing through that? So you have one big application that just seems awfully chatty. You have multiple workloads running on the thing. What's the expensive thing talking back and forth? The only way that you can reliably get the answer to that, that I found, is to talk to people about what those workloads are actually doing and failing that you're going code spelunking. Yep, you're exactly right about that. In our case, it ended up being apparent because we have a set of subnets where only one particular project runs. And when I saw the source IP, I could immediately figure that part out. But if it's a Cates cluster and the private subnets,
Starting point is 00:18:47 yeah, how are you going to find it out? You're going to have to ask everybody that has workloads running there. And we're talking about, in some cases, millions of dollars a month. Yeah, it starts to feel a little bit predatory as far as how it's priced and the amount of work you have to put in to track this stuff down. I've done this a handful of times myself, and it's always painful unless you discover something pretty early on, like, oh, it's talking to S3, because that's pretty obvious when you see that. It's, yeah, flip this switch, and this entire engagement just paid for itself a hundred times over. Now, let's see what else we can discover. That is always one of those fun moments, because first, customers are super grateful to learn that.
Starting point is 00:19:29 Oh, my God, I flipped that switch and I'm saving a whole bunch of money because it starts with gratitude. Thank you so much. This is great. And it doesn't take a whole lot of time for that to alchemize into anger of, wait, you mean I've been being ridden like a pony for this long and no one bothered to mention that if I click a button, this whole thing just goes away. And when you mention this to your AWS account team, like they're solicitous, but they either have to present as I didn't know that existed either, which is not a good look. Or, yeah, you caught us, which is worse. There's no positive story on this. It just feels like a tax on not knowing trivia about AWS. I think that's what really winds me up about it so much. Yeah, I think you're right on about that as well.
Starting point is 00:20:10 My misunderstanding about the NAT pricing was data processing is additive to data transfer. I expected when I replaced NAT gateway with NAT instance that I would be substituting data transfer costs for NAT gateway costs, NAT gateway data processing costs. But in fact, NAT gateway incurs both data processing and data transfer. NAT instances only incur data transfer costs. And so this is a big difference between the two solutions. Not only that, but if you're in the same region, if you're egressing out of your, say, US East 1 region and talking to another hosted service also within US East 1, never leaving the AWS network, you don't actually even incur data transfer costs. So if you're using a NAT gateway, you're paying data processing. To be clear, you do, but it is cross AZ in most cases,
Starting point is 00:21:09 billed at one penny egressing. And on the other side, that hosted service generally pays one penny ingressing as well. Don't feel bad about that one. That was extraordinarily unclear. And the only reason I know the answer to that is that I got tired of getting stonewalled by people that later turned out didn't know the answer. So I ran a series of experiments and you're paying between 13.5 cents to 9.5 cents for every
Starting point is 00:21:47 gigabyte egressed. And this is a phenomenal cost. And at any kind of volume, if you're doing terabytes to petabytes, this becomes a significant portion of your bill. And this is why people hate the NAT gateway so much. I am going to short circuit an angry comment I can already see coming on this, where people are going to say, well, yes, but at the multi-petabyte scale, nobody's paying on-demand retail price. And they're right. Most people who are transiting that kind of data have a specific discount rate applied to what they're doing varies depending upon usage and use case sure great but i'm more concerned with the people who are sitting around dreaming up ideas for a company where i want to wind up doing some sort of streaming service i talked to
Starting point is 00:22:38 one of those companies very early on in my tenure as a consultant around the billing piece and they wanted me to check their napkin math because they thought that at their numbers, when they wound up scaling up, if their projections were right, that they were going to be spending $65,000 a minute. And what did they not understand? And the answer was, well, you didn't understand this other thing, so it's going to be more than that. But no, you're directionally correct. So that idea that started off on a napkin, of course they didn't build it on top of AWS. They went elsewhere.
Starting point is 00:23:09 And last time I checked, they'd raised well over a quarter billion dollars in funding. So that's a business that AWS would love to have on a variety of different levels, but they're never going to even be considered because by the time someone is at scale, they either have built this somewhere else or they went broke trying.
Starting point is 00:23:29 Yep, absolutely. And we might just make the point there that while you can get discounts on data transfer, you really can't, or it's very rare to get discounts on data processing for the NAT gateway. So any kind of savings you can get on data transfer would apply to a NAT instance solution, saving you four and a half cents per gigabyte
Starting point is 00:23:52 inbound and outbound over the NAT gateway equivalent solution. So you're paying a lot for the benefit of a fully managed service there. Very robust, nicely engineered, fully managed service, as we've already, nicely engineered, fully managed service, as we've already acknowledged, but an extremely expensive solution
Starting point is 00:24:08 for what it is, which is really just a proxy in the end. It doesn't add any value to you. The only way to make that more expensive would be to route it through something like Splunk or whatnot. And Splunk does an awful lot for what they charge per gigabyte,
Starting point is 00:24:22 but it just feels like it's rent-seeking in some of the worst ways possible. And what I love about this is that you've solved the problem in a way that is open source. You have already released it in Terraform code. I think one of the first to-dos on this for someone is going to be, okay, now also make it CloudFormation and also make it CDK so you can drop it in however you want.
Starting point is 00:24:42 And anyone can use this. I think the biggest mistake people might make in glancing at this is, well, I'm looking at the hourly charge for the NAT gateways, and that's 32 and a half bucks a month. And the instances that you recommend are hundreds of dollars a month
Starting point is 00:24:56 for the big network optimized stuff. Yeah, if you care about the hourly rate of either of those two things, this is not for you. That is not the problem that it solves. If you're an independent learner annoyed about the $30 charge you got for a managed NAT gateway, don't do this. This will only add to your billing concerns.
Starting point is 00:25:18 Where it really shines is once you're at, I would say, probably about 10 terabytes a month, give or take, in managed NAT gateway data processing is where it starts to consider this. The breakeven is around six or so, but there is value to not having to think about things. Once you get to that level of spend, though, it's worth devoting a little bit of infrastructure time to something like this. Yeah, that's effectively correct. The total cost of running the solution, like all in, there's eight elastic IPs,
Starting point is 00:25:52 four NAT gateways, if you're, say you're in four zones, could be less if you're in fewer zones, like N NAT gateways, N NAT instances, depending on how many zones you're in. And I think that's about it. And I said right in the documentation, if any of those baseline fees are a material number for your use case,
Starting point is 00:26:12 then this is probably not the right solution. Because we're talking about saving thousands of dollars. Any of these small numbers for NAT gateway hourly costs, NAT instance hourly costs, that shouldn't be a factor, basically. Yeah, it's like when I used to worry about costing my customers a few tens of dollars in Cost Explorer or CloudWatch or request fees against S3 for their cost and usage reports, it's, yeah, that does actually have a cost. There's no real way around it. But look at the savings they're realizing by going through that. Yeah, they're not going to come back and complaining about their five-figure
Starting point is 00:26:49 consulting engagement costing an additional $25 in AWS charges and then lowering it by a third. So there's definitely a difference as far as how those things tend to be perceived. But it's easy to miss the big stuff when chasing after the little stuff like that. This is part of the problem I have with an awful lot of cost tooling out there. They completely ignore cost components like this and focus only on the things that are easy to query via API. Of, oh, we're going to cost optimize
Starting point is 00:27:16 your Kubernetes cluster when they think about compute and RAM. And okay, that's great, but you're completely ignoring all of data transfer because there's still no great way to get at that programmatically. And it really is missing the forest for the trees. I think this is key to any cost reduction project or program that you're undertaking. When you look at a bill, look for the biggest spend items first and work your way down from there just because of the impact you can have. And that's exactly what I did in this project.
Starting point is 00:27:48 I saw that EC2 other slash NAT gateway was the big item and I started brainstorming ways that we could go about addressing that. Now I have my next targets in mind. Now that we've reduced this cost to effectively nothing, extremely low compared to what it was, we have other new line items on our build that we can start optimizing. But in any cost project, start with the big things. You have other new line items on our bill that we can start optimizing. But in any cost
Starting point is 00:28:05 project, start with the big things. You have come the long way around to answer a question I get asked a lot, which is, how do I become a cloud economist? And my answer is, you don't. It's something that happens to you. And it appears to be happening to you, too. My favorite part about this solution that you built, incidentally, is that it is being released under the auspices of your employer, Chime Financial, which is immune to being acquired by Amazon just to kill this thing and shut it up because Amazon already has something shitty called Chime. They don't need to wind up launching something else or acquiring something else and ruining it because they have a slack competitor of sorts called Amazon Chime. There's no way they could acquire you. Everyone would get lost in the hallways. Well, I have confidence that Chime will be a good steward of the project. Chime's goal and mission as a company is to help everyone
Starting point is 00:28:56 achieve financial peace of mind. And we take that really seriously. We even apply it to ourselves. And that was kind of the impetus behind developing this in the first place. You mentioned earlier we have Terraform support already. And you're exactly right. I'd love to have CDK, CloudFormation, Pulumi support, and other kinds of contributions are more than welcome from the community. So if anybody feels like participating, if they see a feature that's missing, let's make this project the best that it can be. I suspect we can save many companies hundreds of thousands or millions of dollars. And this really feels like the right direction to go. And this is easily a multi-billion dollar savings opportunity globally. That's huge. I would be flabbergasted if that was the outcome
Starting point is 00:29:42 of this. The hardest part is reaching these people and getting them on board with the idea of handling this. And again, I think there's a lot of opportunity for the project to evolve in the sense of different settings depending upon risk tolerance. I can easily see a scenario where in the event of a disruption to the NAT instance, it fails over to the managed NAT gateway, but fail back becomes manual. So you don't have a flapping route table back and forth or a hold down timer or something like that. Because again, in that scenario, the failure mode is just, well, you're paying four and
Starting point is 00:30:12 a half cents per gigabyte for a while until you wind up figuring out what's going on, as opposed to the failure mode of you wind up disrupting connections on an ongoing basis. And for some workloads, that's not tenable. This is absolutely, for the common case, the right path forward. Absolutely. I think it's an enterprise-grade solution, and the more knobs and dials that we add to tweak to make it more robust or adaptable to different kinds of use cases, the best outcome here would actually be that the entire solution becomes irrelevant because AWS fixes
Starting point is 00:30:44 the NAT gateway pricing. If that happens, I will consider the project a great success. I will be doing backflips like you wouldn't believe. I would sing their praises day in, day out. I'm not saying reduce it to nothing even. I'm not saying it adds no value. I would change the way that it's priced because honestly, the fact that I can run an EC2 instance and be charged $0 on a per gigabyte basis, yeah, I would pay a premium on an hourly charge based upon traffic volumes, but don't meter it per gigabyte.
Starting point is 00:31:13 That's where it breaks down. Absolutely. And why is it additive to data transfer also? Like, I remember first starting to use VPC when it was launched and reading about the NAT instance requirement and thinking, wait a minute, I have to pay this extra management and hourly fee just so my private host could reach the internet? That seems kind of janky. And Amazon established a norm here because Azure and GCP both have their own equivalent of this now.
Starting point is 00:31:43 This is a business choice. This is not a technical choice. They could just run this under the hood and not charge anybody for it or build in the cost. And it wouldn't be this thing we have to think about. I almost hate to say it, but Oracle Cloud does for free. Do they? It can be done. This is a business decision. It is not a technical capability issue where, well, it does incur costs to run these things. I understand that. And I'm not asking for things for free. I very rarely say that this is overpriced when I'm talking about AWS billing issues. I'm talking about it being unpredictable. I'm talking about it being impossible to see in advance. But the
Starting point is 00:32:20 fact that it costs too much money is rarely my complaint. In this case, it costs too much money. Make it cost less. If I'm not mistaken, GCP's equivalent solution is the exact same price. It's also four and a half cents per gigabyte. So that shows you that there's business games being played here. Like, Amazon could get ahead and do right by the customer by dropping this to a much more reasonable price.
Starting point is 00:32:47 I really want to thank you both for taking the time to speak with me and building this glorious, glorious thing. Where can we find it? And where can we find you? Alternet.cloud is going to be the place to visit. It's on Chime's GitHub, which will be released by the time this podcast comes out. As for me, if you want to connect, I'm on Twitter. I am the Whaley is my handle.
Starting point is 00:33:11 And of course, I'm on LinkedIn. Links to all of that will be in the podcast notes. Ben, thank you so much for your time and your hard work. This was fun. Thanks, Corey. Ben Whaley, staff software engineer at Chime Financial and AWS community hero. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice.
Starting point is 00:33:39 Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry rant of a comment that I will charge you not only four and a half cents per word to read, but four and a half cents to reply, because I am experimenting with myself with being a rent-seeking schmuck. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
Starting point is 00:34:38 This has been a HumblePod production. Stay humble.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.