Screaming in the Cloud - The Hidden Costs of Cloud Computing with Jack Ellis

Episode Date: February 27, 2024

On this week’s episode of Screaming in the Cloud, Corey Quinn is joined by Jack Ellis. He is the technical co-founder of Fathom Analytics, a privacy-first alternative to Google Analytics. C...orey and Jack talk in-depth about a wide variety of AWS services, which ones have a habit of subtly hiking the monthly bill, and why Jack has moved towards working with consultants instead of hiring a costly DevOps team. This episode is truly a deep dive into everything AWS and billing-related led by one of the best in the industry. Tune in.Show Highlights(00:00) - Introduction and Background(00:31) - The Birth of Fathom Analytics(03:35) - The Surprising Cost Drivers: Lambda and CloudWatch(05:27) - The New Infrastructure Plan: CloudFront and WAF Logs(08:10) - The Unexpected Costs of CloudWatch and NAT Gateways(10:37) - The Importance of Efficient Data Movement(12:54) - The Hidden Costs of S3 Versioning(14:33) - The Benefits of AWS Compute Optimizer(17:38) - The Implications of AWS's New IPv4 Address Charges(18:57) - Considering On-Premise Data Centers(21:05) - The Economics of Cloud vs On-Premise(24:05) - The Role of Consultants in Cloud Management(31:05) - The Future of Cloud Management(33:20) - Closing Thoughts and Contact InformationAbout Jack EllisTechnical co-founder of Fathom Analytics, the simple, privacy-first alternative to Google Analytics.Links:Twitter: @JackEllisWebsite: https://usefathom.com/Blog Post: An alterNAT Future: We Now Have a NAT Gateway ReplacementSponsor: Oso - osohq.com

Transcript
Discussion (0)
Starting point is 00:00:00 Yeah, we had old logs. We absolutely did. But that was not the big cost driver. People assumed it was, though. The big cost driver was that ingest thing you talk about. Welcome to Screaming in the Cloud. I'm Corey Quinn. I've been paying attention to the world of web traffic analytics for a little while now, because it seems that we've basically seeded the entire space to, well, the only way to know what people are doing on your website is to send all the information to Google. A while back, I heard about a company called Fathom that was launching something in the space that actually treated your data with respect and dignity. It was kind of wild.
Starting point is 00:00:42 I recently re-encountered the company when they had a whole Twitter thread series on things that they had done to save money on their AWS bill, which is basically like, in my case, like taunting a tiger by waving raw meat in front of it. Here today to talk about some of those things, and I'm sure much more, is Jack Ellis, the co-founder and CTO of Fathom Analytics. Jack, thank you for agreeing to suffer through my nonsensical questions. Thanks for having me, my friend. Oso makes it easy for developers to build authorization into their applications. With Oso, you can model, extend, and enforce your authorization as your applications scale.
Starting point is 00:01:21 Organizations like Intercom, Headway Product Board, and PagerDuty have migrated to Oso to build fine-grained authorization backed by a highly available and performance service. Check out Oso today at osohq.com. That's O-S-O-H-Q dot com. So I want to start at the beginning here, which is when you're building something to do analytics for a small website that does not get hits, Apache logs tend to basically be sufficient. And in time, the complexity grows and then people have different problems that they want to wind up addressing. And one thing leads to another. I did not expect to find a company that was relatively early in its journey already caring about the AWS bill. So I have to ask, was there a tipping point that made you say, ah,
Starting point is 00:02:13 we should definitely dive into this and fix it? Did it cross some threshold? Was it just, feels like it's time for a good citizen effort or something else? So the leading motive, I think a lot of companies have a lot of money to burn through and there's lots of venture capital involved. Our company is fully bootstrapped. So that cash is cash, it's profits, it's employee raises and things like that. So we have to ask ourselves, as our business grows, what's going to hurt our profit margin or available cash to spend elsewhere. And AWS on a per page view level was becoming concerning and the spending was wasteful. And so it didn't really matter that sure it was only a hundred thousand this year based on our growth, we were going to see it
Starting point is 00:02:56 become 200, 300, 400. And then before you know it, we'd be reaching out to you saying, look at my AWS bill. It's gone. It's gone crazy, which is what most people do. But we try to get it ahead of time. Most of my customers these days tend to be enterprise scale and not to cast aspersions on them at all, but at enterprise scale, the bills get a lot less interesting in most cases where you have this giant conglomerate with, okay, they're spending hundreds of millions a year, but the biggest workload's a couple million bucks, and there's just a very long tail of those things. It becomes more about central planning. And you don't see the same fun level of misconfigurations because, yeah, if you're running a managed NAT gateway,
Starting point is 00:03:38 and that's driving 20 grand a month in spend, like that's, okay, that winds up being a fifth of your bill in some cases, people feel foolish and they fix it. You don't let that grow until, oh, what is that $30 million a year charge we're getting? Someone notices when the numbers get big enough. So it starts to normalize toward a certain spend. Accounts at your scale are a lot more fun because you get to see things that catch folks who are paying attention to it by surprise. What surprised you the most? Lambda surprised me, but more than anything, CloudWatch really surprised me. We're spending a significant amount
Starting point is 00:04:11 and we weren't getting any value from it. We come up with this approach once upon a time and we were completely fine with it. And I was surprised to see that the Lambda was so high because we were effectively doing double requests, the HTTP into the SQS and then triggering a Lambda. And it just, I don't know, I was supposedly surprised. My own incompetence surprised me. I just wasn't happy about what I was seeing, right? And it was just this inefficient use of money, things that we just didn't have to
Starting point is 00:04:35 do now that things had changed. Once upon a time, SQS had some relevance, but why are we now spending this money when it's not actually delivering any value in our particular use case? And it's driving up the Lambda bill because we're seeing those documented. I believe it can go into the hundreds of milliseconds for latency. We are seeing that. And everyone says, that's crazy, but it is documented. And they say that can happen. So that surprised me.
Starting point is 00:04:58 The SQS time, actually, now that we talk about being surprised. Yeah. Cost and architecture and cloud are the same thing. It's very odd seeing the drivers of your cost. It definitely leads to a better understanding of your own architecture. Once you start seeing it in black and white in the bills that show up, that starts to resemble telephone numbers. I completely agree. And so, yeah, we just, we had to attack it. We had to do something because, and I'll talk about this. It hasn't happened yet. We are looking at workloads that are going to double the volume.
Starting point is 00:05:25 And so my mind goes, okay, that's going to be nearly double the bill. If Lambda's already inefficient and SQS is in heavy use, the bill is going to double. And we weren't doing clever things like batching SQS to Lambda, right? So it was one page view in, Lambda, SQS, Lambda. You can see that's not an efficient infrastructure to have as you're growing to scale. So we had to fix it. Yeah, at scale, everything,
Starting point is 00:05:50 small inefficiencies start to add up. What are you looking at doing instead? Fewer Lambdas, not using Lambdas at all? Right, so I have not talked about the finalized infrastructure for this. So this is an exclusive for you. We are doing CloudFront and WAF. WAF logs, and we're still prototyping this, but WAF logs into Kinesia. It's now called
Starting point is 00:06:13 DataFireHose. It's just been renamed. Into DataFireHose, transform batched through Lambda, and that's to anonymize the data before it hits S3. And there are some privacy law reasons I don't want to get into but we're doing that. It gets into S3 and then we have single store our database running a pipeline which is a massively scalable, like millions per second that it can handle, I think probably more, pulls the data from S3 using their own special proprietary stuff and loads that data into our database this is more equivalent to the big companies and how they handle what do they call it click point or you know those kinds of analytical workloads so we're now looking this is the cool part we're now looking at the cloud front 250,000
Starting point is 00:06:57 requests per second limit we know that WAF batches through to data firehose so we can fit within their default limit might have to increase it a little bit. But we're getting that without any provisioned servers and we've got no Lambda burst concerns because the Lambda team wouldn't increase our burst concurrency. And I appreciate the scaling work they've done, the new every 10 seconds it's an extra thousand invocations or whatever it is. That's fantastic. We're keeping our dashboard and our API going to work great for us. When we have so many customers who can go bursty at any minute, you know, the initial, is it 5,000? I don't know how many years, I think it's a thousand now. Some people are seeing 10 in new accounts and getting declined on increasing them.
Starting point is 00:07:36 And that's a problem, right? So we're feeling like we're being forced by AWS into something that was purpose-made for this use case. And I'm really proud that we pushed Lambda this far for our use case and using Laravel and PHP on the ingest. And that's been a story for years, but we are at the point where we have to go into different directions and I'm happy and sad about that. Lambda is one of those things that can do an awful lot, but at some point it feels like you're trying to stretch it in a direction it wasn't intended for and you start to feel the sharp edges coming apart under you. That's exactly it. And with AWS not giving us limit increases, it's not even an option because we were thinking,
Starting point is 00:08:13 you know, we can bring in Redis and keep it really efficient with the latency to external services, private link, BPC peering, get the execution time down. But even if we do that, we're still not getting that burst that we need and that just versus 250,000 requests per second on CloudFront, which is obviously a much more, you know, CloudFront is CloudFront. It's made for scale. That just was going to work better for us.
Starting point is 00:08:38 And that's default, by the way. You also mentioned that CloudWatch was a fun challenge for you. I imagine you went through the same thing most of us do where it's okay cloud watch that's a lot cloud watch covers a lot of surface area what are the expensive parts of it and sometimes people wind up going down the wrong path of oh i'm storing too many old logs yeah ingest is 50 cents a gigabyte but now they have a 25 cent option and storing it0.03 a gigabyte per month,
Starting point is 00:09:08 old logs are not really the cost driver in most environments. You're absolutely right. So yeah, we had old logs. We absolutely did. But that was not the big cost driver. People assumed it was, though. The big cost driver was that ingest thing you talk about. And it was pointless logs.
Starting point is 00:09:23 At Laravel Vapor, the runtime they had, this is a PHP Laravel runtime, they were writing pointless logs once upon a time, starting up, injecting secrets into the runtime, but they fixed that. So I'm thinking, oh good, I can disable those pointless logs. I'm going to be great. And yet I was still seeing these logs and it's Lambda writing the execution time and things like that. And I'm not using this. and it just blew my mind and infuriated me and then i said to myself okay cool we will we will go without everything all logs from lambda can just go including those beautiful logs where you can see the throttles and the concurrency and the but all of that we'll just get rid of it and we will live well it turns out those graphs are
Starting point is 00:10:01 actually included in the price of lambda so we still see the concurrency and the requests and everything. We just don't have the pointless logging to CloudWatch that we never wanted in the first place. Even the function started, function ended, here's a report, three lines of logs on every invocation. You wound up documenting the advice we've given people and after extensive testing to make sure it doesn't destroy things of the only way to turn this off is to remove the ability to put CloudWatch logs in from the execution role the
Starting point is 00:10:28 Lambda's running in, which is insane. Yeah. And like I said, I said to you before the call, I don't know if they've improved on that, but that was the way that I found to do things. I don't know if the JSON logging changes things. People are suggesting it does, but I haven't checked that out. So yeah, the thing we had to do was crazy. You also wound up getting bitten by my personal favorite obnoxious bugbear, the managed net gateway charges. And there's always two ways that hits. One is in you have a lot of them and very little traffic's going through. So the hourly cost is through the roof, or you have relatively few and the traffic through them is enormous. Which one was you? So we were the enormous traffic. And, you know, that is, I call
Starting point is 00:11:06 it incompetence. I'm being harsh on myself, but just learning the right way to move data around and to move it around efficiently. You know, you can have good practice, but it is possible. And I know Heroku have done this for years. You can have your database traffic go over the internet. Now, if you say that to me now, I say, of course you wouldn't do that but it's not unusual for people to do that not everyone has these these vpcs locked down i know your clients 100 surely do but smaller companies are not always doing that they're rarely doing in fact that you know that the system i'm talking about and so we were hit i thought to myself you know temporarily all that the database can go over here and it's going to be fine. I didn't realize how much traffic was going back and forth. So it wasn't competence on my part, but it was so easy to fall into that trap. And so our NAT gateway spend is literally,
Starting point is 00:11:55 and we've got private link and VPC peering set up now. So our NAT gateway spend is pennies, is cents, it's next to nothing each day. One of the projects that came out of Chime Financial was Alternat, where it runs its own NAT instance, and then as a failback of the managed NAT gateway. So you can maintain uptime in the event the instance has a problem or whatnot, but you are stuffing things through it at significant scale. Save them something like 30 grand a month, I think. They had a whole blog post about it two years ago. That's incredible. And I know you can self-host
Starting point is 00:12:26 and AT gateways and things like that. I just, I don't want to be hands-on with anything. They've probably got a team of, they definitely have a team of DevOps if they're spending that much money. We don't have a team for DevOps. So we have to think about managed, managed, managed. You also had some fun things that make sense,
Starting point is 00:12:43 like the old school sysadmin approach. Used to be for load purposes, but now it seems as a financial one too. You save 2,500 bucks a year on Route 53 just by increasing TTLs for some records. How do you figure out which ones were too short? So we know which ones we are seldom going to change. And if we're going to change something, we'll know weeks in advance. And so I haven't gone ahead and increased it to something ridiculously high, but we had it at, Corey, it was so low. It was, we're talking maybe 60 on some of them and they're not changing at all. And I love this one because people that read this article, this isn't a groundbreaking change for people, but they
Starting point is 00:13:19 hadn't necessarily thought about the TTLs and the impact they have at scale. And they really do. You also knocked almost six grand off your S3 bill just by fixing versioning being turned on on a particular bucket. But I like that all of AWS's recommendations and the default config and guard duty and the rest demand you turn it on for every bucket. It's like you're a little self-interested there, buddy, aren't you? AWS, their S3 stuff drives me wild. Even how the new config doesn't want you to put anything public. It's aggressively just, no, nothing's going public. It feels very hard to use now.
Starting point is 00:13:49 But with the versioning, I had to tick this toggle to show that I was versioning. I'd forgotten about it, right? And I hadn't seen this tiny, tiny thing in the UI. And again, incentives, are they incentivized to make that more obvious? No, they're not. And so I spot this thing, I click it, and I go, oh, no. And that is what was contributing towards our AWS S3 bill. Sorry, I'd forgotten about that one.
Starting point is 00:14:11 You're bringing it back to my memory. I am. Yeah, I don't have this off the top of my head. I pulled up the blog post, and I'll throw a link to it in the show notes. No one is excited by the prospect of building permissions except for the people at Oso. With Oso's authorization as a service, you have building blocks for basic permissions patterns like RBAC, REBAC, ABAC, and the ability
Starting point is 00:14:31 to extend to more fine-grained authorization as your applications evolve. Build a centralized authorization service that helps your developers build and deploy new features quickly. Check out Oso today at osohq.com. Again, that's O-S-O-H-Q dot com. But what I'm curious about, too, is not, I mean, the stuff that you wound up putting in here, I did not notice that you got anything wrong, which is something of a rarity in posts like this. People often like to get ahead of their skis and they'll get some trivial thing wrong. And I try not to be like the, aha, you missed this thing. It's like, I don't want to be the person that shows up to an effort like this and starts chipping away at the validity of what folks
Starting point is 00:15:12 have done. But what I'm curious is that the stuff that you didn't put in here, for example, like you talk about saving money on S3 by turning off versioning. I would wonder if, again, it's all going to be based on what the service drivers are, but taking S3 as an example, did you do any analysis of your data access patterns and figure out if there were lifecycle changes or intelligent tiering that would potentially make sense for you to implement? No, this is more, no. And there probably could have been something we could have done there. That's a very valid point. And that was an example. There are a bunch of things you could go down the path on. It sounds like you took the same approach that I believe in taking, which is
Starting point is 00:15:48 it's this ancient secret of cloud economics where you start by with the biggest numbers rather than alphabetically and understand the items contributing to that and then work your way down. Well, why didn't you optimize your dollar 50 charge for, I don't know, KMS. Because no one cares, buddy. Go back to work. I completely agree. I completely agree. And there's things for sure we could, I mean, you even told me something. We spoke on Twitter DMs and you told me to explore this. I think there was one tweak that came out of that. That was something to do with the compute optimizer. The ingest itself was already good, but there was something on the dashboard that we actually went off and changed that they were recommending.
Starting point is 00:16:25 So thank you for that, by the way. The AWS Compute Optimizer, which should be part of the billing console, but it's not because of internal, I don't know, feudal warlords fighting, whatever it is. But when it launched, it was pretty crap. And it has gotten disturbingly good.
Starting point is 00:16:42 It corrected me on the optimization of one of my Lambda functions. And I just want to know the answer to this for just for my own purposes, because I need to understand how this all works. So it was right and it saved me a penny a month. You'll forgive me if I'm not falling all over myself with excitement at the cost savings, but it has gotten good enough that I have deprecated some of the analytical tooling that I've had used to use for a number of things around right sizing. It sees so many workloads and it knows what it's looking at and reinvent. They launched the ability for you to start customizing how it works, like what headroom should be built
Starting point is 00:17:14 in, how conservative do you want it to be? And its defaults are pretty sensible too. Easy to actually take action based on what they were giving us. And you're right. The cost savings at the current scale for that, it won that, I don't think they're going to be huge, but I still like optimising things. Not overly optimising them, but if it's a case of me tweaking a little value here, and it will add up over time, I absolutely will do that. And I
Starting point is 00:17:36 also felt happy to know that Ingest was moving away from it, but to be validated that it was in a good place with the provisioned memory. And you know what? I also think I need to go back to it after having made these changes and see if anything's changed there, because that would be interesting to see. The problem, too, is that if you spin up a resource,
Starting point is 00:17:54 it's not just what the resource charges you among an ever-increasing array of dimensions. It's, okay, so now it's causing log events and config rule evaluations and snapshots and whatnot. And then those things, in in turn have downstream effects. And it's turtles all the way down. No, for sure. And the data transfer stuff is interesting.
Starting point is 00:18:11 We're actually, as part of this process, we're spinning up EU isolation, EU data processing stuff within AWS. Even the Kinesis writing through to the S3 in the US from the EU is interesting. And these things you have to know about to price out to make sure that it's economical and for the S3 in the US from the EU is interesting. And these things you have to know about to price out to make sure that it's economical and for what the business is doing. Two cents per gigabyte, we can absorb that, it's fine. But knowing to know that I find is a challenge. And that's AWS a lot of the time. Knowing to know this is hard. We're recording this conversation in the middle of February. And starting back on February 1st, AWS started charging half a penny per hour per provisioned public IPv4 address. Most people don't read. So I'm expecting my phone to basically explode right around March 3rd workloads we're concerned about. You know, if we're talking enterprise customers or even slightly bigger
Starting point is 00:19:08 businesses, it's going to be crazy, isn't it? All the EC2s they've got, all the things tied together. I think you're going to have a fun time this year. Between three and 10% is what I'm seeing in various sample customer environments across the board. Mine is almost 10, but I have a weird architecture. But again, we're talking 50 bucks. So, okay. People are going to be unpleasantly surprised by this. Because they want you to move to IPv6 and they're trying to push you. Well, I'd like them to move to IPv6 first. So many of the things I want to run will not work full stop in a pure IPv6 environment, but internally on AWS services. Back when they announced this last summer, it sounded like, oh, great, they're going to have these things ready to catch customers.
Starting point is 00:19:48 They didn't. Yeah. You've got to love them. You really have. Oh, yeah. You've been an AWS customer for a while, and you've been doing a lot of interesting things with them. And as an AWS customer, you have your fair share of frustrations around a lot of the things that they do and how they operate. Are you planning to lead to follow all of the think pieces that are getting written and repatriate all of your workloads to an on-premise data center? How do you think about this? Yeah, so we keep getting asked this and I find it funny. It's a funny question.
Starting point is 00:20:21 I'm sure some people are trolling and I appreciate the trolling. Yeah, there was an element of sarcasm in my question because at your scale, I'd be very hard pressed to build an economical case for you to do that. But I've been, I can be surprised. I am curious. The question is in good faith, even if I'm 90% certain I know where it's going. No, absolutely. Okay. So it's funny because I try and get in the head of someone who's thinking about doing this and I think, okay okay I've already got the DevOps team they're managing cloud I can have them manage on premise so I haven't got to worry about the extra salaries required for that and benefits and everything else I've got my team let's say it's five to ten people I have no idea when it's so funny to think of okay if you've got the team already sure go ahead and do it but then
Starting point is 00:21:06 someone comes back to me when i've said that and they challenge me and they say okay but people leave the company and then you've got to worry about these staff that are managing this it's not just popping someone into place and to replace them like there's training required there they're not just replacing hard drives no exactly i i can't can't see it. I mean, for me, this is never going to happen. I would always prefer, unless we're spending, if we're spending, how much would it have to be? I mean, senior devops, the best of the best in devops, these are big salaries that we're talking about. So the bill would have to be so substantial that it was causing so much pain that I'd do it by. I think it's motivated,
Starting point is 00:21:45 if we're talking about a specific situation here, I think it's motivated by them being bootstrapped and wanting more cash out of the company for themselves, which I understand, but I think it's good for them and their beliefs, whoever we may be talking about. It's just not for me. I don't know, Corey. I just think the whole thing just breaks my brain. Even thinking about it, my brain just goes a bit all over the place. And remember, at points of scale, starting at a million bucks a year in spend, in return for committed spend on all the major cloud providers, you get discounting. These people spending $50 million a year are not paying retail prices. I also saw some of the workloads and some of the databases used for various things. I don't know if it was Elasticsearch. I'm just thinking I wouldn't have chosen that for that problem. And I appreciate
Starting point is 00:22:29 I'm in the poor seats here, but is there really nothing else you can do to reduce that cloud cost? Even kind of hardballing with Amazon Web Services about what they're charging. Once you get to a certain spend, I'm sure you've got... No, you do. Doesn't your company do negotiations on behalf of people? It's about half of our consulting. Oh, yes. Okay, so that's what I mean. So there must be a way. Well, there is a way.
Starting point is 00:22:51 You just told me there's a way. It just going on premise feels like such a big jump. And it's almost like a marketing stunt. But I appreciate there's a real business there. There are a bunch of analyst reports saying that everyone's doing it on some level. I don't see it. What I see is companies who already have data centers moving some workloads around. Cool. I don't see people shrinking their cloud footprint. I see steady state workloads, in many cases, things that do not work well in a cloud environment, not moving in for obvious reasons. but I've never yet found a company of any scale where the AWS bill was larger than payroll. People are expensive and people lose sight of the fact that they are expensive. So they just look at raw hardware costs and maybe some of the forward looking ones look at the power costs too.
Starting point is 00:23:37 There's a lot more to it. And I don't want to dunk by saying this, but everyone knows what we're talking about, but the move to on-premise and bad-mouthing the cloud and everything else, and then a DDoS attack happened, and the first thing they did was spin up Cloudflare. I'm not dunking on them for being DDoS, that's horrible, but the cloud has its place
Starting point is 00:23:58 even if you think you're exiting the cloud. The cloud size, I mean, AWS Shield Advanced, WAF, these things are amazing. CloudFront scalability. I mean, AWS Shield Advanced, WAF, these things are amazing. Cloud front scalability. I just can't imagine having to try and, I guess if your business isn't growing, maybe that's okay, but still you've got the management. It goes around in my head in circles and I just can't imagine doing that ever. We will never do that. Let's just say that. I spent the last month or so building myself a Kubernetes for an upcoming conference talk. I'm
Starting point is 00:24:24 giving at scale a terrible ideas in Kubernetes and I'm doing it out of Raspberry Pis. And the problem I keep running into is, oh yeah, I'd forgotten this aspect of it. Waiting on parts to show up. Some of the parts don't seem to work right. Inconsistencies in a batch of cables, getting the power hooked up. And I'm not even putting my time into this. It's fun. But oh yeah, right. I should be doing this in the cloud. Now, because it's a small home lab environment, I'm one of the best in the world at AWS billing. But I still would not be confident based on what I've seen so far that I wouldn't get a giant surprise bill if I did this on EKS. So, of course, I'm doing it at home. But that doesn't mean I'm moving the production things
Starting point is 00:25:02 that make money and hold client data into my spare room too. That'd be ridiculous. Yeah. And I honestly think time will tell. I think people have got to watch it and be critical of people doing this and people make their choices. Realize that there's some marketing going on, but just watch the outcomes and then make your own decisions.
Starting point is 00:25:19 We've made our decision and we're never going on premise. The stress of knowing that our infrastructure is in some data center and that we're, as a company, responsible. Our team could walk out. Our team could say, no, we're not doing this anymore. Or they could be sick. I can't even imagine the stress. I'd much rather have Amazon's engineers dealing with it. They've got plenty of engineers, I'd imagine. They can lose racks, facilities in some cases, and you barely notice if at all. They have some of the best in the world engineering these problems out. You are worse at replacing failed hard drives than they are, guaranteed. I might have to write about this because when we
Starting point is 00:25:53 talk about this, when my brain does this, it's trying to get these crazy ideas all together, and it's hard. I'd love to see you write about it. Have you written about it? I have a talk coming up on the economics of on-prem versus data centers, of on-prem versus cloud on economic slap fight. It's a keynote at SREcon in San Francisco next month, a month from now. So March, I should probably write the talk at this point. I'm creeping in and doing the speaker procrastination thing. But yeah, it's time for me to go in some depth on this one.
Starting point is 00:26:23 I love it. I want to hear it. I think you know about cloud and you know about cost. That's what's interesting to me. You have that insight. Someone like me, I've not seen the negotiations. I have no idea what you guys are pulling off when you have these negotiations. So I just, I want to know more because there's another side. I can't have these debates without knowing what goes on there. You actually know. And I'd love to hear it from you. It's time for us to be a lot more public
Starting point is 00:26:48 about what we're seeing and how it works. So there's more of that coming out this year too. It's time. It's always custom and you don't want to tell any particular company stories or that will enrage the beast. But it's the open secrets in the industry
Starting point is 00:27:00 that everyone at a certain scale knows exist. But if you don't know that's there, these companies look on sound for like the economics for that don't make sense. Well, there are service specific discounts. So if someone's doing an awful lot of S3, for example, and as a certain use case, yeah, you can get very compelling discount options that mean your cost for whatever metric you care about, MAU, transaction, et cetera, down to a very reasonable place. I like it. I like that a lot. Something else you mentioned, even at your scale, you have committed to never having
Starting point is 00:27:30 a DevOps team, which having used, I used to be a DevOps and yeah, those people are miserable, but why, why do you not want one as opposed to, you know, why those people are miserable? We can guess. It's, I think it's not that we would never hire, you know, a couple of people to help with things. It's just the idea of having this big team to manage quite basic infrastructure doesn't feel right when I can set it up with consultants and then it's effectively hands-off. We are paying a premium for this,
Starting point is 00:27:59 and we are using multi-AZ and everything else services that AWS is managing, even Lambda and things like that. The idea is just to be hands-off. So we'll do the upfront spend with consultants to put things in place so that we don't need a DevOps team to do it. And I appreciate not everyone can do that, but that's just how we are at the moment. I want to see how far we can really push this managed services all in on that. That's really where we're going. But you pay more for managed services. Like there's a 10 to 20% high premium
Starting point is 00:28:28 of using RDS over EC2. Yeah, but when it works, you don't have to have any database expertise internally the way you would if you were running this at scale yourself with open source MySQL or PostgreSQL or whatever it is you choose to use. That's just it. I think when you have
Starting point is 00:28:45 good partners that care about your success, we've got great partners at SingleStore. AWS, no one really talks about this much. AWS, they really want to help you, give you credits and invest in your use cases and help you to grow so you'll spend more with them. It's incentivized, sure. I have a hard time viewing, here's some store credit as an investment. I never like that turn of phrase. All right, fine. Fine, that's how they phrase it. The free sample from the drug dealer is not them investing in your future.
Starting point is 00:29:13 Let's put it that way. All right, fine. But they'll bring on experts and everything else and they'll help you with things. And I've just been blown away by that. And there isn't the salesy part. I think I like the Elastic, what was it? They just released Elastic Cash Serverless servers i had that team reaching out to me and telling me the
Starting point is 00:29:29 limitations that bigger companies were facing because i said to them what are the limitations that your bigger customers are saying and it's to do with the total size which i think is like 90 terabytes or something stupid and the bigger companies are saying that's not big enough which that blows my mind altogether i like that the teams are very involved in customer relations. I've had the same thing with AWS Shield, Jeffrey Leon, one of the guys that used to work there, emailing me and talking about things. I really like that. He's great. Oh, you know that? Okay. Yeah. The dangerous part that sucks about Shield is it costs $3,000 a month. And there is to us a non-deterministic and internally,
Starting point is 00:30:05 I'm sure it's deterministic way, but what looks to all the world, like the charge gets allocated to a random AWS account in your org every month. So there've been a couple of times where devs had minor heart attacks when they're, you know, $20 a month dev environment suddenly got a $3,000 charge slapped onto it. All right. That's fair. Do you see AWS Shield as an insurance policy? Because that's how we've been thinking about it internally. Because they'll absorb the actual, the WAF cost per request is my understanding. So we now see it as insurance.
Starting point is 00:30:35 I think about it slightly differently. I view it as getting you a hotline to their DDoS folks when you need it. It is insurance, but it's talking to some of the best in the world at these problems in that moment without having to sit through a sales pitch and sign over a credit card. Jeffrey Leon being a terrific example back when he worked there before he left to go sell, what was it, cryptocurrency? I think Robinhood is where he went, so kind of.
Starting point is 00:30:58 They had the bat signal, you just run this Lambda and it's sort of the same thing. It's probably changed now, but I know that was our experience when we had it. I don't know if you've read my DDoS attack article, this is from years ago now. They were great. And I definitely do enjoy that service, but yes, it is expensive, especially for smaller businesses. Oh yeah. And the problem is just so many of their services are clearly designed for enterprises, but they don't mention that upfront. The only way to really figure it out is the pricing. Kendra's a good example. It's like, this sounds awesome. Oh, and it's 7,500 bucks a month. So it is not for me. Cool. Like that is hire someone whose full-time job is basically like the archivist of everything I care about and ask them as a human to go and get me the thing I care about.
Starting point is 00:31:38 The types are interesting. You know, everyone's throw up a capture. We're analytics and it happens in the background. We can't throw up a capture to make sure it's a legitimate person coming in. And, you know, Cloudflare can do this. No, Cloudflare cannot do layer. Layer 7 is really hard and you can't throw up a capture. No customer in the universe is going to fill out a capture for the freaking analytics on your web page. So you've got to make people understand that when they try and give you advice. But that was a fun experience and we're returning to be using that as the expectation.
Starting point is 00:32:08 So I have to ask, now that you've successfully knocked $100,000 a year off of your bill, are you done? Are you going to keep going? What does done look like? We feel done. We feel good. Our employees got raises and people have been laid off and we were able to do that. And that feels really good.
Starting point is 00:32:25 We're done. We're good. No, we're good. Honestly, we're good. I think that the main thing now moving forward is we are bringing in consultants when we're building things to make sure we're really squeezing this, you know, I'm friends with Alex debris. I talk to him about things and get his thoughts on.
Starting point is 00:32:38 We have hired him for a number of projects ourselves. When it comes to the deep dynamo stuff, hard to find anyone better. Okay. Him in serverless though. He just, he knows so much. So talking to him, bringing in other consultants, making sure we're doing it right in the first place versus Jack's going to make a guess at doing something right that's going to hurt us down the road is a, is a balance there. You know, and now we're bringing in consultants because we can
Starting point is 00:32:59 afford to hire consultants. We couldn't at the beginning, you know, things we couldn't afford these, these luxuries. So things have changed. Now we've optimized our spend. We're great. But as we do new things, we're going to bring in consultants to make sure we're not going to have these huge, you know, amounts we have to cut off down the road. That's why I'm always interested to talk to people who reach out like, okay, your, your bill is 50 bucks a month. Why are we having this conversation? And I'll very often, it's what we're about to scale this and want you to check our napkin math
Starting point is 00:33:25 before we have to raise a round to pay the bill. When I was doing consulting, I had people reach out and my job at the time, PHP and serverless stuff, my role was to make sure that it would scale for their use case. They hadn't even reached this scale, but they wanted to make sure they could. So I get this preventative thinking
Starting point is 00:33:41 and I always understand why people would come to you for that because people can blow up like that. And bills you know aws bills for everything so they've really got to make sure they go itself yeah i'm curious to see how this winds up unfolding in the future the real trick is at some point once you've reached equilibrium keep an eye on it but you don't necessarily need to go into super deep weeds every month. Look for spikes. Look for trends. Yeah.
Starting point is 00:34:09 Set up notifications in the billing and all of that jazz. The alerts are great if you remember to check them. Sometimes they wind up in the founder's Gmail inbox, which they still have from their personal nonsense years ago and getting lost among everything else. I really want to thank you for taking the time to speak with me. If people want to learn more, taking the time to speak with me. If people want to learn more, either about you, what you've done, the company, anything,
Starting point is 00:34:30 where's the best place for them to go? Usefathom.com is the best place. And follow me on Twitter. I'm Jack Ellis. And that's pretty much it. And we'll put links to all of that and his blog post in the show notes. Thank you so much for taking the time to talk to me about this. I really appreciate it.
Starting point is 00:34:46 Thanks, man. Jack Ellis is the CTO and co-founder of Fathom Analytics. I'm cloud economist, Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you hated this podcast, please leave a five-star review
Starting point is 00:35:01 on your podcast platform of choice, along with an insulting comment that inadvertently will cost that platform $6 because they have no idea how their architecture works in relation to the AWS bill.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.