Screaming in the Cloud - The Hidden Costs of Cloud Computing with Jack Ellis
Episode Date: February 27, 2024On this week’s episode of Screaming in the Cloud, Corey Quinn is joined by Jack Ellis. He is the technical co-founder of Fathom Analytics, a privacy-first alternative to Google Analytics. C...orey and Jack talk in-depth about a wide variety of AWS services, which ones have a habit of subtly hiking the monthly bill, and why Jack has moved towards working with consultants instead of hiring a costly DevOps team. This episode is truly a deep dive into everything AWS and billing-related led by one of the best in the industry. Tune in.Show Highlights(00:00) - Introduction and Background(00:31) - The Birth of Fathom Analytics(03:35) - The Surprising Cost Drivers: Lambda and CloudWatch(05:27) - The New Infrastructure Plan: CloudFront and WAF Logs(08:10) - The Unexpected Costs of CloudWatch and NAT Gateways(10:37) - The Importance of Efficient Data Movement(12:54) - The Hidden Costs of S3 Versioning(14:33) - The Benefits of AWS Compute Optimizer(17:38) - The Implications of AWS's New IPv4 Address Charges(18:57) - Considering On-Premise Data Centers(21:05) - The Economics of Cloud vs On-Premise(24:05) - The Role of Consultants in Cloud Management(31:05) - The Future of Cloud Management(33:20) - Closing Thoughts and Contact InformationAbout Jack EllisTechnical co-founder of Fathom Analytics, the simple, privacy-first alternative to Google Analytics.Links:Twitter: @JackEllisWebsite: https://usefathom.com/Blog Post: An alterNAT Future: We Now Have a NAT Gateway ReplacementSponsor: Oso - osohq.com
Transcript
Discussion (0)
Yeah, we had old logs. We absolutely did. But that was not the big cost driver.
People assumed it was, though. The big cost driver was that ingest thing you talk about.
Welcome to Screaming in the Cloud. I'm Corey Quinn.
I've been paying attention to the world of web traffic analytics for a little while now,
because it seems that we've basically seeded the
entire space to, well, the only way to know what people are doing on your website is to send all
the information to Google. A while back, I heard about a company called Fathom that was launching
something in the space that actually treated your data with respect and dignity. It was kind of wild.
I recently re-encountered the company when they had a whole
Twitter thread series on things that they had done to save money on their AWS bill, which is
basically like, in my case, like taunting a tiger by waving raw meat in front of it.
Here today to talk about some of those things, and I'm sure much more, is Jack Ellis, the co-founder
and CTO of Fathom Analytics. Jack, thank you for agreeing to suffer through my nonsensical questions.
Thanks for having me, my friend.
Oso makes it easy for developers to build authorization into their applications.
With Oso, you can model, extend, and enforce your authorization as your applications scale.
Organizations like Intercom, Headway Product Board, and PagerDuty have migrated
to Oso to build fine-grained authorization backed by a highly available and performance service.
Check out Oso today at osohq.com. That's O-S-O-H-Q dot com. So I want to start at the beginning here,
which is when you're building something to do analytics for a small website that does not get hits, Apache logs tend to basically be sufficient.
And in time, the complexity grows and then people have different problems that they want to wind up addressing.
And one thing leads to another.
I did not expect to find a company that was relatively early in its journey already caring
about the AWS bill. So I have to ask, was there a tipping point that made you say, ah,
we should definitely dive into this and fix it? Did it cross some threshold? Was it just,
feels like it's time for a good citizen effort or something else?
So the leading motive, I think a lot of companies have a lot of
money to burn through and there's lots of venture capital involved. Our company is fully bootstrapped.
So that cash is cash, it's profits, it's employee raises and things like that. So we have to ask
ourselves, as our business grows, what's going to hurt our profit margin or available cash to spend elsewhere. And AWS on a per page view level
was becoming concerning and the spending was wasteful. And so it didn't really matter that
sure it was only a hundred thousand this year based on our growth, we were going to see it
become 200, 300, 400. And then before you know it, we'd be reaching out to you saying, look at my
AWS bill. It's gone. It's gone crazy, which is what most people do.
But we try to get it ahead of time. Most of my customers these days tend to be enterprise scale
and not to cast aspersions on them at all, but at enterprise scale, the bills get a lot less
interesting in most cases where you have this giant conglomerate with, okay, they're spending
hundreds of millions a year, but the biggest workload's a couple million bucks, and there's
just a very long tail of those things. It becomes more about central planning. And you don't see the
same fun level of misconfigurations because, yeah, if you're running a managed NAT gateway,
and that's driving 20 grand a month in spend, like that's, okay, that winds up being a fifth
of your bill in some cases, people feel foolish and they fix it. You don't let that grow until, oh, what is that $30 million
a year charge we're getting? Someone notices when the numbers get big enough. So it starts to
normalize toward a certain spend. Accounts at your scale are a lot more fun because you get to see
things that catch folks who are paying attention to it by surprise. What surprised you the most?
Lambda surprised me, but more than anything,
CloudWatch really surprised me.
We're spending a significant amount
and we weren't getting any value from it.
We come up with this approach once upon a time
and we were completely fine with it.
And I was surprised to see that the Lambda was so high
because we were effectively doing double requests,
the HTTP into the SQS and then triggering a Lambda.
And it just, I don't know, I was supposedly surprised. My own incompetence surprised me. I just wasn't happy about what I
was seeing, right? And it was just this inefficient use of money, things that we just didn't have to
do now that things had changed. Once upon a time, SQS had some relevance, but why are we now spending
this money when it's not actually delivering any value in our particular use case?
And it's driving up the Lambda bill because we're seeing those documented.
I believe it can go into the hundreds of milliseconds for latency.
We are seeing that.
And everyone says, that's crazy, but it is documented.
And they say that can happen.
So that surprised me.
The SQS time, actually, now that we talk about being surprised.
Yeah.
Cost and architecture and cloud are the same thing.
It's very odd seeing the drivers of your cost. It definitely leads to a better understanding of
your own architecture. Once you start seeing it in black and white in the bills that show up,
that starts to resemble telephone numbers. I completely agree. And so, yeah, we just,
we had to attack it. We had to do something because, and I'll talk about this. It hasn't
happened yet. We are looking at workloads that are going to double the volume.
And so my mind goes, okay, that's going to be nearly double the bill.
If Lambda's already inefficient and SQS is in heavy use,
the bill is going to double.
And we weren't doing clever things like batching SQS to Lambda, right?
So it was one page view in, Lambda, SQS, Lambda.
You can see that's not an efficient infrastructure to have as you're growing to scale.
So we had to fix it.
Yeah, at scale, everything,
small inefficiencies start to add up.
What are you looking at doing instead?
Fewer Lambdas, not using Lambdas at all?
Right, so I have not talked about
the finalized infrastructure for this.
So this is an exclusive for you.
We are doing CloudFront and WAF.
WAF logs, and we're still prototyping this, but WAF logs into Kinesia. It's now called
DataFireHose. It's just been renamed. Into DataFireHose, transform batched through Lambda,
and that's to anonymize the data before it hits S3. And there are some privacy law
reasons I don't want to get into but we're doing that. It gets into S3 and then we have single
store our database running a pipeline which is a massively scalable, like millions per second that
it can handle, I think probably more, pulls the data from S3 using their own special proprietary
stuff and loads that data into our database this is more equivalent to the big
companies and how they handle what do they call it click point or you know those kinds of analytical
workloads so we're now looking this is the cool part we're now looking at the cloud front 250,000
requests per second limit we know that WAF batches through to data firehose so we can fit within
their default limit might have to increase it a little bit. But we're getting that without any provisioned servers and we've got no
Lambda burst concerns because the Lambda team wouldn't increase our burst concurrency. And I
appreciate the scaling work they've done, the new every 10 seconds it's an extra thousand
invocations or whatever it is. That's fantastic. We're keeping our dashboard and our API going to
work great for us. When we have so many customers who can go bursty at any minute, you know,
the initial, is it 5,000? I don't know how many years, I think it's a thousand now.
Some people are seeing 10 in new accounts and getting declined on increasing them.
And that's a problem, right? So we're feeling like we're being forced by AWS
into something that was purpose-made for this use case. And I'm really proud that we
pushed Lambda this far for our use case and using Laravel and PHP on the ingest. And that's been a
story for years, but we are at the point where we have to go into different directions and
I'm happy and sad about that. Lambda is one of those things that can do an awful lot,
but at some point it feels like you're trying to stretch it in a direction it wasn't intended for
and you start to feel the sharp edges coming apart under you. That's exactly it.
And with AWS not giving us limit increases, it's not even an option because we were thinking,
you know, we can bring in Redis and keep it really efficient with the latency to external services,
private link, BPC peering, get the execution time down. But even if we do that, we're still not getting
that burst that we need and that just
versus 250,000 requests per second
on CloudFront, which is
obviously a much more, you know, CloudFront
is CloudFront. It's made for scale.
That just was going to work better for us.
And that's default, by the way.
You also mentioned that CloudWatch
was a fun challenge for you.
I imagine you went
through the same thing most of us do where it's okay cloud watch that's a lot cloud watch covers
a lot of surface area what are the expensive parts of it and sometimes people wind up going down the
wrong path of oh i'm storing too many old logs yeah ingest is 50 cents a gigabyte but now they
have a 25 cent option and storing it0.03 a gigabyte per month,
old logs are not really the cost driver in most environments.
You're absolutely right.
So yeah, we had old logs.
We absolutely did.
But that was not the big cost driver.
People assumed it was, though.
The big cost driver was that ingest thing you talk about.
And it was pointless logs.
At Laravel Vapor, the runtime they had, this is a PHP Laravel runtime,
they were writing pointless logs once upon a time, starting up, injecting secrets into the runtime, but they fixed that.
So I'm thinking, oh good, I can disable those pointless logs. I'm going to be great.
And yet I was still seeing these logs and it's Lambda writing the execution time and things
like that. And I'm not using this. and it just blew my mind and infuriated
me and then i said to myself okay cool we will we will go without everything all logs from lambda
can just go including those beautiful logs where you can see the throttles and the concurrency and
the but all of that we'll just get rid of it and we will live well it turns out those graphs are
actually included in the price of lambda so we still see the concurrency and the requests and everything.
We just don't have the pointless logging to CloudWatch
that we never wanted in the first place.
Even the function started, function ended, here's a report,
three lines of logs on every invocation.
You wound up documenting the advice we've given people
and after extensive testing to make sure it doesn't destroy things
of the only way to turn this off is to remove the ability to put CloudWatch logs in from the execution role the
Lambda's running in, which is insane. Yeah. And like I said, I said to you before the call,
I don't know if they've improved on that, but that was the way that I found to do things. I don't
know if the JSON logging changes things. People are suggesting it does, but I haven't checked
that out. So yeah, the thing we had to do was crazy. You also wound up getting bitten by my personal favorite obnoxious bugbear, the managed
net gateway charges. And there's always two ways that hits. One is in you have a lot of them and
very little traffic's going through. So the hourly cost is through the roof, or you have relatively
few and the traffic through them is enormous. Which one was you? So we were the enormous traffic.
And, you know, that is, I call
it incompetence. I'm being harsh on myself, but just learning the right way to move data around
and to move it around efficiently. You know, you can have good practice, but it is possible. And I
know Heroku have done this for years. You can have your database traffic go over the internet.
Now, if you say that to me now, I say, of course you wouldn't do that but it's not unusual for people to do that not everyone has these these vpcs locked down i know
your clients 100 surely do but smaller companies are not always doing that they're rarely doing in
fact that you know that the system i'm talking about and so we were hit i thought to myself you
know temporarily all that the database can go over here and it's going to be fine. I didn't realize how much traffic was going back and forth. So it wasn't competence
on my part, but it was so easy to fall into that trap. And so our NAT gateway spend is literally,
and we've got private link and VPC peering set up now. So our NAT gateway spend is pennies,
is cents, it's next to nothing each day. One of the projects that came out of Chime
Financial was Alternat, where it runs its own NAT instance, and then as a failback of the managed
NAT gateway. So you can maintain uptime in the event the instance has a problem or whatnot,
but you are stuffing things through it at significant scale. Save them something like
30 grand a month, I think. They had a whole blog post about it two years ago.
That's incredible.
And I know you can self-host
and AT gateways and things like that.
I just, I don't want to be hands-on with anything.
They've probably got a team of,
they definitely have a team of DevOps
if they're spending that much money.
We don't have a team for DevOps.
So we have to think about managed, managed, managed.
You also had some fun things that make sense,
like the old school sysadmin approach.
Used to be for
load purposes, but now it seems as a financial one too. You save 2,500 bucks a year on Route 53
just by increasing TTLs for some records. How do you figure out which ones were too short?
So we know which ones we are seldom going to change. And if we're going to change something,
we'll know weeks in advance. And so I haven't gone ahead and increased it to something ridiculously high, but we had it at, Corey, it was so low. It was,
we're talking maybe 60 on some of them and they're not changing at all. And I love this one because
people that read this article, this isn't a groundbreaking change for people, but they
hadn't necessarily thought about the TTLs and the impact they have at scale. And they really do.
You also knocked almost six grand off your S3 bill just by fixing versioning being turned on
on a particular bucket. But I like that all of AWS's recommendations and the default config
and guard duty and the rest demand you turn it on for every bucket. It's like you're a little
self-interested there, buddy, aren't you? AWS, their S3 stuff drives me wild. Even how the new
config doesn't want you to put anything public.
It's aggressively just, no, nothing's going public.
It feels very hard to use now.
But with the versioning, I had to tick this toggle to show that I was versioning.
I'd forgotten about it, right?
And I hadn't seen this tiny, tiny thing in the UI.
And again, incentives, are they incentivized to make that more obvious?
No, they're not.
And so I spot this thing, I click it, and I go, oh, no.
And that is what was contributing towards our AWS S3 bill.
Sorry, I'd forgotten about that one.
You're bringing it back to my memory.
I am.
Yeah, I don't have this off the top of my head.
I pulled up the blog post, and I'll throw a link to it in the show notes.
No one is excited by the prospect of building permissions
except for the people at Oso.
With Oso's authorization as a service,
you have building blocks for basic permissions patterns like RBAC, REBAC, ABAC, and the ability
to extend to more fine-grained authorization as your applications evolve. Build a centralized
authorization service that helps your developers build and deploy new features quickly. Check out
Oso today at osohq.com. Again, that's O-S-O-H-Q dot com.
But what I'm curious about, too, is not, I mean, the stuff that you wound up putting in here,
I did not notice that you got anything wrong, which is something of a rarity in posts like this.
People often like to get ahead of their skis and they'll get some trivial thing wrong.
And I try not to be like the, aha, you missed this thing. It's like, I don't want to be the
person that shows up to an effort like this and starts chipping away at the validity of what folks
have done. But what I'm curious is that the stuff that you didn't put in here, for example, like you
talk about saving money on S3 by turning off versioning. I would wonder if, again, it's all
going to be based on what the service drivers are, but taking S3 as an example, did you do any analysis of your data access
patterns and figure out if there were lifecycle changes or intelligent tiering that would
potentially make sense for you to implement? No, this is more, no. And there probably could
have been something we could have done there. That's a very valid point.
And that was an example. There are a bunch of things you could go down the path on.
It sounds like you took the same approach that I believe in taking, which is
it's this ancient secret of cloud economics where you start by with the biggest numbers rather than
alphabetically and understand the items contributing to that and then work your way down.
Well, why didn't you optimize your dollar 50 charge for, I don't know, KMS. Because no one cares, buddy. Go back to work.
I completely agree. I completely agree. And there's things for sure we could,
I mean, you even told me something. We spoke on Twitter DMs and you told me to explore this.
I think there was one tweak that came out of that. That was something to do with the compute
optimizer. The ingest itself was already good, but there was something on the dashboard that
we actually went off and changed that they were recommending.
So thank you for that, by the way.
The AWS Compute Optimizer,
which should be part of the billing console,
but it's not because of internal,
I don't know,
feudal warlords fighting, whatever it is.
But when it launched, it was pretty crap.
And it has gotten disturbingly good.
It corrected me on the optimization
of one of my Lambda functions.
And I just want to know the answer to this for just for my own purposes, because I need to
understand how this all works. So it was right and it saved me a penny a month. You'll forgive
me if I'm not falling all over myself with excitement at the cost savings, but it has
gotten good enough that I have deprecated some of the analytical tooling that I've had used to use
for a number of things around right sizing. It sees so many workloads and it knows what it's looking at and reinvent. They
launched the ability for you to start customizing how it works, like what headroom should be built
in, how conservative do you want it to be? And its defaults are pretty sensible too.
Easy to actually take action based on what they were giving us. And you're right. The cost savings
at the current scale for that, it won that, I don't think they're going
to be huge, but I still like optimising
things. Not overly optimising
them, but if it's a case of me tweaking a little
value here, and it will add up over
time, I absolutely will do that. And I
also felt happy to know that Ingest was
moving away from it, but
to be validated that it was in a good place with
the provisioned memory. And you know what?
I also think I need to go back to it after having made these changes
and see if anything's changed there,
because that would be interesting to see.
The problem, too, is that if you spin up a resource,
it's not just what the resource charges you
among an ever-increasing array of dimensions.
It's, okay, so now it's causing log events
and config rule evaluations and snapshots and whatnot.
And then those things, in in turn have downstream effects.
And it's turtles all the way down.
No, for sure.
And the data transfer stuff is interesting.
We're actually, as part of this process, we're spinning up EU isolation, EU data processing
stuff within AWS.
Even the Kinesis writing through to the S3 in the US from the EU is interesting.
And these things you have to know about to price out to make sure that it's economical and for the S3 in the US from the EU is interesting. And these things you have to
know about to price out to make sure that it's economical and for what the business is doing.
Two cents per gigabyte, we can absorb that, it's fine. But knowing to know that I find is a
challenge. And that's AWS a lot of the time. Knowing to know this is hard.
We're recording this conversation in the middle of February. And starting back on February 1st, AWS started charging half a penny per hour per provisioned public IPv4 address. Most people don't read. So I'm expecting my phone to basically explode right around March 3rd workloads we're concerned about. You know, if we're talking enterprise customers or even slightly bigger
businesses, it's going to be crazy, isn't it? All the EC2s they've got, all the things tied together.
I think you're going to have a fun time this year. Between three and 10% is what I'm seeing
in various sample customer environments across the board. Mine is almost 10, but I have a weird
architecture. But again, we're talking 50
bucks. So, okay. People are going to be unpleasantly surprised by this. Because they want you to move
to IPv6 and they're trying to push you. Well, I'd like them to move to IPv6 first. So many of the
things I want to run will not work full stop in a pure IPv6 environment, but internally on AWS
services. Back when they announced this last summer, it sounded like, oh, great, they're going to have these things ready to catch customers.
They didn't. Yeah. You've got to love them. You really have.
Oh, yeah. You've been an AWS customer for a while, and you've been doing a lot of
interesting things with them. And as an AWS customer, you have your fair share of frustrations
around a lot of the things that they do and how they operate.
Are you planning to lead to follow all of the think pieces that are getting written and repatriate all of your workloads to an on-premise data center?
How do you think about this?
Yeah, so we keep getting asked this and I find it funny.
It's a funny question.
I'm sure some people are trolling and I appreciate the trolling. Yeah, there was an element of sarcasm in my question because at your scale, I'd be very
hard pressed to build an economical case for you to do that. But I've been, I can be surprised.
I am curious. The question is in good faith, even if I'm 90% certain I know where it's going.
No, absolutely. Okay. So it's funny because I try and get in the head of someone who's
thinking about doing this and I think, okay okay I've already got the DevOps team they're managing cloud I can have them manage on premise
so I haven't got to worry about the extra salaries required for that and benefits and everything else
I've got my team let's say it's five to ten people I have no idea when it's so funny to think of okay
if you've got the team already sure go ahead and do it but then
someone comes back to me when i've said that and they challenge me and they say okay but people
leave the company and then you've got to worry about these staff that are managing this it's not
just popping someone into place and to replace them like there's training required there they're
not just replacing hard drives no exactly i i can't can't see it. I mean, for me,
this is never going to happen. I would always prefer, unless we're spending, if we're spending,
how much would it have to be? I mean, senior devops, the best of the best in devops,
these are big salaries that we're talking about. So the bill would have to be so substantial that
it was causing so much pain that I'd do it by. I think it's motivated,
if we're talking about a specific situation here, I think it's motivated by them being bootstrapped
and wanting more cash out of the company for themselves, which I understand, but I think it's
good for them and their beliefs, whoever we may be talking about. It's just not for me. I don't
know, Corey. I just think the whole thing just breaks my brain. Even thinking about it, my brain just goes a bit all over the place. And remember, at points of scale,
starting at a million bucks a year in spend, in return for committed spend on all the major cloud
providers, you get discounting. These people spending $50 million a year are not paying
retail prices. I also saw some of the workloads and some of the databases used for various things.
I don't know if it was Elasticsearch. I'm just thinking I wouldn't have chosen that for that problem. And I appreciate
I'm in the poor seats here, but is there really nothing else you can do to reduce that cloud cost?
Even kind of hardballing with Amazon Web Services about what they're charging. Once you get to a
certain spend, I'm sure you've got... No, you do. Doesn't your company do negotiations on behalf of people?
It's about half of our consulting.
Oh, yes.
Okay, so that's what I mean.
So there must be a way.
Well, there is a way.
You just told me there's a way.
It just going on premise feels like such a big jump.
And it's almost like a marketing stunt.
But I appreciate there's a real business there.
There are a bunch of analyst reports saying that everyone's doing it on some level.
I don't see it.
What I see is companies who already have data centers moving some workloads around. Cool. I don't see people shrinking their cloud footprint. I see steady state workloads, in many cases, things that do not work well in a cloud environment, not moving in for obvious reasons. but I've never yet found a company of any scale where the AWS bill was larger than payroll.
People are expensive and people lose sight of the fact that they are expensive. So they just look at raw hardware costs and maybe some of the forward looking ones look at the power costs too.
There's a lot more to it. And I don't want to dunk by saying this, but
everyone knows what we're talking about, but the move to on-premise
and bad-mouthing the cloud and everything
else, and then a DDoS attack happened,
and the first thing they did was spin up Cloudflare.
I'm not
dunking on them for being DDoS, that's horrible,
but the cloud has its place
even if you think you're exiting the cloud.
The cloud size, I mean,
AWS Shield Advanced, WAF, these
things are amazing. CloudFront scalability. I mean, AWS Shield Advanced, WAF, these things are amazing. Cloud
front scalability. I just can't imagine having to try and, I guess if your business isn't growing,
maybe that's okay, but still you've got the management. It goes around in my head in circles
and I just can't imagine doing that ever. We will never do that. Let's just say that.
I spent the last month or so building myself a Kubernetes for an upcoming conference talk. I'm
giving at scale a terrible ideas in Kubernetes and I'm doing it out of Raspberry Pis. And the
problem I keep running into is, oh yeah, I'd forgotten this aspect of it. Waiting on parts
to show up. Some of the parts don't seem to work right. Inconsistencies in a batch of cables,
getting the power hooked up. And I'm not even putting my time into this. It's fun.
But oh yeah, right. I should be doing this in the cloud. Now, because it's a small home lab
environment, I'm one of the best in the world at AWS billing. But I still would not be confident
based on what I've seen so far that I wouldn't get a giant surprise bill if I did this on EKS.
So, of course, I'm doing it at home. But that doesn't mean I'm moving the production things
that make money and hold client data into my spare room too.
That'd be ridiculous.
Yeah.
And I honestly think time will tell.
I think people have got to watch it and be critical of people doing this and people make
their choices.
Realize that there's some marketing going on, but just watch the outcomes and then make
your own decisions.
We've made our decision and we're never going on premise.
The stress of knowing that our infrastructure is in
some data center and that we're, as a company, responsible. Our team could walk out. Our team
could say, no, we're not doing this anymore. Or they could be sick. I can't even imagine the
stress. I'd much rather have Amazon's engineers dealing with it. They've got plenty of engineers,
I'd imagine. They can lose racks, facilities in some cases, and you barely notice if at all.
They have some of the best in the world engineering these problems out. You are worse at replacing
failed hard drives than they are, guaranteed. I might have to write about this because when we
talk about this, when my brain does this, it's trying to get these crazy ideas all together,
and it's hard. I'd love to see you write about it. Have you written about it?
I have a talk coming up on the economics of on-prem versus data centers,
of on-prem versus cloud on economic slap fight.
It's a keynote at SREcon in San Francisco next month, a month from now.
So March, I should probably write the talk at this point.
I'm creeping in and doing the speaker procrastination thing.
But yeah, it's time for me to go in some depth on this one.
I love it.
I want to hear it.
I think you know about cloud and you know about cost. That's what's interesting to me. You have that insight. Someone like me, I've not seen the negotiations.
I have no idea what you guys are pulling off when you have these negotiations. So I just,
I want to know more because there's another side. I can't have these debates without knowing what
goes on there. You actually know.
And I'd love to hear it from you.
It's time for us to be a lot more public
about what we're seeing and how it works.
So there's more of that coming out this year too.
It's time.
It's always custom
and you don't want to tell
any particular company stories
or that will enrage the beast.
But it's the open secrets in the industry
that everyone at a certain scale knows exist.
But if you don't know that's there,
these companies look on sound for like the economics for that don't make sense. Well,
there are service specific discounts. So if someone's doing an awful lot of S3, for example,
and as a certain use case, yeah, you can get very compelling discount options that mean your cost
for whatever metric you care about, MAU, transaction, et cetera, down to a very reasonable
place. I like it. I like that a lot.
Something else you mentioned, even at your scale, you have committed to never having
a DevOps team, which having used, I used to be a DevOps and yeah, those people are miserable,
but why, why do you not want one as opposed to, you know, why those people are miserable?
We can guess.
It's, I think it's not that we would never hire, you know, a couple of people to help
with things. It's just the idea of having this big team to manage quite basic infrastructure
doesn't feel right when I can set it up with consultants
and then it's effectively hands-off.
We are paying a premium for this,
and we are using multi-AZ and everything else services
that AWS is managing, even Lambda and things like that.
The idea is just to be hands-off. So we'll do the upfront spend with consultants to put things in
place so that we don't need a DevOps team to do it. And I appreciate not everyone can do that,
but that's just how we are at the moment. I want to see how far we can really push this
managed services all in on that. That's really where we're going.
But you pay more for managed services.
Like there's a 10 to 20% high premium
of using RDS over EC2.
Yeah, but when it works,
you don't have to have any database expertise internally
the way you would if you were running this at scale
yourself with open source MySQL
or PostgreSQL or whatever it is you choose to use.
That's just it.
I think when you have
good partners that care about your success, we've got great partners at SingleStore. AWS,
no one really talks about this much. AWS, they really want to help you, give you credits and
invest in your use cases and help you to grow so you'll spend more with them. It's incentivized,
sure. I have a hard time viewing, here's some store credit as an investment. I never like that turn of phrase.
All right, fine.
Fine, that's how they phrase it.
The free sample from the drug dealer
is not them investing in your future.
Let's put it that way.
All right, fine.
But they'll bring on experts and everything else
and they'll help you with things.
And I've just been blown away by that.
And there isn't the salesy part.
I think I like the Elastic, what was it?
They just released Elastic Cash Serverless servers i had that team reaching out to me and telling me the
limitations that bigger companies were facing because i said to them what are the limitations
that your bigger customers are saying and it's to do with the total size which i think is like 90
terabytes or something stupid and the bigger companies are saying that's not big enough which
that blows my mind altogether i like that the teams are very involved in customer relations. I've had the same thing
with AWS Shield, Jeffrey Leon, one of the guys that used to work there, emailing me and talking
about things. I really like that. He's great. Oh, you know that? Okay. Yeah. The dangerous part
that sucks about Shield is it costs $3,000 a month. And there is to us a non-deterministic
and internally,
I'm sure it's deterministic way, but what looks to all the world, like the charge gets allocated
to a random AWS account in your org every month. So there've been a couple of times where devs had
minor heart attacks when they're, you know, $20 a month dev environment suddenly got a $3,000
charge slapped onto it. All right. That's fair. Do you see AWS Shield as an insurance policy?
Because that's how we've been thinking about it internally.
Because they'll absorb the actual,
the WAF cost per request is my understanding.
So we now see it as insurance.
I think about it slightly differently.
I view it as getting you a hotline
to their DDoS folks when you need it.
It is insurance,
but it's talking to some of the best in the world
at these problems in that moment without having to sit through a sales pitch and sign over a
credit card. Jeffrey Leon being a terrific example back when he worked there before he left to go
sell, what was it, cryptocurrency? I think Robinhood is where he went, so kind of.
They had the bat signal, you just run this Lambda and it's sort of the same thing. It's
probably changed now, but I know that was our experience when we had it. I don't know if you've read my DDoS attack article, this is from
years ago now. They were great. And I definitely do enjoy that service, but yes, it is expensive,
especially for smaller businesses. Oh yeah. And the problem is just so many of their services
are clearly designed for enterprises, but they don't mention that upfront. The only way to really
figure it out is the pricing. Kendra's a good example. It's like, this sounds awesome. Oh, and it's 7,500 bucks a month. So it is not for me.
Cool. Like that is hire someone whose full-time job is basically like the archivist of everything
I care about and ask them as a human to go and get me the thing I care about.
The types are interesting. You know, everyone's throw up a capture. We're analytics and it happens
in the background. We can't throw up a capture to make sure it's a legitimate person coming in.
And, you know, Cloudflare can do this.
No, Cloudflare cannot do layer.
Layer 7 is really hard and you can't throw up a capture.
No customer in the universe is going to fill out a capture for the freaking analytics on your web page.
So you've got to make people understand that when they try and give you advice.
But that was a fun experience and we're returning to be using that as the expectation.
So I have to ask, now that you've successfully knocked $100,000 a year off of your bill,
are you done?
Are you going to keep going?
What does done look like?
We feel done.
We feel good.
Our employees got raises and people have been laid off and we were able to do that.
And that feels really good.
We're done.
We're good.
No, we're good.
Honestly, we're good.
I think that the main thing now moving forward is we are bringing in consultants
when we're building things to make sure we're really squeezing this, you know,
I'm friends with Alex debris.
I talk to him about things and get his thoughts on.
We have hired him for a number of projects ourselves.
When it comes to the deep dynamo stuff, hard to find anyone better.
Okay.
Him in serverless though. He just, he knows so much.
So talking to him, bringing in other consultants, making sure we're doing
it right in the first place versus Jack's going to make a guess at doing
something right that's going to hurt us down the road is a, is a balance there.
You know, and now we're bringing in consultants because we can
afford to hire consultants.
We couldn't at the beginning, you know, things we couldn't
afford these, these luxuries.
So things have changed. Now we've optimized our spend. We're great. But as we do new
things, we're going to bring in consultants to make sure we're not going to have these huge,
you know, amounts we have to cut off down the road. That's why I'm always interested to talk
to people who reach out like, okay, your, your bill is 50 bucks a month. Why are we having this
conversation? And I'll very often, it's what we're about to scale this and want you to check our napkin math
before we have to raise a round to pay the bill.
When I was doing consulting, I had people reach out
and my job at the time, PHP and serverless stuff,
my role was to make sure that it would scale
for their use case.
They hadn't even reached this scale,
but they wanted to make sure they could.
So I get this preventative thinking
and I always understand why people would come to you for that
because people can blow up like that. And bills you know aws bills for everything so they've really
got to make sure they go itself yeah i'm curious to see how this winds up unfolding in the future
the real trick is at some point once you've reached equilibrium keep an eye on it but you
don't necessarily need to go into super deep weeds every month.
Look for spikes.
Look for trends.
Yeah.
Set up notifications in the billing and all of that jazz.
The alerts are great if you remember to check them.
Sometimes they wind up in the founder's Gmail inbox,
which they still have from their personal nonsense years ago
and getting lost among everything else.
I really want to thank you for taking the time to speak with me.
If people want to learn more, taking the time to speak with me.
If people want to learn more, either about you, what you've done, the company, anything,
where's the best place for them to go?
Usefathom.com is the best place.
And follow me on Twitter.
I'm Jack Ellis.
And that's pretty much it.
And we'll put links to all of that and his blog post in the show notes.
Thank you so much for taking the time to talk to me about this.
I really appreciate it.
Thanks, man.
Jack Ellis is the CTO and co-founder of Fathom Analytics.
I'm cloud economist, Corey Quinn,
and this is Screaming in the Cloud.
If you've enjoyed this podcast,
please leave a five-star review on your podcast platform of choice.
Whereas if you hated this podcast,
please leave a five-star review
on your podcast platform of choice,
along with an insulting comment
that inadvertently will cost that platform $6
because they have no idea how their architecture works
in relation to the AWS bill.