Postgres FM - Infra cost optimization
Episode Date: January 27, 2023Here are links to a few things we mentioned:Â The Cost of Cloud, a Trillion Dollar Paradox (blog post from Andreessen Horowitz)OVHHetznerpostgresql_clusterWhy we're leaving the cloud (blog po...st by DHH from Basecamp)Managed services vs. DIY episodeec2instances.infoVantagePostgres TV episode with Everett Berry from VantageMigrating to Aurora: easy except the bill (blog post by Kimberly Nicholls from Gridium)Database Lab EnginePostgres.ai consultingNetdatasysbenchfioQuery macro analysis episodeTop queries by buffers (Gist from Michael)------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov / @michristofides / @PostgresFM, or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQR.
I'm Michael, founder of PGMustard, and this is my co-host Nikolai, founder of PostgresAI.
Hey Nikolai, what are we talking about today?
Let's talk about optimization of budget.
First of all, for infrastructure costs, I mean, like if you run a database and it grows,
obviously it increases your bill, cloud bill usually, sometimes not cloud, of course.
And there's a question which becomes more and more acute, how to pay less for it and continuing to grow in terms of your company growth, how to reduce the costs.
And I think it's the topic that can be interesting for both infrastructure and engineering teams.
And it's becoming more and more important
because we have crisis already.
A lot of people are fired, unfortunately, are being fired.
And this is one of the questions.
If you can optimize infrastructure costs,
probably it's a good way to optimize.
Maybe you can save some jobs, right?
Yeah, it feels like we've gone through a boom period where a lot of people were throwing money at problems and maybe upgrade maybe doing a lot
of things and having a lot of costs that they don't necessarily need longer term so hopefully
in some bigger environments there's some big room for improvement. But also, I think for smaller companies, there's often ways of reducing costs there as well. So yeah, I'm looking forward
to this one. Right. So let's start with clouds, maybe. There is a good article, it's actually
like maybe two years old already from Anderson Horowitz. It's called The cost of cloud a trillion dollar paradox and obviously clouds are super interesting
super like revolutionary concept which changed engineering drastically right but observation
is that with all those additional services when companies start it's great because it helps them
to focus on product development and finding product market
feed and and so on faster but then obviously the bill grows significantly of course the companies
like aws google cloud microsoft webpager and others they win a lot and if we consider the
market is growing growing growing but if you right now do a simple simple exercise i did it
recently and take some i don't know like some mediums medium size server for example intel or
amd server with 16 cores 128 gigabytes of memory for example and compare the costs of EC2 instances
or GCP instances versus the price of the server
if you buy it, for example, right?
Of course, it's extreme right now.
All we want, everyone wants right now
to run just API call or run Terraform task
and provision instances.
It's so good.
Nobody wants to deal with ordering servers then replacing
disks it's like it's it's nightmare from the past but if you compare those costs or also there are
some low-cost providers like ovh or headsnare where you can rent rent servers and they will be replacing failed disks, CPUs and so on.
You will be wondering why it's so high.
So my first question when we talk about database infrastructure cost optimization, where do
we run it?
And extreme optimization is to go with rented servers or with own data centers if you're a big company.
And it will compare the costs of maintenance, of course,
because if you use clouds, maintenance costs, ownership costs, and so on.
But it might be a total sense to, for example,
to go with rented servers or with your own data center.
For smaller teams, rented servers. For larger teams with your own data center. For smaller teams, rented servers.
For larger teams, your own data center to build it.
But, of course, there are many, many questions here.
It's not a database-only topic, definitely.
But I would like to mention one important key here.
With Kubernetes and a lot of automation we have,
for example, we can run Postgres with Kubernetes and a lot of automation we have, for example, we can run Postgres with Kubernetes
or we can run it with Ansible easily.
We have a bundle of Postgres, many options.
For example, Postgres SQL Cluster,
one person from my team maintains it, Vitaly.
It's a great product, open source.
You can run it from Ansible and it has backups,
Patroni for autofillover, everything. So you can manage Postgres yourself or Kubernetes. It's like
full-fledged automation and achieve a lot of good results with low cost here.
Did you see the recent blog post by the team at Basecamp or at least DHH?
I think they're a nice setup. Yeah, but yeah, they were talking about exactly this. And it's
like that's insane how big the bill is right now. If you grow, if you have like, you pay a lot.
Of course, at this scale, you already should negotiate large discounts with cloud provider, obviously. 100% so. And larger, I mean, like 70%. It's not like 20 or 30, 70, 80% discount, you should do it. If you pay more than 1 million, I think you already do it. that's the kind of thing people are already doing. But it does feel like there are people starting to question, are we getting the benefits that the cloud promised? Are we still having to employ
people to do a lot of the infrastructure things, having to worry about the lower level things?
If you are, then are we really happy paying those premiums? But equally, I think that's a very,
very high effort way of potentially saving a lot of money.
And it feels like probably in the rest of the list we have, we're probably going to cover some things that are lower effort.
Maybe they don't save quite as much money on an individual item basis, but you might be able to do them sooner.
Right. I wanted to emphasize that this should be exercise you do constantly from time to time.
I mean, if you're ato or something or or database uh lead
lead database lead in your company i don't say go bare metal with kubernetes but it should be
considered analyzed when you do capacity planning and i also must say clouds and managed services
like rds are awesome all those clouds like every cloud is awesome. Ivan, Timescale Cloud, Superbase, Hasura, Neon,
all of them are great.
All of them provide a lot of value.
But you need to deal with pricing.
You need to analyze costs and consider alternatives.
And we helped several of our customers to make decisions.
And it's always not an easy decision so on one hand you like to
go to rds or not to go to rds to go to aurora rds or to go with kubernetes there are options there
as well maybe another topic right so it's always a quite hard decision because you think okay i
already have a table comparing all all aspects of pricing
all in my my budgets but maybe i forgot something right they provide they do provide value they have
good metrics i mean good characteristics in terms of uptime and durability will our infrastructure
team can handle it if it's small maybe no if it's large okay maybe now it's doing maybe later it's small, maybe no. If it's large, okay, maybe now it's doing, maybe later it's not.
So it's a good question.
But still my advice, if you are too small or too big, consider bare metal, definitely.
Or rented servers.
I mean, if you're too small, consider rented servers.
It can save you a lot, like a lot.
For example, instead of paying 1000, you can pay 100 easily for infra.
But for work, yes, it will require the use of some tools.
Yes, it will require maybe some subscription to some services for support.
But overall, okay, you'll be paying $200 instead of $100.
But before you paid $1,000 to cloud, right?
So RDS is great.
And without RDS, for example, I think I wouldn't come to idea to create thin cloning database lab engine. Because first I worked with RDS clones, I understood how convenient it is to provision temporary machines to experiment, to develop better and then to delete them. And then like extreme case, thin clones was implemented. And I'm sure many will remain on cloud providers and it will continue
growing but in crisis it's one of the biggest questions should you go self-managed postgres or
managed by some other company postgres yeah and we do have a whole episode on this as well right
one of the earliest ones if anybody you've got new listeners making this decision at the moment
that might be worth a listen but yeah it feels like a huge decision lots of work if you've got new listeners making this decision at the moment, that might be worth a listen.
But yeah, it feels like a huge decision.
Lots of work if you're having to migrate.
And maybe you don't have to go all or nothing, right?
Like there are some services that they want clearly running on very thin margins, like the storage services on the clouds. But the databases do feel like they do have quite a high margin added to them versus the like s3 cloud like
simple storage solutions right also if you go to if you go with cloud there are many optimization
techniques there are many it's whole like maybe we could discuss only this during the whole episode
for example if you also in cloud means self-managed versus additional service like RDS.
Also, everywhere there are trade-offs and decisions to make.
So if you run self-managed in cloud,
of course you can benefit from various techniques to optimize costs.
For example, some people also, like it's extreme technique,
but some people use spot instances for even to run databases.
And AWS is better than in GCP, for example.
I'm not sure about Azure.
I don't have enough experience with Azure.
Somehow, like our clients are mostly on, not on Azure.
Also, if you work on self-managed, there are a little bit more options to cut costs, to cut the bill.
If you go with reserved instances, for example, you can buy a contract for bigger type instances and also have convertible instances only in a couple of years.
So you can then dynamically split large instances into pieces when you need smaller instances.
It's quite an interesting technique.
Also, I learned recently there is a secondary market.
You can sell those contracts and so on.
There are companies who focus on this cost optimization.
For example, if you go, I use service ec2instances.info,
which provides, like, I don't know why AWS
doesn't provide normal table with all prices on one page.
But these guys do a very good job and the company
behind it right now is one touch dot sh one touch dot sh so they provide a service automated a
service to optimize cloud costs for all big three providers it's quite interesting i hope we will
have a postgres tv episode with them soon already discussing. So there are many options to
optimize there. Also, in terms of computer storage, both, I wanted to highlight one thing. There is
a serverless approach, which isn't very interesting, but you should very carefully analyze its pros and
cons. Because in general, it's great. I mean, you pay only for what you need.
But you need to understand your workload pattern.
If it's spiky, of course, it's great.
But if it's not spiky, there is some level when going serverless, the overhead you pay for the company, for the service itself, it's already too high.
So it's like also there are trade-offs.
You need to do a lot of work filling cells in some table
with calculator maybe and sourcing information from various places and maybe some testing right
like sometimes paying a little bit to test some things is much better than i saw i saw what a
blog post not that long ago i can't remember who it so I'm going to find it and put it in the notes from a company who did a switch to,
I think it was Aurora in this case,
from regular RDS,
and were really shocked at the increase in pricing
when they were expecting a decrease.
So it's...
Aurora, it's making some loop with topics.
I wanted to discuss first,
like clouds, not clouds, and clouds,
various types of contracts,
spot instances, reserved, convertible, many things to have.
For example, I don't know why RDS doesn't have spot instances.
Well, okay, I understand.
It's stateful, but people do it, actually.
And if you lose compute, storage is there, and Postgres will recover.
For non-production, it's good, but they don't provide it.
But then you talk about Aurora, not Aurora.
On Aurora, one interesting aspect is that
on Aurora, the topic of query optimization,
which when we discussed,
when we prepared for this episode,
you raised the query optimization naturally
as the first topic,
and I said, no, no, no, no, no.
Based on what I saw in larger companies,
its goals may be lost, but with Aurora, maybe not lost, because on Aurora on what I saw in larger companies, it goes maybe last. But with Aurora,
maybe not last. Because on Aurora, you need to pay for IO. And if your database doesn't fit
the buffer pool, share buffers, and your queries are not optimized, they do a lot of work with disk.
In this case, your unoptimized queries hit your bill. So if you optimize them, you reduce your spending,
which is good. So cost optimization, query optimization on Aurora directly leads to cost
optimization, which is interesting case, but because this is how pricing is organized there.
Yeah. And actually, we probably should go back to some of your bigger ticket items. You talked about some of these bigger companies.
You go in and if you suggest query optimization, they come back and tell you that's a very fine-grained solution.
What's the bigger picture is that they've got loads of, I think you mentioned, even unused instances.
Of course, that's going to be a huge saving if you can just make sure you're catching those early.
Do you have any tips and tricks there normally it's usually a lot of unused instances left after some experimenting so you need to control those if your organization is growing you have for example
hundreds or thousands of nodes of course if some team provision then forgot to shut it down it's
not good there are services which control cloud resources
and help you find those which are not used enough.
But I wanted to mention that in non-production,
it's better to use thin cloning and branching.
So Database Lab Engine here,
like I need to advertise my service once again.
By the way, we also, since we are entering crisis,
I understood that there is a good need
in infrastructure cost optimization.
So our consulting wing of Postgres AI team
is also working right now on structuring
all the aspects of database cost optimization.
So our topic today,
so we can help with this as well.
But in terms of tooling,
Database Lab Engine provides you
with one machine, you have
dozens of clones independent. And this can keep your build basically constant. While you need to
run many clones, you keep your build constant in terms of non-production databases. And for
benchmarking, it's slightly different. If benchmarking, it's heavy benchmarking, it's slightly different. If it's heavy benchmarking, you need to run heavy tests,
like utilizing all cores and so on.
Of course, you need to provision full-size clones in cloud.
And it's a good question if you can do it temporarily and then shut it down.
But big question, you need not to forget to collect all aspects of all artifacts,
logs, monitoring.
There are approaches that can be used.
For example, I can recommend NateData
because you can export dashboard there and then you can remove machine.
So you need to have centralized log accumulated everything.
And you also need your monitoring to be able to remember metrics for instances which already don't exist
and keep them longer and net data with export capability unfortunately it's manual for now
because it's front-end feature it's great automation is great and so on so it can be
big cost saver if you start with non-production first,
like to keep build sane there.
But as for production, several things.
First, interesting that my opinion right now is that AMD is better than Intel these days.
If you need the beefy server, for smaller servers, maybe it doesn't matter that much, but if you need 100-200 vCPUs,
if you need many hundreds
of gigabytes of RAM,
AMD is better to handle
all TPU workloads because it provides
much more vCPUs
for less price.
If you compare
benchmarks, you will see that
ROM, Epic, Milan is beating many Xenon processors.
And of course, a big cost saver is ARM.
Like Graviton2 and AWS for smaller services are good, but it's limited in number of processors.
So for really large, heavy-loaded databases, it might be not enough to have 64 cores maximum, right?
I don't remember from the top of my head, but it's limited.
It's not like 96, 244 vCPUs others provide already.
And AMD, if you need a lot of vCPUs, a lot of workload to handle, for example, tps for on one node amd epic is the way
good way to go and it's cheaper in terms of power versus money but if you if you have smaller
requirements maybe you should consider arm and even on rds arm is good right so it's already
there for a couple of years, maybe.
How much do you see people saving going from one to the other?
It can be dozens of percent.
Of course, sometimes it's hard to compare.
For example, if you run micro benchmarks,
for example, Sysbench is good for micro benchmarks to check CPU, to check RAM, and then FIO for I.O.
So Sysbench and FIO is my two tools to use for micro benchmarking.
You will make conclusion probably that AMD is good if you need a lot of cores.
But if you go to database level benchmarking, the question is how, right?
Because if you have like, this is different question, how to benchmark, maybe another
episode, right?
But in general, your benchmark should be good enough to make conclusions.
Because if you just run synthetic PGBench, it's not enough.
It will be not representative for, it will be very far from your real workload.
So you need to take care of proper benchmarking.
And as for ARM, there is also a certain type of saving. I don't remember
numbers. We did it when they appeared quite long ago for one customer. And they started to use it,
but for those instances which need smaller workloads, not like 100,000 TPS.
Yeah, you say smaller workloads. I say normal workloads.
Lower, lower.
For the 90%? Normal, okay. Yeah, 99%. Like less than 1,000 TPS,. I say normal workloads. Lower, lower. For the 90%?
Normal, okay.
Like less than 1,000 TPS, it's normal.
Okay.
Maybe it's good.
Maybe you have 1,000 microservices and shrink everything,
like cut everything to small pieces,
and this system is more resilient, right? If you cut to pieces.
Well, that's a good point.
But if you have a huge monolith, it can be hard.
You mentioned microservices.
And one thing I didn't have in mind, but I wonder if you see,
is whether the push to microservices and having lots of databases,
maybe a database per application, if there's room for savings,
if people combine some of those and
you know well uh in general the rule applies here if you have some strict boundaries between some
resources of course if these boundaries are not elastic they cannot shift to one way on other
you will start paying more it's like with disk If you have the need to have several volumes on one disk,
you need to control several free disk space numbers, right?
And already it's some overhead in terms of managing it.
And also it will probably be less efficient
because somewhere it can be 40%,
somewhere it's like 70 free disk space but if
for example it can be elastically united to like for example with zfs if you use data sets and you
have one single number of free disk space and just control to be to more than 20 30 40 percent always
and all data sets are using what they need to use.
That's it, right?
In this case, the same applies to splitting to some microservices.
Of course, if you put a database on a separate instance,
provide 10 binos for it,
some will be underutilized,
some will need to grow.
Of course, this is how the idea of serverless and auto
scaling appears. This is great that these ideas exist. But they also have some extra overhead,
and they won't work in your data center, if you go my extreme advice to consider data center,
right? So this is interesting topic. Of course, if your microservices are small,
and for future, you might consider splitting them to separate instances. But for beginning,
you might combine them in single Postgres cluster as logical databases, right? In this case,
they will share single instance, single resource. For example, RAM and CPU as well. Right now,
one of them is using more cpu it's okay then
another is using more cpu it's okay on average we have this number we control only one metric
and we in terms of sla and so on we control only one metric here it's good it's elasticity right
in terms of cpu in terms of ram as well only one shared buffer, one page cache number, and so on. And they adjust
like this spike in one, it's okay. But if you split them as physically separate instances,
you have more metrics to control and probably cost efficiency drops here, of course.
Yeah, makes sense. Are there any other big ticket items you had that, you know, if you-
Of course, the biggest tip item, like from any DBA or database expert, of course,
query optimization is like probably the tip of our heel, right?
The top of our heel.
This is a great topic and sometimes it saves a lot.
My advice is to have, if you have many nodes, first of all, to control
PgStats statements in terms of, of course, regular metrics to control CPU, RAM, disk,
everything like to see how far from from situation we are. But if you take, for example, PgStats
statements, it can provide you interesting metric already. Each second, how many seconds are spent for query
processing, right? And if you
like, this is quite
simple but interesting
advice. You compare this metric, seconds
per seconds, to, for example,
say, two, right? It means that
roughly we need two cores, very roughly.
Of course, there are context switches,
other nuances, but
we need roughly two cores.
But in this case, we will be already kind of saturated, right?
If we give it four cores, we are like 50% utilized.
And of course, you don't want to be 50% utilized unless it's on peak hours.
Yeah.
So this is top-level query optimization.
Then you dive to details and understand which queries are consuming most of time
and try to optimize them.
And in this case, if you optimize targeted to resources,
to help save resources, you need to order by total time, in my opinion.
Because that means order by total time.
And there are many details inside.
We had an episode about query analysis,
macro analysis from top to bottom.
And then tools like PgMaster
can help to explain some queries
and understand how to optimize them.
Yeah, I was actually looking at the...
So total time, I completely agree,
but I haven't put it in a blog post yet,
but I've written a query now
to do it for total buffers as well. So if you're on aurora you might prefer to start with looking
at total buffers instead of total time i thought that was an interesting different angle that you
could take but yeah so that that makes sense and i think something people don't even consider and
my tool doesn't help with is do you even need to be doing that query like from an application level
how often do you need to be running these things do you have like do you have things that could
be materialized or do you have things that could be done less frequently so there's there's a whole
host of optimization at the macro level that you can well if if you're moving by total time you
find some query which has a lot of high frequency, so calls is very high, but timing is very low.
In this case, it's better to apply not query optimization technique, but just application optimization technique, and maybe to call it less often, to have cache maybe, and so on.
And I think this is most relevant for folks who are either getting close to thinking they might need, let's say you're on a cloud provider, let say you're on rds and you're getting close to a boundary where you think you might need to
upsize the instance i've seen people avoiding doing so by doing a whole host of query optimization
but i haven't seen many folks go down an instant size but it feels like it's why like that should
be possible right like if you're if you're underutilized why can't you i just haven't
seen it much in the last few years. I saw it many times.
For example, there is some issue with performance.
We come help optimize queries, and then they go down in terms of instance size.
This is normal.
Nice.
Yeah, great.
That feels like it should be possible.
And so if in the past few years you've thrown money at a performance issue.
Or, for example, even more, they remove a couple of standby nodes
because they are not needed anymore because queries are also well optimized.
Nice. So reducing like read replicas, that kind of thing.
Yeah, because read replicas, they mirror everything.
And it's like redundant storage.
It's good for HA, but in terms of performance,
probably it's not good because you cannot access other host memory, right?
So you keep the same memory.
It's good to scale read-only queries,
but in general, in terms of resource,
the use of resources, it's not very optimal, actually, to scale.
Yeah, super interesting.
For any smaller, if people are still listening from really small companies,
I think it's also worth mentioning things like
a lot of people don't even realize quite how many credits are available for startups.
Huge amounts of cloud credits and things
that could be thousands or tens of thousands
that you haven't used or you haven't even applied for.
But yeah, I didn't have much else. Anything else on your list? thousands or tens of thousands that you haven't used or you haven't even applied for but yeah i
didn't have much else anything else on your list i could continue but i need to drop off another
column i'm sorry like let's let's wrap up here well thank you everybody for listening thank you
and see you next week see you bye bye thank you bye