Postgres FM - Infra cost optimization

Episode Date: January 27, 2023

Here are links to a few things we mentioned: The Cost of Cloud, a Trillion Dollar Paradox (blog post from Andreessen Horowitz)OVHHetznerpostgresql_clusterWhy we're leaving the cloud (blog po...st by DHH from Basecamp)Managed services vs. DIY episodeec2instances.infoVantagePostgres TV episode with Everett Berry from VantageMigrating to Aurora: easy except the bill (blog post by Kimberly Nicholls from Gridium)Database Lab EnginePostgres.ai consultingNetdatasysbenchfioQuery macro analysis episodeTop queries by buffers (Gist from Michael)------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov / @michristofides / @PostgresFM, or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQR. I'm Michael, founder of PGMustard, and this is my co-host Nikolai, founder of PostgresAI. Hey Nikolai, what are we talking about today? Let's talk about optimization of budget. First of all, for infrastructure costs, I mean, like if you run a database and it grows, obviously it increases your bill, cloud bill usually, sometimes not cloud, of course. And there's a question which becomes more and more acute, how to pay less for it and continuing to grow in terms of your company growth, how to reduce the costs. And I think it's the topic that can be interesting for both infrastructure and engineering teams.
Starting point is 00:00:42 And it's becoming more and more important because we have crisis already. A lot of people are fired, unfortunately, are being fired. And this is one of the questions. If you can optimize infrastructure costs, probably it's a good way to optimize. Maybe you can save some jobs, right? Yeah, it feels like we've gone through a boom period where a lot of people were throwing money at problems and maybe upgrade maybe doing a lot
Starting point is 00:01:14 of things and having a lot of costs that they don't necessarily need longer term so hopefully in some bigger environments there's some big room for improvement. But also, I think for smaller companies, there's often ways of reducing costs there as well. So yeah, I'm looking forward to this one. Right. So let's start with clouds, maybe. There is a good article, it's actually like maybe two years old already from Anderson Horowitz. It's called The cost of cloud a trillion dollar paradox and obviously clouds are super interesting super like revolutionary concept which changed engineering drastically right but observation is that with all those additional services when companies start it's great because it helps them to focus on product development and finding product market feed and and so on faster but then obviously the bill grows significantly of course the companies
Starting point is 00:02:13 like aws google cloud microsoft webpager and others they win a lot and if we consider the market is growing growing growing but if you right now do a simple simple exercise i did it recently and take some i don't know like some mediums medium size server for example intel or amd server with 16 cores 128 gigabytes of memory for example and compare the costs of EC2 instances or GCP instances versus the price of the server if you buy it, for example, right? Of course, it's extreme right now. All we want, everyone wants right now
Starting point is 00:02:56 to run just API call or run Terraform task and provision instances. It's so good. Nobody wants to deal with ordering servers then replacing disks it's like it's it's nightmare from the past but if you compare those costs or also there are some low-cost providers like ovh or headsnare where you can rent rent servers and they will be replacing failed disks, CPUs and so on. You will be wondering why it's so high. So my first question when we talk about database infrastructure cost optimization, where do
Starting point is 00:03:37 we run it? And extreme optimization is to go with rented servers or with own data centers if you're a big company. And it will compare the costs of maintenance, of course, because if you use clouds, maintenance costs, ownership costs, and so on. But it might be a total sense to, for example, to go with rented servers or with your own data center. For smaller teams, rented servers. For larger teams with your own data center. For smaller teams, rented servers. For larger teams, your own data center to build it.
Starting point is 00:04:10 But, of course, there are many, many questions here. It's not a database-only topic, definitely. But I would like to mention one important key here. With Kubernetes and a lot of automation we have, for example, we can run Postgres with Kubernetes and a lot of automation we have, for example, we can run Postgres with Kubernetes or we can run it with Ansible easily. We have a bundle of Postgres, many options. For example, Postgres SQL Cluster,
Starting point is 00:04:36 one person from my team maintains it, Vitaly. It's a great product, open source. You can run it from Ansible and it has backups, Patroni for autofillover, everything. So you can manage Postgres yourself or Kubernetes. It's like full-fledged automation and achieve a lot of good results with low cost here. Did you see the recent blog post by the team at Basecamp or at least DHH? I think they're a nice setup. Yeah, but yeah, they were talking about exactly this. And it's like that's insane how big the bill is right now. If you grow, if you have like, you pay a lot.
Starting point is 00:05:20 Of course, at this scale, you already should negotiate large discounts with cloud provider, obviously. 100% so. And larger, I mean, like 70%. It's not like 20 or 30, 70, 80% discount, you should do it. If you pay more than 1 million, I think you already do it. that's the kind of thing people are already doing. But it does feel like there are people starting to question, are we getting the benefits that the cloud promised? Are we still having to employ people to do a lot of the infrastructure things, having to worry about the lower level things? If you are, then are we really happy paying those premiums? But equally, I think that's a very, very high effort way of potentially saving a lot of money. And it feels like probably in the rest of the list we have, we're probably going to cover some things that are lower effort. Maybe they don't save quite as much money on an individual item basis, but you might be able to do them sooner. Right. I wanted to emphasize that this should be exercise you do constantly from time to time. I mean, if you're ato or something or or database uh lead
Starting point is 00:06:26 lead database lead in your company i don't say go bare metal with kubernetes but it should be considered analyzed when you do capacity planning and i also must say clouds and managed services like rds are awesome all those clouds like every cloud is awesome. Ivan, Timescale Cloud, Superbase, Hasura, Neon, all of them are great. All of them provide a lot of value. But you need to deal with pricing. You need to analyze costs and consider alternatives. And we helped several of our customers to make decisions.
Starting point is 00:07:02 And it's always not an easy decision so on one hand you like to go to rds or not to go to rds to go to aurora rds or to go with kubernetes there are options there as well maybe another topic right so it's always a quite hard decision because you think okay i already have a table comparing all all aspects of pricing all in my my budgets but maybe i forgot something right they provide they do provide value they have good metrics i mean good characteristics in terms of uptime and durability will our infrastructure team can handle it if it's small maybe no if it's large okay maybe now it's doing maybe later it's small, maybe no. If it's large, okay, maybe now it's doing, maybe later it's not. So it's a good question.
Starting point is 00:07:48 But still my advice, if you are too small or too big, consider bare metal, definitely. Or rented servers. I mean, if you're too small, consider rented servers. It can save you a lot, like a lot. For example, instead of paying 1000, you can pay 100 easily for infra. But for work, yes, it will require the use of some tools. Yes, it will require maybe some subscription to some services for support. But overall, okay, you'll be paying $200 instead of $100.
Starting point is 00:08:16 But before you paid $1,000 to cloud, right? So RDS is great. And without RDS, for example, I think I wouldn't come to idea to create thin cloning database lab engine. Because first I worked with RDS clones, I understood how convenient it is to provision temporary machines to experiment, to develop better and then to delete them. And then like extreme case, thin clones was implemented. And I'm sure many will remain on cloud providers and it will continue growing but in crisis it's one of the biggest questions should you go self-managed postgres or managed by some other company postgres yeah and we do have a whole episode on this as well right one of the earliest ones if anybody you've got new listeners making this decision at the moment that might be worth a listen but yeah it feels like a huge decision lots of work if you've got new listeners making this decision at the moment, that might be worth a listen. But yeah, it feels like a huge decision.
Starting point is 00:09:10 Lots of work if you're having to migrate. And maybe you don't have to go all or nothing, right? Like there are some services that they want clearly running on very thin margins, like the storage services on the clouds. But the databases do feel like they do have quite a high margin added to them versus the like s3 cloud like simple storage solutions right also if you go to if you go with cloud there are many optimization techniques there are many it's whole like maybe we could discuss only this during the whole episode for example if you also in cloud means self-managed versus additional service like RDS. Also, everywhere there are trade-offs and decisions to make. So if you run self-managed in cloud,
Starting point is 00:09:54 of course you can benefit from various techniques to optimize costs. For example, some people also, like it's extreme technique, but some people use spot instances for even to run databases. And AWS is better than in GCP, for example. I'm not sure about Azure. I don't have enough experience with Azure. Somehow, like our clients are mostly on, not on Azure. Also, if you work on self-managed, there are a little bit more options to cut costs, to cut the bill.
Starting point is 00:10:26 If you go with reserved instances, for example, you can buy a contract for bigger type instances and also have convertible instances only in a couple of years. So you can then dynamically split large instances into pieces when you need smaller instances. It's quite an interesting technique. Also, I learned recently there is a secondary market. You can sell those contracts and so on. There are companies who focus on this cost optimization. For example, if you go, I use service ec2instances.info, which provides, like, I don't know why AWS
Starting point is 00:10:59 doesn't provide normal table with all prices on one page. But these guys do a very good job and the company behind it right now is one touch dot sh one touch dot sh so they provide a service automated a service to optimize cloud costs for all big three providers it's quite interesting i hope we will have a postgres tv episode with them soon already discussing. So there are many options to optimize there. Also, in terms of computer storage, both, I wanted to highlight one thing. There is a serverless approach, which isn't very interesting, but you should very carefully analyze its pros and cons. Because in general, it's great. I mean, you pay only for what you need.
Starting point is 00:11:47 But you need to understand your workload pattern. If it's spiky, of course, it's great. But if it's not spiky, there is some level when going serverless, the overhead you pay for the company, for the service itself, it's already too high. So it's like also there are trade-offs. You need to do a lot of work filling cells in some table with calculator maybe and sourcing information from various places and maybe some testing right like sometimes paying a little bit to test some things is much better than i saw i saw what a blog post not that long ago i can't remember who it so I'm going to find it and put it in the notes from a company who did a switch to,
Starting point is 00:12:27 I think it was Aurora in this case, from regular RDS, and were really shocked at the increase in pricing when they were expecting a decrease. So it's... Aurora, it's making some loop with topics. I wanted to discuss first, like clouds, not clouds, and clouds,
Starting point is 00:12:43 various types of contracts, spot instances, reserved, convertible, many things to have. For example, I don't know why RDS doesn't have spot instances. Well, okay, I understand. It's stateful, but people do it, actually. And if you lose compute, storage is there, and Postgres will recover. For non-production, it's good, but they don't provide it. But then you talk about Aurora, not Aurora.
Starting point is 00:13:06 On Aurora, one interesting aspect is that on Aurora, the topic of query optimization, which when we discussed, when we prepared for this episode, you raised the query optimization naturally as the first topic, and I said, no, no, no, no, no. Based on what I saw in larger companies,
Starting point is 00:13:23 its goals may be lost, but with Aurora, maybe not lost, because on Aurora on what I saw in larger companies, it goes maybe last. But with Aurora, maybe not last. Because on Aurora, you need to pay for IO. And if your database doesn't fit the buffer pool, share buffers, and your queries are not optimized, they do a lot of work with disk. In this case, your unoptimized queries hit your bill. So if you optimize them, you reduce your spending, which is good. So cost optimization, query optimization on Aurora directly leads to cost optimization, which is interesting case, but because this is how pricing is organized there. Yeah. And actually, we probably should go back to some of your bigger ticket items. You talked about some of these bigger companies. You go in and if you suggest query optimization, they come back and tell you that's a very fine-grained solution.
Starting point is 00:14:13 What's the bigger picture is that they've got loads of, I think you mentioned, even unused instances. Of course, that's going to be a huge saving if you can just make sure you're catching those early. Do you have any tips and tricks there normally it's usually a lot of unused instances left after some experimenting so you need to control those if your organization is growing you have for example hundreds or thousands of nodes of course if some team provision then forgot to shut it down it's not good there are services which control cloud resources and help you find those which are not used enough. But I wanted to mention that in non-production, it's better to use thin cloning and branching.
Starting point is 00:14:54 So Database Lab Engine here, like I need to advertise my service once again. By the way, we also, since we are entering crisis, I understood that there is a good need in infrastructure cost optimization. So our consulting wing of Postgres AI team is also working right now on structuring all the aspects of database cost optimization.
Starting point is 00:15:17 So our topic today, so we can help with this as well. But in terms of tooling, Database Lab Engine provides you with one machine, you have dozens of clones independent. And this can keep your build basically constant. While you need to run many clones, you keep your build constant in terms of non-production databases. And for benchmarking, it's slightly different. If benchmarking, it's heavy benchmarking, it's slightly different. If it's heavy benchmarking, you need to run heavy tests,
Starting point is 00:15:47 like utilizing all cores and so on. Of course, you need to provision full-size clones in cloud. And it's a good question if you can do it temporarily and then shut it down. But big question, you need not to forget to collect all aspects of all artifacts, logs, monitoring. There are approaches that can be used. For example, I can recommend NateData because you can export dashboard there and then you can remove machine.
Starting point is 00:16:15 So you need to have centralized log accumulated everything. And you also need your monitoring to be able to remember metrics for instances which already don't exist and keep them longer and net data with export capability unfortunately it's manual for now because it's front-end feature it's great automation is great and so on so it can be big cost saver if you start with non-production first, like to keep build sane there. But as for production, several things. First, interesting that my opinion right now is that AMD is better than Intel these days.
Starting point is 00:16:58 If you need the beefy server, for smaller servers, maybe it doesn't matter that much, but if you need 100-200 vCPUs, if you need many hundreds of gigabytes of RAM, AMD is better to handle all TPU workloads because it provides much more vCPUs for less price. If you compare
Starting point is 00:17:21 benchmarks, you will see that ROM, Epic, Milan is beating many Xenon processors. And of course, a big cost saver is ARM. Like Graviton2 and AWS for smaller services are good, but it's limited in number of processors. So for really large, heavy-loaded databases, it might be not enough to have 64 cores maximum, right? I don't remember from the top of my head, but it's limited. It's not like 96, 244 vCPUs others provide already. And AMD, if you need a lot of vCPUs, a lot of workload to handle, for example, tps for on one node amd epic is the way
Starting point is 00:18:07 good way to go and it's cheaper in terms of power versus money but if you if you have smaller requirements maybe you should consider arm and even on rds arm is good right so it's already there for a couple of years, maybe. How much do you see people saving going from one to the other? It can be dozens of percent. Of course, sometimes it's hard to compare. For example, if you run micro benchmarks, for example, Sysbench is good for micro benchmarks to check CPU, to check RAM, and then FIO for I.O.
Starting point is 00:18:44 So Sysbench and FIO is my two tools to use for micro benchmarking. You will make conclusion probably that AMD is good if you need a lot of cores. But if you go to database level benchmarking, the question is how, right? Because if you have like, this is different question, how to benchmark, maybe another episode, right? But in general, your benchmark should be good enough to make conclusions. Because if you just run synthetic PGBench, it's not enough. It will be not representative for, it will be very far from your real workload.
Starting point is 00:19:18 So you need to take care of proper benchmarking. And as for ARM, there is also a certain type of saving. I don't remember numbers. We did it when they appeared quite long ago for one customer. And they started to use it, but for those instances which need smaller workloads, not like 100,000 TPS. Yeah, you say smaller workloads. I say normal workloads. Lower, lower. For the 90%? Normal, okay. Yeah, 99%. Like less than 1,000 TPS,. I say normal workloads. Lower, lower. For the 90%? Normal, okay.
Starting point is 00:19:46 Like less than 1,000 TPS, it's normal. Okay. Maybe it's good. Maybe you have 1,000 microservices and shrink everything, like cut everything to small pieces, and this system is more resilient, right? If you cut to pieces. Well, that's a good point. But if you have a huge monolith, it can be hard.
Starting point is 00:20:09 You mentioned microservices. And one thing I didn't have in mind, but I wonder if you see, is whether the push to microservices and having lots of databases, maybe a database per application, if there's room for savings, if people combine some of those and you know well uh in general the rule applies here if you have some strict boundaries between some resources of course if these boundaries are not elastic they cannot shift to one way on other you will start paying more it's like with disk If you have the need to have several volumes on one disk,
Starting point is 00:20:48 you need to control several free disk space numbers, right? And already it's some overhead in terms of managing it. And also it will probably be less efficient because somewhere it can be 40%, somewhere it's like 70 free disk space but if for example it can be elastically united to like for example with zfs if you use data sets and you have one single number of free disk space and just control to be to more than 20 30 40 percent always and all data sets are using what they need to use.
Starting point is 00:21:26 That's it, right? In this case, the same applies to splitting to some microservices. Of course, if you put a database on a separate instance, provide 10 binos for it, some will be underutilized, some will need to grow. Of course, this is how the idea of serverless and auto scaling appears. This is great that these ideas exist. But they also have some extra overhead,
Starting point is 00:21:54 and they won't work in your data center, if you go my extreme advice to consider data center, right? So this is interesting topic. Of course, if your microservices are small, and for future, you might consider splitting them to separate instances. But for beginning, you might combine them in single Postgres cluster as logical databases, right? In this case, they will share single instance, single resource. For example, RAM and CPU as well. Right now, one of them is using more cpu it's okay then another is using more cpu it's okay on average we have this number we control only one metric and we in terms of sla and so on we control only one metric here it's good it's elasticity right
Starting point is 00:22:38 in terms of cpu in terms of ram as well only one shared buffer, one page cache number, and so on. And they adjust like this spike in one, it's okay. But if you split them as physically separate instances, you have more metrics to control and probably cost efficiency drops here, of course. Yeah, makes sense. Are there any other big ticket items you had that, you know, if you- Of course, the biggest tip item, like from any DBA or database expert, of course, query optimization is like probably the tip of our heel, right? The top of our heel. This is a great topic and sometimes it saves a lot.
Starting point is 00:23:21 My advice is to have, if you have many nodes, first of all, to control PgStats statements in terms of, of course, regular metrics to control CPU, RAM, disk, everything like to see how far from from situation we are. But if you take, for example, PgStats statements, it can provide you interesting metric already. Each second, how many seconds are spent for query processing, right? And if you like, this is quite simple but interesting advice. You compare this metric, seconds
Starting point is 00:23:53 per seconds, to, for example, say, two, right? It means that roughly we need two cores, very roughly. Of course, there are context switches, other nuances, but we need roughly two cores. But in this case, we will be already kind of saturated, right? If we give it four cores, we are like 50% utilized.
Starting point is 00:24:13 And of course, you don't want to be 50% utilized unless it's on peak hours. Yeah. So this is top-level query optimization. Then you dive to details and understand which queries are consuming most of time and try to optimize them. And in this case, if you optimize targeted to resources, to help save resources, you need to order by total time, in my opinion. Because that means order by total time.
Starting point is 00:24:44 And there are many details inside. We had an episode about query analysis, macro analysis from top to bottom. And then tools like PgMaster can help to explain some queries and understand how to optimize them. Yeah, I was actually looking at the... So total time, I completely agree,
Starting point is 00:25:01 but I haven't put it in a blog post yet, but I've written a query now to do it for total buffers as well. So if you're on aurora you might prefer to start with looking at total buffers instead of total time i thought that was an interesting different angle that you could take but yeah so that that makes sense and i think something people don't even consider and my tool doesn't help with is do you even need to be doing that query like from an application level how often do you need to be running these things do you have like do you have things that could be materialized or do you have things that could be done less frequently so there's there's a whole
Starting point is 00:25:34 host of optimization at the macro level that you can well if if you're moving by total time you find some query which has a lot of high frequency, so calls is very high, but timing is very low. In this case, it's better to apply not query optimization technique, but just application optimization technique, and maybe to call it less often, to have cache maybe, and so on. And I think this is most relevant for folks who are either getting close to thinking they might need, let's say you're on a cloud provider, let say you're on rds and you're getting close to a boundary where you think you might need to upsize the instance i've seen people avoiding doing so by doing a whole host of query optimization but i haven't seen many folks go down an instant size but it feels like it's why like that should be possible right like if you're if you're underutilized why can't you i just haven't seen it much in the last few years. I saw it many times.
Starting point is 00:26:26 For example, there is some issue with performance. We come help optimize queries, and then they go down in terms of instance size. This is normal. Nice. Yeah, great. That feels like it should be possible. And so if in the past few years you've thrown money at a performance issue. Or, for example, even more, they remove a couple of standby nodes
Starting point is 00:26:51 because they are not needed anymore because queries are also well optimized. Nice. So reducing like read replicas, that kind of thing. Yeah, because read replicas, they mirror everything. And it's like redundant storage. It's good for HA, but in terms of performance, probably it's not good because you cannot access other host memory, right? So you keep the same memory. It's good to scale read-only queries,
Starting point is 00:27:16 but in general, in terms of resource, the use of resources, it's not very optimal, actually, to scale. Yeah, super interesting. For any smaller, if people are still listening from really small companies, I think it's also worth mentioning things like a lot of people don't even realize quite how many credits are available for startups. Huge amounts of cloud credits and things that could be thousands or tens of thousands
Starting point is 00:27:41 that you haven't used or you haven't even applied for. But yeah, I didn't have much else. Anything else on your list? thousands or tens of thousands that you haven't used or you haven't even applied for but yeah i didn't have much else anything else on your list i could continue but i need to drop off another column i'm sorry like let's let's wrap up here well thank you everybody for listening thank you and see you next week see you bye bye thank you bye

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.