Postgres FM - 102 Query optimization

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQR. I am Michael, founder of PGMustard. This is my co-host Nikolai, founder of PostgresAI. Hey Nikolai, what are we talking about today? Hello, let's talk about query optimization, like second step, third step, after you identified, maybe third step, right? First step is you found some queries which behave not well. Second step, you found some optimization idea. And what's next? Yeah, exactly.

Starting point is 00:00:29 How do we know whether it'd be safe to make change that we're thinking about, whether it will make a difference, that kind of thing. Yep. And this is one of the topics our listeners requested. Yeah, exactly. I think they phrased it quite nicely. It was like a 102, I think they said, like the American class in terms of university courses, right? Is that how it works? 101 is the intro and then 102 is the next step. I'm not the first person who can explain this, but yeah, I saw it. 101 is everywhere, right?

Starting point is 00:00:56 Yeah. So where to start? Let's first make a step back and think about how we verified our optimization idea. As usual, I have a strong opinion here because I see a usual approach is either to rely on experience or to test it right on production. Because the problem is that non-production environments have different data, different size of database, everything is different, sometimes different hardware, but actually hardware is less important.

Starting point is 00:01:27 Much less important. We can tune our Postgres on a weak hardware to behave in terms of the plan choices. To behave is exactly like on production. It's possible. But the data, statistics, and the planner settings, of course, these things are the most important.

Starting point is 00:01:43 So the question is how you verify your ideas. How do you do it? And the answer, great. So what do you think? Well, I'm interested. Yeah, well, so I think you've probably gone really advanced there already talking about how hardware is less important. I think the first lesson is that testing in a similar setup as possible

Starting point is 00:02:02 is really useful. And maybe there's some exceptions to that, but by similar, I think the first thing I always try and encourage people to do is make sure the quantity of data is similar. So, and I guess we're testing two things, right? The first thing we want to test is for the single query that I was trying to optimize or for the single problem I was trying to solve, does this solve that problem? And then there's a secondary level, does it introduce other problems or does it work well amongst the whole load of the system? But right, so we want to, we want our non-production postgres behave ideally

Starting point is 00:02:38 exactly in the same manner as production would do it. Actually production can change. Yesterday it chose one plan, today there is no guarantee it chooses the same plan. And this is quite well known in Postgres. And unfortunately, Postgres doesn't provide a good way to solve this problem. I mean, to freeze the plan choice and avoid sudden plan flips for critical queries. But it's another topic.

Starting point is 00:03:01 So this is another topic I mentioned just to give the feeling that the planner behavior is not stable. So achieving the same planner behavior on non-production environment, it's quite a challenge. But there are specific tricks that can help us. For example, of course, ideally, we want the same data. Because the same data will mean pg statistic content is the same so we are fine there are tricks to export import pg statistic content even if you don't have data there are some not very popular but interesting extensions i think they came from japan i don't remember names but we will provide links it's possible to export and import statistic

Starting point is 00:03:42 i this approach have limitations it's similar to export and import statistics. This approach has limitations. It's similar to hypothetical indexes, partitions. Very interesting approach, and sometimes it's good. But what if we want true behavior? So executor really executes our plans, and we can see the behavior of Postgres for our query. In this case, we want the same data. Of course, there are several levels

Starting point is 00:04:05 how we can be close to the same. We can just take it as is on physical level. This is the absolutely good approach. We just grab the same PgData. If we have managed Postgres, we just clone it. Most of managed services allow to clone. It takes time. it costs money, but it will give us the same data. And so we can see behavior there. But sometimes we cannot copy on physical level. So we can only copy on logical level. But as you said, the numbers of rows, for example, will be the same. So we're good. A problem, not really in the topic of performance, but a problem I see quite a few customers coming across with that is the kind of access to data. So giving developers

Starting point is 00:04:48 access to production data can be problematic from a privacy point of view. It's very problematic. And in my opinion, it's not about... There are two problems that access of arbitrary developer

Starting point is 00:05:01 and company to production can bring us. First problem is security. And second problem, I think it's more important, even more important. Security, we can ask them to sign something, to do security trainings, a lot of things. But second problem is more interesting. Developers want to develop.

Starting point is 00:05:18 So they add a lot of instability. If we allow developers to go to production, it can introduce challenges in terms of stability because we want to have an established process of development testing and deployment so we don't want developers to go and check for example oh i have a great index idea i will create it right on production primary node and i will check it's not, right. But security is also a concern. But even if you don't talk about security, which is a very big concern, if we need to use a logical copy of a production,

Starting point is 00:05:51 for example, the question is, will the plan be the same? Row counts are the same, right? Statistics should be the same. And, for example, we took the same, even the same hardware,

Starting point is 00:06:02 and we put the same postgres settings to configuration question is does it guarantee that the planner will behave in exactly the same manner or no so the the place my head's going to is using analyze and it sampling even when you run analyze you need to get samples of tables right so maybe that's not the only way it could be different, but that's one reason it could be. I agree. Yeah, some instability there. But also if you have a lot of bloat, for example, rel pages is different.

Starting point is 00:06:34 So on your new node, on your logical clone, rel pages will be smaller. And this directly affects the planner behavior. So it can choose different plan. So we can see different access nodes and plan. Not access nodes, execution nodes and plan when you compare production and this logical replica. Unfortunately, and there's no,

Starting point is 00:06:55 I don't know solution to this problem, but it's quite interesting problem. And actually, not just bloat, also, I guess, insert order, right? Like correlation of data. Physical distribution. Yeah. Well, you mean the buffer numbers will be different.

Starting point is 00:07:10 This I understand. For example, we need to read 1,000 rows. We had 1,000 of buffer reads and hits on our source, but we have much fewer hits and reads on our replica because more compact storage, for example, of these rows. This is possible, right? But the question is, will the structure of the plan be the same? And unfortunately, we cannot guarantee.

Starting point is 00:07:36 Usually, it's very close, but unfortunately, real pages are different, and it directly affects the planner behavior. But we're talking about edge cases, right? Not the majority of the time if we're talking about simple OLTP type queries. Does it use the index? Does it not? The kind of things people are probably worried most about, they're likely to be easily replicated once you have a similar sample of data.

Starting point is 00:08:00 Right, right. I agree. And sometimes, for example, we might have multiple production environments, and they are slightly different. So things can be very interesting in various cases, but at least we can be as close as possible. If we aim to do it, we say, okay, we have a billion rows on the table, let's have a billion rows on the table in our production environment, non production environments. And then we can see the plans. Now back to the topic. If we have a plan, if we have a query and a plan, and now we want to ensure that it's a good idea and we need to approve it.

Starting point is 00:08:38 Of course, direct comparison before and after, it's already good, right? So we have one query and we see that our change, for example, we create an index or we do something else, we change query somehow, we directly see that I.O. numbers improved. So conclusion is let's do it. But as you mentioned in the very beginning, of course, there are some cases when our change can affect other queries. And the question is how to find it holistically, how to test it holistically. And this is unsolved problem. First of all, let's consider one case when it can be possible. I had it and I wrote an article about it. We will provide it. So it's like

Starting point is 00:09:18 an interesting case when optimization of index led to all updates being worse in terms of execution. Why? Idea was simple. For example, we have some select. It's quite slow. Maybe not slow, but we see that it uses some index which we can reduce in size significantly adding a where clause. We add a where clause, everything fine. This select has better performance because index size is smaller and so on. Because now it's partial, smaller, right? But then we deploy it and suddenly you see that latency of all updates on average reduced. Why? Because we lost hot updates.

Starting point is 00:09:58 Because adding some column to index definition, even it's in the where clause, leads to different behavior of updates because HAT updates heap only tuple updates. They are possible only if no indexes have column name in their definitions, that column which we are changing. Yes. Right? So if we change some column, usual case, for example, updated at timestamp, we usually tend to change it when we do updates, right?

Starting point is 00:10:29 Because it's the semantics of this column. But if we have an index on it, no such updates can be hip-only-tuple. Hip-only-tuple updates, they are much faster than regular updates because during these updates, Postgres doesn't touch all indexes. It doesn't need to change all indexes. But with regular updates, Postgres needs

Starting point is 00:10:51 to update each index. And this is called index amplification, index write amplification issue, one of those which Uber highlighted in their article. So losing hot updates may be painful. Especially on a heavily loaded system, for sure. The place I am starting to see some progress on in terms of tooling is people even being aware that some of these things are a problem. So PG Analyze added an index advisor in the last year or two. And one of the things that it points out is an estimate of write performance overhead. It's PG Analyze added an index advisor in the last year or two. And one of the things that it points out is an estimate of write performance overhead. So not only can indexes be a problem in terms of preventing hot updates,

Starting point is 00:11:39 they also add write overhead for new rows, for example. Oh, yeah. Any index, if you create one more index, definitely you add overhead. Definitely. And we can actually estimate, definitely you add overhead. Definitely. And we can actually estimate, we can measure it. And this is interesting. And this, definitely it's possible to improve tooling. Definitely. I don't see best tools here.

Starting point is 00:11:55 I made a couple of attempts to create tools. One of the ideas was to create a benchmarking tool, which will be used in any change, for any change related to database. It was called NancyBot. And the idea was we will replay workload before and after our change. We will collect all metrics, for example, from PG star statements, from logs, auto-explain, everything. And then we will provide the direct comparison before and after.

Starting point is 00:12:23 And we see exactly which queries degraded and which queries improved. And this was a great idea. We even had a good interface and so on. But then I realized that practically it's not possible to run full-fledged benchmarks every time developers do any change. Because it's just economically doesn't work.

Starting point is 00:12:43 It's too expensive. You need a separate machine or a couple of machines, sometimes several in a batch. change because it's like it's just economically doesn't work it's too expensive you need separate machine or a couple of machines sometimes several in a batch sometimes we ran several experiments like what if we do this this this like we have 10 options and we executed them then questions arise like we executed them in parallel not sequ sequentially, because sequentially is too slow. If each experiment takes 10 minutes, you want to execute it in parallel. But then questions arise, okay, but what about, are all virtual machines the same? Or there are some deviations. For example, one of them got worse disk or CPU or something.

Starting point is 00:13:21 Or noisy neighbor. Right, right, right. So we added a lot of logic, like micro benchmarks to ensure that some baseline is met and so on and so on and so on. And complexity grew, but the main problem was you just need to pay a lot

Starting point is 00:13:36 to do such experiments. Some companies can afford it. Also, one of the biggest problems was how to replay workload. There was PgReplay and pgReplay.go I saw the other, maybe yesterday, I saw the article from Frank Pacheux

Starting point is 00:13:52 or YugaByte about how to run pgReplay for YugaByte or pgReplay.go actually for YugaByte. But collecting queries is also a challenge and I did it several times. So the bottom line, it's quite complex it's possible to do it some companies actually implemented something but it's very hard to

Starting point is 00:14:11 create a universal tool here but still there is need here there is it we want to cover all changes with some testing right but the good news here is that we don't need full-fledged clones. We can do in a shared environment. We can focus on IO metrics and run many experiments on just one machine. This is what we do with Database Lab and Thing Clones and Database Branching. And it's possible. So we can have Thing Clones. Of course, they have different timing because different file system, for example, ZFS or anything else. B3FS is also an option. And then you run multiple experiments right on one machine and you don't care about timing, you care about plan structures and IU numbers,

Starting point is 00:14:56 and you can verify many things. So next question is what to verify. Okay, we have our query we want to optimize. We found some index idea. We verified that query is improved. How to check other queries? Well, some new tooling needs to be built here, right? And I think it consists of two areas. First area is a reaction. We need a reactive approach. We want to grab the content of PG star statements periodically from production and have some regression testing. We can reduce the scope if we can parse queries, already normalized queries,

Starting point is 00:15:31 using PG query library from LucasFittal, right? To identify which tables are involved. And if we consider optimization for a particular table, we can find only those queries from top 100 or top 500 by total time, by calls.

Starting point is 00:15:47 Somebody needs to decide what is the most important metric and which queries are most critical for our application. But we can find those queries which potentially might be affected. Of course, we can also think about triggers and various dependencies, foreign keys and so on. So this is tricky. But anyway, we can find it automatically. And using thin clone, using single connection, sequentially we can test many queries.

Starting point is 00:16:12 But the problem will be how to find parameters, which parameters to use for those queries. And this is unsolved problem. I already mentioned that this is a big problem. Which parameters to test? In reality, what do you see? In reality, what I see people doing, and I think it's probably a smaller scale to you, is that they reason about the potential effects, discuss it amongst the team during the PR process, during the discussions. Then they deploy to production and monitor.

Starting point is 00:16:46 Maybe they provision a little bit extra. A purely reactive approach, right? Exactly. So we deploy and see. Yeah, this is possible. First of all, of course, code review helps. And having a bunch of very experienced Postgres experts helps a lot, right? They can say, oh, you know, this is a good idea, but what will happen?

Starting point is 00:17:04 What's overhead from it? Let's think about it. Of course, we can manually analyze and try to predict. And if you have experience 10 plus years, it will work quite well. Unless you're very tired, very busy, have days off and so on. So I still think about a fully automatic approach. But it's possible and we can check metrics and have some alerts. For example, if updates are very important for us,

Starting point is 00:17:31 if the ratio of hot updates dropped after our release, it can be somehow flagged, alerted, and so on. This is a good approach, yes. But it's purely reactive, right? And actually, most projects, projects don't have even this. So I agree, this could be a good step towards it. But back to fully automated fashion, which queries to check? We need to check queries from production for regression testing, and that's possible.

Starting point is 00:18:00 And actually, in a couple of cases, we already have something like that with our database lab engine. Everything is done in CICD, and the users see the difference and potential degradation before we deploy it. It's very good. But additionally, we need to think about absolutely new queries. For example, if it's a new feature,

Starting point is 00:18:22 new feature branch, and it introduces some new queries or have a rewrite for existing queries, production doesn't have these queries. So second part of it, it's even more challenging. We need to test absolutely new queries here. And I'm just explaining our thoughts. We moved in this direction with Database Lab and I'm excited that in future we will have fully automated testing of all things related to databases

Starting point is 00:18:51 even for those who are not experts at all. And when experts are involved, they have a lot of artifacts, metrics to analyze and understand much quicker what to do, how to fix it. So everyone can move faster and with better quality and so on so right now without this tooling idea is right we need to have code

Starting point is 00:19:13 review and some every database schema change we need to test the change itself but also think about how it will affect workload. Yeah. And I guess first step, some people don't even check about the locks involved, right? That's like, I guess that maybe that's 101 rather than 102. But making sure that it's not a blocking change. I know this has got a lot better in more recent Postgres versions. Right, but I even don't discuss it considering it's too trivial, but right.

Starting point is 00:19:45 So like create index should have concurrently and updates should be in batches. Yeah, and this is actually, we solved it with database lab. We have a fully automated verification that no exclusive logs last more than specified number of seconds. For example, one second. This is possible to fully automate

Starting point is 00:20:04 for the change itself. But we spent some time to fully automate for the change itself but we spent some time understanding how to test the change itself and quite already solved it we can put it to your CICD but now the question is I think it's a bigger question for example some developer created something but forgot index how to see that it will be bad if we deploy this? And how to see it

Starting point is 00:20:27 without involving very, very expensive and very experienced DBAs every time, right? So we should involve them only if we have a hard question, not a simple question like forgotten limit or index which degrades our other queries and so on yeah and i was you mentioned pg replay go that's my the old the company i used to work at go cardless and i i think that's how they reason right yeah so i think that's how i remember it being used most it was big changes it was risky things it was things that people already knew were potentially problematic. For example, big functionality, we deploy big functionality, or we had some refactoring during several months, right? Large migration, that kind of thing.

Starting point is 00:21:17 Well, in my opinion, benchmarks with multiple sessions should be applied very rarely when something big changes, either like big new release happens or migration to newer Postgres, major Postgres version, or change of infrastructure, change of operational system. It happens a couple of times per quarter maximum usually, right? Yeah, much less normally. Much less normally. It doesn't depend on the size of company actually,

Starting point is 00:21:43 because anyway, it's a very expensive change, usually. And it's good that we need benchmarks, full-fledged benchmarks, not every day. But application changes might happen every day, sometimes several times per day. And they might change some small part of the database. Good news, only small part of the database, right? We don't perform huge refactoring every day. So if we change only one table, two tables, we should be able to test it in more light manner. This is why I'm so big fan of thin cloning. And I'm so excited to see others also think about database branching. Neon database, they say about it, like they say,

Starting point is 00:22:24 you can test a lot of things in CICD it will be cheap. Actually, Aurora has, as we discussed, Aurora has thin clones, but it's not cheap. You need to pay for each clone for compute power separately. I hope Neon will be different. Running on

Starting point is 00:22:39 one machine and you pay for a single machine but run many tests at the same time on one machine. Otherwise it doesn't scale because you sometimes need dozens of tests happening in parallel if you have a big organization. It can be very costly. So you will turn it off and you will be without testing returning to bad times again. But the advice right now is yes have database reviews try to automate some things learn your pgstat user tables right because it shows you how many updates and hot updates you have some tuple statistics it's very good to understand it i'm not i don't agree with a datadog approach

Starting point is 00:23:20 if you go to database dashboard in datadog they have a lot of improvements lately over the last year or two. But if you check what they name, they present it as throughput, tuple statistics. No, no. Throughput is TPS, QPS. This is throughput. Our global throughput for our data is not. But tuple statistics is important, but it should be in details. But it's a long topic. Anyway, understanding when you change something for a table,

Starting point is 00:23:47 understanding couple statistics from PGStat user tables, it's a good thing to have. And you can see, oh, I lost hot updates. Oops. Yeah. How would your advice change if we're talking about smaller teams? If we're talking, let's say, like your six-person startup, you've got a

Starting point is 00:24:05 decent amount of data but not necessarily you know not billions of rows and maybe the team is currently prioritizing moving fast and if they have a few problems that they have to fix that's okay is there any changes in this like because this feels like a very safe approach like the kind of tooling we've discussed seems very very much optimized for trying to avoid a problem at almost any cost. Well, yeah, good question. First of all, I would anyway change mindset to use buffers instead of timing. Anyway, even if you're small, because it will save you so much. Of course, there is such thing as a premature optimization when you try to optimize too early, so much. Of course, there is such thing as premature optimization when you try to optimize

Starting point is 00:24:45 too early, too much, but still, if you know that some table, you will use it as a queue pattern. Insert, update, and delete. Insert, update, delete. It will be bloated a lot. You need some experiment anyway. This experiment, by the way,

Starting point is 00:25:01 can still be a single-user, single-session experiment in a shared environment. Because you don't actually need to use PgBench and utilize all CPUs. You don't need some kind of background workload unless this workload also deals with your table. What you do need is to think about the future, like what will happen in one or two years with this table how big it will be and run some benchmark with one session to fill it and then to run one more session to next you run to update it heavily and so on and see what bloat will be and do you have good indexes here or no like if you focus on buffers, you will quickly...

Starting point is 00:25:47 This mindset change is very important because you will start comparing. Like in your tool, PgMaster, you compare, for example, rows, buffers, buffer numbers. To get one row, we deal with 1,000 buffers. It doesn't feel good at all. And if in one year it will be even worse, probably we should prepare for better growth right here in this area. So we need to test it. We need to test our queries and see what will happen if we if we have more data, for example.

Starting point is 00:26:19 Yeah, I think if the only other thing I would add is I think I see kind of a couple of extreme failure cases here. I see the failure case where people have not really started adding many indexes at all. They almost have no indexes. And then they start to scale and they have every problem. Or index every column, right? Yeah, exactly. Or the other extreme where they've added an index every time they've seen a problem. Or, yeah, I've not seen an every column case.

Starting point is 00:26:43 I've heard about them. But definitely, you know, every time they see a potential for a multi-column index they add one so a single table and even a table of six columns could have 20 indexes on it different orders different so i've seen both of those failure cases and i guess if you know you're in one of those then it kind of changes your approach quite a lot if you're in, we already have too many indexes, maybe you need to be a bit aware of that and be looking at reducing that. If you know you're in the case of we don't have enough, maybe you've got a little bit more leeway

Starting point is 00:27:12 to not have to worry about as much impact of adding extra. You need to understand your workload to know what to test. And like, this is true, but the change of mindset to buffers and just experiment a lot. This is a good thing. Like experiment, experiment, experiment. And you will see and then you will probably start.

Starting point is 00:27:30 If everything is based on human decisions, like based on gut, right? In this case, at some point you will start skipping some checks because you already know that it's okay. But sometimes you will be wrong anyway. This is a bad thing. Yeah. In fact, the buffers thing came up just the other day. I saw a really interesting blog post from a company that moved from RDS for Postgres to Aurora for Postgres.

Starting point is 00:27:56 Okay. And they, in my opinion, must have not had enough indexes. They must have been doing too much IO because their costs went maybe tripled or something according to the graphs and it was really interesting to see that and see that they thought this simple migration would be straightforward but actually because they hadn't optimized for buffers they hadn't reduced you know being quite efficient on the io front that hit them with a big bill and some headaches. On Aurora optimization, it's not about performance,

Starting point is 00:28:25 it's about money as well, right? And I, your centric optimization is the best thing there because if you focus on reduction of IO, I don't remember it's about, is it about rights or rights? I remember something different was there. This detail I already forgot.

Starting point is 00:28:42 But either rights or rights can, if you have them a lot, you need to pay for I.O. at Aurora. So you definitely want to reduce, and you want your buffer pool to have better efficiency ratio in terms of hits versus reads. And just you can check your queries and optimize them and so on. And if, for example, indeed, if your index added more,

Starting point is 00:29:06 for example, it broke all your hot updates, so hot updates ratio became zero. This is what happened with me a couple of times. In this case, you will start paying more. So you optimize one query, but not only you degrade it, you have worse performance for other queries, but you're also paying more.

Starting point is 00:29:24 It's not good. So my advice is know your workload. Learn it. From PGS statements, for example, focus on buffers, not timing. When optimizing, timing is our final goal for optimization, but it should not be used

Starting point is 00:29:40 during the process of optimization. It should be only checked in the beginning and the end. But inside of the process of optimization, It should be only checked in the beginning and the end. But inside of the process of optimization, we should be focused on buffers. And finally, experiment a lot. And to experiment a lot, you should just be ready

Starting point is 00:29:55 to clone and reset a lot, and so on. That's why think cloning is good. And as I've said, Aurora has think cloning, so you can run additional load and test it there and stop it. I guess they have also like per minute or even per second billing as AWS in general has. EC2 nodes, they are per second, so it can be very efficient.

Starting point is 00:30:18 You will need to wait some time, like minutes, to provision a node. But at least you can quickly stop it, right? But pure thin clones, which I named them local thin clones. So we have one VM and many clones inside it. So we pay only for one VM. It's the best for experimenting. So having such a platform for experimenting

Starting point is 00:30:41 is a good thing. Anyway, general advice is experiment a lot. Yeah. Sounds great. Anything else you wanted to add? No, as usual, subscribe, like, share, provide feedback. Feedback is very important.

Starting point is 00:30:56 Share is also very important in your social networks, working groups, Slack, Discord, Telegram, what you use, please share. And bring us more ideas. Actually, we have a line, we have a queue of ideas anyway, but we want more. Yeah, exactly. It's not just always a request comes in and we do the exact topic.

Starting point is 00:31:14 Sometimes it gives us an idea for a slightly different one, or we put two together and that makes a good episode, that kind of thing. So we'll see. Exactly. Thank you so much, everybody. Take care. Cheers, Nikolai. Bye.

Your Ad Here

Postgres FM - 102 Query optimization

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.