Postgres FM - Custom vs generic plan

Episode Date: May 10, 2024

Nikolay and Michael discuss custom and generic planning in prepared statements — how it works, how issues can present themselves, some ways to view the generic plan, and some benefits of av...oiding planning (not just time).  Here are some links to things they mentioned:PREPARE https://www.postgresql.org/docs/current/sql-prepare.html track_activity_query_size https://www.postgresql.org/docs/current/runtime-config-statistics.html#GUC-TRACK-ACTIVITY-QUERY-SIZE plan_cache_mode https://www.postgresql.org/docs/current/runtime-config-query.html#GUC-PLAN-CACHE-MODE EXPLAIN (GENERIC_PLAN) https://www.postgresql.org/docs/current/sql-explain.html#id-1.9.3.148.8 EXPLAIN (GENERIC_PLAN) in PostgreSQL 16 (blog post by Laurenz from Cybertec)  https://www.cybertec-postgresql.com/en/explain-generic-plan-postgresql-16/ Running EXPLAIN on any query, even with $1 parameters (blog post and video by Lukas Fittl of pganalyze) https://pganalyze.com/blog/5mins-postgres-explain-generic-plan EXPLAIN from pg_stat_statements, how to get the generic plan (blog post by Franck Pashto of Yugabyte) https://dev.to/yugabyte/explain-from-pgstatstatements-normalized-queries-how-to-always-get-the-generic-plan-in--5cfi Rework query relation permission checking (commit by Amit Langote) https://git.postgresql.org/gitweb/postgres.git?p=postgresql.git;a=commit;h=a61b1f74823c9c4f79c95226a461f1e7a367764b Partition pruning, prepared statements and generic vs custom query plans (a follow up blog post and video by Lukas) https://pganalyze.com/blog/5mins-postgres-partition-pruning-prepared-statements-generic-vs-custom-query-plans Our episode on over-indexing (inc LWLock discussion) https://postgres.fm/episodes/over-indexing “The year of the lock manager’s revenge” (from blog post by Jeremy Schneider) https://ardentperf.com/2024/03/03/postgres-indexes-partitioning-and-lwlocklockmanager-scalability/  ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, hello, this is PostgresFM episode number 96. My name is Nikolai, founder of Postgres.ai, and as usual, my co-host is Michael from PgMaster. Hi, Michael. Hello, Nikolai. So PgMaster, as I mentioned last time, is focusing on explain plans and helping optimize Postgres queries. And the topic I chose this time is very, very close to that topic, right? I mean, it's a subtopic. I know many people discuss a lot of very in-depth materials around this topic,
Starting point is 00:00:37 but let's just focus on basics just for awareness. And this topic is how plan caching works and generic versus custom plans because like at very high level we don't know about them at all right we know that explain shows us a plan we know there is plan caching we know pg bouncer since recently started supporting prepare statements so you can avoid planning time because sometimes it's very costly for example if you have a lot of partitions a lot of indexes log manager contention can happen so it's good especially for queries which have very high frequency for example thousand times per second it's good to just perform planning once per session and then rely on on cash to plan but it's not really once per
Starting point is 00:01:27 session right there are more nuances in this area so let's uncover them and and discuss yeah nice at a kind of high level i see people that are using orms and frameworks using prepared statements sometimes without realizing it, or sometimes they know, but they don't really know what it means. Do you see a lot of people, like, is that the same in your experience? A lot of people using this without realizing, or are you seeing quite a lot of conscious choice, like conscious use of prepared statements, even for people that aren't using it via an ORM or similar? Well, you know, my point of view on this world,
Starting point is 00:02:07 on one hand, I think people are quite smart in general, but they are smart only when they are very motivated. And we know a lot of engineers don't distinguish binary tree and B3 at all
Starting point is 00:02:20 because they are not motivated enough to understand differences. Similar here, a lot of people, in my opinion, don't understand that one normalized query in PGSA statements might have different plans depending on parameters. This simple truth is something that should be understood, but it can be understood only when you are motivated, when you have some incident, for example, and you start off thinking, oh, okay,
Starting point is 00:02:49 this query can behave differently depending on parameters. So this is basic knowledge. And honestly, I didn't realize it for quite long in my career as well. I thought, okay, this is so, but why is it important? Well, it's very important because looking at Pidzis' statements and understanding heavy queries, it's not easy to jump to optimizing these queries, especially when we talk about mixed situations, when it's also quite high frequency query and also it has issues with like not perfect indexes chosen not perfect plan like not only just index only scan that's it so it means that we need to understand okay we depend on parameters but which parameters
Starting point is 00:03:39 should we choose for for optimization when you say for optimization do you mean like we get to pgstat statements and it has these parameterized queries which ones do we put in in order to run explain analyze buffers right right right this this is unfortunately big gap between macro and micro levels of query analysis which still exists of course with addition of query ID to logs and to PGS activity, we have this bridge between PGS statements and individual queries, which we can find in logs or PGS activity. But still, like I remember one lesson I learned in practice when I spent one hour or so optimizing some query,
Starting point is 00:04:26 it was some parameter which was true or false. And I was trying to optimize for a value which was present in 90% of rows, majority of rows. And it was, of course, difficult because selectivity is low, cardinality is high, right? But then i understood all this hour spent was wasted because it never it's never so never never never ever application runs this value uses this value in query right so the idea is that we need to understand real cases and also not super rare cases or
Starting point is 00:05:09 non-existing at all cases, but we need to understand the cases which hurt user experience, for example, and focus on them. In this case, genetic plan is something really weird in this picture I just painted, right? So what do you think? Like, generic and custom. Maybe let's talk about how Planner uses them and then discuss what should we do about it at all. So this five times rule, what do you think? Yeah, I think that catches people out sometimes so
Starting point is 00:05:45 the way prepared statements works or at least my understanding is that you can prepare a query in advance and that would do like the pause stage and it will do like several stages up to but not including the execution of that query then you can ask it to execute with parameters. Now, on the subsequent executions, you don't have to re-pause, like do the set, like post-credits, do several of those. Well, planning at first does have to be done again, right? So it's the pausing and like the rewriting stages. Yes, on a per session basis, we don't have to do several of
Starting point is 00:06:28 those steps but there are there is also this weird kind of initial phase so if you've if you've done prepare if you've prepared the statement and then you execute it the first five times postgres will plan each of those five executions separately based on the parameters that are given so if those are in the case you gave if you used true was it that was 90 of the time you've used true every single time for those five times it's going to plan it the same way each time and on the sixth execution it will then look at the cost of planning. And if that cost is less than a generic, or if it's more than a generic plan that it's calculated, or I think there's some subtlety around exactly how precise that is,
Starting point is 00:07:15 then it will choose its generic plan thereafter. It will flip to not planning each new execution. So sixth time, seventh time seventh time eighth time ninth time forever onwards it during that session it won't replan so now you're not only saving on the parsing on the rewriting stages you're also saving on the planning stage which you mentioned already but that's not true if for example uh you did three or four executions with true and then a couple of executions with false in your example, then you're going to get a different kind of average cost for those five executions. Then that will get compared to the generic plan and then a different decision will
Starting point is 00:07:56 get made. So depending on those first five, so those first five do actually determine quite a lot what happens thereafter, which is interesting. But yeah, the one time I've seen that the five become important is this can be really confusing for folks. If you're trying to debug something and it's fast for a while and then suddenly it goes slow, this is one of those telltale signs once you've been around a while. If you notice that it's fast for five executions, then slow on the sixth, you should look into this issue, custom plan versus generic plan. It's kind of a telltale sign, like a smoking gun,
Starting point is 00:08:34 or like a, you know, it's a telltale sign that this could be what you're hitting. But aside from that, I haven't really thought too much about it. I don't think it's configurable, that five number. I think it's hard code. Yeah, it's hard. But by the way, why five? Yeah, good question.
Starting point is 00:08:50 Some decision, right? That's it. But what is a generic plan? Generic plan is just we don't take into account parameters at all. It means that we don't take into account this distribution of values or we take yeah good question did you look into how it's calculated in the first place no but it's interesting question right and i would have i could be wrong but i would have guessed that it would be based on some statistics at the time of right like if that would be if you were implementing this feature you would think if if you've got like a got a data distribution there maybe you'd factor that in but it's difficult
Starting point is 00:09:31 because if it's skewed if it's like completely uniform it's quite easy like if we're talking about primary key lookups it's primary key lookup query and every single one's unique and every single one's only going to return one row. It's only scanning one table, that kind of thing. Generic plan is really easy because every single parameter you could provide to that is going to be planned the same way even in the custom plan mode. So generic plan in that case is quite easy. But in the case where it's a skewed distribution,
Starting point is 00:09:58 some values have more entries than others, it becomes tricky, I think, to know how you would plan that in a generic way. So I don't actually know. For me, generic plan is a good thing that we don't have a lot of time to spend finding particular examples and logs or in PGS activity. In PGS activity, we have a different the how it's called track activity size
Starting point is 00:10:26 so this parameter which says by default is 1024 means that really long queries are truncated and we don't see the full query parameters usually are in the end right like where select all columns
Starting point is 00:10:42 our favorite right from table and like so so many column names, so I don't see parameter. Something equals truncated. I saw it not once. So when we don't want to spend time, we want something really fast, this trick prepare statement.
Starting point is 00:11:01 We prepare statement, and then we use just nulls as all values this gives us generic plan but also we can say set plan cache mode to force generic plan which exists in as we just checked before we started recording this episode
Starting point is 00:11:19 we surprisingly plan cache mode configuration setting. It exists since many, many years ago. So in all current versions, it's supported, right? Yes, since version 12. Exactly. So 12 is the oldest currently supported version.
Starting point is 00:11:39 A few months left till the end of it. Yeah, yeah. So we just, this trick allows us to looking at PG statements and just take some normalized query which has this $1, $2,
Starting point is 00:11:58 doesn't have parameters at all. They are removed. And just to use this prepared statements with nulls trick and get some genetic plan and think already how bad situation in general, not in particular cases, but like overall. But this trick, as you absolutely correctly mentioned before this episode recording, it doesn't work if we have, for example, primary key there because primary key equals null or is null
Starting point is 00:12:29 won't work properly because there is not null constraint. So the planner knows that nothing will be returned. It doesn't show us the proper plan at all. Yeah, if you don't specify, like if you don't set plan cache mode to force generic plan, then you'll get a custom plan and the planner will shortcut it because there's a really, yeah. But if you do force generic plan on version 12 onwards, then you will get the generic plan you're hoping for.
Starting point is 00:13:01 Well, that's great. I didn't realize that. So, yeah good good but uh if we are lucky enough to have already postgres 16 all these tricks are not needed because yeah we can just say explain generic plan is there is a new option and explain command and this is very handy right we just ask a generic plan and explain. Don't care about... And interesting that we can even use $1, $2, this query text. Right from PGSR statements, this is super convenient. We can say, explain generic plan and that text.
Starting point is 00:13:36 We can even probably automate this, right? And see, for example, if we have sequential scans in generic plans and like put warnings somewhere and so on but of course this won't work if we want to execute and we want to explain analyze buffers this won't work i mean in this case we need to replace dollar one dollar two with some values yeah and it makes sense right like we can't do the execution unless we specify something badly. Otherwise, what are you executing? So it makes sense. And kudos to Lawrence from CyberTech who added this and has blogged about it.
Starting point is 00:14:15 So I'll share the blog post about that. I remember actually this wave from Lawrence and also Lucas Fittel who journalized blogging about generic plans and also Lucas Fittel pageranalyze blogging about generic plans and also getting excited. Also Frank Pachot, I think, joined at some point. I actually don't know who started this, but I remember waves of thinking, multiple blog authors blog posted
Starting point is 00:14:39 about this generic plans a few years ago. It was great. I don't think it's super surprising that this has come out of a consultancy, like a Postgres consultancy, and that Lucas is excited about it. So I think it's one of those things that really helps people help others. Because if you really know what you're doing, like these people all do, you can now set up Postgres in a way where you can find parameter values in the vast majority of cases, or you can reconfigure things
Starting point is 00:15:07 so that it does log the query ID, or you can get your slowest queries logged with parameters. So there are these ways of doing it now, but if people haven't set that up in advance and you're helping them with a production issue, this is a really helpful tool to give you a clue as to what's going on now explain will always there will only ever be a clue not because you don't have the execution system you don't know for sure why it's slow but those clues are often
Starting point is 00:15:35 enough for experienced people to get a theory get a hypothesis as to what's going wrong and then gives them something to be getting on with instead of being able to say, I have no idea why this is. It's kind of like extra information to make the guess a little bit more educated rather than an actual answer. Yeah, and we can have a holistic approach now, like checking everything in an automated fashion, not guessing and so on. It's great. So, yeah.
Starting point is 00:16:06 What else worth mentioning in this area? Maybe partitioning, which we like. It's funny that you found this commit in Postgres 15, right? Which improved the situation. My general impression, if you use a lot of partitioning, you should perform major upgrades
Starting point is 00:16:28 promptly, not lagging at all because every year a lot of things are improved. So this is a quite in-depth thing that was improved in Postgres 15, saying that partition Partition pruning, so exclusion for irrelevant partitions based on... If a user asks to return some data in query, Postgres can remove partitions which are for sure irrelevant from consideration. For example, we have partitioning by time, range by time, and by date. And then the user for sure wants fresh data. We definitely didn't need to look at very old partitions, partitions with very old data. But this might happen both at planning time and execution time, which can be a big surprise. And you want this to happen at planning time, basically, better.
Starting point is 00:17:29 Because this will mean that we can rely later on cached plan, or if we use prepared statements, we can rely more on cached plans, and then we have a very good situation when partition pruning was already done at planning time we just execute that's it but it's tricky when we talk about generic plan because generic plan doesn't take into account the values so something was improved in postgres 15 in this area right or yeah we tried we tried to look into it before the call, didn't we? Yeah, I couldn't reproduce the problem somehow. Yeah, so we tried on an old version before these.
Starting point is 00:18:11 So it would be interesting to hear from people that have reproduced this. I didn't see a blog post about it. I saw it referenced a couple of times. Yeah, Lucas Fiddle in this blog post, the generalized blog post, mentioned that before we had a problem with generic plans that partition pruning didn't work in planning time. It worked only at execution time. I couldn't reproduce it, but if it's so, it means that, for example, if you have a lot of partitions with a lot of indexes, and at execution time you involve all of them in general if you have a lot of partitions just checking them not returning everything it's big overhead
Starting point is 00:18:51 huge overhead but at least at execution time Postgres doesn't put logs share logs on all indexes because at planning time it does and if at planning time it does and if uh at plan if at planning time we don't have partition pruning working it's nightmare this case of well even the bikes
Starting point is 00:19:16 right well i think that would happen yeah so so a question for you because we do now have this plan cache mode that so plan cache mode got added in 12 that's one of the things we can use to force generic plan force custom plan went while we're debugging while we're while we're trying to um like pre pre version 16 releases before we had this explained generic plan we could use that that for debugging. But I don't think I've ever seen anybody change so PlanCashMode got added. You have three options. Auto, Force Custom Plan
Starting point is 00:19:53 and Force Generic Plan. Auto is the default. So Auto acts as it did as you described. Yes. So the first five extensions. Five times Custom, then Decision is made. Yeah. What to first five extensions. Yeah, five times custom, then decision is made. Yeah. What to choose?
Starting point is 00:20:08 Well, I was going to ask, because I think I got the impression that some of these RRMs, like ActiveRecord, for example, uses prepared statements. I got the impression the main reason they do so is to protect against escrow injection attacks, not for performance reasons. That might be not quite accurate, but there is this other benefit that seems to be mentioned quite a lot around prepared statements. Now, I wondered if you'd ever seen anybody change, like if you only were using prepared statements
Starting point is 00:20:40 for escrow injection protection and you weren't as bothered about the performance benefits in some cases or pitfalls in other cases, you might want to allow Postgres to replan every single time. And I could imagine some people considering turning force custom plan on. I can imagine the upside of doing so. i can't see quite the same upside for doing forced generic plan because all you're doing is saving those five well i think you'd only save those five executions and then you flip to the generic plan anyway maybe there's another case for that but yeah have you seen anybody changing that setting uh in any cases I see everyone doesn't know about it, almost everyone. It's very deep.
Starting point is 00:21:28 It's not super, super. It feels like some kind of fine-tuning. I see people have fears, for example, to rely on prepared statements globally because of the idea, what if a global plan will work not well, right? So if generic plan is bad for a specific parameter. If generic plan is chosen, it's cached, and then it's bad for some cases,
Starting point is 00:21:55 and we have performance degradation for particular parameters. But also, I think this idea to use prepared statements, I don't think these days it already matters as protection. I think the main reason is planning time is usually fast, but not always fast, as we know. And the idea that during planning time Postgres locks all indexes, it's a super big surprise. Many people, including myself, had not long ago.
Starting point is 00:22:31 And we had benchmarks showing this small overhead adding with each index for select, not for update, for select. And at some point when total number of relations, both tables and indexes, reaches 16, which is also hard-coded constant. If query is very frequent, 1,000 or 2,000 per second, primary QLOOKUPs are usually are so. We have lock manager spike and partitioning increases the probability of it. And that's why getting rid of planning is good. And when you get rid of planning, it's worth understanding this behavior of how caching works and generic versus custom.
Starting point is 00:23:16 But just changing this globally, I'm not sure this is a popular decision. But, of course, my experience is very limited. Let us know if you have changed it. It'd be interesting to hear from you. Yeah, it's an interesting topic anyway. But in general, I think it's worth spending efforts to get rid of planning for high-frequent queries,
Starting point is 00:23:40 especially for partition tables. I have one more related question. So now PgBouncer supports prepared statements. especially for partition tables. I have one more related question. So now pgBouncer supports prepared statements. I saw they have a max prepared statement. So it's off by default, which I didn't realize. They have a max prepared statements parameter. Do you have any opinions on tuning that? I saw that they said a default of 100 might make sense.
Starting point is 00:24:03 Yeah, some high value is good. I think it depends on the situation, of course. But in general, it's not an issue to just increase it. Yeah, cool. But again, I see the tendency to be afraid of turning prepared segments globally in already existing heavily loaded projects. If you start from scratch,
Starting point is 00:24:26 it's a very good decision to make to start using prepare statements from the very beginning. And then you just grow and see the problems, fix them as usual, like you would do anyway. And then there's confidence level of using already there,
Starting point is 00:24:43 but switching for a big project from zero to 100% coverage, it's a very hard decision to make. Hard as in scary? Just people are not sure and that's the problem. And there is no good path.
Starting point is 00:25:00 Maybe there is a good path to it. But in a couple of big companies I observed recently, people decided not to it. But in a couple of big companies I observed recently, people decided not to proceed with this after a lot of consideration. It's kind of an interesting topic. Instead, it feels
Starting point is 00:25:16 safer to start using prepared statements only for specific queries. It's a slow way in. Maybe for new features. Not new features. not at all. Again, the problem is usually this log manager overhead when we have 1,000, 2,000 QPS, we see very simple queries select start from
Starting point is 00:25:40 or select something from table where ID equals something, primary QLOOKUP. And if we have planning time every time, select itself is fast. Planning itself is also fast. But if you have a lot of backends competing, access share log, it's not a problem at all, right? But log manager has overhead because of fast path is false. And this is what multiple companies experienced recently.
Starting point is 00:26:08 And Jeremy Schneider said lock manager strikes back or bites back or something like that. Last year was lock manager year in terms of performance issues people experienced. So you're saying, yeah, so I'll share the episode where we discussed that in more detail. So you're saying they choose to turn prepared statements on only for those queries because it reduces the issue there but not globally.
Starting point is 00:26:35 Makes sense. Just to remove these hotspots. Only for these primary QLOOKUPs. I mean, so if you're talking about these super fast queries, even though both execution and planning time are fast, it could be that planning time is more than execution time. Like it's quite common. It happens. So even if it wasn't for the issues you're describing, you could maybe increase
Starting point is 00:26:59 throughput or like there's a lot of overhead to be gained by not having to plan those queries. Yeah, shave off planning time is a good idea in many cases. In general. For example, partitioning. If partitioning is there, planning time can be high
Starting point is 00:27:18 if you have a lot. It can depend on the number of partitions. This is the problem as well. In this case, of course, getting rid of it, it's good. Nice. Cool. Anything else you wanted to add for this one?
Starting point is 00:27:32 No. So it's an interesting thing to keep in mind when dealing with plan optimization. So use explain genetic plan in Postgres 16 and in future Postgres versions. Yeah, if you can't, as a starting point, or if you can't get parameters.
Starting point is 00:27:52 If you don't want to go to logs to find examples, or maybe log min duration statement was not configured somehow. Exactly. If it wasn't configured for the time where you're having issues, or if you can't turn it on easily, like if you've got a lot of process at your company before changing some of these parameters you know so there might be there might be reasons why you can't get real values but of course i think i the
Starting point is 00:28:15 reason i don't know as much as maybe i should on this is if you can get the real values performance optimization work becomes so much easier like of course. Less guesswork, yeah. This generic plan should be something like partial solution in terms of analysis. It's a partial analysis. It always should be with remark, you know, this is not, maybe not it. Maybe, but maybe not. Yeah, like a stepping stone on the way to the solution, but it's not a solution in itself.
Starting point is 00:28:45 Right. On the other hand but it's not a solution in itself. Right. On the other hand, it might give very quick insights what to do and how to speed up. Yeah, good point. Yeah, good point. Good? Yeah. Nice one, Nikolai. Thank you so much.
Starting point is 00:28:56 Thank you. Have a good day. Bye. You too. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.