Postgres FM - Custom vs generic plan
Episode Date: May 10, 2024Nikolay and Michael discuss custom and generic planning in prepared statements — how it works, how issues can present themselves, some ways to view the generic plan, and some benefits of av...oiding planning (not just time). Here are some links to things they mentioned:PREPARE https://www.postgresql.org/docs/current/sql-prepare.html track_activity_query_size https://www.postgresql.org/docs/current/runtime-config-statistics.html#GUC-TRACK-ACTIVITY-QUERY-SIZE plan_cache_mode https://www.postgresql.org/docs/current/runtime-config-query.html#GUC-PLAN-CACHE-MODE EXPLAIN (GENERIC_PLAN) https://www.postgresql.org/docs/current/sql-explain.html#id-1.9.3.148.8 EXPLAIN (GENERIC_PLAN) in PostgreSQL 16 (blog post by Laurenz from Cybertec) https://www.cybertec-postgresql.com/en/explain-generic-plan-postgresql-16/ Running EXPLAIN on any query, even with $1 parameters (blog post and video by Lukas Fittl of pganalyze) https://pganalyze.com/blog/5mins-postgres-explain-generic-plan EXPLAIN from pg_stat_statements, how to get the generic plan (blog post by Franck Pashto of Yugabyte) https://dev.to/yugabyte/explain-from-pgstatstatements-normalized-queries-how-to-always-get-the-generic-plan-in--5cfi Rework query relation permission checking (commit by Amit Langote) https://git.postgresql.org/gitweb/postgres.git?p=postgresql.git;a=commit;h=a61b1f74823c9c4f79c95226a461f1e7a367764b Partition pruning, prepared statements and generic vs custom query plans (a follow up blog post and video by Lukas) https://pganalyze.com/blog/5mins-postgres-partition-pruning-prepared-statements-generic-vs-custom-query-plans Our episode on over-indexing (inc LWLock discussion) https://postgres.fm/episodes/over-indexing “The year of the lock manager’s revenge” (from blog post by Jeremy Schneider) https://ardentperf.com/2024/03/03/postgres-indexes-partitioning-and-lwlocklockmanager-scalability/ ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork
Transcript
Discussion (0)
Hello, hello, this is PostgresFM episode number 96.
My name is Nikolai, founder of Postgres.ai, and as usual, my co-host is Michael from PgMaster.
Hi, Michael.
Hello, Nikolai.
So PgMaster, as I mentioned last time, is focusing on explain plans and helping optimize Postgres queries.
And the topic I chose this time is very, very close to that topic, right?
I mean, it's a subtopic.
I know many people discuss a lot of very in-depth materials around this topic,
but let's just focus on basics just for awareness.
And this topic is how plan caching works and generic versus custom plans because like at
very high level we don't know about them at all right we know that explain shows us a plan we
know there is plan caching we know pg bouncer since recently started supporting prepare statements so
you can avoid planning time because sometimes it's very costly for example if
you have a lot of partitions a lot of indexes log manager contention can happen so it's good
especially for queries which have very high frequency for example thousand times per second
it's good to just perform planning once per session and then rely on on cash to plan but it's not really once per
session right there are more nuances in this area so let's uncover them and and discuss
yeah nice at a kind of high level i see people that are using orms and frameworks using prepared
statements sometimes without realizing it,
or sometimes they know, but they don't really know what it means. Do you see a lot of people,
like, is that the same in your experience? A lot of people using this without realizing,
or are you seeing quite a lot of conscious choice, like conscious use of prepared statements,
even for people that aren't using it via an ORM or similar?
Well, you know, my point of view on this world,
on one hand,
I think people are quite smart
in general,
but they are smart only
when they are very motivated.
And we know a lot of engineers
don't distinguish binary tree
and B3 at all
because they are not motivated enough
to understand differences.
Similar here, a lot of people, in my opinion,
don't understand that one normalized query in PGSA statements
might have different plans depending on parameters.
This simple truth is something that should be understood,
but it can be understood only when you are motivated,
when you have some incident, for example, and you start off thinking, oh, okay,
this query can behave differently depending on parameters. So this is basic knowledge.
And honestly, I didn't realize it for quite long in my career as well. I thought, okay,
this is so, but why is it
important? Well, it's very important because looking at Pidzis' statements and understanding
heavy queries, it's not easy to jump to optimizing these queries, especially when we talk about
mixed situations, when it's also quite high frequency query and also it has issues with
like not perfect indexes chosen not perfect plan like not only just index only scan that's it
so it means that we need to understand okay we depend on parameters but which parameters
should we choose for for optimization when you say for optimization do you mean like we get to pgstat
statements and it has these parameterized queries which ones do we put in in order to run explain
analyze buffers right right right this this is unfortunately big gap between macro and micro
levels of query analysis which still exists of course with addition of query ID to logs and to PGS
activity, we have this bridge between PGS statements and individual queries, which we
can find in logs or PGS activity.
But still, like I remember one lesson I learned in practice when I spent one hour or so optimizing
some query,
it was some parameter which was true or false.
And I was trying to optimize for a value which was present in 90% of rows,
majority of rows.
And it was, of course, difficult because selectivity is low,
cardinality is high, right?
But then i understood
all this hour spent was wasted because it never it's never so never never never ever application
runs this value uses this value in query right so the idea is that we need to understand real cases and also not super rare cases or
non-existing at all cases, but we need to understand the cases which hurt user experience,
for example, and focus on them. In this case, genetic plan is something really weird in this picture I just painted, right?
So what do you think?
Like, generic and custom.
Maybe let's talk about how Planner uses them
and then discuss what should we do about it at all.
So this five times rule, what do you think?
Yeah, I think that catches people out sometimes so
the way prepared statements works or at least my understanding is that you can prepare a query in
advance and that would do like the pause stage and it will do like several stages up to but not
including the execution of that query then you can ask it to execute with parameters.
Now, on the subsequent executions, you don't have to re-pause,
like do the set, like post-credits, do several of those.
Well, planning at first does have to be done again, right?
So it's the pausing and like the rewriting stages.
Yes, on a per session basis, we don't have to do several of
those steps but there are there is also this weird kind of initial phase so if you've if you've done
prepare if you've prepared the statement and then you execute it the first five times postgres will
plan each of those five executions separately based on the parameters that are given so if
those are in the case you gave if you used true was it that was 90 of the time you've used true
every single time for those five times it's going to plan it the same way each time and on the sixth
execution it will then look at the cost of planning. And if that cost is less than a generic,
or if it's more than a generic plan that it's calculated,
or I think there's some subtlety around exactly how precise that is,
then it will choose its generic plan thereafter.
It will flip to not planning each new execution.
So sixth time, seventh time seventh time eighth time ninth time forever
onwards it during that session it won't replan so now you're not only saving on the
parsing on the rewriting stages you're also saving on the planning stage which you mentioned already
but that's not true if for example uh you did three or four executions with true and then a couple of executions with
false in your example, then you're going to get a different kind of average cost for those five
executions. Then that will get compared to the generic plan and then a different decision will
get made. So depending on those first five, so those first five do actually determine quite a
lot what happens thereafter, which is interesting.
But yeah, the one time I've seen that the five become important is this can be really confusing for folks.
If you're trying to debug something and it's fast for a while and then suddenly it goes slow, this is one of those telltale signs once you've been around a while.
If you notice that it's fast for five executions, then slow on the sixth,
you should look into this issue,
custom plan versus generic plan.
It's kind of a telltale sign, like a smoking gun,
or like a, you know, it's a telltale sign
that this could be what you're hitting.
But aside from that,
I haven't really thought too much about it.
I don't think it's configurable, that five number.
I think it's hard code.
Yeah, it's hard. But by the way, why five?
Yeah, good question.
Some decision, right?
That's it.
But what is a generic plan?
Generic plan is just we don't take into account parameters at all.
It means that we don't take into account this distribution of values or we take yeah good
question did you look into how it's calculated in the first place no but it's interesting question
right and i would have i could be wrong but i would have guessed that it would be based on
some statistics at the time of right like if that would be if you were implementing this feature you would think if if you've got like a got a data distribution there maybe you'd factor that in but it's difficult
because if it's skewed if it's like completely uniform it's quite easy like if we're talking
about primary key lookups it's primary key lookup query and every single one's unique and every
single one's only going to return one row. It's only scanning one table, that kind of thing.
Generic plan is really easy because every single parameter you could provide to that
is going to be planned the same way
even in the custom plan mode.
So generic plan in that case is quite easy.
But in the case where it's a skewed distribution,
some values have more entries than others,
it becomes tricky, I think,
to know how you would plan that in
a generic way.
So I don't actually know.
For me, generic plan is a good thing that we don't have a lot of time to spend finding
particular examples and logs or in PGS activity.
In PGS activity, we have a different the how it's called track activity size
so this parameter which says
by default is 1024
means that really long queries
are truncated and we don't see
the full query
parameters usually are in the end
right like where select
all columns
our favorite right
from table and like so so many column names,
so I don't see parameter.
Something equals truncated.
I saw it not once.
So when we don't want to spend time,
we want something really fast,
this trick prepare statement.
We prepare statement, and then we use just nulls as all
values
this gives us generic
plan but also we can say set plan
cache mode to force generic plan
which exists in as we just
checked before we started recording
this episode
we surprisingly
plan cache mode
configuration setting.
It exists since many, many years ago.
So in all current versions, it's supported, right?
Yes, since version 12.
Exactly.
So 12 is the oldest currently supported version.
A few months left till the end of it.
Yeah, yeah.
So we just, this trick
allows us to
looking at PG statements
and just
take some normalized query
which has this $1, $2,
doesn't have parameters at all.
They are removed.
And just to use this prepared statements
with nulls trick
and get some genetic plan and think already how bad situation in general,
not in particular cases, but like overall.
But this trick, as you absolutely correctly mentioned before this episode recording,
it doesn't work if we have, for example, primary key there because primary key equals null or is null
won't work properly because there is not null constraint.
So the planner knows that nothing will be returned.
It doesn't show us the proper plan at all.
Yeah, if you don't specify, like if you don't set plan cache mode
to force generic plan, then you'll get a custom plan
and the planner will shortcut it because there's a really, yeah.
But if you do force generic plan on version 12 onwards,
then you will get the generic plan you're hoping for.
Well, that's great.
I didn't realize that.
So, yeah good good
but uh if we are lucky enough to have already postgres 16 all these tricks are not needed
because yeah we can just say explain generic plan is there is a new option and explain command
and this is very handy right we just ask a generic plan and explain. Don't care about... And interesting that we can even use $1, $2, this query text.
Right from PGSR statements, this is super convenient.
We can say, explain generic plan and that text.
We can even probably automate this, right?
And see, for example, if we have sequential scans in generic plans and like put warnings somewhere
and so on but of course this won't work if we want to execute and we want to explain analyze buffers
this won't work i mean in this case we need to replace dollar one dollar two with some values
yeah and it makes sense right like we can't do the execution unless we specify something badly.
Otherwise, what are you executing?
So it makes sense.
And kudos to Lawrence from CyberTech who added this and has blogged about it.
So I'll share the blog post about that.
I remember actually this wave from Lawrence and also Lucas Fittel
who journalized blogging about generic plans and also Lucas Fittel pageranalyze blogging about generic plans
and also getting excited.
Also Frank Pachot, I think, joined at some point.
I actually don't know who started this,
but I remember waves of thinking,
multiple blog authors blog posted
about this generic plans a few years ago.
It was great.
I don't think it's super surprising that this has come out of a consultancy,
like a Postgres consultancy, and that Lucas is excited about it.
So I think it's one of those things that really helps people help others.
Because if you really know what you're doing, like these people all do,
you can now set up Postgres in a way where you can find parameter values
in the vast majority of cases, or you can reconfigure things
so that it does log the query ID,
or you can get your slowest queries logged with parameters.
So there are these ways of doing it now,
but if people haven't set that up in advance
and you're helping them with a production issue,
this is a really helpful tool to give you a clue
as to what's going on now explain will always there will only ever be a clue not because you
don't have the execution system you don't know for sure why it's slow but those clues are often
enough for experienced people to get a theory get a hypothesis as to what's going wrong and then
gives them something to be getting on with instead of being able to say, I have no idea why this is.
It's kind of like extra information to make the guess a little bit more educated
rather than an actual answer.
Yeah, and we can have a holistic approach now,
like checking everything in an automated fashion, not guessing and so on.
It's great.
So, yeah.
What else worth mentioning in this area?
Maybe partitioning,
which we like.
It's funny that you found this commit
in Postgres 15, right?
Which improved the situation.
My general impression,
if you use a lot of partitioning, you should perform major upgrades
promptly, not lagging at all because every year a lot of things are improved.
So this is a quite in-depth thing that was improved in Postgres 15, saying that partition Partition pruning, so exclusion for irrelevant partitions based on...
If a user asks to return some data in query, Postgres can remove partitions which are for sure irrelevant from consideration.
For example, we have partitioning by time, range by time, and by date.
And then the user for sure wants fresh data.
We definitely didn't need to look at very old partitions, partitions with very old data.
But this might happen both at planning time and execution time, which can be a big surprise.
And you want this to happen at planning time, basically, better.
Because this will mean that we can rely later on cached plan,
or if we use prepared statements, we can rely more on cached plans,
and then we have a very good situation when partition pruning
was already done at planning
time we just execute that's it but it's tricky when we talk about generic plan because generic
plan doesn't take into account the values so something was improved in postgres 15 in this
area right or yeah we tried we tried to look into it before the call, didn't we? Yeah, I couldn't reproduce the problem somehow.
Yeah, so we tried on an old version before these.
So it would be interesting to hear from people that have reproduced this.
I didn't see a blog post about it.
I saw it referenced a couple of times.
Yeah, Lucas Fiddle in this blog post, the generalized blog post,
mentioned that before we had a problem with generic plans that partition pruning didn't work in planning time. It worked only at execution time. I couldn't reproduce it, but if it's so, it means that, for example, if you have a lot of partitions with a lot of indexes, and at execution time you involve all of them in general if you have a lot of partitions
just checking them
not returning everything
it's big overhead
huge overhead
but at least at execution time
Postgres doesn't
put logs
share logs on all indexes
because at planning time it does
and if at planning time it does and if uh at plan if at planning
time we don't have partition pruning working it's nightmare this case of well even the bikes
right well i think that would happen yeah so so a question for you because we do now have this plan cache mode that so plan cache mode
got added in 12 that's one of the things we can use to force generic plan force custom plan went
while we're debugging while we're while we're trying to um like pre pre version 16 releases
before we had this explained generic plan we could use that that for debugging. But I don't think I've ever seen anybody change
so PlanCashMode got
added. You have
three options. Auto,
Force Custom Plan
and Force Generic Plan. Auto
is the default. So Auto acts as
it did as
you described. Yes.
So the first five extensions.
Five times Custom, then Decision is made. Yeah. What to first five extensions. Yeah, five times custom, then decision is made.
Yeah.
What to choose?
Well, I was going to ask, because I think I got the impression that some of these RRMs, like ActiveRecord, for example, uses prepared statements.
I got the impression the main reason they do so is to protect against escrow injection attacks, not for performance reasons.
That might be not quite accurate,
but there is this other benefit
that seems to be mentioned quite a lot
around prepared statements.
Now, I wondered if you'd ever seen anybody change,
like if you only were using prepared statements
for escrow injection protection
and you weren't as bothered about the
performance benefits in some cases or pitfalls in other cases, you might want to allow Postgres to
replan every single time. And I could imagine some people considering turning force custom plan on.
I can imagine the upside of doing so. i can't see quite the same upside for
doing forced generic plan because all you're doing is saving those five well i think you'd only save
those five executions and then you flip to the generic plan anyway maybe there's another case
for that but yeah have you seen anybody changing that setting uh in any cases I see everyone doesn't know about it, almost everyone. It's very deep.
It's not super, super.
It feels like some kind of fine-tuning.
I see people have fears, for example,
to rely on prepared statements globally
because of the idea,
what if a global plan will work not well, right?
So if generic plan is bad for a specific parameter.
If generic plan is chosen, it's cached, and then it's bad for some cases,
and we have performance degradation for particular parameters.
But also, I think this idea to use prepared statements,
I don't think these days it already matters as protection.
I think the main reason is planning time is usually fast,
but not always fast, as we know.
And the idea that during planning time Postgres locks all indexes,
it's a super big surprise.
Many people, including myself, had not long ago.
And we had benchmarks showing this small overhead adding with each index for select, not for update, for select.
And at some point when total number of relations, both tables and indexes, reaches 16, which is also hard-coded constant.
If query is very frequent, 1,000 or 2,000 per second, primary QLOOKUPs are usually are so.
We have lock manager spike and partitioning increases the probability of it.
And that's why getting rid of planning is good.
And when you get rid of planning,
it's worth understanding this behavior of how caching works
and generic versus custom.
But just changing this globally,
I'm not sure this is a popular decision.
But, of course, my experience is very limited.
Let us know if you have changed it.
It'd be interesting to hear from you.
Yeah, it's an interesting topic anyway.
But in general, I think it's worth spending efforts
to get rid of planning for high-frequent queries,
especially for partition tables.
I have one more related question.
So now PgBouncer supports prepared statements. especially for partition tables. I have one more related question.
So now pgBouncer supports prepared statements.
I saw they have a max prepared statement. So it's off by default, which I didn't realize.
They have a max prepared statements parameter.
Do you have any opinions on tuning that?
I saw that they said a default of 100 might make sense.
Yeah, some high value is good.
I think it depends on the situation, of course.
But in general, it's not an issue to just increase it.
Yeah, cool.
But again, I see the tendency to be afraid
of turning prepared segments globally
in already existing heavily loaded projects.
If you start from scratch,
it's a very good decision to make
to start using prepare statements
from the very beginning.
And then you just grow and see the problems,
fix them as usual,
like you would do anyway.
And then there's confidence level
of using already there,
but switching for a big project from
zero to 100%
coverage, it's a
very hard decision to make.
Hard as in scary?
Just people are not sure
and that's the problem.
And there is no good path.
Maybe there is a good path to it.
But in a couple of big
companies I observed recently, people decided not to it. But in a couple of big companies I observed recently,
people decided not to proceed
with this after a lot of
consideration.
It's kind of an interesting topic.
Instead, it feels
safer to start using prepared
statements only for specific queries.
It's a slow way in.
Maybe for new features.
Not new features. not at all.
Again, the problem is usually this log manager overhead
when we have 1,000, 2,000 QPS,
we see very simple queries select start from
or select something from table where ID equals something,
primary QLOOKUP.
And if we have planning time every time, select itself is fast.
Planning itself is also fast.
But if you have a lot of backends competing, access share log, it's not a problem at all,
right?
But log manager has overhead because of fast path is false.
And this is what multiple companies experienced recently.
And Jeremy Schneider said lock manager strikes back or bites back or something like that.
Last year was lock manager year in terms of performance issues people experienced.
So you're saying, yeah, so I'll share the episode where we discussed that in more detail.
So you're saying
they choose to turn prepared statements
on only for those queries
because it reduces the issue there
but not globally.
Makes sense.
Just to remove these hotspots.
Only for
these primary QLOOKUPs.
I mean, so if you're talking about these super fast queries, even though
both execution and planning time are fast, it could be that planning time is more
than execution time. Like it's quite common. It happens. So even if
it wasn't for the issues you're describing, you could maybe increase
throughput or like there's a lot of overhead to be gained
by not having to plan those queries.
Yeah, shave off planning time is a good idea
in many cases.
In general.
For example, partitioning.
If partitioning is there,
planning time can be high
if you have a lot. It can depend on
the number of partitions. This is
the problem as well.
In this case, of course,
getting rid of it, it's good.
Nice.
Cool.
Anything else you wanted to add for this one?
No.
So it's an interesting thing to keep in mind
when dealing with plan optimization.
So use explain genetic plan in Postgres 16
and in future Postgres versions.
Yeah, if you can't, as a
starting point, or if you can't
get parameters.
If you don't want to go to logs
to find examples, or maybe
log min duration statement was not configured somehow.
Exactly. If it wasn't
configured for the time where you're having issues,
or if you can't turn it on
easily, like if you've got a lot of process at your company before changing some of these parameters you know
so there might be there might be reasons why you can't get real values but of course i think i the
reason i don't know as much as maybe i should on this is if you can get the real values performance
optimization work becomes so much easier like of course. Less guesswork, yeah. This generic plan should be something like partial solution
in terms of analysis.
It's a partial analysis.
It always should be with remark, you know, this is not, maybe not it.
Maybe, but maybe not.
Yeah, like a stepping stone on the way to the solution,
but it's not a solution in itself.
Right. On the other hand but it's not a solution in itself. Right.
On the other hand, it might give very quick insights what to do and how to speed up.
Yeah, good point.
Yeah, good point.
Good?
Yeah.
Nice one, Nikolai.
Thank you so much.
Thank you.
Have a good day.
Bye.
You too.
Bye.