Postgres FM - Intro to query optimization
Episode Date: September 2, 2022Here are links to a few things we mentioned: Using EXPLAIN (PostgreSQL documentation)explain.depesz.comexplain.dalibo.compgMustard EverSQLpganalyzepg_stat_monitor (Extension by Percona)Rece...nt thread on hackers mailing list about plan_id in pg_stat_activityauto_explainEXPLAIN observer effect (Ongres blog post by Álvaro Hernández)auto_explain overhead (blog post by Michael) pg_test_timingDatabase Lab Engine (for thin clones)Our previous episode on BUFFERSEXPLAIN Explained (talk by Josh Berkus)A beginner's guide to EXPLAIN (talk by Michael)A deeper dive into EXPLAIN (talk by Michael)EXPLAIN glossary (pgMustard docs)Topic suggestions document------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov and @michristofides or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
I am Michael, founder of PgMustard.
This is Nikolai, founder of PostgresAI.
Hey Nikolai, what are we talking about today?
Hello, hello.
Let's talk about query optimization.
I think this is maybe the most interesting topic in the area of Postgres in general,
but I found not everyone is interested in it.
But let's talk about it anyway.
Yeah, it's also, I guess, a topic quite close to both of our hearts. We've spent many years looking at this. So hopefully we have some interesting
things to add. But let's make some boundaries. Let's distinguish analysis of workload as a whole
and attempts to find the worst part, the best candidates for optimization versus single query optimization. Let's talk about
this, the second topic, subtopic. Yes. I've heard you differentiate between macro performance
analysis and micro performance analysis in the past. So macro being system level, and I guess
we're not talking about that today. And we're going to look more at micro. Once you've worked out there is a problematic query, you know which one it is.
How do you go from that to what can I do about it?
Right.
How to understand is it good or bad in terms of execution and how to read the query plan,
the comment explain, which is the main tool here, right?
Let's talk about this.
Is there anything else before we dive into explain?
Are there any other parts of it that we might need to cover as well?
Well, there are things that explain doesn't cover.
For example, it won't tell you, for example, CPU utilization,
user CPU, system CPU.
It won't tell you physical disk io how many
operations at disk level because postgres doesn't see them directly because postgres works only with
file system cache so io operations reads and his postgres tells you they are not necessarily
from disk they are from page cache so this They are from PgCache.
So this can you get additionally not using explain.
It would be good to have these things inside explain somehow,
like pgstat kcache extends pgstat statements.
It would be good to have something that would extend explain,
but I'm not aware of such thing to exist.
Nor me.
I'm not aware either,
but I think we get clues about them in explain, don't we?
We see some timing that can't be explained otherwise.
I.O. timing.
So, you know, I mean, let's say you mentioned CPU performance.
If we have an operation that's not doing much I.O., but it is taking a long time, that's a clue that there might be something else going on.
Right, right, right.
So in general, if we run even one query,
theoretically, it might be sense to use things like perf
and cflamegraph for one query execution,
and it would augment the information
that you can extract from explain, analyze buffers.
But let's just discuss the basics maybe where to start sounds good and
something so possibly even explain versus explain analyze is a good place to start by explain we get
the the query plan that often normally returns really quickly roughly in the the planning time
of the query and then explain analyze we get the actual well it runs the query and returns
performance data so we we can see how much time was spent and if we ask for buffers how much io
was done and all sorts of other things as well but it allows us to compare things like how much
so the explain might tell us how many rows were being expected to be returned at each stage and
explain analyze we can get the actual number of be returned at each stage and explain analyze we can
get the actual number of rows returned at each stage and comparing the two can be really useful
right right and yeah absolutely and also we discussed it some time ago that explained shows
only one plan sometimes you want to see multiple plans like second best candidate and so on. Otherwise, you should do some tricks to try to guess
what planner had on the plate when choosing.
That's a really good point, and maybe even a good place to start
in terms of using Explain.
So the first thing you'll probably notice when you're looking at Explain
for the first time is a lot of cost numbers.
These are an arbitrary unit that give you an idea of how
expensive so it's a kind of an estimate of well as the as the cost numbers go up postgres thinks
it will take longer to execute but they're not in they're not an estimate of milliseconds or
they're not in any real unit but you can you can then use a couple of different like is it an
enable sex gap there's like some there's some parameters you
can use to affect those costs so you could maybe try and get the second best plan by making the
current plan very expensive so if your query is currently doing the sequential scan and you want
to see if it could use an index and it's just choosing not to you can disable sex scans well
it doesn't actually disable sex scans.
It just makes them incredibly expensive.
Right, that's some penalty.
Exactly.
So you might still see that in the costs.
This trick is very helpful when two plans are very close to each other
in terms of cost, overall cost.
And if you disable index scan and see different plan and see the cost is very close, it gives you idea that we are on the edge, right?
We either crossed it recently because data is changing or we are about to cross it and it's kind of dangerous.
But first of all, I would like to mention that the plans the plan a plan is three, right cycles are not possible,
which is important, because it could be possible, actually.
But well, I mean,
I think in the simplest cases, yes, but with CTS, you get some
like you can get some strange things can refer to the same CT
mode once, for example.
But yeah, in the simplest cases, it's...
But when it's already executed, it's still like a tree, right?
Oh, interesting, by the way, yes.
Anyway, it's like very rough.
It's a tree.
And when it's printed, it's a tree.
But loops are possible inside, of course, right?
And important thing to understand that regular metrics are as cost, rows, timing, they are
shown for one iteration inside a loop.
But buffers, it's a sum of everything.
And it can be confusing sometimes, right?
Yeah.
Well, let's go back to the tree.
I think that's really important.
And a few things that aren't obvious when you're first looking at them is that logically
it's happening almost backwards so the first node that you see on the tree is the last one to be
executed whereas it goes from leaves to root exactly so kind of outside in like kind of right
to left a little bit and there's also some really important statistics especially when you use explain analyze some really important statistics right at the bottom so some summary metrics like execution
time planning time oh they're printed separately from these three right and like trigger time
just in time compilation each of these things can be dominant sometimes they can be like where all
of the time's going and if you have like a really
long tree the general recommendation is start kind of right to left but i also say check that
bottom section because it could be you might not have to look through the entire tree if you find
out that your query is spending 90 percent and the single additional point here is that planning
time can be sometimes very very big as i had cases when inspection of the path of merge join
led to huge scan during the planning time.
And disabling merge scan helped,
but it was not so obvious at all in the beginning
because if you don't notice that planning time is seconds,
so insanely huge, it's a surprise for you. So checking planning time is seconds so insane insanely huge it's a surprise to you for you
so checking planning time also useful yeah same for time spent in triggers and time spent just
in time compilation as well for some analytical queries you might consider it a relatively simple
query or quite it should be quite fast but if the costs are overestimated a lot,
sometimes the just-in-time compilation kicks in,
spends several seconds thinking it's saving you time,
but then the overall query is only a few milliseconds.
That's suboptimal.
Right. And also, since Fosgis 13,
planning time also, if you include buffers option,
it will show you buffers used for planning as well right yes and actually one other thing on planning time before we move on from that
is that auto explain doesn't include planning time so there you can't spot planning time issues in auto explain other than i think we discussed this once yeah if you i think we
discussed it years ago or ages ago because it was a it's probably the only use case for
logging the query time plus the auto explain time and then you could diff the two and it's
probably planning time that's the difference but yeah it's it's a limitation not as you say not super common that planning time's the dominant
issue but when it is it can be 90 plus percent easily right it's it can be unexpected this is
the danger of it right yeah right right right so right to left in kind of inside out, start at the bottom, check the main statistics.
You mentioned briefly that some of the statistics are per loop.
So loops are quite, I think they're quite a confusing topic when you're first getting used to it.
And especially if there's, you know, 10,000 loops, you could easily miss that one of the statistics looks quite small.
But once you times it by 10,000, it can be really big.
So examples are, of course, the costs and the timing, but also things like rows removed
by filter.
Sometimes people look out for those numbers.
If it says one, that's a per loop average.
And actually 10,000 of those is suddenly not insignificant.
Right.
And those averages can be rough.
It's also like there's a mistake
that can be present there.
Well, especially around 0 and 1,
because some of the numbers are integers.
Yeah, so anybody that's wondering
if exactly it's rounded to the nearest integer
and if it's less than 0.5
and gets rounded to 0,
it doesn't necessarily mean that there are zero, which can be problematic.
Also, you know what?
We discussed things that probably there are many talks and many articles that are useful.
I think this podcast is not going to replace them.
We are trying to highlight problems which can be tricky in the beginning.
And one of the things I think we should mention tools as well.
First of all, explain DepecheCom is the oldest one and still very, very popular, maybe still
the most popular.
Then explain DaliBocom, which is PEV2, greatly improved, very good.
And of course, Pidgey Mustard, which is commercial, which you develop.
Right.
Worth checking all of them.
They are good.
Pros and cons for all of them.
But what I think is important to understand is some meta.
When you talk about query analysis of single query analysis, we should say, okay, this is our plan. Better if it was with execution and the
best if it's execution collected with buffers. And we will discuss, by the way, overhead a little bit
later, right? But when you ask someone to help with optimization, of course, the first question
will be show me the query itself, right? Sometimes show me two
plans also. If there was some change and we want to understand why this change influenced the plan,
so we have two plans, but we need to have a query that must have, we should have.
But additionally, I think very important to get Postgres planner settings, like enable
sex scan or random page cost, sex page cost, all costs.
WorkMem as well.
Even though WorkMem is not inside Planner,
PostGIS settings are grouped in groups.
You can select star from PG settings and see group names.
And there is a whole group named Planner, Planner settings.
And WorkMem is not there.
But WorkMem influences Planner decisions. If you change WorkMem, Planner can be And workmem is not there. But workmem influences planner decisions.
If you change workmem, plan can be different.
Right.
So right now my rule is like let's take planner settings plus workmem.
Maybe something else.
I'm interested to see if something else should be there as well.
So when we ask someone to help, we need to present a plan or two, query, planner settings. And I also believe that schema is important to present. Like what kind of schema we had. Table, indexes, and probably
statistics as well. This is like whole picture to analyze what we had. Imagine if all these tools
collected these things automatically, for example. How great it would be to jump into, like, I want to help someone with optimization.
And I see whole picture.
I see query, plan, settings, the schema.
Not whole schema, only part of it which is involved in, like, which this query deals with.
And also statistics maybe statistics is kind of tricky but
this is what basis for planner decisions this is what defines what plan will of course post this
version as well because different version plan can be different different nodes in the plan can
can be present depending on version so what do you you think about this big whole picture?
I understand that none of tools collect this information and none of tools require users to present this information,
but it would be great to store it in history, for example.
Yeah, there are some really interesting tools
that do some of that, but not all of it.
So there's a tool started as a MySQL tool,
but they've added Postgres support called EverSQL.
That asks for things like the query and the schema
and things like that,
and then does some static analysis,
which is super interesting.
There's tools like PG Analyze that's a monitoring tool,
and it's starting to do some,
well, it's been for at least a couple of years now doing more
ad hoc performance like query analysis fire explain visualizations and has access to a lot
of those a lot of that information already by the nature of being a monitoring tool but i think
there's also this natural trade-off between this macro analysis like uh this the all of the
information you can gather and the overhead of
doing so versus the amount of information you're willing to gather to do a sick like once you know
a certain query is a problem you're willing to pay a higher overhead because you're only you
only need to gather that once whereas if you want to do this all the time for every query
i think there's a slightly higher overhead so i think there's there's some tension we could do some hashing for example to track that like statistics didn't change and parameters post
parameters didn't change we could just check it automatically with hash yeah i think there's some
super cool things here but there's also there's a there's a couple of different um environments
right so in production like the one one key thing is that let's say statistics is a
great example production is not the same as staging like we could we can make all of the
data the same we can it depends it's like yeah production might not be the same as staging and
therefore like a statistics problem may not show up and if you're let's say you're doing some development work it's it's really
tricky to to reproduce all of those things so and i think i'd also push back that some of those
some of the most obvious things that can be a problem it like most like maybe even schema
doesn't matter maybe even query doesn't matter if you see that somebody's doing a sequential scan of 10 million rows,
maybe probably it's parallel and they're doing a filter
and it returns just one of those rows.
Without the query, we can tell that...
Enough to suggest index, right?
Exactly.
So there's a bunch of cases where you can give some pretty sensible advice
without any of that
extra information but definitely as it gets more complex i think more of those things can be more
useful but even even in the index case you know you still need i think you still need context from
the customer in terms of trade-offs you know if they've if this is a super high right table you
might be less inclined to to add an index than if it's not or if the customer
has certain requirements there's always going to be a there's always an it depends right there's
always i think whenever you're giving this kind of advice you have to be careful that for different
customers different things would be sensible i guess except for except for extreme well i understand
that what i just described a collection of all these pieces requires a lot of efforts.
So that's why it should be automated.
But imagine if everything was collected automatically inside some tool,
used in an organization and stored historically.
You understand, okay, we optimize this query.
And we try to optimize this query.
And we know the whole context.
When we ask for help, we have all pieces.
And if expert comes to help us, all pieces are present.
It would be much easier to help, right?
Yeah, and I think there are some interesting projects in this area.
I think if you come across the one by Pocona,
they're doing a kind of a replacement to PGSTAT statements.
Well, it's about macroanalysis as well.
PG Monitor, right? PGST pg start monitor yes but i think
they do things like just statements or no yes but with additions like i think they let you track
query plans per query so like i think you could for example see if a plan has changed so it's that
that kind of thing with relatively low overhead i think you start to get a bit more of that
information. So when an expert comes along,
hopefully this is something already installed
and already present.
This is an old, big discussion.
There is a current ongoing discussion
in PgHackers about PgSQL hackers
mailing list. Someone from
AWS proposed adding
plan ID to PgSource statements,
triggering discussion one more time, and this would
be great. Of course, we know that
each query registered in PGSR
statements might
have multiple plans
depending on parameters used.
So when you optimize a query,
very important thing, I
missed it in my list, parameters
you used, right? Because different parameters
may trigger the planner to choose different plan, right?
So it's very, very important.
When we optimize a query, we cannot say we optimize a query.
We must say we optimize a query for some parameters.
And we need to think about variations that we should expect on production and check them too.
Not just a single
case right so this and this is tricky by the way yeah so if anybody's wondering this is like the
simplest example of this is let's say you have a column where 99 of the data is a single value
and then what the other one percent is millions of like unique values if you search for one of the unique values you might get an
index scan if you search for the value that 99% of the table is then you should get a sequential
scan that would be the optimal plan so that's the simplest example this is classic example and even
enable set enable six can to off might not help to avoid SIGSCAN in some cases. And I also, I had a couple of times
in my optimization activities, I had the case when somebody provided me a query without parameters,
and I've checked the table, I saw, okay, what's the worst case? And I started to optimize for
the worst case and made bad decisions because this worst case was never used in production at all.
So it's a very, very interesting topic. I definitely want to find some approach when
we don't know which parameters we have, but we guess somehow. For example, if some additional
tool would analyze statistics, this tool would say, oh, this set of parameters,
this and this, like this is most typical case, this is like some kind of worst case,
and try to optimize for them. This would be great.
So yes, I agree. And I think there are some, I think, for example, AutoExplained,
very old tool, but one thing it does really well is it spits out the exact query that caused that slow
plan. And that's one way of getting at least the extreme versions of the parameters. Or just slow
log. If you have log mean duration statement above 500, 100 milliseconds, which is good,
or at least a second or two, which is not so good, but also fine. You have examples of parameters which trigger slow execution,
but you don't see good parameter sets,
which are not registered in the slow log.
When I say slow log, I mean a part of single Postgres log,
because Postgres has just one log.
It's a different discussion, maybe.
And if log min duration statement enabled,
you see the the examples with
duration but auto explain it's even better with the plan itself yeah is this a good time to talk
about overhead yeah let's talk about overhead so when you run explain analyze versus you run query
without any observability tooling which explain explain allows is observability tooling, which explainalize is an observability tooling.
It adds a lot of details about query execution and planner decision, right?
But you can just run the query and see some timing, but then run explainalize and see
different timing, right?
You had a blog post on this topic, but about auto-explain.
Auto-explain is also like a related question here
yeah ongress had a really good blog post on the observer effect so i think i think there's two
there's two cases that not only can explain analyze tell you that it that it can be very
accurate so it can be that it's roughly the same amount of time as the as running the query through
your client it can be too high where it's adding
overhead and it can be too low for the case where lots of data is being transmitted. It doesn't
transmit that data so it can even be faster than a query that would return the data. So there's
kind of three cases, two cases that are bad. They're quite rare in my experience, and they don't,
especially on modern hardware, they don't show up that often. And also, they're not that problematic
when you're actually looking for the problem. If there's a relatively universal overhead added,
and you're still looking for what's the slowest part, it's still probably the same place. But
yeah, let's explain why it happens. It's doing in order to measure timing there is some
overhead and i would split it to three parts okay sorry for interrupting i will split it to three
parts first when we say explain we just see the planner decision we don't execute query nothing
to discuss in terms of overhead here right well there's uh the cost of planning planning work but
but it's not overhead it's it's anyway we need it but when we add analyze there
is overhead like we really execute the query but we need to measure things and see how many rows
in each node were collected everything like that and timing as well but we also can say buffers
this is additional overhead and we also can say track io timing which is a postgres setting
you can set it dynamically i guess right and you can see io timing additionally printed in
by explain by explain analyze here right and these like three pieces of overhead right what do you
think about each of them yes well as you mentioned i did do a blog post on this because I saw quite a few places where people would really warn against auto-explain with timing on.
There's a really strong warning against it in the Postgres docs.
There's multiple monitoring tools that tell you if you have auto-explain on, make sure you have timing off.
At the same time, I observe very heavily loaded systems serving a lot of thousands,
like a hundred thousand transactions, very loaded systems where it's enabled.
Same. I was coming across customers that had it on.
Doesn't it depend on the hardware, on the CPU we have?
Yes. So I did, there's like a a there's a tool in postgres that lets you
check right i've forgotten what it's called is it pg test timing it's ah something like that
it's in binary directory standard package of postgres right i'll find it and link to it but
yeah basically my understanding is if you have pretty fast system system clock lookups it the overhead can be hard to measure but if you have slow system
clock then it can be extremely easy to measure and i that's the ongress blog post i think is
deliberately picking a system as a slow system clock in order to show that it can add hundreds
of percent of overhead but when i was looking at it on a when i was looking at it on an OLTP workload,
very, very simple PG bench OLTP workload,
I was basically unable to measure it.
I think I got a 2% overhead of adding all of the parameters,
and it was basically all-time.
You know, this is my old idea,
and a couple of times we implemented it.
In any company, when we deal with many Postgres hosts,
it would be good when we set up a host to have a set of micro-benchmarks
checking disk IO limits, CPU.
We can use SysBench for that or FIO for disk.
In the old life, we used Bonnie++, I remember.
And this micro- micro benchmark checking timing overhead
would be also great there.
And like sometimes we might have in cloud,
sometimes we might have two virtual machines
of the same class, same type,
but they behave differently.
So it would be good to check it all the time
we set up Postgres on some machine, right?
Yeah.
Well, yeah, this is your age old.
This is what you're dedicating your professional life to is experiment.
You know, if you're intrigued as to what it would be on your system, it might be different for you for some reason.
Maybe for hardware reasons, maybe for workload reasons.
There might be some specific way that it's bad for you.
It's very difficult to provide general advice and
the advice you read online would generally be cautious especially in the postgres documentation
then they're going to be cautious by default because they don't want to give advice that
one person is going to find horrifically awful even if the majority would find it okay
back to these three classes of overhead i guess the first class is like from
an analyze and explain part and second is track your timing third is buffers let's let's postpone
a little bit they both are related to this overhead from how clock work with clock is organized but
the difference is that inside explain analyze, they are both working.
But track IO timing also working if you have PG stat statements, because it's registered there as well.
So regular execution without running explain, regular query execution will also include it.
So if working with clock is slow, track IO timing can add some penalty when you use PG-STAR statements, right?
Same with auto-explain. Auto-explain runs on every... there is a parameter where you
can...
There is sampling and it existed long ago. I didn't realize it exists. For slow
log there is sampling capabilities since Postgres 13, I guess, but for auto-explain it... you
told me, right? It exists for many years already.
It's great.
So you can auto-explain only like 1% of everything.
It's great.
To be cautious at first, yeah, you can sample a really small percentage.
But naturally, yeah, for RLTP, it's probably fine
because you're probably running the same queries over and over
and you don't need loads of examples of them to optimize.
There is also a possible observer effect from just logging.
If writing to logs is slow, for example, disk is not very fast where you log it.
So it also can be a problem.
But it's a slightly different topic.
The third part, buffers.
What do you think about overhead from buffers?
Yeah, well, you were the first to tell me that it's worth looking into but i wasn't able
to i wasn't able to measure it yeah i wasn't able to measure it there definitely should be difference
if you just run explain and lies many times everything is cached and then explain and lies
buffers difference should be there i i'm sure but still 100 worth having buffers inside explain and
lies as we discussed separately whole half an hour right
yes previous episode I actually
think it might have I don't know if
you if this is to do with
you but I think explain.depez.com
deserves some praise
because I noticed today or yesterday
that it now asks for explain
analyse buffers so that's quite
a call well in my opinion if
like when we analyze a query we
should not do it on production as you yeah we should do it on a special environment which
should be a clone of production and the best way to get to have a clone is using database
lap engine we develop and then there of course you're in a slightly different situation maybe
hardware is different maybe you have less memory for example different
state of caches and maybe a different file system as in the case of database lab engine because it
uses that FS by default and there you you should focus on buffers this is like should be like our
final goal is timing but inside it we focus on buffers inside the process and reducing io numbers not just buffers maybe io numbers like
rows is also logical is also important metric to to keep in mind and if you reduce io you will
reduce timing this is secret of optimization everyone should understand in my opinion right
yes couldn't agree more and if anybody disagrees we can refer you to episode, I'm guessing two.
It was quite an early one.
Right.
Right.
Good.
So what else we should discuss in terms of starting of working with explain?
Yeah.
Well, we might be close to time, you know, I wonder if we should save it for another time.
Is there anything else that we have to mention?
Well, we didn't discuss particular
notes like various types of join and so on of course it's it requires time to learn and of
course there's documentation there are many i see different people present talk named explaining
explain this is like default default name for such talks so not just one personally presented it so i think all of those talks are
useful worth checking right my yeah my favorite is one by josh burkus i'll make sure to link it up
right he did a really old one yeah old but still i listened to it again last year and it's it's
still perfectly relevant there have been some new parameters and sure it doesn't cover everything new nodes parallelization since then
was added yeah but equally it's jit compilation which should be disabled on ltp yeah i've also
done two talks one at the big kind of one to try and cover the i think there's some beginner stuff
that he doesn't cover at the beginning and there's some more advanced stuff that he doesn't get to
because it in an hour you can only do so much.
I have done two talks trying to cover either side of that, not doing the explaining explain part.
So maybe I'll link those up as well.
Oh, and also I have a glossary.
Oh, glossary is great.
Yes, yes.
So it's a good thing.
Right.
Good.
So I hope that was helpful for some folks.
Let's wrap it up.
Right.
Yeah, I hope so too.
Fingers crossed.
And also feel free to reach out
like i think this is the kind of topic that we love and find very interesting i'm definitely
very very happy to help people with this kind of thing we are asking for topics we can ask right
here once again like we are open we have a list of dozens of ideas but we react to feedback if
someone asks for a topic we will prioritize it in our list definitely and we
will try to discuss it soon and as usual thank you everyone who is providing feedback it's very very
important we receive it quite often or like at least once per a couple of days it's a great
feeling i would say and also please as, subscribe everywhere you can, like everywhere you can, and please share in your social networks and working groups.
Absolutely.
Thank you so much.
Thanks, everyone.
And thanks, Nikolai.
Thank you.
Bye-bye.