Postgres FM - EXPLAIN
Episode Date: January 5, 2024With Nikolay taking a well-earned break, Michael covers the basics of EXPLAIN — what it is, the parameters available, and a process he recommends for using it. Here are some links to the t...hings mentioned:EXPLAIN (docs) https://www.postgresql.org/docs/current/sql-explain.htmlUsing EXPLAIN (docs) https://www.postgresql.org/docs/current/using-explain.htmlEXPLAIN Explained (talk by Josh Berkus https://www.youtube.com/watch?app=desktop&v=mCwwFAl1pBUBeyond joins and indexes (slides from talk by Bruce Momjian) https://momjian.us/main/writings/pgsql/beyond.pdfDepesz https://explain.depesz.comDalibo https://explain.dalibo.compgMustard https://www.pgmustard.com/A beginners guide to explain (talk by Michael) https://www.youtube.com/watch?v=31EmOKBP1PY&t=24sExplain beyond the basics (talk by Michael) https://www.youtube.com/watch?v=sGkSOvuaPs4auto_explain episode https://postgres.fm/episodes/auto_explainRow estimates episode https://postgres.fm/episodes/row-estimatesSlow queries and slow transactions episode https://postgres.fm/episodes/slow-queries-and-slow-transactionsBenchmarking episode https://postgres.fm/episodes/benchmarkingEXPLAIN glossary by Michael (forgot to mention, oops!) https://www.pgmustard.com/docs/explain~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael,
founder of PgMastered, and today I am not joined by Nikolai. He is finally taking a well-earned
break, so this week you've just got me. I am in the process of getting a couple of guests as well,
which I'm excited about, so hopefully for the next couple of weeks we'll be able to
give you those conversations. in the meantime I'm going
to record an episode on pretty much the only topic I feel comfortable sharing knowledge on on my own
which is explain and due to the nature of the product I've been working on for the past few
years I have spent a lot of time looking at explain plans and helping folks with those so
yeah I've given a couple of talks on this in the past, but I realize we hadn't actually done a podcast episode on it.
We've done a few related ones.
So long term listeners will be familiar with some of the things we've talked about previously.
But we haven't done a very an intro level one or even I have a process I'd like to recommend people use.
Haven't discussed that yet. So I'm hoping to give you a couple of those things.
But on the topic of guests before I move on,
if you have any recommendations or requests on either the guest front or the topic front,
those are always welcome.
Please send those to us, either privately or publicly.
They would be very, very welcome.
So explain.
First, what is explain and what does it
do? So explain is a statement that you put in front of your query and Postgres will return
the query execution plan. So by default, you just get the plan. So what Postgres would do,
what choices it would make to execute that query. It doesn't execute query if you just use explain
and it provides back the estimated costs involved.
So you might have heard that Postgres
has a cost-based optimizer.
This gives you some insight into the costs
that it's calculating and why it's making certain choices.
So why it's choosing a certain scan type over
another certain scan type or the join order it's chosen or the join type it's doing for a specific
join. So all of those are based on what's possible and also what it thinks will be fastest. Now the
costs are in an arbitrary unit, but the idea is the lower
the cost, the faster the optimizer thinks it would be. So it is correlated to time, but they're not
in a unit of time. So explain on its own can be useful to like quickly see, for example, if you
have a query that's timing out or taking absolutely ages you can quickly have a look at the
query plan and just check maybe why it might be slow but if we want to know for sure why a query
is slow and it's executing in a reasonable amount of time chances are we're going to want to use
some explain parameters to not just get the plan back but also get
also run the query behind the scenes and get execution statistics so those the most common
of those parameters that you might have heard alongside explain is analyze so and this is
different to the analyze keyword for getting query statistics this is explain analyze specifically
but there are loads more parameters as well.
In fact, we did a whole episode on why we believe,
or Nicola and I believe,
that buffers should be on by default with analyze.
So buffers is another parameter
that gives you extra runtime or execution statistics
alongside the timings that you get with analyze.
So with buffers,
we also get information about the blocks
read, written. So the amount of work being done by the query, not just the timings. So that can
really help. But there are a bunch of other parameters as well. I like to recommend people
use as many of them as they can. Most of them don't have that much additional overhead,
but analyze can. In fact,'s something to worth bearing in mind that
there are a couple of caveats when it comes to explain analyze it can both over report and under
report the amount of time that query would take in the wild so for example it's been referred to
as the observer effect before but if we have a slow system clock the overhead of checking the timings can add
a significant overhead to the total execution time when we're running explain with analyze on
and with time specifically with timings on so that's how it can over-report. That's the easiest way it can over-report on the execution time.
But it can also under-report if you, for example,
when we're running spin analyze,
Postgres doesn't have to transmit the data to us.
So if we're requesting thousands of rows
and we're not close to the database,
so perhaps it's a user doing a big report
from a different continent than the database is on,
your monitoring might be showing that that query
is taking a few seconds to execute,
whereas maybe explain analyze would report
only taking a few hundred milliseconds
because it doesn't have to transmit that data
across the network.
So it can over-report and under-report, but generally that's pretty uncommon.
And also directionally, you'll still see where the slowest parts of that query plan are,
even if the numbers aren't precise.
So yeah, so those are the caveats.
In terms of the process I recommend, the number one i think i've mentioned it before in
the podcast but it's so important it's worth repeating you need to be running on a realistic
size data set to get realistic performance data now it doesn't have to be the exact same data. The important part is the number of rows.
So, for example, if you were testing out a new feature
and your local data set only has a couple of dozen rows,
but production has a few million,
chances are the Postgres planner is going to be making different choices
on join order or join algorithm, even
scan type. So it could even choose to do a sequential scan if your data is small enough,
even if there is a good index available. So that can really surprise people, but it's really
important. As the data sets get bigger, this becomes slightly less important. So if you're
testing on tens of millions of rows
instead of hundreds of millions
there's less of a chance of an issue
but still a chance of an issue
getting a different plan
so very much worth bearing that in mind
if you can test on a realistic size data set
please do
there's lots of tools available for doing that
and for masking data these days
but that's in fact
if we've done an episode on that,
I think we have on benchmarking, so I'll link that one up.
So that's tip number one or step one in the process.
Make sure you are or the person that you're helping
is testing on a realistic data set.
Use as many parameters as possible.
I've hinted at that already, mentioned analyze and buffers,
but there are also parameters for settings which tells you any non-default planner related settings that you
have so for example if you've changed work mem or random page cost these are common ones to have
changed those will show up but also if you've for example been testing a few locally and you've
forgotten that you've still got those set that they'll show up in your explain plan as well, which can be helpful.
Another important one is, well, another one I find really useful is verbose, especially when helping other people.
You get to see information about the schema, for example, you get fully qualified object names.
You also get the output of each operation in the query execution plan, what data it's sending to its parent operation, which can be useful for advising certain optimizations.
A couple of others, a new one is the right-hand login information, or I say new, I think it's a few versions old now. You can also specify different formats.
So JSON format is particularly popular these days amongst tools,
but text format is generally still the most concise and most readable amongst,
especially people that are most familiar with it.
They'll often ask to see the text format plank because it's smaller and more familiar to them.
Now there are a couple of other parameters as well.
Less important because people generally leave them on or leave the defaults in place,
but timing can be really useful if you want to, for example,
turn timing information off but have analyze on.
So you still get the full execution timing statistics,
but you don't measure the per node timing, which is where the
main overhead is. So I've seen some people with some tool vendors recommend turning that off with
auto explain, but worth measuring if it does actually have any overhead on your system before
doing so, because timing information can be really useful when looking at speeding things up. Costs
you can turn off as well. The only time I've seen that done and the only time I've done it myself
is when writing blog posts or sharing query plans where I just want to make them slightly less
intimidating or draw attention to the more important parts. So those are on by default.
And the other one that's on by default is or at least when you
use explain analyze is summary which puts the information at the end so like planning timings
and execution times which are very useful so generally people don't turn that off but it
means you can turn that on for example with explain which gives you the planning time
because remember when we just do explain we only plan plan the query. We don't it doesn't execute that.
So you can get the planning time just from from explain alone if you use the summary parameter.
But generally, I recommend putting as many of those on as you can, especially if you're using a tool to parse it.
Or if you are getting help from others, the more information, the better for them.
Cool. So that's the second step
third one is glance at the end so query plans especially once your queries get a bit more
complex can get quite long you can get uh feel a bit like a wall of text but a few of the probably
rarer problems to be honest but a few of the fairly common problems are
highlighted very much at the end so things like if your planning time is 10x your execution time
there's very little point looking through the rest of the query plan looking for optimization potential because 10 like the vast majority of your um
of the total time was in planning so that's only reported at the end and the rest of the query
plan won't help you the other two things that are reported at the end that also report a total time
are triggers so trigger executions um they can be the performance issue. If you have
triggers involved, there's going to be some overhead there. They report a total time.
If that's a high percentage of your total execution time, you've found your problem already
without looking through the whole execution plan. Same is true for just-in-time compilation. That can, in certain cases, be 99% of execution time,
which, again, makes looking through the query plan not as useful.
Cool.
So that's the third one.
Fourth is then, once you've looked at those, then figure out,
okay, we do need to look at the rest of the query plan where is the
time going which parts of these are the most expensive or if you have buffers on which of
these scans is doing the most reads the reason i mentioned that that's like step four is because
i think it's very easy at this point to jump to other like jump to looking to other for other
issues and i've been guilty of this in the past myself as well so experienced people i think maybe are even more susceptible
to this than than less experienced people if you notice a sequential scan your eye gets drawn to it
immediately and you think oh maybe there's a maybe there's an issue there or you you notice a large
filter or you know a few thousand rows being filtered or a lot of loops or something if your eye gets drawn to these problems that you're used
to looking for you can get tripped up or get you can end up going down a bit of a rabbit hole
on this so my tip is to really check where the timings go like where which of these is really
taking up the most time and then focus your attention there only once you've worked that out focus on that part of
the query or put that part of the query plan is there a bottleneck we can solve and maybe maybe
we'll come back to the other issue we spotted later but it also the query might be fast enough
by that point and we can move on so yeah that's that's the fourth and fifth so
work out where the what the most expensive parts of the query plan are and then and only then what
you can do about it so on the timings front there are a few things to be aware of so the execution time is reported uh inclusive of so uh one operations execution time is inclusive
of of all of its children so a little bit of subtraction you'd think might might be enough
there to to work out maybe which of the operations is the most expensive on its own but you have to
take into account a bunch of other things as well so those execution times
are a per loop average so for example if one of our operations is being looped over
multiple times which is especially common in join operations but also parallel execution utilizes loops as how it reports.
So there's a few ways where we have to then multiply it by the number of loops
before subtracting from the parent operation.
So these things, you can see how they can start to get a little bit complex.
In addition, Postgres reports these timings in milliseconds
and doesn't include thousand separators so I know I
personally can start to struggle with the arithmetic of all that once it starts getting
into the multiple seconds as such there's a ton of tools that can help you with this I work on one
but there are some really good free open source ones available lots of editors include them by
default and pretty much all of them make at least some effort to get
those timings are pretty good so some popular ones are depeche.explain.com a really nice thing
about that is that it preserves the text format so it's really expert friendly if you want to share
an explain plan with the postgres mailing lists the one they'll be most familiar with and most
comfortable with will be the
explain.depesh one but explain.dalabo.com is really popular these days much more visual
includes more charts on for example where the timings are going where the buffers
like which operations do the most buffer hits that kind of thing much more beginner friendly
and and as i would probably class my tool as well, PG Master Visual as well,
but also helps with that next step a bit more on what you can possibly do about things.
But yeah, so the main reason I want to mention those is they can really help on the timings front.
So you can paste your query plan in and quickly see where the timings are going.
I would say that a lot of them don't do that step three so a lot
of them don't immediately draw your attention if the if there's triggers taking up a lot of time
or just in time compilation taking up a lot of time or even planning time so do it is still worth
doing that step first and then checking out a tool or two that you that you're familiar with for the next steps yeah that's
that's pretty much it for a process obviously that fifth step working out what to do next
is a hugely deep topic and it's actually where a lot of the normal talks on explain if you go to a
conference and see a talk on explain that's normally where they'll pick up so the best one I've seen is a bit old now it's from
probably about 2016 but Josh Berkus gave it a few times but I'll share one of the recordings that
I liked a lot yeah you spent about an hour going through a lot of the different join types scan
types types of issues you can see things to be aware of and yeah it's a lot even then probably
only covers the first 10 of what you could know there are more recent talks one by bruce monjan
actually or who'd covered some of the rarer scan types so i'll link that one up as well but this is
a deep deep topic that you can you can dive into i'm a few years in and I don't consider myself an expert yet.
So that gives you an idea.
But yeah, it's extremely powerful tool for query optimization in general.
Naturally, to speed things up, it really helps to know which part's slow.
So please use it.
Let me know how you get on.
It'd be cool to hear.
We've done a bunch of related episodes.
So a couple of those to check out are,
there's an extension called AutoExplain.
Now that will let you run your queries as normal,
let's say, maybe even on production,
and the users get the results of the query back,
which we don't when we run Xfane.
I don't know if I mentioned that up top.
But the people using your database will continue to run their queries
and get their results as usual.
But in the background, the execution plans will be being logged.
Any that are beyond a certain timing threshold that you set
will get logged to your postgres logs so we've done a whole episode on that can be extremely useful
and i'll link that up we've also done one on row estimates so that's another common issue so with
explain we get to see how many rows postgres expected to be returned at each stage.
And when it's wrong, when it's wrong by a lot especially,
that can lead to all sorts of issues with inefficient plans being chosen.
So we did a whole episode on that.
In fact, our first ever episode was on slow queries and slow transactions. So that's worth checking out if you haven't already.
And we did a more advanced one on query optimization,
which, again, we still didn't cover some of the basics of explaining.
So I'm glad to have done this one now.
Welcome any questions.
And please do remember to send your guest and topic suggestions through.
Happy holidays. Hope you all had a good one and catch you next time hopefully with a guest all right bye