Postgres FM - Intro to query optimization

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael, founder of PgMustard. This is Nikolai, founder of PostgresAI. Hey Nikolai, what are we talking about today? Hello, hello. Let's talk about query optimization. I think this is maybe the most interesting topic in the area of Postgres in general, but I found not everyone is interested in it.

Starting point is 00:00:20 But let's talk about it anyway. Yeah, it's also, I guess, a topic quite close to both of our hearts. We've spent many years looking at this. So hopefully we have some interesting things to add. But let's make some boundaries. Let's distinguish analysis of workload as a whole and attempts to find the worst part, the best candidates for optimization versus single query optimization. Let's talk about this, the second topic, subtopic. Yes. I've heard you differentiate between macro performance analysis and micro performance analysis in the past. So macro being system level, and I guess we're not talking about that today. And we're going to look more at micro. Once you've worked out there is a problematic query, you know which one it is. How do you go from that to what can I do about it?

Starting point is 00:01:13 Right. How to understand is it good or bad in terms of execution and how to read the query plan, the comment explain, which is the main tool here, right? Let's talk about this. Is there anything else before we dive into explain? Are there any other parts of it that we might need to cover as well? Well, there are things that explain doesn't cover. For example, it won't tell you, for example, CPU utilization,

Starting point is 00:01:42 user CPU, system CPU. It won't tell you physical disk io how many operations at disk level because postgres doesn't see them directly because postgres works only with file system cache so io operations reads and his postgres tells you they are not necessarily from disk they are from page cache so this They are from PgCache. So this can you get additionally not using explain. It would be good to have these things inside explain somehow, like pgstat kcache extends pgstat statements.

Starting point is 00:02:14 It would be good to have something that would extend explain, but I'm not aware of such thing to exist. Nor me. I'm not aware either, but I think we get clues about them in explain, don't we? We see some timing that can't be explained otherwise. I.O. timing. So, you know, I mean, let's say you mentioned CPU performance.

Starting point is 00:02:37 If we have an operation that's not doing much I.O., but it is taking a long time, that's a clue that there might be something else going on. Right, right, right. So in general, if we run even one query, theoretically, it might be sense to use things like perf and cflamegraph for one query execution, and it would augment the information that you can extract from explain, analyze buffers. But let's just discuss the basics maybe where to start sounds good and

Starting point is 00:03:07 something so possibly even explain versus explain analyze is a good place to start by explain we get the the query plan that often normally returns really quickly roughly in the the planning time of the query and then explain analyze we get the actual well it runs the query and returns performance data so we we can see how much time was spent and if we ask for buffers how much io was done and all sorts of other things as well but it allows us to compare things like how much so the explain might tell us how many rows were being expected to be returned at each stage and explain analyze we can get the actual number of be returned at each stage and explain analyze we can get the actual number of rows returned at each stage and comparing the two can be really useful

Starting point is 00:03:50 right right and yeah absolutely and also we discussed it some time ago that explained shows only one plan sometimes you want to see multiple plans like second best candidate and so on. Otherwise, you should do some tricks to try to guess what planner had on the plate when choosing. That's a really good point, and maybe even a good place to start in terms of using Explain. So the first thing you'll probably notice when you're looking at Explain for the first time is a lot of cost numbers. These are an arbitrary unit that give you an idea of how

Starting point is 00:04:25 expensive so it's a kind of an estimate of well as the as the cost numbers go up postgres thinks it will take longer to execute but they're not in they're not an estimate of milliseconds or they're not in any real unit but you can you can then use a couple of different like is it an enable sex gap there's like some there's some parameters you can use to affect those costs so you could maybe try and get the second best plan by making the current plan very expensive so if your query is currently doing the sequential scan and you want to see if it could use an index and it's just choosing not to you can disable sex scans well it doesn't actually disable sex scans.

Starting point is 00:05:05 It just makes them incredibly expensive. Right, that's some penalty. Exactly. So you might still see that in the costs. This trick is very helpful when two plans are very close to each other in terms of cost, overall cost. And if you disable index scan and see different plan and see the cost is very close, it gives you idea that we are on the edge, right? We either crossed it recently because data is changing or we are about to cross it and it's kind of dangerous.

Starting point is 00:05:40 But first of all, I would like to mention that the plans the plan a plan is three, right cycles are not possible, which is important, because it could be possible, actually. But well, I mean, I think in the simplest cases, yes, but with CTS, you get some like you can get some strange things can refer to the same CT mode once, for example. But yeah, in the simplest cases, it's... But when it's already executed, it's still like a tree, right?

Starting point is 00:06:11 Oh, interesting, by the way, yes. Anyway, it's like very rough. It's a tree. And when it's printed, it's a tree. But loops are possible inside, of course, right? And important thing to understand that regular metrics are as cost, rows, timing, they are shown for one iteration inside a loop. But buffers, it's a sum of everything.

Starting point is 00:06:34 And it can be confusing sometimes, right? Yeah. Well, let's go back to the tree. I think that's really important. And a few things that aren't obvious when you're first looking at them is that logically it's happening almost backwards so the first node that you see on the tree is the last one to be executed whereas it goes from leaves to root exactly so kind of outside in like kind of right to left a little bit and there's also some really important statistics especially when you use explain analyze some really important statistics right at the bottom so some summary metrics like execution

Starting point is 00:07:11 time planning time oh they're printed separately from these three right and like trigger time just in time compilation each of these things can be dominant sometimes they can be like where all of the time's going and if you have like a really long tree the general recommendation is start kind of right to left but i also say check that bottom section because it could be you might not have to look through the entire tree if you find out that your query is spending 90 percent and the single additional point here is that planning time can be sometimes very very big as i had cases when inspection of the path of merge join led to huge scan during the planning time.

Starting point is 00:07:54 And disabling merge scan helped, but it was not so obvious at all in the beginning because if you don't notice that planning time is seconds, so insanely huge, it's a surprise for you. So checking planning time is seconds so insane insanely huge it's a surprise to you for you so checking planning time also useful yeah same for time spent in triggers and time spent just in time compilation as well for some analytical queries you might consider it a relatively simple query or quite it should be quite fast but if the costs are overestimated a lot, sometimes the just-in-time compilation kicks in,

Starting point is 00:08:31 spends several seconds thinking it's saving you time, but then the overall query is only a few milliseconds. That's suboptimal. Right. And also, since Fosgis 13, planning time also, if you include buffers option, it will show you buffers used for planning as well right yes and actually one other thing on planning time before we move on from that is that auto explain doesn't include planning time so there you can't spot planning time issues in auto explain other than i think we discussed this once yeah if you i think we discussed it years ago or ages ago because it was a it's probably the only use case for

Starting point is 00:09:15 logging the query time plus the auto explain time and then you could diff the two and it's probably planning time that's the difference but yeah it's it's a limitation not as you say not super common that planning time's the dominant issue but when it is it can be 90 plus percent easily right it's it can be unexpected this is the danger of it right yeah right right right so right to left in kind of inside out, start at the bottom, check the main statistics. You mentioned briefly that some of the statistics are per loop. So loops are quite, I think they're quite a confusing topic when you're first getting used to it. And especially if there's, you know, 10,000 loops, you could easily miss that one of the statistics looks quite small. But once you times it by 10,000, it can be really big.

Starting point is 00:10:06 So examples are, of course, the costs and the timing, but also things like rows removed by filter. Sometimes people look out for those numbers. If it says one, that's a per loop average. And actually 10,000 of those is suddenly not insignificant. Right. And those averages can be rough. It's also like there's a mistake

Starting point is 00:10:27 that can be present there. Well, especially around 0 and 1, because some of the numbers are integers. Yeah, so anybody that's wondering if exactly it's rounded to the nearest integer and if it's less than 0.5 and gets rounded to 0, it doesn't necessarily mean that there are zero, which can be problematic.

Starting point is 00:10:48 Also, you know what? We discussed things that probably there are many talks and many articles that are useful. I think this podcast is not going to replace them. We are trying to highlight problems which can be tricky in the beginning. And one of the things I think we should mention tools as well. First of all, explain DepecheCom is the oldest one and still very, very popular, maybe still the most popular. Then explain DaliBocom, which is PEV2, greatly improved, very good.

Starting point is 00:11:23 And of course, Pidgey Mustard, which is commercial, which you develop. Right. Worth checking all of them. They are good. Pros and cons for all of them. But what I think is important to understand is some meta. When you talk about query analysis of single query analysis, we should say, okay, this is our plan. Better if it was with execution and the best if it's execution collected with buffers. And we will discuss, by the way, overhead a little bit

Starting point is 00:11:55 later, right? But when you ask someone to help with optimization, of course, the first question will be show me the query itself, right? Sometimes show me two plans also. If there was some change and we want to understand why this change influenced the plan, so we have two plans, but we need to have a query that must have, we should have. But additionally, I think very important to get Postgres planner settings, like enable sex scan or random page cost, sex page cost, all costs. WorkMem as well. Even though WorkMem is not inside Planner,

Starting point is 00:12:32 PostGIS settings are grouped in groups. You can select star from PG settings and see group names. And there is a whole group named Planner, Planner settings. And WorkMem is not there. But WorkMem influences Planner decisions. If you change WorkMem, Planner can be And workmem is not there. But workmem influences planner decisions. If you change workmem, plan can be different. Right. So right now my rule is like let's take planner settings plus workmem.

Starting point is 00:12:54 Maybe something else. I'm interested to see if something else should be there as well. So when we ask someone to help, we need to present a plan or two, query, planner settings. And I also believe that schema is important to present. Like what kind of schema we had. Table, indexes, and probably statistics as well. This is like whole picture to analyze what we had. Imagine if all these tools collected these things automatically, for example. How great it would be to jump into, like, I want to help someone with optimization. And I see whole picture. I see query, plan, settings, the schema. Not whole schema, only part of it which is involved in, like, which this query deals with.

Starting point is 00:13:42 And also statistics maybe statistics is kind of tricky but this is what basis for planner decisions this is what defines what plan will of course post this version as well because different version plan can be different different nodes in the plan can can be present depending on version so what do you you think about this big whole picture? I understand that none of tools collect this information and none of tools require users to present this information, but it would be great to store it in history, for example. Yeah, there are some really interesting tools that do some of that, but not all of it.

Starting point is 00:14:21 So there's a tool started as a MySQL tool, but they've added Postgres support called EverSQL. That asks for things like the query and the schema and things like that, and then does some static analysis, which is super interesting. There's tools like PG Analyze that's a monitoring tool, and it's starting to do some,

Starting point is 00:14:43 well, it's been for at least a couple of years now doing more ad hoc performance like query analysis fire explain visualizations and has access to a lot of those a lot of that information already by the nature of being a monitoring tool but i think there's also this natural trade-off between this macro analysis like uh this the all of the information you can gather and the overhead of doing so versus the amount of information you're willing to gather to do a sick like once you know a certain query is a problem you're willing to pay a higher overhead because you're only you only need to gather that once whereas if you want to do this all the time for every query

Starting point is 00:15:20 i think there's a slightly higher overhead so i think there's there's some tension we could do some hashing for example to track that like statistics didn't change and parameters post parameters didn't change we could just check it automatically with hash yeah i think there's some super cool things here but there's also there's a there's a couple of different um environments right so in production like the one one key thing is that let's say statistics is a great example production is not the same as staging like we could we can make all of the data the same we can it depends it's like yeah production might not be the same as staging and therefore like a statistics problem may not show up and if you're let's say you're doing some development work it's it's really tricky to to reproduce all of those things so and i think i'd also push back that some of those

Starting point is 00:16:13 some of the most obvious things that can be a problem it like most like maybe even schema doesn't matter maybe even query doesn't matter if you see that somebody's doing a sequential scan of 10 million rows, maybe probably it's parallel and they're doing a filter and it returns just one of those rows. Without the query, we can tell that... Enough to suggest index, right? Exactly. So there's a bunch of cases where you can give some pretty sensible advice

Starting point is 00:16:44 without any of that extra information but definitely as it gets more complex i think more of those things can be more useful but even even in the index case you know you still need i think you still need context from the customer in terms of trade-offs you know if they've if this is a super high right table you might be less inclined to to add an index than if it's not or if the customer has certain requirements there's always going to be a there's always an it depends right there's always i think whenever you're giving this kind of advice you have to be careful that for different customers different things would be sensible i guess except for except for extreme well i understand

Starting point is 00:17:19 that what i just described a collection of all these pieces requires a lot of efforts. So that's why it should be automated. But imagine if everything was collected automatically inside some tool, used in an organization and stored historically. You understand, okay, we optimize this query. And we try to optimize this query. And we know the whole context. When we ask for help, we have all pieces.

Starting point is 00:17:45 And if expert comes to help us, all pieces are present. It would be much easier to help, right? Yeah, and I think there are some interesting projects in this area. I think if you come across the one by Pocona, they're doing a kind of a replacement to PGSTAT statements. Well, it's about macroanalysis as well. PG Monitor, right? PGST pg start monitor yes but i think they do things like just statements or no yes but with additions like i think they let you track

Starting point is 00:18:14 query plans per query so like i think you could for example see if a plan has changed so it's that that kind of thing with relatively low overhead i think you start to get a bit more of that information. So when an expert comes along, hopefully this is something already installed and already present. This is an old, big discussion. There is a current ongoing discussion in PgHackers about PgSQL hackers

Starting point is 00:18:37 mailing list. Someone from AWS proposed adding plan ID to PgSource statements, triggering discussion one more time, and this would be great. Of course, we know that each query registered in PGSR statements might have multiple plans

Starting point is 00:18:53 depending on parameters used. So when you optimize a query, very important thing, I missed it in my list, parameters you used, right? Because different parameters may trigger the planner to choose different plan, right? So it's very, very important. When we optimize a query, we cannot say we optimize a query.

Starting point is 00:19:13 We must say we optimize a query for some parameters. And we need to think about variations that we should expect on production and check them too. Not just a single case right so this and this is tricky by the way yeah so if anybody's wondering this is like the simplest example of this is let's say you have a column where 99 of the data is a single value and then what the other one percent is millions of like unique values if you search for one of the unique values you might get an index scan if you search for the value that 99% of the table is then you should get a sequential scan that would be the optimal plan so that's the simplest example this is classic example and even

Starting point is 00:19:57 enable set enable six can to off might not help to avoid SIGSCAN in some cases. And I also, I had a couple of times in my optimization activities, I had the case when somebody provided me a query without parameters, and I've checked the table, I saw, okay, what's the worst case? And I started to optimize for the worst case and made bad decisions because this worst case was never used in production at all. So it's a very, very interesting topic. I definitely want to find some approach when we don't know which parameters we have, but we guess somehow. For example, if some additional tool would analyze statistics, this tool would say, oh, this set of parameters, this and this, like this is most typical case, this is like some kind of worst case,

Starting point is 00:20:51 and try to optimize for them. This would be great. So yes, I agree. And I think there are some, I think, for example, AutoExplained, very old tool, but one thing it does really well is it spits out the exact query that caused that slow plan. And that's one way of getting at least the extreme versions of the parameters. Or just slow log. If you have log mean duration statement above 500, 100 milliseconds, which is good, or at least a second or two, which is not so good, but also fine. You have examples of parameters which trigger slow execution, but you don't see good parameter sets, which are not registered in the slow log.

Starting point is 00:21:33 When I say slow log, I mean a part of single Postgres log, because Postgres has just one log. It's a different discussion, maybe. And if log min duration statement enabled, you see the the examples with duration but auto explain it's even better with the plan itself yeah is this a good time to talk about overhead yeah let's talk about overhead so when you run explain analyze versus you run query without any observability tooling which explain explain allows is observability tooling, which explainalize is an observability tooling.

Starting point is 00:22:07 It adds a lot of details about query execution and planner decision, right? But you can just run the query and see some timing, but then run explainalize and see different timing, right? You had a blog post on this topic, but about auto-explain. Auto-explain is also like a related question here yeah ongress had a really good blog post on the observer effect so i think i think there's two there's two cases that not only can explain analyze tell you that it that it can be very accurate so it can be that it's roughly the same amount of time as the as running the query through

Starting point is 00:22:42 your client it can be too high where it's adding overhead and it can be too low for the case where lots of data is being transmitted. It doesn't transmit that data so it can even be faster than a query that would return the data. So there's kind of three cases, two cases that are bad. They're quite rare in my experience, and they don't, especially on modern hardware, they don't show up that often. And also, they're not that problematic when you're actually looking for the problem. If there's a relatively universal overhead added, and you're still looking for what's the slowest part, it's still probably the same place. But yeah, let's explain why it happens. It's doing in order to measure timing there is some

Starting point is 00:23:25 overhead and i would split it to three parts okay sorry for interrupting i will split it to three parts first when we say explain we just see the planner decision we don't execute query nothing to discuss in terms of overhead here right well there's uh the cost of planning planning work but but it's not overhead it's it's anyway we need it but when we add analyze there is overhead like we really execute the query but we need to measure things and see how many rows in each node were collected everything like that and timing as well but we also can say buffers this is additional overhead and we also can say track io timing which is a postgres setting you can set it dynamically i guess right and you can see io timing additionally printed in

Starting point is 00:24:11 by explain by explain analyze here right and these like three pieces of overhead right what do you think about each of them yes well as you mentioned i did do a blog post on this because I saw quite a few places where people would really warn against auto-explain with timing on. There's a really strong warning against it in the Postgres docs. There's multiple monitoring tools that tell you if you have auto-explain on, make sure you have timing off. At the same time, I observe very heavily loaded systems serving a lot of thousands, like a hundred thousand transactions, very loaded systems where it's enabled. Same. I was coming across customers that had it on. Doesn't it depend on the hardware, on the CPU we have?

Starting point is 00:25:01 Yes. So I did, there's like a a there's a tool in postgres that lets you check right i've forgotten what it's called is it pg test timing it's ah something like that it's in binary directory standard package of postgres right i'll find it and link to it but yeah basically my understanding is if you have pretty fast system system clock lookups it the overhead can be hard to measure but if you have slow system clock then it can be extremely easy to measure and i that's the ongress blog post i think is deliberately picking a system as a slow system clock in order to show that it can add hundreds of percent of overhead but when i was looking at it on a when i was looking at it on an OLTP workload, very, very simple PG bench OLTP workload,

Starting point is 00:25:51 I was basically unable to measure it. I think I got a 2% overhead of adding all of the parameters, and it was basically all-time. You know, this is my old idea, and a couple of times we implemented it. In any company, when we deal with many Postgres hosts, it would be good when we set up a host to have a set of micro-benchmarks checking disk IO limits, CPU.

Starting point is 00:26:17 We can use SysBench for that or FIO for disk. In the old life, we used Bonnie++, I remember. And this micro- micro benchmark checking timing overhead would be also great there. And like sometimes we might have in cloud, sometimes we might have two virtual machines of the same class, same type, but they behave differently.

Starting point is 00:26:40 So it would be good to check it all the time we set up Postgres on some machine, right? Yeah. Well, yeah, this is your age old. This is what you're dedicating your professional life to is experiment. You know, if you're intrigued as to what it would be on your system, it might be different for you for some reason. Maybe for hardware reasons, maybe for workload reasons. There might be some specific way that it's bad for you.

Starting point is 00:27:03 It's very difficult to provide general advice and the advice you read online would generally be cautious especially in the postgres documentation then they're going to be cautious by default because they don't want to give advice that one person is going to find horrifically awful even if the majority would find it okay back to these three classes of overhead i guess the first class is like from an analyze and explain part and second is track your timing third is buffers let's let's postpone a little bit they both are related to this overhead from how clock work with clock is organized but the difference is that inside explain analyze, they are both working.

Starting point is 00:27:45 But track IO timing also working if you have PG stat statements, because it's registered there as well. So regular execution without running explain, regular query execution will also include it. So if working with clock is slow, track IO timing can add some penalty when you use PG-STAR statements, right? Same with auto-explain. Auto-explain runs on every... there is a parameter where you can... There is sampling and it existed long ago. I didn't realize it exists. For slow log there is sampling capabilities since Postgres 13, I guess, but for auto-explain it... you told me, right? It exists for many years already.

Starting point is 00:28:27 It's great. So you can auto-explain only like 1% of everything. It's great. To be cautious at first, yeah, you can sample a really small percentage. But naturally, yeah, for RLTP, it's probably fine because you're probably running the same queries over and over and you don't need loads of examples of them to optimize. There is also a possible observer effect from just logging.

Starting point is 00:28:47 If writing to logs is slow, for example, disk is not very fast where you log it. So it also can be a problem. But it's a slightly different topic. The third part, buffers. What do you think about overhead from buffers? Yeah, well, you were the first to tell me that it's worth looking into but i wasn't able to i wasn't able to measure it yeah i wasn't able to measure it there definitely should be difference if you just run explain and lies many times everything is cached and then explain and lies

Starting point is 00:29:16 buffers difference should be there i i'm sure but still 100 worth having buffers inside explain and lies as we discussed separately whole half an hour right yes previous episode I actually think it might have I don't know if you if this is to do with you but I think explain.depez.com deserves some praise because I noticed today or yesterday

Starting point is 00:29:37 that it now asks for explain analyse buffers so that's quite a call well in my opinion if like when we analyze a query we should not do it on production as you yeah we should do it on a special environment which should be a clone of production and the best way to get to have a clone is using database lap engine we develop and then there of course you're in a slightly different situation maybe hardware is different maybe you have less memory for example different

Starting point is 00:30:05 state of caches and maybe a different file system as in the case of database lab engine because it uses that FS by default and there you you should focus on buffers this is like should be like our final goal is timing but inside it we focus on buffers inside the process and reducing io numbers not just buffers maybe io numbers like rows is also logical is also important metric to to keep in mind and if you reduce io you will reduce timing this is secret of optimization everyone should understand in my opinion right yes couldn't agree more and if anybody disagrees we can refer you to episode, I'm guessing two. It was quite an early one. Right.

Starting point is 00:30:49 Right. Good. So what else we should discuss in terms of starting of working with explain? Yeah. Well, we might be close to time, you know, I wonder if we should save it for another time. Is there anything else that we have to mention? Well, we didn't discuss particular notes like various types of join and so on of course it's it requires time to learn and of

Starting point is 00:31:11 course there's documentation there are many i see different people present talk named explaining explain this is like default default name for such talks so not just one personally presented it so i think all of those talks are useful worth checking right my yeah my favorite is one by josh burkus i'll make sure to link it up right he did a really old one yeah old but still i listened to it again last year and it's it's still perfectly relevant there have been some new parameters and sure it doesn't cover everything new nodes parallelization since then was added yeah but equally it's jit compilation which should be disabled on ltp yeah i've also done two talks one at the big kind of one to try and cover the i think there's some beginner stuff that he doesn't cover at the beginning and there's some more advanced stuff that he doesn't get to

Starting point is 00:32:02 because it in an hour you can only do so much. I have done two talks trying to cover either side of that, not doing the explaining explain part. So maybe I'll link those up as well. Oh, and also I have a glossary. Oh, glossary is great. Yes, yes. So it's a good thing. Right.

Starting point is 00:32:17 Good. So I hope that was helpful for some folks. Let's wrap it up. Right. Yeah, I hope so too. Fingers crossed. And also feel free to reach out like i think this is the kind of topic that we love and find very interesting i'm definitely

Starting point is 00:32:29 very very happy to help people with this kind of thing we are asking for topics we can ask right here once again like we are open we have a list of dozens of ideas but we react to feedback if someone asks for a topic we will prioritize it in our list definitely and we will try to discuss it soon and as usual thank you everyone who is providing feedback it's very very important we receive it quite often or like at least once per a couple of days it's a great feeling i would say and also please as, subscribe everywhere you can, like everywhere you can, and please share in your social networks and working groups. Absolutely. Thank you so much.

Starting point is 00:33:10 Thanks, everyone. And thanks, Nikolai. Thank you. Bye-bye.

Postgres FM - Intro to query optimization

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.