Postgres FM - Row estimates

Episode Date: December 23, 2022

Here are links to a few things we mentioned: ANALYZE (docs)Autovacuum config (docs)Statistics used by the planner (docs) CREATE STATISTICS (docs) Row count estimates (pgMustard blog post)�...�pg_hint_plan Optimizer methodology (talk by Robert Haas) Tomáš Vondra on statistics and hints (an excellent interview we forgot to mention, sorry!) ------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov / @michristofides / @PostgresFM, or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I'm Michael, founder of PgMaster. This is my co-host Nikolai, founder of PostgresAI. Hey Nikolai, what are we talking about today? Hi Michael, it's your turn. Tell us please. Oh, flip it on me. Yes, I picked this one. I think we're going to call it Row Estimates. But this is, yeah, based on, so this is about the planner, query planner, and this is again based on a listener suggestion or request they didn't ask about this specifically but they did ask about nested loops
Starting point is 00:00:33 causing performance issues and looking at any one particular performance issue would be quite a dull podcast i think maybe a better video but it really a lot of these especially when it comes to people thinking that the nested loop is the problem it's normally due to a bad row estimate at some or an inaccurate row estimate at some point and a bad planner choice based on that row estimate so I wanted to dive into how Postgres is making those decisions and also some things that we can do as users to help it get better statistics and to help guide it in the right direction or not things that we maybe don't have at our disposal
Starting point is 00:01:14 at the moment that other databases do. So that's the kind of thing I was hoping to talk about. Sounds good, but what's wrong with nested loops? They're good. Yeah, and this is the reason I didn't want to call this nested loops causing issues. It's mostly... So yeah, what's wrong with them is a good point. I think it's about trade-offs, right?
Starting point is 00:01:36 So the reason we have multiple... A lot of the time, the planner has multiple choices for the join algorithm it chooses, and even the scans that it does. So it's not even just about joins it has a lot of choices and different options and because SQL is declarative it can make its mind up about the the fastest or most efficient way to to do that nested loops have some real advantages they've they're very quick to get started. They're very flexible. They can be used in a lot of different cases. But when the two relations that are being joined get large,
Starting point is 00:02:12 there can be more efficient or quicker ways of joining those. In Postgres, we have hash joins and we have merge joins. I think that's it for the join algorithms. Yeah, for the algorithms. Yeah. So nested loops are good when the relation on one side is smaller, quite small, right? Low number of rows on one side. Yeah, exactly. Both sides, one side. And if if even if one side is large, if it's index nested loop can still be really
Starting point is 00:02:44 good, right? And if they are both quite large, nested loop probably it's not a good idea to have it. Yeah, exactly. So we'll probably be better off with a hash join or in some cases, if they're both sorted, a merge join, I guess. But sometimes we don't have those options, right? Sometimes the only
Starting point is 00:03:05 if we if we're not doing a quality operation i think there so it's but equally when people are looking into this if there's a good plan and a bad plan if that's often when people think it's the nested loop that's the problem it there was another option right there was a better plan a better path that could have been chosen but wasn't because it was being scored higher cost, the costs were estimated to be higher. A big input for those costs is how many, I'm going to say rows, I guess, this might be another argument for tuples are being returned at each or tuples. Sorry, back to that old one. Yeah, your version is tuples. I proposed tuples, but you switched. Okay, so if you see an asset loop and we think it should be different, we suspect some issues
Starting point is 00:03:55 with statistics, right? And first thing to check is to try to analyze the table to recollect statistics or no. What do you think? this is i think where we get into it's probably worth us explaining up top analyze i guess what you're talking about there is the process postgres has for gathering statistics so you can analyze a relation i learned today actually looking into it you can analyze just a single column. And there's a lot of flexibility there. Or expression.
Starting point is 00:04:27 Nice. Okay, cool. And also, it happens automatically as part of auto vacuum, or auto analyze, I guess is the part of that. Well, yeah, auto vacuum has three tasks. So vacuuming, preventing, preventing transaction idea up routes or freezing. And also, third option is auto analyzing, like recollecting statistics. And yes, this is cool. Yeah, maybe. But watch what was fourth, it does the visibility map count as separate? Well, yeah, but it can be considered as part of vacuuming. Okay, cool.
Starting point is 00:05:10 Maintaining free space map and visibility map. But when analyzing, analyzing may happen together with auto vacuuming, but it may happen also like auto analyzing separately. So everything happens inside Postgres, right? Many options. But, right, so not every time AutoVacuum runs to process some table, it recollects statistics. It's controlled by several settings,
Starting point is 00:05:37 threshold, scale factor, related to analyze part. Yeah, and I think it's fairly recent. I didn't actually check, but one of the newer versions lets us set an analyze scale factor even for insert-only workloads. Right, there was a problem for append-only use cases when you just add a lot of data and that's it. The vacuum didn't process it. Well, vacuum probably is not needed because you don't produce delete tuples if you just insert, right? But analyzing recollecting statistics
Starting point is 00:06:10 is definitely what I would like to have. And actually many examples, educational examples and so on, they show like, okay, let's load something to table. And before you must you must run analyze yourself to have proper plans. But now you can just wait a little bit right. And it will do a job and then you can play with it. But of course, if you want to control probably you just
Starting point is 00:06:38 need to say analyze table name and that's it. Yeah, so actually, the reason I wanted to focus 11 or 12 when it was like a couple couple of versions ago maybe three versions ago wow that's getting quite a long time ago i remember when that was new even me i remember darafay i i cannot pronounce the last name sorry darafay participated in this discussion and and he he was active in the Russian-speaking community as well. So I remember this case and it was quite like so obvious improvement. It was like one of those changes when you have feeling how come it didn't happen before, right?
Starting point is 00:07:19 Yep. But that's the nice thing. It does keep getting better. And the reason I wanted to be specific about the analyze you were talking about there is that there's a few places people might see that word. And there's analyze itself as a keyword that gathers statistics. You might see vacuum analyze as like a kind of parameter to vacuum. That is the same thing. So that's doing both vacuum and analyze. But you might also see explain analyze, which I guess we're going to talk about in a moment, which is not related at all, just totally different thing. So naming things is difficult.
Starting point is 00:07:55 And sadly, that's just one we need to, like, once you know what it does, you realize it's unrelated, but it can be confusing for beginners. Right. Terms, words are overloaded everywhere. Statistics term is also overloaded. We can talk about statistics, double statistics and user statistics, this statistics, which collects, which keeps the column and expression stats. And this is what is recalculated when you run analyze and also in in explain plan we also see some
Starting point is 00:08:26 statistics right we can also call it some so like the the terms are overloaded indeed there are too few words in language unfortunately in any language but okay so back to our topic if we see in nested loop and we say we we would like to have something else like hash join probably we just need to check it with analyze like what will you do like just run analyze and double check the plan or what yeah so there's a there's a so i think goes back a step how are you even spotting that problem i would suggest that you you should be running explain analyze and trying to work. So that's think analyze buffers. Yeah, ideally, maybe even verbose format JSON, maybe on Twitter, there was a new saying like in buffers with trust.
Starting point is 00:09:18 I like it. Right. I think I saw you had a t shirt like that as well. So that's fun. Yep. So, so, so when explaining those buffers to check it before you run analyze to recollect statistics. Yeah. But specifically explain alone won't help us here because explain alone only gives us the estimated number of rows being returned at each stage of the execution plan, but with explain analyze buffers, but explain analyze is the
Starting point is 00:09:46 important part for this topic because that will give us not only the estimated number of rows but also the actual number of rows being returned and it's mismatches there that can really help us diagnose where this problem's arising where the bad estimates are so yeah that would be let me let me disagree with you here at least at least partially you're talking about like straightforward examples when we can run exponentialized buffers so we see all details both for planning and execution and we understand something is wrong we don't want to have a led loop here. But what if our execution takes 100 years? Yeah, of course. So if you say, like, just explain is not helpful, it's still helpful.
Starting point is 00:10:34 It will show a nested loop. It will give us understanding how a planner thinks about our data. And we know our data, right? We know how many rows there and there. So we say, like, no, I would – if I were you know our data, right? We know how many rows there and there. So we say like, No, I would, I would, if I if I were you plot the planner, right, I would do it differently. So just explain in some cases, and it's also kind of helpful, right? In some extreme cases, when execution is long.
Starting point is 00:11:00 David Wright, Ph.D.: Very, very good point. Yep, completely agree. But then I agree with you about the second port of call. It's often if you can see the relations involved and you can run and you think maybe analyze hasn't run recently on those and you can just it is it's not a locking up or it has even with very low risk. And that might be enough to solve your problem. It might be that the statistics are out of date. Yeah, we can check. We can check when analyze ran. In logs, if auto-locum login is enabled, and in recent versions, the threshold is 10 minutes,
Starting point is 00:11:45 it should be zero always zero zero like all everything all rows all and log entries related to auto vacuum should be there in my opinion but we also have pgstat all tables pgstat user tables and we can check it has four columns two pairs one pair for vacuuming and one pair for time stamps all all pairs are time stamps i'm not talking about count count is quite silly metric right okay five now five well zero means something but five ten when right so two pairs of time stamps one pair for vacuum one One is for auto-vacuum, one is for manual vacuum, and two timestamps for analyze. One for auto-analyze means like analyze job of auto-vacuum. And second is for manual analyze. And we can see when it happened, right? It might still be happening. So also worth checking pg start
Starting point is 00:12:42 activity. And you will see that there auto vacuum colon analyze or vacuum analyze so you can yeah good point it could be have some right now in just activity the query column it has it it has details for auto vacuum jobs as well so it's it's seen as regular session. What's the, like, on huge databases, how long do you tend to see how long it's taking? Hours. Well, it depends on how detailed you want your statistics to be. By default, we have default statistics target 100, meaning 100 buckets. But many people increase it, for example, to 1000,
Starting point is 00:13:26 sometimes more. And the bigger default statistics target is, the longer analyze will be. And it will produce noticeable disk IO, of course, reading data if it's not cached in the buffer pool or patch cache. And also individual columns can be tuned so default statistics target it's like cluster-wide setting but some columns can be adjusted additionally with alter table i don't remember like statistics something like that you can say i want a thousand buckets here but 100 buckets there and only 10 there something like this it's like fine tuning
Starting point is 00:14:05 usually i recommend like avoid it unless you're 100 percent do no in my in my experience i saw cases when people tuned it but they didn't we're not very clear in their intentions you know it was like kind of experimenting and then it was abandoned and then you just have some leftover of past experiments that's it so systematic approach should be like you know what you're doing and here we definitely want more detailed statistics so analyze can it depends on on table size and on number of buckets you want to collect for each expression or column. Yeah. So that's, in fact, going on to the statistics target, I think that's a useful next.
Starting point is 00:14:52 So if analyze doesn't fix your problem, there's something interesting. If you're aware now where the problem is, then if you know the distribution of that data, if you've got a very skewed distribution, that's where I see extending, also increasing that target. Well, I don't know if I have the same opinion as you. I personally don't think it's sensible to change the global statistics target from increasing it from 100 to 1,000. But I've seen some people have success
Starting point is 00:15:23 with doing it with you know, with specific columns that they know, distributed a certain way. Before I started to work with thin cloning, I worked on the holistic approach to this problem. So for example, you say, Okay, I have my database, I have many clones of it, and I have, I have my workload. I can reproduce it reliably, like definitely can do it, at least part of it, right? And then we were able to run, for example, 10 VMs in parallel. We can do it sequentially, of course, but if we can do it in parallel, why not? By the way, there will be a a question do we have the same hardware everywhere so so we need we also added some micro benchmarks to compare like baseline cpu same disk
Starting point is 00:16:12 same and so on but the parallelization is of benchmarks is quite interesting topic itself so and then you just compare you take 10 values for your knob, in this case, default statistics target, and then you compare many, many things. For example, starting from regular latencies and throughput, you can also use PGSat statements to analyze particular query groups and see deviations. And it's very, this a very good holistic approach. By the way, it's also possible to do it using thin clones, but you need to focus, of course, on buffers, not on timing, because timing will be quite volatile metric, but buffers will be reliable.
Starting point is 00:16:59 So it's possible to do something there and see how the amount of work in terms of IOs or buffers will change depending on default statistics target. And in this approach, the goal was, okay, we want to improve queries. And if we change global value, which queries will be improved and which queries maybe will have some penalty, right? Of course, you need also to compare and analyze in this experiment. And this is holistic approach quite interesting, actually, if I like this approach, it's like, I consider such approaches as enterprise ready, because you answer all questions, cover all topics here, and then you can make decision with a lot of data. But
Starting point is 00:17:43 requires efforts, of course. Yeah, right. And I haven't seen as many cases where other queries are slowed down in these cases, definitely in other performance cases. But I hadn't considered a really good point you made around the speed of analyze being very dependent on this and and then well that's that's crucial for things like major version upgrades because if we when we do a major
Starting point is 00:18:13 version plans changes plans changes well we we don't have statistics do we we don't get the statistics automatically so we have to run analyze before we can like as part of the downtime but there's no i don't see any way around that so if we've now increased our statistics target to a point where it's going to take half an hour to run analyze across our entire database then that's another half an hour of downtime every time we do a major version upgrade yeah that's that's not good no exactly so it's it's considering these other i hadn't thought so much about that but that makes me think even more that the kind of going after each only increasing on a kind of
Starting point is 00:18:52 case by case basis might be more sensible right and and analyzing stay in stages i saw cases when it didn't help so like to sit in some intermediate state, it was not good at all. So conclusion was let's do the last step, like maximum statistics and just wait more. And that's it. It would be great if postgres upgrade, pg upgrade would just export and import pg statistic, understanding changes if they are happen between versions. But by way, I remember I wanted to interrupt you in one more place, but I didn't. So you mentioned that if we have issues, probably we need to change this. We started to discuss this default statistics target knob.
Starting point is 00:19:43 But maybe there is another mitigation action. For example, we can tune auto-vacuum to recollect statistics more often. Yep. So I would start from there, maybe. Not from change of number of buckets. Yep. Well, and it depends a little bit on the problem, right?
Starting point is 00:20:02 It depends if it's a distribution issue or if it's a data being stale issue, or if it's a correlation issue. So there's another... Before we go there, before we go... Oh, go on. Right, so it's a question, what is it? And the answer to the question, we need to experiment. That's why I started, will you do analyze manual analyze to check it right okay we see that vacuum did it yesterday and we inserted a lot for example so obviously we should run manual
Starting point is 00:20:35 analyze right and and double check the plan compare before and after right I mean I feel like I want I want to say yes but equally i feel like it's a trap you feel it you because you know me already right it is it is trap it is trap well 99.9 of dbs will do it and i personally will do it as well but i don't like it because if i later like we we want to understand what's happening right And this is one-way ticket always. You can't undo it, you mean. Exactly. But we do then know what the problem was.
Starting point is 00:21:10 At least then we know the problem was in the state. But how can you be 100% sure that you won't have any questions to previous version of statistics? Maybe you would like to check other plans as well, right? You run a list and that's it. Door's closed. Bye-bye. It's not good, right? That's why, guess what?
Starting point is 00:21:33 Thing closed. Well, just to question that slightly, if we have a point-in-time recovery, for example, and if we want to, let's's say restore to a previous time before we analyze can we get the old statistics back exactly yes the only problem with this approach is just like terabyte per hour yeah yeah okay cool that's it so if you have 10 terabytes wait 10 hours for to to try again and then you run analyzing that's why I think clones and database like engine so I think my answer to your question is still yes if it's like a
Starting point is 00:22:10 if this is like a production issue we don't like we need to heal us up yeah exactly I would but equally really good point and worth thinking worth taking a split second to think about the consequences before
Starting point is 00:22:26 hitting execute oh by the way is analyze transactional I don't know in this case you can detach one of clones right make it analyze rollback well yes exactly
Starting point is 00:22:41 I think it is because it's just a pg statistics table it should be I don't pretty statistics table. It should be. I don't remember 100%, but it should be transactional. So you just begin, analyze, check your plan, roll back. Why not? It should be so. Cool. Very well.
Starting point is 00:22:59 Maybe I'm terribly wrong, but. It's a great idea. Someone should check it. Well, one of us will check it. In this case, you can iterate and return basically reset statistics once again and check using regular traditional postgres not thin clones, which I like so much. So okay, this is the cover bit, like how we can play with many different query plans and see what's happening if we... Now let's talk about correlation you wanted to talk about, right? Well, yeah, I guess we've only covered a couple of the different ways
Starting point is 00:23:35 that this could be a problem. And another famous one, I think we have discussed it on here before, is let's say we're looking at two columns that are highly correlated and by default postgres is going to assume i'll go on each time i try to invent some example i'm becoming either racist or sexist or something i cannot i've got a good one for you car car makes car makes and models for example if we say where where car make is toyota and car model is prius those are going to be extremely highly correlated i'm not aware of any priuses that aren't toyotas but by default postgres is going to assume that those are independent variables and is going to look at
Starting point is 00:24:21 the proportion of rows that are toyotas if they're in the most common values and proportionate in the priuses and work out a much lower estimate of the total than actually the case so that's probably the best well currently the best use of extended statistics so we can now think thanks to people like Thomas von dry. I think there's a few more people involved. As of a few versions ago, we can now specify like we can educate about these. Yes, there are three options. And in this case, I don't remember the words, but like something about distinct values, most common values. And the third is about this, like one, like depends on
Starting point is 00:25:07 another, right? So functionally dependent or so. In this case, I would choose that third option, because obviously, Prius means Toyota. So probably, and if we don't do that, Postgres just multiplies, okay, we have like 20% of Toyotas. Okay, we have 1% of Prius, like multiplication, a very low number. This is a bad idea, obviously. And also, if our query includes, for example, I don't know, like Nissan and Prius, like, obviously, we should have zero, right? But Postgres will tell multiplication will give some non-zero number also wrong. In this case, definitely.
Starting point is 00:25:49 But the question is how to, like, how to, okay, I usually try to find some holistic approach, right? How to build some framework. You're right, by the way, that obviously, we just click statistics or we increase number of buckets, none of queries should slow down. It should only help, right? I think so, but I'm not 100% sure. Right. But the framework, why I talked about it, the framework also asks question, are there any parts of our workload which struggle from our change? This is the idea.
Starting point is 00:26:28 We should check everything and ensure that we improved something, but we have everything else at least the same. Not improved, but at least the same. And here I also ask a holistic question. How to understand that we need it? We need to guess that this column depends on that or we can oh well based on the query plan we can often see that that's where a misestimate is happening right like that's that i mean i i'm definitely somebody who's guilty of having a hammer and then everything looks like a nail so that's my like that's typically the way i've seen it but be easier to to spot so yeah that that would be my point but where where is that
Starting point is 00:27:11 the place you would look as well i haven't just the question no answers okay um because this is a common issue in query plans and it's it's not always clear from the query plan where the issue is we we have a tip in pg mustard for particularly bad row estimates especially when well only when there's it they're part of a slow part of the query plan a slow subtree and we link out to a blog post that we've written going through kind of like helping people try this then try that or you know this going through basically the things we've discussed here, so I will link it up. But I think I need to update the blog post based on a few things we've mentioned here. It's a few years old.
Starting point is 00:27:53 Yeah, so I like that you also started to advertise your product because we invest a lot of efforts to try to help people and to improve things, tooling and so on. I think this statistics is very, very underrated, because you need to make a lot of efforts to get into there. So if, for example, explainer lies would suggest or like if Pidgin master suggests more explicitly what to do, more people will would start using it, right? So we need some helpers that would do it. Actually, we don't have helpers for indexes. So it's maybe, and it's still like, there are some works in progress trying to improve and do it.
Starting point is 00:28:34 But very interesting questions. Yeah. Another, well, another thing that, the last thing I wanted to make sure we covered as part of this, it feels odd not to, is that sometimes people get to the end of their tether with this. And, you know, maybe create statistics doesn't yet support their use case.
Starting point is 00:28:51 Like we've talked before about two correlated columns, but from different tables. I think you mentioned quite a good idea using materialized views for that. But as a bridge to have an index on columns from different tables. Yes. Yes. Yeah. It's like quite also silly thought, but it's what we have. Can we do create statistics on materialized views? I didn't even think about that.
Starting point is 00:29:16 Great. So that makes sense then. And even on foreign data wrapper tables, so foreign tables. Yeah. Makes sense. Well, the thing i wanted to mention though is that sometimes people as a last port of call or because they're used to using them in previous databases also would like hints for this so i know i know people often talk about index don't mix with
Starting point is 00:29:36 hints hints you mean to to to command the the executor what to do the planner what to not necessarily to command them or he says it's time to do that, to create it. No, I meant... Regular notifications. Okay, like normal. Yeah, I did mean regular ones because I think there's PG Hint Plan that I think one of the things they let you do is give an idea of the number of rows you're expecting from certain things.
Starting point is 00:30:03 So I think there's an interesting area there. And I did see a talk by Robert Haas that when somebody asked about hints, he seemed very open to them specifically for row estimates, not for other things. So that was super interesting to me. Yeah, and the common approach to hints
Starting point is 00:30:22 from the core hackers is we don't want them. Yeah. But still people need it sometimes. So I don't have 100% strong opinion here. But let me tell you some small story. We added hints to Database Lab. So Database Lab is non-production and we the idea was to allow people to experiment more and see what would like what if approach what if plan would be different and so but
Starting point is 00:30:54 sometime later sometime past people started to use it this comment okay people started to ask okay it's good i found good better plan optimized now why we don't have it on production right yeah and i realized that you if you add something to lab you also need to think if like some there will be some people that will want the same on production so and they're already like okay should we add it to production? Maybe no. Maybe yes. There are different opinions here. Maybe we should do a whole episode on that, actually.
Starting point is 00:31:32 That feels like a good one. Wonderful. Was there anything else you wanted to make sure we covered? No. Analyze more often. Yeah, absolutely. Well, thanks, everybody, for listening. Thank you, Nikolai.
Starting point is 00:31:47 And see you next week. Or have a good Christmas for everybody who's celebrating. Thank you. Likewise.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.