Postgres FM - Extended Statistics

Starting point is 00:00:00 Hello, hello. This is Postgresfm. My name is Nikolai, founder of Postgres AI. And as usual, my co-host is Michael, PGMaster. Hi, Michael. Hello, Nikolai. How are you doing? How is your service? How is your service doing? I saw a newsletter. Oh, yes. All good. Good. Taking it along. Improvements. How about you?

Starting point is 00:00:26 improvements. How about you? Yeah, also like bug fixes, you know. Yeah, and improvements as well. We are preparing some big update probably next month. So I wanted to... actually you wanted to discuss this. It's your topic. Yeah, very much my topic it sounds like. Yeah, and I think it's a useful topic, but I also want to be honest, and as you know, I want to be honest, and discuss how often can we use that feature of Postgres and what's wrong. Not wrong, maybe, but let's just go like extended statistics is the topic. So it's what you can achieve running SQL command create statistics, giving it a name. It's very similar to index creation, right?

Starting point is 00:01:13 Similar, right? You create something with it. It's named for one table only. You cannot do it for multiple tables. You choose columns and there are several parameters you can specify. And it's very similar to index creation, but instead of building the whole structure additionally to speed up search of values, of entries, we just tell Postgres that it needs to maintain more statistics than usual because by default it maintains statistics for every column individually. Not only every column individually, but also multi-column statistics, it does maintain and also expression statistics.

Starting point is 00:02:01 It maintains for expression indexes, like function indexes. For example, very simple common case is when we have a column named email and we use text, not CIText. Nobody uses CIText. This extension is not used at all. I don't know why it exists. It exists for ages. CIText is an extension. It's actually a contrib module, so it's always available in any Postgres setup. And it's case-insensitive text. Because by default, if you create an index on the column named email, you... Unique index. I forgot to say an important thing. Unique index. It won't guarantee uniqueness in case-insensitive manner.

Starting point is 00:02:44 So uppercase email, lowercase email, they will be considered different values. And this is not what we want, because in reality, emails, they are case-insensitive. Case doesn't matter. So you need to use CI text contribute module, but nobody does it. I never see it in reality. Oh, it's so exotic. I never see it in reality. Oh, it's so exotic. I used it last time 10 years ago.

Starting point is 00:03:08 So we usually end up having an index lower email. And what I wanted to say is that by default, implicitly, Postgres will maintain statistics for this value lower email. Right? Yeah. So expression. Right? Yeah. So expression it's automatically but create statistics, statistics allows you to maintain

Starting point is 00:03:31 additional statistics like additional expressions or multicom statistics without creation of index. Yes. Good feature. Although I've only ever thought until reading a lot about it earlier today, I'd only ever really thought of the multi-column use cases. The individual column cases is interesting if you want to be able to use the statistics on an expression but don't want the overhead of maintaining an index on it. Which is interesting but I've not seen that case before. Yeah. So for example we have some expression on a column, a single column. We want statistics but we don't have it. We don't want an index. Yeah. Can you

Starting point is 00:04:20 imagine such case? I haven't worked out one yet, but it made it into the documentation, so I imagine somebody came with that case once. I want example. Yeah. I want an example. But I do think it's worth us going back to basics a little bit, like why do we want or need these statistics? I think it's really important. And I was gonna just say-

Starting point is 00:04:42 Actually, this is what- Oh, go on. This is what you should see working with individual query plans in your... In PGMaster, you obviously have cases. And as we discussed before recording, you mentioned that you have some logic in PGMaster developed to diagnose cases

Starting point is 00:05:04 when something is off and probably you need to consider statistic creation. Well, yeah, this is the tricky thing. It's not super intelligent though. We look out for bad raw estimates and they can be a real problem for query plans because the beauty of the Postgres Optimizer is that it's choosing between various plans and it's doing that work for you. You don't have to specify how it should get the data.

Starting point is 00:05:34 You just ask your query and it works out, hopefully, the fastest way to do that. But it's making a bunch of decisions about join order. Maybe that's a bad example for this topic, but a scan type, different join algorithms, all of these decisions are based largely on the number of rows being returned at each stage and therefore the selectivity of various things. And calculating those selectivities is tricky, like it's a hard problem. And it's the reason why Postgres maintains statistics in the first place.

Starting point is 00:06:09 It does that on a per column basis. The reason it doesn't do it on a combination of column basis by default is because just the number of those explodes so quickly. Like, as soon as you've got a few columns in one table, the number of combinations of two, like, soon as you've got a few columns in one table, the number of combinations of like, is it like n choose two, like it just explodes. So it would be super expensive to do it for everything.

Starting point is 00:06:35 But in certain extreme cases, which are actually quite common in real world applications, some columns are heavily correlated to other columns and popular ones and important ones often get queried together. So an example I see really often in like a software service applications are things like organization ID and team ID or something like that. You know, a team ID is always going to be associated with an organization ID, always, always, always. Or country and language, another example. Country and language is often is a common example as a like car and manufacturer and

Starting point is 00:07:13 things like that. But the one reason I mentioned org ID and team ID is because often the data is then like partition based on that or the primary key is some combination of those things. So they're really often always being queried together and there and Postgres's default, which is a really sensible default, is to assume that different conditions are independent of one another. And like in a in a where condition that means it's really going to underestimate how many rows are going to come back, which can lead to bad things like nested loops being just multiply estimates. But it also has the opposite problem in group by if you group by multiple things and they're

Starting point is 00:07:57 actually very related, you're going to get a massive overestimate instead of underestimate. So it can have both. It can have issues in both directions. Both directions of mistake. Yeah. So once you know you're like as the maker of that software, you're probably very understanding of which of your columns are going to be very, I think the technical language is functionally dependent on others.

Starting point is 00:08:22 How? Well, like org ID and TM ID you you know that I don't like this answer. You know that I I want the answer which would be Could be written For humans and maybe not only humans, you know some answer at like some step by step how to recipe algorithm or something, right? Look here, check this and then decide. This is what I think is missing here because as with indexes, create statistics is something that is not working by default. It requires an effort from engineer.

Starting point is 00:09:05 And I think as I told you before recording I never used it for real in production to solve any problems like zero cases I have. It appeared it was added to PostGust 10 seven years ago. During these seven years, several times I considered it. And you know, like when something like we see, okay, the raw estimate is completely off, like a thousand times, million times, completely off. What to do? What to do? Of course, analyze, analyze, recollect the stats, see if it helps. If it helps, no, it doesn't help what to do. Then old school approach, okay, like probably 100, 100 buckets is not enough. A default statistics target, it's a global setting, but you can control it on an individual column level. Yeah. And you say, okay, I want 1000 buckets

Starting point is 00:10:00 here. And actually, I remember Thomas Wondra was a guest on PostGust TV we discussed this and Thomas opinion was 100 buckets is quite a lot so it's not a good idea to always fix it increasing this number but I see sometimes sometimes yeah I think that might also sometimes help because it not only increases the number of buckets but also increases the length of the MCV, the most common values. But if we don't use standard statistics? With regular statistics target on a column you also get a bump in the number of MCVs tracked.

Starting point is 00:10:44 Ah, okay, I understand. Which is, I think, might also be sometimes helpful, it might sometimes solve the problem. So I can understand why throwing more sampling at the problem could be helpful there. And I'm okay to spend more disk space and memory for this. The problem usually if you raise default statistics target to 1000 for example is that analyze takes longer. Yeah. Which is acceptable in many cases. I wanted to mention very big news which occurred last week. You know it it you name it yeah well the commit postgres 18 that unfortunately won't include what we're talking about today extended statistics

Starting point is 00:11:33 but it's still huge and really positive news although I've seen that there are some performance issues so in postgres 18 it is expected that during PG upgrade, we all know that analyze is on user shoulders basically on DBA or DBRE shoulders. PG upgrade doesn't run analyze. If it's like in place, brief upgrade, very fast upgrade, taking only a couple of minutes. It's pgupgrade-link or hyphen k. So hard links are used, it's very fast, and it also dumps, restores the schema because new version has different features, so this schema needs to be dumped explicitly and restored.

Starting point is 00:12:22 This is automated. But then it says, okay, now it's time to run analyze. And there is analyze in stages, which I don't like. We talked about it a few times. But it's basically, it's outside of the scope of PGA upgrade to run analyze. And we also, I criticized managed Postgres providers like RDS Cloud SQL Azure, because they don't automate it as well and live on shoulders of DBR. People keep forgetting it and after upgrade, they don't have statistics at all.

Starting point is 00:12:55 So big news is in Postgres 18, it was just committed and hopefully won't be reverted. Statistics will be dumped and restored. Yes. And that's even better than, in my opinion, that's even better than analyze being automated. Of course, it's better. This is the answer I've got when I raised this on Twitter

Starting point is 00:13:18 on X many months ago, saying that's not all right, that many service providers don't do it. Somebody said like let's wait and it was the right idea so if now statistics is dumped and restored we have old statistics and that's good. Yeah and my understanding is this is a PG dump feature, therefore it should work. Let's say it does end up in 18. It should work for the next upgrade you do no matter what version you're coming from. Yeah, I also thought about it.

Starting point is 00:13:55 It's interesting. I'm not 100% sure, but of course definitely you can run, you can use new PGgDump against some old Postgres server. And hopefully this feature will work. It means that we will be able to upgrade all servers to new without need to run analysis explicitly. Yeah. And it means less downtime.

Starting point is 00:14:23 I consider that time you're running Analyze to still be there, it could effectively be down, even if you try to restore connections, just because queries could be much, much slower than they should be. So yeah, I think for me this could drastically reduce effective downtime for a lot of projects during upgrades? Yeah, I agree. Because in my opinion, analyze in stages is like, it's a wrong feature, completely wrong feature. And I prefer keeping analyze inside the maintenance window. And of course, if you have default statistics target elevated to some high numbers, thousands, in this case,

Starting point is 00:15:03 it takes longer, especially if you have like some people have a lot of database objects, a lot. I just recently had the pleasure to deal a little bit with a case with 200,000 tables and more than one million indexes. And if you decided to raise default statistics target there, analyze will take ages. And even if it's parallelized, like, yeah, so keeping statistics is good.

Starting point is 00:15:33 Let's back to the topic, though, is that you might still want to run analyze after a major version upgrade. There are a couple of caveats, even in the 18 commit, as to things that won't get pulled across on one of them is extended statistics. So if you are using extended statistics you will and they're important for your query performance, you will still need to run analyze to actually get back up money. I always like running analyze for example you know if a partition table to vacuum doesn't maintain statistics on the parent table somehow Somehow if you run analyze explicitly, you have your statistics.

Starting point is 00:16:09 So I just like running analyze. So I think there's even an improvement. Maybe it was in 17 that you can one analyze only on the parents. So you don't even have to analyze all of the partitions, which is quite cool. Yeah. So partition statistics on partition tables, probably it's is another topic and it's interesting on itself. But back to extended statistics, so my point was, since this is a feature which requires you to run some things explicitly and make decision, there should be recipes.

Starting point is 00:16:41 For indexes, we have recipe. Okay, we have sequential scan. We see some filter or order by, okay, we know these things should help. We verify it, ideally on full-size clones. We see it helps. We deploy to production. People who don't have fast and cheap clones, they check it out on production and see it works. Okay.

Starting point is 00:17:03 So that's it. As for statistics, which extended statistics, which was added as we discussed in Postgres 10, seven years ago. During these seven years, several times I said, okay, maybe we need to create statistics here. Let's try. And I remember zero cases when this was solution. We always ended up having different solution like creation of index or maybe redesigning schema sometimes to the question of previous discussion with Frank, right? I mean, redesigning schema, denormalization and so on. So I don't remember conclusion. Okay, in this case, create statistics is our solution to the problem. And that bothers me. Honestly,

Starting point is 00:17:54 it builds some weird feeling. I'm missing something. Everyone is using something I'm not using. Well, interestingly, one of the main reasons I wanted to bring this up is I think extended statistics might be underused globally I think a lot of people don't know that it exists. There are no recipes that's this is my point we need the recipes when very concrete recipes how to decide it's worth using it right here. Yeah interesting but but I do think there's a chance. Give me a schema only dump. I will give some LLM, good one, this dump, and ask to find, just based on column names, to find some ideas what can have correlation, then we build statistics and hope it will help someday. but it's a weak solution I also think it's flawed because I think if it's

Starting point is 00:18:49 not actually causing query performance issues then you're paying that penalty for analyze being slower well again no benefit I didn't actually that's a problem yeah I didn't finish my thought I'm okay okay I'm okay to pay some penalty to analyze more, keep more statistics. It's not a problem unless we have hundreds of thousands or millions of objects. In this case, I will be very careful. But if it's only a few thousand of objects, I mean, indexes and tables, it takes one minute or three minutes to analyze on this database.

Starting point is 00:19:27 It can be many, many, many terabytes. One or three minutes, I don't care. And in terms of storage, it's nothing. And memory, it's not a lot. Maybe I'm wrong, but I think so. I think you're right. And I think that's important for people to realize that it's not dependent on your data size because we're sampling at this point.

Starting point is 00:19:44 Like, because it's a sampling thing, it doesn't scale linearly with how much data you have. Also important disclaimer, I have set of mine targeted startups heavily because I'm building my fourth startup and our clients consulting clients and clients for products mostly are startups and that means rapid development, like rapid, really rapid. consulting clients and clients for products mostly are startups. And that means rapid development, like rapid, really rapid, sometimes a couple of deployments per day or more. It means that maybe we, statistics we decided to keep, maybe it's not needed today, but

Starting point is 00:20:16 maybe it will be needed tomorrow because we make some new features and so on. And maybe it's better to have it because we know there is a correlation between these columns, so let's have external statistics over all these columns. But we lack recipes, this is the point. So lack recipes leads to underuse of this feature. I think they might also be... I think historically you've also mentioned having a bias towards LTP systems in production. And I think there's also a chance that because often you're hyper optimizing queries, and that's a good thing, going for index only scans and very, very limited data retrieval,

Starting point is 00:20:58 very small number of rows, very, very precise lookups, you're probably not hitting these bad plans in a lot of those cases, that you're skipping straight to a really good plan, that's really hard for the planner to get wrong, because you're giving it the ideal index to use for the important access patterns, and it's possible that this comes up a little bit more when people are having to serve more user defined queries, maybe analytical ones that have any combination of parameters. Or real queries and so on, right?

Starting point is 00:21:32 Exactly. And I think in those cases, maybe sometimes just not knowing what's going to come, you can't create indexes for every case. You do. And indexes slow down writes, unlike unlike statistics this is a super big difference yes so I think there's a chance that bias towards LTP and also being able to often add the indexes you need for a smaller number of access patterns or that like a lot of queries per second might be that this just isn't as important in those cases. But in terms of recipes, I think the big one is bad row estimates on important and slow queries.

Starting point is 00:22:13 So if a query is important and a lot slower than you need it to be, and there's a really bad row estimate in an important part of the plan. So like not in a part of the plan that is fast but in a part of the plan that is slow. The really famous case is a nested loop that thinks that the planet estimates is going to return very very few rows per loop. So maybe like zero or one. So therefore it's thinking well I only need to do this a relatively small number of times therefore it's quickest and easiest to get started with a nested loop And I'll be done really quickly, but in reality will create maybe a thousand rows or more per iteration Actually in those cases a hash join or merge join would have been it would so much faster So though like those cases I think can a benefit

Starting point is 00:23:03 But even actually there's a blog post I was thinking of that blogged about a time where extended statistics helped them a lot. I put the plan through our tool and noticed that yeah a bad row estimate was as highly scored as it could be but so was an index efficiency issue and I think looking back they could have added a better index and solve their performance issue just as like just as well so they've solved it with create statistics but they could have solved it with better index and probably the query plan they got in the end could still be improved further by a better index. So I have an idea for myself and who is listening is welcome to join next month I will have a challenge for myself.

Starting point is 00:23:49 So every time I deal with some query optimization, I will put consideration of extended statistics at first places. I will think, can it be helpful here? And so on. Just prioritize it and think about it more. And after one month of this, 30 days or month, February is shorter month, right? So, 30 days. After it, I will make my conclusion. Should I keep this as like one of primary tools for query optimization or as a secondary tool? How does it sound? I'd love to hear that from anybody else that does try it. One thing I'd encourage doing

Starting point is 00:24:37 is also benchmarking the performance hit on analyze of that tape. Like if you do add it to a table, how long does analyze take before and how long does analyze take after? And I think it will depend on which there's like a few types of extended statistics that you can add. You can add all of them for expressions or you mean they could it there's like three there's like three parameters you can give it. So, indistinct dependencies and MCV. And I think MCV will be by far the most expensive in terms of analyzed performance. Oh, by the way, another thing that the docs mentioned

Starting point is 00:25:15 can be penalized is query planning time and maybe query planning buffers for that matter. If you have a lot of extended statistics created by us, the planner can take longer? Yeah, well it makes sense, right? It's considering more information upfront. Right, makes sense, yeah. I don't know how, but I'd be interested in the benchmarks of how much longer, how much more, like...

Starting point is 00:25:38 Yeah, but in this case, two things. First is, well, both benchmarks sound to me like single session benchmarks. So it's not like we run a lot of like, we can take one session and check just speed maybe multiple times for good statistics, right? So first thing sounds to me like checking analyze speed. And I think we need to remember that if we talk about a single table, it's OK.

Starting point is 00:26:09 I mean, no problem. But if we talk about many tables, it's worth remembering that it can be parallelized. And it's recommendations to use vacuum DB with option hyphen hyphen analyze and specifying number of jobs, number of processes, backends basically, right, which will execute this. And if we are alone on the server, we can take, for example, number of vCPUs we have on the virtual machine, for example, right? Yep. This is one thing. But if it's a single table, it cannot be parallelized.

Starting point is 00:26:45 It will be a single thread anyway. No. But there's actually one more thing that I think worth trying just before we wrap up is if it is a huge penalty but looks valuable to query performance, which I'm not sure you're going to come across, but we'll see. I've wondered about reducing statistics target for those columns. So you could reduce it and then add multi-column. Like it just is an interesting idea of getting back some of that analyze time. I know a very good SQL optimization expert who recommended

Starting point is 00:27:18 reducing default statistics target and I'm going to communicate with him very soon. But this battle in my mind was clear like 10 years ago already. Like reducing the default statistics target or per column parameter was considered not a good idea in my mind compared to increasing it somehow. I don't know like this is like from various cases but I'm going to use this as well as a part of my challenge just to think about this as well because I will talk to him soon. So I've got an idea for you. Let me comment on this second. I already started forgetting what you said about what was the second one? First is let's benchmark. You said benchmark.

Starting point is 00:28:07 I don't, I'm very careful. Analyze? Yeah. And also query planning time or query planning buffers. Query planning time, this is good. Yeah, buffers as well. That's great, that's great. So increasing default statistics target globally

Starting point is 00:28:21 definitely should affect this as well, right? Globally. Yeah. Here I like if somebody is going to do it, don't make mistake. My team and I did a few months ago when we published an article. So when you just connect to Postgres and you do something, planning time can be much bigger because of a lack of rel cache and it's loaded, right? Second time you run this, it's very different. So planning time, first planning time and second planning time, they are very different. If you have connection puller, most people do.

Starting point is 00:29:01 It means that it's not a good idea to make conclusions from observations of the first run. Yeah. Great point. So even if statistics, extended statistics affects planning time, I think it should be so. Question is, is it only for the first run or the second run as well? Who knows, right? Because if only first round it means like real cash or... Well, it's not real cash here, right? Well, it's interesting point just to check, right? Yeah, based on the notes in the documentation I got the impression it would be for every planet. Because it's statistics, it's not real cash. So anyway, I would check multiple times.

Starting point is 00:29:44 One more idea for you, especially talking to that person you mentioned, I wonder if you could come up with a recipe for which columns to reduce it on. I reckon there's a whole bunch of really boring columns like unique IDs, you know, primary key columns type thing. Do we really need high statistics targets on those? Or could we get away with much, much lower?

Starting point is 00:30:03 Do you see what I mean? Like ones that we know are unique and that if we're keeping statistics on the most common values, well, I agree with you, but what's the point to optimizing this area? What are our savings? In my reducing the time analyze takes. As I said, I don't, I don't really, I don't really bother by longer time. It's three minutes against one.

Starting point is 00:30:30 So what? That is quite a big deal, isn't it, in terms of downtime? I don't know. We have recipe for zero downtime. We can afford a couple of additional minutes being spent and nobody notices it because it's a target cluster running on logical replication and we can do very long analyze there. So even one hour nobody notices, it doesn't affect downtime at all.

Starting point is 00:30:57 So I don't know, I don't know and we have export in Postgres 18 now, so export import statistics via Pidge pigeon dump restore yeah pigeon dump restore so I don't know I don't know like I I can imagine this will save something we can find many places like that right for example in some cases we can we can start tuning auto vacuum at table level. And so many times I saw teams, various teams went this path. And then we pulled them back saying, let's stick to defaults because table level settings often cause more harm than good because they add complexity, a lot of complexity. It's hard to maintain. Then you go microservice, you have a lot of clusters, and this table level, well, it's so hard to maintain.

Starting point is 00:31:48 So, and you talk about column level settings. All I meant is if you can find a really simple rule. Yeah, exactly. If it's a unique column, reduce the statistics to one. If this is good, if it goes to hackers and the core itself and implementation, which doesn't require effort from user. This I agree with. But if it's... Cool. Again, my main concern, my main comment about extended statistics is it requires effort.

Starting point is 00:32:22 It's not clear. There are no clear recipes how to apply it. So it ends up not being actively used. Unfortunately, maybe in the next 30 days we will have some recipes, right? Sounds good. I'm looking forward to it. Sounds good. Okay. Thank you so much. Thanks so much, Nikolai. Take care.

Your Ad Here

Postgres FM - Extended Statistics

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.