Postgres FM - Extended Statistics
Episode Date: February 28, 2025Nikolay and Michael discuss the CREATE STATISTICS feature in Postgres — what it's for, how often it's used, and how to spot cases where it would help.  Here are some links to things they ...mentioned:CREATE STATISTICS https://www.postgresql.org/docs/current/sql-createstatistics.htmlcitext https://www.postgresql.org/docs/current/citext.htmlStatistics Used by the Planner https://www.postgresql.org/docs/current/planner-stats.htmldefault_statistics_target https://www.postgresql.org/docs/current/runtime-config-query.html#GUC-DEFAULT-STATISTICS-TARGETTomáš Vondra on Postgres TV https://www.youtube.com/watch?v=8la-OWfD3VIRecent commit to Postgres 18 for pg_upgrade https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1fd1bd871012732e3c6c482667d2f2c56f1a9395Multivariate Statistics Examples https://www.postgresql.org/docs/current/multivariate-statistics-examples.htmlExtended statistics (README) https://github.com/postgres/postgres/blob/master/src/backend/statistics/READMEHow we used Postgres extended statistics to achieve a 3000x speedup (blog post by Jared Rulison) https://build.affinity.co/how-we-used-postgres-extended-statistics-to-achieve-a-3000x-speedup-ea93d3dcdc61~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello, hello. This is Postgresfm. My name is Nikolai, founder of Postgres AI. And as usual, my co-host is Michael, PGMaster. Hi, Michael.
Hello, Nikolai.
How are you doing? How is your service? How is your service doing? I saw a newsletter.
Oh, yes. All good.
Good.
Taking it along.
Improvements.
How about you?
improvements. How about you? Yeah, also like bug fixes, you know. Yeah, and improvements as well. We are preparing some big update probably next month. So I
wanted to... actually you wanted to discuss this. It's your topic. Yeah, very much my
topic it sounds like. Yeah, and I think it's a useful topic, but I also want to be honest, and as you know,
I want to be honest, and discuss how often can we use that feature of Postgres and what's
wrong.
Not wrong, maybe, but let's just go like extended statistics is the topic.
So it's what you can achieve running SQL
command create statistics, giving it a name. It's very similar to index creation, right?
Similar, right? You create something with it. It's named for one table only. You cannot do it for
multiple tables. You choose columns and there are several parameters you can specify.
And it's very similar to index creation, but instead of building the whole structure additionally
to speed up search of values, of entries, we just tell Postgres that it needs to maintain
more statistics than usual because by default it maintains
statistics for every column individually.
Not only every column individually, but also multi-column statistics, it does maintain
and also expression statistics.
It maintains for expression indexes, like function indexes. For example, very simple
common case is when we have a column named email and we use text, not CIText. Nobody uses CIText.
This extension is not used at all. I don't know why it exists. It exists for ages. CIText is
an extension. It's actually a contrib module, so it's always available in any Postgres setup.
And it's case-insensitive text.
Because by default, if you create an index on the column named email, you...
Unique index. I forgot to say an important thing. Unique index.
It won't guarantee uniqueness in case-insensitive manner.
So uppercase email, lowercase email, they will be considered different values.
And this is not what we want, because in reality, emails, they are case-insensitive.
Case doesn't matter.
So you need to use CI text contribute module, but nobody does it.
I never see it in reality.
Oh, it's so exotic. I never see it in reality.
Oh, it's so exotic.
I used it last time 10 years ago.
So we usually end up having an index lower email.
And what I wanted to say is that by default, implicitly,
Postgres will maintain statistics for this value
lower email.
Right?
Yeah.
So expression. Right? Yeah. So expression it's
automatically but create statistics, statistics allows you to maintain
additional statistics like additional expressions or multicom statistics
without creation of index. Yes. Good feature. Although I've only ever thought
until reading a lot about it earlier today, I'd only ever
really thought of the multi-column use cases.
The individual column cases is interesting if you want to be able to use the statistics
on an expression but don't want the overhead of maintaining an index on it. Which is interesting but I've not seen that case before.
Yeah. So for example we have some expression on a column, a single column.
We want statistics but we don't have it. We don't want an index. Yeah. Can you
imagine such case? I haven't worked out one yet, but it made it into the documentation,
so I imagine somebody came with that case once.
I want example.
Yeah. I want an example.
But I do think it's worth us going back to basics
a little bit, like why do we want or need these statistics?
I think it's really important.
And I was gonna just say-
Actually, this is what-
Oh, go on.
This is what you should see working
with individual query plans in your...
In PGMaster, you obviously have cases.
And as we discussed before recording,
you mentioned that you have some logic
in PGMaster developed to diagnose cases
when something is off and probably you need to
consider statistic creation.
Well, yeah, this is the tricky thing.
It's not super intelligent though.
We look out for bad raw estimates and they can be a real problem for query plans because
the beauty of the Postgres Optimizer is that it's choosing
between various plans and it's doing that work for you.
You don't have to specify how it should get the data.
You just ask your query and it works out, hopefully, the fastest way to do that.
But it's making a bunch of decisions about join order.
Maybe that's a bad example for this topic, but a scan type, different
join algorithms, all of these decisions are based largely on the number of rows being
returned at each stage and therefore the selectivity of various things. And calculating those selectivities
is tricky, like it's a hard problem.
And it's the reason why Postgres maintains statistics
in the first place.
It does that on a per column basis.
The reason it doesn't do it on a combination of column basis
by default is because just the number of those
explodes so quickly.
Like, as soon as you've got a few columns in one table,
the number of combinations of two, like, soon as you've got a few columns in one table, the number of combinations
of like, is it like n choose two, like it just explodes.
So it would be super expensive to do it for everything.
But in certain extreme cases, which are actually quite common in real world applications, some
columns are heavily correlated to other
columns and popular ones and important ones often get queried together.
So an example I see really often in like a software service applications are
things like organization ID and team ID or something like that.
You know, a team ID is always going to be associated with an organization ID, always, always, always.
Or country and language, another example.
Country and language is often is a common example as a like car and manufacturer and
things like that.
But the one reason I mentioned org ID and team ID is because often the data is then
like partition based on that or the primary key is some combination of
those things. So they're really often always being queried together and there and Postgres's
default, which is a really sensible default, is to assume that different conditions are independent
of one another. And like in a in a where condition that means it's really going to underestimate how
many rows are going to come back, which can lead to bad things like nested loops being just multiply estimates.
But it also has the opposite problem in group by if you group by multiple things and they're
actually very related, you're going to get a massive overestimate instead of underestimate.
So it can have both.
It can have issues in
both directions.
Both directions of mistake.
Yeah. So once you know you're like as the maker of that software, you're probably very
understanding of which of your columns are going to be very, I think the technical language
is functionally dependent on others.
How?
Well, like org ID and TM ID you you know that
I don't like this answer. You know that I I want the answer which would be
Could be written
For humans and maybe not only humans, you know some answer at like some step by step how to recipe algorithm
or something, right? Look here, check this and then decide. This is what I
think is missing here because as with indexes, create statistics is something
that is not working by default. It requires an effort from engineer.
And I think as I told you before recording I never used it for real in
production to solve any problems like zero cases I have. It appeared it was
added to PostGust 10 seven years ago. During these seven years, several times I considered it. And you know,
like when something like we see, okay, the raw estimate is completely off, like a thousand times,
million times, completely off. What to do? What to do? Of course, analyze, analyze, recollect
the stats, see if it helps. If it helps, no, it doesn't help what to do. Then old school approach, okay,
like probably 100, 100 buckets is not enough. A default statistics target, it's a global setting,
but you can control it on an individual column level. Yeah. And you say, okay, I want 1000 buckets
here. And actually, I remember Thomas Wondra was a guest on PostGust TV
we discussed this and Thomas opinion was 100 buckets is quite a lot so it's not
a good idea to always fix it increasing this number but I see sometimes
sometimes yeah I think that might also sometimes help because it not only increases the number
of buckets but also increases the length of the MCV, the most common values.
But if we don't use standard statistics?
With regular statistics target on a column you also get a bump in the number of MCVs
tracked.
Ah, okay, I understand.
Which is, I think, might also be sometimes helpful, it might sometimes solve the problem.
So I can understand why throwing more sampling at the problem could be helpful there.
And I'm okay to spend more disk space and memory for this. The problem usually if you raise default
statistics target to 1000 for example is that analyze takes longer.
Yeah. Which is acceptable in many cases. I wanted to mention very big news
which occurred last week. You know it it you name it yeah well the commit postgres 18 that
unfortunately won't include what we're talking about today extended statistics
but it's still huge and really positive news although I've seen that there are
some performance issues so in postgres 18 it is expected that during PG upgrade, we all know that analyze is on user shoulders
basically on DBA or DBRE shoulders.
PG upgrade doesn't run analyze.
If it's like in place, brief upgrade, very fast upgrade, taking only a couple of minutes. It's pgupgrade-link or hyphen k.
So hard links are used, it's very fast, and it also dumps, restores the schema because
new version has different features, so this schema needs to be dumped explicitly and
restored.
This is automated.
But then it says, okay, now it's time to run analyze.
And there is analyze in stages, which I don't like.
We talked about it a few times.
But it's basically, it's outside of the scope of PGA upgrade to run analyze.
And we also, I criticized managed Postgres providers like RDS Cloud SQL Azure, because they don't automate it as well and live on shoulders of DBR.
People keep forgetting it and after upgrade,
they don't have statistics at all.
So big news is in Postgres 18,
it was just committed and hopefully won't be reverted.
Statistics will be dumped and restored.
Yes.
And that's even better than, in my opinion,
that's even better than analyze being automated.
Of course, it's better.
This is the answer I've got when I raised this on Twitter
on X many months ago, saying that's not all right,
that many service providers don't do it.
Somebody said like let's wait and it was the right idea so if now statistics is dumped and restored
we have old statistics and that's good. Yeah and my understanding is this is a PG dump feature, therefore it should work.
Let's say it does end up in 18.
It should work for the next upgrade you do
no matter what version you're coming from.
Yeah, I also thought about it.
It's interesting.
I'm not 100% sure, but of course definitely you can run,
you can use new PGgDump against some old Postgres server.
And hopefully this feature will work.
It means that we will be able to upgrade all servers to new without need to run analysis
explicitly.
Yeah.
And it means less downtime.
I consider that time you're running Analyze to still
be there, it could effectively be down, even if you try to restore connections, just because
queries could be much, much slower than they should be. So yeah, I think for me this could
drastically reduce effective downtime for a lot of projects during upgrades? Yeah, I agree. Because in my opinion, analyze in stages is like,
it's a wrong feature, completely wrong feature.
And I prefer keeping analyze inside the maintenance window.
And of course, if you have default statistics target
elevated to some high numbers, thousands, in this case,
it takes longer,
especially if you have like some people
have a lot of database objects, a lot.
I just recently had the pleasure to deal a little bit
with a case with 200,000 tables
and more than one million indexes.
And if you decided to raise default statistics target there, analyze will take ages.
And even if it's parallelized, like, yeah, so keeping statistics is good.
Let's back to the topic, though, is that you might still want to run analyze after a major
version upgrade.
There are a couple of caveats, even in the 18 commit, as to things that won't get pulled
across on one of them is extended statistics.
So if you are using extended statistics you will and they're important for your query performance,
you will still need to run analyze to actually get back up money.
I always like running analyze for example you know if a partition table to vacuum doesn't
maintain statistics on the parent table somehow Somehow if you run analyze explicitly, you have your statistics.
So I just like running analyze.
So I think there's even an improvement.
Maybe it was in 17 that you can one analyze only on the parents.
So you don't even have to analyze all of the partitions, which is quite cool.
Yeah.
So partition statistics on partition tables, probably it's is another topic and it's interesting on itself.
But back to extended statistics, so my point was, since this is a feature which requires
you to run some things explicitly and make decision, there should be recipes.
For indexes, we have recipe. Okay, we have sequential scan.
We see some filter or order by, okay, we know these things should help.
We verify it, ideally on full-size clones.
We see it helps.
We deploy to production.
People who don't have fast and cheap clones, they check it out on production and see it
works.
Okay.
So that's it. As for statistics, which extended
statistics, which was added as we discussed in Postgres 10, seven years ago. During these
seven years, several times I said, okay, maybe we need to create statistics here. Let's
try. And I remember zero cases when this was solution. We always ended up having different solution like creation of index or maybe redesigning
schema sometimes to the question of previous discussion with Frank, right?
I mean, redesigning schema, denormalization and so on.
So I don't remember conclusion. Okay, in this case,
create statistics is our solution to the problem. And that bothers me. Honestly,
it builds some weird feeling. I'm missing something. Everyone is using something I'm not using.
Well, interestingly, one of the main reasons I wanted
to bring this up is I think extended statistics might be underused globally I
think a lot of people don't know that it exists. There are no recipes that's this
is my point we need the recipes when very concrete recipes how to decide it's
worth using it right here. Yeah interesting but but I do think there's a chance. Give me a schema only dump. I will give some LLM, good one, this dump, and ask to find,
just based on column names, to find some ideas what can have correlation, then we build statistics
and hope it will help someday. but it's a weak solution I also think it's flawed because I think if it's
not actually causing query performance issues then you're paying that penalty
for analyze being slower well again no benefit I didn't actually that's a
problem yeah I didn't finish my thought I'm okay okay I'm okay to pay some
penalty to analyze more, keep more statistics.
It's not a problem unless we have hundreds of thousands or millions of objects.
In this case, I will be very careful.
But if it's only a few thousand of objects, I mean, indexes and tables, it takes one minute
or three minutes to analyze on this database.
It can be many, many, many terabytes.
One or three minutes, I don't care.
And in terms of storage, it's nothing.
And memory, it's not a lot.
Maybe I'm wrong, but I think so.
I think you're right.
And I think that's important for people to realize that it's not dependent on your data size
because we're sampling at this point.
Like, because it's a sampling thing, it doesn't scale linearly with how much data
you have.
Also important disclaimer, I have set of mine targeted startups heavily because I'm building
my fourth startup and our clients consulting clients and clients for products mostly are
startups and that means rapid development, like rapid, really rapid. consulting clients and clients for products mostly are startups.
And that means rapid development, like rapid, really rapid, sometimes a couple of deployments
per day or more.
It means that maybe we, statistics we decided to keep, maybe it's not needed today, but
maybe it will be needed tomorrow because we make some new features and so on.
And maybe it's better to have it because we know there is a correlation between these columns,
so let's have external statistics over all these columns.
But we lack recipes, this is the point.
So lack recipes leads to underuse of this feature.
I think they might also be...
I think historically you've also mentioned having a bias towards LTP systems in production. And I think there's also a chance that because often you're hyper optimizing queries, and
that's a good thing, going for index only scans and very, very limited data retrieval,
very small number of rows, very, very precise lookups, you're probably not hitting these bad plans in
a lot of those cases, that you're skipping straight to a really good plan,
that's really hard for the planner to get wrong, because you're
giving it the ideal index to use for the important access patterns, and it's
possible that this comes up a little bit more when people are having to serve
more user
defined queries, maybe analytical ones that have any combination of parameters.
Or real queries and so on, right?
Exactly. And I think in those cases, maybe sometimes just not knowing what's going to
come, you can't create indexes for every case. You do.
And indexes slow down writes, unlike unlike statistics this is a super big difference
yes so I think there's a chance that bias towards LTP and also being able to
often add the indexes you need for a smaller number of access patterns or
that like a lot of queries per second might be that this just isn't as
important in those cases. But
in terms of recipes, I think the big one is bad row estimates on important and slow queries.
So if a query is important and a lot slower than you need it to be, and there's a really
bad row estimate in an important part of the plan. So like not in a part of the plan that is fast but in a part of the plan that is slow. The really famous case is a nested loop that thinks that the planet
estimates is going to return very very few rows per loop. So maybe like zero or one.
So therefore it's thinking well I only need to do this a relatively small number of times
therefore it's quickest and easiest to get started with a nested loop
And I'll be done really quickly, but in reality will create maybe a thousand rows or more per iteration
Actually in those cases a hash join or merge join would have been it would so much faster
So though like those cases I think can a benefit
But even actually there's a blog post I was thinking of that
blogged about a time where extended statistics helped them a lot. I put the plan through
our tool and noticed that yeah a bad row estimate was as highly scored as it could be but so
was an index efficiency issue and I think looking back they could have added a better index and solve their performance issue just as like just as well so they've solved it
with create statistics but they could have solved it with better index and
probably the query plan they got in the end could still be improved further by a
better index. So I have an idea for myself and who is listening is welcome
to join next month I will have a challenge for myself.
So every time I deal with some query optimization, I will put consideration of extended statistics
at first places.
I will think, can it be helpful here?
And so on. Just prioritize it and think about it more.
And after one month of this, 30 days or month, February is shorter month, right? So,
30 days. After it, I will make my conclusion. Should I keep this as like one of primary tools for query optimization
or as a secondary tool? How does it sound?
I'd love to hear that from anybody else that does try it. One thing I'd encourage doing
is also benchmarking the performance hit on analyze of that tape. Like if you do add it
to a table, how long does
analyze take before and how long does analyze take after? And I think it will depend on which
there's like a few types of extended statistics that you can add. You can add all of them for
expressions or you mean they could it there's like three there's like three parameters you can give it. So, indistinct dependencies and MCV.
And I think MCV will be by far the most expensive
in terms of analyzed performance.
Oh, by the way, another thing that the docs mentioned
can be penalized is query planning time
and maybe query planning buffers for that matter.
If you have a lot of extended statistics created by us,
the planner can take longer?
Yeah, well it makes sense, right?
It's considering more information upfront.
Right, makes sense, yeah.
I don't know how, but I'd be interested in the benchmarks of how much longer, how much more, like...
Yeah, but in this case, two things.
First is, well, both benchmarks sound to me
like single session benchmarks.
So it's not like we run a lot of like,
we can take one session and check just speed
maybe multiple times for good statistics, right?
So first thing sounds to me like checking analyze speed.
And I think we need to remember that if we talk about a single table, it's OK.
I mean, no problem.
But if we talk about many tables, it's worth remembering that it can be parallelized.
And it's recommendations to use vacuum DB with option hyphen hyphen analyze
and specifying number of jobs, number of processes, backends basically, right, which will execute
this. And if we are alone on the server, we can take, for example, number of vCPUs we
have on the virtual machine, for example, right?
Yep.
This is one thing. But if it's a single table, it cannot be parallelized.
It will be a single thread anyway.
No.
But there's actually one more thing that I think worth trying just before we wrap up
is if it is a huge penalty but looks valuable to query performance, which I'm not sure you're
going to come across, but we'll see.
I've wondered about reducing statistics target for those columns. So you
could reduce it and then add multi-column. Like it just is an interesting idea of getting
back some of that analyze time. I know a very good SQL optimization expert who recommended
reducing default statistics target and I'm going to communicate with him very soon. But this battle in my mind was clear like 10 years
ago already. Like reducing the default statistics target or per column parameter was considered
not a good idea in my mind compared to increasing it somehow. I don't know like this is
like from various cases but I'm going to use this as well as a part of my
challenge just to think about this as well because I will talk to him soon. So
I've got an idea for you. Let me comment on this second. I already started
forgetting what you said about what was the second one? First is let's benchmark.
You said benchmark.
I don't, I'm very careful.
Analyze?
Yeah.
And also query planning time or query planning buffers.
Query planning time, this is good.
Yeah, buffers as well.
That's great, that's great.
So increasing default statistics target globally
definitely should affect this as well, right?
Globally.
Yeah. Here I like if somebody is going to do it, don't make mistake. My team and I did
a few months ago when we published an article. So when you just connect to Postgres and you
do something, planning time can be much bigger because of a lack of rel cache and it's loaded, right?
Second time you run this, it's very different.
So planning time, first planning time and second planning time, they are very different.
If you have connection puller, most people do.
It means that it's not a good idea to make conclusions from observations of the first run.
Yeah. Great point.
So even if statistics, extended statistics affects planning time, I think it should be so.
Question is, is it only for the first run or the second run as well?
Who knows, right? Because if only first round it means
like real cash or... Well, it's not real cash here, right? Well, it's interesting point just to check,
right? Yeah, based on the notes in the documentation I got the impression it would be for every
planet. Because it's statistics, it's not real cash. So anyway, I would check multiple times.
One more idea for you,
especially talking to that person you mentioned,
I wonder if you could come up with a recipe
for which columns to reduce it on.
I reckon there's a whole bunch of really boring columns
like unique IDs, you know, primary key columns type thing.
Do we really need high statistics targets on those?
Or could we get away with much, much lower?
Do you see what I mean?
Like ones that we know are unique and that if we're keeping statistics on the
most common values,
well, I agree with you, but what's the point to optimizing this area?
What are our savings?
In my reducing the time analyze takes.
As I said, I don't, I don't really, I don't really bother by longer time.
It's three minutes against one.
So what?
That is quite a big deal, isn't it, in terms of downtime?
I don't know.
We have recipe for zero downtime.
We can afford a couple of additional minutes being spent and nobody notices it because
it's a target
cluster running on logical replication and we can do very long analyze there.
So even one hour nobody notices, it doesn't affect downtime at all.
So I don't know, I don't know and we have export in Postgres 18 now, so export import
statistics via Pidge pigeon dump restore yeah pigeon dump restore so I don't know I don't know like I I can imagine this will
save something we can find many places like that right for example in some
cases we can we can start tuning auto vacuum at table level. And so many times I saw teams, various teams went this path.
And then we pulled them back saying, let's stick to defaults because table level settings
often cause more harm than good because they add complexity, a lot of complexity. It's hard to
maintain. Then you go microservice, you have a lot of clusters,
and this table level, well, it's so hard to maintain.
So, and you talk about column level settings.
All I meant is if you can find a really simple rule.
Yeah, exactly.
If it's a unique column, reduce the statistics to one.
If this is good, if it goes to hackers and the core itself and implementation, which doesn't
require effort from user. This I agree with. But if it's...
Cool.
Again, my main concern, my main comment about extended statistics is it requires effort.
It's not clear. There are no clear recipes how to apply
it. So it ends up not being actively used. Unfortunately, maybe in the next 30 days we
will have some recipes, right? Sounds good. I'm looking forward to it. Sounds good. Okay.
Thank you so much. Thanks so much, Nikolai. Take care.