Postgres FM - Partitioning
Episode Date: April 7, 2023Nikolay and Michael discuss table partitioning — what it is, why and when it's helpful, and some considerations for your partition key. Here are links to a few things we mentioned: Partit...ioning docspg_partmanIndex maintenance episode Timescale partitioningpg_cronXtreme PostgreSQL (talk by Christophe Pettus)Database Antipatterns (also by Christophe, slides 46-49)Understanding an outage (blog post by Duffel)~~~What did you like or not like? What should we discuss next time? Let us know on social media, or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I'm Michael,
founder of PGMastered, and this is my co-host Nikolaj, founder of PostgresAI.
Hey Nikolaj, what are we talking about today?
Hi Michael, let's talk about partitioning.
Absolutely. Table partitioning in Postgres. We've talked about it in a few episodes,
but it's great to do an episode completely on this. We've had requests for it and I'm
excited to get to it. Where do you want to to start perhaps with what partitioning is of course so it's about splitting a table into multiple tables
and the beauty of it is that that's transparent at the usual client level yes exactly because we
can we can split ourselves we can say like we have table name we can do table name one table
name two but it will be not interesting to rewrite our application.
And partitioning does the same, but behind the scene, right?
Yeah, exactly.
With lots of nice cool features and an ever improving list of features to go with that.
Did you look into the history of it at all?
I mean, obviously you've lived the history of it.
Yeah, I lived through the history. I remember I was using inheritance-based partitioning and helped clients implement it.
And this actually was my first real job in terms of podcast consulting in Silicon Valley,
one startup which moved from, actually, I helped them move from Heroku to RDS.
And also, they had issues with large tables.
And obviously I said, like, let's partition it.
And it was before Postgres 10,
which implemented declarative partitioning quite long ago.
And we did it with inheritance.
A lot of PLPG SQL code was written.
And at that time, actually, the interesting part is this exactly the task
when I realized that RDS regular clones is very good for testing.
So I had many, many cycles of testing to have fully automated procedure,
move old rows from non-partition table to partition table in batches,
to have view on top of it to provide full transparency
for application everything is possible but it took a lot of effort because inheritance-based
partitioning recall maybe it's it provides you more flexibility in some places but it lacks
automation like you need to write a lot of code and of of course, if you have clones, it's good. You can test it, but still it's a lot of work.
And by that time already, I think PG Partman existed.
I'm not sure RDS has it, but somehow I decided not.
Anyway, around 2015, as we also discussed a couple of times,
a lot of former Oracle users or current Oracle users
came to Postgres community and ecosystem,
and they obviously raised the question, we cannot live without declarative partitioning anymore.
It's not good. Let's do it.
And declarative partitioning was implemented.
I think it was in Postgres 10, right?
First version appeared.
But then in each version, a lot of new features related to partitioning were added. It means that if you own quite old Postgres and you are considering partitioning or to use it more often,
obviously, first thing I would do to consider a major upgrade to the latest version possible
because it would provide much more features benefits in this area.
Yeah.
In fact, you mentioned it right at the beginning of that,
but I think it's worth going back to. The benefit here is that large tables can get difficult to manage.
So there's certain things that become very…
Sorry for interrupting.
I just realized, since you asked about history,
inheritance-based partitioning existed for long.
And inheritance is a good concept.
It's from the era when object-relational database concept was popular, late 90s, early 2000s.
But declarative partitioning is much easier to use.
And now, of course, I would not use inheritance-based partitioning at all.
Expect some rare cases.
But let's talk about why do we need it at all, what do you think?
Yeah, so, I mean, I bump into different reasons that people want to do this,
but the main one I see is maintenance.
So easier and more parallelizable maintenance, ability to, so that includes things like indexing,
re-indexing, vacuum, analyze,
probably a bunch of things I'm forgetting there.
But in addition to that,
if you have data that needs to be deleted,
so old data or a certain client,
like if you might want to partition
based on how you want to, yes.
So we have the ability with partitions certain client like if you you might want to partition based on how you want to yes so partition
the we have the ability with partitions to drop it to detach the partition and drop the entire table
and not have to not have to worry about the the deletes and the the impact of all of those so
there's just truncate particular partition and that's it because truncate is much faster than
delete and then vacuuming because delete doesn't physically remove anything it just marks tuples as deleted
and then it's heavy operation if you have millions millions of rows yeah exactly so did i think you
still detach first right before truncating but i'm well nobody prevents you from truncating particular partition why not
it's a good overview but i would like to emphasize that most people think many people not most i i
don't have statistics in hands but but usually people from my perspective they think partitioning
is needed for query performance it's so but not so. There are direct and indirect benefits for query performance if you
use partitioning. Direct benefits is, of course, indexes are smaller and faster. Good, but we all
know that B3 height grows very slowly. So you need to have jumping from 1 million rows to 1 billion rows. It's not, it's just a few hops
in terms of industry traversal, but indirect benefits are much more interesting. And let's
talk about them. You mentioned manageability, maintenance of partitions. Exactly. Like if you
need to vacuum very large table, it will take hours, sometimes half a day, sometimes like almost like a day,
I saw very depends on your system on your power of your disks, and so on. But it can take quite a
long, but it cannot be parallelized in terms of heap and tall stable vacuuming, right? So it's
single threaded process. And you cannot speed it up, even if you adjust the cost limit and cost delay for autowalking workers.
So they go full speed, you don't care about negative effects.
It's still slow because it's single-threaded.
If you have a partition table, if you have, for example, 10 partitions, 100 partitions, you can benefit from many, many workers to do much more work faster, right?
Then it's a question about your disk I.O. limits, saturation risks, and so on.
But you have flexibility.
You can tune it already.
So for indexes, it's better.
But as I remember, AutoVacuum also for one table, it processes all indexes using a single worker.
Only manual vacuum can process multiple indexes.
In modern Postgres newer versions, it can be parallelized.
So partitioning is much better for vacuuming.
It's straightforward because of parallelization.
And I see due to defaults, we have only three workers by default usually. And it's so sad to see three workers on
systems with 96 cores. I do it all the time. Somehow people think AutoVacuum is good. If you
have so big system, so many vCPUs or cores, consider adjusting number of workers earlier
because it requires restart, right? Especially if you're thinking about partitioning these things come together if you are planning to use partitioning you
definitely need to reconsider your auto vacuum tuning settings and have more workers i usually
recommend up to 30 percent of vcpus to set number of workers max number of workers for auto vacuum
and also like increasing other things. So
this is number one thing, but close to it also, you mentioned index maintenance. Again,
building an index on a very large table might take an hour or more, sometimes hours, like
eight hours. I saw it also. It's not fun at all. Especially it's not fun, unlike regular vacuum,
index build or re-indexing concurrently or not concurrently,
it doesn't matter.
It holds Xmin horizon.
We discussed it a couple of times.
There was an attempt to fix it in Postgres 14.0.
It was reverted in 14.4.
All versions between 14.0.1.2.3 are considered dangerous because they can lead
to index corruption so if you're 14 Postgres 14 user you should be at least at Postgres 14.4
it was very great optimization not to hold xmin horizon when you build or rebuild index but
unfortunately it was reverted i don't know the plans when it will be back.
But what does it mean, holding Xmin Horizon?
It means that while you are building or rebuilding your index,
there is Xmin Horizon.
It means that there is some number in terms of transaction ID,
XC transaction ID, right?
Corresponding to the moment when you started to build the index.
And all tuples which became dead after it cannot be deleted by auto vacuum or regular
vacuum because xmin horizon is in the past.
So while you are building index on very large table, vacuuming is less efficient and it
cannot delete many tuples.
If you enabled auto vacuumvacuum logging,
you can see some dead tuples are found but cannot be deleted.
This is exactly because Xmin Horizon is in the past for them.
And this is a problem leading to bloat, right?
Higher bloat that we could have.
Higher bloat that can lead.
You mentioned second-order effects or indirect implications.
So we have bloat causing slower query performance.
But also vacuum getting behind can lead to things like the visibility map not getting updated.
And then that can lead to plans no longer being able to do true index-only scans.
So that can impact query performance as well.
So there are these other effects as well. So being able to
parallelize vacuum and have more workers for doing vacuum on each partition
is it can be a benefit from that perspective too. Right. So one of the things if you have partitions
and we don't like, these are some rule, some people follow don't allow tables to grow above 100 gigabytes. If you have 100 gigabytes, you need partitioning already.
So, right, if we have smaller tables, our auto-vacuum workers are feeling much better,
removing garbage much faster than tables. Garbage collection works more efficiently.
And this affects query performance definitely. It's very hard and tricky to measure.
Like sometimes people say, well, okay, how bad is that?
Well, we can take some experiments
and show degradation of some query performance,
but it's very depends on nature of data and workload.
But I already feel it would be possible
to develop some metric.
For example, understanding the exceed
growth rates, like your writing transactions, how many per seconds you have exceed growth
rate is this and you could have like some threshold saying or SLO, SLA, SLO probably.
We don't want to allow our XminHorizon to be this number in terms of XSID or this number in terms of minutes or seconds behind current value.
So we should not allow XminHorizon to lag that much.
From this, you can go to the requirements of how big your partitions should be.
Because there is some math here.
It's possible to develop for a particular system.
It's much harder to develop some methodology
for arbitrary system.
Yeah, good point.
In fact, that probably leads us nicely.
That's why 100 gigabytes.
Just 100 gigabytes.
Good rule, right?
But we can do better in terms of explaining
why this number.
Yeah, it's a nice memorable number for sure.
While we're on the topic of when, I guess there's also the question of how.
And I think...
One second, one second.
We didn't finish with why.
Okay, continue.
Very big reason also is data locality.
Yeah, okay, yeah.
It's a huge reason, especially for data,
like time series data or time decay data,
which has like,
these are new rows,
these are fresh
and been accessed more frequently.
And these rows are quite old.
Maybe sometimes soon
they will be not needed at all.
We will consider pruning.
It can be regular time series,
like measurements of temperature or something like metrics, or it can be regular time series, like measurements of temperature
or something like metrics,
or it can be also like social media comments
and posts and so on.
Fresh data tend to have usage pattern
like to be more frequently accessed.
And we would probably like to have fresh records
in our shared buffers
and operational system cache and so on.
And all the records, only a few of them,
could allow a little bit higher latency for all data, probably.
It depends, of course.
But if you think about a very large table,
we have everything mixed.
Any given page can contain new and fresh records.
It's kind of like a loose cluster.
We don't have index- tables in postgres but it's almost like a it's not an enforced clustering
in terms of strict but it's kind of a loose clustering if you think about the time series
example things aren't guaranteed to be in order within a partition but each partition will be
kept separate yeah yeah so yeah. So yeah,
actually, if you, for example, consider indexes, you have an index on table, you have, for example,
some something like date, creation date, and you think, okay, it would be good to partition by
days or months or weeks, you could have all your queries dealing with filtering somehow.
And then you could, instead of partitioning on table, you could just split indexes to 10 or 100 or 1,000 indexes, partial indexes.
Like partition at index level.
And it would work well if the only reason for partitioning was query performance, direct benefit from smaller index size.
But as we just discussed, there are bigger reasons
than just the smaller index size.
Yeah, so you need to split tables, definitely.
Maybe one more reason, like data locality is very important
because if you've considered, like, you have, I don't know,
like a lot of hip-only tuple updates inside pages and so on.
But it's not good to mix fresh and old rows in one page all the time,
like mixing in a huge table,
because you need to allocate more memory to hold.
For example, okay, I want all my fresh data to be in the buffer pool,
for example, closer to application.
In this case,
you need all pages of this table. But if you start splitting by date somehow, like, okay,
fresh partitions have fresh data, old partitions have old data. For fresh rows, you have much fewer
pages. I'm talking about cache efficiency here, right? And in this case,
also, you mentioned all visible and all frozen, these bits for each page. It's a visibility map.
It's very good to have all data separated in old partitions, because updates or deletes happen very infrequently there,
or inserts also.
And so we auto-vacuum can skip many more pages, keeping them all frozen, all visible all the time, right?
It's also about vacuum efficiency, but a different vacuum,
not deleting the tuples, but keeping this visibility map up to date.
And this is super beneficial for performance, query performance,
because visibility maps like we can see for index only scans, heap fetches in the plan less often.
Heap fetches means vacuum is lagging to process like index only scan worked, but heap fetches
means like we need to consult heap because of this page is not marked all visible. So we need to consult heap because this page is not marked all visible so we need to check
it but if it's all partitioned most most of its pages are all visible or frozen great so index
only scan is better how to measure it also interesting question for particular system
again it's not trivial but doable but but for arbitrary system to find some metrics when you need partitioning.
So you need partitioning 100 gigabytes.
100 gigabytes go with partitioning and more workers for autolocking.
I'm thinking about it for many years, and these reasons are not well discussed or discussed in very dry manner.
My main reason is indexing, and second reason is index locality.
Only then, a smaller index size, and so on.
Nice, I agree.
But I also echo what you said,
and I see it brought up most commonly
when it comes to query performance,
and I think that's one of the least good reasons for doing it.
But yeah, there's lots of good reasons,
and I'm glad they they have net
positives there as well there are also of course some downsides in terms of query performance like
we see for example the planner has to do some partition pruning right and yeah like excessive
numbers of partitions well it's getting better in later versions of postgres for sure but at the
at the thousands of partitions level you can start to see some issues planning
time increases a lot let me again like sorry i i echo is good word i have echo in in my mind i need
to like finalize my thought about why i put indexing and re-indexing on the first place
in terms of reasons of partitioning like in in heavily loaded systems, we need index maintenance always.
Like it should be planned,
properly planned.
Without index maintenance,
you will have degradation over time,
100%.
It's not only about
how Postgres MVCC is organized.
It's for other systems,
it's also so.
Indexes need to be rebuilt quite often.
Well, since Postgres 14,
less often due to like
a lot of
B3 deduplication in 13 and
14. But still, index
maintenance is a must-have, like
day 2 DBA operation.
And this means if you need
to rebuild indexes and you have huge tables,
you will have bloat.
Just because you are
doing this quite slow,
holding XminHorizon. So index maintenance, we had an episode just because you are doing this quite slow, holding a spin horizon.
So index maintenance, we had an episode about it, right?
Means also like do index maintenance,
but also have fast auto-vacuum processing.
It can be done with partitioning.
These like index maintenance, vacuum and partitioning, these three components should play together.
This is why I put it on the first place. I think
it's an underestimated reason, not
well discussed in
the Postgres ecosystem and blocks and so on.
I think something we probably have to
get to is how
people might be wondering how to pick a partition
key. And I think that comes, like, that's
an... Well, it's interesting, right? Like, in terms
of... Have you seen it done badly?
Or, like, what's your... what do you generally see people doing?
There are some pains in Postgres partitioning.
One of the pains that all primary key and all unique keys, unique indexes must include partitioning key.
I remember I already mentioned this time when I implemented using inheritance, I implemented partitioning on some RDS system, migrated from Heroku also.
I remember the moment when I said,
oh, we need to index this.
We had some partition key, I even don't remember,
was it like a range or list or hash?
I think not hash, but probably range, I don't remember.
And I said, we need to index it as well.
And if we index it exactly like described in partitions,
like with this expression, for example, date, right, from timestamp.
And I was told, are you kidding?
It's already partitioned by this.
We don't need it, right?
Postgres planner already knows how to find proper partition
and indexing on this value, it's the same.
So partitioning works like some index as well.
Like to locate proper partition in the plan,
we just use these conditions and that's it.
So you don't need to have index on it.
But at the same time,
primary key and unique keys must have it.
And this is annoying.
For example, if you have surrogate key like id some integer or uuid
uuulid anything version five six i don't know anything now you consider your huge table to
be partitioned by range using timestamp for example created at right now you need to redefine your primary key to have both id column
and created at column and this is super annoying because now you think oh what about foreign keys
which reference to this primary key they also need to be rebuilt when you redefine primary key
it's also not an easy job you need to think about if you need to do it zero downtime.
Also, funny thing,
I had recent experience
with partitioning
and I realized
if you have multi-terabyte table
and your management tells you
that you can allocate
up to one hour of downtime,
it's interesting situation
because you tell them back,
I need either 12 hours
of downtime or zero.
If we have 12 hours downtime,
I will design my changes in blocking manner,
not online manner.
So alter table, create index,
because create index is like roughly two times faster
than creating index concurrently.
But if you tell me I have only one hour,
it's nothing for me.
I will go with online operations and I just need more time to develop and test it so this is funny thing like you cannot
be in the middle between huge chunk of downtime and zero one hour is not good okay zero downtime
means we need to do some preparations and one of preparations is to redefine primary key and unique key. All unique keys. You might have multiple ones.
To include partitioning key.
How to find...
You asked actually a different thing.
How to find proper partitioning key.
It's tricky, very tricky question.
And also granularity.
How come we discuss time-based partitioning
and didn't mention timescale?
Yeah.
Because if...
Well, we haven't discussed how you create partitioning and didn't mention timescale. Yeah. Because if... Well, we haven't discussed how you create partitioning.
I think one of the main benefits of the way timescale do things is...
Automation.
Exactly.
Everything's done for you.
So we've already talked about partitioning being transparent to the user,
but timescale will almost make it transparent to you as the developer as well.
It's so good.
Easier to use.
Yeah, exactly.
And also compressing old partitions
with like 50x compression.
It's super cool.
This was what regular Postgres lacks.
But if you're on RDS,
you cannot use Timescale
because AWS and Timescale company
don't have contract for it.
Well, but PG Partman is like,
there are... are okay but sorry
but it doesn't provide you compression and it doesn't automate in many cases for example if
i'm not sure about rds but on cloud sql google you can read that okay we have pg partman but
we don't have scheduler this like background worker or something so automation should be
handled like you need to use pgchron
or some other cron jobs or something.
So it's not full automation.
It's just some functions.
And I can write myself these functions with my logic.
But this is a big thing
that can go wrong with partitioning, right?
Like when I've seen blog posts
that mention big issues
and they mention partitioning,
it tends to be a new
partition failed to create. And that has naturally, that's a downtime issue. If you get to the point
that you should be, let's say it's time-based and you get to the new month or whatever you're
partitioning by and that partition doesn't exist, you can't accept inserts all of a sudden.
It's better to create partitions in advance a little bit right and to have some
code that checks everything and some monitoring in place so definitely when we talk like declarative
partitioning hides a lot of details from the user automates many things but still a lot of work
needs to be done a lot of work for example exactly partition creation or detach and destroy it
you also need to do it yourself or use timescale if you can afford it if you have such preference
i was going to raise something else i've i was speaking to somebody recently who's considering
partition they're getting to the point where they they think it could be useful and i've i thought
of a different way of thinking about your partition key and that might be how you would like to delete data eventually if you want to the time-based
stuff's really good because a lot of people will want to delete old data eventually but that might
not be the way you want to delete data in the future like for example this person customer
based might make sense because when a customer cancels, they want to delete all of that data.
And by that point, it might make sense to think about it by a customer instead of by time.
So it's really interesting in terms of locality as well.
If you're running a SaaS app, chances are you're going to have, even if you're doing larger queries, they're probably
one account at a time. Nobody's looking at data across two accounts. It's all within a single
account. So depending on how you want to access your data, but I think crucially, depending on
how you might want to delete the data might be a really interesting way of thinking about that
partition key. Yeah. Also I know cases, but it was from sharding experience, but it's very close.
Partitioning can be considered like a local sharding to some extent. I know the cases when
people chose sharding partitioning key in implemented mechanism in such a way that if,
for example, it was a huge system, first of all, very big one, like a huge system.
And they chose a mechanism to move accounts which are not actively used to more archived partitions and to let them be evicted from the caches and so on.
But those which are quite active were located in hard partitions and present in the like working set of database. It's also possible, it's the same like you described, but not
at the customer level, but batches of customers, like not batches of sets, subsets of customers
and accounts and all the data associated with account is moved to some inactive. Once it
becomes active, it transparently automatically moves to different partitions.
It's quite interesting approach.
And if it's done in non-blocking manner in background, it's interesting.
Like, I'm not fully understanding those guys because they live at much bigger scale.
It was the Yandex mail system. But it was about sharding. I remember
they lived like a year or so with Oracle plus Postgres, both sharded. I was asking,
are you already migrated to Postgres? And they said, no, we use both. Let me check. What's your
email? We will check and
they told me oh you're still in oracle let me press a button press a button in one minute you're
already in postgres if something wrong tested we can move back and i love this approach because
you can for example if you think about partitioning and you can implement it requires some time but
it's so cool if you can move transparently to migrate in batches to move
back and forth and have real testing like new schema old schema you moved 10 of your data you
continue testing it if something is wrong you know in like within one hour you can move back
and everything is on old schema this is like very solid, mature approach, right? But it requires engineering, of course.
And yeah.
Thinking about the IO overhead of moving people backwards and forwards and backwards again.
But of course, for the one direction, it makes a lot of sense.
But imagine it's like partitions have many clients, but you can move one single client.
It's interesting.
You can do it in batch.
It's flexible.
Something was interesting
engineered there and they tested
it during many months,
checking all parameters,
a lot of nodes involved. But in
partitioning it's also possible.
When I implement partitioning, I
consider this like transparent move in
batches, of course. Everything should
be online. But back to the key,
it's a hard question, things to
consider, first of all, like data locality, definitely. Sometimes people understanding
that they have time decay, time series data, they still choose list-based partitioning.
They create, for example, table partitions with IDs and list-based partitioning to have
better management of what is happening with partitions and so on.
Even though they could do it
with like range partitioning using time.
So it depends.
Does that help in terms of with balancing,
like keeping the partitions roughly the same?
So like, let's say you're a growing startup.
Each month you have more data.
Yeah.
So you can, if you do-
But also like number one question
should be your workload.
You need to analyze your workload and understand, like,
if some query doesn't have partitioning key in where clause, for example,
it means that probably it will need to scan all partitions.
In some cases, it's fine, but you should test and understand it, right?
Well, if it's some slow query anyway,
if it benefits from some indexes from index only scan
we are can accept it runs like 10 seconds maybe it's fine but what will happen in a couple of
years so so anyway with partitioning most often you need to adjust application we say it's
transparent but you to benefit from it you probably need to adjust many queries so they
can definitely benefit from it for example if you have partitioning key it should be present in
most queries right to check the plans and so on yeah if not all but yeah that's cool
and then you like granularity right so day partitions or week partitions, month partitions. If you check
timescale, in many cases we have small, small chunks, we call chunks, and many of them. But
of course, it depends on Postgres version. You need to check what will happen if you have a
thousand partitions. If you go with, for example, day partitioning, in three years you will have
roughly a thousand partitions right so what will
happen with planning time but if you prune them old ones maybe you will not reach thousand
partitions right but in some cases in on modern postgres i i'm actually not sure all productions
are lagging so i don't have a good example of large system running on postgres 15 that i can
touch with partitioning so So I just learned from
release notes that many improvements happened but haven't tested many of them yet. But planning time
is maybe a concern, definitely. I've seen slides and a couple of talks recently from, I think,
Christoph Pettis about running Postgres at scale. And I think the numbers he mentioned were in the region of, in the past, it was around
a thousand partitions that would really cause you issues.
But nowadays, 25,000 seems to be like a threshold that has been mentioned.
But yeah, again, I don't have direct experience with that.
We have trade-off here.
We want to be below magic number 100 gigabytes for each partition, right?
At the same time, we don't want to be number of partitions too high to affect planning time badly.
So we need to test our system and decide where is our golden middle ground here.
And this is where the next step is, right?
You talked in the past about 100 gigabytes partition.
And how big before you should be thinking about sharding?
Terabyte?
Yeah.
Okay.
Well, we can leave with table 10 terabytes partitioned to 100, well, 1,000 probably partitions
and run it on one node or quite modern machine.
That should be very, very good.
It's doable, definitely.
10 terabytes is not a problem for modern Postgres and modern hardware.
But if you have 50 terabytes,
you should split in one way or another,
but it's another topic,
not partitioning already.
Well, we didn't touch many, many, many, many, many,
many topics,
but I hope some interesting stuff we delivered today.
So my number one reason is indexing
and index maintenance and vacuuming as well
and data locality, second one.
But number one advice is, as usual,
experiment on clones, check the plans,
experiment, experiment, experiment.
Wonderful. Thank you, Nikolai.
Thank you, everybody.
And see you next week.
Thank you, Michael.
Thank you, Michael. Bye-bye.
Bye.