Postgres FM - pg_squeeze
Episode Date: January 10, 2025Michael and Nikolay are joined by AntonÃn Houska to discuss pg_squeeze — what it is, how it started, some of its features, and hopes of getting the functionality into core. Here are some ...links to things they mentioned:AntonÃn Houska https://postgres.fm/people/antonin-houskapg_squeeze https://github.com/cybertec-postgresql/pg_squeezeCybertec https://www.cybertec-postgresql.compg_repack https://github.com/reorg/pg_repackIntroducing pg_squeeze (by Kaarel Moppel) https://www.cybertec-postgresql.com/en/introducing-pg_squeeze-a-postgresql-extension-to-auto-rebuild-bloated-tablespg_squeeze is included in the PostgreSQL APT packages https://wiki.postgresql.org/wiki/Aptpg_squeeze is included in the PostgreSQL Yum packages https://yum.postgresql.orgpg_rewrite https://github.com/cybertec-postgresql/pg_rewriteVACUUM FULL / CLUSTER CONCURRENTLY patch https://commitfest.postgresql.org/51/5117~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
I am Michael, founder of PgMustard, and as usual, I'm joined by Nikolai, founder of PostgresAI.
Hey, Nikolai.
Hi, Michael.
And today we have a special guest, Tony, from CyberTech, a Postgres developer there,
who is responsible for, among other things, the PgSqueeze extension.
So, welcome, Tony.
Hello, thanks for the invitation.
It is our pleasure. So So today we are going to be
talking all things pgsqueeze. Perhaps Tony could give us an overview of what it is.
So pgsqueeze is a Postgres extension that does similar things like the cluster command or the
vacuum with the full option, except that it does not require the
exclusive locking of the table well in fact it needs the exclusive lock only for a very short
time at the end of the processing but but it's actually a tool that lets users to get rid of
table table bloat yeah it's hugely powerful isn't it so online bloat. Yeah, it's hugely powerful, isn't it?
So online bloat reduction of tables.
Yes, basically that.
Yeah, big, big problem.
And it's a really powerful tool.
And I think a lot of our listeners will be familiar with PgRepack,
which I think originated from a...
Was it...
Nikolaj, you can probably tell me what it was originally called,
a reorg or something like that. Yeah, something like's already already doesn't matter it's ancient name yeah but
pg squeeze uh you also have like a bunch of other features and it's it's done in a more modern way
in terms of how maybe you could tell us a little bit about how it's how it's implemented and how
long you've been working on it yes the motivation the motivation was actually to use, well, it was assignment.
It was a task that I got from my employer.
And the idea was to adjust pg-repec, the existing pg-repec extension,
so it can be scheduled, so it can run according to schedule automatically.
But when I investigated the pg-repec code,
I realized that it's implemented in two parts. Part of the logic is on the server and the other part is on the client.
And I realized that the client part is not really simple. It does some tricky things
with connections. It even terminates connections at some point. So I realized that if it should be executed
from a background worker, it would be quite strange because
the background worker would have to use the client library
to connect to server. So the server would have to connect
to itself over the Lipiku library.
So the fact that the client is not trivial
made it almost impossible to implement the functionality in the background worker.
So that was one reason I thought it's better to start from scratch. And once I was considering
doing everything from scratch, I got opportunities to do some things differently.
Yeah, just instead of replacing files, you decided to use logical replication, right?
Because by that time, it already became quite mature.
Unlike the moment when PGRework and PgRepack was created originally, there was no logical
replication yet at all.
Yes.
Right?
Is this right?
Yes, PgRepack uses triggers.
Well, maybe I should tell shortly how both extensions work.
The common part is that they copy the useful say useful data from
the table into a new table and then swap the files. Each table
is table contents is stored in a file. So, so the extensions
copy the useful data and then switch the links to the files.
All right. Also accumulating changes in some Delta
table and applying them. Yeah, also accumulating changes in some delta table and applying them,
this can be
a headache itself.
This is where pgReplic
uses triggers
and around the time
I was considering writing
such an extension
from scratch, there was
a new feature in Postgres
and it was logical decoding. I think the
logical replication was not yet implemented at the moment but I thought it could be useful for this
purpose so I spent quite some time studying how it works and then I decided to use the logical
decoding. I considered it more, I will say, compact.
I didn't like the idea to construct,
well, the pgRepec somehow has to create the triggers
so it plays with the DDL statements.
It has to construct the DDL statements
to create the temporary table as well as the triggers and I thought this is quite complex and
error prone
So that's why I prefer the logical decoding
I think logical decoding should be lower impact as well in terms of overhead
So it feels like a lighter weight way of doing it as well, to me. Is that fair?
Yes, I think the fact that it works at a lower level in the stack should even be probably
more efficient.
Yeah, because in case of PGRPack, any write during initialization of during this massive copy using I think it uses create table as select
when we do it all rights this table also go to this delta table using triggers and basically
we write it twice and it goes twice to wall while if we take logical recording we just extract it
from wall and it's written once right this? This is already a big difference. Well, I think with pgSqueeze, the insertions into the new table are actually
locked too. I think there's no way to... Well, you could avoid the logging if the table was
unlocked, but then it's hard to switch. I didn't go that much far in this direction, but...
I mean, in our case, in PgSquiz, we just accumulate changes in the slot.
And we don't write twice to two tables, right?
This is the big difference in terms of how this delta is accumulated.
Yes, I see.
Yes, that's right.
So this is obviously a difference.
And I didn't realize that the original idea was to be able to schedule it.
And scheduling PGRPAC means you do it on client using, I don't know, like regular cron or something.
In this case, you brought all the interface to SQL, and we just can schedule it using PGRPAC or something, right?
This is the idea. I didn't realize it. Or you can schedule it using pg-chron or something, right? This is the idea.
I didn't realize it.
Or you can schedule it right with pg-squeeze.
I don't know this.
Okay, Michael is shaking head.
I understand.
Okay.
No, I'm saying you can.
I'm saying you can schedule it within pg-squeeze. It uses the same syntax as crontab.
Is that right?
Yeah.
I didn't know this.
I use it only a couple of times
manually somewhere very long ago so yeah this that's great and and in general be able to run
everything at sql level it it's good I mean not cli level but sql level it gives freedom sometimes
right you can do it in application or anywhere, PSQL or something.
Yes, and it's good for development and troubleshooting.
Yeah, yeah, yeah.
That's great.
That's great.
Good.
Tony, I was reading,
I think you have a super fan,
previous CyberTech employee, Karel.
He wrote a really good introduction post.
I couldn't believe it was 2016.
I mean, also it had lots of these features at the beginning. How has it developed over time
and how much have you had to keep working on it? Well, I think there was not much functionality
added since the initial version.
I did some optimizations to avoid the restriction it puts on vacuum, because one problem is that
the X-Men horizon, you probably know what it means,
must be prevented from advancing
for some time when the data is being copied.
Then this horizon must stay constant
it must not advance and the other problem is
that. So sorry once again so what exactly
you did like if we create a slot it already
it already freezes our state in terms of horizon until we
start consuming changes from the slot, right?
Yes, and I think the initial version did block this horizon for the entire processing,
but at some point I realized that it's only necessary for the initial load,
for the copying of the data.
Then it can be allowed to advance. Yeah, when it's done.
After the load, after the copying is done, you only need to process the changes that were applied
concurrently, so you no longer need to block the Xmin horizon for this stage.
This is a huge problem for huge tables, right?
If we have a one terabyte table,
it means that this horizon is blocked for long
and it affects the whole cluster
because autovacuum cannot delete freshly that tuples,
as we discussed many times on this podcast.
In this area, have you considered the idea of speeding up initialization?
If, for example, we have regular integer 8 or, I don't know,
like some numeric primary key, we could split the table into some batches
or areas, and then we could use multiple insert processes,
which would work in parallel, speeding up the initial initial load.
Did you consider this or? Or this is like, maybe?
No, not really. I didn't have this idea. Do you mean processing the table in multiple iterations?
No, I didn't get that far. So you say that with pg-sql the X-Men horizon is a problem
with the processing or in general?
In general when working with logical replication, decoding, it's a problem. Initialization is
a huge problem. Sometimes we're dealing with huge databases. We avoid initialization completely.
For example, if we want all tables,
sometimes we have a couple of recipes
how to convert physical replica to logical replica
just because initialization is a headache.
It can take sometimes days.
And during this time, Xmin Horizon is not progressing
and auto-vacuum is blocked.
So I know some tools.
I don't remember.
Maybe PureDB was doing this,
basically virtual partitioning of the table.
Not real partitioning, but just kind of partitioning,
splitting it to pieces and then loading.
If we have very fast disks and a lot of CPUs,
a lot of cores, we can speed this process up.
And this means that Xmin Horizon is blocked for a shorter period.
I think, yeah, we could consider in this product,
we could consider this, I think, this approach.
And PGRPack doesn't do it.
It just create table as select.
So it's a single process
and it can be slow. While here
we have an opportunity to speed it up, I think.
This sounds like one of the cons we may
discuss later, but from this perspective
the pgSqueiz behaves like logical recording.
Yeah, and if we consider this approach in the future, if we think how to improve this,
it's worth remembering that we need to synchronize all these sessions which perform inserts of
individual partitions, virtual procedures we need to synchronize
them using single snapshot right and so they deal with the same data how it's called i don't
remember when you start transaction you can specify a snapshot and there is a function
which exports the snapshot name so we we can we can ititchdown, Pitchdown does this as well. If, yeah.
Yes, the Pitchdown also uses the kind of snapshot that is called historic snapshot.
It takes snapshot of the database as it was at some point in time, but I'm not sure this,
well, you need for the snapshot to work, you need to make sure that the Xmin horizon does not advance because
otherwise you must
not lose, you should copy
even the deleted tuples
that are still visible to the snapshot
because
without that you would not be able to
apply updates. If the tuple gets updated
you need to copy even the old version.
Right, but if the slot
is created already,
it should guarantee that the snapshot is not lost, right?
I think so.
Yes, the slot does guarantee it,
but I'm just explaining why we cannot advance it more than what is correct.
Also, Nikolai, I don't think this is as big a problem
because I think you're talking about the slow initialization of a huge database, like the entire database, whereas...
Just one table.
Well, yeah, but just one table.
And we can be talking about partitions here, though, right?
Like if we're talking about, you're talking about a terabyte table. If we could get it down into 100 gigabyte partitions,
pgSqueeze can work on a table-by-table basis
and only blocking it partition-by-partition.
Well, right, but if it's on partition table,
this can be a problem, right?
But on the other side, if we're dealing with huge bloat,
usually a terabyte table means 100 gigabytes table, right?
Because while we're copying it, it already
squeezes, right? That's, that's, that's what's happening. Yeah, maybe this explains why this is
not a big concern. And, and people should you're right, people should partition the huge tables
anyway. Yeah. Anything you've had to do in particular around partitioning, Tony? Or I guess it just works like any normal tables?
No, I think it's just normal.
I only checked at the beginning
the kind of table must be checked.
I don't think there's any special treatment.
Yeah, unless it's timescale DB.
I remember there's an issue with logical.
There are hyper tables, which are kind of like partition tables.
Yeah, they couldn't change.
Right, and with regular partitioning, logical decoding works at partition level, that's it.
Not that's it, but basics is this and i remember there's some
issue with timescale db so i wonder if timescale db is compatible with pg squeeze maybe there are
issues there actually do you know tiny i don't know sorry no it's okay so i just recently learned
there are issues with logical we had had a database with TimescaleDB.
We needed to move it using logical
and also on the fly removing bloat, right?
But it didn't work easy.
So, yeah.
Good to know.
I have another question in general.
So this is extension, right?
This is an extension like pg repack why extension
it could be not extension right it could be just some bunch of i don't know like is there anything
that requires this project to be an exposed extension because if it's not extension that
you can use it anywhere even like there are managed Postgres systems which don't have even pgRepack.
They don't have it. And people suffer from bloat. I mean, not people, databases
suffer from bloat there, and I see it. And if they don't have pgRepack,
they don't have pgsqueeze, we cannot use this approach, or we need
to implement something similar ourselves. We're using logical decoding, right?
Well, extension is, I consider extension just a means of packaging,
but what is important is that this code is a shared library
because it's C code running on server.
So it must be shared library,
and the easiest way to get shared library to the user is to make it an
extension. Does it answer your question? Yeah, yeah, it answers. Yeah, I just think
if for example, there is a managed service provider, what should they add? pgrepack,
pgsquiz, or just pgsqu? Like, what's our recommendation to them?
Because I know some managed Postgres service providers
which become quite popular,
and they lack this functionality,
and they also don't tune to vacuum by default,
which means growing projects accumulate a lot of load,
and they need some tools.
So what do you think?
They should just consider PGSqueeze as an extension
because it's convenient, because it has SQL interface, right?
You mean PGSqueeze versus PGRepack?
Well, yeah, yeah, yeah.
So to me, aql level interface is an ability to schedule it's an obvious
advantage right yes and uh yeah so i just said maybe a rhetoric question so no no no need to
answer yes if you ask me i recommend pg squeeze butrepack. If someone likes pg-repack, let him use it.
Okay, I have another question.
We touched a little bit on this initialization,
and you mentioned pg-repack constructs some DDL and so on.
When pg-repack performs with initial load, is it possible,
like theoretically, not right now, I understand like maybe right now it's not possible, but
theoretically, is it possible, for example, to, because it's logical decoding, Postgres
should obviously support this, is it possible to change some column type, for example. Imagine we had primary key integer 4 and we want to migrate to integer
8. It's already not a use case of bloat removal
but it's kind of like
mutation of schema. Not possible, right? No, it cannot but
I remember I was suggested this feature
it was suggested to me
but I think it would belong
to a different extension
and one thing that makes
the pgSqueeze a little bit
simpler than it might be
is that it assumes
no catalog changes so it checks
whether anything changed
in the catalog and if
did then the transaction is
aborted and the processing is not effective. So everything if anything
changes in the catalog during the processing then the processing ends up
with error and everything is a rollback. So if it should change the catalog
in addition to copying the data,
it will be more complex.
So the main reason is simplicity, right?
For the sake of simplicity.
Yes, and I think actually it's a different feature,
so I would put it into another extension.
Actually, I wrote something similar.
It's called pgRewrite,
and it tries to use the same logic or similar logic,
like pgsqueeze, to partition table,
to turn non-partition table into partitions.
I'm not sure how much it is used, but it's another use case.
Yeah, I didn't know about this use case yeah I didn't know about this
no I didn't know about that either
I was just going to give you some context
this came up in a discussion we were having
in an episode that we called
column Tetris because that's what we've heard it called
by lots of people
the idea of not just changing a data type
but keeping the existing
data, like keeping all the existing
columns but changing their order
for alignment purposes to minimize dead space in between between columns so putting all of the 16
byte columns it together then the eight then the four then the one or you know um
so i'm guessing it's the same, but you could use the exact same mechanism.
Yes, the mechanism is another use case for this mechanism, but not implemented in the
pg-squeezing.
Yeah.
Is it implemented, like, can we use pg-reverite for this, or it's only to make
a table partitioned? No, that pgRewrite extension so far only
handles the partitioning.
But potentially it could be used for this table rewrite for some reorganization
schema. Yes.
Interesting. And also pgRewrite, while it's doing this,
it eliminates bloat, it eliminates blow, right?
Well, it basically creates new tables, so yes.
So blow to the remove. Doesn't this mean that PgRewrite performs the same job as PgSqueeze?
Well, but it always needs one non-partition table, and it always creates partitions so not all users
that need to get rid of both also need partitioning.
Yeah, I see. I see. I'm just noticing that this project also does the job of that project.
Doesn't that mean that maybe it could be a single project? This is my idea.
I don't know.
I just see the need in this.
Sometimes we need to convert primary key data type,
not only integer 4 to integer 8.
I have tables right now where, for example,
people decided where people, sometimes it was myself,
decided to use data type text to store UUID values. And this is not good, not a good decision. Also, because of disk space.
And it would be great to convert it to UUID data type easily without headache,
taking care of all the things that are required to
achieve real zero downtime. For example, sometimes I see people create quite smart migrations,
but then they forget about auto vacuum running in transaction ID wrap around prevention mode,
for example, right? And it blocks deployment, for example.
It's like there are many things to remember, right?
In such project, to be able to rewrite or to reorganize the order of columns
and get rid of bloat,
maybe sometimes clustering as well.
I know PgRepack supports clustering.
I'm not sure about PgSqueeze.
To reorganize also supports, right?
Yes, you can specify index.
So, I mean, there are many
things, but they are united
by the single idea that tables
should be basically
rewritten, right?
And partitioning is another use case.
Maybe it should be
a single...
I'm just thinking. I'm just like thinking,
Yes, the problem is that this would end up in say many extensions, someone must maintain it. So
I think ideal case is that this finds its way into the core.
Oh, yes. Let's discuss this. What? Why not not? Let's discuss this. For example, Pidget
Squeeze. Why we cannot propose something like as an alternative to vacuum full and cluster
comments, maybe just an options to them like online, right?
Concurrently.
Concurrently. It doesn't matter which words. And everyone already receives this. What are the
obstacles on this path?
The obstacle is
a review. I already submitted
a patch which
modifies the cluster command
and the vacuum with the full option.
Let's see what
happens. So the patch is
already in the commit fest.
Can you tell more about this patch?
Well so far it does not contain the scheduling feature. That's something rather easy to add but
otherwise it's almost identical to the functionality is basically identical to
the functionality of pg squeeze except that it does not change the visibility
of the data.
You know, the
pgSqueeze, when it
writes the data into the
new table, it uses
new transaction ID.
So the data kind of
moves to the future.
Well, this is
similar to what happens the alter table commands if
if alter table, revise table it also uses new transaction so this is
documented anomaly of the MVCC. I'm looking at this commit fest which is
called vacuum full cluster concurrently and this is this is great actually this shows how
poorly i'm prepared for this recording i'm sorry apologies for that for this but this is huge
i promise to look at this and advertise it on my twitter and linkedin this is great this is great
like i think so many people would be happy to have this functionality in core. This is huge work.
So basically, yeah, PGS-Q is in core.
That's great.
I would be happy too,
because I would not have to maintain the code anymore.
And as I say, if it was incorporated into the core, then this visibility problem would also be gone.
The extension is kind of limited.
Some things happen lower in the stack, so the extension needs to...
It must use the new transaction to insert the data.
But if it were incorporated in the core,
then it would act just like cluster and vacuum full.
These commands do not change visibility.
And what is the current status that waits for review, I see, right?
Yes. So version 5 five version six already right
so there's already some progress uh some reviews happened and it waits for another round of reviews
right yes yes that seems so yes yeah that's great that's great well, I'm excited to see this. I don't know. Okay.
Yeah, for sure. I mean, I did a couple of hours of research and
didn't find that Nikolai sorry, don't don't beat yourself up.
Tony, though, I've got bad news for you. I think you probably
asked it we're gonna need to maintain PG squeeze for a while.
Because the scheduling stuff looks really useful and i
imagine i'm looking at the list of features and you might have got the wrong impression listening
to this that you value simplicity that it might not be that powerful and might not have that many
features but there's so many features to pg squeeze things that i wasn't expecting to see like
it does it nicole i'll be happy about this it does analyze on the table afterwards by default
you can you can skip the analyze i saw that that was a feature but by default it runs analyze on it
and you can even switch the table space so you can when rebuilding i know not many people use
table spaces these days but that's a feature and i learned something new i didn't realize the heap
of a table could be in a different table space to the
indexes.
That blew my mind.
So I learned quite a lot reading your very thorough readme, which was great.
Well, I'm not sure how many people use these features, but when I was writing it, I tried
to do it in a general way.
Especially the analyze is necessary because it is needed to evaluate
if the bloat exists.
So that's basically why it analyzes the table.
Maybe not, I'm not sure.
I think it's analyzed after.
Yeah, okay. So sorry that I was wrong. I don't remember why exactly.
You don't remember why you added such a good feature. I like it.
Speaking of tablespaces, I think this maybe not huge, but there is some potential that this...
This is old stuff.
We used tablespaces before cloud, basically.
Because some disks were fast and expensive,
some disks were slow and cheap.
So we tried to organize the storage.
What I'm thinking right now in cloud era,
it looks like tablespaces,
use of them almost disappeared.
But if we think about tiered storage, which might get popularity over time quite soon.
For example, if we store huge heap, huge table on object storage on S3, just because it's much cheaper.
We might decide, okay, we want
indexes on
at least some indexes
on a regular disk.
EBS volume, for example.
I don't know. It's just my fantasy.
But it might
have some chances
to happen,
to be popular in some cases, I don't know.
So, yeah.
Another different extension.
Tony, there's one more cool feature of PG Squeeze that I liked.
There's various monitoring tables.
There's a log of the jobs that have run and an errors table,
which I thought was extremely good.
And it can do retries as well.
I think it doesn't by default, but you can set up several retries. That all seems to me really
powerful in terms of, you know, as you say, if things, if you're unlucky and a DDL command
happens to be one on the table, just as it was kicking off one of these squeeze jobs and it
gets cancelled,
you might want it to be able to retry again quite soon.
So these features all seem cool.
And the idea of an error log as well,
in terms of having that as a table
and having a SQL interface to it all seemed really cool.
I think that was all from the beginning.
So do you remember much about implementing those
and what you were thinking at the time?
The monitoring, for me, it was even important for the development and testing. And I think
it's, yes, I wanted to, yes, I think one reason was to make sure that those say concurrent changes, I mean that do
actually happen during the test. So when the data is being
copied to the news file, some applications may do changes, I call
these concurrent changes and I wanted to make sure that during testing these changes
really happen so that was
one reason I did the monitoring
but of course it should be useful
for the users
if for example it takes
too long time it's good to
know how many rows
were already copied and
what was the progress
yeah
I loved it the only thing i
was surprised that you didn't include i don't think people necessarily need it but it shows
off how much work you're doing is it didn't show me a before and after in terms of like table size
i thought there was an opportunity there to show off and show how much success you're bringing but yeah it shows the
number of tuples it on the initialization or number maybe number of rows and then number of
concurrent inserts updates and deletes maybe i think there's like four columns yeah that's it
cool well what are your like what's what's the future of pgSqueeze? I guess the core stuff makes sense. I don't really think much, I don't really think that far, but then it will be
maintained for some time and the next step might be to submit a patch for
the scheduling, but I'm not sure if the scheduling is appropriate for the Postgres core.
I think then if this was merged into the core, then I think people would
bring up many new ideas. So I didn't really think
that far. I think there is not much, not many features to be added.
Well, I don't have many ideas what else should be added.
And at this point, I'm not too eager to add new features
because it works.
And as I hear sometimes that some important customer is using it,
then I tend not to touch it anymore.
I did the coding the best way I could do but
with the positive feedback I tend to not add new features. I'm fine if I'm only supposed
to maintain it.
Yeah, like for major like new major versions, I'm guessing there's some work
yes, updated for major
Postgres versions
nice
Nikolai, anything else you wanted to ask?
no, I think we're good
thank you so much for coming
and explaining
how it was created
and how it's going
and I'm super excited
with this surprise
that there is a proposal
to have vacuum full concurrently.
It's really great.
Going to look at it closer.
Okay, thanks for the feedback.
This is like,
if it succeeded,
it's a logical step
for this project, I think, right?
Which can influence every Postgres instance.
For sure.
Yeah.
Yeah.
Well, best of luck.
Thank you so much for joining us,
and I hope you have a good week.
Yes, thanks for the invitation, and have a nice day.
Thank you. Bye-bye.
You too. Bye.