Postgres FM - pg_squeeze

Episode Date: January 10, 2025

Michael and Nikolay are joined by Antonín Houska to discuss pg_squeeze — what it is, how it started, some of its features, and hopes of getting the functionality into core. Here are some ...links to things they mentioned:Antonín Houska https://postgres.fm/people/antonin-houskapg_squeeze https://github.com/cybertec-postgresql/pg_squeezeCybertec https://www.cybertec-postgresql.compg_repack https://github.com/reorg/pg_repackIntroducing pg_squeeze (by Kaarel Moppel) https://www.cybertec-postgresql.com/en/introducing-pg_squeeze-a-postgresql-extension-to-auto-rebuild-bloated-tablespg_squeeze is included in the PostgreSQL APT packages https://wiki.postgresql.org/wiki/Aptpg_squeeze is included in the PostgreSQL Yum packages https://yum.postgresql.orgpg_rewrite https://github.com/cybertec-postgresql/pg_rewriteVACUUM FULL / CLUSTER CONCURRENTLY patch https://commitfest.postgresql.org/51/5117~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael, founder of PgMustard, and as usual, I'm joined by Nikolai, founder of PostgresAI. Hey, Nikolai. Hi, Michael. And today we have a special guest, Tony, from CyberTech, a Postgres developer there, who is responsible for, among other things, the PgSqueeze extension. So, welcome, Tony. Hello, thanks for the invitation.
Starting point is 00:00:24 It is our pleasure. So So today we are going to be talking all things pgsqueeze. Perhaps Tony could give us an overview of what it is. So pgsqueeze is a Postgres extension that does similar things like the cluster command or the vacuum with the full option, except that it does not require the exclusive locking of the table well in fact it needs the exclusive lock only for a very short time at the end of the processing but but it's actually a tool that lets users to get rid of table table bloat yeah it's hugely powerful isn't it so online bloat. Yeah, it's hugely powerful, isn't it? So online bloat reduction of tables.
Starting point is 00:01:09 Yes, basically that. Yeah, big, big problem. And it's a really powerful tool. And I think a lot of our listeners will be familiar with PgRepack, which I think originated from a... Was it... Nikolaj, you can probably tell me what it was originally called, a reorg or something like that. Yeah, something like's already already doesn't matter it's ancient name yeah but
Starting point is 00:01:31 pg squeeze uh you also have like a bunch of other features and it's it's done in a more modern way in terms of how maybe you could tell us a little bit about how it's how it's implemented and how long you've been working on it yes the motivation the motivation was actually to use, well, it was assignment. It was a task that I got from my employer. And the idea was to adjust pg-repec, the existing pg-repec extension, so it can be scheduled, so it can run according to schedule automatically. But when I investigated the pg-repec code, I realized that it's implemented in two parts. Part of the logic is on the server and the other part is on the client.
Starting point is 00:02:14 And I realized that the client part is not really simple. It does some tricky things with connections. It even terminates connections at some point. So I realized that if it should be executed from a background worker, it would be quite strange because the background worker would have to use the client library to connect to server. So the server would have to connect to itself over the Lipiku library. So the fact that the client is not trivial made it almost impossible to implement the functionality in the background worker.
Starting point is 00:02:54 So that was one reason I thought it's better to start from scratch. And once I was considering doing everything from scratch, I got opportunities to do some things differently. Yeah, just instead of replacing files, you decided to use logical replication, right? Because by that time, it already became quite mature. Unlike the moment when PGRework and PgRepack was created originally, there was no logical replication yet at all. Yes. Right?
Starting point is 00:03:29 Is this right? Yes, PgRepack uses triggers. Well, maybe I should tell shortly how both extensions work. The common part is that they copy the useful say useful data from the table into a new table and then swap the files. Each table is table contents is stored in a file. So, so the extensions copy the useful data and then switch the links to the files. All right. Also accumulating changes in some Delta
Starting point is 00:04:04 table and applying them. Yeah, also accumulating changes in some delta table and applying them, this can be a headache itself. This is where pgReplic uses triggers and around the time I was considering writing such an extension
Starting point is 00:04:19 from scratch, there was a new feature in Postgres and it was logical decoding. I think the logical replication was not yet implemented at the moment but I thought it could be useful for this purpose so I spent quite some time studying how it works and then I decided to use the logical decoding. I considered it more, I will say, compact. I didn't like the idea to construct, well, the pgRepec somehow has to create the triggers
Starting point is 00:04:53 so it plays with the DDL statements. It has to construct the DDL statements to create the temporary table as well as the triggers and I thought this is quite complex and error prone So that's why I prefer the logical decoding I think logical decoding should be lower impact as well in terms of overhead So it feels like a lighter weight way of doing it as well, to me. Is that fair? Yes, I think the fact that it works at a lower level in the stack should even be probably
Starting point is 00:05:34 more efficient. Yeah, because in case of PGRPack, any write during initialization of during this massive copy using I think it uses create table as select when we do it all rights this table also go to this delta table using triggers and basically we write it twice and it goes twice to wall while if we take logical recording we just extract it from wall and it's written once right this? This is already a big difference. Well, I think with pgSqueeze, the insertions into the new table are actually locked too. I think there's no way to... Well, you could avoid the logging if the table was unlocked, but then it's hard to switch. I didn't go that much far in this direction, but... I mean, in our case, in PgSquiz, we just accumulate changes in the slot.
Starting point is 00:06:31 And we don't write twice to two tables, right? This is the big difference in terms of how this delta is accumulated. Yes, I see. Yes, that's right. So this is obviously a difference. And I didn't realize that the original idea was to be able to schedule it. And scheduling PGRPAC means you do it on client using, I don't know, like regular cron or something. In this case, you brought all the interface to SQL, and we just can schedule it using PGRPAC or something, right?
Starting point is 00:07:05 This is the idea. I didn't realize it. Or you can schedule it using pg-chron or something, right? This is the idea. I didn't realize it. Or you can schedule it right with pg-squeeze. I don't know this. Okay, Michael is shaking head. I understand. Okay. No, I'm saying you can.
Starting point is 00:07:17 I'm saying you can schedule it within pg-squeeze. It uses the same syntax as crontab. Is that right? Yeah. I didn't know this. I use it only a couple of times manually somewhere very long ago so yeah this that's great and and in general be able to run everything at sql level it it's good I mean not cli level but sql level it gives freedom sometimes right you can do it in application or anywhere, PSQL or something.
Starting point is 00:07:45 Yes, and it's good for development and troubleshooting. Yeah, yeah, yeah. That's great. That's great. Good. Tony, I was reading, I think you have a super fan, previous CyberTech employee, Karel.
Starting point is 00:08:01 He wrote a really good introduction post. I couldn't believe it was 2016. I mean, also it had lots of these features at the beginning. How has it developed over time and how much have you had to keep working on it? Well, I think there was not much functionality added since the initial version. I did some optimizations to avoid the restriction it puts on vacuum, because one problem is that the X-Men horizon, you probably know what it means, must be prevented from advancing
Starting point is 00:08:40 for some time when the data is being copied. Then this horizon must stay constant it must not advance and the other problem is that. So sorry once again so what exactly you did like if we create a slot it already it already freezes our state in terms of horizon until we start consuming changes from the slot, right? Yes, and I think the initial version did block this horizon for the entire processing,
Starting point is 00:09:13 but at some point I realized that it's only necessary for the initial load, for the copying of the data. Then it can be allowed to advance. Yeah, when it's done. After the load, after the copying is done, you only need to process the changes that were applied concurrently, so you no longer need to block the Xmin horizon for this stage. This is a huge problem for huge tables, right? If we have a one terabyte table, it means that this horizon is blocked for long
Starting point is 00:09:54 and it affects the whole cluster because autovacuum cannot delete freshly that tuples, as we discussed many times on this podcast. In this area, have you considered the idea of speeding up initialization? If, for example, we have regular integer 8 or, I don't know, like some numeric primary key, we could split the table into some batches or areas, and then we could use multiple insert processes, which would work in parallel, speeding up the initial initial load.
Starting point is 00:10:27 Did you consider this or? Or this is like, maybe? No, not really. I didn't have this idea. Do you mean processing the table in multiple iterations? No, I didn't get that far. So you say that with pg-sql the X-Men horizon is a problem with the processing or in general? In general when working with logical replication, decoding, it's a problem. Initialization is a huge problem. Sometimes we're dealing with huge databases. We avoid initialization completely. For example, if we want all tables, sometimes we have a couple of recipes
Starting point is 00:11:09 how to convert physical replica to logical replica just because initialization is a headache. It can take sometimes days. And during this time, Xmin Horizon is not progressing and auto-vacuum is blocked. So I know some tools. I don't remember. Maybe PureDB was doing this,
Starting point is 00:11:31 basically virtual partitioning of the table. Not real partitioning, but just kind of partitioning, splitting it to pieces and then loading. If we have very fast disks and a lot of CPUs, a lot of cores, we can speed this process up. And this means that Xmin Horizon is blocked for a shorter period. I think, yeah, we could consider in this product, we could consider this, I think, this approach.
Starting point is 00:12:01 And PGRPack doesn't do it. It just create table as select. So it's a single process and it can be slow. While here we have an opportunity to speed it up, I think. This sounds like one of the cons we may discuss later, but from this perspective the pgSqueiz behaves like logical recording.
Starting point is 00:12:28 Yeah, and if we consider this approach in the future, if we think how to improve this, it's worth remembering that we need to synchronize all these sessions which perform inserts of individual partitions, virtual procedures we need to synchronize them using single snapshot right and so they deal with the same data how it's called i don't remember when you start transaction you can specify a snapshot and there is a function which exports the snapshot name so we we can we can ititchdown, Pitchdown does this as well. If, yeah. Yes, the Pitchdown also uses the kind of snapshot that is called historic snapshot. It takes snapshot of the database as it was at some point in time, but I'm not sure this,
Starting point is 00:13:18 well, you need for the snapshot to work, you need to make sure that the Xmin horizon does not advance because otherwise you must not lose, you should copy even the deleted tuples that are still visible to the snapshot because without that you would not be able to apply updates. If the tuple gets updated
Starting point is 00:13:39 you need to copy even the old version. Right, but if the slot is created already, it should guarantee that the snapshot is not lost, right? I think so. Yes, the slot does guarantee it, but I'm just explaining why we cannot advance it more than what is correct. Also, Nikolai, I don't think this is as big a problem
Starting point is 00:14:02 because I think you're talking about the slow initialization of a huge database, like the entire database, whereas... Just one table. Well, yeah, but just one table. And we can be talking about partitions here, though, right? Like if we're talking about, you're talking about a terabyte table. If we could get it down into 100 gigabyte partitions, pgSqueeze can work on a table-by-table basis and only blocking it partition-by-partition. Well, right, but if it's on partition table,
Starting point is 00:14:35 this can be a problem, right? But on the other side, if we're dealing with huge bloat, usually a terabyte table means 100 gigabytes table, right? Because while we're copying it, it already squeezes, right? That's, that's, that's what's happening. Yeah, maybe this explains why this is not a big concern. And, and people should you're right, people should partition the huge tables anyway. Yeah. Anything you've had to do in particular around partitioning, Tony? Or I guess it just works like any normal tables? No, I think it's just normal.
Starting point is 00:15:10 I only checked at the beginning the kind of table must be checked. I don't think there's any special treatment. Yeah, unless it's timescale DB. I remember there's an issue with logical. There are hyper tables, which are kind of like partition tables. Yeah, they couldn't change. Right, and with regular partitioning, logical decoding works at partition level, that's it.
Starting point is 00:15:42 Not that's it, but basics is this and i remember there's some issue with timescale db so i wonder if timescale db is compatible with pg squeeze maybe there are issues there actually do you know tiny i don't know sorry no it's okay so i just recently learned there are issues with logical we had had a database with TimescaleDB. We needed to move it using logical and also on the fly removing bloat, right? But it didn't work easy. So, yeah.
Starting point is 00:16:17 Good to know. I have another question in general. So this is extension, right? This is an extension like pg repack why extension it could be not extension right it could be just some bunch of i don't know like is there anything that requires this project to be an exposed extension because if it's not extension that you can use it anywhere even like there are managed Postgres systems which don't have even pgRepack. They don't have it. And people suffer from bloat. I mean, not people, databases
Starting point is 00:16:52 suffer from bloat there, and I see it. And if they don't have pgRepack, they don't have pgsqueeze, we cannot use this approach, or we need to implement something similar ourselves. We're using logical decoding, right? Well, extension is, I consider extension just a means of packaging, but what is important is that this code is a shared library because it's C code running on server. So it must be shared library, and the easiest way to get shared library to the user is to make it an
Starting point is 00:17:26 extension. Does it answer your question? Yeah, yeah, it answers. Yeah, I just think if for example, there is a managed service provider, what should they add? pgrepack, pgsquiz, or just pgsqu? Like, what's our recommendation to them? Because I know some managed Postgres service providers which become quite popular, and they lack this functionality, and they also don't tune to vacuum by default, which means growing projects accumulate a lot of load,
Starting point is 00:18:05 and they need some tools. So what do you think? They should just consider PGSqueeze as an extension because it's convenient, because it has SQL interface, right? You mean PGSqueeze versus PGRepack? Well, yeah, yeah, yeah. So to me, aql level interface is an ability to schedule it's an obvious advantage right yes and uh yeah so i just said maybe a rhetoric question so no no no need to
Starting point is 00:18:38 answer yes if you ask me i recommend pg squeeze butrepack. If someone likes pg-repack, let him use it. Okay, I have another question. We touched a little bit on this initialization, and you mentioned pg-repack constructs some DDL and so on. When pg-repack performs with initial load, is it possible, like theoretically, not right now, I understand like maybe right now it's not possible, but theoretically, is it possible, for example, to, because it's logical decoding, Postgres should obviously support this, is it possible to change some column type, for example. Imagine we had primary key integer 4 and we want to migrate to integer
Starting point is 00:19:28 8. It's already not a use case of bloat removal but it's kind of like mutation of schema. Not possible, right? No, it cannot but I remember I was suggested this feature it was suggested to me but I think it would belong to a different extension and one thing that makes
Starting point is 00:19:51 the pgSqueeze a little bit simpler than it might be is that it assumes no catalog changes so it checks whether anything changed in the catalog and if did then the transaction is aborted and the processing is not effective. So everything if anything
Starting point is 00:20:14 changes in the catalog during the processing then the processing ends up with error and everything is a rollback. So if it should change the catalog in addition to copying the data, it will be more complex. So the main reason is simplicity, right? For the sake of simplicity. Yes, and I think actually it's a different feature, so I would put it into another extension.
Starting point is 00:20:44 Actually, I wrote something similar. It's called pgRewrite, and it tries to use the same logic or similar logic, like pgsqueeze, to partition table, to turn non-partition table into partitions. I'm not sure how much it is used, but it's another use case. Yeah, I didn't know about this use case yeah I didn't know about this no I didn't know about that either
Starting point is 00:21:07 I was just going to give you some context this came up in a discussion we were having in an episode that we called column Tetris because that's what we've heard it called by lots of people the idea of not just changing a data type but keeping the existing data, like keeping all the existing
Starting point is 00:21:24 columns but changing their order for alignment purposes to minimize dead space in between between columns so putting all of the 16 byte columns it together then the eight then the four then the one or you know um so i'm guessing it's the same, but you could use the exact same mechanism. Yes, the mechanism is another use case for this mechanism, but not implemented in the pg-squeezing. Yeah. Is it implemented, like, can we use pg-reverite for this, or it's only to make
Starting point is 00:22:02 a table partitioned? No, that pgRewrite extension so far only handles the partitioning. But potentially it could be used for this table rewrite for some reorganization schema. Yes. Interesting. And also pgRewrite, while it's doing this, it eliminates bloat, it eliminates blow, right? Well, it basically creates new tables, so yes. So blow to the remove. Doesn't this mean that PgRewrite performs the same job as PgSqueeze?
Starting point is 00:22:37 Well, but it always needs one non-partition table, and it always creates partitions so not all users that need to get rid of both also need partitioning. Yeah, I see. I see. I'm just noticing that this project also does the job of that project. Doesn't that mean that maybe it could be a single project? This is my idea. I don't know. I just see the need in this. Sometimes we need to convert primary key data type, not only integer 4 to integer 8.
Starting point is 00:23:16 I have tables right now where, for example, people decided where people, sometimes it was myself, decided to use data type text to store UUID values. And this is not good, not a good decision. Also, because of disk space. And it would be great to convert it to UUID data type easily without headache, taking care of all the things that are required to achieve real zero downtime. For example, sometimes I see people create quite smart migrations, but then they forget about auto vacuum running in transaction ID wrap around prevention mode, for example, right? And it blocks deployment, for example.
Starting point is 00:24:07 It's like there are many things to remember, right? In such project, to be able to rewrite or to reorganize the order of columns and get rid of bloat, maybe sometimes clustering as well. I know PgRepack supports clustering. I'm not sure about PgSqueeze. To reorganize also supports, right? Yes, you can specify index.
Starting point is 00:24:28 So, I mean, there are many things, but they are united by the single idea that tables should be basically rewritten, right? And partitioning is another use case. Maybe it should be a single...
Starting point is 00:24:43 I'm just thinking. I'm just like thinking, Yes, the problem is that this would end up in say many extensions, someone must maintain it. So I think ideal case is that this finds its way into the core. Oh, yes. Let's discuss this. What? Why not not? Let's discuss this. For example, Pidget Squeeze. Why we cannot propose something like as an alternative to vacuum full and cluster comments, maybe just an options to them like online, right? Concurrently. Concurrently. It doesn't matter which words. And everyone already receives this. What are the
Starting point is 00:25:26 obstacles on this path? The obstacle is a review. I already submitted a patch which modifies the cluster command and the vacuum with the full option. Let's see what happens. So the patch is
Starting point is 00:25:42 already in the commit fest. Can you tell more about this patch? Well so far it does not contain the scheduling feature. That's something rather easy to add but otherwise it's almost identical to the functionality is basically identical to the functionality of pg squeeze except that it does not change the visibility of the data. You know, the pgSqueeze, when it
Starting point is 00:26:11 writes the data into the new table, it uses new transaction ID. So the data kind of moves to the future. Well, this is similar to what happens the alter table commands if if alter table, revise table it also uses new transaction so this is
Starting point is 00:26:32 documented anomaly of the MVCC. I'm looking at this commit fest which is called vacuum full cluster concurrently and this is this is great actually this shows how poorly i'm prepared for this recording i'm sorry apologies for that for this but this is huge i promise to look at this and advertise it on my twitter and linkedin this is great this is great like i think so many people would be happy to have this functionality in core. This is huge work. So basically, yeah, PGS-Q is in core. That's great. I would be happy too,
Starting point is 00:27:17 because I would not have to maintain the code anymore. And as I say, if it was incorporated into the core, then this visibility problem would also be gone. The extension is kind of limited. Some things happen lower in the stack, so the extension needs to... It must use the new transaction to insert the data. But if it were incorporated in the core, then it would act just like cluster and vacuum full. These commands do not change visibility.
Starting point is 00:28:01 And what is the current status that waits for review, I see, right? Yes. So version 5 five version six already right so there's already some progress uh some reviews happened and it waits for another round of reviews right yes yes that seems so yes yeah that's great that's great well, I'm excited to see this. I don't know. Okay. Yeah, for sure. I mean, I did a couple of hours of research and didn't find that Nikolai sorry, don't don't beat yourself up. Tony, though, I've got bad news for you. I think you probably asked it we're gonna need to maintain PG squeeze for a while.
Starting point is 00:28:42 Because the scheduling stuff looks really useful and i imagine i'm looking at the list of features and you might have got the wrong impression listening to this that you value simplicity that it might not be that powerful and might not have that many features but there's so many features to pg squeeze things that i wasn't expecting to see like it does it nicole i'll be happy about this it does analyze on the table afterwards by default you can you can skip the analyze i saw that that was a feature but by default it runs analyze on it and you can even switch the table space so you can when rebuilding i know not many people use table spaces these days but that's a feature and i learned something new i didn't realize the heap
Starting point is 00:29:22 of a table could be in a different table space to the indexes. That blew my mind. So I learned quite a lot reading your very thorough readme, which was great. Well, I'm not sure how many people use these features, but when I was writing it, I tried to do it in a general way. Especially the analyze is necessary because it is needed to evaluate if the bloat exists.
Starting point is 00:29:57 So that's basically why it analyzes the table. Maybe not, I'm not sure. I think it's analyzed after. Yeah, okay. So sorry that I was wrong. I don't remember why exactly. You don't remember why you added such a good feature. I like it. Speaking of tablespaces, I think this maybe not huge, but there is some potential that this... This is old stuff. We used tablespaces before cloud, basically.
Starting point is 00:30:28 Because some disks were fast and expensive, some disks were slow and cheap. So we tried to organize the storage. What I'm thinking right now in cloud era, it looks like tablespaces, use of them almost disappeared. But if we think about tiered storage, which might get popularity over time quite soon. For example, if we store huge heap, huge table on object storage on S3, just because it's much cheaper.
Starting point is 00:31:05 We might decide, okay, we want indexes on at least some indexes on a regular disk. EBS volume, for example. I don't know. It's just my fantasy. But it might have some chances
Starting point is 00:31:21 to happen, to be popular in some cases, I don't know. So, yeah. Another different extension. Tony, there's one more cool feature of PG Squeeze that I liked. There's various monitoring tables. There's a log of the jobs that have run and an errors table, which I thought was extremely good.
Starting point is 00:31:44 And it can do retries as well. I think it doesn't by default, but you can set up several retries. That all seems to me really powerful in terms of, you know, as you say, if things, if you're unlucky and a DDL command happens to be one on the table, just as it was kicking off one of these squeeze jobs and it gets cancelled, you might want it to be able to retry again quite soon. So these features all seem cool. And the idea of an error log as well,
Starting point is 00:32:12 in terms of having that as a table and having a SQL interface to it all seemed really cool. I think that was all from the beginning. So do you remember much about implementing those and what you were thinking at the time? The monitoring, for me, it was even important for the development and testing. And I think it's, yes, I wanted to, yes, I think one reason was to make sure that those say concurrent changes, I mean that do actually happen during the test. So when the data is being
Starting point is 00:32:53 copied to the news file, some applications may do changes, I call these concurrent changes and I wanted to make sure that during testing these changes really happen so that was one reason I did the monitoring but of course it should be useful for the users if for example it takes too long time it's good to
Starting point is 00:33:18 know how many rows were already copied and what was the progress yeah I loved it the only thing i was surprised that you didn't include i don't think people necessarily need it but it shows off how much work you're doing is it didn't show me a before and after in terms of like table size i thought there was an opportunity there to show off and show how much success you're bringing but yeah it shows the
Starting point is 00:33:46 number of tuples it on the initialization or number maybe number of rows and then number of concurrent inserts updates and deletes maybe i think there's like four columns yeah that's it cool well what are your like what's what's the future of pgSqueeze? I guess the core stuff makes sense. I don't really think much, I don't really think that far, but then it will be maintained for some time and the next step might be to submit a patch for the scheduling, but I'm not sure if the scheduling is appropriate for the Postgres core. I think then if this was merged into the core, then I think people would bring up many new ideas. So I didn't really think that far. I think there is not much, not many features to be added.
Starting point is 00:34:45 Well, I don't have many ideas what else should be added. And at this point, I'm not too eager to add new features because it works. And as I hear sometimes that some important customer is using it, then I tend not to touch it anymore. I did the coding the best way I could do but with the positive feedback I tend to not add new features. I'm fine if I'm only supposed to maintain it.
Starting point is 00:35:20 Yeah, like for major like new major versions, I'm guessing there's some work yes, updated for major Postgres versions nice Nikolai, anything else you wanted to ask? no, I think we're good thank you so much for coming and explaining
Starting point is 00:35:39 how it was created and how it's going and I'm super excited with this surprise that there is a proposal to have vacuum full concurrently. It's really great. Going to look at it closer.
Starting point is 00:35:54 Okay, thanks for the feedback. This is like, if it succeeded, it's a logical step for this project, I think, right? Which can influence every Postgres instance. For sure. Yeah.
Starting point is 00:36:12 Yeah. Well, best of luck. Thank you so much for joining us, and I hope you have a good week. Yes, thanks for the invitation, and have a nice day. Thank you. Bye-bye. You too. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.