Postgres FM - Archiving

Starting point is 00:00:00 Hello, hello. This is Posgis FM. My name is Nick, as usual. PostG. Say, I. And as usual, with me, Michael, PJ Master. Hi, Michael. Hello, Nick. So we have a question today from one of the listeners, and it's about archiving. I have so many options, what it can mean. Where to start? Yeah, well, they specifically asked whether it was possible for us to do a video on archiving, and they said, want to archive a database that is about 5 terabytes in size. So what do you assume that means when you first hear it? Well, if you are familiar with Posgis internals and it's just not internals. This is about postgues architecture. Archiving is simple.

Starting point is 00:00:52 You just set up archive command, that's it, right? But it's a joke. I think this is not what was meant here. though yeah so i think 5 terabyte these days is average size database who doesn't agree well i see bigger databases and 5 terabyte it's not big database anymore it was big 10 years ago these days big means more than 10 terabytes this is just it right so 5 terabytes is average size database and archiving it means that we want to keep it longer term right how i read it i think so longer term and or we want probably to clean up

Starting point is 00:01:41 sometimes oh interesting yeah to clean up well yeah because for example if imagine it's a e-commerce we accumulated order history for many many years it's not used but it's so so much data so why not archive it and then clean up clean up in Posgis straightforward cleanup we've deleted as we discussed many times it's it's not an easy operation in large databases because because of MVCC vacuum and all these complexities so if it's partitioned if bigger tables are partitioned then cleanup is easy right we just detach and drop all petitions that's it and before that obviously archiving makes a lot of sense but even if we don't have a task to clean up

Starting point is 00:02:38 archiving may might make sense for backup purposes right we want to keep archive for long yeah or maybe with that there are requirements those maybe there are requirements right yeah I think there's I think there's an argument that it's slightly different to backups, but I'm not, I'm not actually certain that my, like, how I have, like, internalized that is actually, that actually is helpful in any practical sense, because if you've got data archived, you've got, it's kind of a backup, right? Like, it is another source of old data, but I do think they're different in that you wouldn't expect an archive to have up-to-date data, whereas you would expect.

Starting point is 00:03:27 to back up to have up-to-date data. So I do think there's, like, subtle differences. And also, I think, you back up and you archive for different reasons, therefore there's different trade-offs. So, like, if we're backing up in order to be able to get back online quickly, if we have a disaster, that you might want to put it somewhere different in a different format than if we archive it because we need access to it, but not necessarily performantly. for me in like in real life we for example take our DB lab or experiments

Starting point is 00:04:02 experiments should be done on clones right on think clones branching if it's DB lab for example or full-fledged clones on separate virtual machines like if it's if it's self-managed right and we say let's restore either from snapshot cloud snapshot of disk or let's restore from backups or from archive and there is purely like synonyms for me like let's restore from archive and and that's why archive command it's a part of backups right it's it's archiving walls and we say we have many like versions of our database not versions many exemplars of our database and archive or in backups, backups, archives, archive, it's the same for me in this case.

Starting point is 00:04:56 And of course we want to be it quickly restored, but these days, if it's cloud, it's easy to configure. So you basically archive it, back it up using PG, Becrest or Volgy. It goes to S3 or other object storage like GCS or I'm not an expert in Azure blob storage. They name it, right? And there you can configure it. So for example, a couple of weeks later that backup automatically moves to colder storage like an S3 it's called Glacier to an archive well all our archives one is hotter or another is colder right it's glass here or like in GCS in Google Cloud it's called cold line I don't remember the name but the idea is it's cheaper yeah but slower to restore

Starting point is 00:05:51 yes and and this causes still causes a bug which is known many years in Volgi I recently wiped code a fix because it changes modification time and the latest becomes broken if you want to restore using upper upper case latest with old G well G backup fetch latest in this case it will fetch wrong might fetch wrong backup because it's based on mod time time and if when when it's moved automatically to colder storage modification time of old backup changes so latest might be broken I recently I sent it to Andrea already pull request to fix it in Walser it should be fixed right yeah but I

Starting point is 00:06:42 mean modification time cannot be trusted here obviously so yeah but it's a good idea and it's great that clouds support is super easy to do you just configure some policy how long to keep them how long to like when to move to all colder storage so it allows you to for the same money to keep archives for longer right so or if you if you're like for example you're doing this for compliance reasons and you're the length of time is pre-determined how long you have to keep it for you can keep it for that long for cheaper like that it feels to me right that's the more yeah yeah so so obviously it's not it's not it's not sd anymore it's hdd but it's still super reliable and as i understand they like promise like like a lot of nines like amazing number of

Starting point is 00:07:37 nines or how many it's like but here we're talking about durability not like availability right yeah Exactly. So obviously, data won't be lost. That's the key. And it can be stored years, years, years, if you want, right? And that's great. The only risk remains if somebody possesses your cloud account access, they might destroy everything. To protect from that, there is a lock option, right? You can lock and basically say you cannot delete. Nobody can delete. or you can copy to different account or even different cloud. Yeah, with different permissions. Right. Right. And if it's, again, if it's backups using BGibcrest or WOLGV. VolGiv has a special comment for this, for this copy.

Starting point is 00:08:36 So you can copy from one backup location, one archive location to different archive location. Yeah. But if you are on managed, post gas in this case physical backups are not available unless it's crunchy bridge right which is great that they allow it to access physical backups and bolts and in this case the only option remains is to to create logical backup which i prefer to call dump and we had an episode is uh yeah is logical dump is logical backup is logical backup my answer still no because but but okay is it is it an archive because I think like

Starting point is 00:09:21 archiving almost always isn't incremental right like you off you do it in batches right you take a whole year and archive it so in this case it's not changing so a logical dump is kind of is an archive well for me again like I don't feel this difference I I I I hear you, you feel some difference. But for me, all backups are also archive. They're just continuous thanks to wall archive. They consist of two parts. Full or deltas incremental, differential.

Starting point is 00:09:56 There are many names additionally there. But basically, like, full copy of data directory plus walls, right? And walls make them like continuous. This archive becomes continuous. This is where I think we have a difference, because I think these... Let's go back to the question, right? They want to archive a database that's five terabytes in size.

Starting point is 00:10:17 That's all the information we got. Yeah. We don't, like, there's a chance that it's an old product that's no longer served and they just need to keep the data for a certain amount of time. It could be that it doesn't need to be access to it anymore. It could be, um, it could be all manner or man. They might be, they might be want to drop database and just keep it for history, right? So, to answer this question directly, the first thing we need to understand, is it managed

Starting point is 00:10:47 service or we have access to physical level? If it's managed service, the only option for us is PJ dump. No, I disagree. Why? Well, because I think you could, so depending on the parameters, right, like, you could export it in a different format, right? There's a lot of a lot of work.

Starting point is 00:11:12 Okay, logical backup is the only option here. Yes. Although, technically speaking, we can rely on, we can say, okay, let's create a snapshot and consider it archive

Starting point is 00:11:23 and rely on RDS that they will keep it for a longer. Why not? Yeah, but so here's what I'm thinking, though. And five terabytes is not very large, right, as you were saying at the beginning. But if,

Starting point is 00:11:35 in the general case, where we care about cost. It seems like cost actually matters here. It's one of the main reasons for doing this in the first place. If we care a lot about cost, it feels to me like some of the newer formats that compress extremely well. Like, and maybe newer's not the right term,

Starting point is 00:11:51 but it's new to me in terms of coming across it from a post-cress perspective. But the formats like iceberg and parquet that compressed down extremely well feels to me like would be the cheapest way of archiving things. data that's static and is not going to be changing or you're going to get a new batch of it once per year. Like the archiving use cases that I'm thinking of. If we exported in those formats,

Starting point is 00:12:17 they'd be tiny. Like the five terabytes might go down 95% or so. Oh, I doubt. 20 times I doubt. Well, maybe. Okay. What have you seen? Yeah, I've seen people reporting that kind of, like, even on the time scale, compression side, I've seen that report. And that wasn't even... Depending on data again. Yeah, depends on data. But if you've got a lot of numerical...

Starting point is 00:12:45 A lot of columns that compress... that have a lot of similar data in them, which I think is pretty common. Especially for like, if we're thinking it's healthcare-related or financial, like, transactional data, a lot of that is numerical stuff. well i again like okay park is is interesting i think it's an option as well and uh in this case you can also like it depends on access patterns later right we need to first thing we need to answer do we have access to physical or only logical logical we obviously have access because this is our database yeah i suppose right so physical versus logical thing we discussed

Starting point is 00:13:28 many times but physical also can be compressed right so Both wall G and PG because they compressed data director in walls. So why not? The only thing to, like, super important thing to remember is that you cannot just take data director, that's it, you need to make sure walls are also archived. If it's only logical, okay, there is a parquet option, but also there is a PG-dump option, which also has compression. Yeah, but we're not getting the, this is compression like at the page level, right?

Starting point is 00:14:01 or kind of the row level and I think column column level compression is hard to compete with well it's page level or file level what's the difference there is difference

Starting point is 00:14:19 huh how how well from Pidgea dump I would expect 10 times sometimes in some cases 10 X compression if you there are options in PJ dump right that STD I would check definitely and it has

Starting point is 00:14:37 do remember if PG dump allows to control compression level can you set compression level 19 it will be terribly slow right or if you check okay with that that's the nice thing like about this this use case is we might be okay with really slow retrieval really slow restores because the main point is keep it for many years. Right, right. Anyway, PigeDamp supports for multiple years already.

Starting point is 00:15:08 It supports options. And you can control compression level there. You cannot control compression level in wall compression, which compresses full page rights inside, full page images inside wall, right? There, it was my proposal. I need to finish it. There you can control algorithm,

Starting point is 00:15:29 but not the level of compression. so i i'm not sure everyone needs it but still for completeness we should support it in postgres right but for pitch down definitely you can control level so if you are okay to wait and load CPU a lot and it will like in some if you check uh zeta sted at 19 i think it should be super slow but maybe beneficial for in terms of size and of course restore will be also slow, right? Yeah. Yeah. I think that is something to factor in if people are deciding

Starting point is 00:16:07 which way to go, though, is what kind of questions are you going to get of this old data? Like, are they going to be auditors that want to come along and be able to query it? Does it matter that it takes, like, if you're going to get advanced warning that they're coming, does it matter that it takes a day

Starting point is 00:16:22 to restore? Are they going to be analytical queries? Like, are people going to want to know aggregations on it? Because then if, if it's going to be mostly aggregates, maybe you do want it in some columnar format where you can get those done quickly. Well, you confuse me, but obviously when you compress a dump, it's file level, so the whole thing is compressed. It's not page level. That makes sense, yeah. It should be good. Page level, if it's, for example, you store something on ZFS, there, it's page level.

Starting point is 00:16:53 Here, no, the whole thing is compressed. It can be super slow, of course, but quite well compressed. not sure how it will compete with parquet i have not a lot of experience with it so like in production we have a customer who has it self-managed database and archived from time to time in parquet format yeah and it's great yeah it's good work it works it also that of us it's interesting setup but anyway i think there is a potential here if somebody wants to write some article or do some research there are many options to discover here and benchmark like we could build some you know like decision three like physical or logical how to decide i would stick with physical always if recovery time matters and if we want continuous archiving right if and if we have access

Starting point is 00:17:48 of course because if no access it's blocked this path is block yeah logical okay dump or parquet format dump, we have compression options. Definitely I would choose, I would not choose a single, single file. Well, again, depends, right. But normally we choose custom format because we want compression and paralyzation, right? Because if you want single file dump, then you cannot use sterilization and you will be you will be limited by a single VCPU work when dumping and restoring when you use custom format you can say how many jobs how many VCPUs to utilize when dumping and separately when restoring can be different number and this this matters a lot if we want to dump faster and restore faster right this

Starting point is 00:18:49 thing and then finally like so this is a decision making tree right and then finally where to store it is it definitely archive shouldn't be stored on expensive SSD disks right it should be stored object storage is great if it's cloud right if it's not cloud some cheap discs which we don't use a lot so they like serve as archive or even like is tape still an option i think in many organization which old organization still well it has a huge lifespan right so tape right why not anyway it should be something which we know is reliable will live very long right but also we need to take into account the restoration timing right maybe yeah because glaciers super cheap as three glaciers like the cheapest but also the slowest to recover

Starting point is 00:19:48 from right yeah to fetch fetch data fetch data from there so it's interesting right and and physical backups also can be physical archive also can be placed into various location so location question i think it's a separate decision well yeah i i i think there's some there's a couple of other things to factor in because if you are if you do store it it in Parquet, for example, you can query it without restoring. You don't even have to restore to get the data. Yeah, that's great, great option. Also, logical dump path is great because it allows you partial archive.

Starting point is 00:20:35 If you want only specific tables. And also, of course, you need to take into account that you only archive data, no helper data helper data is indexes or materialized views so derived things well which means that recovery will be longer right so I would say if you are going to drop some database and for historical purposes even if you have a physical access I would probably do dump still five terabytes dump if it's powerful machine definitely within one hour definitely you know but yeah we talked about it recently yeah and if it's like with many threads you can well if it's a by the way dump aren't dumps are always interesting

Starting point is 00:21:31 i think it doesn't matter parquet or native pjadam if you have on partitioned case and for example five terabytes but four of them is a single table we see it a lot actually one huge table usually called logs or history or something like historical events yeah sometimes queue like workload yeah but something like accumulated and and then people think how to get rid of it partitioning is needed there right because without partitioning dumping will be single threaded as well although you can create your own tool and i think actually pidge dump could supported using AD ranges dump using multiple parallel threads theoretically it's possible and you can write this but out of the box Pidge a dump will limit you

Starting point is 00:22:28 and probably dumping 5 terabytes will look like this I and you say I want like to utilize all my 16 cores and using 16 threads to send it to S3 first 10 minutes all 16 threads are utilized and then a couple of hours just single thread is working because one huge table uh yeah yeah that's why so partitioning will be would become archiving much faster and and more flexible to tune like like let's move faster because again if it's live database every time we archive logically we have a long running transaction and we affect vacuum behavior even if it's on replicas which have holds and buy feedback on right so we this can lead to higher blood and various

Starting point is 00:23:31 negative consequences I don't see it as much anymore but I used to get quite frustrated seeing partitioning advice that mostly focused on like the read performance benefits of partitioning and the more I've learned about it over the years the more I think the main reason to partition is for these maintenance tasks and for health of the database overall but time-based in particular if you've got an archiving requirement if you're ever going to need to remove database on time time-based partitioning is just a dream yeah you're absolutely right as some LLMs say okay yeah right so exactly like this archiving restoring like dump restore how vacuum works how index creation works even physical backups benefit

Starting point is 00:24:27 from partitioning because if updates and search deletes are localized to specific files so postgres keeps tables and indexes and indexes and files up to one gigabyte so if it's yeah five terabytes four of which a single table it will be shrink to one gigabyte files and imagine we have a single on partition table all rights can come to any of these files all the time any blocks inside any any any pages all the time all the time like it's it's not localized right and in this case if you use snapshots rd s snapshots they have incremental support right so first snapshot is full and then next snapshot is just deep at block

Starting point is 00:25:16 level if you use volj or pg backrest they also have delta backups or incremental backups or differential backups where also at block level only changes are backed up right so in this case if you have a huge table you write constantly everywhere your delta will be bigger compared to the case when you write to only to specific file which is like like our latest file and only only occasionally you write old updating all data old all the records in this case these deltas will be much more localized and and this is also beneficial right so so even physical backups will benefit from partitioning so general advice yes and our general advice was always if you exceed hundred gigabytes one table it's time to partition people ask is it counting like do we

Starting point is 00:26:16 count it with indexes with toast well without well it's very very rough number so recently we started saying okay if it's exceeding 50 gigabytes already let's think let's think about partitioning right yeah without indexes maybe because the point is don't worry about it if your whole database is one gigabyte and but also if your hold it if you if your largest table is already a terabyte maybe it's a good time to it's it's late it's let you know what I mean that like that's the reason for giving a like a rule of thumb is give people an idea of where they stand roughly and if you're close to that number it'd be thinking about it

Starting point is 00:26:55 so we have funny stories when the AI companies come to us for help because they grow too fast and they need help yeah they like database expertise we have a bunch of AI companies see startups and it's it was an interesting story so some company like achieved like between 10 to 20 terabytes already and they have table succeeding terabyte and someone from our team says they should partition a year ago but i said they just launched a year ago right so the the pace is very very high and i think postgis needs improvements in the area of partitioning to simplify how like develop experience and i hope we will work in this area next year

Starting point is 00:27:41 with our tooling it has got a lot better at native partitioning if it feels like maybe not 18 as much but it feels like most releases before then since it was introduced there was like significant improvements right there were in 18 in terms of in terms of various like detached partition concurrently in terms of support of on operations locking things and but still it's super painful still you know like you mean basically to you need to allocate an engineer for a month or two to implement it and we have great guidelines which we give our clients but still lack of automation you mean like that's to migrate from non partition setup

Starting point is 00:28:26 to a partition setup yeah yeah they the hardest part here is that everyone is using various RMs and like they won't this tooling to be closer to their language right there is of course department but yeah but still oh okay it's about the creation of new okay interesting yeah it's about creation of new partition and then maintaining it because when you have partition schema already there is amplification on several risks one of them very well known log manager right and i had a series of blog posts recently studied postgast 18 in deeper also in the context of prepared statements and it was interesting and another

Starting point is 00:29:10 thing is that any ddl schema changes they might be really painful right even simple index creation on or dropping index can be painful or foreign keys right so yeah it's it requires every time you need like every time we see like people implement some nuances and like it feels like postgres needs to solve it Fair. All right. Anything else on the archiving front? Final thing.

Starting point is 00:29:42 If you archived something, but haven't tested. Right? It's like backups, right? Yeah, Schrodinger. When we say tested, you mean like, can it be stored? Can it be queried? And also, is it the same as the data was originally? Like, it's not, it's verifying as well, right?

Starting point is 00:30:00 Like, do some basic checks. exactly so unverified unrestored backups and archives are like undercooked yeah great point just simply undercooked and should be considered as a huge critical mistake that's it huge risk right yeah yeah yeah so if you if you did something but never run if you write a code but never run and ship how does it feel

Starting point is 00:30:37 the same with backups you created backup but you haven't tested archives yeah so it's a critical mistake that's it nice good addition

Starting point is 00:30:49 yeah this addition yeah just almost overlooked yeah okay nice well thanks again and catch you next time

Starting point is 00:31:00 Bye-bye

Postgres FM - Archiving

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.