Postgres FM - Archiving
Episode Date: December 19, 2025Nik and Michael discuss a listener question about archiving a database. Here are some links to things they mentioned: Listener request to talk about archiving https://www.youtube.com/watch?...v=KFRK8PiIvTg&lc=UgyiFrO37gEgUaVhRgN4AaABAg Our episode on “Is pg_dump a backup tool?” https://postgres.fm/episodes/is-pg_dump-a-backup-tool ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello, hello. This is Posgis FM. My name is Nick, as usual. PostG. Say, I. And as usual, with me, Michael, PJ Master. Hi, Michael.
Hello, Nick. So we have a question today from one of the listeners, and it's about archiving. I have so many options, what it can mean. Where to start?
Yeah, well, they specifically asked whether it was possible for us to do a video on archiving, and they said,
want to archive a database that is about 5 terabytes in size.
So what do you assume that means when you first hear it?
Well, if you are familiar with Posgis internals and it's just not internals.
This is about postgues architecture.
Archiving is simple.
You just set up archive command, that's it, right?
But it's a joke.
I think this is not what was meant here.
though yeah so i think 5 terabyte these days is average size database who doesn't agree well i see
bigger databases and 5 terabyte it's not big database anymore it was big 10 years ago these days
big means more than 10 terabytes this is just it right so 5 terabytes is average size database
and archiving it means that we want to
keep it longer term right how i read it i think so longer term and or we want probably to clean up
sometimes oh interesting yeah to clean up well yeah because for example if imagine it's a
e-commerce we accumulated order history for many many years it's not used but it's so so much data
so why not archive it and then clean up clean up in Posgis straightforward cleanup we've
deleted as we discussed many times it's it's not an easy operation in large
databases because because of MVCC vacuum and all these complexities so if it's
partitioned if bigger tables are partitioned then cleanup is easy right we just
detach and drop all petitions that's it and before that obviously archiving
makes a lot of sense but even if we don't have a task to clean up
archiving may might make sense for backup purposes right we want to keep
archive for long yeah or maybe with that there are requirements
those maybe there are requirements right yeah I think there's I think there's
an argument that it's slightly different to backups, but I'm not, I'm not actually certain that
my, like, how I have, like, internalized that is actually, that actually is helpful in any practical
sense, because if you've got data archived, you've got, it's kind of a backup, right? Like,
it is another source of old data, but I do think they're different in that you wouldn't expect
an archive to have up-to-date data, whereas you would expect.
to back up to have up-to-date data.
So I do think there's, like, subtle differences.
And also, I think, you back up and you archive for different reasons,
therefore there's different trade-offs.
So, like, if we're backing up in order to be able to get back online quickly,
if we have a disaster, that you might want to put it somewhere different in a different format
than if we archive it because we need access to it, but not necessarily performantly.
for me in like in real life we for example take our DB lab or experiments
experiments should be done on clones right on think clones branching if it's
DB lab for example or full-fledged clones on separate virtual machines like if
it's if it's self-managed right and we say let's restore either
from snapshot cloud snapshot of disk or let's restore from backups or from archive and there
is purely like synonyms for me like let's restore from archive and and that's why archive command
it's a part of backups right it's it's archiving walls and we say we have many like versions of our
database not versions many exemplars of our database and archive
or in backups, backups, archives, archive, it's the same for me in this case.
And of course we want to be it quickly restored, but these days, if it's cloud, it's easy to configure.
So you basically archive it, back it up using PG, Becrest or Volgy.
It goes to S3 or other object storage like GCS or I'm not an expert in Azure blob storage.
They name it, right? And there you can configure it. So for example, a couple of
weeks later that backup automatically moves to colder storage like an S3 it's called
Glacier to an archive well all our archives one is hotter or another is colder right
it's glass here or like in GCS in Google Cloud it's called cold line I don't
remember the name but the idea is it's cheaper yeah but slower to restore
yes and and this causes still causes a bug which is known many years in
Volgi I recently wiped code a fix because it changes modification time and the
latest becomes broken if you want to restore using upper upper case latest with
old G well G backup fetch latest in this case it will fetch wrong might
fetch wrong backup because it's based on mod time
time and if when when it's moved automatically to colder storage modification
time of old backup changes so latest might be broken I recently I sent it to
Andrea already pull request to fix it in Walser it should be fixed right yeah but I
mean modification time cannot be trusted here obviously so yeah but it's a good
idea and it's great that clouds support is super easy to do you just configure some policy how
long to keep them how long to like when to move to all colder storage so it allows you to for
the same money to keep archives for longer right so or if you if you're like for example you're
doing this for compliance reasons and you're the length of time is pre-determined how long you have to
keep it for you can keep it for that long for cheaper like that it feels to me right that's the
more yeah yeah so so obviously it's not it's not it's not sd anymore it's hdd but it's still super
reliable and as i understand they like promise like like a lot of nines like amazing number of
nines or how many it's like but here we're talking about durability not like availability right yeah
Exactly. So obviously, data won't be lost. That's the key. And it can be stored years, years, years, if you want, right? And that's great. The only risk remains if somebody possesses your cloud account access, they might destroy everything. To protect from that, there is a lock option, right? You can lock and basically say you cannot delete. Nobody can delete.
or you can copy to different account or even different cloud.
Yeah, with different permissions.
Right.
Right.
And if it's, again, if it's backups using BGibcrest or WOLGV.
VolGiv has a special comment for this, for this copy.
So you can copy from one backup location, one archive location to different archive location.
Yeah.
But if you are on managed,
post gas in this case physical backups are not available unless it's crunchy bridge
right which is great that they allow it to access physical backups and bolts
and in this case the only option remains is to to create logical backup which i prefer to call
dump and we had an episode is uh yeah is logical dump is logical backup is logical backup
my answer still no because but but okay is it is it an archive because I think like
archiving almost always isn't incremental right like you off you do it in batches right
you take a whole year and archive it so in this case it's not changing so a logical dump
is kind of is an archive well for me again like I don't feel this difference I I I
I hear you, you feel some difference.
But for me, all backups are also archive.
They're just continuous thanks to wall archive.
They consist of two parts.
Full or deltas incremental, differential.
There are many names additionally there.
But basically, like, full copy of data directory plus walls, right?
And walls make them like continuous.
This archive becomes continuous.
This is where I think we have a difference,
because I think these...
Let's go back to the question, right?
They want to archive a database that's five terabytes in size.
That's all the information we got.
Yeah.
We don't, like, there's a chance that it's an old product that's no longer served
and they just need to keep the data for a certain amount of time.
It could be that it doesn't need to be access to it anymore.
It could be, um, it could be all manner or man.
They might be, they might be want to drop database and just keep it for history, right?
So, to answer this question directly, the first thing we need to understand, is it managed
service or we have access to physical level?
If it's managed service, the only option for us is PJ dump.
No, I disagree.
Why?
Well, because I think you could, so depending on the parameters, right, like, you could
export it in a different format, right?
There's a lot of
a lot of work.
Okay, logical
backup is the only option here.
Yes.
Although, technically speaking,
we can rely on,
we can say,
okay, let's create a snapshot
and consider it archive
and rely on RDS
that they will keep it for a longer.
Why not?
Yeah, but so here's what I'm thinking,
though.
And five terabytes is not very large,
right, as you were saying at the beginning.
But if,
in the general case,
where we care about cost.
It seems like cost actually matters here.
It's one of the main reasons for doing this in the first place.
If we care a lot about cost,
it feels to me like some of the newer formats
that compress extremely well.
Like, and maybe newer's not the right term,
but it's new to me in terms of coming across it
from a post-cress perspective.
But the formats like iceberg and parquet
that compressed down extremely well
feels to me like would be the cheapest way
of archiving things.
data that's static and is not going to be changing or you're going to get a new batch of it
once per year. Like the archiving use cases that I'm thinking of. If we exported in those formats,
they'd be tiny. Like the five terabytes might go down 95% or so.
Oh, I doubt. 20 times I doubt. Well, maybe. Okay. What have you seen?
Yeah, I've seen people reporting that kind of, like, even on the time scale,
compression side, I've seen that report.
And that wasn't even...
Depending on data again.
Yeah, depends on data.
But if you've got a lot of numerical...
A lot of columns that compress...
that have a lot of similar data in them, which I think is pretty common.
Especially for like, if we're thinking it's healthcare-related or financial, like, transactional data,
a lot of that is numerical stuff.
well i again like okay park is is interesting i think it's an option as well and uh in this case
you can also like it depends on access patterns later right we need to first thing we need to
answer do we have access to physical or only logical logical we obviously have access because
this is our database yeah i suppose right so physical versus logical thing we discussed
many times but physical also can be compressed right so
Both wall G and PG because they compressed data director in walls.
So why not?
The only thing to, like, super important thing to remember is that you cannot just take data
director, that's it, you need to make sure walls are also archived.
If it's only logical, okay, there is a parquet option, but also there is a PG-dump option,
which also has compression.
Yeah, but we're not getting the, this is compression like at the page level, right?
or kind of the row level
and I think column
column level compression
is hard to compete with
well it's
page level or file level
what's the difference
there is difference
huh
how how
well from Pidgea dump I would expect
10 times sometimes
in some cases
10 X compression
if you
there are options in PJ dump right that STD I would check definitely and it has
do remember if PG dump allows to control compression level can you set
compression level 19 it will be terribly slow right or if you check okay with that
that's the nice thing like about this this use case is we might be okay with really
slow retrieval really slow restores because the main point is keep it
for many years.
Right, right.
Anyway,
PigeDamp supports for multiple years already.
It supports options.
And you can control compression level there.
You cannot control compression level in wall compression,
which compresses full page rights inside,
full page images inside wall, right?
There, it was my proposal.
I need to finish it.
There you can control algorithm,
but not the level of compression.
so i i'm not sure everyone needs it but still for completeness we should support it in postgres right
but for pitch down definitely you can control level so if you are okay to wait and load CPU a lot
and it will like in some if you check uh zeta sted at 19 i think it should be super slow
but maybe beneficial for in terms of size and of course restore will be also
slow, right? Yeah.
Yeah. I think
that is something to factor in if people are deciding
which way to go, though, is what
kind of questions are you
going to get of this old data? Like, are they
going to be auditors that want to come along
and be able to query it?
Does it matter that it takes, like, if you're
going to get advanced warning that they're coming,
does it matter that it takes a day
to restore? Are
they going to be analytical queries?
Like, are people going to want to know aggregations
on it? Because then if, if it's
going to be mostly aggregates, maybe you do want it in some columnar format where you can get those
done quickly. Well, you confuse me, but obviously when you compress a dump, it's file level,
so the whole thing is compressed. It's not page level. That makes sense, yeah. It should be good.
Page level, if it's, for example, you store something on ZFS, there, it's page level.
Here, no, the whole thing is compressed. It can be super slow, of course, but quite well compressed.
not sure how it will compete with parquet i have not a lot of experience with it so like in production
we have a customer who has it self-managed database and archived from time to time in parquet
format yeah and it's great yeah it's good work it works it also that of us it's interesting setup
but anyway i think there is a potential here if somebody wants to write some article or do some
research there are many options to discover here and benchmark like we could build some you know
like decision three like physical or logical how to decide i would stick with physical always
if recovery time matters and if we want continuous archiving right if and if we have access
of course because if no access it's blocked this path is block yeah logical okay dump or parquet format
dump, we have compression options. Definitely I would choose, I would not choose a single, single file.
Well, again, depends, right. But normally we choose custom format because we want compression
and paralyzation, right? Because if you want single file dump, then you cannot use
sterilization and you will be you will be limited by a single VCPU work when dumping and
restoring when you use custom format you can say how many jobs how many VCPUs to
utilize when dumping and separately when restoring can be different number and this this
matters a lot if we want to dump faster and restore faster right this
thing and then finally like so this is a decision making tree right and then finally where to store
it is it definitely archive shouldn't be stored on expensive SSD disks right it should be stored
object storage is great if it's cloud right if it's not cloud some cheap discs which we don't
use a lot so they like serve as archive or even like is tape still an option i think in many
organization which old organization still well it has a huge lifespan right so tape right
why not anyway it should be something which we know is reliable will live very long right
but also we need to take into account the restoration timing right maybe yeah because
glaciers super cheap as three glaciers like the cheapest but also the slowest to recover
from right yeah to fetch fetch data fetch data from there so it's interesting right and and
physical backups also can be physical archive also can be placed into various location so location
question i think it's a separate decision well yeah i i i think there's some there's a couple
of other things to factor in because if you are if you do store it
it in Parquet, for example, you can query it without restoring.
You don't even have to restore to get the data.
Yeah, that's great, great option.
Also, logical dump path is great because it allows you partial archive.
If you want only specific tables.
And also, of course, you need to take into account that you only archive data,
no helper data helper data is indexes or materialized views so derived things well which
means that recovery will be longer right so I would say if you are going to drop some
database and for historical purposes even if you have a physical access I would
probably do dump still five terabytes dump if it's powerful machine definitely
within one hour definitely you know but yeah we talked about it recently yeah and if it's like
with many threads you can well if it's a by the way dump aren't dumps are always interesting
i think it doesn't matter parquet or native pjadam if you have on partitioned case and for example
five terabytes but four of them is a single table we see it a lot actually one huge
table usually called logs or history or something like historical events yeah sometimes
queue like workload yeah but something like accumulated and and then people think how to get
rid of it partitioning is needed there right because without partitioning dumping will be
single threaded as well although you can create your own tool and i think actually pidge dump could
supported using AD ranges dump using multiple parallel threads theoretically
it's possible and you can write this but out of the box Pidge a dump will limit you
and probably dumping 5 terabytes will look like this I and you say I want like to
utilize all my 16 cores and using 16 threads to send it to S3 first 10
minutes all 16 threads are utilized and then a couple of hours just single thread is working
because one huge table uh yeah yeah that's why so partitioning will be would become
archiving much faster and and more flexible to tune like like let's move faster because again
if it's live database every time we archive logically we
have a long running transaction and we affect vacuum behavior even if it's on replicas
which have holds and buy feedback on right so we this can lead to higher blood and various
negative consequences I don't see it as much anymore but I used to get quite
frustrated seeing partitioning advice that mostly focused on like the read performance
benefits of partitioning and the more I've learned about it over the years the more I think the
main reason to partition is for these maintenance tasks and for health of the database overall
but time-based in particular if you've got an archiving requirement if you're ever going to need to
remove database on time time-based partitioning is just a dream yeah you're absolutely right
as some LLMs say okay yeah right so exactly like this archiving restoring like
dump restore how vacuum works how index creation works even physical backups benefit
from partitioning because if updates and search deletes are localized to
specific files so postgres keeps tables and indexes and indexes and
files up to one gigabyte so if it's yeah five terabytes four of which a single
table it will be shrink to one gigabyte files and imagine we have a single on
partition table all rights can come to any of these files all the time any blocks
inside any any any pages all the time all the time like it's it's not localized right
and in this case if you use snapshots rd s snapshots they
have incremental support right so first snapshot is full and then next snapshot is just deep at block
level if you use volj or pg backrest they also have delta backups or incremental backups or differential
backups where also at block level only changes are backed up right so in this case if you have a
huge table you write constantly everywhere your delta will be bigger compared to the case
when you write to only to specific file which is like like our latest file and only only
occasionally you write old updating all data old all the records in this case these deltas will be
much more localized and and this is also beneficial right so so even physical backups will benefit from
partitioning so general advice yes and our general advice was always if you exceed
hundred gigabytes one table it's time to partition people ask is it counting like do we
count it with indexes with toast well without well it's very very rough number so
recently we started saying okay if it's exceeding 50 gigabytes already let's think let's
think about partitioning right yeah without indexes maybe because the point is
don't worry about it if your whole database is one gigabyte and but also if your
hold it if you if your largest table is already a terabyte maybe it's a good
time to it's it's late it's let you know what I mean that like that's the
reason for giving a like a rule of thumb is give people an idea of where they
stand roughly and if you're close to that number it'd be thinking about it
so we have funny stories when the AI companies come to us for help because
they grow too fast and they need help yeah they like database expertise we have
a bunch of AI companies see startups and it's it was
an interesting story so some company like achieved like between 10 to 20 terabytes
already and they have table succeeding terabyte and someone from our team says they should
partition a year ago but i said they just launched a year ago right so the the pace is very
very high and i think postgis needs improvements in the area of partitioning to
simplify how like develop experience and i hope we will work in this area next year
with our tooling it has got a lot better at native partitioning if it feels like maybe not
18 as much but it feels like most releases before then since it was introduced there was like
significant improvements right there were in 18 in terms of in terms of various like
detached partition concurrently in terms of support of on
operations locking things and but still it's super painful still you know like
you mean basically to you need to allocate an engineer for a month or two to
implement it and we have great guidelines which we give our clients but still
lack of automation you mean like that's to migrate from non partition setup
to a partition setup yeah yeah they the hardest part here is that everyone is
using various RMs and like they won't
this tooling to be closer to their language right there is of course department but yeah
but still oh okay it's about the creation of new okay interesting yeah it's about creation of new
partition and then maintaining it because when you have partition schema already there is
amplification on several risks one of them very well known log manager right and i had a series
of blog posts recently studied postgast 18 in deeper
also in the context of prepared statements and it was interesting and another
thing is that any ddl schema changes they might be really painful right even
simple index creation on or dropping index can be painful or foreign keys right
so yeah it's it requires every time you need like every time we see like people
implement some nuances and like it feels like postgres needs to solve it
Fair.
All right.
Anything else on the archiving front?
Final thing.
If you archived something, but haven't tested.
Right?
It's like backups, right?
Yeah, Schrodinger.
When we say tested, you mean like, can it be stored?
Can it be queried?
And also, is it the same as the data was originally?
Like, it's not, it's verifying as well, right?
Like, do some basic checks.
exactly so unverified unrestored backups and archives are like undercooked yeah great point
just simply undercooked and should be considered as a huge critical mistake that's it
huge risk right yeah yeah yeah so if you if you did something
but never run
if you write a code but never run
and ship
how does it feel
the same with backups
you created backup but
you haven't tested
archives yeah
so
it's a critical mistake
that's it
nice good addition
yeah
this addition
yeah just almost
overlooked
yeah okay
nice
well thanks again
and catch you next time
Bye-bye
