Postgres FM - pg_upgrade: the tricky and dangerous parts
Episode Date: June 30, 2023Nikolay (alone, again) reveals some issues that might hit those who perform major PostgreSQL upgrades with minimal downtime.Links:- "Upgrades" – PostgresFM episode 037: https://postgres.fm/...episodes/upgrades- recovery_target_lsn: https://postgresqlco.nf/doc/en/param/recovery_target_lsn/- recovery_target_action: https://postgresqlco.nf/doc/en/param/recovery_target_action/- pg_easy_replicate https://github.com/shayonj/pg_easy_replicate/- HN discussion of this tool: https://news.ycombinator.com/item?id=36405761- Waiting for Postgres 16: Logical decoding on standbys: https://pganalyze.com/blog/5mins-postgres-16-logical-decoding- pg_upgrade and logical replication (discussion in pgsql-hackers): https://www.postgresql.org/message-id/flat/20230217075433.u5mjly4d5cr4hcfe%40jrouhaud- allow upgrading publisher node (proposal to natively support running pg_upgrade on publisher, WIP): https://commitfest.postgresql.org/43/4273/- ❓🤔 pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption? (discussion in pgsql-hackers): https://www.postgresql.org/message-id/flat/CAM527d8heqkjG5VrvjU3Xjsqxg41ufUyabD9QZccdAxnpbRH-Q%40mail.gmail.com~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!If you would like to share this episode, here's a good link (and thank you!)~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork
Transcript
Discussion (0)
Hello, hello, this is episode number 52. I screwed up a little bit in the beginning because I forgot
to unmute myself in Zoom. Thank you, audience, live audience, which I have in YouTube who helped
me to unmute myself. But unfortunately, I didn't notice it in the beginning. But okay, let's
proceed. So we have PGA Upgrade, which is kind of a very good tool to upgrade.
Like in ancient times, we needed to dump restore our database to perform a major upgrade.
But now we have pgUpgrade, great.
But actually, we discussed it with Michael some time ago.
We had an episode about minor and major upgrades.
We discussed high-level problems that we usually have and
so on. But let me discuss particular problems when you use pgupgrade to run major upgrades.
So problem number one, if you have a huge database and some significant traffic to it,
and you want to avoid downtime,
there is a well-known idea,
let's use logical to avoid downtime.
Of course, if you use page upgrade as is,
it will take significant time.
And there is another recipe described in documentation
to use dash K or dash dash link option,
which significantly speeds up the process
because we switched to using hard links. Most of files in PgData is not going to change during
upgrade, so hard links help. But some people say, okay, we don't want even some minutes. It will
take usually up to a few 10 minutes or 15 minutes,
depending on cluster size and additional steps you perform. They say, we want to do it faster,
let's involve logical replication. And here we have a couple of dangers. First of all,
if cluster is really huge, it's very challenging to initialize a logical
replica because when you initialize it in a traditional way, it's fully automated.
If you just create a replication subscription and start a subscription to a new empty cluster,
if it's already a new Postgres, great, it will be done automatically, but you will find it very challenging for really big databases, like for example, 5, 10, 15 terabytes under some significant load, like at least level involves basically a kind of dump restore process, which takes long.
And also it affects the health of the source cluster, which is not yet released, yesterday we had only beta 2, and also the branch rel 16 stable was stamped, meaning that we're already close to Right now it's not possible. So it's possible only from primary,
but even if it's from standby, the problem will remain. So you affect how autowacom workers work
and they cannot remove freshly dead tuples because you're still taking data and you need that snapshot. So after the moment you started, if some tuples became
dead, AutoACM cannot remove the dead tuples and you will see in logs that some number
of tuples are dead but cannot yet be removed. Why? Because we still hold snapshot because
we still initialize our replica. And sometimes it's even not possible to finish. And load,
on the primary you produce a lot of load. So it's kind of difficult. And I actually
didn't do it myself, never. I just know it feels not the right approach, but my colleagues
did and they tried and they failed, even on smaller databases like about one terabyte. So how to do it better? This is a recipe. So we
call it physicaltological and it's quite straightforward. We can first of all have a
secondary cluster. I'm going to use Patroni terminology here. So we have a secondary cluster with its own standby leader.
They are all standbys, but one of them is leader
because we have cascaded replication.
So standby leader is replicating using stream replication
from main primary, and then it has its own standby nodes,
cascaded replication.
And Patroni can do that.
You can say I'm having a secondary cluster here,
so it shouldn't have a real primary, but it has its own standby leader.
This streaming replication is asynchronous, normal streaming replication.
And then we create a slot and we create publication on the source, on the primary, as I said.
And once we do it, walls are starting to accumulate on the primary more and more.
If we keep this slot unused, we know we will be out of disk space on the main primary, which we are using right now.
So we should move fast or we should have a lot of disk space. And of course, it's not good to spend a lot of time accumulating this lag basically, right?
If no one is using this slot, we have a LSN position. We know LSN position when we create
a slot using SQL approach. You can do it in two ways, but SQL approach is fine as well
here. So you say select PGE create replication slot, it will return you lsn. So and slot is in at this position. Good. Now we have a trick. We,
on our future primary, which is currently standby leader, we just adjust configuration. We say
instead of replicating everything using physical replication,
we have a setting which is called recovery target LSN.
And we put this LSN from the slot position.
We put it right there, restart server, and let it catch up and pause.
There is another setting, I don't remember the name,
which says which action to perform when
something like this is achieved. Like, okay, we achieved this LSN and we pause. Great.
And not replicating at physical level anything additional. By the way, we found that if you use
wall shipping here, sometimes it replaces slightly more than needed.
So we need to explore this probably. But if you use streaming replication, this works
really good. So, and we can, at this point, we can shut down. And here we have a trick.
Here we have problem number two. So first of all, this already allows us to switch from physical replication to logical.
And this is how we can very quickly initialize logical replica based on physical replica
in cases when you need all of the database or majority of tables to be replicated.
Of course, with logical, you can replicate only
some tables. And this recipe is not good in this case, if you need only like, for example,
10% of all your tables. But if you need everything, and in the case of a major upgrade,
we need everything. In this case, this physical-to-logical recipe is really good.
But we have now a second problem here. The first problem we already solved. Instead of
regular insertion, we use physical replicants which are from physical-to-logical. By the
way, when we create a slot, when we create a subscription, we need to say explicitly that we already have everything.
So we just need CDC.
Logical consists of two stages.
First is two steps.
First is full initialization.
So bring snapshot of data.
And second is CDC, change data capture, so replaying changes from slot.
So in case of physical to logical conversion, we don't need the first one because physical
already brought us everything.
And how we initialized replica, physical replica doesn't matter.
It could be a pgbase backup, it could be pgbackrest or wlg fetching whole database from backups.
It could be cloud snapshots or anything else.
So now we have a second problem. This problem is quite interesting.
It feels like this. If you first initialize logical replication and then shut down your
logical replica, run pgupgrade on it and start again,
there are risks of data corruption.
This is quite interesting and it was discussed in hacker news,
sorry, in hackers mailing list a few months ago.
There is a discussion, I'm going to provide link in description.
So it's quite kind of interesting.
So when we run pgUpgrade, it doesn't play with logical well.
Instead of this, let's just swap the order of our actions.
We know that our future primary is at position of the slot.
But instead of starting to use logical replication right away,
let's postpone it.
And first, let's run pgupgrade.
We run pgupgrade.
We start our new cluster, which has all data.
And we know it was frozen at the position we need.
It's synchronized with the slot.
We started, it has timeline one or two, so it's kind of like a fresh database. Then,
after it, we can start using the logical replication, create a subscription.
Of course, during the pgUpgrade, the slot will accumulate some lag because the lsn position is frozen, it's behind. So of
course, pgupgrade should be done as fast as possible, for example involving hardlinks,
the option "-k". In this case it will take a few minutes. We know that in a new cluster after pgUpgrade,
we don't have statistics, so we need to run analyze.
I would say it's not a good idea to run analyze right away. We can start logical first and run analyze logical subscription first,
start the CDC process, change data capture process,
and in parallel to run, analyze and using many workers,
for example, vacuumdb, dash dash analyze dash j, maybe even the number of cores we have.
If disks are very good, SSD and so on, it should be fine. And during that, the logical replication is already working. It's
catching up. We don't have this risk of data corruption discussed in the hacker's mailing list.
And for logical, of course, all tables should have primary key or replica identity full.
If we have primary key, we don't need actually statistics to work with primary key. So all inserts, updates, deletes will work
even before we collect statistics. So we can run analyze in parallel or slightly later
and so on. It will be fine. We just shouldn't run regular user queries on this node yet until we collect statistics.
So in this case, this is a good recipe. But there is another problem, and this problem is not only
related to this recipe involving logical. And by the way, I forgot to mention that a couple of weeks ago, a new tool was published, which is written in Ruby, and it allows to automate the process involving logical and run Postgres measure upgrades using logical.
It was discussed on Hacker News, it was discussed on Twitter, and at least five people sent me this tool because they know I'm interested in Postgres
upgrades. Thank you all. But I think it's not... This tool is okay, I think. I briefly checked it
and I didn't see this physical to logical conversion. So it means that if you have a really big database, probably this
tool won't work well. And second, I'm not sure the order of actions. So you should be careful
because if order of actions is logical first, then upgrade and continue logical, there are risks of
corruption. So I would encourage you to check this, explore this and communicate
with author. Maybe I will do it as well. It's kind of very tricky parts and dangerous parts.
That's why I call this episode PgUpgrade tricky and dangerous parts. So finally, let's discuss
the final, maybe the most dangerous part, because it's not only about logical.
It's about anyone who uses pgUpgrade.
Most of people use hardlinks with option-k right now or dash-link,
but it's not even about that.
The problem is related to how we upgrade standby nodes. Postgres documentation describes
the recipe involving rsync. It says, okay, once you upgraded the primary,
to upgrade standbys, which already have some data, just run rsync to eliminate deviation we accumulated during
running pgeprate. And that's it, right? It's quite obvious. But the problem is, if you
really want to minimize downtime in case of without logical downtime, usually a few minutes,
we want to achieve a few minutes, even with clusters huge, many many many terabytes, we want just a few
minutes of downtime. If you want to minimize it, you cannot use rsync in the best mode. A best mode
for rsync would be checksum. So rsync checksum would check the content of all files. But it would take a lot. Maybe it's easier to recreate a replica
than just wait until rsync checksum is working.
Unfortunately, rsync is a single-threaded thing.
So if you want to benefit from the fact
that you have fast disks and a lot of cores,
you need to parallelize it somehow yourself. Maybe with GNU parallel or something
like that. So it's like kind of a task. And documentation doesn't describe it. But if you
just follow the documentation, it says rsync. Okay, there's another option. So if you don't
specify checksum, you just use rsync, it will compare
two things, size of the file and modification time.
It's much better, but still it's a question to me, will it be reliable?
But the least reliable approach is to use the option size only.
So we will compare only size.
And in this case, guess what documentation describes?
Documentation describes size only.
And imagine some table, which is quite big, say 100 gigabytes.
Postgres stores tables and indexes using files, one gigabyte files, many, many files, one gigabyte files.
You can find them using OID of table or index and you can see OID.1, OID.2, so many files,
all of them are one gigabyte. So if some writes happened during your upgrade, which you are not propagated to standbys, and then you follow Postgres
documentation recipe, official recipe, which is currently official, and just run rsync
size only, as it describes.
Rsync size only will see, okay, we have blah blah blah dot two file, one gigabyte on source, one gigabyte on target,
no difference.
This is super interesting because I suspect many of clusters upgraded recently or maybe
not that recently have corrupted standbys.
And we had it also with some quite big database last week.
We upgraded using our logical recipe, this kind of magic and so on. And fortunately, we saw corruption on replicas, on new replicas,
when we moved read-only traffic, user traffic there, and we started seeing
corruption errors in logs. Corruption errors are very important to monitor and we have three
error codes, Postgres has three error codes you should always monitor and have alerts on which. It's XX000, XX001, XX002.
This is the very bottom of the error code page.
In documentation, you can find this.
And the last three lines, this is the most, like, this should be,
you should have alerts on this definitely but a funny thing during
major operations such as major upgrade you probably silence all alerts so it's you need to
have manual checks as well of logs and this is how we found out like accidentally we we check
we were checking logs logs and saw a couple of lines and we saw this,
oh wow, it looks like corruption. And then we found out that all standbys, we had like maybe
five standbys, have for some files, one gigabyte files, have slightly different content rather than
new future primary. We rolled back. So fortunately, no data loss, nothing like that
happened. So we had only a few errors, like 30 maybe errors or so for user traffic, and that's
it, which is great. But then I started to think, this is official recipe. And the documentation,
this recipe, how to perform upgrades, it mentions that you should check
with PG control data, you should check the last checkpoint position when you stop standby and
primary on the future cluster, new cluster. But it says something, they might have differences if, for example, you stopped
standbys first or standbys are still running, something like that.
But it doesn't say that you shouldn't proceed if you see difference, first of all.
And it doesn't mention, like, right now documentation is very, very uncertain
about the order of how we should stop servers and how we
should proceed with this dangerous rsync size only. So that's why I'm thinking many clusters
were upgraded like that and if you don't have monitoring of corruption probably you still have
corruption on standby nodes. It sounds very dangerous. I wrote to hackers yesterday to the mailing list
and Robert House already admitted that this recipe looks not right, it should be improved.
But let's discuss how to do it better. Thinking how to do it better, talking to guys and and to Andrei Borodin as well, checking some interesting slides.
Standby leader, our new primary, which was already upgraded,
before upgrade, if we keep it running,
and when we keep running also all standbys,
and then we shut down primary, but don't shut down all its standbys, meaning that we need to disable autofillover such as Patroni to put it to maintenance mode.
We shouldn't allow autofillover to happen in this case.
But the idea is we keep standbys running and allow them to fully catch up on all legs. When you shut down some server, primary, wall sender has some logic to make attempt to write
everything to standbys, to send all changes before shutting down.
And then we also can involve pgControlData.
So the idea is let's keep standbys running.
We check also with pgctlData checkpoint position,
which is good. Then we upgrade primary, and then probably we still keep running standbys.
And then probably one by one, we shut one down, do rsync, or probably we shut all of them down
and rsync in parallel with all nodes. And this recipe actually I saw in Yandex.Mail presentation,
and I learned yesterday that this recipe with some work was coming from the guys from Yandex.Mail in
the past related to the use of hard links and these are snippets and so on. But I think it's super important to ensure that
standbys received all changes from primary.
And if they didn't, probably you will have corruption.
Actually, that's it.
I hope it was interesting, dangerous enough.
Dangerous, I mean, the lesson here is definitely
have monitoring and alerting for
corruption errors, XX001,
00001 or 00002, these three errors
probably involve some
additional checks, for example, using AmpCheck.
We also, my colleagues learned that
Ampcheck also found corruption on replicas if you run it,
because there was a deviation between files in index and files in data.
Then also, when you run pgupgrade, the safest, I think, is just to initialize the replicas. If you can afford it, just re-initialize
it from scratch, avoiding this. But if you cannot afford it, at least be very careful
with this risk of deviation. Or maybe run rsync without data size only option, because this size only
options definitely doesn't. I cannot guarantee that without this option with default behavior
of rsync when it checks size and modification time, you will be fine. I cannot say yet.
So I think some work should be done. And if you have ideas, please
join the discussion on the hackers mailing list, which I started yesterday. So yeah,
let me check a couple of questions I have in chat because it's a live session.
Are we expecting to have modification and official documentation to ensure safer upgrades in all
means? Well, Postgres is open source.
You can propose your own modification.
I'm still thinking.
I have questions here, more than answers.
Once I have some answer, I'm definitely going to propose if I have something,
some idea how to improve.
But I guess documentation should be changed for all Postgres versions, yes.
So, okay. Thank you so much. It was
some strange episode again. And next week,
I hope Michael is returning and we will have anniversary one year for
our podcast. Thank you so much. Subscribe, share
in your social networks and working groups where you work with your colleagues
who are also interested in Postgres.
And send us comments with your feedback
and ideas for future.
Thank you. Bye.