Postgres FM - pg_upgrade: the tricky and dangerous parts

Starting point is 00:00:00 Hello, hello, this is episode number 52. I screwed up a little bit in the beginning because I forgot to unmute myself in Zoom. Thank you, audience, live audience, which I have in YouTube who helped me to unmute myself. But unfortunately, I didn't notice it in the beginning. But okay, let's proceed. So we have PGA Upgrade, which is kind of a very good tool to upgrade. Like in ancient times, we needed to dump restore our database to perform a major upgrade. But now we have pgUpgrade, great. But actually, we discussed it with Michael some time ago. We had an episode about minor and major upgrades.

Starting point is 00:00:41 We discussed high-level problems that we usually have and so on. But let me discuss particular problems when you use pgupgrade to run major upgrades. So problem number one, if you have a huge database and some significant traffic to it, and you want to avoid downtime, there is a well-known idea, let's use logical to avoid downtime. Of course, if you use page upgrade as is, it will take significant time.

Starting point is 00:01:18 And there is another recipe described in documentation to use dash K or dash dash link option, which significantly speeds up the process because we switched to using hard links. Most of files in PgData is not going to change during upgrade, so hard links help. But some people say, okay, we don't want even some minutes. It will take usually up to a few 10 minutes or 15 minutes, depending on cluster size and additional steps you perform. They say, we want to do it faster, let's involve logical replication. And here we have a couple of dangers. First of all,

Starting point is 00:02:00 if cluster is really huge, it's very challenging to initialize a logical replica because when you initialize it in a traditional way, it's fully automated. If you just create a replication subscription and start a subscription to a new empty cluster, if it's already a new Postgres, great, it will be done automatically, but you will find it very challenging for really big databases, like for example, 5, 10, 15 terabytes under some significant load, like at least level involves basically a kind of dump restore process, which takes long. And also it affects the health of the source cluster, which is not yet released, yesterday we had only beta 2, and also the branch rel 16 stable was stamped, meaning that we're already close to Right now it's not possible. So it's possible only from primary, but even if it's from standby, the problem will remain. So you affect how autowacom workers work and they cannot remove freshly dead tuples because you're still taking data and you need that snapshot. So after the moment you started, if some tuples became dead, AutoACM cannot remove the dead tuples and you will see in logs that some number

Starting point is 00:03:53 of tuples are dead but cannot yet be removed. Why? Because we still hold snapshot because we still initialize our replica. And sometimes it's even not possible to finish. And load, on the primary you produce a lot of load. So it's kind of difficult. And I actually didn't do it myself, never. I just know it feels not the right approach, but my colleagues did and they tried and they failed, even on smaller databases like about one terabyte. So how to do it better? This is a recipe. So we call it physicaltological and it's quite straightforward. We can first of all have a secondary cluster. I'm going to use Patroni terminology here. So we have a secondary cluster with its own standby leader. They are all standbys, but one of them is leader

Starting point is 00:04:51 because we have cascaded replication. So standby leader is replicating using stream replication from main primary, and then it has its own standby nodes, cascaded replication. And Patroni can do that. You can say I'm having a secondary cluster here, so it shouldn't have a real primary, but it has its own standby leader. This streaming replication is asynchronous, normal streaming replication.

Starting point is 00:05:20 And then we create a slot and we create publication on the source, on the primary, as I said. And once we do it, walls are starting to accumulate on the primary more and more. If we keep this slot unused, we know we will be out of disk space on the main primary, which we are using right now. So we should move fast or we should have a lot of disk space. And of course, it's not good to spend a lot of time accumulating this lag basically, right? If no one is using this slot, we have a LSN position. We know LSN position when we create a slot using SQL approach. You can do it in two ways, but SQL approach is fine as well here. So you say select PGE create replication slot, it will return you lsn. So and slot is in at this position. Good. Now we have a trick. We, on our future primary, which is currently standby leader, we just adjust configuration. We say

Starting point is 00:06:21 instead of replicating everything using physical replication, we have a setting which is called recovery target LSN. And we put this LSN from the slot position. We put it right there, restart server, and let it catch up and pause. There is another setting, I don't remember the name, which says which action to perform when something like this is achieved. Like, okay, we achieved this LSN and we pause. Great. And not replicating at physical level anything additional. By the way, we found that if you use

Starting point is 00:07:01 wall shipping here, sometimes it replaces slightly more than needed. So we need to explore this probably. But if you use streaming replication, this works really good. So, and we can, at this point, we can shut down. And here we have a trick. Here we have problem number two. So first of all, this already allows us to switch from physical replication to logical. And this is how we can very quickly initialize logical replica based on physical replica in cases when you need all of the database or majority of tables to be replicated. Of course, with logical, you can replicate only some tables. And this recipe is not good in this case, if you need only like, for example,

Starting point is 00:07:51 10% of all your tables. But if you need everything, and in the case of a major upgrade, we need everything. In this case, this physical-to-logical recipe is really good. But we have now a second problem here. The first problem we already solved. Instead of regular insertion, we use physical replicants which are from physical-to-logical. By the way, when we create a slot, when we create a subscription, we need to say explicitly that we already have everything. So we just need CDC. Logical consists of two stages. First is two steps.

Starting point is 00:08:34 First is full initialization. So bring snapshot of data. And second is CDC, change data capture, so replaying changes from slot. So in case of physical to logical conversion, we don't need the first one because physical already brought us everything. And how we initialized replica, physical replica doesn't matter. It could be a pgbase backup, it could be pgbackrest or wlg fetching whole database from backups. It could be cloud snapshots or anything else.

Starting point is 00:09:06 So now we have a second problem. This problem is quite interesting. It feels like this. If you first initialize logical replication and then shut down your logical replica, run pgupgrade on it and start again, there are risks of data corruption. This is quite interesting and it was discussed in hacker news, sorry, in hackers mailing list a few months ago. There is a discussion, I'm going to provide link in description. So it's quite kind of interesting.

Starting point is 00:09:47 So when we run pgUpgrade, it doesn't play with logical well. Instead of this, let's just swap the order of our actions. We know that our future primary is at position of the slot. But instead of starting to use logical replication right away, let's postpone it. And first, let's run pgupgrade. We run pgupgrade. We start our new cluster, which has all data.

Starting point is 00:10:19 And we know it was frozen at the position we need. It's synchronized with the slot. We started, it has timeline one or two, so it's kind of like a fresh database. Then, after it, we can start using the logical replication, create a subscription. Of course, during the pgUpgrade, the slot will accumulate some lag because the lsn position is frozen, it's behind. So of course, pgupgrade should be done as fast as possible, for example involving hardlinks, the option "-k". In this case it will take a few minutes. We know that in a new cluster after pgUpgrade, we don't have statistics, so we need to run analyze.

Starting point is 00:11:12 I would say it's not a good idea to run analyze right away. We can start logical first and run analyze logical subscription first, start the CDC process, change data capture process, and in parallel to run, analyze and using many workers, for example, vacuumdb, dash dash analyze dash j, maybe even the number of cores we have. If disks are very good, SSD and so on, it should be fine. And during that, the logical replication is already working. It's catching up. We don't have this risk of data corruption discussed in the hacker's mailing list. And for logical, of course, all tables should have primary key or replica identity full. If we have primary key, we don't need actually statistics to work with primary key. So all inserts, updates, deletes will work

Starting point is 00:12:11 even before we collect statistics. So we can run analyze in parallel or slightly later and so on. It will be fine. We just shouldn't run regular user queries on this node yet until we collect statistics. So in this case, this is a good recipe. But there is another problem, and this problem is not only related to this recipe involving logical. And by the way, I forgot to mention that a couple of weeks ago, a new tool was published, which is written in Ruby, and it allows to automate the process involving logical and run Postgres measure upgrades using logical. It was discussed on Hacker News, it was discussed on Twitter, and at least five people sent me this tool because they know I'm interested in Postgres upgrades. Thank you all. But I think it's not... This tool is okay, I think. I briefly checked it and I didn't see this physical to logical conversion. So it means that if you have a really big database, probably this tool won't work well. And second, I'm not sure the order of actions. So you should be careful

Starting point is 00:13:35 because if order of actions is logical first, then upgrade and continue logical, there are risks of corruption. So I would encourage you to check this, explore this and communicate with author. Maybe I will do it as well. It's kind of very tricky parts and dangerous parts. That's why I call this episode PgUpgrade tricky and dangerous parts. So finally, let's discuss the final, maybe the most dangerous part, because it's not only about logical. It's about anyone who uses pgUpgrade. Most of people use hardlinks with option-k right now or dash-link, but it's not even about that.

Starting point is 00:14:27 The problem is related to how we upgrade standby nodes. Postgres documentation describes the recipe involving rsync. It says, okay, once you upgraded the primary, to upgrade standbys, which already have some data, just run rsync to eliminate deviation we accumulated during running pgeprate. And that's it, right? It's quite obvious. But the problem is, if you really want to minimize downtime in case of without logical downtime, usually a few minutes, we want to achieve a few minutes, even with clusters huge, many many many terabytes, we want just a few minutes of downtime. If you want to minimize it, you cannot use rsync in the best mode. A best mode for rsync would be checksum. So rsync checksum would check the content of all files. But it would take a lot. Maybe it's easier to recreate a replica

Starting point is 00:15:27 than just wait until rsync checksum is working. Unfortunately, rsync is a single-threaded thing. So if you want to benefit from the fact that you have fast disks and a lot of cores, you need to parallelize it somehow yourself. Maybe with GNU parallel or something like that. So it's like kind of a task. And documentation doesn't describe it. But if you just follow the documentation, it says rsync. Okay, there's another option. So if you don't specify checksum, you just use rsync, it will compare

Starting point is 00:16:06 two things, size of the file and modification time. It's much better, but still it's a question to me, will it be reliable? But the least reliable approach is to use the option size only. So we will compare only size. And in this case, guess what documentation describes? Documentation describes size only. And imagine some table, which is quite big, say 100 gigabytes. Postgres stores tables and indexes using files, one gigabyte files, many, many files, one gigabyte files.

Starting point is 00:16:47 You can find them using OID of table or index and you can see OID.1, OID.2, so many files, all of them are one gigabyte. So if some writes happened during your upgrade, which you are not propagated to standbys, and then you follow Postgres documentation recipe, official recipe, which is currently official, and just run rsync size only, as it describes. Rsync size only will see, okay, we have blah blah blah dot two file, one gigabyte on source, one gigabyte on target, no difference. This is super interesting because I suspect many of clusters upgraded recently or maybe not that recently have corrupted standbys.

Starting point is 00:17:50 And we had it also with some quite big database last week. We upgraded using our logical recipe, this kind of magic and so on. And fortunately, we saw corruption on replicas, on new replicas, when we moved read-only traffic, user traffic there, and we started seeing corruption errors in logs. Corruption errors are very important to monitor and we have three error codes, Postgres has three error codes you should always monitor and have alerts on which. It's XX000, XX001, XX002. This is the very bottom of the error code page. In documentation, you can find this. And the last three lines, this is the most, like, this should be,

Starting point is 00:18:43 you should have alerts on this definitely but a funny thing during major operations such as major upgrade you probably silence all alerts so it's you need to have manual checks as well of logs and this is how we found out like accidentally we we check we were checking logs logs and saw a couple of lines and we saw this, oh wow, it looks like corruption. And then we found out that all standbys, we had like maybe five standbys, have for some files, one gigabyte files, have slightly different content rather than new future primary. We rolled back. So fortunately, no data loss, nothing like that happened. So we had only a few errors, like 30 maybe errors or so for user traffic, and that's

Starting point is 00:19:33 it, which is great. But then I started to think, this is official recipe. And the documentation, this recipe, how to perform upgrades, it mentions that you should check with PG control data, you should check the last checkpoint position when you stop standby and primary on the future cluster, new cluster. But it says something, they might have differences if, for example, you stopped standbys first or standbys are still running, something like that. But it doesn't say that you shouldn't proceed if you see difference, first of all. And it doesn't mention, like, right now documentation is very, very uncertain about the order of how we should stop servers and how we

Starting point is 00:20:27 should proceed with this dangerous rsync size only. So that's why I'm thinking many clusters were upgraded like that and if you don't have monitoring of corruption probably you still have corruption on standby nodes. It sounds very dangerous. I wrote to hackers yesterday to the mailing list and Robert House already admitted that this recipe looks not right, it should be improved. But let's discuss how to do it better. Thinking how to do it better, talking to guys and and to Andrei Borodin as well, checking some interesting slides. Standby leader, our new primary, which was already upgraded, before upgrade, if we keep it running, and when we keep running also all standbys,

Starting point is 00:21:23 and then we shut down primary, but don't shut down all its standbys, meaning that we need to disable autofillover such as Patroni to put it to maintenance mode. We shouldn't allow autofillover to happen in this case. But the idea is we keep standbys running and allow them to fully catch up on all legs. When you shut down some server, primary, wall sender has some logic to make attempt to write everything to standbys, to send all changes before shutting down. And then we also can involve pgControlData. So the idea is let's keep standbys running. We check also with pgctlData checkpoint position, which is good. Then we upgrade primary, and then probably we still keep running standbys.

Starting point is 00:22:13 And then probably one by one, we shut one down, do rsync, or probably we shut all of them down and rsync in parallel with all nodes. And this recipe actually I saw in Yandex.Mail presentation, and I learned yesterday that this recipe with some work was coming from the guys from Yandex.Mail in the past related to the use of hard links and these are snippets and so on. But I think it's super important to ensure that standbys received all changes from primary. And if they didn't, probably you will have corruption. Actually, that's it. I hope it was interesting, dangerous enough.

Starting point is 00:23:01 Dangerous, I mean, the lesson here is definitely have monitoring and alerting for corruption errors, XX001, 00001 or 00002, these three errors probably involve some additional checks, for example, using AmpCheck. We also, my colleagues learned that Ampcheck also found corruption on replicas if you run it,

Starting point is 00:23:30 because there was a deviation between files in index and files in data. Then also, when you run pgupgrade, the safest, I think, is just to initialize the replicas. If you can afford it, just re-initialize it from scratch, avoiding this. But if you cannot afford it, at least be very careful with this risk of deviation. Or maybe run rsync without data size only option, because this size only options definitely doesn't. I cannot guarantee that without this option with default behavior of rsync when it checks size and modification time, you will be fine. I cannot say yet. So I think some work should be done. And if you have ideas, please join the discussion on the hackers mailing list, which I started yesterday. So yeah,

Starting point is 00:24:32 let me check a couple of questions I have in chat because it's a live session. Are we expecting to have modification and official documentation to ensure safer upgrades in all means? Well, Postgres is open source. You can propose your own modification. I'm still thinking. I have questions here, more than answers. Once I have some answer, I'm definitely going to propose if I have something, some idea how to improve.

Starting point is 00:24:59 But I guess documentation should be changed for all Postgres versions, yes. So, okay. Thank you so much. It was some strange episode again. And next week, I hope Michael is returning and we will have anniversary one year for our podcast. Thank you so much. Subscribe, share in your social networks and working groups where you work with your colleagues who are also interested in Postgres. And send us comments with your feedback

Starting point is 00:25:28 and ideas for future. Thank you. Bye.

Postgres FM - pg_upgrade: the tricky and dangerous parts

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.