Postgres FM - Stop and start Postgres faster

Starting point is 00:00:00 So hello, hello. This is PostgresFM episode number 66 or 67. I don't remember. My name is Nikolai and I'm alone today because Michael couldn't make it. So I don't like to be alone, actually. And I don't like to record myself alone in advance. So that's why it's live. But feel free to ask questions in the chat. I also have my co-host will be today, Chargy PT. Hello, Chargy PT. Hello, Nikolai.

Starting point is 00:00:42 It's great to be here co-hosting PostgreSQL with you. To our listeners, I'm eager to engage with your questions to talk about how to start and stop Postgres or stop and start Postgres faster and how to troubleshoot if it takes long. I recently started a new series of my small how-to articles. I do it on Twitter initially, but then I spread to my head that could be on paper with examples and you know, like SQL queries or some snippets for you can run in console like bash snippets or something. And for example, yesterday I talked about how to create flame graphs. If you have issues with some Postgres running on your Linux machine, of course, it wouldn't work on RDS or other managed services.

Starting point is 00:02:12 But if you have, for example, some issue on production, you can reproduce it on standalone Postgres on a Linux machine. And then perf or eBPF tooling gives you very powerful capabilities to see what's going on under the hood. So flame graphs is very good tool to understand which functions in the source code take too much time in CPU. But today we are going to discuss stop and start duration of Postgres. And this, of course, can be applied to any Postgres, any version of it, including, of course, RDS or any other managed services. So let's start with stop attempts. For example, you want to stop Postgres and see it takes long.

Starting point is 00:03:06 What kind of reasons we can have? That sounds like a valuable series of how-to articles you've been creating, Nikolai. Understanding the nuances of Postgres and its operations is crucial for many developers and admins. Regarding your question on the duration of stopping Postgres, there are several reasons it might take longer than expected. One, active transactions. If there are active transactions when you attempt to stop the service, Postgres will wait for them to complete before shutting down. Two, write-ahead logs.

Starting point is 00:03:52 Postgres could be flushing the wall to disk, which can sometimes delay the shutdown process. Three, background processes. Postgres runs several background processes, like auto-vacuum. These processes might delay the shutdown if they're in the middle of their operations. For connection draining, if there are active connections, Postgres will wait for them to disconnect or reach a safe state to shut down. For speeding up the process, one might consider issuing a fast shutdown command,

Starting point is 00:04:22 which skips waiting for active transactions to complete. Okay, here I can interrupt without hesitating. Unlike when I'm with Michael, I'm always hesitating to interrupt Michael, but do it all the time, unfortunately. So I see, like, honestly, I didn't prepare this thing beforehand, so I didn't provide any ideas.

Starting point is 00:04:46 So we see, obviously, a couple of first reasons Charter PT mentions are really correct. I mean, they are the same as I list in my how-to article. I explain three reasons and how to deal with them. And the very first is indeed ongoing long-running transactions. So usually if we try to stop using the most default way, it's the so-called smart way. Postgres is going to wait until ongoing work is done, and this is probably not what you

Starting point is 00:05:27 want, because, of course, maybe you need to stop faster, so you need to use so-called fast shutdown. It's pgctl-amp-fast. Default is called smart. And in this case, it will issue SIGINT instead of SIGTERM signal to the Postgres process. And it means that Postgres will interrupt all ongoing transactions and proceed with the shutdown process.

Starting point is 00:06:01 Shutdown process is kind of quite complicated. If you can study source code, I provide links in my article. I will provide them here in the show notes as well. This is simple. You just use fast shutdown. You understand that ongoing transactions will be interrupted. But two other reasons are much more interesting. And let's start with wall-related and checkpoint-related reasons. I don't know. Is it working? Are you working? Certainly, Nikolai. Okay. The interaction between the write-ahead logs well and checkpoints plays a crucial role in PostgreSQL's shutdown process. When you initiate a shutdown, if there's a pending checkpoint, Postgres will attempt to complete that checkpoint before shutting down.

Starting point is 00:07:01 Checkpoints are mostly... No, no, no, no, no. When we try to shut down Postgres, first of all, there is a thing which is called a shutdown checkpoint. So it's like always so. We try to shut down, and Postgres wants to perform shutdown checkpoint. That means that all dirty buffers in the buffer pool need to be flushed to disk. So this is a clean shutdown.

Starting point is 00:07:35 If you issue CKILL or KILL-9, which is kind of crushing your Postgres or just switching off power of whole machine. In this case, of course, it won't do it because it doesn't have ability chances to do it. But in other cases, it tries to perform shutdown checkpoint and the trick with shutdown

Starting point is 00:07:58 checkpoint is while it's doing it, it cannot accept new connections. And this is bad because if we accumulated a lot of dirty buffers in the buffer pool, it cannot accept new connections. And this is bad because if we accumulated a lot of dirty buffers in the buffer pool, it means, you know, like, I don't like, I like Michael better. I don't like, it's like not working really well. Let's do it alone completely. So if you issue your shutdown, shutdown checkpoint can be huge. Of course, if your settings are default, it won't be huge because max wall size is one gigabyte.

Starting point is 00:08:30 It means that only like roughly half a gigabyte of wall data can be accumulated in the wall since the latest checkpoint. And also, if you have checkpoint timeout default, like it's also small, I think it's five minutes, right? In this case, we don't accumulate a lot of dirty buffers. And it means that shutdown checkpoint will be fast, but still it's not really pleasant because while Postgres is flashing dirty buffers on disk,

Starting point is 00:09:04 it doesn't accept connections. So the idea is we can run checkpoint explicitly right before we shut down. It means that we issue fast checkpoint using SQL command checkpoint, all dirty

Starting point is 00:09:20 buffers. Dirty means changed buffers in the buffer pool, but not yet synchronized to disk. We issue a checkpoint, Postgres flushes all of them on disk, and during this checkpoint, we're still working, still accepting connections, so nobody actually notices it. And then when we try to shutdown immediately after our checkpoint, the shutdown checkpoint is going to be there still, but it will be small because we just performed another checkpoint. This is great. Sometimes you want to do two checkpoints actually. I never did two checkpoints,

Starting point is 00:09:57 but recently we discussed in my team and it turned out that there is an opinion that two checkpoints is better because the first one can be really long. And during that, you again accumulated a lot of dirty buffers. So probably two checkpoints and then shut down. And actually why we can accumulate a lot of, and why we can want to accumulate

Starting point is 00:10:21 a lot of dirty buffers, because this is interesting and this will be related to our startup timing troubleshooting. So if you have default settings, Maxwell size is just one gigabyte, current workloads are huge and one gigabyte is not a lot actually, right? So you probably want to increase the distance in time between your checkpoints because if you do checkpoints too frequently you have two kinds of overhead first kind of first kind of overhead we actually discussed it i think we had a special episode about it but just in case let me repeat uh the first kind of overhead is that you just flushed your dirty buffer,

Starting point is 00:11:09 it became clean, and then immediately after it, or slightly later before next checkpoint, some backend again changed the content of this buffer. And you need to, like it became dirty once again. And if you have longer distance between two checkpoints, a few writes might happen in the same buffer before you flush it. So instead of flushing it multiple times, you could flush it just once, right? In this case, overhead becomes smaller.

Starting point is 00:11:41 And of course, if you have frequent checkpoints, you have this chance to deal with the same if you have frequent checkpoints, you have this chance to deal with the same buffer multiple times in checkpoints. And what kind of workloads are more experienced with this kind of overhead? More often, of course, random writes. So if you, for example, performing copy, it will be quite kind of sequential load to your system. So it's not going to experience such kind of problem a lot. But if you, for example, trying to update some records using some index scan, index scan can give you tuples in very random places, random buffers. And in this case, if it's a massive random update, not random,

Starting point is 00:12:28 I mean, not random, absolutely random, but just location of tuples is not predictable and tuples are randomly sparsely stored, for example, and so on. In this case, there are good chances that you will visit and revisit the same buffer during this massive update multiple times. It can be not a single update. It can be like some background migration happening and you have update in batches. But still, you revisit probably do multiple writes and just one synchronization with disk when checkpoint is happening. But it's not the only problem. The second problem is full-page writes. By default, full-page writes are enabled because

Starting point is 00:13:20 block size on disk is, on file system, ext4 is 4 kilobytes. system x4 is four kilobytes our buffer size is eight kilobytes so to avoid the partial rights we need to postgres needs to write the whole buffer to wall if it's the very first change of the buffer after the latest checkpoint. So if right after our checkpoint we change something in the buffer, for example, just, I don't know, like single tuple, we change single tuple, so new version created. In this case, Postgres needs to write whole buffer to wall, and if wall compression is on, it will compress it, which is probably good in many cases, maybe not always, but sometimes it's good.

Starting point is 00:14:12 Hello chat, by the way, thank you for joining. And subsequent changes of the same buffer pool before next checkpoint are going to be much lighter. Only the tuple itself will be written, not whole buffer. It means that if you have frequent checkpoints, Postgres needs to write more full-page writes. It increases the side of all significantly. And this overhead is propagated to replication and to archiving, to backup system, obviously. And also, of course, it takes more disk. Compression helps to some extent, but of course, it's much better if you have infrequent checkpoints and just write less. But infrequent checkpoints lead to the problem we are discussing here.

Starting point is 00:15:08 So we can accumulate a lot of dirty pages, and the shutdown attempt will be long unless you issue an explicit checkpoint or two. And we can actually see the number of dirty pages in the buffer pool just installing standard extension. It's a contrib module shipped with Postgres itself. It's called pgBufferCache. And in this buffer cache view, there is a flag. It's dirty, I think.

Starting point is 00:15:37 It's dirty, right? It's some Boolean flag. And you can just see how many buffers are currently dirty. You can multiply them by 8 kbytes and understand the volume of data that needs to be synchronized with disk during next checkpoint or shutdown checkpoint. So the idea of speeding up is simple. You cannot avoid shutdown checkpoint unless you crash your Postgres. don't crash your Postgres. The idea is just to make its work smaller, make its life simpler, just issuing one or two checkpoints right before shutdown. This helps a lot.

Starting point is 00:16:21 And this should be considered as a standard approach for fast restarts or fast shutdown attempts when we need to, for example, to perform a minor upgrade, for example, or some other works like other kinds of upgrade or switchover or something. If you want to stop current primary, just do this. And final third reason of longer stop I discussed in my article is lagging archiving. So lagging archiving of walls. It's related to backups. Archive command can be failing or just to be slow. And we accumulated

Starting point is 00:17:06 some walls which are not yet archived, pending walls. You can see, for example, you can see failures in lastlsn. I think it's possible to see lastlsn, right? In pgstart.archiver. This is good to monitor. Anyway, you want to monitor errors there, and errors should be investigated. Because if you accumulated a lot of walls which are not in the backups yet, it means that you might lose this data in case of disaster. You lost all the nodes. You need to recover from backups, but backups are lacking some walls, right? It's not good.

Starting point is 00:17:46 I mean, this is a serious situation. You want to monitor the lag of archiving and to ensure that all walls are archived in time. But here we deal with a slightly different problem. And one day I discovered it on production. I discovered that Patroni failed to perform failover because it was waiting. It was trying to stop Postgres and it was waiting on archive command because a lot of walls were accumulated unnoticed. So one problem, issue with backup system

Starting point is 00:18:25 and wall archiving led to another problem we cannot stop Postgres and perform failover. Currently Patroni already fixed I think in version 2.1.2 it was fixed. Thanks to Kukushkin for the very... Kukushkin is maintainer of Patroni kudos to Kukushkin for this very speedy reaction and bug fix so right now Patroni will failover and take

Starting point is 00:18:54 care of wall archiving separately but the idea is that you just need to monitor and make sure that wall archiving is not lagging. You can introduce the concept of lag, of how many walls are pending to be archived. And for example, say five is too much already, too many.

Starting point is 00:19:19 So we don't want it. We consider this as an incident and we have alerts, we investigate. Or you can measure the lag in bytes, understanding that each wall is 16 mb. Or it can be adjusted actually, and RDS adjusts it. It has 64 mb there. So that's it for stopping attempt. Let's talk about startup. By the way, interesting point that sometimes, like from my experience, people who don't understand what's happening with their ongoing attempt to stop or start. They just try to redo it or to be more aggressive, for example, kill minus none, seek kill, which

Starting point is 00:20:15 is not good at all. Don't do that. Just try to understand what's happening and where we are currently in the process. So I think that's it with stop action. What about start action? If Postgres was stopped without issues and shutdown checkpoint finished, everything should be fine.

Starting point is 00:20:39 And if it's primary, it should start quite fast. But if it crashed and needs to recover all walls since the latest checkpoint, this is the whole idea of checkpoints and wall. After crash, we need to replay walls after the latest successful checkpoint. And again, of course, if checkpoint tuning was applied and distance between checkpoints is bigger, then this startup redo phase will take longer. Or if you restore from backups, and this is a very common situation, people restore from backups and don't understand why it's not starting, why it's not accepting accepting connections, and a lot of emotions, new attempts, and so on.

Starting point is 00:21:28 They see fatal, the database system is not yet accepting connections. I think everyone who works with Postgres long enough saw this error. Interesting that it's fatal, but it's not fatal. It's normal. We just need to wait. And the problem here is that it's not easy to understand how long to wait. I think if you're interested in hacking Postgres, I think it's a good idea to improve this area

Starting point is 00:21:56 and try to deliver some good knowledge to users how long to wait. For example, we can see in logs that redo process has started, but we have no idea from which position it started and which position will be considered as consistent and ready for work. Sorry. So if we ended up in this situation and we see such kind of error and understand that Postgres is performing redo, we can find like few things,

Starting point is 00:22:43 especially if it's self-managed Postgres and everything is under our control. First of all, you can inspect the process list and see the Postgres process which is

Starting point is 00:22:59 performing redo. This will, like, just with PS command, you can grab startup recovering and see, it will give you LSN basically. LSN, it's log sequence number, it's position inside wall, sequential position. And this is great. So you understand, you can check it multiple times and see it's progressing. It's already a huge relief for most people to understand that we are waiting for something which is happening. Because the worst thing is that nothing is happening and we are waiting for 10 minutes. It's not progressing. What

Starting point is 00:23:40 to do? But usually, we see it's progressing, meaning that redo is happening, we are moving forward. And next question, how long else to wait? In this case, you need pgControlData. pgControlData is a standard program shipped with Postgres, and you need to specify dash capital D pgData, data directory, and it will show you a lot of metadata about your Postgres cluster. Cluster here we mean like clusters to overload the term, but here we mean our data directory basically. It will tell you a lot of interesting stuff, but also it will tell you the position at which Postgres will consider it as a consistency point.

Starting point is 00:24:31 So you can see redo location, checkpoint location, and also minimum recovery ending location. Actually, to be precise, this minimum recovery ending location is only related to cases when we recover from backups, restore Postgres from backups. If we deal with post-crash recovery, this unfortunately will be 0.0. So we don't have this information. We don't know when Postgres will consider itself as successful. But in this case, we can inspect pgwall directory and understand, just check the latest, the highest value. We can understand lsn from the file name.

Starting point is 00:25:20 I have another how-to explaining this. So we have two ways of understanding here. And unfortunately, if it's managed Postgres service like RDS and they don't provide you capability to use PG control data, if they provide you only SQL, and currently we cannot use SQL because Postgres is not accepting connections. So in this case, I don't know what to do.

Starting point is 00:25:48 Ask your vendor of managed Postgres and maybe they need to improve and provide, for example, some interface, some API CLI interface to PG control data and process list. Because when Postgres is starting long, because we tune checkpoints, because we want, right? And it's starting long, we want to understand where we are and how long to wait. Again, Postgres could report this in logs. And this is a good idea to think about implementing this. I think it's not super hard. Unfortunately, I'm not a hacker and I don't have a lot of time. I could probably do it, but I'm not a hacker. But I have this idea and if you have interest in this, I think it would be great to see in logs LSN position.

Starting point is 00:26:36 I know recently in Postgres 16, Checkpoint now reports LSN positions and Checkpoint position is reported to logs. It's a new feature of Postgres 16. So next logical development in this area could be if we recovery, redo should report details about where we started,

Starting point is 00:26:58 where we are. For example, each 30 seconds it could report and so we could see the progress right in the logs. And finally, how long to wait, where we will stop. Obviously, if you have these data points in terms of LSN, and you observe the progress, you can predict, you can have some forecasts, it's easy, you understand the speed. As a reminder, you can take two lessons and in working Postgres, you can just subtract one from another and the difference will be in bytes. So it's good.

Starting point is 00:27:32 And if we see the difference in bytes, we can understand the speed. If we have timestamps of current position, we understand, for example, each minute we replay I don't know, like 5 gigabytes, for example, or 2 gigabytes. Depends on your hardware resources. In this case, we understand the speed and assuming speed is

Starting point is 00:27:57 basically we are moving at the same speed all the time. Of course, it's not so, but we can assume it. In this case, knowing the final point, we can predict when we will reach it. And I did it many times because I have systems

Starting point is 00:28:16 with huge Maxwell size and checkpoint timeout, for example, 100 gigabytes Maxwell size and checkpoint amount, 30 or 60 minutes. And of course, startup when recovering from backups or after a crash, there it takes quite long. And without understanding, I was myself in that situation. And I saw so many times people become nervous, they restart

Starting point is 00:28:42 the process. And this is not efficient way to deal with this problem at all. Okay, I hope this is helpful. Let's see if we have questions in chat. No. Another high. High again. I hope this was helpful. And if you have questions, just leave them in comments.

Starting point is 00:29:04 And again, thank you everyone. We do it already like 66 weeks in a row. Sometimes I'm alone or Michael is alone, but we continue. And now I have daily commitment to have this Postgres marathon,

Starting point is 00:29:20 Postgres how-tos. I want to reach 365 days in a row. It will be challenging, but so far, today is day 11. So far, it's working. I hope it will work. So check out my

Starting point is 00:29:35 Twitter, LinkedIn, or the GitLab repository where these articles are stored. Okay, thank you so much. Until next one. Bye.

Postgres FM - Stop and start Postgres faster

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.