Postgres FM - Stop and start Postgres faster
Episode Date: October 7, 2023In this episode (recorded live on YouTube), Nikolay discusses Postgres shutdown and startup times – how to troubleshoot them and, when needed, optimize.  Some extra things mentioned in th...is episode:#PostgresMarathon series – every day, Nikolay posts a new howto-style article https://twitter.com/hashtag/PostgresMarathonDay 2: Postgres shutdown and restart attempts https://twitter.com/samokhvalov/status/1707147450044297673Day 3: How to troubleshoot long Postgres startup https://twitter.com/samokhvalov/status/1707466169245171773GitLab repo with #PostgresMarathon posts (markdown): https://gitlab.com/postgres-ai/postgresql-consulting/postgres-howtosEpisode 010 "WAL and checkpoint tuning": https://postgres.fm/episodes/wal-and-checkpoint-tuning~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
So hello, hello. This is PostgresFM episode number 66 or 67. I don't remember.
My name is Nikolai and I'm alone today because Michael couldn't make it.
So I don't like to be alone, actually.
And I don't like to record myself alone in advance.
So that's why it's live. But feel free to ask questions in the chat.
I also have my co-host will be today, Chargy PT.
Hello, Chargy PT.
Hello, Nikolai.
It's great to be here co-hosting PostgreSQL with you. To our listeners, I'm eager to engage with your questions to talk about how to start and stop Postgres
or stop and start Postgres faster and how to troubleshoot if it takes long. I recently
started a new series of my small how-to articles. I do it on Twitter initially, but then I spread to my head that could be on paper with examples
and you know, like SQL queries or some snippets for you can run in console like bash snippets
or something.
And for example, yesterday I talked about how to create flame graphs.
If you have issues with some Postgres running on your Linux machine,
of course, it wouldn't work on RDS or other managed services.
But if you have, for example, some issue on production,
you can reproduce it on standalone Postgres on a Linux machine.
And then perf or eBPF tooling gives you very powerful capabilities to see
what's going on under the hood.
So flame graphs is very good tool to understand which functions in the source code take too
much time in CPU. But today we are going to discuss stop and start duration of Postgres.
And this, of course, can be applied to any Postgres, any version of it, including, of course,
RDS or any other managed services. So let's start with stop attempts. For example, you want to stop Postgres and see it takes long.
What kind of reasons we can have?
That sounds like a valuable series of how-to articles you've been creating, Nikolai.
Understanding the nuances of Postgres and its operations is crucial for many developers and admins.
Regarding your question on the duration of stopping Postgres, there are several reasons it might take longer than expected.
One, active transactions.
If there are active transactions when you attempt to stop the service,
Postgres will wait for them to complete before shutting down.
Two, write-ahead logs.
Postgres could be flushing the wall to disk,
which can sometimes delay the shutdown process.
Three, background processes.
Postgres runs several background processes, like auto-vacuum.
These processes might delay the shutdown if
they're in the middle of their operations. For connection draining, if there are active
connections, Postgres will wait for them to disconnect or reach a safe state to shut down.
For speeding up the process, one might consider issuing a fast shutdown command,
which skips waiting for active transactions to complete.
Okay, here I can interrupt without hesitating.
Unlike when I'm with Michael,
I'm always hesitating to interrupt Michael,
but do it all the time, unfortunately.
So I see, like, honestly,
I didn't prepare this thing beforehand,
so I didn't provide any ideas.
So we see, obviously, a couple of first reasons
Charter PT mentions are really correct.
I mean, they are the same as I list in my how-to article.
I explain three reasons and how to deal with them.
And the very first is indeed ongoing long-running transactions.
So usually if we try to stop using the most default way,
it's the so-called smart way.
Postgres is going to wait until ongoing work is done, and this is probably not what you
want, because, of course, maybe you need to
stop faster, so you need to use so-called fast
shutdown. It's pgctl-amp-fast.
Default is called smart.
And in this case, it will issue SIGINT instead of SIGTERM signal
to the Postgres process.
And it means that Postgres will interrupt all ongoing transactions
and proceed with the shutdown process.
Shutdown process is kind of quite complicated. If you can study
source code, I provide links in my article. I will provide them here in the show notes as well.
This is simple. You just use fast shutdown. You understand that ongoing transactions will be
interrupted. But two other reasons are much more interesting. And let's start with
wall-related and checkpoint-related reasons.
I don't know. Is it working? Are you working?
Certainly, Nikolai. Okay. The interaction between the write-ahead logs well and checkpoints plays a crucial role in PostgreSQL's shutdown process.
When you initiate a shutdown, if there's a pending checkpoint, Postgres will attempt to complete that checkpoint before shutting down.
Checkpoints are mostly...
No, no, no, no, no. When we try to shut down Postgres, first of all,
there is a thing which is called a shutdown checkpoint.
So it's like always so.
We try to shut down, and Postgres wants to perform shutdown checkpoint.
That means that all dirty buffers in the buffer pool
need to be flushed to disk.
So this is a clean shutdown.
If you issue CKILL or KILL-9,
which is kind of crushing your Postgres or just switching off power of whole machine. In this case,
of course, it won't do it because
it doesn't have ability
chances to do it.
But in other cases, it
tries to perform shutdown checkpoint
and the trick with shutdown
checkpoint is while it's doing it,
it cannot accept new connections.
And this is bad because if we
accumulated a lot of dirty buffers in the buffer pool, it cannot accept new connections. And this is bad because if we accumulated a lot of dirty
buffers in the buffer pool, it means, you know, like, I don't like, I like Michael better.
I don't like, it's like not working really well. Let's do it alone completely. So if you issue
your shutdown, shutdown checkpoint can be huge.
Of course, if your settings are default, it won't be huge because max wall size is one gigabyte.
It means that only like roughly half a gigabyte of wall data
can be accumulated in the wall since the latest checkpoint.
And also, if you have checkpoint timeout default,
like it's also small, I think it's five minutes, right?
In this case, we don't accumulate a lot of dirty buffers.
And it means that shutdown checkpoint will be fast,
but still it's not really pleasant
because while Postgres is flashing dirty buffers on disk,
it doesn't accept connections. So the idea
is we can run
checkpoint explicitly
right before we
shut down. It means that
we issue fast checkpoint
using SQL command
checkpoint, all dirty
buffers. Dirty means changed
buffers in the buffer
pool, but not yet synchronized to disk.
We issue a checkpoint, Postgres flushes all of them on disk, and during this checkpoint,
we're still working, still accepting connections, so nobody actually notices it.
And then when we try to shutdown immediately after our checkpoint, the shutdown checkpoint is going to
be there still, but it will be small because we just performed another checkpoint. This is great.
Sometimes you want to do two checkpoints actually. I never did two checkpoints,
but recently we discussed in my team and it turned out that there is an opinion that two
checkpoints is better because the first one can be really long.
And during that,
you again accumulated a lot of dirty buffers.
So probably two checkpoints
and then shut down.
And actually why we can accumulate a lot of,
and why we can want to accumulate
a lot of dirty buffers,
because this is interesting
and this will be related to our startup timing troubleshooting. So if you have default settings, Maxwell size is just
one gigabyte, current workloads are huge and one gigabyte is not a lot actually, right? So you
probably want to increase the distance in time between your checkpoints because
if you do checkpoints too frequently you have two kinds of overhead first kind of first kind
of overhead we actually discussed it i think we had a special episode about it but just in case
let me repeat uh the first kind of overhead is that you just flushed your dirty buffer,
it became clean, and then immediately after it,
or slightly later before next checkpoint,
some backend again changed the content of this buffer.
And you need to, like it became dirty once again.
And if you have longer distance between two checkpoints,
a few writes might happen in the same buffer before you flush it.
So instead of flushing it multiple times, you could flush it just once, right?
In this case, overhead becomes smaller.
And of course, if you have frequent checkpoints,
you have this chance to deal with the same if you have frequent checkpoints, you have this chance
to deal with the same buffer multiple times in checkpoints. And what kind of workloads
are more experienced with this kind of overhead? More often, of course, random writes. So if you,
for example, performing copy, it will be quite kind of sequential load to your system. So it's not
going to experience such kind of problem a lot. But if you, for example, trying to update some
records using some index scan, index scan can give you tuples in very random places,
random buffers. And in this case, if it's a massive random update, not random,
I mean, not random, absolutely random, but just location of tuples is not predictable and
tuples are randomly sparsely stored, for example, and so on. In this case, there are good chances
that you will visit and revisit the same buffer during this massive update multiple times.
It can be not a single update.
It can be like some background migration happening and you have update in batches.
But still, you revisit probably do multiple writes and just one
synchronization with disk when checkpoint is happening. But it's not the only problem.
The second problem is full-page writes. By default, full-page writes are enabled because
block size on disk is, on file system, ext4 is 4 kilobytes. system x4 is four kilobytes our buffer size is
eight kilobytes so to avoid the partial rights we need to postgres needs to write the whole buffer
to wall if it's the very first change of the buffer after the latest checkpoint. So if right after our checkpoint we change something in the buffer,
for example, just, I don't know, like single tuple, we change single tuple,
so new version created.
In this case, Postgres needs to write whole buffer to wall,
and if wall compression is on, it will compress it, which is probably
good in many cases, maybe not always, but sometimes it's good.
Hello chat, by the way, thank you for joining.
And subsequent changes of the same buffer pool before next checkpoint are going to be
much lighter.
Only the tuple itself will be written,
not whole buffer. It means that if you have frequent checkpoints, Postgres needs to write
more full-page writes. It increases the side of all significantly. And this overhead is propagated to replication and to archiving, to backup system,
obviously. And also, of course, it takes more disk. Compression helps to some extent, but
of course, it's much better if you have infrequent checkpoints and just write less. But infrequent checkpoints lead to the problem we are discussing here.
So we can accumulate a lot of dirty pages,
and the shutdown attempt will be long unless you issue an explicit checkpoint or two.
And we can actually see the number of dirty pages in the buffer pool
just installing standard extension.
It's a contrib module shipped with Postgres itself.
It's called pgBufferCache.
And in this buffer cache view, there is a flag.
It's dirty, I think.
It's dirty, right?
It's some Boolean flag.
And you can just see how many buffers are currently dirty. You can multiply them by 8 kbytes and understand the volume of data that needs to be synchronized with disk during next checkpoint or shutdown checkpoint.
So the idea of speeding up is simple.
You cannot avoid shutdown checkpoint unless you crash your Postgres. don't crash your Postgres.
The idea is just to make its work smaller, make its life simpler,
just issuing one or two checkpoints right before shutdown.
This helps a lot.
And this should be considered as a standard approach for fast restarts or fast shutdown attempts
when we need to, for example, to perform a minor upgrade, for example,
or some other works like other kinds of upgrade or switchover or something.
If you want to stop current primary, just do this.
And final third reason of longer stop I discussed in my article is lagging archiving.
So lagging archiving of walls.
It's related to backups.
Archive command can be failing or just to be slow. And we accumulated
some walls which are not yet archived, pending walls. You can see, for example, you can see
failures in lastlsn. I think it's possible to see lastlsn, right? In pgstart.archiver.
This is good to monitor. Anyway, you want to monitor errors there, and errors should be investigated.
Because if you accumulated a lot of walls which are not in the backups yet,
it means that you might lose this data in case of disaster.
You lost all the nodes.
You need to recover from backups, but backups are lacking some walls, right?
It's not good.
I mean, this is a serious situation.
You want to monitor the lag of archiving and to ensure that all walls are archived in time.
But here we deal with a slightly different problem.
And one day I discovered it on production. I discovered that Patroni failed to perform failover
because it was waiting.
It was trying to stop Postgres and it was waiting on archive command
because a lot of walls were accumulated unnoticed.
So one problem, issue with backup system
and wall archiving led to another problem we cannot stop
Postgres and perform failover. Currently Patroni already fixed
I think in version 2.1.2 it was fixed. Thanks to Kukushkin for
the very... Kukushkin is maintainer of Patroni kudos to Kukushkin for this very speedy
reaction and bug fix
so right now
Patroni will
failover and take
care of wall archiving separately
but the
idea is that
you just need to
monitor and make sure that
wall archiving is not lagging.
You can introduce the concept of lag, of how many walls are pending to be archived.
And for example, say five is too much already, too many.
So we don't want it. We consider this as an incident and we have alerts, we investigate. Or you can measure the lag in bytes, understanding that each wall is 16 mb.
Or it can be adjusted actually, and RDS adjusts it.
It has 64 mb there.
So that's it for stopping attempt.
Let's talk about startup.
By the way, interesting point that sometimes, like from my experience, people who don't understand what's happening with their ongoing attempt to stop
or start.
They just try to redo it or to be more aggressive, for example, kill minus none, seek kill, which
is not good at all.
Don't do that.
Just try to understand what's happening and where we are currently in the process.
So I think that's it with stop action.
What about start action?
If Postgres was stopped without issues
and shutdown checkpoint finished,
everything should be fine.
And if it's primary, it should start quite fast.
But if it crashed and needs to recover all walls since the latest checkpoint,
this is the whole idea of checkpoints and wall.
After crash, we need to replay walls after the latest successful checkpoint.
And again, of course, if checkpoint tuning was applied
and distance between checkpoints is bigger, then this startup
redo phase will take longer. Or if you restore from backups, and this is a very common situation,
people restore from backups and don't understand why it's not starting, why it's not accepting accepting connections, and a lot of emotions, new attempts, and so on.
They see fatal, the database system is not yet accepting connections.
I think everyone who works with Postgres long enough saw this error.
Interesting that it's fatal, but it's not fatal.
It's normal.
We just need to wait.
And the problem here is that it's not easy to understand how long to wait.
I think if you're interested in hacking Postgres,
I think it's a good idea to improve this area
and try to deliver some good knowledge to users how long to wait.
For example, we can see in logs that redo process has started,
but we have no idea from which position it started and which position will be considered as
consistent and ready for work.
Sorry.
So if we ended up in this situation and we see such kind of error
and understand that Postgres is performing redo,
we can find like few things,
especially if it's
self-managed Postgres and everything
is under our control.
First of all, you can
inspect
the process list and see
the
Postgres process which is
performing redo.
This will, like,
just with PS command, you can grab startup recovering and see, it
will give you LSN basically. LSN, it's log sequence number, it's position inside wall,
sequential position. And this is great. So you understand, you can check it multiple
times and see it's progressing. It's already a huge relief for most people
to understand that we are waiting for something which is happening. Because the worst thing
is that nothing is happening and we are waiting for 10 minutes. It's not progressing. What
to do? But usually, we see it's progressing, meaning that redo is happening, we are moving forward.
And next question, how long else to wait? In this
case, you need pgControlData. pgControlData is
a standard program shipped with Postgres, and you need
to specify dash capital D
pgData, data directory, and it will show you a lot of metadata about your
Postgres cluster. Cluster here we mean like clusters to overload the term, but here we mean
our data directory basically. It will tell you a lot of interesting stuff, but also it will tell you the position at which Postgres will consider it as a consistency point.
So you can see redo location, checkpoint location, and also minimum recovery ending location.
Actually, to be precise, this minimum recovery ending location is only related to cases when we recover from backups, restore Postgres from backups.
If we deal with post-crash recovery, this unfortunately will be 0.0.
So we don't have this information.
We don't know when Postgres will consider itself as successful.
But in this case, we can inspect pgwall directory and understand,
just check the latest, the highest value.
We can understand lsn from the file name.
I have another how-to explaining this.
So we have two ways of understanding here.
And unfortunately, if it's managed Postgres service like RDS
and they don't provide you capability to use PG control data,
if they provide you only SQL,
and currently we cannot use SQL
because Postgres is not accepting connections.
So in this case, I don't know what to do.
Ask your vendor of managed Postgres and maybe they need to improve and provide, for example,
some interface, some API CLI interface to PG control data and process list.
Because when Postgres is starting long, because we tune checkpoints,
because we want, right? And it's starting long, we want to understand where we are and how long
to wait. Again, Postgres could report this in logs. And this is a good idea to think about
implementing this. I think it's not super hard. Unfortunately, I'm not a hacker and I don't have a lot of time. I could probably do it, but I'm not a hacker.
But I have this idea and if you have interest in this, I think it would be great to see
in logs LSN position.
I know recently in Postgres 16, Checkpoint now reports LSN positions and Checkpoint position
is reported to logs. It's a new feature of
Postgres 16. So next
logical development
in this area
could be if we recovery, redo
should report details
about where we started,
where we are. For example, each
30 seconds it could report
and so we could see the progress
right in the logs. And finally,
how long to wait, where we will stop. Obviously, if you have these data points in terms of LSN,
and you observe the progress, you can predict, you can have some forecasts, it's easy,
you understand the speed. As a reminder, you can take two lessons and in working Postgres,
you can just subtract one from another and the difference will be in bytes. So it's good.
And if we see the difference in bytes, we can understand the speed. If we have timestamps
of current position, we understand, for example, each minute we replay I don't know, like
5 gigabytes, for example, or
2 gigabytes. Depends on your
hardware
resources.
In this case, we understand the speed
and assuming speed is
basically we are moving
at the same speed all the time. Of course, it's not
so, but we can assume it.
In this case,
knowing the final point,
we can predict when we will reach it.
And I did it many times
because I have systems
with huge Maxwell size
and checkpoint timeout,
for example,
100 gigabytes Maxwell size
and checkpoint amount,
30 or 60 minutes. And of course, startup
when recovering from backups or after a crash, there it takes quite long. And without understanding,
I was myself in that situation. And I saw so many times people become nervous, they restart
the process. And this is not efficient way to deal with this problem at all.
Okay, I hope this is helpful.
Let's see if we have questions in chat.
No.
Another high.
High again.
I hope this was helpful.
And if you have questions, just leave them in comments.
And again, thank you everyone.
We do it
already like 66
weeks in a row. Sometimes
I'm alone or Michael is alone, but
we continue. And now
I have daily commitment
to have this Postgres marathon,
Postgres how-tos.
I want to reach
365 days
in a row. It will be challenging, but
so far,
today is day 11.
So far, it's working. I hope it
will work. So check out my
Twitter, LinkedIn, or
the GitLab repository
where these articles
are stored.
Okay, thank you so much.
Until next one.
Bye.