Postgres FM - WAL and checkpoint tuning
Episode Date: September 9, 2022Here are links to a few things we mentioned: LSN (log sequence number)On the impact of full-page writes (blog post by Tomas Vondra)Deep dive on Aurora Postgres (talk by Grant McAlister) Net...data monitoring PostgresqlCO.NF pgPedia ------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov and @michristofides or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
I'm Michael, founder of PgMustard, and this is my co-host Nikolai, founder of PostgresAI.
Hey Nikolai, what are we going to talk about today?
Hi Michael, let's talk about checkpoint tuning.
Yes, right ahead, log in general, right? A WoW configuration.
We had a really good request for this topic from one of our listeners.
Right, so not only checkpoint tuning, but checkpoint tuning is a part of it.
Yes, absolutely. Thank youelsea for this one uh-huh thank you uh very interconnected
topics we we should think about them both wall configuration and checkpoint tuning they come
together yeah should we go through them one at a time in terms of well so what what is check
what are checkpoints should we start there or? Or what is what as well? Right.
So what are checkpoints?
Checkpoints.
To talk about checkpoints, we need to talk about wall first.
Right-ahead log.
And right-ahead log is like the basic, absolutely fundamental concept of any database system.
Not only relational. which says that first data is recorded to a different place,
to some additional, some kind of binary log or write-ahead log.
And only then it's changed in memory and data pages.
And this allows us to build recovery system.
This allows us to survive unexpected restarts, failures, various bugs, and so on. And in Postgres, this
also allows us to have physical replication because it's built on top of recovery subsystem.
So if you think that you change some row in a table, first this change is recorded into
write-ahead log. And only when we know that this change is recorded reliably in wall, it's already on disk.
Only then you can see committed.
The user can see commit.
And the actual change in data, it still can be only in memory, not flushed into disk yet.
Right?
So basically we record twice, we record to write a headlock, the change,
and we change the data in place where it's stored permanently. But this flushing to disk,
in the second case, it's not occurred synchronously. So we can see commit, but the
data file table index files, they are not yet in actual state on disk. And the checkpoint, it's the process of
writing so-called dirty blocks, dirty buffers to disk. So dirty in this context means changed,
but not yet saved on disk. So if it's already saved, it's called clean buffer. If it's not
yet saved, it's called dirty buffer. And when we have dirty
buffers, it means that we changed a lot. We already reflected these changes in wall. It's like
the rule number one, write a headlock, but it's not yet saved on disk. And when checkpoint happens,
all dirty buffers are saved. And it means that next time we will need to start, for example,
after some crash, power loss, anything.
We don't care about previous changes because they already
reflected on disk, right?
Yeah.
So we only have to worry about things since the last checkpoint.
Right.
And I guess that'll come up.
Right, right, right.
Yeah.
And just before we move on from the basics, because I had to read up a bit
about this, the main, the main reason we need this is to prevent data loss on recovery.
So it's the idea of the dual system,
the idea of having this,
is the D in ACID, I believe,
the durability side,
so preventing data loss.
Right, ACID is like the core concept
of a database system.
If it says data committed,
it never can lose it.
Otherwise, it's a bad system.
So right-ahead log exactly allows us to have it.
And without checkpoint, we would need to keep a lot of changes and replay a lot of changes
and startup time after any crash would be very long.
So checkpoints happen all the time.
They happen kind of on schedule.
They also can happen during, for example, when we need to shut down the server or restart
it.
There is so-called mandatory shutdown checkpoint.
So Postgres doesn't report that shutdown completed until this shutdown checkpoint finishes.
So it's also important to understand. And that's why sometimes
we can see that shutdown takes some significant time because we have a lot of dirty buffers and
we need to save them first. Yeah. On this topic, I saw a good recommendation. I think it was from
you actually around Postgres upgrades and the idea of taking a checkpoint, triggering a checkpoint
manually to reduce the time needed. Right, because during shutdown,
checkpoint Postgres doesn't respond to new queries anymore.
It says shutting down, so come later, right?
But if we run manual checkpoint,
a SQL command checkpoint, it can be in parallel.
So we save a lot of dirty buffers ourselves,
like just running select in terminal in P-SQL, for example.
And then when shutdown checkpoint happens,
it's already fast because there's a very low number
of dirty buffers, right?
So it's always recommended when we need to restart server,
for example, for minor upgrade,
or we need, for example, to perform switchover,
this manual, not manual, it can be automated, of course,
but additional checkpoint
i would say implicit or explicit explicit checkpoint should be there because after it
shutdown checkpoint will happen so we need to help it to be shorter faster yeah right so you
you already mentioned a couple of uses for the write headlog. One's recovery on crash.
One is replication.
I believe point-in-time recovery is another one.
So tools like pgbackrest make use of it, I believe.
Right, right, right.
Well, it's a different topic.
But of course, right, if we store full copies of pgdata,
by the way, they never are in consistent state.
If you copy pgdata on live server, it's not consistent.
So you need a bunch of walls to be able to reach consistent point, always.
But you also can store the whole stream of walls additionally.
And this allows you to take some so-called base backup PgData
corresponding to some point in time,
and then you can replay additional walls and reach a new point in time,
and it can be arbitrary.
So if you store a continuous stream of walls in your archive,
you can choose any time you want.
The only problem usually is if your database is quite big,
the initial copying PgData initially takes time. And roughly, like, very, very rough
rule is one terabyte per hour, but can be slower, can be faster, but very rough rule. So if you have
10 terabyte database should be prepared for five or 10 or 15 hours of initial copy. And I wonder
why people don't use often cloud snapshots for that, to speed it up, right? It would make sense completely.
But I know sometimes they are not reliable,
cloud snapshots in Amazon and Google.
There are issues with them sometimes.
Also, making them takes time, but we could do it on Replic.
I put us to a different field.
Let's postpone discussion of backups and and other
things but you're right point of time recovery is another area of application of wall but the
primary goal is to allow us to recover from from unexpected shutdown or power or something i think
so it's it's like in games right i lost you briefly then yeah i see i see let me let me listen to you
then so well yeah so i'd be i think it's getting better so we have a lot of parameters for tuning
these things we have a lot of parameters for controlling how how fast these things happen
how much has to happen before they kick in there's a few settings that it seems like we really shouldn't touch
pretty much ever as well but it also seems that especially on high write workloads there are some
really big wins we can get by changing some of the defaults are not good enough for you usually
right okay is that generally true or like if i have a read heavy application and not not a ton
of writes am i likely to what what how would i know if i'm
hitting problems what's the kind of what the first telltale signs well it's a good question usually
it's good to evaluate the distance between the two daily backups for example and you you can even in
sql you can even take two lsns usually when backup is taken you know know, LSN, log sequence number. So it's like position in wall.
It's always sequential.
It has some specific structure.
There are a few articles.
We will attach articles explaining how to read LSN and how to understand the structure.
So if you take two LSNs and convert them to PG underscore LSN data type in Postgres, you can subtract one from another.
And the difference will be in bytes.
So this difference, then you can run pg pretty or pg size.
And you can see the difference in megabytes, gigabytes, actually gigabytes, actually.
But Postgres doesn't use this notation.
So you can understand how much wall you generate during a day, right? If this size is
quite small, like gigabytes or 10 gigabytes, it's quite small. Probably you don't need specific
checkpoint and wall tuning at all. But if you have one terabyte generated per day, it's quite a lot.
And I'm sure you need to go away from defaults and you will have better performance, better everything.
For example, wall compression is not enabled by default, right?
I missed that one. That's great.
Yeah, but I'm going to check if it's enabled in the recent version because Postgres defaults are improving, but still they are lagging.
If you have heavily loaded systems, you definitely want to tune it.
So if you have Postgres 13, for example,
wall compression is disabled by default. Let's talk about what is written in walls.
Oh, first of all, just simplify the explanation people usually use about checkpoints. It's like
in games, you want to save your progress, right? And if something bad happens, you will repeat fewer steps, right? So
it's very simple analogy. Yeah, it's still in Postgres 14, it's still not enabled wall compression.
Yeah, in Postgres 15, it's still not enabled. And this is, I think, should be enabled in most
setups. I'm almost sure on RDS it's enabled. So, and if you can, for example,
do checkpoints very rarely,
once per week, it's insanely low.
But in this case,
there are high chances that
if crash happens,
you will need to wait a lot
while Postgres replaces many walls, right?
A lot to do in terms of redo.
And during this period,
you will be down.
Your system is down.
So not good. That's
why, like logically, I would say it's good to have checkpoints more often, right?
It seems like a Goldilocks problem, right? Too often, and you have a lot of overhead,
but too infrequent, and it will take a long time to recover. So it feels like there's a balance.
There is trade-off here. And there are two kinds of overhead. We will talk like there's a balance. There is trade-off here and there are two kinds
of overhead. We will talk about it in a second. But to understand where overhead comes from,
in spite of dirty buffers, let's talk about what is written to wall. By default, full page rights
are enabled, right? And what is full page right? It's if you change anything in some table, in some row, it may be a very small change.
Postgres writes whole page fully to write a headlock. Why? Because there is a difference.
Buffer is usually 8 kilobyte size, kibibyte size. But file system probably uses block size 4
kibibytes. And you don't want to have partial write during writing to disk reported as success, but you wrote only one half of it.
So that's why full page write is needed.
And by the way, first Postgres first talks about Aurora from Grant McAllister, if I'm not mistaken.
They are very well explained.
We will find links to YouTube and probably a slide deck. They explained very well
this problem about full page rights and this big overhead. So when first change in the page occurs,
first time it's written after checkpoint, it's full page right. If you change it once again,
only small delta is written. So it's not full page, right? But only until the
next checkpoint. If checkpoint happened, all changes are initially again full page rights.
If checkpoints are very frequent, we have a lot of full page rights. If checkpoints are not frequent,
very often we have repetitive changes of the same page. So we changed it.
For example, a row inserted something.
We insert once again into the same page.
It's a new change.
Again and again, we update something.
We have had update, keep only tuple, tuple update.
So we change it to the same page.
And this means we touch the same page.
We write to it multiple times. In this case, I'm saying not only the number of writes matters, also the nature of writes matters.
If you have hot updates touching the same page many, many times, you will benefit from rare checkpoints a lot.
Because only one full page write will happen after a checkpoint and then
you benefit having very light you write a little bit to to the write ahead log every subsequent
change until next checkpoint of course yeah that's super interesting and i think also it explains why
people so there's there's some people i've seen i suspect this is a very bad idea, but turning off full-page writes in order to increase throughput,
but it feels like a very risky thing to do,
and I feel like we're going to cover quite a few better ways.
Some systems can afford it.
You need to understand what your file system is,
your disks, and what settings they use.
And in some cases it's possible, but it's quite dangerous.
You should understand all risks and be 100% sure that it's possible in your system.
But usually we don't go this way.
Usually we use X4 with 4K block size and so on.
And we want full page rights enabled.
So back to compression.
Compression is applied only to full page rights.
Interesting.
Right. So we don't compress, as I understand, Postgres doesn't compress these small changes.
It compresses only this first time we change something in a page, we record this page fully,
and we can compress it. And compression is not enabled by default.
And if you enable it, you can see huge benefit in terms of how much wall you write. Why we care
about volume here? Because if we have write heavy system, of course, writing a lot additionally to
wall, it's additional overhead on disk. If you have 10 replicas, sometimes people have it.
All replicas need to receive this data.
They work physical replication, logical as well.
It works through wall, through write-ahead log.
So if we write a lot, we need to send over network a lot.
They want wall compression enabled to compress all full page writes.
And we want checkpoints to happen rarely, to have fewer full page rights as well.
So I would tune max wall size and checkpoint timeout to have very, very not frequent checkpoints.
But in this case, if they are not frequent, again, startup time after crash, also failover, for example.
The timing of these procedures will be very, very bad.
Long. Minutes.
I see various engineers struggling to understand why, for example,
shutdown takes so long, why restart takes so long.
And they become nervous.
And at extreme cases, they use kill minus nine.
So, sick kill.
Postgres survives because we have write-ahead lock.
And we just redo.
But redo also takes a lot.
They kill it.
It's not acceptable.
Only in rare cases, we should do it.
It's like last resort.
We should not do it.
But after read, Postgres starts and startup takes also many minutes.
They're still nervous.
It's not a good situation. That's why people need to understand how much wall needed to be written and distance between checkpoints.
Yeah, let's go back to a couple of those ones you mentioned.
So my understanding is that checkpoint timeout is a maximum time between checkpoints.
And that's default quite low is five minutes.
Very low.
So what would be a sensible starting point
for most people in terms of?
Yeah.
So usually the main metric here
is how long you can afford being down
in the bad case,
in the case of an incident.
This is the main number you need to understand.
You need to talk with your business people and
find some number like okay we can be down up to two minutes for example right from there you start
thinking okay if we have this like requirement or slo service level objective if we are sres right
so if we have two minutes let's think during two minutes, how much we can
replay? We can measure it
with experiment. We can, for example,
set checkpoint timeout
in Maxwell size to very insanely
big numbers. Then we can
have a lot of
writes happening, pgbench, for example.
Then we can wait until
one checkpoint
finishes, another is about to finish, and then we kill minus nine, our Postgres, crush it on purpose.
And then we see recovery and just measure the speed of recovery, how many bytes of wall we can replay per second, per minute.
And this gives us understanding of how much wall we can afford to replay, not to exceed two minutes of downtime, for example.
Yeah, perfect.
From this, we can start thinking, like, this is very important to understand.
Recovery speed in terms of bytes per second, bytes per minute, or gigabytes per minute, anything here.
From there, we can understand how many bytes of wall we produce when everything is normal.
During quite busy hours, usually at night, for example, we have lower activity.
At daytime, on working days, probably we have more activity, right?
But usually we say, okay, we produce, for example, one wall per second.
It's quite good speed. Each wall means this file. There is also
confusion in terms because as I remember, documentation says wall file is some abstract
thing. It's like two gigabytes and wall segment is 16 maybe bytes. But if you go to pgwall directory,
you will see each file will be 16 maybe bytes usually as i remember rds tunes it and they have 64
megabytes per each wall so i said each wall is usually 16 megabytes so one wall per second
during normal quite busy hours it means we produce 16 megabytes per second of all data
right so okay and it means that okay what is, okay, what is our replay speed? What is our production speed?
And from there, we can understand during which time we generate that amount of wall data,
which will give us two minutes of recovery time, right?
Quite complex, I understand.
Well, it feels like luckily we've got that second parameter in max wall size.
No, no, no, let's let's pause yes i'm talking about how to
like in my opinion how to understand what's the best what's like normal checkpoint timeout for
you right yes so in this case we understand okay recovery time is this production time of production
speed is this so we can conclude that to have not more than two minutes of recovery time,
we need to have to reproduce this number of wall data. So we will set checkpoint timeout probably
like half an hour. This is like quite maybe 15 minutes, 30 minutes. It depends, of course.
Observing a concrete system, we can make some conclusion. Okay, we want 30 minutes, for example.
But then we start distinguishing planned checkpoints and requested checkpoints.
Requested checkpoints is like Postgres has two logics.
One logic is, okay, on schedule.
When time comes, time to have checkpoint.
Every 30 minutes, for example.
By default, it's five minutes.
I think it's too often, right?
But then if there is another parameter called max wall size,
and I think it's a very, very important parameter to understand.
It's our protection for the cases when we have elevated activity
and we want to have more frequent checkpoints
because we want to be protected, again,
from writing too much data to wall and have again longer wait
again the same logic if we understand how much we produce the speed of production we can say okay
maxwell size also roughly corresponds to so so checkpoint timeout and maxwell size their tuning
can be correlated here right yeah so my my understanding is you it it sounds like we should rely on
checkpoint timeout for the majority of the time that should be the thing that that kicks off
checkpoints but if if more than that amount of wow is generated more than the amount we expected
we could set an amount in we should we should set it so, like, and default is very small, a one gigabyte default is insanely small for modern workloads. Usually, like I recommend to go up, sometimes up to 100 gigabytes, but we need to understand this recovery trade off, right? So we need to measure recovery and guarantee our business that we will not be down more than, for example, two minutes or five minutes.
But Maxwell's size protects us from the cases when we have more writes.
And Postgres can decide to perform a requested checkpoint.
We see it in logs.
By the way, logging of checkpoints we need to enable always.
As far as I remember, recently default was changed and login is now enabled.
I remember discussion in hackers mailing list.
So log checkpoint should be enabled for all checkpoints.
I'm 100% sure.
This is what you want to understand.
Default is disabled in Postgres 12, disabled in Postgres 14, but enabled in Postgres 15, which will be released very soon.
So this is a new change.
In Postgres 15, a log checkpoint is enabled, and I recommend enabling it for any Postgres.
So I also saw some DBAs see that like 90% of all checkpoints are requested.
They occur according to max full size.
This is a problem.
No, it's not a problem.
It's not a problem.
Because requested checkpoint and timed checkpoint, like planned on schedule, they are the same, actually.
Like, no big difference.
But, of course, you want to be in order.
Everything should be in order.
Of course, you want, like, it's just a sign that probably you need to reconsider settings,
but it's not an emergency situation, right?
Yeah.
Sounds good.
There is another checkpoint completion target we didn't mention.
And by default, it's 0.7 or?
No, this changed.
Yeah, I looked this up until very recently.
It was 0.5. Oh, 0.5 is. Until very recently, it was 0.5.
Oh, 0.5 is terrible, I would say. It's not what you want.
Yeah, but in 2014, it was increased to 0.9.
Great. This is a good number. So what is it? Since when you run manual checkpoint,
explicit checkpoint, it goes full speed. So it writes dirty buffers to disk as fast as possible, and it produces some stress on disk.
It's okay stress, but normally you want to be more gentle with your disk system, right?
So that's why we spread it over time.
And 0.9 checkpoint completion target means that between two checkpoints, 90% of time we want to spend writing, and 10% of time we are resting.
Maybe you want even more, 99%.
I don't know, like 99% of time.
And this is important because it's hard to understand the distance between checkpoints.
It's quite a tricky question.
Logging will report something.
But you can think about when checkpoint starts.
This is like the beginning. So 10 minutes between them is like, or
15 minutes between them or 30. It's fine. But what I wanted to
deliver this is very tricky. It bothered me a few years
actually. And only in the book of Rogoff, we already mentioned
it, this postgresql internals. So I read it in Russian earlier.
Now it's published.
Both parts are published in English.
It's very good.
It explains everything in detail with links to source code.
And finally, I understood why if we set Maxwell size one gigabyte, the distance in bytes measured
in bytes, it can be like 300 something megabytes.
So it's like three times smaller.
Why is that?
So explanation is interesting.
I'm looking at it right now.
So I knew it from practice.
I just like, when I saw,
oh, you have default one gigabyte,
you know that it means that the actual distance
measured in bytes will be 300 megabytes.
It's tiny distance.
It means that a checkpoint
will disturb your system constantly.
And I even saw the case of a very large company where people had some cleanup job on the background happening.
And then before big event, marketing event, they disabled this job.
And then a couple of months later, they realized the job is disabled and some engineer, very experienced one, but not Postgres expert.
He said, okay, this job was not painful at all.
It was working many years.
So he went ahead and tried to delete 10 million rows using one delete and put system down
for 10 minutes because they didn't have checkpoint tuning in place.
So Maxwell size was default 1 gigabyte. Actual
distance was 300 megabytes. I will explain why. It means that when you produce a lot,
you have checkpoints happening all the time. Boom, boom, boom, boom. And a lot of full page
writes. Boom, boom, boom. It's not compressed. And disks work quite good, like enterprise disks,
but not NVMe, unfortunately. And they just, situation happened It's not compressed. And disks work quite good, like enterprise disks, but not NVMe,
unfortunately. And they just
saturation happened and they went down
for 10 minutes. Instead, just one delete.
I even had a talk,
I did it in Russia some time ago,
like, just about this case,
how delete can put your...
One line of delete can put your Postgres
down even before you worked
very well and like critical
system but so your checkpoint tuning is the important thing to have so if you have one gigabyte
until Postgres 11 it was if you have checkpoint completion target close to one you should divide
by three since Postgres 11 you should divide divide two. So if you have one gigabyte max wall size,
actual distance will be half a gigabyte, roughly,
if your checkpoint completion target is 0.9.
Because Postgres needs everything since last checkpoint
and also everything between previous one and latest one, latest success.
Oh, interesting.
And before Postgres 11, additional cycle was needed.
So two successful cycles and a tail, right?
Not tail, this tail is before, not behind us, it's in front of us, right?
So if checkpoint completion target is 0.9, like roughly three times, like three intervals
needed.
Yeah, interesting. That's why you need to raise Maxwell size anyway.
Yeah, that seems like almost everybody would want to increase that one.
I've read, like, there's some other interesting ones
that I'd be keen on your view on.
And also, actually, I guess, are people on cloud providers,
you mentioned RDS a couple of times,
are they generally more protected from this because they've been tuned already?
Tuning here means increasing Maxwell size, increase Maxwell size, but do it not
blindly understanding the recovery time.
Yeah.
The other ones, meanwhile side, for example, I've read that that can increase
right performance if you increase that number.
From my practice, I cannot say anything here like i i
didn't dive deeply enough to discuss this but maxwell size is my favorite topic yeah if you
haven't had to worry about meanwhile size then i can't imagine it's uh that important so yeah
good to know and yeah the the only other one i wanted to ask about is while we're on the topic. Sorry, it's so important.
I just want to emphasize it.
Yes.
So if we have very short distance in terms of Maxwell size,
forced checkpoints, and we have unexpected
or maybe expected someone decided to do it,
a lot of right activity, we can measure it with experiments.
And what I found that, you know know thin clones is good to iterate but we
cannot use thin clones here because we need to our disk and file system behave exactly the same as on
production so i found a good recipe how to have some workload which will not touch our physical layout of data. It's a transaction of massive delete.
Like delete 10 or
100 million rows, but cancel it
in the beginning, rollback.
Delete will write to
Xmax. We discussed it
a couple of months ago probably.
It will write the transaction ID
which deleted tuple.
But if transaction got cancelled,
this is virtually zero. Zero means this tuple is But if transaction got canceled, this is virtually zero.
Zero means this tuple is still alive.
So we produce a lot of wall,
produce big stress on system,
but then we say nothing changed.
And we can do another experiment on the same system.
It's perfect workload for lab, right?
So we can have a sequence of experiments
with different Maxwell size
and see using like IOTOP or IOSTAT,
we can see, we can use monitoring.
I recommend using NateData
because it has export button.
You can export all graphs
and you can see how disk IOTOP behave.
And usually if you have one gigabyte Maxwell size
and disks are not very
powerful, you will see plateau because it's saturated, right? Then you double your Maxwell
size, plateau again, double Maxwell. At some point you will see your system under such stress.
It's already not plateau because, and IOTOP shows when Maxwell size is small, IOTOP shows if you order by write throughput, check pointer will be number one.
It writes like 200, 300, I don't know, 500 megs per second.
Like it writes, writes, writes.
Also, I promised to explain two reasons of overhead.
One, we already discussed full page writes.
If we just finished our checkpoint and we needed to start another because Maxwell size
commands us to have them very frequent. It's like insane checkpointing, insane checkpointing, like
checkpointer went mad, right? So for example, I saw like every 15 seconds, boom, boom, boom, boom,
because these deletes like 300 max, it's quite, quite fast fast. Again, again, again.
Default settings.
So full page rights is one type of overhead.
So basically, oh, also make your deletes not sequential.
For example, you can have some index on some random column.
And you can say, let's delete first 10 million rows ordered by this column.
But it's random.
So first tuple is first page,
second tuple is page number 1000. So we jump between various pages. And this is the worst
type because we could benefit, like if they are sequential, probably all changes in one page will
happen inside one checkpointing cycle. But if we jump between pages, we constantly produce a lot
of full page writes and we need
to produce them once again because the checkpoint just finished, right? This is the worst situation
and this happens. This is exactly what put that system down, what I explained.
So second, I didn't realize, but it's quite obvious. Second overhead is quite obvious.
If our page was dirty, it was checkpointed. Now it's clean. Checkpoint, okay. We visit
once again, it visit once again.
It became dirty again.
We need to write it again.
If two writes would be inside one checkpointing cycle,
we would produce only one write.
But if two-hour visits happened in different checkpoint cycles,
we need to perform two disk writes.
So it's more I-O.
Right, right.
So sequential delete is not that bad.
Random delete, according to some index, can be very can be very bad right well that's a good point so as well as all the sorry i like i
so so much i spent some months exploring it and we made a lot very good like i would say enterprise
scale experiments and i i can take any system and show exactly how recovery will behave how exactly you
need to tune i will i can like show graphs it's like it's quite expensive in terms of time and
probably money research of system but i think big systems need it they need to understand their
workload their system and understand what will happen so this this random delete, I say this, like I named it double unfortunate.
You can be unfortunate because you crashed
and you're unfortunate twice
because you crashed during some random intensive writes.
In this case, you definitely want to understand
your Maxwell-Seichens checkpoint timeout
and you want your disk Iograph not to have plateau,
but be like spiky.
And this is a sign that you have some room for other rights, right?
This is like our research shows like, okay, at 16 gigabytes or 32 gigabytes, we already don't have plateau.
So this is our desired setting for Maxwell size, maybe 100 gigabytes even, divided by two.
We need to understand since post-mix.
And then we say okay but
in this case recovery time can be in at normal time it will be this at bad times when somebody
is writing randomly a lot it can be this like 10 minutes can you afford it or it's not good here so
you see how much like i i had in the past with Maxwell size especially.
Yeah, right.
This is great.
And I hope people are encouraged to go and you can easily check this, by the way, can't you?
Just with all, like with all Postgres parameters, you can just use show.
Show Maxwell size.
If you get one gigabyte back, maybe time to have a look at that.
Same with a checkpoint timeout.
So show checkpoint timeout.
Check out that if it's if
it comes back five minutes or it might say 300 seconds um again another one to look at anything
else there are other things but let's let's stop at this point because we are out of time definitely
here i apologize for too many details in this case i don't think i don't think that's the kind
of feedback i if anybody thinks we did do too many details let me't think i don't think that's the kind of feedback i if
anybody thinks we did do too many details let me know but i don't think that's going to be the
feedback and i again i i want to like advertise net data here because if you do some experiments
and some you you take like same virtual machine same settings everything as production you do
this very unfortunate massive delete rolled back again, again, delete, rolled back,
and you check with various settings,
do install net data and export all dashboard
with all disk IO and everything to file,
and then you can compare.
You can open in browser several files.
Snapshots.
Right, right.
And see exactly the difference in behavior
for different settings.
It's so convenient.
Yeah.
And you can store those artifacts long-term.
Yeah, I really enjoyed when you showed me that.
I also wanted to advertise a few things.
There's a couple of great websites for checking out parameters
if you want to see what they mean.
Obviously, the PostgreSQL documentation is great,
but there's also PostgreSQL code.nf or PostgreSQL conf by Ongress
that I'll link up, and PGpedia as well
I find great for this kind of thing they have a section on this that I found useful so I'll share
those as well right but like you I suppose if you have heavily loaded LTP system you probably will
set checkpoint timeout to 15 or 30 minutes and max size to something like 32 gigabytes, at least maybe more, but
better to conduct full-fledged research and make decisions based on your requirements
from business.
Right.
Good.
Thanks so much, Nikolai.
Thanks everyone.
Thank you everyone.
Yes.
Share, like share, share, share, share is important.
Most important probably.
And if you, by the way i have
special request to our listeners today if you have a ios device please go to apple podcasts
and like us please and write some review we would appreciate a lot i don't know if you saw but we
got a nice one already so thank you to that person good thank you michael cheers everyone bye now