Postgres FM - WAL and checkpoint tuning

Episode Date: September 9, 2022

Here are links to a few things we mentioned: LSN (log sequence number)On the impact of full-page writes (blog post by Tomas Vondra)Deep dive on Aurora Postgres (talk by Grant McAlister) Net...data monitoring PostgresqlCO.NF pgPedia ------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov and @michristofides or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I'm Michael, founder of PgMustard, and this is my co-host Nikolai, founder of PostgresAI. Hey Nikolai, what are we going to talk about today? Hi Michael, let's talk about checkpoint tuning. Yes, right ahead, log in general, right? A WoW configuration. We had a really good request for this topic from one of our listeners. Right, so not only checkpoint tuning, but checkpoint tuning is a part of it. Yes, absolutely. Thank youelsea for this one uh-huh thank you uh very interconnected
Starting point is 00:00:29 topics we we should think about them both wall configuration and checkpoint tuning they come together yeah should we go through them one at a time in terms of well so what what is check what are checkpoints should we start there or? Or what is what as well? Right. So what are checkpoints? Checkpoints. To talk about checkpoints, we need to talk about wall first. Right-ahead log. And right-ahead log is like the basic, absolutely fundamental concept of any database system.
Starting point is 00:00:58 Not only relational. which says that first data is recorded to a different place, to some additional, some kind of binary log or write-ahead log. And only then it's changed in memory and data pages. And this allows us to build recovery system. This allows us to survive unexpected restarts, failures, various bugs, and so on. And in Postgres, this also allows us to have physical replication because it's built on top of recovery subsystem. So if you think that you change some row in a table, first this change is recorded into write-ahead log. And only when we know that this change is recorded reliably in wall, it's already on disk.
Starting point is 00:01:48 Only then you can see committed. The user can see commit. And the actual change in data, it still can be only in memory, not flushed into disk yet. Right? So basically we record twice, we record to write a headlock, the change, and we change the data in place where it's stored permanently. But this flushing to disk, in the second case, it's not occurred synchronously. So we can see commit, but the data file table index files, they are not yet in actual state on disk. And the checkpoint, it's the process of
Starting point is 00:02:27 writing so-called dirty blocks, dirty buffers to disk. So dirty in this context means changed, but not yet saved on disk. So if it's already saved, it's called clean buffer. If it's not yet saved, it's called dirty buffer. And when we have dirty buffers, it means that we changed a lot. We already reflected these changes in wall. It's like the rule number one, write a headlock, but it's not yet saved on disk. And when checkpoint happens, all dirty buffers are saved. And it means that next time we will need to start, for example, after some crash, power loss, anything. We don't care about previous changes because they already
Starting point is 00:03:08 reflected on disk, right? Yeah. So we only have to worry about things since the last checkpoint. Right. And I guess that'll come up. Right, right, right. Yeah. And just before we move on from the basics, because I had to read up a bit
Starting point is 00:03:21 about this, the main, the main reason we need this is to prevent data loss on recovery. So it's the idea of the dual system, the idea of having this, is the D in ACID, I believe, the durability side, so preventing data loss. Right, ACID is like the core concept of a database system.
Starting point is 00:03:41 If it says data committed, it never can lose it. Otherwise, it's a bad system. So right-ahead log exactly allows us to have it. And without checkpoint, we would need to keep a lot of changes and replay a lot of changes and startup time after any crash would be very long. So checkpoints happen all the time. They happen kind of on schedule.
Starting point is 00:04:06 They also can happen during, for example, when we need to shut down the server or restart it. There is so-called mandatory shutdown checkpoint. So Postgres doesn't report that shutdown completed until this shutdown checkpoint finishes. So it's also important to understand. And that's why sometimes we can see that shutdown takes some significant time because we have a lot of dirty buffers and we need to save them first. Yeah. On this topic, I saw a good recommendation. I think it was from you actually around Postgres upgrades and the idea of taking a checkpoint, triggering a checkpoint
Starting point is 00:04:40 manually to reduce the time needed. Right, because during shutdown, checkpoint Postgres doesn't respond to new queries anymore. It says shutting down, so come later, right? But if we run manual checkpoint, a SQL command checkpoint, it can be in parallel. So we save a lot of dirty buffers ourselves, like just running select in terminal in P-SQL, for example. And then when shutdown checkpoint happens,
Starting point is 00:05:08 it's already fast because there's a very low number of dirty buffers, right? So it's always recommended when we need to restart server, for example, for minor upgrade, or we need, for example, to perform switchover, this manual, not manual, it can be automated, of course, but additional checkpoint i would say implicit or explicit explicit checkpoint should be there because after it
Starting point is 00:05:33 shutdown checkpoint will happen so we need to help it to be shorter faster yeah right so you you already mentioned a couple of uses for the write headlog. One's recovery on crash. One is replication. I believe point-in-time recovery is another one. So tools like pgbackrest make use of it, I believe. Right, right, right. Well, it's a different topic. But of course, right, if we store full copies of pgdata,
Starting point is 00:06:00 by the way, they never are in consistent state. If you copy pgdata on live server, it's not consistent. So you need a bunch of walls to be able to reach consistent point, always. But you also can store the whole stream of walls additionally. And this allows you to take some so-called base backup PgData corresponding to some point in time, and then you can replay additional walls and reach a new point in time, and it can be arbitrary.
Starting point is 00:06:31 So if you store a continuous stream of walls in your archive, you can choose any time you want. The only problem usually is if your database is quite big, the initial copying PgData initially takes time. And roughly, like, very, very rough rule is one terabyte per hour, but can be slower, can be faster, but very rough rule. So if you have 10 terabyte database should be prepared for five or 10 or 15 hours of initial copy. And I wonder why people don't use often cloud snapshots for that, to speed it up, right? It would make sense completely. But I know sometimes they are not reliable,
Starting point is 00:07:09 cloud snapshots in Amazon and Google. There are issues with them sometimes. Also, making them takes time, but we could do it on Replic. I put us to a different field. Let's postpone discussion of backups and and other things but you're right point of time recovery is another area of application of wall but the primary goal is to allow us to recover from from unexpected shutdown or power or something i think so it's it's like in games right i lost you briefly then yeah i see i see let me let me listen to you
Starting point is 00:07:47 then so well yeah so i'd be i think it's getting better so we have a lot of parameters for tuning these things we have a lot of parameters for controlling how how fast these things happen how much has to happen before they kick in there's a few settings that it seems like we really shouldn't touch pretty much ever as well but it also seems that especially on high write workloads there are some really big wins we can get by changing some of the defaults are not good enough for you usually right okay is that generally true or like if i have a read heavy application and not not a ton of writes am i likely to what what how would i know if i'm hitting problems what's the kind of what the first telltale signs well it's a good question usually
Starting point is 00:08:30 it's good to evaluate the distance between the two daily backups for example and you you can even in sql you can even take two lsns usually when backup is taken you know know, LSN, log sequence number. So it's like position in wall. It's always sequential. It has some specific structure. There are a few articles. We will attach articles explaining how to read LSN and how to understand the structure. So if you take two LSNs and convert them to PG underscore LSN data type in Postgres, you can subtract one from another. And the difference will be in bytes.
Starting point is 00:09:07 So this difference, then you can run pg pretty or pg size. And you can see the difference in megabytes, gigabytes, actually gigabytes, actually. But Postgres doesn't use this notation. So you can understand how much wall you generate during a day, right? If this size is quite small, like gigabytes or 10 gigabytes, it's quite small. Probably you don't need specific checkpoint and wall tuning at all. But if you have one terabyte generated per day, it's quite a lot. And I'm sure you need to go away from defaults and you will have better performance, better everything. For example, wall compression is not enabled by default, right?
Starting point is 00:09:50 I missed that one. That's great. Yeah, but I'm going to check if it's enabled in the recent version because Postgres defaults are improving, but still they are lagging. If you have heavily loaded systems, you definitely want to tune it. So if you have Postgres 13, for example, wall compression is disabled by default. Let's talk about what is written in walls. Oh, first of all, just simplify the explanation people usually use about checkpoints. It's like in games, you want to save your progress, right? And if something bad happens, you will repeat fewer steps, right? So it's very simple analogy. Yeah, it's still in Postgres 14, it's still not enabled wall compression.
Starting point is 00:10:35 Yeah, in Postgres 15, it's still not enabled. And this is, I think, should be enabled in most setups. I'm almost sure on RDS it's enabled. So, and if you can, for example, do checkpoints very rarely, once per week, it's insanely low. But in this case, there are high chances that if crash happens, you will need to wait a lot
Starting point is 00:10:56 while Postgres replaces many walls, right? A lot to do in terms of redo. And during this period, you will be down. Your system is down. So not good. That's why, like logically, I would say it's good to have checkpoints more often, right? It seems like a Goldilocks problem, right? Too often, and you have a lot of overhead,
Starting point is 00:11:17 but too infrequent, and it will take a long time to recover. So it feels like there's a balance. There is trade-off here. And there are two kinds of overhead. We will talk like there's a balance. There is trade-off here and there are two kinds of overhead. We will talk about it in a second. But to understand where overhead comes from, in spite of dirty buffers, let's talk about what is written to wall. By default, full page rights are enabled, right? And what is full page right? It's if you change anything in some table, in some row, it may be a very small change. Postgres writes whole page fully to write a headlock. Why? Because there is a difference. Buffer is usually 8 kilobyte size, kibibyte size. But file system probably uses block size 4 kibibytes. And you don't want to have partial write during writing to disk reported as success, but you wrote only one half of it.
Starting point is 00:12:09 So that's why full page write is needed. And by the way, first Postgres first talks about Aurora from Grant McAllister, if I'm not mistaken. They are very well explained. We will find links to YouTube and probably a slide deck. They explained very well this problem about full page rights and this big overhead. So when first change in the page occurs, first time it's written after checkpoint, it's full page right. If you change it once again, only small delta is written. So it's not full page, right? But only until the next checkpoint. If checkpoint happened, all changes are initially again full page rights.
Starting point is 00:12:53 If checkpoints are very frequent, we have a lot of full page rights. If checkpoints are not frequent, very often we have repetitive changes of the same page. So we changed it. For example, a row inserted something. We insert once again into the same page. It's a new change. Again and again, we update something. We have had update, keep only tuple, tuple update. So we change it to the same page.
Starting point is 00:13:20 And this means we touch the same page. We write to it multiple times. In this case, I'm saying not only the number of writes matters, also the nature of writes matters. If you have hot updates touching the same page many, many times, you will benefit from rare checkpoints a lot. Because only one full page write will happen after a checkpoint and then you benefit having very light you write a little bit to to the write ahead log every subsequent change until next checkpoint of course yeah that's super interesting and i think also it explains why people so there's there's some people i've seen i suspect this is a very bad idea, but turning off full-page writes in order to increase throughput, but it feels like a very risky thing to do,
Starting point is 00:14:10 and I feel like we're going to cover quite a few better ways. Some systems can afford it. You need to understand what your file system is, your disks, and what settings they use. And in some cases it's possible, but it's quite dangerous. You should understand all risks and be 100% sure that it's possible in your system. But usually we don't go this way. Usually we use X4 with 4K block size and so on.
Starting point is 00:14:37 And we want full page rights enabled. So back to compression. Compression is applied only to full page rights. Interesting. Right. So we don't compress, as I understand, Postgres doesn't compress these small changes. It compresses only this first time we change something in a page, we record this page fully, and we can compress it. And compression is not enabled by default. And if you enable it, you can see huge benefit in terms of how much wall you write. Why we care
Starting point is 00:15:13 about volume here? Because if we have write heavy system, of course, writing a lot additionally to wall, it's additional overhead on disk. If you have 10 replicas, sometimes people have it. All replicas need to receive this data. They work physical replication, logical as well. It works through wall, through write-ahead log. So if we write a lot, we need to send over network a lot. They want wall compression enabled to compress all full page writes. And we want checkpoints to happen rarely, to have fewer full page rights as well.
Starting point is 00:15:49 So I would tune max wall size and checkpoint timeout to have very, very not frequent checkpoints. But in this case, if they are not frequent, again, startup time after crash, also failover, for example. The timing of these procedures will be very, very bad. Long. Minutes. I see various engineers struggling to understand why, for example, shutdown takes so long, why restart takes so long. And they become nervous. And at extreme cases, they use kill minus nine.
Starting point is 00:16:24 So, sick kill. Postgres survives because we have write-ahead lock. And we just redo. But redo also takes a lot. They kill it. It's not acceptable. Only in rare cases, we should do it. It's like last resort.
Starting point is 00:16:38 We should not do it. But after read, Postgres starts and startup takes also many minutes. They're still nervous. It's not a good situation. That's why people need to understand how much wall needed to be written and distance between checkpoints. Yeah, let's go back to a couple of those ones you mentioned. So my understanding is that checkpoint timeout is a maximum time between checkpoints. And that's default quite low is five minutes. Very low.
Starting point is 00:17:06 So what would be a sensible starting point for most people in terms of? Yeah. So usually the main metric here is how long you can afford being down in the bad case, in the case of an incident. This is the main number you need to understand.
Starting point is 00:17:23 You need to talk with your business people and find some number like okay we can be down up to two minutes for example right from there you start thinking okay if we have this like requirement or slo service level objective if we are sres right so if we have two minutes let's think during two minutes, how much we can replay? We can measure it with experiment. We can, for example, set checkpoint timeout in Maxwell size to very insanely
Starting point is 00:17:54 big numbers. Then we can have a lot of writes happening, pgbench, for example. Then we can wait until one checkpoint finishes, another is about to finish, and then we kill minus nine, our Postgres, crush it on purpose. And then we see recovery and just measure the speed of recovery, how many bytes of wall we can replay per second, per minute. And this gives us understanding of how much wall we can afford to replay, not to exceed two minutes of downtime, for example.
Starting point is 00:18:28 Yeah, perfect. From this, we can start thinking, like, this is very important to understand. Recovery speed in terms of bytes per second, bytes per minute, or gigabytes per minute, anything here. From there, we can understand how many bytes of wall we produce when everything is normal. During quite busy hours, usually at night, for example, we have lower activity. At daytime, on working days, probably we have more activity, right? But usually we say, okay, we produce, for example, one wall per second. It's quite good speed. Each wall means this file. There is also
Starting point is 00:19:07 confusion in terms because as I remember, documentation says wall file is some abstract thing. It's like two gigabytes and wall segment is 16 maybe bytes. But if you go to pgwall directory, you will see each file will be 16 maybe bytes usually as i remember rds tunes it and they have 64 megabytes per each wall so i said each wall is usually 16 megabytes so one wall per second during normal quite busy hours it means we produce 16 megabytes per second of all data right so okay and it means that okay what is, okay, what is our replay speed? What is our production speed? And from there, we can understand during which time we generate that amount of wall data, which will give us two minutes of recovery time, right?
Starting point is 00:19:55 Quite complex, I understand. Well, it feels like luckily we've got that second parameter in max wall size. No, no, no, let's let's pause yes i'm talking about how to like in my opinion how to understand what's the best what's like normal checkpoint timeout for you right yes so in this case we understand okay recovery time is this production time of production speed is this so we can conclude that to have not more than two minutes of recovery time, we need to have to reproduce this number of wall data. So we will set checkpoint timeout probably like half an hour. This is like quite maybe 15 minutes, 30 minutes. It depends, of course.
Starting point is 00:20:40 Observing a concrete system, we can make some conclusion. Okay, we want 30 minutes, for example. But then we start distinguishing planned checkpoints and requested checkpoints. Requested checkpoints is like Postgres has two logics. One logic is, okay, on schedule. When time comes, time to have checkpoint. Every 30 minutes, for example. By default, it's five minutes. I think it's too often, right?
Starting point is 00:21:05 But then if there is another parameter called max wall size, and I think it's a very, very important parameter to understand. It's our protection for the cases when we have elevated activity and we want to have more frequent checkpoints because we want to be protected, again, from writing too much data to wall and have again longer wait again the same logic if we understand how much we produce the speed of production we can say okay maxwell size also roughly corresponds to so so checkpoint timeout and maxwell size their tuning
Starting point is 00:21:39 can be correlated here right yeah so my my understanding is you it it sounds like we should rely on checkpoint timeout for the majority of the time that should be the thing that that kicks off checkpoints but if if more than that amount of wow is generated more than the amount we expected we could set an amount in we should we should set it so, like, and default is very small, a one gigabyte default is insanely small for modern workloads. Usually, like I recommend to go up, sometimes up to 100 gigabytes, but we need to understand this recovery trade off, right? So we need to measure recovery and guarantee our business that we will not be down more than, for example, two minutes or five minutes. But Maxwell's size protects us from the cases when we have more writes. And Postgres can decide to perform a requested checkpoint. We see it in logs. By the way, logging of checkpoints we need to enable always.
Starting point is 00:22:40 As far as I remember, recently default was changed and login is now enabled. I remember discussion in hackers mailing list. So log checkpoint should be enabled for all checkpoints. I'm 100% sure. This is what you want to understand. Default is disabled in Postgres 12, disabled in Postgres 14, but enabled in Postgres 15, which will be released very soon. So this is a new change. In Postgres 15, a log checkpoint is enabled, and I recommend enabling it for any Postgres.
Starting point is 00:23:15 So I also saw some DBAs see that like 90% of all checkpoints are requested. They occur according to max full size. This is a problem. No, it's not a problem. It's not a problem. Because requested checkpoint and timed checkpoint, like planned on schedule, they are the same, actually. Like, no big difference. But, of course, you want to be in order.
Starting point is 00:23:40 Everything should be in order. Of course, you want, like, it's just a sign that probably you need to reconsider settings, but it's not an emergency situation, right? Yeah. Sounds good. There is another checkpoint completion target we didn't mention. And by default, it's 0.7 or? No, this changed.
Starting point is 00:24:01 Yeah, I looked this up until very recently. It was 0.5. Oh, 0.5 is. Until very recently, it was 0.5. Oh, 0.5 is terrible, I would say. It's not what you want. Yeah, but in 2014, it was increased to 0.9. Great. This is a good number. So what is it? Since when you run manual checkpoint, explicit checkpoint, it goes full speed. So it writes dirty buffers to disk as fast as possible, and it produces some stress on disk. It's okay stress, but normally you want to be more gentle with your disk system, right? So that's why we spread it over time.
Starting point is 00:24:36 And 0.9 checkpoint completion target means that between two checkpoints, 90% of time we want to spend writing, and 10% of time we are resting. Maybe you want even more, 99%. I don't know, like 99% of time. And this is important because it's hard to understand the distance between checkpoints. It's quite a tricky question. Logging will report something. But you can think about when checkpoint starts. This is like the beginning. So 10 minutes between them is like, or
Starting point is 00:25:08 15 minutes between them or 30. It's fine. But what I wanted to deliver this is very tricky. It bothered me a few years actually. And only in the book of Rogoff, we already mentioned it, this postgresql internals. So I read it in Russian earlier. Now it's published. Both parts are published in English. It's very good. It explains everything in detail with links to source code.
Starting point is 00:25:32 And finally, I understood why if we set Maxwell size one gigabyte, the distance in bytes measured in bytes, it can be like 300 something megabytes. So it's like three times smaller. Why is that? So explanation is interesting. I'm looking at it right now. So I knew it from practice. I just like, when I saw,
Starting point is 00:25:53 oh, you have default one gigabyte, you know that it means that the actual distance measured in bytes will be 300 megabytes. It's tiny distance. It means that a checkpoint will disturb your system constantly. And I even saw the case of a very large company where people had some cleanup job on the background happening. And then before big event, marketing event, they disabled this job.
Starting point is 00:26:18 And then a couple of months later, they realized the job is disabled and some engineer, very experienced one, but not Postgres expert. He said, okay, this job was not painful at all. It was working many years. So he went ahead and tried to delete 10 million rows using one delete and put system down for 10 minutes because they didn't have checkpoint tuning in place. So Maxwell size was default 1 gigabyte. Actual distance was 300 megabytes. I will explain why. It means that when you produce a lot, you have checkpoints happening all the time. Boom, boom, boom, boom. And a lot of full page
Starting point is 00:26:59 writes. Boom, boom, boom. It's not compressed. And disks work quite good, like enterprise disks, but not NVMe, unfortunately. And they just, situation happened It's not compressed. And disks work quite good, like enterprise disks, but not NVMe, unfortunately. And they just saturation happened and they went down for 10 minutes. Instead, just one delete. I even had a talk, I did it in Russia some time ago, like, just about this case,
Starting point is 00:27:18 how delete can put your... One line of delete can put your Postgres down even before you worked very well and like critical system but so your checkpoint tuning is the important thing to have so if you have one gigabyte until Postgres 11 it was if you have checkpoint completion target close to one you should divide by three since Postgres 11 you should divide divide two. So if you have one gigabyte max wall size, actual distance will be half a gigabyte, roughly,
Starting point is 00:27:49 if your checkpoint completion target is 0.9. Because Postgres needs everything since last checkpoint and also everything between previous one and latest one, latest success. Oh, interesting. And before Postgres 11, additional cycle was needed. So two successful cycles and a tail, right? Not tail, this tail is before, not behind us, it's in front of us, right? So if checkpoint completion target is 0.9, like roughly three times, like three intervals
Starting point is 00:28:23 needed. Yeah, interesting. That's why you need to raise Maxwell size anyway. Yeah, that seems like almost everybody would want to increase that one. I've read, like, there's some other interesting ones that I'd be keen on your view on. And also, actually, I guess, are people on cloud providers, you mentioned RDS a couple of times, are they generally more protected from this because they've been tuned already?
Starting point is 00:28:47 Tuning here means increasing Maxwell size, increase Maxwell size, but do it not blindly understanding the recovery time. Yeah. The other ones, meanwhile side, for example, I've read that that can increase right performance if you increase that number. From my practice, I cannot say anything here like i i didn't dive deeply enough to discuss this but maxwell size is my favorite topic yeah if you haven't had to worry about meanwhile size then i can't imagine it's uh that important so yeah
Starting point is 00:29:18 good to know and yeah the the only other one i wanted to ask about is while we're on the topic. Sorry, it's so important. I just want to emphasize it. Yes. So if we have very short distance in terms of Maxwell size, forced checkpoints, and we have unexpected or maybe expected someone decided to do it, a lot of right activity, we can measure it with experiments. And what I found that, you know know thin clones is good to iterate but we
Starting point is 00:29:46 cannot use thin clones here because we need to our disk and file system behave exactly the same as on production so i found a good recipe how to have some workload which will not touch our physical layout of data. It's a transaction of massive delete. Like delete 10 or 100 million rows, but cancel it in the beginning, rollback. Delete will write to Xmax. We discussed it a couple of months ago probably.
Starting point is 00:30:17 It will write the transaction ID which deleted tuple. But if transaction got cancelled, this is virtually zero. Zero means this tuple is But if transaction got canceled, this is virtually zero. Zero means this tuple is still alive. So we produce a lot of wall, produce big stress on system, but then we say nothing changed.
Starting point is 00:30:35 And we can do another experiment on the same system. It's perfect workload for lab, right? So we can have a sequence of experiments with different Maxwell size and see using like IOTOP or IOSTAT, we can see, we can use monitoring. I recommend using NateData because it has export button.
Starting point is 00:30:56 You can export all graphs and you can see how disk IOTOP behave. And usually if you have one gigabyte Maxwell size and disks are not very powerful, you will see plateau because it's saturated, right? Then you double your Maxwell size, plateau again, double Maxwell. At some point you will see your system under such stress. It's already not plateau because, and IOTOP shows when Maxwell size is small, IOTOP shows if you order by write throughput, check pointer will be number one. It writes like 200, 300, I don't know, 500 megs per second.
Starting point is 00:31:33 Like it writes, writes, writes. Also, I promised to explain two reasons of overhead. One, we already discussed full page writes. If we just finished our checkpoint and we needed to start another because Maxwell size commands us to have them very frequent. It's like insane checkpointing, insane checkpointing, like checkpointer went mad, right? So for example, I saw like every 15 seconds, boom, boom, boom, boom, because these deletes like 300 max, it's quite, quite fast fast. Again, again, again. Default settings.
Starting point is 00:32:06 So full page rights is one type of overhead. So basically, oh, also make your deletes not sequential. For example, you can have some index on some random column. And you can say, let's delete first 10 million rows ordered by this column. But it's random. So first tuple is first page, second tuple is page number 1000. So we jump between various pages. And this is the worst type because we could benefit, like if they are sequential, probably all changes in one page will
Starting point is 00:32:37 happen inside one checkpointing cycle. But if we jump between pages, we constantly produce a lot of full page writes and we need to produce them once again because the checkpoint just finished, right? This is the worst situation and this happens. This is exactly what put that system down, what I explained. So second, I didn't realize, but it's quite obvious. Second overhead is quite obvious. If our page was dirty, it was checkpointed. Now it's clean. Checkpoint, okay. We visit once again, it visit once again. It became dirty again.
Starting point is 00:33:06 We need to write it again. If two writes would be inside one checkpointing cycle, we would produce only one write. But if two-hour visits happened in different checkpoint cycles, we need to perform two disk writes. So it's more I-O. Right, right. So sequential delete is not that bad.
Starting point is 00:33:27 Random delete, according to some index, can be very can be very bad right well that's a good point so as well as all the sorry i like i so so much i spent some months exploring it and we made a lot very good like i would say enterprise scale experiments and i i can take any system and show exactly how recovery will behave how exactly you need to tune i will i can like show graphs it's like it's quite expensive in terms of time and probably money research of system but i think big systems need it they need to understand their workload their system and understand what will happen so this this random delete, I say this, like I named it double unfortunate. You can be unfortunate because you crashed and you're unfortunate twice
Starting point is 00:34:12 because you crashed during some random intensive writes. In this case, you definitely want to understand your Maxwell-Seichens checkpoint timeout and you want your disk Iograph not to have plateau, but be like spiky. And this is a sign that you have some room for other rights, right? This is like our research shows like, okay, at 16 gigabytes or 32 gigabytes, we already don't have plateau. So this is our desired setting for Maxwell size, maybe 100 gigabytes even, divided by two.
Starting point is 00:34:41 We need to understand since post-mix. And then we say okay but in this case recovery time can be in at normal time it will be this at bad times when somebody is writing randomly a lot it can be this like 10 minutes can you afford it or it's not good here so you see how much like i i had in the past with Maxwell size especially. Yeah, right. This is great. And I hope people are encouraged to go and you can easily check this, by the way, can't you?
Starting point is 00:35:12 Just with all, like with all Postgres parameters, you can just use show. Show Maxwell size. If you get one gigabyte back, maybe time to have a look at that. Same with a checkpoint timeout. So show checkpoint timeout. Check out that if it's if it comes back five minutes or it might say 300 seconds um again another one to look at anything else there are other things but let's let's stop at this point because we are out of time definitely
Starting point is 00:35:36 here i apologize for too many details in this case i don't think i don't think that's the kind of feedback i if anybody thinks we did do too many details let me't think i don't think that's the kind of feedback i if anybody thinks we did do too many details let me know but i don't think that's going to be the feedback and i again i i want to like advertise net data here because if you do some experiments and some you you take like same virtual machine same settings everything as production you do this very unfortunate massive delete rolled back again, again, delete, rolled back, and you check with various settings, do install net data and export all dashboard
Starting point is 00:36:10 with all disk IO and everything to file, and then you can compare. You can open in browser several files. Snapshots. Right, right. And see exactly the difference in behavior for different settings. It's so convenient.
Starting point is 00:36:24 Yeah. And you can store those artifacts long-term. Yeah, I really enjoyed when you showed me that. I also wanted to advertise a few things. There's a couple of great websites for checking out parameters if you want to see what they mean. Obviously, the PostgreSQL documentation is great, but there's also PostgreSQL code.nf or PostgreSQL conf by Ongress
Starting point is 00:36:44 that I'll link up, and PGpedia as well I find great for this kind of thing they have a section on this that I found useful so I'll share those as well right but like you I suppose if you have heavily loaded LTP system you probably will set checkpoint timeout to 15 or 30 minutes and max size to something like 32 gigabytes, at least maybe more, but better to conduct full-fledged research and make decisions based on your requirements from business. Right. Good.
Starting point is 00:37:15 Thanks so much, Nikolai. Thanks everyone. Thank you everyone. Yes. Share, like share, share, share, share is important. Most important probably. And if you, by the way i have special request to our listeners today if you have a ios device please go to apple podcasts
Starting point is 00:37:33 and like us please and write some review we would appreciate a lot i don't know if you saw but we got a nice one already so thank you to that person good thank you michael cheers everyone bye now

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.