Postgres FM - Massive DELETEs

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael, founder of PgMustered. This is my co-host Nikolai, founder of Postgres AI. Hello Nikolai, what are we talking about today? Hi Michael, we wanted to talk about one of my favorite topics, massive delete, I usually call it, but maybe you can call it somehow differently. Yeah, I really like this. So when you say massive delete, do you mean deleting a lot of rows all at once? Right.

Starting point is 00:00:30 In one SQL statement. So, for example, 10 million rows, 1 billion row, or just 1 million. It's good already. Good enough to feel how bad this is for Postgres. So we can discuss in detail what's happening if you do it. I like this type of thing because it's not only bad, it's also good. And we can explore that as well because if you do it on production, it's bad. But if you do it on purpose in some testing environment, it can be very helpful.

Starting point is 00:01:10 Oh, I see what you mean. So it's a useful tool for certain things. Okay, cool. Yeah, I get it. Stress test. Yeah. So you mentioned this in a previous episode, that it would be good to cover this in more detail. And I was really interested but i think it's important to say what we're not going to cover as well so massive deletes

Starting point is 00:01:31 some people might also be thinking of like some other things that sound like deletes like some people often implement soft deletes which uh would be like implemented via updates we're not going to talk about that kind of thing we're talking about only actual Postgres deletes at this point. Some people complain why soft deletes are not already implemented by default in database system, like natively. By then, maybe it's a good idea to have. But yeah, we will skip this topic. Soft delete, maybe it's another episode because it has interesting nuances.

Starting point is 00:02:06 Yeah, but let's talk about what's happening if you want to delete many, many rows, how to do it correctly on production, not to put your database on knees, and how to use it in non-production environments for something good, for good reasons. Right? I like it. Do you want to start? Maybe with some use cases, like when are you generally seeing people do these huge deletes?

Starting point is 00:02:34 I've got a few examples I've seen, but I'd be interested in yours. Well, usually people do massive delete by mistake. Not understanding how Postgres works. Sure. And MVCC and so on, and they expect it's simple operation. They probably saw some warning that

Starting point is 00:02:54 you shouldn't expect immediate reduction of disk space used. For example, you delete from some table and think, okay, I will free some space. No, you won't free some space because it's only part of the work. Another part is auto-vacuum, which will need to actually delete that tuples. But, yeah, what? And even then, you won't see, if you're monitoring disk space, you won't see that as free space again. It depends. It depends.

Starting point is 00:03:33 Autovacuum also can truncate relation if some page in the end is deleted. All tuples are deleted and vacuumed, so physically deleted already, right? But if a page becomes free, it will truncate it and it will reclaim disk space. Yeah, okay. But I feel like those cases are the exception rather than the norm. I see a lot more cases where people are deleting a subset of data from a non-partition table and really surprised when their relation doesn't decrease in size. But just to cover a few of the other basics, I think this is actually a topic that's coming up more and more because we have these like new privacy laws that require people to be able to like to have the right to be forgotten and deleting data is quite a safe way of not having a data leak like the less data you have the less data that can be leaked

Starting point is 00:04:21 security wise so there's i feel like there's these forcing functions i think also people are starting to get serious about having data retention policies so how long should we keep each type of data around for so i think there are some of these use cases that mean this is coming up more and in the past at least in the last 10 years i saw a lot more people kind of just storing everything forever and not really thinking about it. So I didn't know if this is something you're seeing more often or it's just an age-old problem. Well, yes. So I think JDPR and everything, it's not that big

Starting point is 00:04:56 because, you know, single user delete, it's usually like some fraction of the whole, right? But usually massive delete happens when we need to clean up and understand there's some old data which is not super useful. We pay for it. And we want to postpone that moment when we need to scale our instances, database instances, machines, and so on. In this case, like in general, I would like to mention before COVID,

Starting point is 00:05:31 I went to VLDB conference, which is a conference I know since being a kid, basically, when I was a student learning database theory and so on. I've heard about this conference. So I went to it because it was very close in Los Angeles. And I remember a keynote, some researcher, she was presenting the talk, which was interesting. It was about delete, basically. She said, this is exponential growth of the total data volumes

Starting point is 00:06:07 of data in our databases in the world. Because storage becomes cheaper, we produce a lot of data, big data, huge data, and so on. This is like some zettabytes or something like insane curve insane with some forecast and she said we spent decades to learn how to store properly a cid not to lose data reliable highly available and so on now it's it's coming time to learn how to clean up and delete understand which data can be safely delete how to delete it efficiently, and so on. And I was super inspired. I also did a talk at a conference because around the same time I had production incident, a senior engineer, backend engineer, with good understanding of analytical databases actually, got a task to delete some old data preparing for some marketing campaign because like forecast was

Starting point is 00:07:07 saying we either need to invest a lot to upgrades or we need to clean up before we do this huge marketing campaign so he went to the production and just performed delete from table where created at older than one year. He estimated in advance that it will be 10 million rows. And we got an incident. Downtime, more than 10 minutes, cost the company a lot of money. I was super impressed, a lot of money. And he almost left the company himself because he was super embarrassed. And this was at the same time at this VLDB conference without Delete. I'm saying we need something, right? At least we need to educate people that Delete should be split to batches.

Starting point is 00:08:00 Of course, if you have good disk, maybe you'll be fine. But in that case, we had quite slow. It was some enterprise level, but some old system. It was on-premise. And also, at the same time, I was advocating this company. I was helping as a consultant with Postgres. The company was growing super fast, huge startup. And I was saying, guys, you do need to increase Maxwell size. One gigabyte is not enough. So with

Starting point is 00:08:37 default setting Maxwell size, untuned checkpoint or and quite slow storage, I think maximum throughput for writes was maybe 600 megabytes per second, maybe 500. It's not like current modern NVMe, which gives you like two gigabytes per second, for example. These two factors plus the idea, let's delete 10 million rows, it's not a huge number, right? So, yeah, Check yeah check pointer became crazy because a lot of what's happening first of all we find rows i mean executor finds rows in that case it was also unfortunate that it was not sequential pattern so what like rows were ordered by according to created at. So this starts partially and first you need to put, Postgres needs to put xmax value to this hidden system

Starting point is 00:09:36 column. It needs to put current transaction which deletes, right? Value to xmax column for first row and the page becomes dirty. Dirty means in memory we have different page than on disk. So we have some, we need to flush it to disk eventually. Check pointer needs to do it basically, right? And then different row was different tuple, basically physical row, right? Version of row in different page.

Starting point is 00:10:04 So we do it in different page. So we dirtied different page. So almost, like, if you need to update XMUX in 100 rows, it's like almost 100 pages already dirtied, very inefficiently already because of this pattern of random access, basically. It's not random, but it's not sequential. This is the key, right? And then CheckPointer sees, okay, I have

Starting point is 00:10:27 Maxwell size 1 gigabyte, and it was Postgres, I think, 9.5, 9.6. At that time, it means that real distance between two CheckPoints was three times lower. Only 300 megabytes. You can read in

Starting point is 00:10:43 Igor Rogov's book, Postgres Internals, very good explanation mentioning the improvements and why it's not really says, okay, it's time already. We already accumulated too many buffers dirty. We need to put like to flash them to page cache first of all, and then pdflash will go and flash them to disk. So it starts working, producing a lot of IO, and it converts to real disk IO already. And then, boom, different row happens to be in the same page which just flushed. It was just flushed. It was clean, but we update it again. I mean, we update different tuple in the same page, it becomes dirty again. And CheckPlanter says, okay,

Starting point is 00:11:40 again, a lot of work. So it's duplicating effort by flushing too excessively. Is that what you're... yeah. Yeah, yeah. So checkpoints or distance between checkpoints was like, I don't know, like 20 seconds only. Also, I learned that it's like, if I'm not mistaken, I think many years already passed,

Starting point is 00:12:02 checkpoints can overlap sometimes for a couple of seconds. It's still happening but it's insane. So Checkpointer produced a lot of I.O. and this couldn't handle this I.O. And database became unresponsive and we had a critical incident, like basically

Starting point is 00:12:19 downtime, failover and so on. It's insane. Everything became insane and downtime was huge and, and so on. It's insane. Everything became insane. And downtime was huge. And that's not good, right? But as I remember, there are two effects. Not only you need to flash the same page multiple times if you have this non-sequential access pattern,

Starting point is 00:12:39 but also once CheckPointer or anything else made the page clean, we have different effect. Full-page images. Yes, yes. So full-page images, full-page rights. So since full-page rights is on, it means that after CheckPoint, if we visit with our change, our delete, this page again, it goes in full to the wall, right?

Starting point is 00:13:08 Whereas if it was in a previous, if it managed to sneak into the previous before the next checkpoint, it would have been a much smaller amount of work. Yeah, yeah. If we had, for example, a huge, a max wall size like 100 gigabytes, we have enough disk space. We have enough disk space. We can afford it. We understand if we crash, startup time will be longer. Replica provisioning also

Starting point is 00:13:31 takes longer because the coverage recovery point takes time, minutes. And then in this case, we understand. And then in this case, even if we have random, not random, almost like not sequential access pattern, we visit the same page multiple times, not sequentially.

Starting point is 00:13:49 And not only check pointer will flush it just once instead of multiple times, but also only the first change will go to the wall as a full page, but subsequent changes will be just presented as a tuple. And this means volume generation decreases significantly if we do this. So it was helpful to me as a consultant because I finally found agreement that we need the checkpoint tuning. It was the beginning of my practice in the US, so it was hard for me to find good arguments. But this case showed, okay, I just did a very good series of experiments. You know I'm a big fan of experiments. So if you just start experimenting with Maxwell size

Starting point is 00:14:40 and run deletes each time, You just do this one gigabyte. This is the I.O. for this massive delete. Like two gigabytes, four gigabytes, logarithmic approach. You reach 64, for example, or 128 gigabytes. And you can draw a good curve, right? Look how I.O. And usually I.O., if you also have monitoring, if this delete takes like a

Starting point is 00:15:08 minute or two minutes, five minutes, you can see monitoring that you have plateau for IO because it's saturated. Are you taking the total IO? So let's say you're doing the same delete each time and you're looking at total IO across the time and showing that there's less IO as we increase the... I just, yeah, I just, well, I did several things in this experiment and I think it's worth maybe a good how-to article or something. Actually, it was two phases, two parts of the whole experiment. First is to study this IO behavior and check pointer behavior. So I did snapshot of pg-style bg-writer, which as we know until recent version, contains not only bg-writer, but also check pointer and backend activity for cleaning dirty

Starting point is 00:16:00 buffers. So I converted as usual, I converted buffers to gigabytes because this is how any engineer can start understanding what's happening. If you say buffers, nobody understands except DBAs, right? If you multiply it by eight kibibytes, you have gibibytes, megibytes, everyone starts understanding. So I just showed that with default setting, CheckPointer had a lot of I.O. and also PgStart wall. Much more wall was generated. But I also made screenshots of monitoring showing that we had plateau saturating disk I.O., disk write I.O.

Starting point is 00:16:43 But when we had already like 16 gigabytes, 32 gigabytes, we already see like spiky pattern up and down and this is good. It means we have a loop. Because of the batches? Why is it spiky? Yeah, some batching, something like, yeah, a checkpoint, and it's also like, it's all called, of course, checkpoint completion target, It's like close to one, so it should be spread, but it's like, this is batching there, obviously. And this is good.

Starting point is 00:17:09 We must iterate it. Plato is bad, right? Actually, it's a good point that this helps, this optimizer, like tuning the check pointer is helping with massive deletes, regardless of whether we batch or not. It's helping in the case where we don't batch, and it's helping in the case where we do batch.

Starting point is 00:17:26 Is that right? Well, right now I'm talking only about how we would, how tuning would prevent this incident. I think it would be, I think it would convert P1 or priority one or criticality one incident to P2. So basically we have, but database is slow, but it's not down.

Starting point is 00:17:46 Yeah. Great. Because we just discussed this, much less well-produced CheckPointer has less IEO to perform. Okay. I understand the spikiness now. We're talking about that's when the checkpoints are happening. Yeah, the IEO from CheckPointer

Starting point is 00:18:02 Great, great, great. This IEO was spiky and it's good. It means we are not having plateau. We are not saturated. Of course, better if you know your numbers, you know the limit of I.O. your storage device can handle, and you can draw this line on graph and understand how far. Basically, regular SRE practices starting from use, usage situation errors should be applied here to study this incident and perform root cause analysis. And this was great. It was obvious that if we reach like 16 gigabytes or 32 gigabytes, we are in much better shape. We just need to have disk IO and also second phase of experience.

Starting point is 00:18:44 I think we had an episode about maximum size and checkpoint tuning. So second phase, I won't go too into detail there, but second phase of, you do need to understand recovery time in the worst situation. And I invented the term like double worst situation, double unlucky. So worst situation if your database crashed right before checkpoint completes. Yeah. And double unlucky if at the same time you had massive delete or something like that. In this case, it means a lot of work during recovery. So, yeah, that's it actually. This is how massive delete looks

Starting point is 00:19:25 like in the wild, and you should avoid it. Yeah, so how do we then go from the, I guess we've gone from priority one incident to priority two. How do we make this not an incident at all? Batching. So just split it to batches. I think

Starting point is 00:19:41 there might be cases when logically you cannot afford splitting to batches because you must there might be cases when logically you cannot afford splitting to batches because you must delete everything in one transaction. But in my practice, I never saw this. I mean, I always could convince people to split to batches

Starting point is 00:19:58 and do everything in different transactions. So in this case, we need just to understand what is the ideal batch size for us. Not too small, not too big. Too small means a lot of transaction overhead. Too big, we just discussed. I was going to say, I haven't seen a delete use case that, like, we're deleting stuff, right? We don't need it anymore. That's almost by definition right we don't need it anymore that's almost by definition we don't need it anymore so why do you why would you need it to all be gone or none of it

Starting point is 00:20:29 to be gone it doesn't that doesn't make sense to me potentially there might be a case when you don't want users to see part parts of the old okay right but so they want to not see anything but you'd rather they saw everything than part of it. Yeah, in this case, go and adjust application logic so the application cannot reach that data already, right? Even if it's present in database, but you hide it already, maybe like that, based on timestamps or something. But again, this is just a theoretical discussion.

Starting point is 00:21:03 In practice, I didn't see any cases when we couldn't. Like, the benefits for Postgres with its MVCC model, benefits from batch deletes always much higher than, like, experiencing this pain and risks. So, yeah, batches, and we know the ideal size of batches. We discussed it many, many many times starting from the very first episode we had almost two years ago so and very recently in the don't do this episode right yeah yeah yeah so ideal my recipe is just try to be below one second but maybe not

Starting point is 00:21:40 below 100 milliseconds or 50 so this is ideal size to me based on human perception. Yeah, you mentioned something in the recent episode about it degrading over time. So let's say you get a clone or you have a replica of production where you're testing batch sizes and you get the sweet spot. Maybe you find out that you can delete 5,000 rows in just under a second. And you think, let's go with 5,000 as the batch size. One hour later, you see the same delete already takes one minute.

Starting point is 00:22:19 What's happening? So I was going to ask, not only what's causing it, but do you run the test for longer? Or, like, how do you think about that stuff? Well, you just need to understand that when you delete or update, you produce dead tuples, and it means that only the first part of two parts of the operation is done. It's done seamlessly by your SQL query,

Starting point is 00:22:44 but there is a second super important part, which is vacuuming. So delete is not complete when you see a transaction committed. It's complete only logically, but physically we have these data couples, basically garbage, unless some transactions still need it.

Starting point is 00:23:03 And you need to understand vacuum. And actually in the same company to perform delete properly, of course I split to batches and then I was big fan like, oh, I want to create a simple query, which is basically stateless. I don't want to deal with remembering last ID or last timestamp of previous batch. Basically, no window, how to say, pagination. No pagination. I don't want to have pagination. Because they have a similar problem as pagination for selects showing huge results set split to pages to users.

Starting point is 00:23:44 Similar problem here. Very similar. Because it's the same, basically, problem. So I wanted to do something like stateless, and I just relied on Postgres on some index. I checked indexes used, and then already in production, I saw this degradation. So degradation was because of a lot of data plus and auto vacuum couldn't

Starting point is 00:24:06 catch up with my speed of deletion. I also remember interesting things since then, since then, I'm a big fan of single threaded maintenance jobs. I implemented multiple threads originally. But then I saw that even single thread is too fast for vacuum for auto vacuum and we don't need the parallelization here of course parallelization in my case was based on select for update skip locked like fancy super cool like it was maybe like five six years ago like let's do fancy stuff we will we will have like five workers let's go but you don't need it here one thread is enough to to clean up because otherwise you need to speed up vacuum somehow and vacuum for a single table is always single threaded you cannot have multiple workers

Starting point is 00:25:00 cleaning up that tuples from a single physical table. Non-partitioned, yeah. If it's partitioned, yes, they can work on multiple partitions at the same time. But if it's a single table, no. I want to come back to partitioning. But in the meantime, so massive delete also has a big impact on, like I was thinking about index- only scans not not for the delete but for other like concurrent reads because let's say we're in a situation where we are deleting a lot

Starting point is 00:25:33 of tuples from not necessarily random but like lots of different pages the visibility map is going to be well there's going to be a lot of pages that have changes and therefore can't be true index only scans so i was thinking for that case we might actually going to be a lot of pages that have changes and therefore can't be true index-only scans. So I was thinking for that case, we might actually want to run a vacuum manually every certain number of batches. Yeah, this is what I did. And I also thought maybe I should. Yeah, I did. I did manual, not manual.

Starting point is 00:25:59 I mean, my script did it. Maybe after N batches, like 100,000 batches, vacuum. And how did you determine N? Good question. Well, experiments. So if you want to go stateless and you know a lot of data

Starting point is 00:26:20 will be a problem because index you use on primary key or on the creation timestamp, create that, this index also needs vacuum because it will be pointing to dead tuples. And that's why subsequent batches will be slower, slower degradation over time will happen. But if you vacuum, you get rid of these links to the tuples, right? And it becomes good again.

Starting point is 00:26:48 So based on that, applying rule, we want our transactions to exceed one or two seconds because users might notice some bad things. And when we have a long transaction we also block we also are some kind of problem for all autowacom workers globally for our database so if you see oh it degraded to two seconds this is the right time

Starting point is 00:27:16 to run vacuum right but I eventually chose not to do vacuum actually I had many iterations of my solution. And finally I gave up and decided to go stateful. So I performed just this key set pagination

Starting point is 00:27:33 just based on last timestamp. I select next batch based on that. It's super fast even if we had a lot of dead tuples. So I don't care. Because auto vacuum isacuum is good to skip pages that can skip. It's not like 10 or 15 years ago.

Starting point is 00:27:50 It's already quite good. So multiple runs versus just single big run. This is a good question, by the way, worth exploring. Is it good to run vacuum quite often in this case versus let's run it just once. In terms of how well it will produce, in terms of how many disk operations in total,

Starting point is 00:28:10 this is a good experiment to conduct and compare these numbers. Well, doesn't it depend a little bit on, like I think it depends a bit on your workload, right? If you have no other concurrent users, running one vacuum is likely more efficient than running lots of small ones. But if you have no other concurrent users running one vacuum is likely more efficient than running lots of small ones but if you have other concurrent users and you're forcing heap fetches then maybe your io impact from those reads it like outweighs the duplicate effort on vacuum or even even when i say duplicate effort on vacuum i guess we're talking about the same kind of a similar if we see that we want to run

Starting point is 00:28:47 single vacuum not frequently we will leave with a lot of buffers which are out of visibility map so yeah yeah and index only scans will degrade but this you add complexity here i know i know like in my in my case i don't care about other users for now. I like this approach. First, you study the problem alone, and then you start adding what you did. Like, oh, we have other users. Let's already understanding the problem in vacuum, so to speak.

Starting point is 00:29:19 Like, simple user mode, basically, right? Then you start adding considerations. What about other users? If you know how Postgres behaves, how your workload behaves for single user situation, it's already better. You are much more confident when you start looking at others as well.

Starting point is 00:29:39 Yeah, it's a good point. If we rely on index-only scans a lot, probably we should vacuum frequently to avoid heap fetches. You saw in your plans at PGMaster quite a lot, right? Yeah, it's quite a common one. It's not

Starting point is 00:29:54 the biggest issue, but it is especially range scans where people are returning a lot of rows or doing aggregates that rely on index-only scans, on these kind of hybrid workloads, set kind of analytical queries, they can degrade quite quickly in certain environments if you go from an index-only scan across a few thousand pages

Starting point is 00:30:18 to an index-only scan with a few thousand heap fetches. Yeah, so your deletes can be a problem for these. So, SELECTs might degrade because visibility map is outdated. Exactly. So, yeah, more frequent vacuums might be good. It's an interesting discussion, an interesting point

Starting point is 00:30:38 as well. But in my case, I liked the, like, when I gave up my idea to be stateless, I like so much, like I just remember this. And remember, I just, I degraded also to single threaded approach. Forgetting about this select for bladest skip locked. And so I just need one value always to remember. So easy. And then performance became very predictable, reliable, stable. So all batches were the same size and

Starting point is 00:31:11 latencies were the same. Also interesting when you do it, you start thinking, oh, I want progress bar basically. I want to predict ETA, how much time left. And if it's degrading or not stable, this latency of each delete, you cannot predict reliably. But once you perform key set pre-generations here, you know your batch, it takes like 300 milliseconds each batch. Great.

Starting point is 00:31:41 I integrated it into my sql uh this progress bar it report reported percentage and how much left and eta quick quick question on the like on the ordering of batches do you try and do any kind of natural you mentioned created that being a little bit random but i would have expected unless the tuples are getting updated a fair amount, the creator that might be relatively natural ordering. It should be a very good correlation between physical location and created ad. You can check it easily. Select city ID, comma, created ad, and order by created ad, some limit. And you can see city ID is the physical location and you can understand how

Starting point is 00:32:26 correlated physical location is with created that in my case it had it did have a lot of updates in the past so it was like basically everywhere in the table and okay that's that's why incident happened actually I think it makes sense. Yeah, yeah, yeah. So, yeah. Oh, no. I ordered by modified ad. And I even considered creating index on modified ad. But this is anti-pattern to me. Index on modified ad is a problem

Starting point is 00:33:00 because you start losing hot updates, heap-only tuple updates immediately because each update needs, It's a problem because you start losing HAT updates, heap-only tuple updates, immediately, because each update needs, usually by trigger or something, it changes this value. And if you have an index on it, by definition, Postgres cannot implement heap-only tuple updates. And it means you need to deal with index write amplification problem, which is nasty.

Starting point is 00:33:26 So you're optimizing your delete, but you're messing up your updates. Now you are considering other users. I remember I went deeper. I decided, okay, honestly, I don't want this index to be built. I know it will bite me back because of heap-only tuples. I don't want to lose them. I checked statistics and I saw we have a good ratio of them. Among all updates, we have a lot

Starting point is 00:33:53 of hot updates. So I decided, you know what I decided? I decided to rely on created add index, but then somehow perform logical with modified add based on partial correlation. It was sometimes crazy. It worked quite well. I don't remember details, but I did something like some tricks there avoiding this. At first, I already coordinated index creation with guys.

Starting point is 00:34:16 They said, oh yeah, let's do it. But then I said, no, no, no, no. In different companies, like one year before that, I already had this mistake done. I suggested some index and then degradation of updates happened because of we lost hot updates actually this that incident led me to the idea we need to verify all changes holistically checking all all clues ideally what we have and this letter uh this experimentation approach and so on, it's all

Starting point is 00:34:46 connected and interesting. But maybe let's skip this. And since we're almost out of time, if you have questions. Two more ones. I guess they're not quick, but when we're done with this large delete, are there any kind of maintenance

Starting point is 00:35:02 tidy-up tasks that you recommend doing? I was thinking we did a whole episode on index maintenance. what do you rec like are there any kind of maintenance tidy up tasks that you recommend doing i was thinking like we did a whole episode on index maintenance i think rebuilding indexes concurrently this question exactly is a good bridge to what i wanted to discuss as a last topic i wanted to discuss how massive delete can be used useful in non-production ah okay yeah and let's talk about this. Why your question is a breach? Because now I remember very well, but in my career, I forgot about this many, many, many times. Delete doesn't touch indexes at all. This is the key. So delete, they just,

Starting point is 00:35:41 indexes are not touched at all during delete. Delete just puts xmax. That's it. And if transaction is considered as committed, then vacuum will remove this. And also if no other transactions move this version, they're already in the future. In this case, tuple will be deleted. And also links to this dataple in this index

Starting point is 00:36:02 will be deleted asynchronously by a vacuum. But by our backend is not dealing with indexes at all. It only needs to use one index to find the scope of work, the tuples we need to touch. Oh, well, also during planning, as we know, all indexes are considered by a planner and the access share lock is acquired. So I don't see any index maintenance here at all. Well, I was just thinking about them being, let's say we deleted 10% of our table. Well, yeah, you're right.

Starting point is 00:36:36 So if we don't do frequent vacuuming... Even if we do frequent vacuuming, there'll still be like 10% bloat, especially if it's like created out or something where we're deleting the old data. Yes, vacuum doesn't rebalance B3, so B3 will be in worse shape probably. I would say if we deleted the big fraction of the table data, we probably need to consider, we need to check

Starting point is 00:37:06 bloat in indexes and probably we'll need to rebuild them. But I already got used to relying on automatically rebuild jobs, which we suggest implementing on weekends or something. But this feels like a good time to reconsider anyway it felt worth mentioning I know you want to bridge to that one but I had one more quick one and that's like more and more I'm thinking about some of these use cases are

Starting point is 00:37:35 deleting like a multi-tenant SaaS application wants to delete a single tenant and if you've partitioned like I was thinking about partitioning and detach like another way of avoiding big deletes is if you've partitioned, like I was thinking about partitioning and detached, like the another way of avoiding big deletes is if you have a, if you can just drop a whole partition. and then like round-robin approach truncating when it's possible. It's much faster. It's much faster.

Starting point is 00:38:10 It's much better. No job for many components, and it's good. Yeah. Cool. I agree. And indexes are on partitions. They are like a physical, and if it's truncated, it's also truncated. So great.

Starting point is 00:38:25 Yeah, and let's touch a little bit last topic. are like physical and if it's truncated, it's also truncated. So great. Yeah. And let's touch a little bit last topic. Why delete is useful. Yes. So I already was like, you know, database lab and ZFS, like branching, think learning and iteration starting from the idea of experimentation. It should start from the same point in each iteration.

Starting point is 00:38:45 You need to compare things. You need to compare apples versus apples, right? Always apples to apples. So each iteration must start from the same state of database. And this is difficult if table is huge, database is huge, and you want to test it on big hardware. And in case of single-threaded behavior, if you don't care about checkpoint or vacuuming and so on, usually thin clones that the Best Lab Engine provides are great if you just study plans and so on.

Starting point is 00:39:16 But in this case, we deal with like, we need to consider the same file system and we need to understand checkpoint behavior, wall, like everything, vacuum. So we need a dedicated clone when we perform this experiment for checkpoint tuning and to study this behavior. And in this case, for this tuning, I found it super cool. This kind of workload can bring us very good, interesting tool. So if we perform massive delete but don't commit, we roll back, physical layout remains the same.

Starting point is 00:39:51 I mentioned it multiple times in previous episodes, and this is a super cool observation. You can just delete, roll back, begin, delete, roll back massively. This puts good pressure to your wall system, right, checkpointer, like it's cool. This puts good pressure to your wall system, right? Check pointer, like it's cool. And it means that, of course, for a vacuum, it won't do anything because transaction mark has rolled back. So even XMAX is already updated.

Starting point is 00:40:21 The tuples survive. They're still alive. But Postgres generates a lot of wall for such workload. It also puts pressure to a backup system, to a replication system, and also a checkpoint has a lot of work. So it means you're testing,

Starting point is 00:40:37 like you can stress test Postgres components just with this kind of workload. And I know we mentioned last week, right, we had Sai as a guest. I suggest who haven't watched that episode,

Starting point is 00:40:53 it was a super interesting discussion about logical replication with PureDB founder Sai. And I mentioned that this is exactly how I found that we can easily reach and saturate single core on wall sender. So you just delete million rows, rollback. It spams wall and wall sender hitting 100% of single vCPU and is becoming bottleneck quite quickly.

Starting point is 00:41:23 Fortunately, in production, as I mentioned, I couldn't find such situation, but in non-production, it's easy to reproduce. So I found this very handy tool to stress test Postgres and for checkpoint or tuning and so on. That's why message delete not only bad, but also good. But only being rolled back. Isn't it funny and exciting that physical layout doesn't change off table?

Starting point is 00:41:55 You delete it, roll back, but it's the same. This is one of those things where if you talk to Oracle guys, you'll get them sweating by this point because of a different implementation. It's only because of Postgres' specific MVCC implementation. Because in Oracle they have the undo log,

Starting point is 00:42:13 it's like the opposite trade-off. So it's so interesting that the implementation details are so... It's good to know them in order to work out how to test things. Yeah, Postgres is expecting you to perform rollbacks often. Yeah, it's good to make use of it when it's advantageous. Right.

Starting point is 00:42:34 So Tuple remains in the same spot in the page, doesn't shift unlike in other systems, but still XMUX is updated, so page becomes dirty and must be flushed to disk. Yeah, makes sense. Yeah, I think we explored it quite well enough, quite deep. Takeover, takeaways, key takeaways. First, perform checkpointer tuning.

Starting point is 00:42:59 Watch out our episode about that. And if you prohibit massive deletes from deletes and batches, roughly not longer than one or two seconds. Unless you don't have users who deal with your database a lot, and maybe you can

Starting point is 00:43:18 go with 30 seconds in this case. Well, at least consider batching and try, yeah. I just wanted to have some rule and for regular LTP, one second is a good rule. Yeah, great. I like it.

Starting point is 00:43:33 I think it's like a healthy recommendation as well. Even if you wouldn't go down or even if you wouldn't have degradation of that users would notice, it's just an unhealthy thing to do, to go excessive and then it's kind of like that users would notice it's just an unhealthy thing to do like to to go excessive and then like it's kind of like extremes where if you can keep things more level you probably see better things yeah and a couple of more take your takeaways is uh vacuum don't forget

Starting point is 00:43:59 vacuuming regular vacuum effects about uh index only scans we discussed maybe you want frequent and quite aggressive vacuum like to go faster more are you consumed and also partitioning yeah maybe maybe your massive deal is just truncated in this case all those negative effects can go away yeah yeah love it good thanks so much nicolai thank you michael see you next time

Your Ad Here

Postgres FM - Massive DELETEs

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.