Postgres FM - Massive DELETEs
Episode Date: April 19, 2024Nikolay and Michael discuss doing massive DELETE operations in Postgres — what can go wrong, how to prevent major issues, and some ideas to minimise their impact. Here are some links to th...ings they mentioned:Article based on Nikolay’s talk, including batching implementation (translated to English) https://habr-com.translate.goog/en/articles/523536/?_x_tr_sl=ru&_x_tr_tl=en&_x_tr_hist=true Our episode on WAL and checkpoint tuning https://postgres.fm/episodes/wal-and-checkpoint-tuningEgor Rogov’s book on Postgres Internals (chapter 10 on WAL) https://edu.postgrespro.com/postgresql_internals-14_en.pdf full_page_writes https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-FULL-PAGE-WRITES TRUNCATE https://www.postgresql.org/docs/current/sql-truncate.html Our episode on partitioning https://postgres.fm/episodes/partitioning Our episode on bloat https://postgres.fm/episodes/bloat Our episode on index maintenance https://postgres.fm/episodes/index-maintenance ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
I am Michael, founder of PgMustered. This is my co-host Nikolai, founder of Postgres AI.
Hello Nikolai, what are we talking about today?
Hi Michael, we wanted to talk about one of my favorite topics,
massive delete, I usually call it, but maybe you can call it somehow differently.
Yeah, I really like this. So when you say massive delete,
do you mean deleting a lot of rows all at once?
Right.
In one SQL statement.
So, for example, 10 million rows,
1 billion row,
or just 1 million.
It's good already.
Good enough to feel how bad this is for Postgres.
So we can discuss in detail what's happening if you do it. I like this type of thing because it's not only bad, it's also good.
And we can explore that as well because if you do it on production, it's bad. But if you do it on purpose in some testing environment, it can be very helpful.
Oh, I see what you mean.
So it's a useful tool for certain things.
Okay, cool.
Yeah, I get it.
Stress test.
Yeah.
So you mentioned this in a previous episode, that it would be good to cover this in more detail.
And I was really interested but i think it's important to say what we're not going to cover as well so massive deletes
some people might also be thinking of like some other things that sound like deletes like some
people often implement soft deletes which uh would be like implemented via updates we're not going to
talk about that kind of thing we're talking about only actual Postgres deletes at this point.
Some people complain why soft deletes are not already implemented
by default in database system, like natively.
By then, maybe it's a good idea to have.
But yeah, we will skip this topic.
Soft delete, maybe it's another episode because it has interesting nuances.
Yeah, but let's talk about what's happening if you want to delete many, many rows,
how to do it correctly on production, not to put your database on knees,
and how to use it in non-production environments for something good,
for good reasons.
Right? I like it. Do you want to start?
Maybe with some use cases, like
when are you generally seeing people
do these huge deletes?
I've got a few examples
I've seen, but I'd be interested in yours.
Well, usually
people do massive delete by mistake.
Not understanding
how Postgres works.
Sure.
And MVCC and so on, and they expect it's simple operation. They probably saw some warning that
you shouldn't expect immediate reduction of disk space used. For example, you delete from some
table and think, okay, I will free some space. No, you won't free some space because it's only part of the work.
Another part is auto-vacuum, which will need to actually delete that tuples.
But, yeah, what?
And even then, you won't see, if you're monitoring disk space,
you won't see that as free space again.
It depends.
It depends.
Autovacuum also can truncate relation if some page in the end is deleted.
All tuples are deleted and vacuumed, so physically deleted already, right? But if a page becomes free, it will truncate it and it will reclaim disk space.
Yeah, okay. But I feel like those cases are the exception rather than the norm. I see a lot
more cases where people are deleting a subset of data from a non-partition table and really
surprised when their relation doesn't decrease in size. But just to cover a few of the other
basics, I think this is actually a topic that's coming up more and more because we have these like new privacy laws that require people
to be able to like to have the right to be forgotten and deleting data is quite a safe
way of not having a data leak like the less data you have the less data that can be leaked
security wise so there's i feel like there's these forcing functions i think also people are starting to get serious about having data retention policies so
how long should we keep each type of data around for so i think there are some of these use cases
that mean this is coming up more and in the past at least in the last 10 years i saw a lot more
people kind of just storing everything forever and not really thinking about it.
So I didn't know if this is something you're seeing more often
or it's just an age-old problem.
Well, yes.
So I think JDPR and everything, it's not that big
because, you know, single user delete,
it's usually like some fraction of the whole, right?
But usually massive delete happens when we need to clean up
and understand there's some old data which is not super useful.
We pay for it.
And we want to postpone that moment when we need to scale our instances,
database instances, machines, and so on.
In this case, like in general, I would like to mention before COVID,
I went to VLDB conference, which is a conference I know since being a kid,
basically, when I was a student learning database theory and so on.
I've heard about this conference.
So I went to it because it was very close in Los Angeles.
And I remember a keynote, some researcher,
she was presenting the talk, which was interesting.
It was about delete, basically.
She said, this is exponential growth of the total data volumes
of data in our databases in the world. Because storage becomes cheaper, we produce a lot of data,
big data, huge data, and so on. This is like some zettabytes or something like insane curve insane with some forecast and she said we spent
decades to learn how to store properly a cid not to lose data reliable highly available and so on
now it's it's coming time to learn how to clean up and delete understand which data can be safely
delete how to delete it efficiently, and so on.
And I was super inspired. I also did a talk at a conference because around the same time I had
production incident, a senior engineer, backend engineer, with good understanding
of analytical databases actually, got a task to delete some old data preparing for some marketing campaign because like forecast was
saying we either need to invest a lot to upgrades or we need to clean up before we do this huge
marketing campaign so he went to the production and just performed delete from table where created at older than one year. He estimated in advance
that it will be 10 million rows. And we got an incident. Downtime, more than 10 minutes,
cost the company a lot of money. I was super impressed, a lot of money. And he almost left
the company himself because he was super embarrassed.
And this was at the same time at this VLDB conference without Delete.
I'm saying we need something, right?
At least we need to educate people that Delete should be split to batches.
Of course, if you have good disk, maybe you'll be fine.
But in that case, we had quite slow.
It was some enterprise level, but some old system.
It was on-premise.
And also, at the same time, I was advocating this company.
I was helping as a consultant with Postgres.
The company was growing super fast, huge startup. And I was saying, guys, you do need to
increase Maxwell size. One gigabyte is not enough. So with
default setting Maxwell size, untuned checkpoint or and quite
slow storage, I think maximum throughput for writes was maybe 600 megabytes per second, maybe 500.
It's not like current modern NVMe, which gives you like two gigabytes per second, for example.
These two factors plus the idea, let's delete 10 million rows, it's not a huge number, right?
So, yeah, Check yeah check pointer became crazy because a lot of what's
happening first of all we find rows i mean executor finds rows in that case it was also
unfortunate that it was not sequential pattern so what like rows were ordered by according to created at. So this starts
partially and first you need to put, Postgres needs to put xmax value to this hidden system
column. It needs to put current transaction which deletes, right? Value to xmax column
for first row and the page becomes dirty.
Dirty means in memory we have different page than on disk.
So we have some, we need to flush it to disk eventually.
Check pointer needs to do it basically, right?
And then different row was different tuple,
basically physical row, right?
Version of row in different page.
So we do it in different page. So we dirtied different page.
So almost, like, if you need to update XMUX in 100 rows,
it's like almost 100 pages already dirtied,
very inefficiently already because of this pattern of random access, basically.
It's not random, but it's not sequential.
This is the key, right?
And then CheckPointer
sees, okay, I have
Maxwell size 1 gigabyte,
and it was Postgres, I think,
9.5, 9.6.
At that time, it means that real
distance between two CheckPoints was
three times lower.
Only 300
megabytes. You can read in
Igor Rogov's book, Postgres Internals, very good explanation mentioning the improvements and why it's not really says, okay, it's time already. We already accumulated too many buffers dirty.
We need to put like to flash them to page cache first of all,
and then pdflash will go and flash them to disk.
So it starts working, producing a lot of IO,
and it converts to real disk IO already.
And then, boom, different row happens to be in the same page which just
flushed. It was just flushed. It was clean, but we update it again. I mean, we update
different tuple in the same page, it becomes dirty again. And CheckPlanter says, okay,
again, a lot of work.
So it's duplicating effort by flushing too excessively.
Is that what you're... yeah.
Yeah, yeah.
So checkpoints or distance between checkpoints was like,
I don't know, like 20 seconds only.
Also, I learned that it's like, if I'm not mistaken,
I think many years already passed,
checkpoints can overlap sometimes for a couple
of seconds. It's still happening
but it's insane.
So Checkpointer produced a lot of
I.O. and this couldn't handle this I.O.
And database became
unresponsive and we
had a critical incident, like basically
downtime, failover
and so on. It's insane. Everything
became insane and downtime was huge and, and so on. It's insane. Everything became insane.
And downtime was huge.
And that's not good, right?
But as I remember, there are two effects.
Not only you need to flash the same page multiple times
if you have this non-sequential access pattern,
but also once CheckPointer or anything else made the page clean,
we have different effect.
Full-page images.
Yes, yes.
So full-page images, full-page rights.
So since full-page rights is on, it means that after CheckPoint,
if we visit with our change, our delete, this page again,
it goes in full to the wall, right?
Whereas if it was in a previous, if it managed to sneak into the previous before the next
checkpoint, it would have been a much smaller amount of work.
Yeah, yeah.
If we had, for example, a huge, a max wall size like 100 gigabytes, we have enough disk
space. We have enough disk space.
We can afford it. We understand if
we crash, startup time will be
longer. Replica provisioning also
takes longer because
the coverage recovery point
takes time, minutes.
And then in this case, we understand. And then in this case,
even if we have random,
not random, almost
like not sequential access pattern,
we visit the same page multiple times, not sequentially.
And not only check pointer will flush it just once instead of multiple times,
but also only the first change will go to the wall as a full page,
but subsequent changes will be just presented as a tuple.
And this means volume generation
decreases significantly if we do this. So it was helpful to me as a consultant because I finally
found agreement that we need the checkpoint tuning. It was the beginning of my practice in the US, so it was hard for me to find good arguments.
But this case showed, okay, I just did a very good series of experiments.
You know I'm a big fan of experiments. So if you just start experimenting with Maxwell size
and run deletes each time, You just do this one gigabyte.
This is the I.O. for this massive delete.
Like two gigabytes, four gigabytes,
logarithmic approach.
You reach 64, for example, or 128 gigabytes.
And you can draw a good curve, right?
Look how I.O.
And usually I.O., if you also have monitoring, if this delete takes like a
minute or two minutes, five minutes, you can see monitoring that you have plateau for IO
because it's saturated. Are you taking the total IO? So let's say you're doing the same delete
each time and you're looking at total IO across the time and showing that there's less IO as we
increase the... I just, yeah, I just, well, I did several things in this experiment and I think it's
worth maybe a good how-to article or something. Actually, it was two phases, two parts of the
whole experiment. First is to study this IO behavior and check pointer behavior.
So I did snapshot of pg-style bg-writer, which as we know until recent version,
contains not only bg-writer, but also check pointer and backend activity for cleaning dirty
buffers. So I converted as usual, I converted buffers to gigabytes because this is how
any engineer can start understanding what's happening. If you say buffers,
nobody understands except DBAs, right? If you multiply it by eight kibibytes,
you have gibibytes, megibytes, everyone starts understanding. So I just showed that with default setting,
CheckPointer had a lot of I.O. and also PgStart wall.
Much more wall was generated.
But I also made screenshots of monitoring showing that we had plateau
saturating disk I.O., disk write I.O.
But when we had already like 16 gigabytes, 32 gigabytes,
we already see like spiky pattern up and down and this is good.
It means we have a loop.
Because of the batches? Why is it spiky?
Yeah, some batching, something like, yeah, a checkpoint, and it's also like,
it's all called, of course, checkpoint completion target, It's like close to one, so it should be spread,
but it's like, this is batching there, obviously.
And this is good.
We must iterate it.
Plato is bad, right?
Actually, it's a good point that this helps,
this optimizer, like tuning the check pointer
is helping with massive deletes,
regardless of whether we batch or not.
It's helping in the case where we don't batch,
and it's helping in the case where we do batch.
Is that right?
Well, right now I'm talking only about how we would,
how tuning would prevent this incident.
I think it would be,
I think it would convert P1 or priority one
or criticality one incident to P2.
So basically we have,
but database is slow, but it's not down.
Yeah. Great.
Because
we just discussed this, much less
well-produced CheckPointer has
less IEO to perform. Okay.
I understand the spikiness now. We're talking
about that's when the checkpoints are happening.
Yeah, the IEO from CheckPointer
Great, great, great. This IEO was spiky
and it's good. It means we are not having plateau.
We are not saturated.
Of course, better if you know your numbers, you know the limit of I.O. your storage device can handle,
and you can draw this line on graph and understand how far. Basically, regular SRE practices starting from use, usage situation errors should be applied here to study this incident and perform root cause analysis.
And this was great.
It was obvious that if we reach like 16 gigabytes or 32 gigabytes, we are in much better shape.
We just need to have disk IO and also second phase of experience.
I think we had an episode about maximum size and checkpoint tuning. So second phase, I won't go
too into detail there, but second phase of, you do need to understand recovery time in the worst
situation. And I invented the term like double worst situation, double unlucky. So worst situation if your database crashed right before checkpoint completes.
Yeah.
And double unlucky if at the same time you had massive delete or something like that.
In this case, it means a lot of work during recovery.
So, yeah, that's it actually.
This is how massive delete looks
like in the wild,
and you should avoid it.
Yeah, so how do we then go from
the, I guess we've gone from priority
one incident to priority two. How do we make
this not an incident at all?
Batching. So just split
it to batches. I think
there might be cases when logically
you cannot afford splitting to batches because you must there might be cases when logically you cannot afford splitting
to batches because you must delete
everything in one transaction. But
in my practice,
I never saw this. I mean, I
always
could convince people to split to batches
and do everything in
different transactions.
So in this case,
we need just to understand what is the ideal
batch size for us. Not too small, not too big. Too small means a lot of transaction overhead.
Too big, we just discussed.
I was going to say, I haven't seen a delete use case that, like, we're deleting stuff,
right? We don't need it anymore. That's almost by definition right we don't need it anymore that's almost by definition we don't need it anymore so why do you why would you need it to all be gone or none of it
to be gone it doesn't that doesn't make sense to me potentially there might be a case when you don't
want users to see part parts of the old okay right but so they want to not see anything but you'd
rather they saw everything than part of it.
Yeah, in this case, go and adjust application logic
so the application cannot reach that data already, right?
Even if it's present in database, but you hide it already,
maybe like that, based on timestamps or something.
But again, this is just a theoretical discussion.
In practice, I didn't see any cases when we couldn't.
Like, the benefits for Postgres with its MVCC model,
benefits from batch deletes always much higher
than, like, experiencing this pain and risks.
So, yeah, batches, and we know the ideal size of batches.
We discussed it many, many many times starting from
the very first episode we had almost two years ago so and very recently in the don't do this
episode right yeah yeah yeah so ideal my recipe is just try to be below one second but maybe not
below 100 milliseconds or 50 so this is ideal size to me based on human perception.
Yeah, you mentioned something in the recent episode
about it degrading over time.
So let's say you get a clone or you have a replica of production
where you're testing batch sizes and you get the sweet spot.
Maybe you find out that you can delete 5,000 rows in just under a second.
And you think, let's go with 5,000 as the batch size.
One hour later, you see the same delete already takes one minute.
What's happening?
So I was going to ask, not only what's causing it,
but do you run the test for longer?
Or, like, how do you think about that stuff?
Well, you just need to understand that when you delete or update,
you produce dead tuples,
and it means that only the first part of two parts of the operation is done.
It's done seamlessly by your SQL query,
but there is a second super important part,
which is vacuuming.
So delete is not complete
when you see a transaction committed.
It's complete only logically,
but physically we have these data couples,
basically garbage,
unless some transactions still need it.
And you need to understand vacuum. And actually
in the same company to perform delete properly, of course I split to batches and then I was
big fan like, oh, I want to create a simple query, which is basically stateless. I don't
want to deal with remembering last ID or last timestamp of previous batch.
Basically, no window, how to say, pagination.
No pagination.
I don't want to have pagination.
Because they have a similar problem as pagination for selects showing huge results set split to pages to users.
Similar problem here.
Very similar. Because it's the same, basically, problem.
So I wanted to do something like stateless,
and I just relied on Postgres on some index.
I checked indexes used,
and then already in production,
I saw this degradation.
So degradation was because of a lot of data plus and auto vacuum couldn't
catch up with my speed of deletion. I also remember interesting things since then, since then,
I'm a big fan of single threaded maintenance jobs. I implemented multiple threads originally.
But then I saw that even single thread is too fast for vacuum for auto vacuum and we don't need
the parallelization here of course parallelization in my case was based on select for update skip
locked like fancy super cool like it was maybe like five six years ago like let's do fancy stuff
we will we will have like five workers let's go but you don't need it
here one thread is enough to to clean up because otherwise you need to speed up vacuum somehow and
vacuum for a single table is always single threaded you cannot have multiple workers
cleaning up that tuples from a single physical table.
Non-partitioned, yeah.
If it's partitioned, yes, they can work on multiple partitions at the same time.
But if it's a single table, no.
I want to come back to partitioning.
But in the meantime, so massive delete also has a big impact on,
like I was thinking about index- only scans not not for the delete but for
other like concurrent reads because let's say we're in a situation where we are deleting a lot
of tuples from not necessarily random but like lots of different pages the visibility map is
going to be well there's going to be a lot of pages that have changes and therefore can't be
true index only scans so i was thinking for that case we might actually going to be a lot of pages that have changes and therefore can't be true index-only scans.
So I was thinking for that case, we might actually want to run a vacuum manually every certain number of batches.
Yeah, this is what I did.
And I also thought maybe I should.
Yeah, I did.
I did manual, not manual.
I mean, my script did it.
Maybe after N batches, like 100,000 batches,
vacuum.
And how did you determine N?
Good question.
Well, experiments.
So if you want to go stateless
and you know a lot of data
will be a problem
because index you use
on primary key or on the creation timestamp,
create that, this index also needs vacuum because it will be pointing to dead tuples.
And that's why subsequent batches will be slower, slower degradation over time will
happen.
But if you vacuum, you get rid of these links to the tuples, right?
And it becomes good again.
So based on that, applying rule, we want our transactions to exceed one or two seconds
because users might notice some bad things.
And when we have a long transaction we also block we also
are some kind of problem
for all autowacom workers globally
for our database
so if you see
oh it degraded to two seconds this is the right time
to run vacuum right but I
eventually chose not to do vacuum actually
I had
many iterations of my solution.
And finally I gave up and
decided to go stateful.
So I performed just
this key set pagination
just based on last timestamp.
I select next batch
based on that. It's super fast even
if we had a lot of dead tuples.
So I don't care.
Because auto vacuum isacuum is good
to skip pages that can skip.
It's not like 10 or 15 years ago.
It's already quite good.
So multiple runs versus just single big run.
This is a good question, by the way,
worth exploring.
Is it good to run vacuum quite often in this case
versus let's run it just once.
In terms of how well it will produce,
in terms of how many disk operations in total,
this is a good experiment to conduct and compare these numbers.
Well, doesn't it depend a little bit on,
like I think it depends a bit on your workload, right?
If you have no other concurrent users,
running one vacuum is likely more efficient than running lots of small ones. But if you have no other concurrent users running one vacuum is likely more efficient than running lots
of small ones but if you have other concurrent users and you're forcing heap fetches then maybe
your io impact from those reads it like outweighs the duplicate effort on vacuum or even even when
i say duplicate effort on vacuum i guess we're talking about the same kind of a similar if we see that we want to run
single vacuum not frequently we will leave with a lot of buffers which are out of visibility map so
yeah yeah and index only scans will degrade but this you add complexity here i know i know
like in my in my case i don't care about other users for now.
I like this approach.
First, you study the problem alone,
and then you start adding what you did.
Like, oh, we have other users.
Let's already understanding the problem in vacuum, so to speak.
Like, simple user mode, basically, right?
Then you start adding considerations.
What about other users?
If you know how Postgres behaves,
how your workload behaves for single user situation,
it's already better.
You are much more confident
when you start looking at others as well.
Yeah, it's a good point.
If we rely on index-only scans a lot,
probably we should vacuum frequently to avoid
heap fetches. You saw
in your plans at
PGMaster quite a lot, right?
Yeah, it's quite a common one.
It's not
the biggest issue,
but it is
especially range scans
where people are returning
a lot of rows or doing aggregates that rely on index-only scans,
on these kind of hybrid workloads, set kind of analytical queries,
they can degrade quite quickly in certain environments
if you go from an index-only scan across a few thousand pages
to an index-only scan with a few thousand heap fetches.
Yeah, so your deletes can be a problem for
these. So, SELECTs
might degrade because
visibility map is outdated.
Exactly. So, yeah,
more frequent vacuums might be good.
It's an interesting discussion, an interesting point
as well. But in my case,
I liked the, like,
when I gave up my idea
to be stateless, I like so much, like I just
remember this. And remember, I just, I degraded also to single threaded approach.
Forgetting about this select for bladest skip locked. And so I just need one value always to
remember. So easy. And then performance became very predictable, reliable, stable.
So all batches were the same size and
latencies were the same. Also interesting
when you do it, you start thinking, oh, I want progress bar basically.
I want to predict ETA, how much time left.
And if it's degrading or not stable,
this latency of each delete, you cannot predict reliably.
But once you perform key set pre-generations here,
you know your batch, it takes like 300 milliseconds each batch.
Great.
I integrated it into my sql uh this progress bar it report reported percentage and
how much left and eta quick quick question on the like on the ordering of batches do you try
and do any kind of natural you mentioned created that being a little bit random but i would have
expected unless the tuples are getting updated a fair amount,
the creator that might be relatively natural ordering. It should be a very good correlation between physical location and created ad.
You can check it easily.
Select city ID, comma, created ad, and order by created ad, some limit.
And you can see city ID is the physical location and you can understand how
correlated physical location is with created that in my case it had it did have a lot of
updates in the past so it was like basically everywhere in the table and okay that's that's
why incident happened actually I think it makes sense. Yeah, yeah, yeah. So, yeah.
Oh, no.
I ordered by modified ad.
And I even considered creating index on modified ad.
But this is anti-pattern to me.
Index on modified ad is a problem
because you start losing hot updates,
heap-only tuple updates immediately
because each update needs, It's a problem because you start losing HAT updates, heap-only tuple updates, immediately,
because each update needs, usually by trigger or something,
it changes this value.
And if you have an index on it,
by definition, Postgres cannot implement heap-only tuple updates. And it means you need to deal with index write amplification problem,
which is nasty.
So you're optimizing your delete, but you're
messing up your updates. Now you are considering
other users. I remember I went deeper.
I decided, okay, honestly,
I don't want this index to be built.
I know it will bite me back
because of heap-only tuples. I don't want to lose them.
I checked statistics and I saw we have a good ratio of them. Among all updates, we have a lot
of hot updates. So I decided, you know what I decided? I decided to rely on created add index,
but then somehow perform logical with modified add based on partial correlation. It was
sometimes crazy. It worked
quite well. I don't remember details, but
I did something like some tricks there
avoiding this.
At first, I already coordinated
index creation with guys.
They said, oh yeah, let's do it.
But then I said, no, no, no, no. In different
companies, like one year before
that, I already had this mistake
done. I suggested some index and
then degradation of updates happened because of we lost hot updates actually this that incident
led me to the idea we need to verify all changes holistically checking all all clues ideally what
we have and this letter uh this experimentation approach and so on, it's all
connected and interesting. But maybe
let's skip this. And since
we're almost out of time,
if you have questions.
Two more ones. I guess they're not quick, but
when we're done with this
large delete,
are there any kind of maintenance
tidy-up tasks that you
recommend doing? I was thinking we did a whole episode on index maintenance. what do you rec like are there any kind of maintenance tidy up tasks that you recommend
doing i was thinking like we did a whole episode on index maintenance i think
rebuilding indexes concurrently this question exactly is a good bridge to what i wanted to
discuss as a last topic i wanted to discuss how massive delete can be used useful in non-production
ah okay yeah and let's talk about this. Why your question is a breach? Because
now I remember very well, but in my career, I forgot about this many, many, many times.
Delete doesn't touch indexes at all. This is the key. So delete, they just,
indexes are not touched at all during delete. Delete just puts xmax.
That's it.
And if transaction is considered as committed,
then vacuum will remove this.
And also if no other transactions move this version,
they're already in the future.
In this case, tuple will be deleted.
And also links to this dataple in this index
will be deleted asynchronously by a vacuum.
But by our backend is not dealing with indexes at all.
It only needs to use one index to find the scope of work, the tuples we need to touch.
Oh, well, also during planning, as we know, all indexes are considered by a planner and
the access share lock is acquired.
So I don't see any index maintenance here at all.
Well, I was just thinking about them being, let's say we deleted 10% of our table.
Well, yeah, you're right.
So if we don't do frequent vacuuming...
Even if we do frequent vacuuming, there'll still be like 10% bloat,
especially if it's like created out or something
where we're deleting the old data.
Yes, vacuum doesn't rebalance B3,
so B3 will be in worse shape probably.
I would say if we deleted the big fraction of the table data,
we probably need to consider, we need to check
bloat in indexes and probably we'll need to rebuild them. But I already got used to relying
on automatically rebuild jobs, which we suggest implementing on weekends or something.
But this feels like a good time to reconsider anyway it felt worth
mentioning I know you want to bridge to that
one but I had one more quick one
and that's like more and more
I'm thinking about
some of these use cases are
deleting like a
multi-tenant SaaS application wants to delete
a single tenant and
if you've partitioned like I was thinking about
partitioning and detach
like another way of avoiding big deletes is if you've partitioned, like I was thinking about partitioning and detached, like the another way of avoiding big deletes is if you have a, if you can just drop a whole partition. and then like round-robin approach truncating when it's possible.
It's much faster.
It's much faster.
It's much better.
No job for many components, and it's good.
Yeah.
Cool.
I agree.
And indexes are on partitions.
They are like a physical, and if it's truncated, it's also truncated.
So great.
Yeah, and let's touch a little bit last topic. are like physical and if it's truncated, it's also truncated. So great. Yeah.
And let's touch a little bit last topic.
Why delete is useful.
Yes.
So I already was like, you know, database lab and ZFS,
like branching, think learning and iteration starting from the idea
of experimentation.
It should start from the same point in each iteration.
You need to compare things.
You need to compare apples versus apples, right?
Always apples to apples.
So each iteration must start from the same state of database.
And this is difficult if table is huge, database is huge, and you want to test it on big hardware. And in case of single-threaded behavior,
if you don't care about checkpoint or vacuuming and so on,
usually thin clones that the Best Lab Engine provides are great
if you just study plans and so on.
But in this case, we deal with like,
we need to consider the same file system
and we need to understand checkpoint behavior,
wall, like everything,
vacuum. So we need a dedicated clone when we perform this experiment for checkpoint tuning and
to study this behavior. And in this case, for this tuning, I found it super cool. This kind
of workload can bring us very good, interesting tool. So if we perform massive delete but don't commit,
we roll back, physical layout remains the same.
I mentioned it multiple times in previous episodes,
and this is a super cool observation.
You can just delete, roll back, begin, delete, roll back massively.
This puts good pressure to your wall system,
right, checkpointer, like it's cool. This puts good pressure to your wall system, right?
Check pointer, like it's cool.
And it means that, of course, for a vacuum, it won't do anything because transaction mark has rolled back.
So even XMAX is already updated.
The tuples survive.
They're still alive.
But Postgres generates a lot of wall for such workload.
It also puts pressure to a backup system,
to a replication system,
and also a checkpoint
has a lot of work.
So it means you're testing,
like you can stress test
Postgres components
just with this kind of workload.
And I know we mentioned last week,
right, we had Sai as a guest.
I suggest
who haven't
watched that episode,
it was a super interesting discussion about logical
replication with
PureDB founder Sai.
And I mentioned that
this is exactly how I found that
we can easily reach and saturate single core on wall sender.
So you just delete million rows, rollback.
It spams wall and wall sender hitting 100% of single vCPU and is becoming bottleneck quite quickly.
Fortunately, in production, as I mentioned,
I couldn't find such situation,
but in non-production, it's easy to reproduce.
So I found this very handy tool to stress test Postgres
and for checkpoint or tuning and so on.
That's why message delete not only bad, but also good.
But only being rolled back.
Isn't it funny and exciting that physical layout doesn't change off table?
You delete it, roll back, but it's the same.
This is one of those things where if you talk to Oracle guys,
you'll get them sweating by this point because of a different implementation.
It's only because
of Postgres' specific MVCC
implementation.
Because in Oracle they have the
undo log,
it's like the opposite trade-off.
So it's so interesting that
the implementation details are so...
It's good to know them in order to
work out how to test things.
Yeah, Postgres is expecting you to perform rollbacks often.
Yeah, it's good to make use of it when it's advantageous.
Right.
So Tuple remains in the same spot in the page,
doesn't shift unlike in other systems,
but still XMUX is updated,
so page becomes dirty and must be flushed to disk.
Yeah, makes sense.
Yeah, I think we explored it quite well enough, quite deep.
Takeover, takeaways, key takeaways.
First, perform checkpointer tuning.
Watch out our episode about that.
And if you prohibit massive deletes from
deletes and batches, roughly
not longer than one or
two seconds.
Unless you don't have users who
deal with your
database a lot, and maybe you can
go with 30 seconds
in this case.
Well, at least consider batching and
try, yeah.
I just wanted to have some
rule and for regular LTP, one
second is a good rule.
Yeah, great. I like it.
I think it's like a healthy recommendation as well.
Even if you wouldn't go
down or even if you wouldn't have
degradation of
that users would notice,
it's just an unhealthy thing to do, to go excessive and then it's kind of like that users would notice it's just an unhealthy thing to do like to to go
excessive and then like it's kind of like extremes where if you can keep things more level you
probably see better things yeah and a couple of more take your takeaways is uh vacuum don't forget
vacuuming regular vacuum effects about uh index only scans we discussed maybe you want frequent and quite
aggressive vacuum like to go faster more are you consumed and also partitioning yeah maybe
maybe your massive deal is just truncated in this case all those negative effects
can go away yeah yeah love it good thanks so much nicolai thank you michael see you next time