Postgres FM - Transaction ID wraparound
Episode Date: January 6, 2023Here are links to a few things we mentioned: The Internals of PostgreSQL chapter 5 (book by Hironobu SUZUKI) PostgreSQL 14 internals chapter 7 (book by Egor Rogov)  Transaction ID Wraparou...nd (blog post from Sentry) What We Learned from the Recent Mandrill Outage (blog post from Mailchimp) How to simulate transaction ID wraparound (blog post by Prashant Dixit)Add 64-bit XIDs into PostgreSQL 16 (commitfest entry)  Partitioning (docs)  Consider using VACUUM’s INDEX_CLEANUP option (tweet by Peter Geoghan) Add wraparound failsafe to VACUUM (commit) Do you vacuum everyday? (talk by Hannu Krosing) Multixacts wraparound monitoring (mailing list thread) Subtransactions Considered Harmful (blog post by Nikolay) Buffer management in PostgreSQL (talk by Alexander Korotkov) OrioleDB pageinspectpg_hexedit  pg_visibility  Visibility Map (docs)------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov / @michristofides / @PostgresFM, or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
I'm Michael, founder of PGMustard, and this is my co-host Nikolai, founder of PostgresAI.
Hey Nikolai, what are we talking about today?
Hi Michael, happy new year. Finally it happened.
Yeah, absolutely.
First episode in 2023, and this is the most boring topic I can imagine.
But it was requested, right? And i have something to say about that so let's talk
about transaction id wraparound prevention not when wraparound occurs but how to prevent it
well we will probably will touch the topic of how to deal if it happened finally yeah so
transaction id wraparound or xyd wraparound yeah i'm guilty of choosing this one uh we we had a couple of
i think this is the first topic we've ever had two requests for the same topic so hopefully there's
at least two people out there that don't think it's boring i think it's in my opinion this is
the most boring topic but it's also the most the biggest fear of dba and And again, let's try to explain it
for backend developers and so on,
because actually oftentimes
the probability of this very bad event to occur
depends not only on DBAs,
but also on how application is developed,
basically how fast consumption of real transaction ideas is happening.
So we should think about it from developer perspective as well.
Yeah, absolutely. And I guess it's worth saying that we really hope that the people listening to this never face this issue in their entire careers but hopefully we can leave people with a few things to to look at
to monitor to tune in order to minimize that chance of it ever happening to them right where
should we start i think it is worth covering a little bit about what this issue is for anybody
that's new to has never heard about it, before going into how to avoid it.
Do you think that's sensible?
Yeah.
Well, the best description is a very detailed description,
and I can recommend PostgreSQL Internals, two actually books.
One is fully online, another is PDF, fully translated to English from Russian author.
One is from Japanese author, another is from Russian author, Suzuki and Rogoff.
And both are quite good explaining this.
Usually this explained as like we have four bytes
to represent each transaction ID.
And of course, four bytes,
it means like half of it is roughly 2 billion.
Full space is slightly more than 4 billion
and of course today it's not that much not that many ideas all right and roughly if you have for
example one or two thousand transactions per second it will take roughly one month depending on like how stable your traffic is
your workload is it will take roughly one or slightly more months to reach the limit and
what happens if you reach the limit we discussed a few times that each tuple so physical raw version, has xmin and xmax and ctid, virtual column.
So xmin, this is like a birth date of tuple.
It holds transaction ID which created this tuple.
And this also four bytes only.
And of course, it should be eight bytes these days.
And there is work in progress.
There is a big thread in hackers.
Some Postgres variations already implement 8 byte or 64-bit transaction IDs.
But of course, it's a big redesign because currently all tuple headers,
they are with padding alignment, they are 24 bits, right?
Like three bytes only.
But if you need more, you need to increase it.
So our page is also limited.
So it's a big redesign and big change of behavior.
But of course, we should have 8-byte transaction IDs. But what happens if you fill this with 4-byte only transaction IDs?
At some point, you need to start from 0 again, basically, from 1.
And, well, there are some special transaction IDs, especially until 9.6, but it doesn't matter.
So you're already overlapping so over a
wraparound happens and to avoid this there is a mechanism called freezing and this is one of
main tasks of vacuum and auto vacuum and auto vacuum does freezing all the time sometimes it's
so-called aggressive freezing so like i need it right now sometimes it's regular
freezing but what what it does it says okay this transaction id is this number but we think it's in
the in the past by all means so even if this number is bigger than the current transaction id
we know that it's from like previous epoch
epoch epoch right so it's like definitely from the past and it's it's achieved by placing additional
bit like frozen bit and if all tuples in page are known to be frozen all page is also marked as all frozen and it's there is a visibility map it holds two bits for
each page one like all visible and another is all frozen and if page is all frozen it means that
everything all tuples in this page are from the past and it's good because next time auto vacuum
will process this page it will just keep skip it because it's all from the past
nothing changed good so usually these articles have a picture like circle and half of circle
is our past half of circle is our future and we are rotating and if we freeze our pay tuple or
whole page we just think it's outside of our circle. It's definitely from the past.
So that's actually it in terms of theory. But the problem is that there are several handles,
several knobs that DBAs can tune. And usually, if we have a huge table, usually this aggressive freezing happens quite early at 10% of capacity.
So 200 million transactions and we already need to freeze it as soon as possible.
And you can see in PGC activity that autovacuum is running in transaction ID wra-around prevention mode to prevent transaction ID wrap-around.
And this is quite interesting behavior.
We will discuss it later.
But discussing all this boring stuff,
question to our audience,
do you have in your monitoring transaction ID wrap-around prevention graph and alerts?
And by that, do you mean the number of remaining transaction IDs?
For example, right.
Yeah.
For example.
So we should sometimes, under certain circumstances, we will discuss it in a second.
Yeah.
Before we move on from the theory side of it, I think there's an interesting point about
the name here.
And I've always called it, I've always seen it referenced as transaction ID wraparound.
But I think I actually prefer another term I came across, which is transaction ID exhaustion, which I think describes the problem a bit better.
Because it's like we have this finite but ever-changing number of IDs left, and it's by exhausting those that the problem occurs because Postgres will
prevent us from wrapping around it will shut it will stop having so that that's really useful
for my understanding because the the the failure case here isn't wraparound which would cause things like um basically corruption in various forms now
postgres seems it sees that as worse than downtime i assume and what it will instead do is enter
eventually enter read-only mode and that that's work that's really bad for applications that want
to keep sending data of course but it's not as bad as the alternative. So transaction ID exhaustion appeals to me more as a term.
Well, yeah, right. I like this term, but wraparound shows that it's rotating, right?
But also, in some cases, I saw wraparound word used in positive meaning. For example,
we have wraparound because we are starting to. For example, we have wraparound
because we are starting to use the same numbers again.
And if you say wraparound happened,
it is something negative, but maybe positive.
So there is some ambiguity here in terms of meaning.
But yeah, exhausted idea, it's also good to say.
But like, I forgot to say what...
You're right.
I forgot to say what will happen if we...
For example, the vacuum cannot perform freezing.
And obviously, Postgres at some point will say
only few IDs left to be used.
And exactly, it won't process transactions anymore.
And you will need to perform freezing, basically.
And it will take many hours, usually.
It will take a lot, maybe days.
The question I have on this, actually,
I think I read that the first warning in the logs
is at 10 million IDs left. That seems, considering all of the other defaults in Postgres, that seems
like the least conservative one I've ever heard. I guess it, anyway, I'm not, I guess.
Yes, I don't remember this. But I wanted to say there are several blog posts
to learn from other people's disasters.
For example, from Sentry,
also from Mandaril Melchim's project,
which also experienced this a few years ago.
And obviously everyone,
this is the biggest fear of everyone who uses podcast.
So both, I thought both were excellent blog posts
but I don't think either did a great job
of explaining how to avoid
going. Yeah, you know,
back to your note about logs,
you know why I don't know? Because
who reads logs?
And if you
just, if you even read it, you should
have alerts and this should be like critical
alert.
But I would suggest it be way before you're down to 10 million.
But what would be a sensible?
I would suggest this threshold.
I don't remember.
I always forget knob names.
But default is, as I said, 200 million transactions.
So this is like when aggressive freezing happens.
And I would suggest doubling it just taking the current value and having alert when we reached two times of this
value and actually if you have reliable alerts i would suggest considering increasing this threshold
because 200 million like 10 percent of capacities i would use at least 20
30 maybe 40 if you if you are if you trust your alerting system and so on because of course there
is a big overhead of from doing this right but uh this is usually when we discuss this with our
clients of course there are different opinions and it takes time to to reach consensus because
of course people feel feel fierce like okay this uh this default threshold 10 percent of capacity
maybe we should not touch it well default was created thinking that maybe you don't have monitoring, right? If you have reliable monitoring,
you already can postpone this point when aggressive freezing happens.
Why aggressive freezing is bad?
Because it will block you.
Usually, autovacuum, when it sees that it blocks someone, it stops.
Like harakiri, right?
So, like, stopping itself. But if it's doing this to prevent transaction idea
wrap around it won't yield and your ddl will fail so you cannot deploy your changes anymore
and if even if you kill it manually which is fine actually in some cases if you need to deploy
something as soon as possible you know you see that capacity used is only
10%.
So it's very far from this limit.
You decide to kill it in favor of doing deployment, and then it starts immediately.
This is the problem.
It ignores nav time and everything.
I'm starting and I'm not yielding.
I'm starting. I'm not yielding. I'm starting.
That's it.
So what you do, you start your deploy and immediately kill this auto vacuum.
And only in this case, your DDL proceeds.
But then if you need more DDLs in sequence, it starts again.
So it's nightmare, actually.
And unfortunately, there are parameters per table.
So you can adjust are parameters per table so you can adjust
this threshold per table but you cannot change them without restart
so so you need to manually fight it back like stop that's why i think like sometimes you need to
adjust your threshold and try to avoid times when it happens. Should we cover some of the more basic things?
I've got some ideas here for things that I've seen people doing that would increase their
risks of this happening to them.
Obviously, monitoring and good alerting is a great remedy here.
But things that might start to get you in trouble.
I've seen people turning off auto vacuum, which seems like, you know, that's an easy way of starting to get into this kind of trouble.
Well, as I remember, I may be mistaken, but if you turn off auto vacuum, this rest of freezing will happen anyway.
As I remember, I might be wrong here. I think you're right. But for example, if you then have one of the other mitigating factors,
then you haven't had, you know, I think the freezing also happens at other points.
It has like a min age and a max age.
Is that right?
So you can block autowacuum from doing freezing if you have, for example, long transaction.
And here I should mention that usually people say
we have long transaction and it's bad
because over time I achieved some conclusion
that long transaction is only like itself.
It's only bad in terms of blocking issues.
So it blocks others because it holds some lock.
But when we talk about auto-vacuum behavior,
freezing and so on,
we should say not long transaction, but instead XminHorizon.
It will include all aspects of it.
It's possible to have, by the way, I bet your monitoring doesn't have it properly.
Because somehow people try to invent wheels different ways.
But this concept, XminHorizon, if you have transaction which have assigned exceed
regular transaction id it will hold this xmin horizon xmin horizon is the oldest transaction
which should be used in the system so vacuum cannot remove the tuples which were deleted
later when that transaction and if there is ongoing transaction with some number,
regular transaction number, not virtual,
I will explain virtual in a bit.
So in this case, vacuum cannot delete that tuples.
But not only this.
If, for example, you have replication slots
and physical replication,
and standby node also reports transaction ID
that needs on that.
So hot standby feedback is on.
So it reports also transaction ID
and also can hold Xmin Horizon.
And finally, sometimes, in some cases,
as I remember, analyze can hold it.
Also index creation is holding XminHorizon. There was an attempt
to improve this, optimize it in Postgres 14 and until Postgres 14.4 it was corrupting
indexes so it was reverted. It would be very good optimization because if for example if
you have a huge table and you need to build index, obviously you will hold Xmin Horizon for long,
and it will affect auto-locking behavior.
Usually, you need to understand how fast is your transaction ID consumption rate.
If you are a developer, you're thinking you design your system,
it would be good to understand, for example, per second, per hour, per day,
how many transaction IDs we will consume.
The good thing is that if you're just selecting something, you are not consuming transaction
ID.
There is a concept of virtual transaction ID, and it's coming for free.
It won't increase this transaction ID counter. But if, for example, you're writing something
or you call tixit current, this is old name of function,
there is pgCurrentExecId, I always forget it, I wrote it here.
So if you call it, you will either return from your transaction
return ID, real transaction ID, which is already
assigned, or it will assign
a new transaction ID.
By the way, this is the way to
simulate transaction ID
prevention. You block
freezing, opening some transaction,
and in another session, you start
running this
current very, very quickly
to consume a lot of transaction IDs right and
in this case if for example if sometimes I see design when application performs a lot of rollbacks
it's probably a sign that we don't think about our transaction ID consumption rates and we are not lean, right?
So we did something, rolled back, and transaction ID lost here.
I mean, wasted.
And you should be more careful.
Maybe rollback should be done less often.
So normally, social media-like workload is like 80 or 90% reads.
They should not increase XSID.
In this case, we are moving less faster.
And if we hold Xmin horizon, we can say, okay, we hold it.
But during a minute, we will consume this number of XSIDs, right?
And it's fine.
During five minutes, that number of XSIDs, it's already not so good. During five minutes, that number of xits, it's already not so
good. During 10 minutes, it's already quite bad. And monitoring Xmin horizon, this is a great thing
to start thinking about our behavior. And the good news is that if we have smaller tables,
meaning partitioning, right, we have more pages which are frozen and touched less often
because this is like kind of old data.
And in this case, we have several good behaviors of our system.
For example, if we create index, it's much faster, right?
So we don't hold XminHorizon.
And auto vacuum is much more efficient
because it can skip more pages because
they are frozen. So what I'm trying to say is by using partitioning, you can radically decrease
chances of transaction idea up around, which is probably quite underestimated topic. I didn't see
articles discussing this, but it's obvious to me. So yeah, I have large tables.
Yeah, I think that's awesome observation. I also
occurred to me that it would help with recovery reading
through a couple of the blog posts, but it hadn't occurred to
me that it would help with prevention, which is really
cool.
You also you can also if you consider a huge table, for example, terabyte, and you cut it to partitions,
you can parallelize processing because auto vacuum, one worker can work only with one
table.
But if it's 10 partitions already, 10 workers can be involved.
And I recommend having many workers because many, I mean, like take your number of CPUs and consider having up to like 25% of the CPUs as number of workers, because by default, it's only three. It's very conservative for not good for modern hardware.
Well, and I think this was one of the key causes of the century one in particular, they said they, they're, they still had the default of three auto vacuum workers
so maintenance yeah the big idea of this episode if like many things are discussed but big like if
if our listeners will try to to get only one thought it's without help from developers who should implement partitioning
to avoid tables roughly like, say,
100 gigabytes in size.
If you exceed 100 gigabytes,
partitioning should be there.
These, like, without help from developers,
DBAs won't be able,
like, their work will be very tough
to prevent transaction idea up around,
to handle this and so on.
Because even if you need to create index on a huge table, it's already a lot of headache.
So developers, please use partitioning earlier.
And by the way, this empiric rule, 100 gigabytes, we discussed it with several people.
For example, Kukushkin also told me like 100 gigabytes, partitioning needed.
One terabyte, maybe sharding needed already right well it
depends but operational operational costs are increasing so i spent some time thinking sometime
i mean months i was thinking why 100 gigabytes why 100 gigabytes and now i can say it of course
it depends on your system how fast you consume transaction IDs.
But if you think about it, and you can think about this Xmin Horizon, and you think, consider how long it takes to create, for example, index, and during creation of index, you hold
Xmin Horizon, and you're already understanding.
This is like for engineers, for developers, actually actually to understand these thresholds and you understand
okay to create an index on 100 gigabyte table it takes this time minutes during this we will hold
xmin horizon and it means that freezing won't be possible and so on and we can conclude okay
we can afford only this number of minutes and this this, so I mean, I don't have good math here, but I have sense it's possible to create math here and say, this is acceptable for us.
Like create SLOs, like service level objectives.
It's okay for us during daytime when we have this number of transactions per second, writing transactions per second.
It's okay for us to hold Xmin Horizon during this number of transactions per second, writing transactions per second, it's okay for us to hold XminHorizon
during this number of minutes.
So this dictates how long index creation should be.
And this introduces some limit for table size, right?
And this is quite interesting
because this is already like explainable limit
for developers to understand why they should do it
yeah it's also a really nice round number that's easily well yes yes well from this is from a lot
of pain and experience yeah derived from it but we can observing these numbers understanding how
fast our transaction id is spending we can have a good math and say,
for our system, it's like 50 gigs or 200 gigs.
Yeah.
Because we have very good,
like very, very fast disks, for example.
Yeah, I think that's a really good point.
I want to go back to something you mentioned earlier,
which was around the Postgres 14 issue.
Now, I'm a bit scared because i thought this might be in
postgres and i'm scared it might be have been reverted there was some new logic i believe put
one of my one of my tips was going to be to upgrade to 14 or above you know 14.4 or above
because i think i saw a feature in there that would now more aggressively prioritize a freezing version of vacuum whereas in the past
it wouldn't necessarily you'd get a full vacuum and i'm reading the blog post a lot of people
are running a full version of not vacuum full completely different thing but like a but vacuum
and i think i saw a tweet by peter gagan i actually shared this in my newsletter back in 2021. But this was before
the version 14 release,
so around January, February time.
He recommended if you are
ever in a situation of
recovering from
Postgres shutting down due to this issue,
he recommended running Vacuum's
index cleanup option,
which just does freezing.
To turn off processing of indexes.
Yes, it's possible.
Yeah.
Now, I thought, yeah.
Now, there was some interesting, I think, is it Rob Tree,
I think replied to this saying,
could AutoVacuum do that automatically?
And I thought he added something in version 14.
I thought Peter might have got something into version 14.
I don't remember this.
Or maybe I don't know.
But if we discuss recovering from this very bad state,
I'd recommend re-watching or watching Postgres TV episode
with Hanu Crossing from Google Cloud
because Hanu explained that single mode is not needed.
And even more, it's using single mode
to get out of that bad state is even less productive because
you don't see details and he developed a recipe how to escape from the bad situation using regular
mode so it's quite good and the episode is called do you vacuum every day yeah excellent interview
it was it was a talk open you had a lot of questions as usual
yeah silly questions but the reason i brought this up was i think it's another i kind of always want
to bang the upgrade drum but if you're in a heavily loaded system there are good features in
higher versions of postgres that have your back on these kinds of
things as more companies and people are hitting these kinds of issues the developers are listening
and they're adding features to try and prevent it to try and make these things less likely
well you said you say of course but i see a lot of big companies that are scared of upgrading their
date you know still on out of support versions of Postgres like 10 9.6.
If you recommend anyone to upgrade to 14.0 right now, it
would be a mistake, right? But this and or 14.3. But this was a
recommendation during during many months after version 14 released.
So fears have some basis to happen, right?
So there are pros and cons,
but of course, if you can,
you should have a newer version always.
It's obvious, right?
Yeah, yeah.
But I think these aren't always shouted about. I looked in the release notes and it was quite far down
the release notes that this kind of thing came up
but yeah I agree
I think 14 was a bit of a blip
14 to 14.3
I don't remember issues like that from
10, 11, 12
13, 15
for indexing
re-indexing not holding
xmin horizon but without this optimization we can state for indexing, reindexing, not holding XminHorizon.
But without this optimization,
we can state partitioning is a must
if you have large tables
or will have large tables.
You should be prepared for partitioning.
Speaking of transaction IDs,
I mentioned there is normal transaction ID
or virtual transaction ID,
but there is also third type.
And I already said, like, check your monitoring.
Probably you don't have this or that.
I bet like 90% probability that you don't have, even if you have transaction ID wraparound
prevention alert, you don't have multi-exec transaction ID wraparound prevention alert.
Because usually people just forget about it.
And even big monitoring systems forget about it.
Of course, chances are lower.
So multi-exact ID or multi-transaction ID,
it's a separate space of transaction IDs.
And it's used when multiple sessions are using some lock.
Some example is using foreign keys.
If you have foreign keys, you are multi-exec user and it also can be exhausted.
So it should be also monitored.
And most monitoring failed to remember this and implement it.
And trick there is, there is a function H
that allows you to understand
the difference between true XIDs.
But for multi-exact IDs,
you need to use MXIDH or something.
This is what I usually fail to understand
and trying to add multi-exact ID monitoring,
at least manual or into some like Grafana,
Prometheus or something.
I usually forgot to remember
that the age function is different.
So this is tricky.
But you should have monitoring of two ages,
of two capacity used, of two spaces,
regular XSEDs and multi multi xeeds virtual xeeds are are not
should not be monitoring they are not they're not a problem at all so this is this is actually
maybe it in terms of recommendations where to look at and i think if you are postgres user
you should think about it even if you're not dba you should think about it because as i think if you are Postgres user you should think about it even if you're not DBA you should
think about it because as I said if you are using XSEDs too fast you should optimize for it
you should use regular XSEDs less often you should also use partitioning right also by the way
sub-transactions, they also increment.
In my article about how harmful some transactions in current implementation in Postgres are,
this was maybe like, I had four items and this was item number one.
The use of sub-transactions leads to regular XSEDE consumption.
So you have nested transactions, each sub-transaction is plus one to regular exceed.
So in the developer's mind,
definitely before I read that article,
I was thinking a transaction with sub-transactions.
In my head, it was one transaction.
And therefore, I would have guessed
it would be a single transaction ID.
But that can be very wrong.
It's the same space.
Yeah, super interesting.
In my opinion, in current form, if you want to be prepared for heavy load,
you shouldn't use transactions at all, like at all,
because dangers are numerous.
And not numerous, there are four dangers, at least in my article,
but they are so painful.
So,
and one of them is related to multi-exact ideas as well in foreign case.
But good news is that it should be improved.
There are,
there's work in progress to improve sub-transaction behavior.
And also Alexander Korotkov,
recently I had interview with,
open talks on Postgres TV with him,
and his project, AureolDB, is actually selling this already.
So sub-transactions will be much better.
Nice.
And also, AureolDB, as I remember, has 64-bit transaction IDs already.
So all we talk about wrap around problem gone.
So at least for at least for a few 1000 years. Yeah, I think this episode won't make sense in
some years, like five years, maybe when I hope so. That would be success. I think maybe
earlier, actually, maybe earlier. Your fingers crossed. Well, I also wanted for success, I think. Maybe earlier, actually. Maybe earlier.
Yeah, fingers crossed.
Well, I also wanted, for developers,
I wanted to remind that there are a couple of good extensions to look into internals.
They're quite easy to use.
First is PageInspect.
And actually, there is also, if you're a Linux user,
you can even have Hex Editor to look into pages from Peter Gagin.
I don't remember, PgHexEdit or something it's called.
It's like additional visualization on top of PageInspect.
And also PgVisible, which inspects visibility maps.
And as a reminder, visibility map is not only about all visible bit, it's also about all frozen bit. So
it's useful to inspect and see which pages are marked as all
frozen. So, so it's good to understand what's happening.
Right?
Yeah, absolutely. I like that. And remember to tune
auto vacuum people. That's probably even if you're on a smaller database, there's I like that. And remember to tune out to vacuum people.
That's probably, even if you're on a smaller database, there's a lot of benefits. Right, but cloud vendors do it already a little bit.
Yeah.
Your DBS probably will do it.
If you hire some good teams as PostgreSQL consulting team, they also will do it.
But what they won't do instead of you people is partitioning use
partitioning please yeah partitioning is a must for large systems wonderful anything else i think
that's it it's already too much right it's great thanks everybody for listening thank you nicolai
and see you next week yeah by the way uh final words. This was a topic, as we said, from our listeners. Again,
thank you so much. Please continue delivering ideas. This is great to have ideas and we just
choose one of them and discuss. That's great. Thank you so much.
Even if you just review the list of ideas there already and let us know which ones of those you're
most interested in right exactly right right that'd be great okay that's it thank you thank you bye