Postgres FM - Transaction ID wraparound

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I'm Michael, founder of PGMustard, and this is my co-host Nikolai, founder of PostgresAI. Hey Nikolai, what are we talking about today? Hi Michael, happy new year. Finally it happened. Yeah, absolutely. First episode in 2023, and this is the most boring topic I can imagine. But it was requested, right? And i have something to say about that so let's talk about transaction id wraparound prevention not when wraparound occurs but how to prevent it

Starting point is 00:00:32 well we will probably will touch the topic of how to deal if it happened finally yeah so transaction id wraparound or xyd wraparound yeah i'm guilty of choosing this one uh we we had a couple of i think this is the first topic we've ever had two requests for the same topic so hopefully there's at least two people out there that don't think it's boring i think it's in my opinion this is the most boring topic but it's also the most the biggest fear of dba and And again, let's try to explain it for backend developers and so on, because actually oftentimes the probability of this very bad event to occur

Starting point is 00:01:18 depends not only on DBAs, but also on how application is developed, basically how fast consumption of real transaction ideas is happening. So we should think about it from developer perspective as well. Yeah, absolutely. And I guess it's worth saying that we really hope that the people listening to this never face this issue in their entire careers but hopefully we can leave people with a few things to to look at to monitor to tune in order to minimize that chance of it ever happening to them right where should we start i think it is worth covering a little bit about what this issue is for anybody that's new to has never heard about it, before going into how to avoid it.

Starting point is 00:02:05 Do you think that's sensible? Yeah. Well, the best description is a very detailed description, and I can recommend PostgreSQL Internals, two actually books. One is fully online, another is PDF, fully translated to English from Russian author. One is from Japanese author, another is from Russian author, Suzuki and Rogoff. And both are quite good explaining this. Usually this explained as like we have four bytes

Starting point is 00:02:33 to represent each transaction ID. And of course, four bytes, it means like half of it is roughly 2 billion. Full space is slightly more than 4 billion and of course today it's not that much not that many ideas all right and roughly if you have for example one or two thousand transactions per second it will take roughly one month depending on like how stable your traffic is your workload is it will take roughly one or slightly more months to reach the limit and what happens if you reach the limit we discussed a few times that each tuple so physical raw version, has xmin and xmax and ctid, virtual column.

Starting point is 00:03:27 So xmin, this is like a birth date of tuple. It holds transaction ID which created this tuple. And this also four bytes only. And of course, it should be eight bytes these days. And there is work in progress. There is a big thread in hackers. Some Postgres variations already implement 8 byte or 64-bit transaction IDs. But of course, it's a big redesign because currently all tuple headers,

Starting point is 00:04:01 they are with padding alignment, they are 24 bits, right? Like three bytes only. But if you need more, you need to increase it. So our page is also limited. So it's a big redesign and big change of behavior. But of course, we should have 8-byte transaction IDs. But what happens if you fill this with 4-byte only transaction IDs? At some point, you need to start from 0 again, basically, from 1. And, well, there are some special transaction IDs, especially until 9.6, but it doesn't matter.

Starting point is 00:04:43 So you're already overlapping so over a wraparound happens and to avoid this there is a mechanism called freezing and this is one of main tasks of vacuum and auto vacuum and auto vacuum does freezing all the time sometimes it's so-called aggressive freezing so like i need it right now sometimes it's regular freezing but what what it does it says okay this transaction id is this number but we think it's in the in the past by all means so even if this number is bigger than the current transaction id we know that it's from like previous epoch epoch epoch right so it's like definitely from the past and it's it's achieved by placing additional

Starting point is 00:05:32 bit like frozen bit and if all tuples in page are known to be frozen all page is also marked as all frozen and it's there is a visibility map it holds two bits for each page one like all visible and another is all frozen and if page is all frozen it means that everything all tuples in this page are from the past and it's good because next time auto vacuum will process this page it will just keep skip it because it's all from the past nothing changed good so usually these articles have a picture like circle and half of circle is our past half of circle is our future and we are rotating and if we freeze our pay tuple or whole page we just think it's outside of our circle. It's definitely from the past. So that's actually it in terms of theory. But the problem is that there are several handles,

Starting point is 00:06:35 several knobs that DBAs can tune. And usually, if we have a huge table, usually this aggressive freezing happens quite early at 10% of capacity. So 200 million transactions and we already need to freeze it as soon as possible. And you can see in PGC activity that autovacuum is running in transaction ID wra-around prevention mode to prevent transaction ID wrap-around. And this is quite interesting behavior. We will discuss it later. But discussing all this boring stuff, question to our audience, do you have in your monitoring transaction ID wrap-around prevention graph and alerts?

Starting point is 00:07:22 And by that, do you mean the number of remaining transaction IDs? For example, right. Yeah. For example. So we should sometimes, under certain circumstances, we will discuss it in a second. Yeah. Before we move on from the theory side of it, I think there's an interesting point about the name here.

Starting point is 00:07:42 And I've always called it, I've always seen it referenced as transaction ID wraparound. But I think I actually prefer another term I came across, which is transaction ID exhaustion, which I think describes the problem a bit better. Because it's like we have this finite but ever-changing number of IDs left, and it's by exhausting those that the problem occurs because Postgres will prevent us from wrapping around it will shut it will stop having so that that's really useful for my understanding because the the the failure case here isn't wraparound which would cause things like um basically corruption in various forms now postgres seems it sees that as worse than downtime i assume and what it will instead do is enter eventually enter read-only mode and that that's work that's really bad for applications that want to keep sending data of course but it's not as bad as the alternative. So transaction ID exhaustion appeals to me more as a term.

Starting point is 00:08:51 Well, yeah, right. I like this term, but wraparound shows that it's rotating, right? But also, in some cases, I saw wraparound word used in positive meaning. For example, we have wraparound because we are starting to. For example, we have wraparound because we are starting to use the same numbers again. And if you say wraparound happened, it is something negative, but maybe positive. So there is some ambiguity here in terms of meaning. But yeah, exhausted idea, it's also good to say.

Starting point is 00:09:25 But like, I forgot to say what... You're right. I forgot to say what will happen if we... For example, the vacuum cannot perform freezing. And obviously, Postgres at some point will say only few IDs left to be used. And exactly, it won't process transactions anymore. And you will need to perform freezing, basically.

Starting point is 00:09:55 And it will take many hours, usually. It will take a lot, maybe days. The question I have on this, actually, I think I read that the first warning in the logs is at 10 million IDs left. That seems, considering all of the other defaults in Postgres, that seems like the least conservative one I've ever heard. I guess it, anyway, I'm not, I guess. Yes, I don't remember this. But I wanted to say there are several blog posts to learn from other people's disasters.

Starting point is 00:10:29 For example, from Sentry, also from Mandaril Melchim's project, which also experienced this a few years ago. And obviously everyone, this is the biggest fear of everyone who uses podcast. So both, I thought both were excellent blog posts but I don't think either did a great job of explaining how to avoid

Starting point is 00:10:50 going. Yeah, you know, back to your note about logs, you know why I don't know? Because who reads logs? And if you just, if you even read it, you should have alerts and this should be like critical alert.

Starting point is 00:11:06 But I would suggest it be way before you're down to 10 million. But what would be a sensible? I would suggest this threshold. I don't remember. I always forget knob names. But default is, as I said, 200 million transactions. So this is like when aggressive freezing happens. And I would suggest doubling it just taking the current value and having alert when we reached two times of this

Starting point is 00:11:34 value and actually if you have reliable alerts i would suggest considering increasing this threshold because 200 million like 10 percent of capacities i would use at least 20 30 maybe 40 if you if you are if you trust your alerting system and so on because of course there is a big overhead of from doing this right but uh this is usually when we discuss this with our clients of course there are different opinions and it takes time to to reach consensus because of course people feel feel fierce like okay this uh this default threshold 10 percent of capacity maybe we should not touch it well default was created thinking that maybe you don't have monitoring, right? If you have reliable monitoring, you already can postpone this point when aggressive freezing happens.

Starting point is 00:12:31 Why aggressive freezing is bad? Because it will block you. Usually, autovacuum, when it sees that it blocks someone, it stops. Like harakiri, right? So, like, stopping itself. But if it's doing this to prevent transaction idea wrap around it won't yield and your ddl will fail so you cannot deploy your changes anymore and if even if you kill it manually which is fine actually in some cases if you need to deploy something as soon as possible you know you see that capacity used is only

Starting point is 00:13:05 10%. So it's very far from this limit. You decide to kill it in favor of doing deployment, and then it starts immediately. This is the problem. It ignores nav time and everything. I'm starting and I'm not yielding. I'm starting. I'm not yielding. I'm starting. That's it.

Starting point is 00:13:26 So what you do, you start your deploy and immediately kill this auto vacuum. And only in this case, your DDL proceeds. But then if you need more DDLs in sequence, it starts again. So it's nightmare, actually. And unfortunately, there are parameters per table. So you can adjust are parameters per table so you can adjust this threshold per table but you cannot change them without restart so so you need to manually fight it back like stop that's why i think like sometimes you need to

Starting point is 00:13:59 adjust your threshold and try to avoid times when it happens. Should we cover some of the more basic things? I've got some ideas here for things that I've seen people doing that would increase their risks of this happening to them. Obviously, monitoring and good alerting is a great remedy here. But things that might start to get you in trouble. I've seen people turning off auto vacuum, which seems like, you know, that's an easy way of starting to get into this kind of trouble. Well, as I remember, I may be mistaken, but if you turn off auto vacuum, this rest of freezing will happen anyway. As I remember, I might be wrong here. I think you're right. But for example, if you then have one of the other mitigating factors,

Starting point is 00:14:48 then you haven't had, you know, I think the freezing also happens at other points. It has like a min age and a max age. Is that right? So you can block autowacuum from doing freezing if you have, for example, long transaction. And here I should mention that usually people say we have long transaction and it's bad because over time I achieved some conclusion that long transaction is only like itself.

Starting point is 00:15:13 It's only bad in terms of blocking issues. So it blocks others because it holds some lock. But when we talk about auto-vacuum behavior, freezing and so on, we should say not long transaction, but instead XminHorizon. It will include all aspects of it. It's possible to have, by the way, I bet your monitoring doesn't have it properly. Because somehow people try to invent wheels different ways.

Starting point is 00:15:41 But this concept, XminHorizon, if you have transaction which have assigned exceed regular transaction id it will hold this xmin horizon xmin horizon is the oldest transaction which should be used in the system so vacuum cannot remove the tuples which were deleted later when that transaction and if there is ongoing transaction with some number, regular transaction number, not virtual, I will explain virtual in a bit. So in this case, vacuum cannot delete that tuples. But not only this.

Starting point is 00:16:15 If, for example, you have replication slots and physical replication, and standby node also reports transaction ID that needs on that. So hot standby feedback is on. So it reports also transaction ID and also can hold Xmin Horizon. And finally, sometimes, in some cases,

Starting point is 00:16:39 as I remember, analyze can hold it. Also index creation is holding XminHorizon. There was an attempt to improve this, optimize it in Postgres 14 and until Postgres 14.4 it was corrupting indexes so it was reverted. It would be very good optimization because if for example if you have a huge table and you need to build index, obviously you will hold Xmin Horizon for long, and it will affect auto-locking behavior. Usually, you need to understand how fast is your transaction ID consumption rate. If you are a developer, you're thinking you design your system,

Starting point is 00:17:21 it would be good to understand, for example, per second, per hour, per day, how many transaction IDs we will consume. The good thing is that if you're just selecting something, you are not consuming transaction ID. There is a concept of virtual transaction ID, and it's coming for free. It won't increase this transaction ID counter. But if, for example, you're writing something or you call tixit current, this is old name of function, there is pgCurrentExecId, I always forget it, I wrote it here.

Starting point is 00:17:56 So if you call it, you will either return from your transaction return ID, real transaction ID, which is already assigned, or it will assign a new transaction ID. By the way, this is the way to simulate transaction ID prevention. You block freezing, opening some transaction,

Starting point is 00:18:18 and in another session, you start running this current very, very quickly to consume a lot of transaction IDs right and in this case if for example if sometimes I see design when application performs a lot of rollbacks it's probably a sign that we don't think about our transaction ID consumption rates and we are not lean, right? So we did something, rolled back, and transaction ID lost here. I mean, wasted.

Starting point is 00:18:53 And you should be more careful. Maybe rollback should be done less often. So normally, social media-like workload is like 80 or 90% reads. They should not increase XSID. In this case, we are moving less faster. And if we hold Xmin horizon, we can say, okay, we hold it. But during a minute, we will consume this number of XSIDs, right? And it's fine.

Starting point is 00:19:22 During five minutes, that number of XSIDs, it's already not so good. During five minutes, that number of xits, it's already not so good. During 10 minutes, it's already quite bad. And monitoring Xmin horizon, this is a great thing to start thinking about our behavior. And the good news is that if we have smaller tables, meaning partitioning, right, we have more pages which are frozen and touched less often because this is like kind of old data. And in this case, we have several good behaviors of our system. For example, if we create index, it's much faster, right? So we don't hold XminHorizon.

Starting point is 00:20:01 And auto vacuum is much more efficient because it can skip more pages because they are frozen. So what I'm trying to say is by using partitioning, you can radically decrease chances of transaction idea up around, which is probably quite underestimated topic. I didn't see articles discussing this, but it's obvious to me. So yeah, I have large tables. Yeah, I think that's awesome observation. I also occurred to me that it would help with recovery reading through a couple of the blog posts, but it hadn't occurred to

Starting point is 00:20:39 me that it would help with prevention, which is really cool. You also you can also if you consider a huge table, for example, terabyte, and you cut it to partitions, you can parallelize processing because auto vacuum, one worker can work only with one table. But if it's 10 partitions already, 10 workers can be involved. And I recommend having many workers because many, I mean, like take your number of CPUs and consider having up to like 25% of the CPUs as number of workers, because by default, it's only three. It's very conservative for not good for modern hardware. Well, and I think this was one of the key causes of the century one in particular, they said they, they're, they still had the default of three auto vacuum workers

Starting point is 00:21:26 so maintenance yeah the big idea of this episode if like many things are discussed but big like if if our listeners will try to to get only one thought it's without help from developers who should implement partitioning to avoid tables roughly like, say, 100 gigabytes in size. If you exceed 100 gigabytes, partitioning should be there. These, like, without help from developers, DBAs won't be able,

Starting point is 00:21:57 like, their work will be very tough to prevent transaction idea up around, to handle this and so on. Because even if you need to create index on a huge table, it's already a lot of headache. So developers, please use partitioning earlier. And by the way, this empiric rule, 100 gigabytes, we discussed it with several people. For example, Kukushkin also told me like 100 gigabytes, partitioning needed. One terabyte, maybe sharding needed already right well it

Starting point is 00:22:27 depends but operational operational costs are increasing so i spent some time thinking sometime i mean months i was thinking why 100 gigabytes why 100 gigabytes and now i can say it of course it depends on your system how fast you consume transaction IDs. But if you think about it, and you can think about this Xmin Horizon, and you think, consider how long it takes to create, for example, index, and during creation of index, you hold Xmin Horizon, and you're already understanding. This is like for engineers, for developers, actually actually to understand these thresholds and you understand okay to create an index on 100 gigabyte table it takes this time minutes during this we will hold xmin horizon and it means that freezing won't be possible and so on and we can conclude okay

Starting point is 00:23:21 we can afford only this number of minutes and this this, so I mean, I don't have good math here, but I have sense it's possible to create math here and say, this is acceptable for us. Like create SLOs, like service level objectives. It's okay for us during daytime when we have this number of transactions per second, writing transactions per second. It's okay for us to hold Xmin Horizon during this number of transactions per second, writing transactions per second, it's okay for us to hold XminHorizon during this number of minutes. So this dictates how long index creation should be. And this introduces some limit for table size, right? And this is quite interesting

Starting point is 00:23:59 because this is already like explainable limit for developers to understand why they should do it yeah it's also a really nice round number that's easily well yes yes well from this is from a lot of pain and experience yeah derived from it but we can observing these numbers understanding how fast our transaction id is spending we can have a good math and say, for our system, it's like 50 gigs or 200 gigs. Yeah. Because we have very good,

Starting point is 00:24:32 like very, very fast disks, for example. Yeah, I think that's a really good point. I want to go back to something you mentioned earlier, which was around the Postgres 14 issue. Now, I'm a bit scared because i thought this might be in postgres and i'm scared it might be have been reverted there was some new logic i believe put one of my one of my tips was going to be to upgrade to 14 or above you know 14.4 or above because i think i saw a feature in there that would now more aggressively prioritize a freezing version of vacuum whereas in the past

Starting point is 00:25:07 it wouldn't necessarily you'd get a full vacuum and i'm reading the blog post a lot of people are running a full version of not vacuum full completely different thing but like a but vacuum and i think i saw a tweet by peter gagan i actually shared this in my newsletter back in 2021. But this was before the version 14 release, so around January, February time. He recommended if you are ever in a situation of recovering from

Starting point is 00:25:35 Postgres shutting down due to this issue, he recommended running Vacuum's index cleanup option, which just does freezing. To turn off processing of indexes. Yes, it's possible. Yeah. Now, I thought, yeah.

Starting point is 00:25:50 Now, there was some interesting, I think, is it Rob Tree, I think replied to this saying, could AutoVacuum do that automatically? And I thought he added something in version 14. I thought Peter might have got something into version 14. I don't remember this. Or maybe I don't know. But if we discuss recovering from this very bad state,

Starting point is 00:26:09 I'd recommend re-watching or watching Postgres TV episode with Hanu Crossing from Google Cloud because Hanu explained that single mode is not needed. And even more, it's using single mode to get out of that bad state is even less productive because you don't see details and he developed a recipe how to escape from the bad situation using regular mode so it's quite good and the episode is called do you vacuum every day yeah excellent interview it was it was a talk open you had a lot of questions as usual

Starting point is 00:26:47 yeah silly questions but the reason i brought this up was i think it's another i kind of always want to bang the upgrade drum but if you're in a heavily loaded system there are good features in higher versions of postgres that have your back on these kinds of things as more companies and people are hitting these kinds of issues the developers are listening and they're adding features to try and prevent it to try and make these things less likely well you said you say of course but i see a lot of big companies that are scared of upgrading their date you know still on out of support versions of Postgres like 10 9.6. If you recommend anyone to upgrade to 14.0 right now, it

Starting point is 00:27:32 would be a mistake, right? But this and or 14.3. But this was a recommendation during during many months after version 14 released. So fears have some basis to happen, right? So there are pros and cons, but of course, if you can, you should have a newer version always. It's obvious, right? Yeah, yeah.

Starting point is 00:28:05 But I think these aren't always shouted about. I looked in the release notes and it was quite far down the release notes that this kind of thing came up but yeah I agree I think 14 was a bit of a blip 14 to 14.3 I don't remember issues like that from 10, 11, 12 13, 15

Starting point is 00:28:20 for indexing re-indexing not holding xmin horizon but without this optimization we can state for indexing, reindexing, not holding XminHorizon. But without this optimization, we can state partitioning is a must if you have large tables or will have large tables. You should be prepared for partitioning.

Starting point is 00:28:35 Speaking of transaction IDs, I mentioned there is normal transaction ID or virtual transaction ID, but there is also third type. And I already said, like, check your monitoring. Probably you don't have this or that. I bet like 90% probability that you don't have, even if you have transaction ID wraparound prevention alert, you don't have multi-exec transaction ID wraparound prevention alert.

Starting point is 00:29:06 Because usually people just forget about it. And even big monitoring systems forget about it. Of course, chances are lower. So multi-exact ID or multi-transaction ID, it's a separate space of transaction IDs. And it's used when multiple sessions are using some lock. Some example is using foreign keys. If you have foreign keys, you are multi-exec user and it also can be exhausted.

Starting point is 00:29:34 So it should be also monitored. And most monitoring failed to remember this and implement it. And trick there is, there is a function H that allows you to understand the difference between true XIDs. But for multi-exact IDs, you need to use MXIDH or something. This is what I usually fail to understand

Starting point is 00:29:58 and trying to add multi-exact ID monitoring, at least manual or into some like Grafana, Prometheus or something. I usually forgot to remember that the age function is different. So this is tricky. But you should have monitoring of two ages, of two capacity used, of two spaces,

Starting point is 00:30:23 regular XSEDs and multi multi xeeds virtual xeeds are are not should not be monitoring they are not they're not a problem at all so this is this is actually maybe it in terms of recommendations where to look at and i think if you are postgres user you should think about it even if you're not dba you should think about it because as i think if you are Postgres user you should think about it even if you're not DBA you should think about it because as I said if you are using XSEDs too fast you should optimize for it you should use regular XSEDs less often you should also use partitioning right also by the way sub-transactions, they also increment. In my article about how harmful some transactions in current implementation in Postgres are,

Starting point is 00:31:11 this was maybe like, I had four items and this was item number one. The use of sub-transactions leads to regular XSEDE consumption. So you have nested transactions, each sub-transaction is plus one to regular exceed. So in the developer's mind, definitely before I read that article, I was thinking a transaction with sub-transactions. In my head, it was one transaction. And therefore, I would have guessed

Starting point is 00:31:42 it would be a single transaction ID. But that can be very wrong. It's the same space. Yeah, super interesting. In my opinion, in current form, if you want to be prepared for heavy load, you shouldn't use transactions at all, like at all, because dangers are numerous. And not numerous, there are four dangers, at least in my article,

Starting point is 00:32:04 but they are so painful. So, and one of them is related to multi-exact ideas as well in foreign case. But good news is that it should be improved. There are, there's work in progress to improve sub-transaction behavior. And also Alexander Korotkov, recently I had interview with,

Starting point is 00:32:22 open talks on Postgres TV with him, and his project, AureolDB, is actually selling this already. So sub-transactions will be much better. Nice. And also, AureolDB, as I remember, has 64-bit transaction IDs already. So all we talk about wrap around problem gone. So at least for at least for a few 1000 years. Yeah, I think this episode won't make sense in some years, like five years, maybe when I hope so. That would be success. I think maybe

Starting point is 00:33:02 earlier, actually, maybe earlier. Your fingers crossed. Well, I also wanted for success, I think. Maybe earlier, actually. Maybe earlier. Yeah, fingers crossed. Well, I also wanted, for developers, I wanted to remind that there are a couple of good extensions to look into internals. They're quite easy to use. First is PageInspect. And actually, there is also, if you're a Linux user, you can even have Hex Editor to look into pages from Peter Gagin.

Starting point is 00:33:28 I don't remember, PgHexEdit or something it's called. It's like additional visualization on top of PageInspect. And also PgVisible, which inspects visibility maps. And as a reminder, visibility map is not only about all visible bit, it's also about all frozen bit. So it's useful to inspect and see which pages are marked as all frozen. So, so it's good to understand what's happening. Right? Yeah, absolutely. I like that. And remember to tune

Starting point is 00:34:03 auto vacuum people. That's probably even if you're on a smaller database, there's I like that. And remember to tune out to vacuum people. That's probably, even if you're on a smaller database, there's a lot of benefits. Right, but cloud vendors do it already a little bit. Yeah. Your DBS probably will do it. If you hire some good teams as PostgreSQL consulting team, they also will do it. But what they won't do instead of you people is partitioning use partitioning please yeah partitioning is a must for large systems wonderful anything else i think that's it it's already too much right it's great thanks everybody for listening thank you nicolai

Starting point is 00:34:42 and see you next week yeah by the way uh final words. This was a topic, as we said, from our listeners. Again, thank you so much. Please continue delivering ideas. This is great to have ideas and we just choose one of them and discuss. That's great. Thank you so much. Even if you just review the list of ideas there already and let us know which ones of those you're most interested in right exactly right right that'd be great okay that's it thank you thank you bye

Postgres FM - Transaction ID wraparound

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.