Postgres FM - MultiXact member space exhaustion

Starting point is 00:00:00 Hello, hello, this is Posgass FM. This is Nick from Posgass AI, and as usual, co-host Michael. Hi, Michael. Hi, Nick. Yeah, and we invited a couple of guests today, Nate and Andrew, and we invited because the company Metronome

Starting point is 00:00:16 experienced outages in, I think, in May, right, or April in May, right, in May of 2025, which were extremely hard to diagnose, but these folks did a very good job and discussed all the details in an excellent blog post. And this blog post looked so great, so we couldn't miss it. So thank you for coming. Let's dive into some very rare Postgres problems. Yeah, thanks for having us. Let's dive into some very rare Postgres problems.

Starting point is 00:00:45 Yeah, thanks for having us and hopefully sharing this out will keep making it rare. Yeah. Yes, thank you for having us. It's a pleasure to be here. Great. So Michael, where should we start? How about we start with... Could you guys give us a little bit of an overview of what Metronome does, why you're using Postgres and what you were up to that was hitting the limits this hard? Yeah, so I can start with the overview and then like Andrew I can pair on the exact way that we kind of blew our foot off here. Absolutely. Yeah, so Metronome is a usage-based billing platform.

Starting point is 00:01:25 And so when it was founded, I think it's coming up on six years ago now, it was really, at that point in time, this is just before COVID and lockdowns began, the target was really there's a lot of infrastructure as a service companies. Most of them are billing on some kind of heartbeat, you know, per minute or sometimes per second model. That gets really tricky to do at scale. In the simplest cases, yet you're just incrementing a counter, but as things ramp up and your account executives go out and cook up all different ways to finesse how they want to

Starting point is 00:01:54 charge that client, it gets really complicated. You add in things like committed spend, credits, you want to automate it, you want to cut people off when they're not paying. The complexity really compounds immediately. And the one of the founders got it worked on the billing at Dropbox and also encountered issues where when you build this internally, which is where everyone kind of starts, it gets very easy to build the braille system by accident, build to your local maxima, fits your current pricing and package in. And then someone comes in with a new product and suddenly they're delayed, like they can't launch their product or they can't charge

Starting point is 00:02:28 for it for weeks or sometimes months. And so you get led into this spot where you're delaying launches or just like burning compute because you can't actually charge for it, which feels really silly. And then there's also a frequent, you just accept a lot of slippage because building these systems to be perfectly accurate gets really tricky. And, you know, thankfully there's a lot of foundational pieces such as like, you know, Kafka exactly once and you know, strong transactional databases that make this possible to build have confidence in,

Starting point is 00:02:59 but it's very difficult to do this internally and kind of like prioritize that degree of rigor. And also if you're building infrastructure company, you know, let's say you're a database startup, your best engineers are there because they want to work on building a database. They're not actually there because they want to work on building. And so we also think that, you know, kind of consolidating talent to work in this problem yields a better result. And I think the, the marriage of the business logic domain, the exact

Starting point is 00:03:24 correctness needs and the overall data scale volumes were right place, right time for this AI doom. Introduces a lot of very interesting challenges and well, we're about to talk about one of them. And you chose Postgres obviously, right? Yeah, the Postgres decision preceded both of us by ways, but I think we're probably quite happy with it. Okay, good, good.

Starting point is 00:03:45 So do you know why Postgres was chosen in the past? Not precisely. I mean, like, probably, yeah, we immediately went for using RDS. I don't know if it was immediately Aurora. And interesting. Yeah. I think it would be an interesting decision as a small startup to anticipate having a lot of data scale to want to deal with the schema change pain associated with MySQL.

Starting point is 00:04:07 Yeah, I wanted to add a joke that maybe somebody Googled some open source relational database with good money data type. But yeah, you know, money data type is what people shouldn't use in Postgres. Yeah, I think we're mostly in numeric, but at this point we're kind of cursing it. We didn't lock down the precision more tightly because Postgres will give you far too much precision and kind of having to interoperate with any other system becomes difficult and we're starting to figure out how to unwind some of those decisions. Yeah, yeah. That is an interesting tangent, especially I guess with people billing by

Starting point is 00:04:49 fractions of a cent and things. So yeah, anyway, probably too much of a tangent or maybe we'll come back to it later. Yeah, I mean, in short, yeah, it gets very painful translating between the different systems that do or don't have that precision. Sometimes even your billing provider, you know, understandably can't bill in fractions of a cent, which means that how all the different subproducts you paid for and have those round together might not end up with the same total. Can you tell us a little bit about the scale?

Starting point is 00:05:20 I remember 30 terabytes on Aurora from the blog post, anything like in this area you could share to get us the feeling of the scale. So we have two primary workloads that are running on Postgres. The one we'll mostly be talking about is our main DB, which is a common startup pattern. We've got our kitchen sink database that has nearly every data set within it. And yeah, that's ballooned out to around 30 terabytes now and growing terabyte plus per month at this point. And yeah, this wasn't an issue for a while.

Starting point is 00:05:53 And then depending on how our clients leverage our platform, part of billing is generating invoices and invoices are usually characterized with like, you know, stuff a little total and whatnot, but then you have a line item per product usage. And, you know, some of these companies have a lot of different products you can pay for. You know, you look at the list of different LLM, you know, model engines that are available. Sometimes you can use Batch, Async versus Real-Time, or like infrastructure vendors who are deployed to dozens of different regions. And it's easy for the cardinality here to explode, especially if you want to group it by

Starting point is 00:06:29 a logical application on their platform. And we may, but we now regret a decision to basically persist those line items in a separate row and a ginormous table in our Postgres database. How big is that table? It's about half the cluster and growing. Like 15 terabytes un-partitioned, right? Yep. Yeah, well, I know this. It's a pain. There's some scaling problems you get really ahead of and others you defer because

Starting point is 00:07:05 a lot of the technology in the Postgres space is really fantastic now. Aurora does probably just handle it, caveat, making sure that you're dealing with vacuuming and whatnot and all the other kind of implicit thresholds. Actually, I forgot to mention that those who watch PostGCFM regularly should watch already by this point, previous episode with Sugu and where we discussed multigrads and sharding. And in that very episode, he mentioned your case, right?

Starting point is 00:07:36 So yeah, I wanted to answer, this is what we go to do next week, but it was a secret at that time. So, and before that, we also had Pidgey Dog, Lev Kokoto from Pidgey Dog. And obviously, sharding currently is a hot topic again. And with Citus in the past, there is SPQR. Before that, we had a Pell proxy from Skype very long ago,

Starting point is 00:07:59 like previous era. So is it right to say that the problem we are going to discuss today, you wouldn't meet it if you are on sharding schema. If your database was sharded, this wouldn't happen probably. I think we've seen the app suggest that it would be easier to avoid it. As part of the migration that we're about to mention, we're moving another table that's generated data to a partition scheme. And we've already seen a lot of positive effects there. And you can run a lot of auto-vacuum workers in parallel. And so it allows you to much more easily keep up with the rate of change.

Starting point is 00:08:35 This is partitioning, not sharding. And with one partition table, the problem is vacuum is single-threaded, right. Yeah, so with partitioned schema, you can have, you can raise a number of workers for auto vacuum max workers. You can raise it and benefit from it if you have a lot of CPUs and disk IO capacity. But what I'm trying to say, problems like these, they like can be avoided completely if we have sharding and every shard is smaller,

Starting point is 00:09:08 every single shard is smaller. Yeah, and that's what I was trying to get at. By partitioning it, we can kind of see the effect of, if these were all in logical chunks that were maybe at max 100 gigabytes, it's much easier for the garbage collection system, the key pop. Yeah, yeah, yeah. That's interesting. It would be even easier if we were across many machines, but at the very least, I think we can infer most of the effect, even if it's all on one. Yeah. Okay. Maybe it's already time to dive into the problem itself, right? Yeah. Or maybe briefly, like why we created the problem. Yeah. Before you go into that space, I was curious, do you have any issues with multi execs before

Starting point is 00:09:51 that at all? Maybe other kinds of issues like observing in performance insights, lightweight locks, contention, lightweight multi execs, like there are a couple of them. You never had them before, right? Not in particular. No, I mean, we run into, you know, emergency out of vacuums for sure. Yeah, that's very interesting because your path is like definitely is different from what we observed with other projects because usually different problems with multi-exact came and the contention happened and so on. Your path is quite unique in this case. I think.

Starting point is 00:10:28 Yeah. So let's dive in. Let's dive in. Yeah. I really want to describe like, yeah, how we created the problem for ourselves. And I really want Andrew to explain how we diagnose it. Because that's the most interesting part here by far. Yeah.

Starting point is 00:10:40 As we keep seeing these tables grow, we kind of realize also do some access patterns that we're introduced. It wasn't really tenable to keep just appending data month after month for every customer in terms of the line item table. This is also painful for the invoice table, which there's a lot of write amplification between the two. The average invoice has dozens, sometimes hundreds of line items, depending on how a client's using our platform. And so invoices have to be queried a lot of different ways. We didn't really see a way of moving that outside of a relational store and like being able to use foreign keys and constraints with it in the rest of our data.

Starting point is 00:11:20 It was like providing too much value to really doing anything different there. We did wind up partitioning it just to like enable the auto vacuum process to keep up much more easily. But we realized that we could constrain the access pattern for the line item sufficiently that it didn't actually need to be in Postgres. Because I guess within Metronome, over the course of a month, you have the notion of an invoice that's in a draft state where we'll calculate on demand for you. We're not actually persisting data associated with it. Then only after you go through a finalization, which is typically the first of the month, because most billing periods are going to be monthly, that we do all the math, do the calculations, persist that data. Which, you can then infer our workload sounds pretty funky.

Starting point is 00:11:58 So we've got like a steady trickle of writes for configuration changes. And then once a month, we have to write out several billion rows, which amounts to like one to two terabytes of data. All on the same day. As close to it as we can get. Yeah. Very different from social media. Very different. Otherwise, it's a more typical, we have 99 plus percent reads versus writes. It's a lot of us just compute on demand. So it's a little funky. I think that because the high spikes in write volume are so periodic is why we haven't seen multi-exact problems in the past. I think we'd likely narrowly been avoiding them because it's so hard to observe whether or not you're approaching them. I feel like this is

Starting point is 00:12:43 one of the meta-stable states of Postgres, if you're approaching them. You know, I feel like this is one of the like meta stable states of Postgres. Uh, if you're familiar with that page from the definition of stable, where you often things are working, working, working, and they just fall off a cliff or entirely lock up. But the basic trigger here being, yeah, we identified we wanted to make this architecture change, and then we were running the very long migration process of pushing these line items into S3, which is what our back end server is going to be, so their access pattern is so simple, but

Starting point is 00:13:10 also rewriting and moving the invoices from one table to another. And unfortunately, we're doing that broadly one by one, which we'll get to why that's relevant. So you mentioned performance cleave. This is a very good term when we talk about multi-exact. And when we had different cases in the past where, as I mentioned, contention happened. And interesting enough, like we always like 20, 15, 10 years ago, we often saw opinions that foreign keys are dangerous.

Starting point is 00:13:43 Like they have huge overhead, don't use them because like they slow down everything and so on. But interesting that now we can agree with it but in very different way, in an unexpected way because they yes they slow down rights but we can afford it and you your table which grown to 15 terabytes is a good example, right? You survived with foreign keys and you paid that price to have good data consistency, right? But then performance clips, right?

Starting point is 00:14:14 So, yeah, for that table in particular, like the reason why you're able to move that was, you know, it only had one foreign key, which was like back to the invoice row, it was much simpler. That table is effectively append only. If there was a workload, it just really didn't have to be on Postgres. I don't think we were taking advantage of the qualities to justify the cost.

Starting point is 00:14:36 So let's discuss multisects in general, like why we should think about them. Like, you know, my own story is usually, like with multi-exact. I take some monitoring, this is my hobby. Take some post-guess monitoring, and we check which metrics they have. And usually they have transaction wrap-around risk analysis, right?

Starting point is 00:15:01 To control freezing process and to understand how far we are from transaction ID wraparound. My question number one, where is multi-exact ID wraparound analysis? Most monitoring systems like it, but they like it because it never happens. Usually transaction ID wraparound happens, we know several blog posts and horror stories from the past, but multi-exact idea up around, nobody experienced it. So okay, why should we think about multi-exacts? Tell us more technical details what happened with your case.

Starting point is 00:15:37 Well, briefly, I'll respond to the point about foreign keys. I think the community is right and that you don't always need them. But I think what can be tricky is that, you know, when you're a very young company making lots of changes to your product, trying to find that product market fit and get to the point where scaling becomes the biggest concern. Foreign keys are really useful to like prevent teams from stepping on each other and breaking your data. And only when those product areas and like the platform you're building on top of become mature enough, can you maybe afford to put in the pull request level integrity checks that make sure you're not changing some access pattern or maybe build the asynchronous offline data consistency

Starting point is 00:16:15 check to make sure you're not violating it. This is a case where we didn't think critically about have we made that leap from one side to the other. In retrospect, we have and it's leap from one side to the other. In retrospect, we have, and it's one of the mitigations we took. I would defend having a really strict schema early on and then as you scale the platform, just comes necessary to progress some of those guardrails. Yeah, to maintain good quality of data, that's absolutely great advice.

Starting point is 00:16:43 But let's discuss how multi-exact are created, and so on. Sure. Yeah, so I suppose we should start with what a multi-exact even is and when it's used. So basically, in Postgres, when you have two or more transactions trying to lock the same row at the same time, for instance, when you're using select

Starting point is 00:17:03 for share or select for update, or in in fact when you're inserting into a table that has a foreign key to another table, Postgres uses a system called multi-exact or multi-transactions and a multi-exact is basically a logical object that represents a set of transactions locking the same row. Each participating transaction is called a multi-exact member, and those are a little distinct, but they are related. And the way that a multi-exact is stored is that it's represented by a 32-bit integer ID, and that's where your wraparound issue can happen with multi-exact IDs, but in this scenario, that was actually not the problem. For every multi-exact, Postgres keeps an offset into a globally shared append-only multi-exact

Starting point is 00:17:49 membership space. These are stored in the PGMultiExact slash members file on disk. You can think of it as logically like a simple last recently used cache that is contiguous. So you have to append to it every single time. And so when you have to store these members, you will have to look up an offset into this theoretically long array, and each one of your transaction IDs is stored in one slot in this array, and this array is indexed by a 32-bit integer. This membership array is what it was exhausted

Starting point is 00:18:29 in our case. So due to the way that multi-exacts are created and subsequently reaped, it is easy to exhaust this membership space when you have a lot of parallel transactions attempting to lock the same row, whether it's through foreign keys or whether you're doing select for share or select for update. And the reason for that is that since these multi-exact are immutable, suppose that you

Starting point is 00:18:57 have three transactions trying to lock the same row. You would first create the first multi-exact between the first two transactions, and then the third transaction would come in and you have to create the first multi-exact between the first two transactions and then the third transaction would come in and you have to create yet another multi-exact with the old transaction IDs and then the new one. But the membership space is not relinquished until a vacuum occurs, which can happen later or perhaps never in some cases if you are constantly adding new transactions and you're super busy doing all these updates. And what's interesting is this.

Starting point is 00:19:28 The multi-transaction membership space cannot be reclaimed in chunks. It has to be reclaimed contiguously. So suppose you have a very long-running transaction that is part of a multi-exact that is very old. It can hold up the entire vacuuming of subsequent multi-exact that are finished, but are short-lived. And it is not until that the oldest multi-transaction is finished that the vacuum can actually reclaim all the membership space from the old multi-exact to the most recent one. And this scenario creates the ability to exhaust the multi-exact membership

Starting point is 00:20:01 space. And that's what we experienced ourselves. Yeah, I wanted to add one thing. It might sound strange that Select4Update also may produce... You also explored this, right? It happens only when we have sub-transactions, I think, right? So if it's a single transaction, Select4Update, there is no place for multi-exact. But if we start doing save points there, basically we have different transactions inside one transaction. And then multi-exact mechanism starts working again.

Starting point is 00:20:35 And I just remember how many years ago, four years ago, I explored this when I explored another big performance cliff related to sub-transactions. So there I noticed this as well. That is true. Good introduction. Thank you so much. So, yeah, it's not it, right? There's also some problems like quadratic behavior, right? Yes, I hinted at that before.

Starting point is 00:21:00 So essentially, if you have, let's say, five transactions that are attempting to lock the same row, you have to create the multi-exact one at a time by essentially cloning it with the previous transaction IDs and adding that additional transaction ID one at a time. So now you have a multi-exact with two, then three, then four, then five. And you add that all up and that becomes a quadratic, a quadratically growing use case of multi-exact membership space, at least until the vacuum can actually clean up those old multi-exacts.

Starting point is 00:21:32 And so that's where you will find quadratic consumption of the multi-exact space, and this can happen very quickly if you have a lot of concurrent transactions attempting to, say, insert into a table with a foreign key that links to another table that is a low cardinality table. So they're all trying to reference that exact same row. But it's also applicable when we have also additional long running transaction, right? Because otherwise cleanup will happen, right? Correct.

Starting point is 00:21:59 So a long running transaction will cause the vacuum to essentially stop reclaiming multi-exact membership space until it is finished because it can only do it sequentially. Is it this transaction, long-running transaction, should it also participate in multi-exact mechanism or it can be any or even simple select? A simple select probably won't do it. It would have to be in a multi-transaction context. So if you, let's say, have a multi-exact and you have two transactions in that and the first one finishes, that multi-exact is not cleaned up until that second transaction, part of that multi-exact is also finished. So all

Starting point is 00:22:36 transactions in the multi-exact have to finish before the vacuum can reclaim it. Yeah, so this highlights like pieces advice, like partition your large tables, avoid long grinding transactions in Postgres or OTP, right? Like just yet another reason to avoid long grinding transactions. And if it's, yeah, so long grinding transactions on replica reported hosted by feedback, they are harmless here. So, right? So like again, yeah, yeah. I'm trying, we

Starting point is 00:23:06 are debating right now how to properly monitor long-running transactions because there are different effects, negative effects, and we need to distinguish them. In this case, yeah, we should monitor probably long-running transactions which participate. And by long, how much longer was it in your case? Like one minute is long or 30 minutes is long in this case? Actually, not sure I don't actually have the metrics on that. Nate, do you know? I think, you know, 99.9 probably 9% of our queries are, you know, sub second. And I guess like, previously worked on a database and for a company or or for a team at Stripe and like had some of these lessons and just like really deeply, or have a little scar tissue in short. Running a platform up there is difficult. And so I've been pushing us in that direction and kind of moving anything that we expect

Starting point is 00:23:59 to be longer off of the online workload into our Lakehouse. And so it's very, you know, occasionally we have queries that go five, 10 seconds, but I believe we have a like, you know, global timeout of about 30 seconds. Just like, it's like part of my misunderstanding here is I thought part of these multi exacts is that the multi transaction like can be daisy chained as you have many concurrent overlapping transactions occur.

Starting point is 00:24:24 I think that's more what you're running into than like a long running one, was that we just had this steady stream of overlapping transactions. And like until you break that chain, you're in this like host position where you keep, you know, very quickly at an increasing rate consuming space. Yes, I was only merely stating the cleanup scenario and the challenges there, but you're absolutely correct. When you have this daisy chaining, you know, constantly creating new multi-XX by adding new transaction IDs to it, you will very quickly exhaust your membership space.

Starting point is 00:24:57 You say very quickly and very easily, but I mean, we are still talking about two billion, right? Like, I think that's a, there's been a debate in the Postgres world about 32 bit versus 64 bit for a long time, but we are still talking about two billion. So you say very quickly and very easily at your scale, right? Like this is also due to how many things you're doing and maybe the pattern as well,

Starting point is 00:25:24 like invoices that have lots of line items or, you know, This is also due to how many things you're doing and maybe the pattern as well like Invoices that have lots of line items or you know, like there it feels like there are Specifics here that make this more likely. I'm not saying it shouldn't be a problem I'm looking forward to hearing how you diagnose this and what we can do better. But I feel like We should say it's not necessarily very easy to hit this limit. I'm not actually, I think I found one other Stack Overflow post of somebody else who hit it. I don't know if you came across that while you were in the middle of this, but I couldn't find any other examples on the making of this. But it's super interesting and I think we're seeing more and more companies using Postgres

Starting point is 00:26:00 at this scale. Like Nikola mentioned a few of them that we talked to but we also had an episode with folks from Notion and Figma and Adyen as well on that same episode so there are a lot of high volume companies now on Postgres so I think this stuff's super interesting but two billion still seems like even with the daisy chaining it still seems like a lot to me. Let me disagree with you. There is no... Like, debate is not like 64 or 32. Postgres desperately needs 64 transactions. And, for example, OroidDB already implemented it. But OroidDB is like alternative storage engine. And it's right now in hands of SuperBase in terms of development. And I think right now there is launch week and they publish some benchmarks again.

Starting point is 00:26:47 Like it's great, great project. So I think many projects dream about the day when we will have 64, not 32. Right now only several like forks implemented only. Right. There are commercial forks which implement 64 transaction IDs and not exact IDs. Right. So I didn't know that. Yeah.

Starting point is 00:27:08 Sadly not a war, it sounds like. Yeah. Yeah. So 2 billion is not like, it's not a lot already. It's not a lot. Yeah. It's pretty easy to hit it if you're going at scale and you're hitting it with quadratic velocity.

Starting point is 00:27:23 Oh, yes. You get to 2 billion quickly. It does sound like a big number, but... You're going at scale and you're hitting it with quadratic velocity. You get to 2 billion quickly. It does sound like a big number, but it's not that big. Michael, sorry for interruption. Michael, you can take chessboard and start putting one grain on first cell and so on. Oh yeah, so wait, maybe I misunderstood. If there are five row changes we're trying to do, do we create two to the five rows in

Starting point is 00:27:47 this table, or is it like one, then two, then three? Is that only five rows? Oh, sorry, in the member space? Yeah, good question. So basically, it's more of a sequencing thing. So essentially, you say you have two transactions. You start with two, you get one multi-exact, taking up two members. The third transaction comes in,

Starting point is 00:28:07 you create a new multi-exact with three members, but that old one with the two members has not been cleaned up yet, so you're gonna take it with five. The third one comes in, I'm sorry, excuse me, the fourth transaction comes in, you create a multi-exact with four and three and two, so now you've added an additional four on top of that. And now the fifth one comes in,

Starting point is 00:28:23 and now you have a multi-exact with five, four, three, two. And that all add together takes up membership space until the vacuum can come up and reap those old multi-exacts. Yeah, chessboard example is good here, I guess. We can try to understand volume when we approach 64 cell, number 64, how big it will be. Andrew, you've done such a good job describing like how this could happen but when you're in the middle of this I'm guessing this is relatively new knowledge or like when you're in the middle of this

Starting point is 00:28:56 how did you work out what on earth was going on? Yes that's quite the journey. So I'm actually really new to Metronome. I just joined in April. So when this incident was occurring, it was only my fifth week or so. And this was occurring on a Saturday right before a vacation to Miami. So, wow.

Starting point is 00:29:16 Well, let's rewind a little bit because I guess the most embarrassing part of this is like the first, what we now know was the first occurrence was a full week prior. Where our migration workload had basically implicitly triggered this. We hadn't really known what to do. And unfortunately, we're not able to get to a full RCA. And so, you know, we got to our best explanation, which is like, oh, there's something in transaction

Starting point is 00:29:41 ID wraparounds that we're not grokking correctly or Amazon's hiding something from us. And then we proceeded the turn back on with a slightly lower concurrency, you know, five days later. So like, yeah, so we didn't get there immediately. But then, you know, that was a P one incident, you know, internally, tried to be cautious, but didn't follow through sufficiently. And so this is like kind of in the second chapter where Andrew comes in. Yes, that's right. I was not a participant in the first three incidences. It was only the fourth one, which I think was on a Saturday in which the call went out from Cosmo to see if additional engineering resources could come and jump on. So I decided to, but having been so new, I actually

Starting point is 00:30:20 didn't have any like tactical knowledge I could use to address all the specific fires that were being started by this. So I decided to actually figure out what exactly was happening and as Nate had said we weren't really sure but we knew it was related to multi-exact in some way. We just weren't really sure exactly why and we were struggling to really fully understand because in our initial evaluation, we saw that the number of multi-exact IDs that was being created was well below the threshold of like 400 million. And we weren't really sure why. So what I did is I took that knowledge and I just went ahead and looked into the Postgres

Starting point is 00:30:56 source code and looked at the multi-exact dot C file, which is this enormous file that contains pretty much all the logic that you'll need to know about multi-exaqs. And in there, we were working with the hypothesis that somehow this threshold of 400 million, but we were only seeing 200 million before vacuums started occurring, was related in some way. So I found the function called multi-exaq member freeze threshold, which calculates the threshold that you should start basically an auto-vacuum based on the number of multi-exacts that are in use and, newly, the number of multi-exact members that are in use.

Starting point is 00:31:35 And that's what caught my eye. So from there, I mathematically proved that the amount of members should be extremely high given the code path that we are obviously hitting because we're seeing vacuums occur the amount of members should be extremely high given the code path that we are obviously hitting because we're seeing vacuums occur at a specific number of multi-exact being used. And from there I was able to find log messages related to multi-exact membership exhaustion

Starting point is 00:31:58 and correlate that with Datadog logs that we had found ourselves in the recent week. And there we could conclude that multi-exact members were being, the membership space was being exhausted, and we were able to put together what I would consider kind of a hacky estimate of the number of multi-exact members that are currently in use by looking at the files that are written for each member and estimating by multiplying by a constant how many members are currently in use at the moment and that's how we came across the solution or the cause I should say. I think this is worth diving into a little bit because I definitely didn't realize that there

Starting point is 00:32:37 were at least two exhaustion possibilities with multi-exact so So we've got the total. Well, yeah. So why are there these two numbers? Like, why do we have this 2 billion that you were monitoring at 400 million, which should be super safe, right? Because we have the possibility to go up to 2 billion, but you want it to be cleaning up much sooner than that. So you monitored at 400 million or so. at 400 million or so. Why is there another limit? Yeah, so we have two limits here. We have an ID that we give to multi-exact. That's a 32-bit integer. So of course you cannot assign more than two billion to these multi-exact because we don't assign negative IDs. And then you have the multi-exact membership space. Now that is represented as logically a simple LRU that is append only, and so we can only address up to 32 bits of membership space.

Starting point is 00:33:38 Each membership slot is addressed by a 32-bit integer, and thus now you have two different values that you have to worry about. You have to worry about how many multi-X-axes you're creating and how much membership space you're consuming with those multi-X-axes. And that's where those two numbers come from. And one of the great things in your blog post you mentioned is how do you even monitor that?

Starting point is 00:34:03 Have you worked out a way since? Is there anything you can do there? Yes, I think I might have just off? Is it, is there, have you worked out a way since? Is there anything you can do there? Yes, I think I might have just offhand mentioned it, but we use an estimate where we read the membership files, because they are written to disk, and then we know that the files have to be a certain size, and we multiply by that constant, whatever the size of these members are, and we get an estimate for the number of members. Now, this is not the best solution, which is why I submitted a Postgres patch to surface this telemetry directly. Nice, thank you.

Starting point is 00:34:34 Yeah, that's what I wanted to mention. We have several things going on in mailing lists after this blog post. By the way, before this, why did you decide to post in so great detail? What was your intention to help community? I guess a couple of factors. One, we do want to help the greater Postgres community. I think there are lots of other situations where we've benefited greatly by blog posts put out, in particular by companies like Percona, as well as others, you know, maybe led us away from paths that would have wound up being painful. But another part is that, you know, internally at Metronome, we view ourselves as a infrastructure provider for our clients. And while, you know, I think it's okay to say that we're not fully recognized as that yet,

Starting point is 00:35:19 and we're still on our journey to solidifying that narrative, it is what we want to, or how we want to act to be perceived that way. And we are in the critical path for a lot of our clients and how they operate. And so for us to have multiple, you know, multi-hour town times in the space of, you know, less than 10 days is like pretty bad and honestly not what you want for your infrastructure provider if you choose to be in that way. And so having a really crisp explanation of, hey, we are a little embarrassed by this, but here's what we learn so you can learn from it, as well as a little bit of vetting of like, we didn't run into the easiest thing. It is novel. There is a reason why it took a while for us to sort it out. And so I think a

Starting point is 00:36:01 lot of our partners felt a lot better. Better could be a strong term, but when you have an honest RCA and like, you know, cloud flare might be like leading the industry and like how they operate in this vein, you feel a lot better that your provider is actually learning from mistakes and is going to do a better job of avoiding them. You might internally if you're evaluating should I just build this in house instead. So great transparency and so on it pays off over time. Yeah, that's great. So I saw several

Starting point is 00:36:32 mailing list discussions, but before that I also like, I feel like with multi-exact we can have different kinds of problems. One is wraparound, right? Okay. As I said, I never saw it. And that's why many monitoring systems even don't monitor it. And we have several cases of SLRU limits reached, right? And one of them was explored in the past. We can provide maybe blog posts, links to description. And this is your case. And your case obviously bumped into lack of observability bits, like how much of capacity we reach, like how far from the performance cliff we are.

Starting point is 00:37:14 But also besides that, I saw different topics were discussed in Mellenglis. For example, there is a setting, vacuum, multi-exact failsafe H, which is about wraparound, actually. And Peter Gagin brought to attention that this setting doesn't take into account multisac membership exhaustion. So it's also interesting side effects from your blog post, maybe fully unexpected. Then I saw Andrew, I think yours proposal to have to have this automatic exact monitoring like system view. So any monitoring could have some graphs and also alerts probably, right? And this proposal currently at what stage?

Starting point is 00:38:00 I saw it's in commit fast, but it requires review, right? Correct. Yes, the patch is in the commit fast and I saw it's in CommitFest, but it requires review, right? Correct, yes. The patch is in the CommitFest, and it definitely still requires review. Well, I've engaged with another contributor named Naga, who's we've actually submitted competing proposals. His actually directly reads from the memory space, while mine uses the PG stats infrastructure to surface

Starting point is 00:38:28 this telemetry. But yes, it is still just in review. It has not been accepted yet as far as I can tell. Yeah, so but it's great that it's registered in commit fast. And I think if some people who listen to us and maybe maybe some people who watch also our sessions of hacking, we do with Andre and Kjork sometimes almost every week. We missed today because of some issues. But we are going to be back next week.

Starting point is 00:38:52 So we show how easy it is actually to test. You take Coursor, you tell it, just compile Postgres, apply patch, and let's start using it. And instead of diving too much into C code, you think about use cases and how to, for example, emulate problems, right? And so on, this is actually fun. And I encourage our audience to maybe to start

Starting point is 00:39:17 with their first review and test this patch. And this can increase chances that it will be included to PostGIS 19, right? Yes, what's also kind of interesting is, I know you earlier mentioned that there was a debate about moving to 64-bit integers, and in fact, part of this commit fest does include a proposal to change those offsets to 64-bits.

Starting point is 00:39:38 So there's a lot of changes that are coming in 19, it seems, if they're accepted, of course, which would help alleviate this multi-exact membership and multi-exact issue writ large. That's great but it's only next year. I suppose we can't go too fast. Well and even longer if it's Aurora right they have a generally a bit of an additional lag to getting major version because it's a fork that they have to do a lot of work on. On that note though, you mentioned kind of trust in an infrastructure provider and how issues, big issues, can rock that trust and you kind of need to see some maybe root cause analysis or

Starting point is 00:40:18 maybe some action. Has this at any point rocked your confidence in Postgres or less so Aurora, I guess, in terms of is it the right platform as you scale? Maybe initially and then since you've worked out exactly what the issue was, what are your thoughts there? Yeah, I don't think it actually has really rocked our confidence in Postgres. I haven't seen anyone run for the hills yet. Um, but, uh, you know, it, it definitely surfaces the fact that there is quite a bit to learn and that this is a very complex system, but to that end, something I've learned at least, uh, participating with the Postgres dev community and being

Starting point is 00:40:57 on the mailing list is that there's a lot of very smart individuals who are part of this community and who are actively contributing and are passionately debating what should and shouldn't be part of it and how things should be designed. So in many ways it actually may have built my confidence to see so many people paying attention and caring about it from all across the world from Fujitsu, Amazon, Google, ourselves, even just regular folks doing it in their hobbies spare time. So in many ways perhaps it's a testament to how far open source can go

Starting point is 00:41:27 and how much confidence we can have in these systems. Because like you said, this is a pretty rare instance and many of the core cases that most people need to use are covered or at least will be. That's really good to hear. And of course, when people hit issues with proprietary systems like SQL Server and Oracle, I'm sure when they blog about it, those conversations do happen internally

Starting point is 00:41:49 at Microsoft or Oracle, but you can't see them. You can't see the actual conversations. And part of the way you diagnose this issue was diving into the source code. So there are some really cool factors that wouldn't have been possible if this was an open source. Yeah, that's not just open source, but open discussions as well, open mailing lists. So yeah, that's really good to hear, but it would have been very understandable if people have definitely left Postgres for less good reasons. So not naming names, Uber. Yeah, I will say, as we said in our blog post, 30 terabytes is not a small amount of data to move. And so if we decided to make a large decision like that,

Starting point is 00:42:32 I believe we would have to reckon with the transition itself. And so in many ways, we probably want to try to work with what we got. And thankfully, I think it's served us well for the most part. Yeah, there's something to be said for the devil you know. There's lots of other systems, Postgres compatible, that promise you the world and something gives typically whether it's the price, the actual performance, the read or the writes, or how maintainable is actually run it.

Starting point is 00:43:08 I think it's clear that we're going to have to undergo the investment to figure out what's next because like we can only go single writer for so long, especially with the demands that we put on the database on the first of the month every, every single month. They're pretty extreme. Yeah. And behind the scenes, we also chatted with Nate about the formula. So Michael, for the first time, also chatted with Nate about the formula. So Michael, for the first time we should include some formula in the show notes.

Starting point is 00:43:30 There's a formula that's saying that roughly 63,000 transactions overlapping is enough to reach a 2 billion limit. Yeah, and so you know the dangerous thing of some math on the fly that can be checked by anyone after the fact, but yeah, as I worked it out, only 63,000 overlapping transactions, which is not that many. Relatively easy to do depending on how you, you know, with a highly concurrent workload operating one row at a time with a potential overlap between them. You can zoom that really quickly.

Starting point is 00:44:05 It might not also be transactions. If you take a checkpoint, that's almost seen as like a sub transaction and will also create a new multi-exact as well. So if you're doing a lot of checkpointing, you can easily hit that as well. Are you using sub transactions? I'm not actually sure myself, but it is a possibility.

Starting point is 00:44:23 Yeah, I don't believe we are, or at least not intentionally. I think right now in the latest versions of Postgres, there is ability to change SLRU sizes to avoid SLRU overflow cases. But my opinion about some transactions in projects like this, at this scale, remains just avoid them. Yeah, because they can bring other problems as well, like sub-transactions can be overflowing and so on.

Starting point is 00:44:54 And they consume transaction ID. So yeah. Good. That's, I think, quite a good deep dive, unless we forgot anything. Any last things you wanted to add? I think we could talk about the mitigations just so that if someone's unfortunate enough to run into this, they have a few breadcrumbs for how to work around it. Because it was rather expensive for us to figure out what those were, including Andrew

Starting point is 00:45:21 and others digging through the source to really understand it. But a lot of what we inferred after the fact was, so we had this migration, which was doing a bunch of row by row writes, not ideal if you're doing a highly concurrent process here. And so we did make those batches. We also did a lot of deep thinking on which of the foreign keys we actually needed on this new table. Batches, how did you find the ideal for your batch size? I don't know that we're at an ideal, but I think we climbed the hill far enough, but it's not a problem. I see.

Starting point is 00:45:56 And so it was kind of a trade-off between what throughput we could obtain and making sure that we weren't having the type of like over time growth in our like heuristic for the member space utilization, which is mainly based on how many of the offset files existed. Yeah, I order of magnitude. Oh, sorry, gun. Yeah, I usually say choosing batches choose the biggest possible, but avoid locking rows for too long to affect users because users, we know perception is 200 milliseconds. So if you look like one second should be maximum, two seconds should be absolute maximum for

Starting point is 00:46:31 batch size. Yeah, it's less often. So if we're even batches around like 200, which was enough to mitigate this concern and kind of keep all of the operations up, you know, short, since there's a very online workload hitting this as well. This is not users in terms of batch transactions, but users read results, right? Or no?

Starting point is 00:46:54 Nobody's reading results. No, no. So sorry, this is in the context of our greater migration to move line items, line construct out of our database, as well as partitioning the invoice table, which has also grown to be quite large. And so, you know, kind of we would read the old version, all the line items in the invoice row, write those to a new format in S3, which, you know, ultimately keyed by the invoice ID, and then allow logic to kind of like, we can't just do a deterministic mapping,

Starting point is 00:47:25 because then you get hot partitions in S3. So you have to do a little bit of extra work and leave yourself a link in the actual database row. And so then we are writing those out to the new table, which is partitioned one by one. And our diagnosis, and it's a little hard to confirm this, because again, there is no like observability bit here, that was the big mistake as well as the single inserts, like very quick, it was the issue that they all reference a handful of other tables for foreign keys, some of which were very low cardinality. So you know we have like some enum tables and we, you know, for better or for worse, in this case worse, had chosen to use a foreign key reference to them. And so if every invoice is, you know, corresponding to the USD credit

Starting point is 00:48:09 type, for example, you're gonna have a lot of overlapping transactions all accessing that row to verify the integrity of the data. And so we kind of walked through it and we realized, you know, a lot of these low cardinality cases, we don't really change those code paths. We're quite confident they're going to be stable. It's only the higher cardinality cases. And then voice also references its customer that owns that invoice. But that's a much higher cardinality connection. We're much less likely to have overlapping transactions referring to the same customer

Starting point is 00:48:40 in a migration like this. Without foreign keys, did you think about periodical checks of referential integrity like asynchronously? Yeah, so we're gonna be adding those to run our Lakehouse. So yeah, like we have a syncing process that brings all the files over into Parquet. Eventually we're gonna finish this way

Starting point is 00:49:00 on getting the VCDC, so it's the more incremental and hopefully using Iceberg. And so on those we'll run incremental, probably using iceberg, but, and so on those we'll run, yeah, you know, every, I mean, it's much cheaper to run the checks there for a lot of reasons. More and more reasons to move data out of Postgres. I understand it for analytical reasons about yesterday, and keep it in S3.

Starting point is 00:49:19 Yesterday S3 released vector index. If you haven't heard it, it's interesting. So I'm just curious where the place for Postgres will be in our systems. But I totally understand for analytical workloads, definitely. Yeah. I mean, everything that's OLTP-shaped

Starting point is 00:49:38 is going to be staying in Postgres for some time. Can't see a reason to move anywhere else. Well, yeah. Great. In terms of mitigations, did you tune any of the settings around multi-exact parameters? Yeah. What did you do?

Starting point is 00:49:53 Which ones did you increase or decrease? I don't know if there was actually any specific post-gres flags that we changed, actually. I don't think we changed any constants specifically. But what we did do is we did introspect and to the code base and determine the danger levels for membership spaces and so we use that as a threshold for alerting ourselves with Grafana and so if our membership space consumption Breaches a certain level what we will do is we will take actions to reduce the number of concurrency, the amount of concurrent transactions that

Starting point is 00:50:29 are occurring until Postgres can recover that membership space. So it's a little bit of a manual intervention at the moment, but I don't think there's any specific knobs that we can tune because it is fundamentally limited by a 32-bit integer. And the telemetry with respect to multi-exact membership numbers is not used anywhere else except in the detection of when auto-vacuum should occur, and that is sort of automatic and kind of hard-coded. And the current utilization, you use the function pggetMultiExactMembers, right? Or how do you? No, I don't believe that is it.

Starting point is 00:51:10 I think it's some specialized function where we actually have to get all the files and read the file sizes themselves. So it's not something from Postgres I don't believe. But it's Aurora. You don't have access to files. There is some function. I can't remember, Manit, I don't believe. But it's Aurora, you don't have access to files. There is some function, I can't remember if... It's really a list or something, right? So it lists files, it's still SQL, you can list files.

Starting point is 00:51:34 Okay, interesting. The query originated from Amazon blog post, but I'll try to pull it up, you can see if you want to include it in the show notes. Let's make sure. I don't remember if this recipe was shared in your blog post, how to monitor this. It was not. We did not post it there.

Starting point is 00:51:49 We actually intend to have a follow up with a more engineering-based audience, where we would detail these steps kind of in the same way that we detailed it here, but in a more concrete format for engineers and not for executives and company leaders. Yeah, and we're a bit more time-line constrained for initial RCA. And we're optimistic that we can link to Andrew's patch for the second follow-up blog post. Exactly. That's awesome because people like probably we should

Starting point is 00:52:21 add it to monitoring like practically everywhere in large systems, right? This should be main outcome of this discussion probably right for those who listen and have big systems Absolutely before your patch Andrew got accepted and everyone upgraded to Postgres 19, which will take some time Right. So yes until until then we definitely have some guidance for monitoring systems We admit though that it is not the most ideal, but it is something that people can use to guide themselves away from disaster. And I think work, well there's the thread you mentioned from Peter Gagan, he's somebody that has worked in the past on mitigation factors. So not only to be able to monitor transaction ID wraparound, but he did some things to vacuum so that it's just less likely to get into that state in the first place.

Starting point is 00:53:13 And I think the making that parameter he mentioned aware of the member space or have some have some different parameter that kicks in and does a like an anti-wrap around vacuum or whatever the equivalent would need to be earlier would make this less likely to happen. And of course, monitoring is a good idea anyway, but prevention seems even better. So, yeah. Optimization inside Postgres could be implemented here because quadratic behavior, at least it could, like in other cases in Postgres, at least it could be split to some pieces, like 128 pieces, something like this. I don't know. I haven't seen discussion of actual optimization to get more capacity. Have you seen it? I'm sure.

Starting point is 00:54:01 No, but I was looking through what parameters we even have that mention multi-exact, and one of them is in version 17, we got multi-exact member buffers, which is quite low by default, only 256 kilobytes or 32 pages, and we've got multi-exact offset buffers, also low, 128 kilobytes by default.

Starting point is 00:54:22 So I'm wondering whether increasing those would help with the member space. Maybe that's the offset buffers one, and then yeah we also have multi-exact failsafe age which you mentioned, version 14 that's the one from Peter Kagan, and two much older ones. I did wonder about the freeze min age, that's 5 million by default, I wondered if lowering that would just get you, make sure that the vacuum ran quicker when it did need to happen, you know, that kind of thing. And then there's the freeze table age is quite, is 150 million. So I, again, I'm not sure if lowering that would mean you just get vacuums more often, but I guess when we're talking about such a huge table, it's still going to be a problem with the vacuum does. So until

Starting point is 00:55:03 you get the partitioning in place. Anyway, thank you so much, both of you. It's been amazing learning about this. Scary, but great. And thank you for paying the pain and the cost so that so many fewer people don't have to in future. And special thanks for transparency and admitting like your own mistakes in the past. Like I especially appreciate this, like how you talk freely about your own mistakes in the past. Like, especially appreciate this,

Starting point is 00:55:26 like how you talk freely about your own mistakes. This is very valuable and a good example for others as well. Absolutely. Thanks for having us. It was a pleasure to be here. Thank you. Yeah, it's really fun chat. Thank you and have a great week.

Postgres FM - MultiXact member space exhaustion

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.