Postgres FM - MultiXact member space exhaustion
Episode Date: July 18, 2025Nikolay and Michael are joined by Andrew Johnson and Nate Brennand from Metronome to discuss MultiXact member space exhaustion — what it is, how they managed to hit it, and some tips to pre...vent running into it at scale. Here are some links to things they mentioned:Nate Brennand https://postgres.fm/people/nate-brennandAndrew Johnson https://postgres.fm/people/andrew-johnsonMetronome https://metronome.comRoot Cause Analysis: PostgreSQL MultiXact member exhaustion incidents (blog post by Metronome) https://metronome.com/blog/root-cause-analysis-postgresql-multixact-member-exhaustion-incidents-may-2025Multixacts and Wraparound (docs) https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-MULTIXACT-WRAPAROUNDmultixact.c source code https://github.com/postgres/postgres/blob/master/src/backend/access/transam/multixact.cAdd pg_stat_multixact view for multixact membership usage monitoring (patch proposal by Andrew, needing review!) https://commitfest.postgresql.org/patch/5869/PostgreSQL subtransactions considered harmful (blog post by Nikolay) https://postgres.ai/blog/20210831-postgresql-subtransactions-considered-harmfulvacuum_multixact_failsafe_age doesn't account for MultiXact member exhaustion (thread started by Peter Geoghegan) https://www.postgresql.org/message-id/flat/CAH2-WzmLPWJk3gbAxy8dHY%2BA-Juz_6uGwfe6DkE8B5-dTDvLcw%40mail.gmail.comAmazon S3 Vectors https://aws.amazon.com/blogs/aws/introducing-amazon-s3-vectors-first-cloud-storage-with-native-vector-support-at-scale/MultiXacts in PostgreSQL: usage, side effects, and monitoring (blog post by Shawn McCoy and Divya Sharma from AWS) https://aws.amazon.com/blogs/database/multixacts-in-postgresql-usage-side-effects-and-monitoring/Postgres Aurora multixact monitoring queries https://gist.github.com/natebrennand/0924f723ff61fa897c4106379fc7f3dc And finally an apology and a correction, the membership space is ~4B, not ~2B as said by Michael in the episode! Definition here:https://github.com/postgres/postgres/blob/f6ffbeda00e08c4c8ac8cf72173f84157491bfde/src/include/access/multixact.h#L31And here's the formula discussed for calculating how the member space can grow quadratically by the number of overlapping transactions:Members can be calculated via: aₙ = 2 + [sum from k=3 to n+1 of k]This simplifies to: aₙ = (((n+1)(n+2))/2) - 1~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello, hello, this is Posgass FM.
This is Nick from Posgass AI,
and as usual, co-host Michael.
Hi, Michael.
Hi, Nick.
Yeah, and we invited a couple of guests today,
Nate and Andrew,
and we invited because the company Metronome
experienced outages in, I think, in May, right,
or April in May, right, in May of 2025,
which were extremely hard to diagnose, but these folks did a very
good job and discussed all the details in an excellent blog post.
And this blog post looked so great, so we couldn't miss it.
So thank you for coming.
Let's dive into some very rare Postgres problems.
Yeah, thanks for having us. Let's dive into some very rare Postgres problems.
Yeah, thanks for having us and hopefully sharing this out will keep making it rare.
Yeah.
Yes, thank you for having us. It's a pleasure to be here.
Great. So Michael, where should we start?
How about we start with... Could you guys give us a little bit of an overview of what Metronome does,
why you're using Postgres and what you were up to that was hitting the limits this hard?
Yeah, so I can start with the overview and then like Andrew I can pair on
the exact way that we kind of blew our foot off here. Absolutely. Yeah, so Metronome is a usage-based billing platform.
And so when it was founded, I think it's coming up on six years ago now, it was really, at
that point in time, this is just before COVID and lockdowns began, the target was really
there's a lot of infrastructure as a service companies.
Most of them are billing on some kind of heartbeat, you know, per minute or sometimes per second
model.
That gets really tricky to do at scale.
In the simplest cases, yet you're just incrementing a counter, but as things ramp up and your
account executives go out and cook up all different ways to finesse how they want to
charge that client, it gets really complicated.
You add in things like committed spend, credits, you want to automate it, you want to cut people
off when they're not paying.
The complexity really compounds immediately.
And the one of the founders got it worked on the billing at Dropbox and also encountered issues
where when you build this internally, which is where everyone kind of starts, it gets very easy
to build the braille system by accident, build to your local maxima, fits your current pricing and
package in. And then someone comes in with a new product and suddenly they're delayed, like they can't launch their product or they can't charge
for it for weeks or sometimes months. And so you get led into this spot where you're delaying
launches or just like burning compute because you can't actually charge for it, which feels
really silly. And then there's also a frequent, you just accept a lot of slippage because building these systems to be perfectly
accurate gets really tricky. And, you know,
thankfully there's a lot of foundational pieces such as like, you know,
Kafka exactly once and you know,
strong transactional databases that make this possible to build have confidence
in,
but it's very difficult to do this internally and kind of like prioritize that
degree of rigor. And also if you're building infrastructure company, you know, let's
say you're a database startup, your best engineers are there because they
want to work on building a database.
They're not actually there because they want to work on building.
And so we also think that, you know, kind of consolidating talent to work
in this problem yields a better result.
And I think the, the marriage of the business logic domain, the exact
correctness needs and the overall
data scale volumes were right place, right time for this AI doom.
Introduces a lot of very interesting challenges and well, we're about to talk about one of
them.
And you chose Postgres obviously, right?
Yeah, the Postgres decision preceded both of us by ways, but I think we're probably
quite happy with it.
Okay, good, good.
So do you know why Postgres was chosen in the past?
Not precisely.
I mean, like, probably, yeah, we immediately went for using RDS.
I don't know if it was immediately Aurora.
And interesting.
Yeah.
I think it would be an interesting decision as a small startup to anticipate having a
lot of data scale to want to deal with the schema change pain associated with MySQL.
Yeah, I wanted to add a joke that maybe somebody Googled some open source relational database
with good money data type.
But yeah, you know, money data type is what people shouldn't use in Postgres.
Yeah, I think we're mostly in numeric, but at this point we're
kind of cursing it. We didn't lock down the precision more tightly because Postgres will
give you far too much precision and kind of having to interoperate with any other system
becomes difficult and we're starting to figure out how to unwind some of those decisions.
Yeah, yeah. That is an interesting tangent, especially I guess with people billing by
fractions of a cent and things. So yeah, anyway, probably too much of a tangent or maybe we'll
come back to it later.
Yeah, I mean, in short, yeah, it gets very painful translating between the different
systems that do or don't have that precision. Sometimes even your billing provider, you
know, understandably can't bill in fractions of a cent, which means
that how all the different subproducts you paid for and have those round together might
not end up with the same total.
Can you tell us a little bit about the scale?
I remember 30 terabytes on Aurora from the blog post, anything like in this area you
could share to get us the feeling of the scale.
So we have two primary workloads that are running on Postgres.
The one we'll mostly be talking about is our main DB, which is a common startup pattern.
We've got our kitchen sink database that has nearly every data set within it.
And yeah, that's ballooned out to around 30 terabytes now
and growing terabyte plus per month at this point.
And yeah, this wasn't an issue for a while.
And then depending on how our clients leverage our platform,
part of billing is generating invoices
and invoices are usually characterized
with like, you know, stuff a little total and whatnot,
but then you have a line item per product usage. And, you know, some of these companies have a lot of different
products you can pay for. You know, you look at the list of different LLM, you know, model engines
that are available. Sometimes you can use Batch, Async versus Real-Time, or like infrastructure
vendors who are deployed to dozens of different regions. And it's easy for the cardinality here to explode, especially if you want to group it by
a logical application on their platform. And we may, but we now regret a decision to
basically persist those line items in a separate row and a ginormous table in our Postgres database.
How big is that table?
It's about half the cluster and growing.
Like 15 terabytes un-partitioned, right?
Yep.
Yeah, well, I know this. It's a pain.
There's some scaling problems you get really ahead of and others you defer because
a lot of the technology in the Postgres space is really fantastic now.
Aurora does probably just handle it, caveat, making sure that you're dealing with vacuuming
and whatnot and all the other kind of implicit thresholds.
Actually, I forgot to mention that those who watch PostGCFM regularly
should watch already by this point,
previous episode with Sugu
and where we discussed multigrads and sharding.
And in that very episode, he mentioned your case, right?
So yeah, I wanted to answer,
this is what we go to do next week,
but it was a secret at that time.
So, and before that, we also had Pidgey Dog,
Lev Kokoto from Pidgey Dog.
And obviously, sharding currently is a hot topic again.
And with Citus in the past, there is SPQR.
Before that, we had a Pell proxy from Skype very long ago,
like previous era.
So is it right to say that the problem we are going
to discuss today, you wouldn't meet it if you are on sharding schema.
If your database was sharded, this wouldn't happen probably.
I think we've seen the app suggest that it would be easier to avoid it.
As part of the migration that we're about to mention, we're moving another table that's generated data to a partition scheme. And we've already seen a
lot of positive effects there. And you can run a lot of auto-vacuum workers in parallel.
And so it allows you to much more easily keep up with the rate of change.
This is partitioning, not sharding. And with one partition table, the problem is
vacuum is single-threaded, right. Yeah, so with partitioned schema,
you can have, you can raise a number of workers
for auto vacuum max workers.
You can raise it and benefit from it
if you have a lot of CPUs and disk IO capacity.
But what I'm trying to say, problems like these,
they like can be avoided completely if we have sharding and every shard is smaller,
every single shard is smaller. Yeah, and that's what I was trying to get at. By partitioning it,
we can kind of see the effect of, if these were all in logical chunks that were maybe at max 100
gigabytes, it's much easier for the garbage collection system, the key pop. Yeah, yeah, yeah. That's interesting.
It would be even easier if we were across many machines, but at the very least, I think
we can infer most of the effect, even if it's all on one.
Yeah. Okay. Maybe it's already time to dive into the problem itself, right?
Yeah. Or maybe briefly, like why we created the problem.
Yeah. Before you go into that space, I was curious, do you have any issues with multi execs before
that at all?
Maybe other kinds of issues like observing in performance insights, lightweight locks,
contention, lightweight multi execs, like there are a couple of them.
You never had them before, right?
Not in particular. No, I mean, we run into, you know, emergency out of vacuums for sure. Yeah, that's very interesting because your path is like definitely is different from what we
observed with other projects because usually different problems with multi-exact came
and the contention happened and so on. Your path is quite unique in this case.
I think.
Yeah.
So let's dive in.
Let's dive in.
Yeah.
I really want to describe like, yeah, how we created the problem for ourselves.
And I really want Andrew to explain how we diagnose it.
Because that's the most interesting part here by far.
Yeah.
As we keep seeing these tables grow, we kind of realize also do some access patterns that
we're introduced. It wasn't really tenable to keep just appending data month after month for every
customer in terms of the line item table. This is also painful for the invoice table, which
there's a lot of write amplification between the two. The average invoice has dozens, sometimes
hundreds of line items, depending on how a client's using our platform.
And so invoices have to be queried a lot of different ways.
We didn't really see a way of moving that outside of a relational store and like
being able to use foreign keys and constraints with it in the rest of our data.
It was like providing too much value to really doing anything different there.
We did wind up partitioning it just to like enable the auto vacuum process to keep up much more easily. But we realized that we could constrain the access pattern
for the line item sufficiently that it didn't actually need to be in Postgres. Because I
guess within Metronome, over the course of a month, you have the notion of an invoice
that's in a draft state where we'll calculate on demand for you. We're not actually persisting
data associated with it. Then only after you go through a finalization, which is typically the first of the month,
because most billing periods are going to be monthly, that we do all the math, do the
calculations, persist that data. Which, you can then infer our workload sounds pretty funky.
So we've got like a steady trickle of writes for configuration changes. And then once a month,
we have to write out several billion
rows, which amounts to like one to two terabytes of data. All on the same day. As close to it as
we can get. Yeah. Very different from social media. Very different. Otherwise, it's a more
typical, we have 99 plus percent reads versus writes. It's a lot of us just compute on demand.
So it's a little funky. I think that because the high spikes in write volume are so periodic is
why we haven't seen multi-exact problems in the past. I think we'd likely narrowly been avoiding
them because it's so hard to observe whether or not you're approaching them. I feel like this is
one of the meta-stable states of Postgres, if you're approaching them. You know, I feel like this is one of the like meta stable states of Postgres.
Uh, if you're familiar with that page from the definition of stable, where
you often things are working, working, working, and they just fall off a
cliff or entirely lock up.
But the basic trigger here being, yeah, we identified we wanted to make
this architecture change, and then we were running the very long migration
process of pushing these line items into S3, which
is what our back end server is going to be, so their access pattern is so simple, but
also rewriting and moving the invoices from one table to another.
And unfortunately, we're doing that broadly one by one, which we'll get to why that's
relevant.
So you mentioned performance cleave.
This is a very good term when we talk about multi-exact.
And when we had different cases in the past where, as I mentioned, contention happened.
And interesting enough, like we always like 20, 15, 10 years ago, we often saw opinions
that foreign keys are dangerous.
Like they have huge overhead, don't use
them because like they slow down everything and so on. But interesting
that now we can agree with it but in very different way, in an unexpected
way because they yes they slow down rights but we can afford it and you your
table which grown to 15 terabytes is a good example, right?
You survived with foreign keys and you paid that price to have good data
consistency, right?
But then performance clips, right?
So, yeah, for that table in particular, like the reason why you're able to move
that was, you know, it only had one foreign key, which was like back to the
invoice row, it was much simpler.
That table is effectively append only.
If there was a workload,
it just really didn't have to be on Postgres.
I don't think we were taking advantage of the qualities
to justify the cost.
So let's discuss multisects in general,
like why we should think about them.
Like, you know, my own story is usually, like with multi-exact.
I take some monitoring, this is my hobby.
Take some post-guess monitoring, and we check
which metrics they have.
And usually they have transaction wrap-around risk
analysis, right?
To control freezing process and to understand how far we are from transaction ID wraparound.
My question number one, where is multi-exact ID wraparound analysis?
Most monitoring systems like it, but they like it because it never happens.
Usually transaction ID wraparound happens, we know several blog posts and horror stories
from the past, but multi-exact idea
up around, nobody experienced it.
So okay, why should we think about multi-exacts?
Tell us more technical details what happened with your case.
Well, briefly, I'll respond to the point about foreign keys.
I think the community is right and that you don't always need them.
But I think what can be tricky is that, you know, when you're a very young company making
lots of changes to your product, trying to find that product market fit and get to the point where
scaling becomes the biggest concern. Foreign keys are really useful to like prevent teams from
stepping on each other and breaking your data. And only when those product areas and like the
platform you're building on top of become mature enough, can you maybe afford to put in the pull request level integrity checks that make sure you're
not changing some access pattern or maybe build the asynchronous offline data consistency
check to make sure you're not violating it.
This is a case where we didn't think critically about have we made that leap from one side
to the other.
In retrospect, we have and it's leap from one side to the other.
In retrospect, we have, and it's one of the mitigations we took.
I would defend having a really strict schema early on and then as you scale the platform,
just comes necessary to progress some of those guardrails.
Yeah, to maintain good quality of data, that's absolutely great advice.
But let's discuss how multi-exact are created,
and so on.
Sure.
Yeah, so I suppose we should start
with what a multi-exact even is and when it's used.
So basically, in Postgres, when you
have two or more transactions trying to lock the same row
at the same time, for instance, when you're using select
for share or select for update, or in in fact when you're inserting into a table that has a
foreign key to another table, Postgres uses a system called multi-exact or
multi-transactions and a multi-exact is basically a logical object that
represents a set of transactions locking the same row. Each participating
transaction is called a multi-exact member, and those
are a little distinct, but they are related. And the way that a multi-exact is stored is that it's
represented by a 32-bit integer ID, and that's where your wraparound issue can happen with
multi-exact IDs, but in this scenario, that was actually not the problem. For every multi-exact, Postgres keeps an offset into a globally shared append-only multi-exact
membership space.
These are stored in the PGMultiExact slash members file on disk.
You can think of it as logically like a simple last recently used cache that is contiguous.
So you have to append to it every single time. And so
when you have to store these members, you will have to look up an offset into
this theoretically long array, and each one of your transaction IDs is stored
in one slot in this array, and this array is indexed by a 32-bit integer. This
membership array is what it was exhausted
in our case. So due to the way that
multi-exacts are created and
subsequently reaped, it is easy to exhaust
this membership space when you have a lot
of parallel transactions attempting to
lock the same row, whether it's
through foreign keys or whether you're doing select for share or select for update.
And the reason for that is that since these multi-exact are immutable, suppose that you
have three transactions trying to lock the same row.
You would first create the first multi-exact between the first two transactions, and then
the third transaction would come in and you have to create the first multi-exact between the first two transactions and then the third transaction would come in and you have
to create yet another multi-exact with the old transaction IDs and then the new
one. But the membership space is not relinquished until a vacuum occurs,
which can happen later or perhaps never in some cases if you are constantly
adding new transactions and you're super busy doing all these updates.
And what's interesting is this.
The multi-transaction membership space cannot be reclaimed in chunks.
It has to be reclaimed contiguously.
So suppose you have a very long-running transaction that is part of a multi-exact that is very
old.
It can hold up the entire vacuuming of subsequent multi-exact that are finished,
but are short-lived. And it is not until that the oldest multi-transaction is finished that
the vacuum can actually reclaim all the membership space from the old multi-exact to the most
recent one. And this scenario creates the ability to exhaust the multi-exact membership
space. And that's what we experienced ourselves. Yeah, I wanted to add one thing. It might sound strange that Select4Update
also may produce... You also explored this, right? It happens
only when we have sub-transactions, I think, right? So if it's a
single transaction, Select4Update, there is no place for multi-exact.
But if we start doing save points there,
basically we have different transactions
inside one transaction.
And then multi-exact mechanism starts working again.
And I just remember how many years ago, four years ago,
I explored this when I explored another big performance
cliff related to sub-transactions. So there I noticed this as well.
That is true.
Good introduction. Thank you so much.
So, yeah, it's not it, right?
There's also some problems like quadratic behavior, right?
Yes, I hinted at that before.
So essentially, if you have, let's say, five transactions that are attempting
to lock the same row, you have to create the multi-exact one at a time by essentially cloning
it with the previous transaction IDs and adding that additional transaction ID one at a time.
So now you have a multi-exact with two, then three, then four, then five. And you add that
all up and that becomes a quadratic, a quadratically growing use case
of multi-exact membership space,
at least until the vacuum can actually clean up
those old multi-exacts.
And so that's where you will find quadratic consumption
of the multi-exact space, and this can happen very quickly
if you have a lot of concurrent transactions
attempting to, say, insert into a table with a foreign key that links to another table that is a low cardinality table.
So they're all trying to reference that exact same row.
But it's also applicable when we have also additional long running transaction, right?
Because otherwise cleanup will happen, right?
Correct.
So a long running transaction will cause the vacuum to essentially stop reclaiming multi-exact
membership space until it is finished because it can only do it sequentially.
Is it this transaction, long-running transaction, should it also participate in multi-exact
mechanism or it can be any or even simple select?
A simple select probably won't do it.
It would have to be in a multi-transaction context. So if you, let's say, have a multi-exact
and you have two transactions in that and the first one finishes, that multi-exact is not
cleaned up until that second transaction, part of that multi-exact is also finished. So all
transactions in the multi-exact have to finish before the vacuum can reclaim it.
Yeah, so this highlights like pieces advice, like partition your large tables, avoid long
grinding transactions in Postgres or OTP, right?
Like just yet another reason to avoid long grinding transactions.
And if it's, yeah, so long grinding transactions on replica reported hosted by feedback, they
are harmless here.
So, right?
So like again, yeah, yeah. I'm trying, we
are debating right now how to properly monitor long-running transactions because there are
different effects, negative effects, and we need to distinguish them. In this case, yeah,
we should monitor probably long-running transactions which participate. And by long, how much longer was it in your case? Like one minute is long or 30 minutes
is long in this case? Actually, not sure I don't actually have the metrics on that. Nate, do you
know? I think, you know, 99.9 probably 9% of our queries are, you know, sub second. And I guess like,
previously worked on a database and for a company or or for a team at Stripe and like had some of these lessons and just like
really deeply, or have a little scar tissue in short. Running a platform up there is difficult.
And so I've been pushing us in that direction and kind of moving anything that we expect
to be longer off of the online workload into our Lakehouse. And so it's very, you know,
occasionally we have queries that go five, 10 seconds,
but I believe we have a like, you know,
global timeout of about 30 seconds.
Just like, it's like part of my misunderstanding here
is I thought part of these multi exacts
is that the multi transaction like can be daisy chained
as you have many concurrent overlapping transactions occur.
I think that's
more what you're running into than like a long running one, was that we just had this steady
stream of overlapping transactions. And like until you break that chain, you're in this like
host position where you keep, you know, very quickly at an increasing rate consuming space.
Yes, I was only merely stating the cleanup
scenario and the challenges there, but you're absolutely correct. When you have this daisy
chaining, you know, constantly creating new multi-XX by adding new transaction IDs to
it, you will very quickly exhaust your membership space.
You say very quickly and very easily, but I mean, we are still talking about two billion,
right? Like, I think that's a,
there's been a debate in the Postgres world
about 32 bit versus 64 bit for a long time,
but we are still talking about two billion.
So you say very quickly and very easily at your scale, right?
Like this is also due to how many things you're doing
and maybe the pattern as well,
like invoices that have lots of line items or, you know, This is also due to how many things you're doing and maybe the pattern as well like
Invoices that have lots of line items or you know, like there it feels like there are
Specifics here that make this more likely. I'm not saying it shouldn't be a problem I'm looking forward to hearing how you diagnose this and what we can do better. But I feel like
We should say it's not necessarily very easy to hit this limit.
I'm not actually, I think I found one other Stack Overflow post of somebody else who hit it.
I don't know if you came across that while you were in the middle of this,
but I couldn't find any other examples on the making of this.
But it's super interesting and I think we're seeing more and more companies using Postgres
at this scale. Like Nikola mentioned a few of them that we talked to but we also had an episode with folks from Notion and Figma and Adyen as well on that same episode so there are a lot of
high volume companies now on Postgres so I think this stuff's super interesting but two billion
still seems like even with the daisy chaining it still seems like a lot to me. Let me disagree with you. There is no... Like, debate is not like 64 or 32.
Postgres desperately needs 64 transactions.
And, for example, OroidDB already implemented it.
But OroidDB is like alternative storage engine.
And it's right now in hands of SuperBase in terms of development.
And I think right now there is launch week and they publish some benchmarks again.
Like it's great, great project.
So I think many projects dream about the day when we will have 64, not 32.
Right now only several like forks implemented only.
Right.
There are commercial forks which implement 64 transaction IDs and not exact IDs.
Right.
So I didn't know that.
Yeah.
Sadly not a war, it sounds like.
Yeah.
Yeah.
So 2 billion is not like, it's not a lot already.
It's not a lot.
Yeah.
It's pretty easy to hit it if you're going at scale and you're hitting it with quadratic
velocity.
Oh, yes.
You get to 2 billion quickly. It does sound like a big number, but... You're going at scale and you're hitting it with quadratic velocity.
You get to 2 billion quickly.
It does sound like a big number, but it's not that big.
Michael, sorry for interruption.
Michael, you can take chessboard and start putting one grain on first cell and so on.
Oh yeah, so wait, maybe I misunderstood.
If there are five row changes we're trying to do, do we create two to the five rows in
this table, or is it like one, then two, then three?
Is that only five rows?
Oh, sorry, in the member space?
Yeah, good question.
So basically, it's more of a sequencing thing.
So essentially, you say you have two transactions.
You start with two, you get one multi-exact, taking up two members.
The third transaction comes in,
you create a new multi-exact with three members,
but that old one with the two members
has not been cleaned up yet, so you're gonna take it with five.
The third one comes in, I'm sorry, excuse me,
the fourth transaction comes in,
you create a multi-exact with four and three and two,
so now you've added an additional four on top of that.
And now the fifth one comes in,
and now you have a multi-exact with five, four, three, two.
And that all add together takes up membership space
until the vacuum can come up and reap those old multi-exacts.
Yeah, chessboard example is good here, I guess.
We can try to understand volume when we approach 64 cell,
number 64, how big it will be.
Andrew, you've done such a good job describing like how this could happen but when you're in the
middle of this I'm guessing this is relatively new knowledge or like when you're in the middle of this
how did you work out what on earth was going on?
Yes that's quite the journey. So I'm actually really new to Metronome.
I just joined in April.
So when this incident was occurring,
it was only my fifth week or so.
And this was occurring on a Saturday
right before a vacation to Miami.
So, wow.
Well, let's rewind a little bit
because I guess the most embarrassing part of this
is like the first, what we now know
was the first occurrence was a full week prior.
Where our migration workload had basically implicitly triggered this.
We hadn't really known what to do.
And unfortunately, we're not able to get to a full RCA.
And so, you know, we got to our best explanation, which is like, oh, there's something in transaction
ID wraparounds that we're not grokking correctly or Amazon's hiding something from us. And then we proceeded the turn back on with a slightly lower concurrency,
you know, five days later. So like, yeah, so we didn't get there immediately. But then,
you know, that was a P one incident, you know, internally, tried to be cautious, but didn't
follow through sufficiently. And so this is like kind of in the second chapter where Andrew
comes in.
Yes, that's right. I was not a participant in the first three incidences. It was only the fourth
one, which I think was on a Saturday in which the call went out from Cosmo to see if additional
engineering resources could come and jump on. So I decided to, but having been so new, I actually
didn't have any like tactical knowledge I could use to address all the specific
fires that were being started by this. So I decided to actually figure out what exactly
was happening and as Nate had said we weren't really sure but we knew it was related to
multi-exact in some way. We just weren't really sure exactly why and we were struggling to
really fully understand because in our initial evaluation, we saw that the
number of multi-exact IDs that was being created was well below the threshold of like 400 million.
And we weren't really sure why.
So what I did is I took that knowledge and I just went ahead and looked into the Postgres
source code and looked at the multi-exact dot C file, which is this enormous file that
contains pretty much all the logic that you'll need to know about multi-exaqs.
And in there, we were working with the hypothesis that somehow this threshold of 400 million,
but we were only seeing 200 million before vacuums started occurring, was related in some way.
So I found the function called multi-exaq member freeze threshold,
which calculates the threshold that you should start
basically an auto-vacuum based on the number of multi-exacts that are in use and, newly,
the number of multi-exact members that are in use.
And that's what caught my eye.
So from there, I mathematically proved that the amount of members should be extremely
high given the code path that we are obviously hitting because we're seeing vacuums occur the amount of members should be extremely high
given the code path that we are obviously hitting
because we're seeing vacuums occur
at a specific number of multi-exact being used.
And from there I was able to find log messages
related to multi-exact membership exhaustion
and correlate that with Datadog logs
that we had found ourselves in the recent week.
And there we could conclude that
multi-exact members were being, the membership space was being exhausted, and we were able to
put together what I would consider kind of a hacky estimate of the number of multi-exact members that
are currently in use by looking at the files that are written for each member and estimating by
multiplying by a constant how many members are currently in use at the moment and that's how we came across the solution or the cause I should say.
I think this is worth diving into a little bit because I definitely didn't realize that there
were at least two exhaustion possibilities with multi-exact so So we've got the total. Well, yeah. So why are there
these two numbers? Like, why do we have this 2 billion that you were monitoring at 400 million,
which should be super safe, right? Because we have the possibility to go up to 2 billion, but you
want it to be cleaning up much sooner than that. So you monitored at 400 million or so.
at 400 million or so. Why is there another limit? Yeah, so we have two limits here. We have an ID that we give to multi-exact. That's
a 32-bit integer. So of course you cannot assign more than two billion to these multi-exact
because we don't assign negative IDs. And then you have the multi-exact membership space. Now that is represented as logically a simple LRU that is append only, and so we can only
address up to 32 bits of membership space.
Each membership slot is addressed by a 32-bit integer, and thus now you have two different
values that you have to worry about.
You have to worry about how many multi-X-axes
you're creating and how much membership space
you're consuming with those multi-X-axes.
And that's where those two numbers come from.
And one of the great things in your blog post
you mentioned is how do you even monitor that?
Have you worked out a way since?
Is there anything you can do there? Yes, I think I might have just off? Is it, is there, have you worked out a way since? Is there anything you can do there?
Yes, I think I might have just offhand mentioned it, but we use an estimate where we read the membership files,
because they are written to disk, and then we know that the files have to be a certain size, and we multiply by that constant,
whatever the size of these members are, and we get an estimate for the number of members.
Now, this is not the best solution, which is why I submitted a Postgres patch
to surface this telemetry directly.
Nice, thank you.
Yeah, that's what I wanted to mention.
We have several things going on in mailing lists after this blog post.
By the way, before this, why did you decide to post in so great detail? What was your
intention to help community? I guess a couple of factors. One, we do want to help the greater
Postgres community. I think there are lots of other situations where we've benefited greatly
by blog posts put out, in particular by companies like Percona, as well as others, you know, maybe led us away from paths that would have wound up being painful. But another part is
that, you know, internally at Metronome, we view ourselves as a infrastructure provider for our
clients. And while, you know, I think it's okay to say that we're not fully recognized as that yet,
and we're still on our journey to solidifying that narrative, it is what we want to, or how we want to act to be perceived that way.
And we are in the critical path for a lot of our clients and how they operate.
And so for us to have multiple, you know, multi-hour town times in the space of, you
know, less than 10 days is like pretty bad and honestly not what you want for your infrastructure
provider if you choose to be in that way.
And so having a really crisp explanation of, hey, we are a little embarrassed by this, but here's what we learn
so you can learn from it, as well as a little bit of vetting of like, we didn't run into the easiest
thing. It is novel. There is a reason why it took a while for us to sort it out. And so I think a
lot of our partners felt a lot better. Better could be a strong term, but when you have an honest RCA
and like, you know, cloud flare might be like leading the
industry and like how they operate in this vein, you feel a
lot better that your provider is actually learning from mistakes
and is going to do a better job of avoiding them. You might
internally if you're evaluating should I just build this in
house instead.
So great transparency and so on it pays off over time. Yeah, that's great. So I saw several
mailing list discussions, but before that I also like, I feel like with multi-exact we can have
different kinds of problems. One is wraparound, right? Okay. As I said, I never saw it. And that's why many monitoring systems even don't monitor it.
And we have several cases of SLRU limits reached, right?
And one of them was explored in the past.
We can provide maybe blog posts, links to description.
And this is your case.
And your case obviously bumped into lack of observability bits, like how much of capacity
we reach, like how far from the performance cliff we are.
But also besides that, I saw different topics were discussed in Mellenglis.
For example, there is a setting, vacuum, multi-exact failsafe H, which is about wraparound, actually.
And Peter Gagin brought to attention that this setting doesn't take into account multisac
membership exhaustion.
So it's also interesting side effects from your blog post, maybe fully unexpected. Then I saw Andrew, I think yours proposal to have
to have this automatic exact monitoring like system view.
So any monitoring could have some graphs and also alerts probably, right?
And this proposal currently at what stage?
I saw it's in commit fast, but it requires review, right?
Correct. Yes, the patch is in the commit fast and I saw it's in CommitFest, but it requires review, right? Correct, yes.
The patch is in the CommitFest, and it definitely
still requires review.
Well, I've engaged with another contributor named Naga, who's
we've actually submitted competing proposals.
His actually directly reads from the memory space,
while mine uses the PG stats infrastructure to surface
this telemetry.
But yes, it is still just in review.
It has not been accepted yet as far as I can tell.
Yeah, so but it's great that it's registered in commit fast.
And I think if some people who listen to us and maybe maybe some people who watch also
our sessions of hacking, we do with Andre and Kjork sometimes almost every week.
We missed today because of some issues.
But we are going to be back next week.
So we show how easy it is actually to test.
You take Coursor, you tell it, just compile Postgres,
apply patch, and let's start using it.
And instead of diving too much into C code,
you think about use cases and how to, for example,
emulate problems, right?
And so on, this is actually fun.
And I encourage our audience to maybe to start
with their first review and test this patch.
And this can increase chances that it will be included
to PostGIS 19, right?
Yes, what's also kind of interesting is,
I know you earlier mentioned that there was a debate
about moving to 64-bit integers, and in fact,
part of this commit fest does include a proposal
to change those offsets to 64-bits.
So there's a lot of changes that are coming in 19, it seems,
if they're accepted, of course,
which would help alleviate this
multi-exact membership and multi-exact issue writ large. That's great but it's only next year.
I suppose we can't go too fast. Well and even longer if it's Aurora right they have a
generally a bit of an additional lag to getting major version because it's a fork that they have to do a lot of work on.
On that note though, you mentioned kind of trust in an infrastructure provider and how issues,
big issues, can rock that trust and you kind of need to see some maybe root cause analysis or
maybe some action. Has this at any point rocked your confidence in Postgres or less so Aurora, I guess, in
terms of is it the right platform as you scale?
Maybe initially and then since you've worked out exactly what the issue was, what are your
thoughts there?
Yeah, I don't think it actually has really rocked our confidence in Postgres.
I haven't seen anyone run for the hills yet. Um, but, uh, you know, it, it definitely surfaces the fact that there is quite a
bit to learn and that this is a very complex system, but to that end, something
I've learned at least, uh, participating with the Postgres dev community and being
on the mailing list is that there's a lot of very smart individuals who are part of
this community and who are actively contributing and are passionately
debating what should and shouldn't be part of it and how things should be designed.
So in many ways it actually may have built my confidence to see so many people paying
attention and caring about it from all across the world from Fujitsu, Amazon, Google, ourselves,
even just regular folks doing it in their hobbies spare time.
So in many ways perhaps it's a testament
to how far open source can go
and how much confidence we can have in these systems.
Because like you said, this is a pretty rare instance
and many of the core cases that most people need to use
are covered or at least will be.
That's really good to hear.
And of course, when people hit issues
with proprietary systems like SQL Server and
Oracle, I'm sure when they blog about it, those conversations do happen internally
at Microsoft or Oracle, but you can't see them.
You can't see the actual conversations.
And part of the way you diagnose this issue was diving into the source code.
So there are some really cool factors that wouldn't have been possible if this was an open source.
Yeah, that's not just open source, but open discussions as well, open mailing lists.
So yeah, that's really good to hear, but it would have been very understandable if people have definitely left Postgres for less good reasons.
So not naming names, Uber. Yeah, I will say, as we said in our blog post, 30 terabytes is not a small amount of data to move.
And so if we decided to make a large decision like that,
I believe we would have to reckon
with the transition itself.
And so in many ways, we probably want
to try to work with what we got.
And thankfully, I think it's served us well
for the most part. Yeah, there's something to be said for the devil you know. There's lots of other systems,
Postgres compatible, that promise you the world and something gives typically whether it's the
price, the actual performance, the read or the writes, or how maintainable is actually run it.
I think it's clear that we're going to have to undergo the investment to figure out what's next because like we can only go single writer for so long,
especially with the demands that we put on the database on the first of the month
every, every single month.
They're pretty extreme.
Yeah.
And behind the scenes, we also chatted with Nate about the formula.
So Michael, for the first time, also chatted with Nate about the formula.
So Michael, for the first time we should include some formula in the show notes.
There's a formula that's saying that roughly 63,000 transactions overlapping is enough
to reach a 2 billion limit.
Yeah, and so you know the dangerous thing of some math on the fly that can be checked by anyone after
the fact, but yeah, as I worked it out, only 63,000 overlapping transactions, which is
not that many.
Relatively easy to do depending on how you, you know, with a highly concurrent workload
operating one row at a time with a potential overlap between them.
You can zoom that really quickly.
It might not also be transactions.
If you take a checkpoint,
that's almost seen as like a sub transaction
and will also create a new multi-exact as well.
So if you're doing a lot of checkpointing,
you can easily hit that as well.
Are you using sub transactions?
I'm not actually sure myself, but it is a possibility.
Yeah, I don't believe we are, or at least not intentionally.
I think right now in the latest versions of Postgres,
there is ability to change SLRU sizes to avoid SLRU overflow
cases.
But my opinion about some transactions in projects
like this, at this scale, remains just avoid them.
Yeah, because they can bring other problems as well, like sub-transactions can be overflowing
and so on.
And they consume transaction ID.
So yeah.
Good.
That's, I think, quite a good deep dive, unless we forgot anything.
Any last things you wanted to add?
I think we could talk about the mitigations just so that if someone's unfortunate enough
to run into this, they have a few breadcrumbs for how to work around it.
Because it was rather expensive for us to figure out what those were, including Andrew
and others digging through the source to really understand it.
But a lot of what we inferred after the fact was, so we had this migration, which was doing
a bunch of row by row writes, not ideal if you're doing a highly concurrent process here.
And so we did make those batches.
We also did a lot of deep thinking on which of the foreign keys we actually needed on
this new table. Batches, how did you find the ideal for your batch size?
I don't know that we're at an ideal, but I think we climbed the hill far enough,
but it's not a problem. I see.
And so it was kind of a trade-off between what throughput we could obtain and making sure that
we weren't having the type of like over time growth in our like heuristic for the member
space utilization, which is mainly based on how many of the offset files existed.
Yeah, I order of magnitude.
Oh, sorry, gun.
Yeah, I usually say choosing batches choose the biggest possible, but avoid locking rows
for too long to affect users because users, we know perception is 200 milliseconds.
So if you look like one second should be maximum, two seconds should be absolute maximum for
batch size.
Yeah, it's less often.
So if we're even batches around like 200, which was enough to mitigate this concern
and kind of keep all of the operations up, you know, short, since there's a very online
workload hitting
this as well.
This is not users in terms of batch transactions, but users read results, right?
Or no?
Nobody's reading results.
No, no.
So sorry, this is in the context of our greater migration to move line items, line construct
out of our database, as well as partitioning the
invoice table, which has also grown to be quite large.
And so, you know, kind of we would read the old version, all the line items in the invoice
row, write those to a new format in S3, which, you know, ultimately keyed by the invoice
ID, and then allow logic to kind of like, we can't just do a deterministic mapping,
because then you get hot partitions in S3. So you have to do a little bit of extra work and
leave yourself a link in the actual database row. And so then we are writing those out to the new
table, which is partitioned one by one. And our diagnosis, and it's a little hard to confirm this,
because again, there is no like observability bit here, that was the big mistake as well as
the single inserts, like very quick, it was the issue that they all reference a handful of other
tables for foreign keys, some of which were very low cardinality. So you know we have like some
enum tables and we, you know, for better or for worse, in this case worse, had chosen to use a
foreign key reference to them. And so if every invoice is, you know, corresponding to the USD credit
type, for example, you're gonna have a lot of overlapping transactions all accessing
that row to verify the integrity of the data. And so we kind of walked through it and we
realized, you know, a lot of these low cardinality cases, we don't really change those code paths.
We're quite confident they're going to be stable.
It's only the higher cardinality cases.
And then voice also references its customer that owns that invoice.
But that's a much higher cardinality connection.
We're much less likely to have overlapping transactions referring to the same customer
in a migration like this.
Without foreign keys, did you think about periodical checks of referential integrity
like asynchronously?
Yeah, so we're gonna be adding those
to run our Lakehouse.
So yeah, like we have a syncing process
that brings all the files over into Parquet.
Eventually we're gonna finish this way
on getting the VCDC, so it's the more incremental
and hopefully using Iceberg.
And so on those we'll run incremental, probably using iceberg, but,
and so on those we'll run, yeah, you know, every, I mean, it's much cheaper to run the checks there
for a lot of reasons.
More and more reasons to move data out of Postgres.
I understand it for analytical reasons about yesterday,
and keep it in S3.
Yesterday S3 released vector index.
If you haven't heard it, it's interesting.
So I'm just curious where the place for Postgres
will be in our systems.
But I totally understand for analytical workloads,
definitely.
Yeah.
I mean, everything that's OLTP-shaped
is going to be staying in Postgres for some time.
Can't see a reason to move anywhere else.
Well, yeah.
Great.
In terms of mitigations, did you tune any of the settings
around multi-exact parameters?
Yeah.
What did you do?
Which ones did you increase or decrease?
I don't know if there was actually any specific post-gres
flags that we changed, actually.
I don't think we changed any constants specifically.
But what we did do is we did
introspect and to the code base and determine the danger levels for membership spaces and so we use that as a
threshold for alerting ourselves with Grafana and so if our membership space consumption
Breaches a certain level what we will do is we will take actions to reduce the number of concurrency, the amount of concurrent transactions that
are occurring until Postgres can recover that membership space.
So it's a little bit of a manual intervention at the moment, but I don't think there's
any specific knobs that we can tune because it is fundamentally limited by a 32-bit integer. And the telemetry with respect to multi-exact membership numbers
is not used anywhere else except in the detection of when auto-vacuum should occur,
and that is sort of automatic and kind of hard-coded.
And the current utilization, you use the function pggetMultiExactMembers, right?
Or how do you?
No, I don't believe that is it.
I think it's some specialized function where we actually have
to get all the files and read the file sizes themselves.
So it's not something from Postgres I don't believe.
But it's Aurora.
You don't have access to files. There is some function. I can't remember, Manit, I don't believe. But it's Aurora, you don't have access to files.
There is some function, I can't remember if...
It's really a list or something, right?
So it lists files, it's still SQL, you can list files.
Okay, interesting.
The query originated from Amazon blog post, but I'll try to pull it up, you can see if
you want to include it in the show notes.
Let's make sure.
I don't remember if this recipe was shared in your blog post,
how to monitor this.
It was not.
We did not post it there.
We actually intend to have a follow up
with a more engineering-based audience, where
we would detail these steps kind of in the same way
that we detailed it here, but in a more concrete format
for engineers and not for executives and company leaders.
Yeah, and we're a bit more time-line constrained for initial RCA. And we're
optimistic that we can link to Andrew's patch for the second follow-up blog post.
Exactly. That's awesome because people like probably we should
add it to monitoring like practically everywhere in large systems, right? This should be main outcome of this discussion probably right for those who listen and have big systems
Absolutely before your patch Andrew got accepted and everyone upgraded to Postgres 19, which will take some time
Right. So yes until until then we definitely have some guidance for monitoring systems
We admit though that it is not the most ideal, but it is something that people can use to
guide themselves away from disaster.
And I think work, well there's the thread you mentioned from Peter Gagan, he's somebody
that has worked in the past on mitigation factors. So not only to be able to monitor transaction ID wraparound,
but he did some things to vacuum so that it's just less likely to get into that state in the first place.
And I think the making that parameter he mentioned aware of the member space or have some have some different parameter that kicks in and does a like an anti-wrap around vacuum
or whatever the equivalent would need to be earlier would make this less likely to happen.
And of course, monitoring is a good idea anyway, but prevention seems even better.
So, yeah.
Optimization inside Postgres could be implemented here because quadratic behavior,
at least it could, like in other cases in Postgres, at least it could be split to some
pieces, like 128 pieces, something like this. I don't know. I haven't seen discussion of
actual optimization to get more capacity. Have you seen it? I'm sure.
No, but I was looking through what parameters we even have
that mention multi-exact,
and one of them is in version 17,
we got multi-exact member buffers,
which is quite low by default,
only 256 kilobytes or 32 pages,
and we've got multi-exact offset buffers,
also low, 128 kilobytes by default.
So I'm wondering whether increasing those
would help with the member space. Maybe that's the offset buffers one, and then yeah we also have multi-exact failsafe
age which you mentioned, version 14 that's the one from Peter Kagan, and two much older ones.
I did wonder about the freeze min age, that's 5 million by default, I wondered if lowering that
would just get you, make sure that the vacuum ran quicker when it did need to happen, you know, that kind of thing. And
then there's the freeze table age is quite, is 150 million. So I, again, I'm not sure
if lowering that would mean you just get vacuums more often, but I guess when we're talking
about such a huge table, it's still going to be a problem with the vacuum does. So until
you get the partitioning in place. Anyway, thank you so much, both of you.
It's been amazing learning about this.
Scary, but great.
And thank you for paying the pain and the cost
so that so many fewer people don't have to in future.
And special thanks for transparency
and admitting like your own mistakes in the past.
Like I especially appreciate this, like how you talk freely about your own mistakes in the past. Like, especially appreciate this,
like how you talk freely about your own mistakes. This is very valuable and a good example for others
as well. Absolutely. Thanks for having us. It was a pleasure to be here. Thank you. Yeah,
it's really fun chat. Thank you and have a great week.