Postgres FM - synchronous_commit
Episode Date: May 2, 2025Nikolay and Michael discuss synchronous_commit — what it means on single node setups, for synchronous replication setups, and the pros and cons of the different options for each. Here are ...some links to things they mentioned:synchronous_commit https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-SYNCHRONOUS-COMMITsynchronous_commit history on pgPedia https://pgpedia.info/s/synchronous_commit.htmlPatroni’s maximum_lag_on_failover setting https://patroni.readthedocs.io/en/master/replication_modes.html#asynchronous-mode-durabilitywal_writer_delay https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-WAL-WRITER-DELAYSelective asynchronous commits in PostgreSQL - balancing durability and performance (blog post by Shayon Mukherjee) https://www.shayon.dev/post/2025/75/selective-asynchronous-commits-in-postgresql-balancing-durability-and-performance/Asynchronous Commit https://www.postgresql.org/docs/current/wal-async-commit.htmlsynchronous_standby_names https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-SYNCHRONOUS-STANDBY-NAMESJepson article about Amazon RDS multi-AZ clusters (by Kyle Kingsbury, aka "Aphyr”) https://jepsen.io/analyses/amazon-rds-for-postgresql-17.4~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello and welcome to Postgres FM,
a week to share about all things Postgres Core.
I am Michael, fan of Peter Mustard,
and this is Nikolai, fan of Postgres AI.
Hey Nikolai, how's it going?
Everything is all right, how are you?
Yeah, good, thank you.
It's been a nice sunny week here in the UK.
Oh, this is hostile our son, okay.
So this week we got a listener request.
In fact, we get quite a lot of listener requests.
Thank you everybody.
This one was from Sharon Mukherjee apologies for
pronunciation and they asked us about synchronous commit and whether or when
it might make sense to ever turn it off or use different settings and I thought
it was an interesting topic full stop so I wanted to get your thoughts as well on this.
I have many thoughts on this. We need to start somewhere.
So I wondered about starting probably not in the most obvious place but on a single node setup. I know this becomes more interesting multi-node, but on single node,
pure Postgres, there are still some interesting discussion points here, right?
Yeah, and even before that, there is single... So the meaning of synchronous commit is overloaded.
I know technically it defines behavior defines the behaviour of the commit.
There is local behaviour and if we have synchronous standbys, there is remote behaviour.
But originally it was only about local.
And then the decision was...
It was only on and off to the options.
This is how I remember it for a decade or more, more than a decade of being Postgres
user.
And then there is overloaded meaning, remote apply, remote...
Why?
Yeah, remote write, remote apply. Two options. So I think my first thought is
it's a huge mistake to mix things here. It causes pain in users like me. All the
time I like I read it, I read it, I read it like, I cannot remember this. To remember this, you need to deal only with this.
Yeah, or look up it, well you can't remember it,
you just look it up in the, I just look it up.
Every time you look it up and try to understand the logic.
It's like, we have many places in GUC system,
Postgres system like this, like for example,
you know, we still do a lot of consulting. We help our
customers, many of them are startup companies who grow rapidly. Very smart people. They
understand things very quickly, but you need to explain what's happening. One of the things
like we deal with bloat all the time. Bloat is one of the major concerns and we explain okay we need that auto vacuum tuning and there is
auto vacuum work mem which is minus one it means you need to look at maintenance
workman why this why was this done this is exactly the same thing like why was
this done to my brain like somebody made some decision to cause huge pain for ages. It's
a UX issue, a user experience issue, because instead of clarifying and separating things,
we develop some implicit dependencies and you need to be expert. And even if you are
expert, it's really hard to memorize all these things.
And the problem is like at some point we will use more and more AI and so on, but to explain
and think these things to AI it's also difficult because there are implicit dependencies. So
synchronous commit can be local but it also can be not local. Local is on and off and there is also word local, right? So meaning of on was rewritten.
Thank you guys, thank you. Thank you who did this. Thank you. If you try to be a consultant and
explain things to others, very smart guys, you will see the pain. It's really hard. And why is this hard?
This is my first thought. Why is this so hard?
As you say, it's because combining two things, right? And the behavior is... I can see why
they combined it though. Like the behavior...
I also can see no no questions here and to their credit the documentation is really nicely written so
okay it can be points it's nicely written but it doesn't explain all the
things so I hope we will touch additional topics not covered by
documentation today nice yeah so let's talk about the single node case on-off, on by default, right?
So why would we want to turn it off?
What's happening here?
Yeah, if we forget about multi-node setups, we have just one node, right?
And if we have right-heavy workloads and we think right latency is not good and eventually right throughput is not good,
because if latency is not good throughput won't be good as well. There is
some dependency there which is not straightforward, but there is. And it means
that, well we know at commit time, Postgres needs to make sure information is
written to all and it's present,
synchronized to it, it's on disk. Before that, POSGAS cannot respond to the user
that the commit was successful. And when we say commit, it can be explicit commit or it can be
implicit commit if we have a single statement transaction,
which is a super popular approach in web and mobile apps.
So when we just have single statement transactions all the time, just separate and so on, separate
update and so on.
And all of them have implicit commit as well.
So commit can be successful only when wall is already on disk.
And it means we need to wait for disk. And if disk is not a RAMFS, which is in 99.99% of cases, right? If it's regular disk, it can be quite slow. And in this case, we
deal with... if our transactions are super fast, like extremely fast, which is actually
rare, usually they are not so fast, because of various cases, various things like including, for example, index right amplification.
I have mood to criticize Postgres today, you know.
Well, yeah, so I feel like so far you've described the kind of synchronous commit equals on
case. It's default.
Yes, which is default. And we're waiting for the flush to happen to the right side disc
before confirming back to the client that that commit has been successful.
But turning it off, we don't wait for the extra step of flush into disc before confirming
to the client, which sounds risky. Well, the risk is if there happens to be a crash on the node,
in the time between the write-head log being written and the flush, so the commit being
confirmed and the flush to disk, we lose any of that data. So the risk here is some data loss on crash.
Yeah, the short like tail of wall is lost. Postgres will still reach consistency, all
good, but we lose some part of data. It's similar to any multi-node setup with asynchronous node when the fallover happens and if
standbys are asynchronous, nobody guarantees that at fallover everything will be present there,
right? Because it's asynchronous. There can be lags, but there's consistency, all good,
and we can lose, for example, by default in Patroni settings. As I remember, there's a setting, I always forget the name,
but setting like maximum lag allowed at failover.
As I remember, by default, it's one megabyte.
So up to one megabyte is okay to lose.
Here it's similar.
There is no threshold, like unlimited,
but technically it cannot be very, cannot be huge.
I looked into it and there is the default setting, there's a writer log, writer delay
of 200 milliseconds by default. And so I thought maybe only 200 milliseconds worth of
writes could be lost, but in the documentation, it's actually, what yes good point actually it's it's for
for implementation detail reasons it's about three times it's it says it's three times that so by
default it's 600 milliseconds um that could be lost but you could reduce that if this setting only
makes sense to change if you're setting synchronous commit for any
any of your transactions but you can reduce it if you want. So you could if you do decide to go
for some use of synchronous commit equals off then this is a setting you could reduce to to minimize
data loss but at that point you're still losing some data you might still be losing some data so
it's I'm not sure if it matters that much.
Right, right. So maximum is three times of that. Yeah.
Wall writer delay, right. And the question is, like, should we consider this? My answer
is if writes are super fast, maybe this then maybe yes. If you can, if you like it's, it's
there is a trade off here, right? Obviously. Yeah. I was thinking about it in our last episode on, about time series
data, but you made a really good point about if you, if you are writing kind of
lots and lots of little writes, often in time series workloads people suggest
actually batching inserts. So by that point,
yeah, so batching makes a load of sense for optimization.
It's a good point.
So choosing between synchronous commit off and batching,
you just choose batching.
Yeah, but then at that point,
your transactions are not that fast
because you're batching maybe a thousand
or more of them together.
And then the benefits of synchronous commit equals off of much smaller yeah exactly
so it's an I don't think you can get the benefit twice if that makes sense you
don't get any benefit of doing both. Single row inserts versus one insert of
thousand rows. Thousand inserts versus thousand rows single insert. Of course,
you'll have just one commit, it's great and it will be faster. But of course, or updates for example,
of course you should avoid huge batches because they have also downsides like longer lasting locks
they have also downsides like longer lasting locks or a bit if in case if such batch is crushed you lose more you need to retry more right instead of
losing just one insert you losing all thousand inserts so as usual batching my
rule of thumb is just to choose batch size so it's roughly like one or two seconds maximum to avoid long lasting
locks and risks to repeat too much. In this case it's great, like in this case just one commit per
second or maybe 10 commits per second, it's very low number. So in this case the overhead of
So in this case the overhead of the need to wait for a wall to be synced with two disk is super low, which is like roughly you can think about it like not more than one millisecond.
In this case like just forget about it.
Synchronous commit off is not a good idea.
That's it.
But usually things are much more complicated
because we just consider the one type of workload.
Any project has many types of workload, right?
Yeah, and that's a good point
that Shayan actually made in a follow-up post
that I think sometimes gets missed in this topic,
which is that it can be set on a per transaction basis or
per session basis per user basis so if you're doing a one-off if you're doing
like a one-off migration or bulk load of data and you can't for some reason but
batch yeah then maybe it's worth considering turning it off just for that user or for that session
or per transaction.
What kind of situation could it be if you cannot like bulk but not...
I was struggling.
I was struggling to think of it.
Maybe some streaming like streaming case.
Yeah, it felt tenuous.
If technically it comes from many, many sources.
Exactly. it felt tenuous. If technically it comes from many many sources, instead of saying okay we need Kafka
before it comes to Postgres and then we will batch it's like it's heavy solution maybe indeed
this is exactly the case when it's worth considering off, right? Yeah. To use off. The other thing I was thinking about
is whether this comes up as a thought for people
because of tools like PGBench that by default
you get a lot of small transactions,
a lot of small, very fast transactions.
So because, like I wondered if it's
Becomes interesting to people because they just think let's quickly do a test with PG bench and
Oh, wow synchronous commit. What's that? Have you tried?
PG bench and turn off synchronous commit. I tried many times because it's obvious idea and yeah, remember I saw huge advantage
Because the bench write transactions by default they are not so simple it's I remember
it's update plus two inserts maybe a delete as well it's multi-statement
transaction which is definitely not super lightweight so, fair enough. I didn't try it. Yeah. Well, I can imagine we can have
situations when the effect is quite noticeable, like dozens of percent, dozens percent. But
in general, from my practice, I stopped looking in this direction. Yeah, fair enough. So think about often. And of course, data
loss is not a good idea. So I don't know,
like write heavy workloads, maybe yes, but
again, batching and all proper tricks to
reduce a number of actual commits
happening and that's it.
Yeah. And I think people, when people talk
about data loss, they quite often immediately
jump to things like banking
apps, things where it's absolutely critical.
You'd much rather have some downtime than have some data loss.
But I also think user experience-wise, even for just regular crud apps, losing data can
be really confusing to people, even if it's only a not that important thing. The cases I was
thinking of are much more like the time series ones where if we've got one sensor reporting every
minute, if we lose one minute of if we lose a couple of those sensor reports, it's not that
important. We can still do a bunch of aggregates on that data. So I guess like for monitoring and
logging type things, maybe it's less important.
But there can be really important logs too, right? Like, so I personally, I'm not, I don't
see too many use cases where this makes a lot of sense, but there is a whole page on
the Postgres docs about, it's called asynchronous commit. So it must have been like somewhat
of interest to quite a lot of people.
Right. Yeah. Right. Yeah.
Yeah. Yeah.
Should we switch to the topic?
Like it feels like this gets more interesting
when we start talking about synchronous.
Oh yes.
Let's do it.
Let's move on and the mode apply.
First of all, own is overloaded
when you have synchronous to buy names, non-empty.
Synchronous to buy names, non-empty. Synchronous standby names non-empty, right? We can have purely synchronous
replication or we can have so-called quorum commit and I just had a chat with one of our team members
and I agree. I heard criticism as well, Quorum commit is a very misleading term here, because
usually it's used like this. We have multiple standbys. They are synchronous originally,
by nature. But then we say synchronous commit, for example, remote. And then we configure
POSGAS to allow synchronous standbys to have any. We say any.
And for example we have, say, five replicas, five standbys.
And we say any one of them.
And it means that commit happens on the primary and on any one additionally.
So on two nodes before the user receives success of commit.
It means two out of six.
Five standby, primary, six nodes.
So two out of six is not quorum, because quorum, definition of quorum means more than
50.
So that would be four out of six, right?
Usually, usually.
I think we can say this. In this case, the criticism is conditional.
We may say, in our case, the definition of the quorum is rewritten and we need 2 out of 6.
But it sounds strange a little bit. So maybe how should we call this situation when we not just one synchronous replica and
every commit must happen on both, but we have multiple replicas and we say one more or two
more.
So how do we call it?
Semisynchronous?
I've heard this term as well.
Semisynchronous.
Yeah, yeah.
I actually, yeah, I like, I kind of like it. It does imply.
But official doc says it's confusing. Semisynchronous is confusing enough that
means I have to look it up, which is probably helpful. I don't assume I know what it means.
Right. Right. But yeah, it doesn't fit into either category. Like it's not synchronous,
and it's not asynchronous. Like it's, yeah, I don't like it though.
I don't feel comfortable with it
because what's it saving us?
Like what's the benefit of our own?
Like is the idea that if both crash at the same time.
Yeah, before we talk about it,
let's finish with terminology.
Maybe I'm looking up right now
the meaning of our quorum,
maybe it's not that bad because in some cases we can say, okay, 25 meaning of word quorum, maybe it's not that bad because
in some cases we can say, okay, 25% is our quorum, right?
We can define these rules.
So expectation that it should be more than 50, maybe it's like false expectation.
Okay.
Yeah.
Anyway, but you cannot say percentage.
You say like any one or two, right?
Configuring Postgres, synchronous standby names, you can say any word and then say...
You can have some kind of...
It's an interesting syntax, you saw it, right?
So...
Yeah, I also saw you could do...
Is it like priority?
Yeah, yeah, yeah.
So yeah, there are some interesting things there.
But anyway, the idea is they work together.
Synchronous standby names and synchronous commit in this case.
And if synchronous standby names is not empty, so there is this magic happening.
If it's just hostname, purely synchronous
replication. But also purely synchronous, there are flavors we will discuss based on
synchronous commit setting. If there is like expression is there, like any or like first
blah blah blah, then it's already more complicated but it's interesting and
and official documentation mentions it uses the term quorum commit right
i saw it in a crunchy brit in a crunchy data blog post i didn't actually see it in the
documentation okay okay doesn't matter maybe doesn't matter Let's talk about flavors. Let's consider we have only, for simplicity,
just one host name mentioned in synchronous standby names. No expressions, just one.
And if we don't change synchronous commit setting, default is on, here the meaning changes right which like this makes me I started
with this right they like overloaded the meaning changes and what it does mean
it means that we need to to wait until what like let's let's see so yeah I I
think it's worth thinking about them in turn.
I like to think of them like, progressively.
I think the order goes, remote apply is the strongest.
So remote apply will wait until
it's not only been written to the write ahead log
and flushed.
Flushed to disk disk but also on the... but
but available to queries so read queries on the on the standby will be
able to... and here I like I think okay apply here but why replay there?
In PgStat replication it's called the replay lesson right okay mm-hmm apply or replay
apply replay we consider the same thing right yeah right so so that's the
strongest I think that's the highest level we can set and but it has
additional latency obviously I think on is the next strongest. I think we wait for FLUSH, but not for, what did you want to call it, REPLAY.
REPLAY or APPLY.
And then there's one level below ON, which is the WRITE.
There is no remote RECEIVE. There is a replay right. Yeah, but I think receive is,
they're saying it's the same as right. All data received, transmitted over network received,
but not yet flushed to file, but received, right? Yeah, write and receive are the same. There is confusion because
in different parts of Postgres here we have different words, different verbs used, right?
But I agree. So there is write, there is flash, and there is apply phases. And there are three steps here. There is no remote flush.
Overloading happens, but it would be good to have remote flush and say that ON is acting
like flush.
But there is no remote flush.
There is only remote write and remote apply.
And instead of flush, we have on.
Saving on number of words, right?
Supported.
Number of settings.
But they did add a local, which I think,
according to the documentation,
they're saying is only for completeness.
So it's interesting that they did have one.
It's not only for completeness,
and we will go there in a few minutes.
So ON means remote flush, medium setting.
Right?
The wall is written.
Yes.
Like it's happening on the primary.
In primary, at commit time, it's flushed to disk.
We know that in the case of crash, this whole data is already
on disk.
But what will happen with data pages during recovery, they will need to be adjusted.
It's not applied yet.
Good, good, good.
So how do you feel about this? It's like synchronous replication, which
means that if we lose the primary, the new primary will have everything, which is great,
right? It will have everything and it will be needed to be applied, so recovery, but we don't lose
anything.
It's great.
Agreed?
Yeah, with a little bit of latency cost on each, right?
A little bit.
Or a lot.
Yeah, this latency depends on the network as well, a lot.
And usually we prefer to move standbys to different availability zones.
And this increases latency, of course.
And this means that if we keep this as is, like synchronous commit on and synchronous
bynames have...
The setting has one hostname.
It means that we will have significant latency overhead.
This is why it's hard to use in heavily loaded projects, all TPEG-like projects.
But it's quite strict mode, right?
Yeah, and you can have an HA setup that's within the same availability zone, or at least
the same region, I'm guessing, would massively reduce that?
Well, I wouldn't put...
So for me, one availability zone can disappear.
If you think about HA, we need to have replica in different availability zone, which of course,
there is trade-off here.
We need to wait, right?
Wait, like what is here, what is here, and then what's worse, what's best for us.
And we move, for me, like H&M, means the replica should be in different AZ. It's in different AZ,
means like commit in this strict mode, remote flush or on, there's no remote flush.
It means we will need to wait both synchronization with
disk on the primary and there. And network will add overhead here. And I expect we won't be able
to scale these writes very well. So if it's like thousands of writes per second, it will be noticeable issue.
That's why in most cases I see people go to...
Well, by the way, remote apply interesting.
It's even worse.
Like we need to apply this.
It's even worse and kills performance even more.
What do you think about this mode?
It's like extreme.
I actually lost you. I think our network connection is not great.
When I discussed network issues, we had network issues. So I moved us to consideration of remote
apply. Well, yeah. So actually question here then, this is no longer, I don't think about data
risk of data loss because if we've flushed, then we're good. So is this for read replicas?
This only makes sense if we're sending read traffic. Okay, great.
So this is good in terms of how data is visible there. We know by default on it's acting like remote flash, which again doesn't
exist. Changes are not applied, so any reading session which can be possible, the only possible
option we can have on the selects, like reads on standby. In this case it won't see this data yet.
In this case, it won't see this data yet. Unless we're using remote apply, in which case it would always be there.
It will be there right after committed.
It's visible.
It's great.
So the delay between written and visible becomes zero.
Once it's committed, it's immediately visible.
Great. And when
do we need this question? I don't know. I don't work with such projects.
Right, because by the point you need read replicas, the point of having read replicas
is because you have so much read traffic that your primary can't serve at all. Right? So we're talking about extremely high,
at least extremely high reads, maybe cases where we have a ton more reads than writes
and write latency is not that maybe social media again.
Yeah. In this case is replica can be in the same availability zone because it's not
for HA purposes, right? It's just to redistribute reads maybe. I honestly don't know. We usually,
we already have mechanisms everywhere implemented to deal with the legs of replication.
deal with legs of replication. Like to stick to the primary right after writes and in the same session of user like for some time or like
these kinds of things. So yeah you mentioned sticky reads before.
In this case to slow down commits for the sake of this like
reduction of this delay of this leg.
I don't know.
I would prefer to deal with small legs, but don't slow down commits.
Yeah.
So that's, I guess that's remote apply and default is on.
Do you ever, do you see use cases for remote right?
Remote right is the most popular.
Oh really? Interesting.
Of course, most setups where synchronous standby names is non-empty, I see, they use remote
apply most of them. Remote write, remote write. Right, that makes more sense. The least complete.
And here we have new issues because this is super popular option. And,
oh, by the way, I wanted to mention that yesterday on Hacker News, it was the article from Affir,
Jepson. It was discussing RDS multi-AZ clusters, which is interesting. It's closer to our topic today,
but to understand the article we need time. So, great, we have remote write as being very popular.
What does it mean? The standby node received all data but hasn't flushed it yet to the
disk. It means that if all nodes suddenly shut down, only primary will have this data. After
commit, we know, if commit happens, primary flushed it. We don't have settings which let us control the behavior of commit on standby nodes, If we have remote write on or remote apply, it means in any of these cases,
disk on the primary has wall data from this transaction.
So it's not off at all already.
And imagine all nodes disappeared, only the primary has this data, because standby,
since it's remote write, is not guaranteed. Disk might miss this data, and this transaction
might be lost. And this is okay. But in case we lose the primary completely, we lose data.
We lose data here.
But we lose data only if all nodes are shut down.
Usually when we use the primary, standby nodes are not shut down.
Unless something like...
Yeah, yeah, yeah.
Bigger issue.
Let's say loss of primary is our corner case, not corner, edge case.
But loss of primary and standby nodes are not lost but shut down, it's kind of already
very corner case, so several edges, so several problems simultaneously.
So usually they're still running. That's why remote write
is so popular because it says, okay, we lost the primary, we need to perform auto failover,
or manual failover, usually auto failover, and standby one or two, or how many you configure,
of those standby nodes pair the setting, pair standby commit equals remote, right?
They have it in memory.
And they are still running, so they will flush it.
So the risk is good here.
I mean, the risk is understood, like loss of primary and everyone is shut down suddenly,
but it's so low, so we can accept it. And so it still feels like synchronous commit to me because I doubt when we lose primary
standby node which received but not yet flushed this whole data, it will be suddenly off.
It will be running, and we don't lose this data.
This is great.
Yeah, interesting. data. This is great. This is why it's so popular. But it's great. Except it's not. Because if
autofollower needs to happen, it's very rarely like sometimes people allocate like Crunchy Bridge or RDS, not Multi-ASZ cluster, but Multi-ASZ standby.
Now there are two options.
Multi-ASZ standby means there is synchronous standby only serving HA purposes, you cannot
send queries to it.
Crunchy Bridge has the same, not the same. They use Postgres replication to solve this.
Exactly, they configure synchronous by names and they set synchronous setting to very weird...
synchronous commit to very weird value. We will discuss it. But the idea is HAI replica
doesn't receive traffic, it serves only for HCI purposes.
Great.
But in many other cases people use, like they say, this Quorum commit, one or two additional
nodes must say that data is received, remote write.
And in this case, if we lose the primary, auto-failover will happen to where?
To one of these notes which we received.
But the problem is, for example, if it's Patroni, Patroni cannot choose this note, because there is no such capability to find which... We have five replicas, five standbys. We know one of them received the latest settings.
How to find this? There is no such capability. It doesn't exist.
Because this PG last received a lesson. But it has the wrong name because it shows
flush lesson. It's a bug or feature I don't know. So this is a problem and
this is like if Patroni uses that I think it right now it uses that. It
chooses the node which received like which flushed the same with maximum value, and this might be a different node.
So, with that in mind, I guess it's then an acceptable risk, or that's why it has some data loss, but why not use on in that case?
Because it will slow down commits.
that case? Because it will slow down
the commits. Interestingly, I checked before coming on with our own Google Cloud SQL setup, and I wondered what our setting was, and it was on.
So either they haven't...
But yeah, it's interesting that they could have chosen to set it to remote write.
And synchronous binames is non-empty?
I didn't check. It's a good question.
If it's empty, you have pure synchronous replication.
This is it. Honestly, it concerns me a lot how overloaded values and names here.
On means different things depending on different settings. Why?
depending on different settings like why? Maybe we are trying to save on the number of configuration settings, GUC names, not to expand it too much, okay, 300 but not let's not to expand
it to 500 right but this causes pain in brains like people like using this. So if it's on, okay.
It's asynchronous. No asynchronous standby names.
Yeah, so it waits for flush only on the primary. Yeah, so that's the same,
that's basically just local. Yeah, yeah, yeah. Good. So, there's this problem,
it's unsolved. It should be solved in Postgres. It should be solved in Postgres
and I think it's not super difficult to solve, so we need a function which will
show write a lesson or receive a lesson. By the way, receive and write may be a little bit different.
You receive but okay, doesn't matter. Because valid receiver is valid receiver, right?
Anyway, so a couple of more things deeper I wanted to mention. It's deeper things, but one of the things is even if we have remote
write, there is issue in the code that causes flashing periodically. So, we have remote write but sometimes we wait until it's flushed. And this is also
like this I guess this issue is known it's revealed itself only under very
heavy loads and yeah it also could be improved maybe right because if so
remote write is not purely remote write it. It's remote, right, right, right, but at some point it's flush and all transactions are
currently being committed, they wait, additional flush.
And then it's again, pure, like again, right, right, right, then again flush.
And it can be seen in the code that there are things there. FData sync, right?
Roughly, yeah.
I mean, is this also the wall writer delay
or a different kind of delay?
It's not documented.
Okay, interesting.
So, I mean, if you use remote write,
there is like behavior which is not purely remote write.
It still involves disk sometimes, periodically.
I guess some buffer is full and needs to be written.
But this causes delays of commits.
Makes sense. Yeah.
And another thing is probably off topic about Crunchy Bridge.
Like, let's discuss what we observed.
Like we just created fresh cluster and saw a few things.
One of them is synchronous standby names is filled with...
I created HA cluster.
Yeah.
So HA cluster was all good.
And synchronous standby names has this host
which is HA replica. It's not visible to us, we cannot send queries to it
and synchronous commit is set to local.
Yeah so the first thing you mentioned it being non-empty means they're using
Postgres replication.
What's the significance of that being non-empty?
If it's not empty, it's supposed to be.
If synchronous commit was default on, it would mean we have synchronous replication.
Flash, as you said, on means, in this non-existing remote flash.
But synchronous commit is set to local, which means we don't see the state of what's happening
on standby.
The commit behavior depends only on the primary.
Like my setup, right?
Like on Google Cloud SQL for me, even though it's on, it's kind of like local because
they're not using...
Right, but they still have the replica.
Yeah, yeah.
Strange, right?
But local is confusing because I don't...
Yeah, I would have, as you said, I would have expected it to be on or as you say the
more popular one.
If it would be on, if it was on, this replica would be synchronous.
If it's local, I guess it's unsynchronous.
Yeah.
Or asynchronous, yeah.
Asynchronous, right.
Although it's mentioned in synchronous on my names.
Yeah, good point.
What would you prefer it to be remote right, I guess, rather than on?
No, no, no. I think they are not... I'm not sure it's a mistake.
Well, it's not the default, right? So it's definitely deliberate.
There are some thoughts There are some...
There were some thoughts about this, I think.
And I'm trying to guess what.
Because obviously making it synchronous would cause slow commits, right?
Yeah.
Performance would be not great.
Okay, we go to local, but why do we...
If people are doing...
Yeah, we keep it local, so this HA replica becomes asynchronous.
Do you think it's to look good in benchmarking?
Why cannot we configure asynchronous replica in a normal way?
Why should we put it to synchronous standby names, right?
Because it could be regular asynchronous replica, like present
and slot, but not available for read-only transactions. But it's there. And the only
idea, it's not my idea, we discussed this situation in the team, and idea came up into one great mind was probably they want wall to be written by backends,
not by wall writer, but because there is like this effect when you use synchronous
binaries it's not empty, backends are writing and this can increase throughput. It's a guess, hypothesis. If Crunchy is listening to us,
and we know they're listening because they changed the random page cost to 1.1, right?
Oh, yeah. Actually, it's great. I also wanted to thank them because since then I stopped saying
one and I say 1.1 because Crunchy benchmarked it and decided to use 1.1 for random page cost.
Great, thank you so much.
But if you guys are listening, please let us know what's happening with settings.
It's really interesting. Maybe it should be documented because it's interesting.
Overall impression about synchronous commit, all those options and synchronous replication,
I honestly see more asynchronous, asynchronous setups, purely asynchronous. But I know serious
guys and I see clusters with synchronous like quorum commit configured, my impression still there are things to be fixed and improved. This is what I observe. And one more
of let's name it. When we created this cluster with HA checkbox, HA
replica was also created. We see it in synchronous by names and then we
decided using our
superuser, which Crunchy has, and it's great. I think it's the only managed
service, one of maybe two actually, which provides superuser access, which is
great. So we decided to change shared preload libraries and restart.
And there was an idea, is this, like, question, is this change propagated to standby node or
not?
And what happened?
Postgres restart, not cluster restart, postgres restart, cost fell over.
This is not all right.
This is not all right.
Freshly created cluster and we did it three times, two out of three times, fell over happened.
And another time it didn't happen.
So I guess it's like some race condition is present there, right?
And we saw that change was not perpetuated to a channel.
So it's actually data loss.
If I have super user, it's my right, right?
And I have outer systemuser, it's my right, and I have an alter system, we can see it in the
PgSettings and it's not yet applied, like pending restart is true.
And then we have failover and we don't see it anymore.
What is it?
It's data loss. Or it's unsupported behavior.
No, it's data loss. I wrote something via...
Using Altar system, I wrote something and I lost it.
So you can't have it both ways? I see where you're coming from, 100%.
But on one hand you want super user access, and on the other hand
you're going to call it data loss if you change anything that doesn't get propagated.
Let me contradict with myself.
I'm too harsh, I know.
Altosystem is actually local.
We can change settings on any standby, each standby separately, independently.
So it's not guaranteed that it's not propagated by replication.
I just expected maybe some automation exists which would propagate it.
If this replica is a chair replica, maybe we should match settings.
Because this setting requires restart, but what happens to settings that don't require
restart?
If I change them, and then what? If I change them
using the other system, do I lose them? This is not good, right?
But equally, expectations of automations around this stuff could be a nice excuse for other
providers not to add CPUs access. So it's yet another...
No, no, no. I think it's... No, no, no, no, no, no, no, no, no, no, no, no, no, no.
I'm just saying be careful what you wish for.
Well, it could happen. Like I change something using UI and...
No, because if you use the UI, they change it somewhere else that then does get propagated.
Maybe. But I... rooting for Crunch actually here.
I think it can be solved this problem.
I don't want them to close super user access.
It's great feature, super great feature.
I think everyone should be.
Based on our last previous week's topic, my position, RDS and others, they are taking too much from us.
It stops feeling open source.
We don't need to make this point again.
Yeah, okay.
Well, I'm going to repeat it many times, so it's not all right.
So, Crunch are doing great job here to providing, we are providing super user access and I think
it can be solved.
And one more thing, if they are really listening, when I restart Postgres, I see restart cluster.
I feel like I need to chop this bit up and send it to them as a support thing.
Good, yeah, that was good.
Nice one. Thank you.
Yeah, thanks so much Nikolai and thanks for the request and yeah, catch you next week. See you, bye bye.