Postgres FM - synchronous_commit

Episode Date: May 2, 2025

Nikolay and Michael discuss synchronous_commit — what it means on single node setups, for synchronous replication setups, and the pros and cons of the different options for each. Here are ...some links to things they mentioned:synchronous_commit https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-SYNCHRONOUS-COMMITsynchronous_commit history on pgPedia https://pgpedia.info/s/synchronous_commit.htmlPatroni’s maximum_lag_on_failover setting https://patroni.readthedocs.io/en/master/replication_modes.html#asynchronous-mode-durabilitywal_writer_delay https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-WAL-WRITER-DELAYSelective asynchronous commits in PostgreSQL - balancing durability and performance (blog post by Shayon Mukherjee) https://www.shayon.dev/post/2025/75/selective-asynchronous-commits-in-postgresql-balancing-durability-and-performance/Asynchronous Commit https://www.postgresql.org/docs/current/wal-async-commit.htmlsynchronous_standby_names https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-SYNCHRONOUS-STANDBY-NAMESJepson article about Amazon RDS multi-AZ clusters (by Kyle Kingsbury, aka "Aphyr”) https://jepsen.io/analyses/amazon-rds-for-postgresql-17.4~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Postgres FM, a week to share about all things Postgres Core. I am Michael, fan of Peter Mustard, and this is Nikolai, fan of Postgres AI. Hey Nikolai, how's it going? Everything is all right, how are you? Yeah, good, thank you. It's been a nice sunny week here in the UK.
Starting point is 00:00:16 Oh, this is hostile our son, okay. So this week we got a listener request. In fact, we get quite a lot of listener requests. Thank you everybody. This one was from Sharon Mukherjee apologies for pronunciation and they asked us about synchronous commit and whether or when it might make sense to ever turn it off or use different settings and I thought it was an interesting topic full stop so I wanted to get your thoughts as well on this.
Starting point is 00:00:47 I have many thoughts on this. We need to start somewhere. So I wondered about starting probably not in the most obvious place but on a single node setup. I know this becomes more interesting multi-node, but on single node, pure Postgres, there are still some interesting discussion points here, right? Yeah, and even before that, there is single... So the meaning of synchronous commit is overloaded. I know technically it defines behavior defines the behaviour of the commit. There is local behaviour and if we have synchronous standbys, there is remote behaviour. But originally it was only about local. And then the decision was...
Starting point is 00:01:43 It was only on and off to the options. This is how I remember it for a decade or more, more than a decade of being Postgres user. And then there is overloaded meaning, remote apply, remote... Why? Yeah, remote write, remote apply. Two options. So I think my first thought is it's a huge mistake to mix things here. It causes pain in users like me. All the time I like I read it, I read it, I read it like, I cannot remember this. To remember this, you need to deal only with this.
Starting point is 00:02:29 Yeah, or look up it, well you can't remember it, you just look it up in the, I just look it up. Every time you look it up and try to understand the logic. It's like, we have many places in GUC system, Postgres system like this, like for example, you know, we still do a lot of consulting. We help our customers, many of them are startup companies who grow rapidly. Very smart people. They understand things very quickly, but you need to explain what's happening. One of the things
Starting point is 00:03:00 like we deal with bloat all the time. Bloat is one of the major concerns and we explain okay we need that auto vacuum tuning and there is auto vacuum work mem which is minus one it means you need to look at maintenance workman why this why was this done this is exactly the same thing like why was this done to my brain like somebody made some decision to cause huge pain for ages. It's a UX issue, a user experience issue, because instead of clarifying and separating things, we develop some implicit dependencies and you need to be expert. And even if you are expert, it's really hard to memorize all these things. And the problem is like at some point we will use more and more AI and so on, but to explain
Starting point is 00:03:54 and think these things to AI it's also difficult because there are implicit dependencies. So synchronous commit can be local but it also can be not local. Local is on and off and there is also word local, right? So meaning of on was rewritten. Thank you guys, thank you. Thank you who did this. Thank you. If you try to be a consultant and explain things to others, very smart guys, you will see the pain. It's really hard. And why is this hard? This is my first thought. Why is this so hard? As you say, it's because combining two things, right? And the behavior is... I can see why they combined it though. Like the behavior... I also can see no no questions here and to their credit the documentation is really nicely written so
Starting point is 00:04:49 okay it can be points it's nicely written but it doesn't explain all the things so I hope we will touch additional topics not covered by documentation today nice yeah so let's talk about the single node case on-off, on by default, right? So why would we want to turn it off? What's happening here? Yeah, if we forget about multi-node setups, we have just one node, right? And if we have right-heavy workloads and we think right latency is not good and eventually right throughput is not good, because if latency is not good throughput won't be good as well. There is
Starting point is 00:05:32 some dependency there which is not straightforward, but there is. And it means that, well we know at commit time, Postgres needs to make sure information is written to all and it's present, synchronized to it, it's on disk. Before that, POSGAS cannot respond to the user that the commit was successful. And when we say commit, it can be explicit commit or it can be implicit commit if we have a single statement transaction, which is a super popular approach in web and mobile apps. So when we just have single statement transactions all the time, just separate and so on, separate
Starting point is 00:06:16 update and so on. And all of them have implicit commit as well. So commit can be successful only when wall is already on disk. And it means we need to wait for disk. And if disk is not a RAMFS, which is in 99.99% of cases, right? If it's regular disk, it can be quite slow. And in this case, we deal with... if our transactions are super fast, like extremely fast, which is actually rare, usually they are not so fast, because of various cases, various things like including, for example, index right amplification. I have mood to criticize Postgres today, you know. Well, yeah, so I feel like so far you've described the kind of synchronous commit equals on
Starting point is 00:07:19 case. It's default. Yes, which is default. And we're waiting for the flush to happen to the right side disc before confirming back to the client that that commit has been successful. But turning it off, we don't wait for the extra step of flush into disc before confirming to the client, which sounds risky. Well, the risk is if there happens to be a crash on the node, in the time between the write-head log being written and the flush, so the commit being confirmed and the flush to disk, we lose any of that data. So the risk here is some data loss on crash. Yeah, the short like tail of wall is lost. Postgres will still reach consistency, all
Starting point is 00:08:14 good, but we lose some part of data. It's similar to any multi-node setup with asynchronous node when the fallover happens and if standbys are asynchronous, nobody guarantees that at fallover everything will be present there, right? Because it's asynchronous. There can be lags, but there's consistency, all good, and we can lose, for example, by default in Patroni settings. As I remember, there's a setting, I always forget the name, but setting like maximum lag allowed at failover. As I remember, by default, it's one megabyte. So up to one megabyte is okay to lose. Here it's similar.
Starting point is 00:08:57 There is no threshold, like unlimited, but technically it cannot be very, cannot be huge. I looked into it and there is the default setting, there's a writer log, writer delay of 200 milliseconds by default. And so I thought maybe only 200 milliseconds worth of writes could be lost, but in the documentation, it's actually, what yes good point actually it's it's for for implementation detail reasons it's about three times it's it says it's three times that so by default it's 600 milliseconds um that could be lost but you could reduce that if this setting only makes sense to change if you're setting synchronous commit for any
Starting point is 00:09:45 any of your transactions but you can reduce it if you want. So you could if you do decide to go for some use of synchronous commit equals off then this is a setting you could reduce to to minimize data loss but at that point you're still losing some data you might still be losing some data so it's I'm not sure if it matters that much. Right, right. So maximum is three times of that. Yeah. Wall writer delay, right. And the question is, like, should we consider this? My answer is if writes are super fast, maybe this then maybe yes. If you can, if you like it's, it's there is a trade off here, right? Obviously. Yeah. I was thinking about it in our last episode on, about time series
Starting point is 00:10:30 data, but you made a really good point about if you, if you are writing kind of lots and lots of little writes, often in time series workloads people suggest actually batching inserts. So by that point, yeah, so batching makes a load of sense for optimization. It's a good point. So choosing between synchronous commit off and batching, you just choose batching. Yeah, but then at that point,
Starting point is 00:10:57 your transactions are not that fast because you're batching maybe a thousand or more of them together. And then the benefits of synchronous commit equals off of much smaller yeah exactly so it's an I don't think you can get the benefit twice if that makes sense you don't get any benefit of doing both. Single row inserts versus one insert of thousand rows. Thousand inserts versus thousand rows single insert. Of course, you'll have just one commit, it's great and it will be faster. But of course, or updates for example,
Starting point is 00:11:35 of course you should avoid huge batches because they have also downsides like longer lasting locks they have also downsides like longer lasting locks or a bit if in case if such batch is crushed you lose more you need to retry more right instead of losing just one insert you losing all thousand inserts so as usual batching my rule of thumb is just to choose batch size so it's roughly like one or two seconds maximum to avoid long lasting locks and risks to repeat too much. In this case it's great, like in this case just one commit per second or maybe 10 commits per second, it's very low number. So in this case the overhead of So in this case the overhead of the need to wait for a wall to be synced with two disk is super low, which is like roughly you can think about it like not more than one millisecond. In this case like just forget about it.
Starting point is 00:12:38 Synchronous commit off is not a good idea. That's it. But usually things are much more complicated because we just consider the one type of workload. Any project has many types of workload, right? Yeah, and that's a good point that Shayan actually made in a follow-up post that I think sometimes gets missed in this topic,
Starting point is 00:13:01 which is that it can be set on a per transaction basis or per session basis per user basis so if you're doing a one-off if you're doing like a one-off migration or bulk load of data and you can't for some reason but batch yeah then maybe it's worth considering turning it off just for that user or for that session or per transaction. What kind of situation could it be if you cannot like bulk but not... I was struggling. I was struggling to think of it.
Starting point is 00:13:36 Maybe some streaming like streaming case. Yeah, it felt tenuous. If technically it comes from many, many sources. Exactly. it felt tenuous. If technically it comes from many many sources, instead of saying okay we need Kafka before it comes to Postgres and then we will batch it's like it's heavy solution maybe indeed this is exactly the case when it's worth considering off, right? Yeah. To use off. The other thing I was thinking about is whether this comes up as a thought for people because of tools like PGBench that by default
Starting point is 00:14:17 you get a lot of small transactions, a lot of small, very fast transactions. So because, like I wondered if it's Becomes interesting to people because they just think let's quickly do a test with PG bench and Oh, wow synchronous commit. What's that? Have you tried? PG bench and turn off synchronous commit. I tried many times because it's obvious idea and yeah, remember I saw huge advantage Because the bench write transactions by default they are not so simple it's I remember it's update plus two inserts maybe a delete as well it's multi-statement
Starting point is 00:14:56 transaction which is definitely not super lightweight so, fair enough. I didn't try it. Yeah. Well, I can imagine we can have situations when the effect is quite noticeable, like dozens of percent, dozens percent. But in general, from my practice, I stopped looking in this direction. Yeah, fair enough. So think about often. And of course, data loss is not a good idea. So I don't know, like write heavy workloads, maybe yes, but again, batching and all proper tricks to reduce a number of actual commits happening and that's it.
Starting point is 00:15:38 Yeah. And I think people, when people talk about data loss, they quite often immediately jump to things like banking apps, things where it's absolutely critical. You'd much rather have some downtime than have some data loss. But I also think user experience-wise, even for just regular crud apps, losing data can be really confusing to people, even if it's only a not that important thing. The cases I was thinking of are much more like the time series ones where if we've got one sensor reporting every
Starting point is 00:16:11 minute, if we lose one minute of if we lose a couple of those sensor reports, it's not that important. We can still do a bunch of aggregates on that data. So I guess like for monitoring and logging type things, maybe it's less important. But there can be really important logs too, right? Like, so I personally, I'm not, I don't see too many use cases where this makes a lot of sense, but there is a whole page on the Postgres docs about, it's called asynchronous commit. So it must have been like somewhat of interest to quite a lot of people. Right. Yeah. Right. Yeah.
Starting point is 00:16:45 Yeah. Yeah. Should we switch to the topic? Like it feels like this gets more interesting when we start talking about synchronous. Oh yes. Let's do it. Let's move on and the mode apply. First of all, own is overloaded
Starting point is 00:16:58 when you have synchronous to buy names, non-empty. Synchronous to buy names, non-empty. Synchronous standby names non-empty, right? We can have purely synchronous replication or we can have so-called quorum commit and I just had a chat with one of our team members and I agree. I heard criticism as well, Quorum commit is a very misleading term here, because usually it's used like this. We have multiple standbys. They are synchronous originally, by nature. But then we say synchronous commit, for example, remote. And then we configure POSGAS to allow synchronous standbys to have any. We say any. And for example we have, say, five replicas, five standbys.
Starting point is 00:17:51 And we say any one of them. And it means that commit happens on the primary and on any one additionally. So on two nodes before the user receives success of commit. It means two out of six. Five standby, primary, six nodes. So two out of six is not quorum, because quorum, definition of quorum means more than 50. So that would be four out of six, right?
Starting point is 00:18:21 Usually, usually. I think we can say this. In this case, the criticism is conditional. We may say, in our case, the definition of the quorum is rewritten and we need 2 out of 6. But it sounds strange a little bit. So maybe how should we call this situation when we not just one synchronous replica and every commit must happen on both, but we have multiple replicas and we say one more or two more. So how do we call it? Semisynchronous?
Starting point is 00:18:57 I've heard this term as well. Semisynchronous. Yeah, yeah. I actually, yeah, I like, I kind of like it. It does imply. But official doc says it's confusing. Semisynchronous is confusing enough that means I have to look it up, which is probably helpful. I don't assume I know what it means. Right. Right. But yeah, it doesn't fit into either category. Like it's not synchronous, and it's not asynchronous. Like it's, yeah, I don't like it though.
Starting point is 00:19:26 I don't feel comfortable with it because what's it saving us? Like what's the benefit of our own? Like is the idea that if both crash at the same time. Yeah, before we talk about it, let's finish with terminology. Maybe I'm looking up right now the meaning of our quorum,
Starting point is 00:19:44 maybe it's not that bad because in some cases we can say, okay, 25 meaning of word quorum, maybe it's not that bad because in some cases we can say, okay, 25% is our quorum, right? We can define these rules. So expectation that it should be more than 50, maybe it's like false expectation. Okay. Yeah. Anyway, but you cannot say percentage. You say like any one or two, right?
Starting point is 00:20:07 Configuring Postgres, synchronous standby names, you can say any word and then say... You can have some kind of... It's an interesting syntax, you saw it, right? So... Yeah, I also saw you could do... Is it like priority? Yeah, yeah, yeah. So yeah, there are some interesting things there.
Starting point is 00:20:30 But anyway, the idea is they work together. Synchronous standby names and synchronous commit in this case. And if synchronous standby names is not empty, so there is this magic happening. If it's just hostname, purely synchronous replication. But also purely synchronous, there are flavors we will discuss based on synchronous commit setting. If there is like expression is there, like any or like first blah blah blah, then it's already more complicated but it's interesting and and official documentation mentions it uses the term quorum commit right
Starting point is 00:21:13 i saw it in a crunchy brit in a crunchy data blog post i didn't actually see it in the documentation okay okay doesn't matter maybe doesn't matter Let's talk about flavors. Let's consider we have only, for simplicity, just one host name mentioned in synchronous standby names. No expressions, just one. And if we don't change synchronous commit setting, default is on, here the meaning changes right which like this makes me I started with this right they like overloaded the meaning changes and what it does mean it means that we need to to wait until what like let's let's see so yeah I I think it's worth thinking about them in turn. I like to think of them like, progressively.
Starting point is 00:22:09 I think the order goes, remote apply is the strongest. So remote apply will wait until it's not only been written to the write ahead log and flushed. Flushed to disk disk but also on the... but but available to queries so read queries on the on the standby will be able to... and here I like I think okay apply here but why replay there? In PgStat replication it's called the replay lesson right okay mm-hmm apply or replay
Starting point is 00:22:45 apply replay we consider the same thing right yeah right so so that's the strongest I think that's the highest level we can set and but it has additional latency obviously I think on is the next strongest. I think we wait for FLUSH, but not for, what did you want to call it, REPLAY. REPLAY or APPLY. And then there's one level below ON, which is the WRITE. There is no remote RECEIVE. There is a replay right. Yeah, but I think receive is, they're saying it's the same as right. All data received, transmitted over network received, but not yet flushed to file, but received, right? Yeah, write and receive are the same. There is confusion because
Starting point is 00:23:47 in different parts of Postgres here we have different words, different verbs used, right? But I agree. So there is write, there is flash, and there is apply phases. And there are three steps here. There is no remote flush. Overloading happens, but it would be good to have remote flush and say that ON is acting like flush. But there is no remote flush. There is only remote write and remote apply. And instead of flush, we have on. Saving on number of words, right?
Starting point is 00:24:32 Supported. Number of settings. But they did add a local, which I think, according to the documentation, they're saying is only for completeness. So it's interesting that they did have one. It's not only for completeness, and we will go there in a few minutes.
Starting point is 00:24:47 So ON means remote flush, medium setting. Right? The wall is written. Yes. Like it's happening on the primary. In primary, at commit time, it's flushed to disk. We know that in the case of crash, this whole data is already on disk.
Starting point is 00:25:08 But what will happen with data pages during recovery, they will need to be adjusted. It's not applied yet. Good, good, good. So how do you feel about this? It's like synchronous replication, which means that if we lose the primary, the new primary will have everything, which is great, right? It will have everything and it will be needed to be applied, so recovery, but we don't lose anything. It's great.
Starting point is 00:25:49 Agreed? Yeah, with a little bit of latency cost on each, right? A little bit. Or a lot. Yeah, this latency depends on the network as well, a lot. And usually we prefer to move standbys to different availability zones. And this increases latency, of course. And this means that if we keep this as is, like synchronous commit on and synchronous
Starting point is 00:26:16 bynames have... The setting has one hostname. It means that we will have significant latency overhead. This is why it's hard to use in heavily loaded projects, all TPEG-like projects. But it's quite strict mode, right? Yeah, and you can have an HA setup that's within the same availability zone, or at least the same region, I'm guessing, would massively reduce that? Well, I wouldn't put...
Starting point is 00:26:47 So for me, one availability zone can disappear. If you think about HA, we need to have replica in different availability zone, which of course, there is trade-off here. We need to wait, right? Wait, like what is here, what is here, and then what's worse, what's best for us. And we move, for me, like H&M, means the replica should be in different AZ. It's in different AZ, means like commit in this strict mode, remote flush or on, there's no remote flush. It means we will need to wait both synchronization with
Starting point is 00:27:27 disk on the primary and there. And network will add overhead here. And I expect we won't be able to scale these writes very well. So if it's like thousands of writes per second, it will be noticeable issue. That's why in most cases I see people go to... Well, by the way, remote apply interesting. It's even worse. Like we need to apply this. It's even worse and kills performance even more. What do you think about this mode?
Starting point is 00:28:04 It's like extreme. I actually lost you. I think our network connection is not great. When I discussed network issues, we had network issues. So I moved us to consideration of remote apply. Well, yeah. So actually question here then, this is no longer, I don't think about data risk of data loss because if we've flushed, then we're good. So is this for read replicas? This only makes sense if we're sending read traffic. Okay, great. So this is good in terms of how data is visible there. We know by default on it's acting like remote flash, which again doesn't exist. Changes are not applied, so any reading session which can be possible, the only possible
Starting point is 00:28:55 option we can have on the selects, like reads on standby. In this case it won't see this data yet. In this case, it won't see this data yet. Unless we're using remote apply, in which case it would always be there. It will be there right after committed. It's visible. It's great. So the delay between written and visible becomes zero. Once it's committed, it's immediately visible. Great. And when
Starting point is 00:29:27 do we need this question? I don't know. I don't work with such projects. Right, because by the point you need read replicas, the point of having read replicas is because you have so much read traffic that your primary can't serve at all. Right? So we're talking about extremely high, at least extremely high reads, maybe cases where we have a ton more reads than writes and write latency is not that maybe social media again. Yeah. In this case is replica can be in the same availability zone because it's not for HA purposes, right? It's just to redistribute reads maybe. I honestly don't know. We usually, we already have mechanisms everywhere implemented to deal with the legs of replication.
Starting point is 00:30:26 deal with legs of replication. Like to stick to the primary right after writes and in the same session of user like for some time or like these kinds of things. So yeah you mentioned sticky reads before. In this case to slow down commits for the sake of this like reduction of this delay of this leg. I don't know. I would prefer to deal with small legs, but don't slow down commits. Yeah. So that's, I guess that's remote apply and default is on.
Starting point is 00:30:58 Do you ever, do you see use cases for remote right? Remote right is the most popular. Oh really? Interesting. Of course, most setups where synchronous standby names is non-empty, I see, they use remote apply most of them. Remote write, remote write. Right, that makes more sense. The least complete. And here we have new issues because this is super popular option. And, oh, by the way, I wanted to mention that yesterday on Hacker News, it was the article from Affir, Jepson. It was discussing RDS multi-AZ clusters, which is interesting. It's closer to our topic today,
Starting point is 00:31:49 but to understand the article we need time. So, great, we have remote write as being very popular. What does it mean? The standby node received all data but hasn't flushed it yet to the disk. It means that if all nodes suddenly shut down, only primary will have this data. After commit, we know, if commit happens, primary flushed it. We don't have settings which let us control the behavior of commit on standby nodes, If we have remote write on or remote apply, it means in any of these cases, disk on the primary has wall data from this transaction. So it's not off at all already. And imagine all nodes disappeared, only the primary has this data, because standby, since it's remote write, is not guaranteed. Disk might miss this data, and this transaction
Starting point is 00:33:14 might be lost. And this is okay. But in case we lose the primary completely, we lose data. We lose data here. But we lose data only if all nodes are shut down. Usually when we use the primary, standby nodes are not shut down. Unless something like... Yeah, yeah, yeah. Bigger issue. Let's say loss of primary is our corner case, not corner, edge case.
Starting point is 00:33:51 But loss of primary and standby nodes are not lost but shut down, it's kind of already very corner case, so several edges, so several problems simultaneously. So usually they're still running. That's why remote write is so popular because it says, okay, we lost the primary, we need to perform auto failover, or manual failover, usually auto failover, and standby one or two, or how many you configure, of those standby nodes pair the setting, pair standby commit equals remote, right? They have it in memory. And they are still running, so they will flush it.
Starting point is 00:34:32 So the risk is good here. I mean, the risk is understood, like loss of primary and everyone is shut down suddenly, but it's so low, so we can accept it. And so it still feels like synchronous commit to me because I doubt when we lose primary standby node which received but not yet flushed this whole data, it will be suddenly off. It will be running, and we don't lose this data. This is great. Yeah, interesting. data. This is great. This is why it's so popular. But it's great. Except it's not. Because if autofollower needs to happen, it's very rarely like sometimes people allocate like Crunchy Bridge or RDS, not Multi-ASZ cluster, but Multi-ASZ standby.
Starting point is 00:35:29 Now there are two options. Multi-ASZ standby means there is synchronous standby only serving HA purposes, you cannot send queries to it. Crunchy Bridge has the same, not the same. They use Postgres replication to solve this. Exactly, they configure synchronous by names and they set synchronous setting to very weird... synchronous commit to very weird value. We will discuss it. But the idea is HAI replica doesn't receive traffic, it serves only for HCI purposes. Great.
Starting point is 00:36:07 But in many other cases people use, like they say, this Quorum commit, one or two additional nodes must say that data is received, remote write. And in this case, if we lose the primary, auto-failover will happen to where? To one of these notes which we received. But the problem is, for example, if it's Patroni, Patroni cannot choose this note, because there is no such capability to find which... We have five replicas, five standbys. We know one of them received the latest settings. How to find this? There is no such capability. It doesn't exist. Because this PG last received a lesson. But it has the wrong name because it shows flush lesson. It's a bug or feature I don't know. So this is a problem and
Starting point is 00:37:15 this is like if Patroni uses that I think it right now it uses that. It chooses the node which received like which flushed the same with maximum value, and this might be a different node. So, with that in mind, I guess it's then an acceptable risk, or that's why it has some data loss, but why not use on in that case? Because it will slow down commits. that case? Because it will slow down the commits. Interestingly, I checked before coming on with our own Google Cloud SQL setup, and I wondered what our setting was, and it was on. So either they haven't... But yeah, it's interesting that they could have chosen to set it to remote write.
Starting point is 00:38:20 And synchronous binames is non-empty? I didn't check. It's a good question. If it's empty, you have pure synchronous replication. This is it. Honestly, it concerns me a lot how overloaded values and names here. On means different things depending on different settings. Why? depending on different settings like why? Maybe we are trying to save on the number of configuration settings, GUC names, not to expand it too much, okay, 300 but not let's not to expand it to 500 right but this causes pain in brains like people like using this. So if it's on, okay. It's asynchronous. No asynchronous standby names.
Starting point is 00:39:14 Yeah, so it waits for flush only on the primary. Yeah, so that's the same, that's basically just local. Yeah, yeah, yeah. Good. So, there's this problem, it's unsolved. It should be solved in Postgres. It should be solved in Postgres and I think it's not super difficult to solve, so we need a function which will show write a lesson or receive a lesson. By the way, receive and write may be a little bit different. You receive but okay, doesn't matter. Because valid receiver is valid receiver, right? Anyway, so a couple of more things deeper I wanted to mention. It's deeper things, but one of the things is even if we have remote write, there is issue in the code that causes flashing periodically. So, we have remote write but sometimes we wait until it's flushed. And this is also
Starting point is 00:40:26 like this I guess this issue is known it's revealed itself only under very heavy loads and yeah it also could be improved maybe right because if so remote write is not purely remote write it. It's remote, right, right, right, but at some point it's flush and all transactions are currently being committed, they wait, additional flush. And then it's again, pure, like again, right, right, right, then again flush. And it can be seen in the code that there are things there. FData sync, right? Roughly, yeah. I mean, is this also the wall writer delay
Starting point is 00:41:12 or a different kind of delay? It's not documented. Okay, interesting. So, I mean, if you use remote write, there is like behavior which is not purely remote write. It still involves disk sometimes, periodically. I guess some buffer is full and needs to be written. But this causes delays of commits.
Starting point is 00:41:41 Makes sense. Yeah. And another thing is probably off topic about Crunchy Bridge. Like, let's discuss what we observed. Like we just created fresh cluster and saw a few things. One of them is synchronous standby names is filled with... I created HA cluster. Yeah. So HA cluster was all good.
Starting point is 00:42:03 And synchronous standby names has this host which is HA replica. It's not visible to us, we cannot send queries to it and synchronous commit is set to local. Yeah so the first thing you mentioned it being non-empty means they're using Postgres replication. What's the significance of that being non-empty? If it's not empty, it's supposed to be. If synchronous commit was default on, it would mean we have synchronous replication.
Starting point is 00:42:38 Flash, as you said, on means, in this non-existing remote flash. But synchronous commit is set to local, which means we don't see the state of what's happening on standby. The commit behavior depends only on the primary. Like my setup, right? Like on Google Cloud SQL for me, even though it's on, it's kind of like local because they're not using... Right, but they still have the replica.
Starting point is 00:43:14 Yeah, yeah. Strange, right? But local is confusing because I don't... Yeah, I would have, as you said, I would have expected it to be on or as you say the more popular one. If it would be on, if it was on, this replica would be synchronous. If it's local, I guess it's unsynchronous. Yeah.
Starting point is 00:43:36 Or asynchronous, yeah. Asynchronous, right. Although it's mentioned in synchronous on my names. Yeah, good point. What would you prefer it to be remote right, I guess, rather than on? No, no, no. I think they are not... I'm not sure it's a mistake. Well, it's not the default, right? So it's definitely deliberate. There are some thoughts There are some...
Starting point is 00:44:05 There were some thoughts about this, I think. And I'm trying to guess what. Because obviously making it synchronous would cause slow commits, right? Yeah. Performance would be not great. Okay, we go to local, but why do we... If people are doing... Yeah, we keep it local, so this HA replica becomes asynchronous.
Starting point is 00:44:29 Do you think it's to look good in benchmarking? Why cannot we configure asynchronous replica in a normal way? Why should we put it to synchronous standby names, right? Because it could be regular asynchronous replica, like present and slot, but not available for read-only transactions. But it's there. And the only idea, it's not my idea, we discussed this situation in the team, and idea came up into one great mind was probably they want wall to be written by backends, not by wall writer, but because there is like this effect when you use synchronous binaries it's not empty, backends are writing and this can increase throughput. It's a guess, hypothesis. If Crunchy is listening to us,
Starting point is 00:45:29 and we know they're listening because they changed the random page cost to 1.1, right? Oh, yeah. Actually, it's great. I also wanted to thank them because since then I stopped saying one and I say 1.1 because Crunchy benchmarked it and decided to use 1.1 for random page cost. Great, thank you so much. But if you guys are listening, please let us know what's happening with settings. It's really interesting. Maybe it should be documented because it's interesting. Overall impression about synchronous commit, all those options and synchronous replication, I honestly see more asynchronous, asynchronous setups, purely asynchronous. But I know serious
Starting point is 00:46:16 guys and I see clusters with synchronous like quorum commit configured, my impression still there are things to be fixed and improved. This is what I observe. And one more of let's name it. When we created this cluster with HA checkbox, HA replica was also created. We see it in synchronous by names and then we decided using our superuser, which Crunchy has, and it's great. I think it's the only managed service, one of maybe two actually, which provides superuser access, which is great. So we decided to change shared preload libraries and restart. And there was an idea, is this, like, question, is this change propagated to standby node or
Starting point is 00:47:07 not? And what happened? Postgres restart, not cluster restart, postgres restart, cost fell over. This is not all right. This is not all right. Freshly created cluster and we did it three times, two out of three times, fell over happened. And another time it didn't happen. So I guess it's like some race condition is present there, right?
Starting point is 00:47:34 And we saw that change was not perpetuated to a channel. So it's actually data loss. If I have super user, it's my right, right? And I have outer systemuser, it's my right, and I have an alter system, we can see it in the PgSettings and it's not yet applied, like pending restart is true. And then we have failover and we don't see it anymore. What is it? It's data loss. Or it's unsupported behavior.
Starting point is 00:48:08 No, it's data loss. I wrote something via... Using Altar system, I wrote something and I lost it. So you can't have it both ways? I see where you're coming from, 100%. But on one hand you want super user access, and on the other hand you're going to call it data loss if you change anything that doesn't get propagated. Let me contradict with myself. I'm too harsh, I know. Altosystem is actually local.
Starting point is 00:48:35 We can change settings on any standby, each standby separately, independently. So it's not guaranteed that it's not propagated by replication. I just expected maybe some automation exists which would propagate it. If this replica is a chair replica, maybe we should match settings. Because this setting requires restart, but what happens to settings that don't require restart? If I change them, and then what? If I change them using the other system, do I lose them? This is not good, right?
Starting point is 00:49:14 But equally, expectations of automations around this stuff could be a nice excuse for other providers not to add CPUs access. So it's yet another... No, no, no. I think it's... No, no, no, no, no, no, no, no, no, no, no, no, no, no. I'm just saying be careful what you wish for. Well, it could happen. Like I change something using UI and... No, because if you use the UI, they change it somewhere else that then does get propagated. Maybe. But I... rooting for Crunch actually here. I think it can be solved this problem.
Starting point is 00:49:52 I don't want them to close super user access. It's great feature, super great feature. I think everyone should be. Based on our last previous week's topic, my position, RDS and others, they are taking too much from us. It stops feeling open source. We don't need to make this point again. Yeah, okay. Well, I'm going to repeat it many times, so it's not all right.
Starting point is 00:50:19 So, Crunch are doing great job here to providing, we are providing super user access and I think it can be solved. And one more thing, if they are really listening, when I restart Postgres, I see restart cluster. I feel like I need to chop this bit up and send it to them as a support thing. Good, yeah, that was good. Nice one. Thank you. Yeah, thanks so much Nikolai and thanks for the request and yeah, catch you next week. See you, bye bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.