Postgres FM - Logical replication

Episode Date: September 15, 2023

Nikolay and Michael discuss logical replication — some history, initialization, change data capture, how to scale it, some limitiations, and ways that it is getting better. Here are some l...inks to some things they mentioned:Logical replication https://www.postgresql.org/docs/current/logical-replication.html GitLab upgraded multi-terabyte, heavily-loaded clusters with zero-downtime https://twitter.com/samokhvalov/status/1700574156222505276 pg_waldump https://www.postgresql.org/docs/current/pgwaldump.html pg_dump and external snapshots (blog post by Michael Paquier) https://paquier.xyz/postgresql-2/postgres-9-5-feature-highlight-pg-dump-snapshots/ Failover of logical replication slots in Patroni (talk by Alexander Kukushkin) https://www.youtube.com/watch?v=SllJsbPVaow Our episode on replication https://postgres.fm/episodes/replication ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael, this is Nikolai, this is episode 63. What are we going to talk about today, Nikolai? Let's talk about logical replication decoding, everything logical. Everything logical, yeah, I like it. So this was your choice, why did you choose it? Just because I realized we didn't talk about it. And also I wanted to congratulate our colleagues from GitLab
Starting point is 00:00:26 with very, very good, successful upgrades to major upgrades without downtime at all and without data loss at all. So it was based on logical. So I think I have something to say about logical now based on that work. I expect from them, we actually discussed with them and agreed that there should be materials published, textual and also talks.
Starting point is 00:00:49 We actually could invite them to discuss it. It's probably a good idea. But it will be mostly about major upgrades involving logical. But let's discuss logical itself. Because it's interesting that more and more systems are using it. Yeah, absolutely. And it's getting better and better, right? It's been getting better in recent releases. It's about to get better again in the next release. And you were just telling me before the call, there's some exciting changes coming in the year ahead, hopefully as well.
Starting point is 00:01:21 Yes, I would like to use the version from 2025, I think, or so, because it will be definitely much better. It's getting improvements every year. There are still pieces that people need but don't have yet. For example, DDL, replication of DDL, one of the biggest pieces left unsolved. But it's already much better than five years ago. Definitely. Wonderful. Should we start there then? Should we start with a bit
Starting point is 00:01:48 of history? Yeah, let's talk about history. Probably. I remember when there was discussion that replication should not be implemented in Postgres core. It was like 15 years ago, there was opinion that application is something like external to Postgres. Let's rely on other tools. It's not the job for Postgres to have replication. It's so strange, similar to discussions of autofailover and so on.
Starting point is 00:02:17 And such systems were created. One of the most noticeable were Slony, and Londeste from Skype, and also Bukardo. I used two of these, Slony and Londeste, a lot of work to set up and to maintain and constant headache. Especially Slony, it was a lot of work a lot of maintenance work and these systems were logical replication they were based on for example on triggers and some insert or right any right happens on a table it's also written to different table and propagates to different posts and this was like overhead was not only maintenance, but also performance, obviously, because you basically you double your rights.
Starting point is 00:03:12 So you need to write to two places and to avoid data loss, it slows down all transactions, writing transactions. Yeah. So, I mean, Laundrister was great. It was based on pgq pgq was implemented in a very good way in terms of scalability in terms of performance because it was using partitioning and like everything was well but still some maintenance was needed and it was not pleasure to deal with it but inversion like later inversion i don't remember and some version maybe nine zero not a nine zero
Starting point is 00:03:47 physical application was created and it became the main way to replicate data from one node to another and so i mean people before physical replication implementation, people relied only on third-party tools like Slony and Blondeste and only on logical replication to achieve HA, high availability. With physical replication, where whole records are replicated as is, it became much less headache. I mean, physical replication is also not trivial, but it's much more reliable and it's based on the on the idea that we have wall already right wall is our way right ahead log
Starting point is 00:04:35 is our way to recover from after crushers so if we crash we just redo all changes since the last successful checkpoint and we are already in consistent and good state again so the idea was let's be constantly in recovery state on standby nodes constantly recovering that's why by the way the function to check if it's primary or standby is called pgIsInRecovery, which is confusing because you might think we recently crashed. We didn't. I mean, we probably crashed, but if we crashed and in recovery, you cannot connect. If you can connect and ask pgIsInRecovery, it means it's not in recovery, it's just a replica.
Starting point is 00:05:20 I mean, standby. Yeah, that's funny. I didn't know that. It also reveals the implementation details how replication was created based on recovery. And then Hudson by was created, meaning that you're not only recovering on the standby nodes, but you also allow read only queries. Yeah, because why not? And so of course, a lot of work was done
Starting point is 00:05:46 to achieve this, but I'm just like trying to have some high-level overview of history here. And then, after several years of development of physical replication,
Starting point is 00:05:58 obvious idea was the very same wall records we use to provide recovery and physical replication can be used also for logical replication. It means that with logical, it's called logical decoding. You just read what is in wall record. If it like insert this data, okay, we can say let's execute insert, SQL insert. By the way, it's very confusing for those who use tools like
Starting point is 00:06:26 pg wall dump and it's now in fresh postgres it's available through sql you can read wall yourself and see what's there it might be confusing to see insert delete and update uppercase because it's not sell it's not sql's physical operation, tuple level. If you delete 100 records, 100 tuples, you will see 100 independent delete records in wall. And they will be translated to 100 delete SQL statements on logical replication. By the way, it also shows that logical can be implemented differently, in different ways.
Starting point is 00:07:08 For example, in MySQL, it's very different. It can be statement level. You can just take a statement and propagate it to a different machine. But decoding logical replication in Postgres, it's based on wall. The same wall we use for physical application. So if you have multiple tuples, operation with multiple tuples, you will see multiple SQL queries on the subscriber side. It's called subscriber. For publisher, subscriber, logical has different, logical replication has different terminology than physical replication. Not primary standby, but publisher and subscriber. So logical decoding was created and it opened
Starting point is 00:07:49 a lot of new possibilities to replicate based on the same wall to replicate at logical level. Meaning you recreate cluster, you have different cluster, you lose your bloat, you can lose your indexes, you can have different set of indexes, you can have different major version and you can have different database management systemes. You can have different manager version, and you can have different database management system. It can be not Postgres, but something else based on logical decoding, right? So it's great. I mean, the development of it, it's not super fast.
Starting point is 00:08:18 I mean, it's a very complex thing, and other systems implemented things Postgres lacks many years. But currently in 2023, logical decoding, logical replication, especially with release 16 and previous release 15, a lot of work was done to make them mature and it already can be used for very heavily loaded systems like dozens of terabytes, hundreds of thousands of TPS. And this can help you to achieve zero downtime operations. Of course, you need to understand many things to achieve this, a lot of interesting things.
Starting point is 00:08:55 And still some things are missing, but things are improving. That's my small overview of history. Yeah, nice. I don't know anyone who could consider like consider slowing your own this to these days. I mean, maybe some people think it's better right now. But I would you just consider always logical decoding and replication. If you need logical level. Yeah, the only other thing that I've seen the only thing I've
Starting point is 00:09:20 seen it used for that you didn't mention is that you don't have to unlike physical application, you don't have to replicate everything. So you can take a subset. A filter, exactly. Or maybe you're forced to use logical. For example, if you are using managed Postgres, RDS, Cloud SQL, they don't provide physical replication connection outside of their infrastructure. So it means that if you need to replicate between clouds, for example, or on your managed Postgres service, or you need to migrate, you are forced to use logical replication.
Starting point is 00:09:56 It's a bad thing, but it's also a good thing because these cloud providers, they also rely on logical replication to migrate customers from Oracle or from competitors. And this helps develop logical replication further because they are big users of it. Right. And it's in order to achieve zero downtime, right? It's kind of the change data capture part of it. I mean, we could migrate from those systems through dump and restore, right? It's kind of the change data capture part of it. If I mean, we could migrate from those systems through dump and restore, right, but not without downtime. It's the key
Starting point is 00:10:31 aspect that enables you to do it without or with minimal downtime. So downtime is not related to dump restore. I'm not sure why you... So let's say we're moving from Oracle to Postgres, and we want to... We could take a dump modifier and send it through something and put it and restore it as Postgres, right? We could take the data. Well, I'm not an expert on Oracle.
Starting point is 00:11:01 And first of all, AWS and GCP, Google Cloud, they have so-called database migration services, DMS. And they support from system to system different directions. Also, there are companies like systems like Click and Fivetran, commercial systems, also there's Debezium open source. So they help to create replication from different database management systems. When we talk about logical decoding and replication, it's actually either between Postgres or from Postgres to somewhere else. But I don't know anyone who migrated from Postgres to
Starting point is 00:11:38 Oracle. But let's discuss in general, picture in general. So when you need to use it, you have two stages. First is initialization, and second is so-called CDC, change data capture. Streaming changes, not streaming, replicating changes. Because streaming is about physical replication, let's not mix terminology. So initialization is very challenging if you have a lot of data. I mean, yes, it's kind of dump restore and it can be dump restore can be with dash J means like with parallel, of course, you put a lot of stress to your disk system or network
Starting point is 00:12:18 system. But if you can afford it, it's good. I mean, you can move faster. And there's a big trade off. For example, if you consider initialization of logical replication for Postgres, if you can afford it, it's good. I mean, you can move faster. And there is a big trade-off. For example, if you consider initialization of logical replication for Postgres, if you have, for example, 10 terabytes and 100,000 TPS, what you do prefer? To read from disk slower and have... Like the problem is when you read while you are reading, you already open logical slot because you need future CDC to start from the very beginning of your, like when you read, you read it with, Bidgedump reads it with repeatable read, transaction isolation.
Starting point is 00:12:57 Why? Because we need a snapshot. So all data, all tables are in consistent state. All foreign keys matching and so on. Even if you run pgdump-j, there is a snapshot there. And this snapshot cannot go away because we need to finish initialization, copy, initial copy first, and then start CDC from that point where this snapshot was created. So if you read slowly, disks are okay. I mean, like 10% of your capacity in terms of disk I.O. But it means that you need to keep this postgres
Starting point is 00:13:34 needs to keep the snapshot very long. And this will affect guess who? Auto vacuum, right? Because auto vacuum cannot delete that tuples which became dead after our snapshot creation. And so we have a trade-off. We want to disturb auto vacuum less, so we want to move faster. But it means we will put a lot of stress on to our disk system. And every time you need to decide based on capabilities of your infrastructure in particular situation. So there is no universal recipe here. For example, if we say we are creating this from some system which nobody is using, for example, can be dedicated replica with no failover, no load balance flags in Patroni. It's like dedicated replica. Probably we can utilize 100% of disk IO capacity
Starting point is 00:14:30 and move as fast as possible because it will report with constant by feedback. It will report to primary. Or we can turn off constant by feedback in this case and allow replication to be practically paused for a long time. But we need to adjust settings for it. Is this, this is new though, right? Is this, this is new in 16?
Starting point is 00:14:51 What? Being able to do logical replication from a physical replica. It's a good question. You can create a slot on the primary. Remember its position in terms of LSN and snapshot. And then you can ask open transaction on standby, repeatable read transaction, asking to deal with the same snapshot and read it from there. Yes.
Starting point is 00:15:22 Okay. It's possible. It's like synchronization. And PitchDump actually has not only dash J, but it also has, I don't remember option, it has ability to specify snapshot ID. So you can dump from standby and not disturb primary. But standby, we have different dilemma,
Starting point is 00:15:42 different trade-off. Hot standby dilemma, we discussed it, I think, right? So it will report to primary and we have auto-vacuum issue again. Or it will not be able to replay physical replication while we are dumping because we have long-running transaction. So hot standby dilemma, it's interesting thing. We can zoom into it one day. But in general, it's possible.
Starting point is 00:16:07 And of course, if you do it from primary, yeah. So what you're referring to Postgres 16 feature to have logical replication, including CDC from standby. This is great. This is super great. Okay, cool. Yeah. But I'm talking about initialization process yes good point and last bit of information here and by the way there is a very good post from
Starting point is 00:16:32 michael paquera i'm sorry for pronunciation i am constantly using this there are some blog posts which i'm using for years and recommend all the time and people say thank you all the time like i think we should have like something like golden right library how to say like golden hole of blog posts because there are some posts which are useful for very long time so there is a post about how to synchronize position of slot and pg dump it was in nine five i think it was implemented implemented in 95, this kind of thing. The only problem, if you want to know logical slot position, not only in terms of LSN, but log sequence number, but also in terms of this snapshot ID,
Starting point is 00:17:14 you need to create the slot. You cannot create the slot with SQL. You need to use a replication protocol. So it's kind of a different type of connection. And many vendors of logical replication systems, including Qlik, don't understand that for years. And I tried to help them, they don't understand. They continue using SQL. In this case, you cannot synchronize properly and you have duplicates if your subscriber lacks unique keys. So inserts might overlap and happen twice.
Starting point is 00:17:46 So this is initialization. One last bit about initialization. You can convert physical replica to logical. This is super cool for very large databases. You just use recovery target LSN. You create slot, you know LSN. You physically replicate until this LSN using recovery target LSN. And then just reconfigure Postgres to
Starting point is 00:18:05 continue with CDC with logical replication. It's possible. This is so cool. Last time when we were talking about replication more generally, you brought this up, but I think it was a theory. You'd done it in tests, but you'd never done it in production. I believe now
Starting point is 00:18:21 you must have done it. Well, I'm guessing that's how GitLab did it. One of the tricks is this. The only problem if you first achieve recovery target LSN, switch to logical, and then upgrade, this is a problem. And there is a post on hackers which discusses
Starting point is 00:18:37 while on logical you cannot run pgUpgrade. You need to first run pgUpgrade and then switch to logical. And this is one of the features I think which is also under development to make it officially supported without additional manual actions. But it's doable already. So yeah, this is what was used there. So again, like how to understand what you need, which kind of provisioning you need. If you have like less than one terabyte, probably you will be fine everywhere. If you have one to 10 terabytes, probably you would like to read faster to cause less bloat. By the way, if you cause bloat on old
Starting point is 00:19:17 system and you are going to switch to new one, you don't care. Yeah, a degradation on since logical replication. In this case, we are dealing with fresh cluster with you're losing your bloat. Like, like have a joke like, with RDS, they don't allow physical replication, they allow only logical application. So they don't allow me to download my bloat. Like I can download only my data losing my bloat. But what if I need my bloat?
Starting point is 00:19:49 In some cases, for example, testing, like lower environments for experimentation, you probably want your bloat to be preserved, right? Because this would help to have more realistic testing, more like in real production. I i mean i get what you're saying but you could create your own like in a way you are creating your own bloat on the logical replica right if you're if you've got auto vacuum running in the same way and you're running the same workload you probably are getting bloat in yourself right but imagine it's not the same workload, you probably are getting bloat yourself.
Starting point is 00:20:27 It's not the same bloat. Exactly. If you have 90% bloat, real pages is higher. It will affect plans. It's not a good problem. I'm very familiar with this problem, but you're getting kind of like a different bloat. Different bloat is different.
Starting point is 00:20:49 Okay, but it's a different topic. So if you have one to two terabytes, for example, you probably want to be very careful optimizing how you do this initialization of logical application. But also important if you want everything or some part of it, right? True. logical application. But also important if you want everything or some part of it. Right, true. Above 10 terabytes, I will probably think about if you if above 10 terabytes, and if you
Starting point is 00:21:14 want everything, it's better to convert physical replica to logical preserving all blood by the way. And what's a converting a physical replica to a logical replica? Again, you have physical replica, you replay, you say recovery target LSN,
Starting point is 00:21:35 putting position of slot, you created slot, you know LSN, you put it to your physical replica and say pause, for for example or promote, depends. I don't remember, this is also interesting detail, I don't remember our decision actually here, we experimented with it.
Starting point is 00:21:53 You can say either promote or pause when it reaches LSN. Once it's done, you can say okay now it's logical replica. All my blood will be with me. It's bad and good. I mean it's good for testing, it's bad for, I mean I would like logical replica, all my blood will be with me. It's bad and good. I mean, it's good for testing. It's bad for I mean, I would like to get rid of my blood. If you provision logical replica in regular way, you are getting rid of blood. It's like a dump restore. It's good thing. But in this case, we if we have a lot of data, like 15, 20, 30 terabytes, probably we want to convert physical replica to logical. But only if we don't need to take on it. For example, if we need to take 10% of it,
Starting point is 00:22:35 it's kind of resource waste. You will need to perform a huge cleanup afterwards. Yeah, of course, because you're physically replicating the entire thing for only 10% of the benefit. Right, because physical replication doesn't allow you filtering. While with logical, you can say, I want these tables only. Yeah, or even as of recently, you can even do just these columns from these tables. It's pretty impressive how much you can filter that now. Right. Next step would be let's probably replace PII with something, right? Yeah, I guess that is a change data capture problem. Yeah, you
Starting point is 00:23:14 could. That's cool idea. Yeah. It's like, let's create some rules and then some tables, some columns, we will apply some modification on the fly. And during initialization as well. This would be a super cool feature, I think. Nice. Some regexp or something. Yeah. I think it's doable.
Starting point is 00:23:33 And it's doable right now already if you go down to logical decoding level, losing all features from logical replication, native for logical replication, and just apply yourself, making some transformations if needed. But again, like if it's a small database, regular approach. If it's a larger database, think about this trade-off, disk versus overhead related to vacuuming. And if it's a super big database, consider a physical to logical approach. It's an interesting trick, underappreciated think, right now. Nice. Do you think there's a limit size-wise? If you're accumulating
Starting point is 00:24:11 changes on the primary during that flip, what's the impact? This is already about CDC. Okay, sorry. Do you want to move on to that? Yes. So CDC is interesting. First of all, most of the problems you will experience with Logical are related to CDC. Inability to apply changes from standbys, from physical standbys until Postgres 16, which hopefully will be released the day before we published this podcast since we are not online. Fingers crossed. Fingers crossed, right. So this is solved. Why this is important? Because nobody wants to start from scratch if primary is down, for example, and it's good to be able to replicate
Starting point is 00:25:00 from standby. Also, in terms of overhead, of course, logical decoding is small overhead. There are cases when wall sender, which sends data for logical replication, can saturate one core. And if you have multiple, like single slot, it's always single process, single wall sender. And usually its utilization is quite low. In some cases, for example, if you delete a lot of rows and then roll back, delete a lot of rows, roll back, delete a lot of rows, roll back, this is a harmful workload for logical replication. And at levels one or two walls per second, you will see a wall sender saturating your CPU, even quite good one like Xeon, Modern, like something. You will see 100%. And it will be bottleneck of
Starting point is 00:25:54 logical replication, which, by the way, is very hard to solve because even if you, for example, decide to use four slots, for example, all of them will be parsing everything, even if you need like only part of data, they will still pass all data. So you will see four slots, but it's very rare. It's interesting in my experience, I saw it in non-production and I thought it's a big bottleneck, but but in reality our real workload is not like that it's not like marking a lot of tuples with xmax meaning they are deleted and then rolling back
Starting point is 00:26:31 no no no normal rights are different so probably you won't see situation of wall senders if you do see situation of wall senders of CPU situation it's bad it's not solvable right now. Bad news.
Starting point is 00:26:47 But again, it's rare. So probably you don't, like being able in Postgres 16, being able to perform CDC from standbys, from physical standbys, it's good because you don't have overhead on primary. And different thing, what happens if primary is down i don't know if
Starting point is 00:27:06 if it's possible to like i'm new to post-16 also so big problem is how to sustain various failures until like logical applications should be also highly available because right now at least until post-16 if primary is down or the node like like from where you perform CDC, change data capture, is down, you need to start from scratch. You need to redo initialization. And this is a huge headache. I see the work to sustain switchovers, failovers ongoing for Postgres 17. And I also know using some tricks Kukushkin implemented for Patroni, Kukushkin implemented support for switchover.
Starting point is 00:27:51 So we logically replicate from primary, but if we switchover, we don't lose slots. Slots should persist. This problem, I think it's not solved in 16, but it looks like good chances it will be solved for 17. And it's solved in Patroni using some tricks. Use on your own risk. I remember, I think Craig Ringer helped in the GitHub issue with some good insights how to implement it correctly.
Starting point is 00:28:20 But officially, it's not supported. I mean, officially by Postgres, it's not supported. And it's a big problem of CDC, right? Yeah, that makes sense. I didn't know it was solved in, or there was a solution for it in Patroni. That's very cool. Yeah, so there are two problems, big classes of problems.
Starting point is 00:28:38 High availability and performance. I already touched a little bit both. About high availability. If node is down, okay, use Patroni and this feature is good. What else bad can happen? For example, DDL can happen. Yeah. And DDL is not solved.
Starting point is 00:28:56 It's in commit first, requires review. There is, I think, version 4 of patch proposed. It's good to see that it's in work, in progress this is super good if this will be solved in future Postgres versions because
Starting point is 00:29:16 if the deal comes usually, at least you can mitigate it in many cases what happens, CDC stops I mean, what happens? CDC stops. I mean, logical replication stops. The slot starts to accumulate a lot of changes, risk to be out of disk space.
Starting point is 00:29:38 And you just need to replay the same DDL on the subscriber in most cases. It will help. Yeah. And then it will restart automatically or you have to kick that off again? Yeah, it will pick yeah and then it will restart automatically or you have to kick that off again it will pick it up here we like must say logging all ddl is a good idea always all ddls should be logged must be logged you will like save a lot of nerves if you have logs of ddl for example here you just fix it manually i did did it manually many times. Some DDL happens, especially if you have some automated partitioning.
Starting point is 00:30:09 Yeah, true. I've heard people, like, for example, if you're doing a major version upgrade using Logical, I've heard people try to put a feature freeze on, you know, no changes while we're doing the upgrade. But if you're partitioning, that's happening automatically. You need to pause automatic partitioning. Or create the next few
Starting point is 00:30:29 in advance, depending on how you're doing things. Another interesting thing is that some DDL can be left unnoticed. For example, if some index was created on the publisher, nobody will notice because indexes don't
Starting point is 00:30:45 matter. But you can lose this index and one month later you will realize you have slow queries again and you need to recreate it. Because it doesn't work. Index creation won't block you. But it's also not replicated. Well, that's also a feature
Starting point is 00:31:01 though, isn't it? I've heard of people using it for analytics database where you want the data indexed in a different way or you want to do roll-ups on it or something and you don't want those indexed. So I wonder if they're going to do it so it's optional or you can control which DDL gets replicated. Right. Yeah, I agree.
Starting point is 00:31:23 Yes, that's a good idea. And with logical replicas, we don't have a host and buy feedback problem. But we might have problem with the lag. For example, in many cases, one slot is not enough. And I'm talking not about wall sender, which is quite rare. I'm talking about logical replication worker on the subscriber side and it's an ability to catch up all the changes in proper time. It can be related to disk IO or CPU saturation, I saw both. And in this case you might want to consider multiple slots. But here we have an interesting problem many people don't realize.
Starting point is 00:32:00 It's a minor problem actually. So the problem is that if you publish everything this DDL is very light you say I'm going to publish all tables to this slot it's easy we can just continue
Starting point is 00:32:18 but if you want to balance first of all you need some balancing algorithm and it's good if you base this algorithm based on tuple statistics from pgstatall tables saying, okay, we want to distribute writes between four slots or 10 slots, like as even as possible. But then you are going to publish a bunch of tables to one slot. And this publishing requires alter table and after labor guess what it requires a lock unlike publishing all tables and when you want this lock guess who can block us
Starting point is 00:32:55 our old friend auto vacuum running in transaction idea around prevention mode which everyone every DBA should hate because you you prepared beautiful procedure, you are going to implement to run it at 1 AM not sleeping and then you forgot about this. So any changes, any DDL, anything, any complex changes, always should keep in mind that auto vacuum is yielding and will kill itself unless it's running in transaction ID wrap around prevention mode, which won't yield and you need to take care of it yourself. So this is interesting, but fortunately this kind of issue, this kind of conflict won't
Starting point is 00:33:42 lead to downtime unlike alter table at column because alter table at column will block selects. This kind of log, I don't remember details, but this kind of log which is needed to define publication for table, even if you cannot acquire this log and see it for 10 minutes, for example, it happened some day in my experience. You will just be blocked. This is the worst outcome. Selects will still be running. This is good. What about modifications? Like updates, deletes? Yes, yes. So, DML won't be blocked. So, likely your users won't notice it. So, it's a super interesting thing how to distribute between slots.
Starting point is 00:34:28 And there is also another very big caveat from having multiple slots. With single slot, all things are simple. With multiple slots, for example, imagine table A and table B, and they have foreign key between them. Table A is published to one slot, table B is published to a different slot. Like a race condition or something. Two slots, two different legs. Right, so foreign keys is a problem.
Starting point is 00:34:53 By default, they will be ignored, by the way, by logical replication. I didn't know about it until recently. So logical replication ignores foreign keys and allows eventual consistency if you use multiple slots. Wow. So Postgres eventual consistency is an official approach in Postgres as well.
Starting point is 00:35:14 Wow. Indeed. Just use logical and multiple slots. That's it. And you can enforce it, of course, but you won't like it if you enforce it. So it depends on the situation, of course, but if, for example, it's for big migration from major Postgres version to the major Postgres version, or you deal with glipsy version change, something like that, operational system change, in this case, probably you want to allow it, just don't redirect read-only traffic to such replicas. Yeah, yeah. don't redirect read-only traffic to such replicas. Yeah, yeah.
Starting point is 00:35:48 Of course, don't redirect read-only traffic because it will be split-brain. Or multi-master split-brain. Multi-master is almost the same, right? I'm joking. And you probably want a short pause before resuming traffic. Of course, to catch up all the slots, to ensure no changes are lost.
Starting point is 00:36:04 Of course, of course. And in this case, it will be already in a consistent state again, because publisher guarantees foreign keys on publisher side guaranteed. But it's interesting, right? So if you use multiple slots, you cannot redirect it on the traffic as well. It means testing strategy is different. The only thing I don't think we've covered that I had to
Starting point is 00:36:28 look out for was I've seen a few people put issues with sequences, so like those not synchronizing between the publisher and the subscriber. Honestly, I don't remember details. I think there was some work done to support it. Eventually
Starting point is 00:36:43 it should be supported. Not in 16, right? Not yet. I don't think so, no. Right. But it's not a big problem. You just dump all values and you propagate them to new primary and just add some constant for safety. Sequences are 8 bytes.
Starting point is 00:37:02 So if you add 1 million, nobody will notice it. I mean, in some cases, of course, big gaps are a problem. It depends. But it requires some additional coding. But it's not as annoying problem as a lack of DDL application, in my opinion, in my case, in my cases I had, I had, Yeah, it's it's solvablevable as opposed to not solvable. I've just seen it trip people up on their first attempt. Right. Or just use
Starting point is 00:37:31 your UID, but version 7, please, which is not officially supported by Postgres, but some libraries in your favorite language are already supported. By the way, news from there is that standard progress is happening. Hopefully we will see standard, uh, releasing version seven, I mean, RFC for UID version seven
Starting point is 00:37:54 soon. And we already have a patch for Postgres. If things go very well in Postgres 17, we'll support this. Very cool. Because Postgres is waiting for RFC to be finalized. Nice. And these are like timestamp ordered UUIDs? Lexicography. So basically, if you order by, new IDs will be close to each other. Right. Great. new ideas will be close to each other right audible UUID there are many things about logical there are many things I don't remember everything I just
Starting point is 00:38:32 wanted to share some most like challenging things I had recently yeah that's wonderful I've learned a lot about multi-slot especially I've learned myself recently many things about it. But I think honestly, the good thing about logical replication,
Starting point is 00:38:51 if you do it very carefully, not allowing slot to accumulate a lot of data to be out of disk space. By the way, Postgres 15, I think, or 14, got previous versions, got some setting to say maximum number of bytes in slot. If you reach it, it's better to kill the slot than to allow out of disk space. Disk space, yeah. Right. So if you do it carefully, users shouldn't notice your experiments and you can run these experiments on production and test it and prepare good procedure and then
Starting point is 00:39:26 when it comes to deployment you already have everything verified right on production right so this is good thing about it and if you just start with one slot and see it's enough it's good you're lucky if you have for example multiple thousand tuple writes per second, for example, 10,000 tuple writes per second on primary, it's quite significant load. So in this case, probably you will need to have multiple slots. In this case, if you have partitioning, you will say thank you to yourself, because if you have huge table and partition, it will go to one of the slots representing, for example, like 50% of all load. Partitioning helps here. It doesn't help in some cases, but here it helps a lot to redistribute rights among slots. And then if you use multiple
Starting point is 00:40:19 slots, be careful with this foreign key issue. Yeah, because your your replica will be inconsistent in terms of referential integrity at any given time until you perform switchover and switchover is interesting probably we should discuss switchover one day also I like your idea of the
Starting point is 00:40:40 algorithm to work out which tables should go where but the foreign key thing makes me think if you've got clusters of tables that have foreign keys between them and a separate cluster that have foreign keys. If you could separate those, that would be cool. I explored this idea with machine
Starting point is 00:40:55 learning with k-means. It works quite well in some cases. It depends on your schema. Of course. We know k here. If you want your schema. Of course. Right. And we know K here. If you want four slots, our K is four.
Starting point is 00:41:09 So it's easy. But unfortunately, in well-developed systems, relationships between tables are usually quite complex. But it's a valid idea. I agree. Yeah.
Starting point is 00:41:22 Awesome. Well, thank you so much, Nikolai. Thanks, everybody, for listening listening and see you next week. See you. Oh, of course, as usual reminder, not to forget,
Starting point is 00:41:32 we have YouTube. We have podcast published everywhere, literally. So Spotify, Apple podcasts, everything. Just go to PostgresFM and subscribe and you will see it and choose your favorite method. As usual, please subscribe.
Starting point is 00:41:47 Please leave reviews, stars, reviews, and leave ideas on YouTube or on Twitter, what you want. We have a document where you can add your idea for future episodes, some feedback.
Starting point is 00:42:01 So we would really appreciate it. And on YouTube, we continue publishing Polish transcripts, subtitles, right? So if you want, you can recommend your friends who barely understand my Russian accent or Michael's British accent and subtitles help a lot. And also they can be automatically translated quite well to many many languages by youtube itself so you can see the subtitles in your language yeah so yeah youtube's great for the transcription and for the translation and the podcast ones we
Starting point is 00:42:38 do edit out some of the the ums and the rs and also some bits where we go off topic a little bit too much so if you if you want the slightly shorter version then those are there for you too right right so we invest in quite an effort to to these subtitles and we will continue improving okay thank you see you next time bye bye

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.