Postgres FM - Replication

Episode Date: December 16, 2022

Here are links to a few things we mentioned: Replication (docs) 9.0 release notes (including built-in replication) SlonyLondistepg_is_in_recovery (pgPedia page) Patroni Multiple Database...s with Active Recordhot_standby_feedback (docs) max_standby_archive_delay or max_standby_streaming_delay (for WAL-shipping and WAL-streaming respectively)synchronous_commitSynchronous replicasDatabase Lab Engine Neon Branching  Past, Present, and Future of Logical Replication (Postgres TV with Amit Kapila)  Failover of logical replication slots in Patroni (Postgres TV with Alexander Kukushkin)  PostgreSQL Subtransactions Considered Harmful (blog post by Nikolay) Why we spent the last month eliminating PostgreSQL subtransactions (blog post by GitLab) Fivetran ------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov / @michristofides / @PostgresFM, or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I'm Michael, founder of PG Mustard, and this is my co-host Nikolai, founder of Postgres AI. Hey Nikolai, what are we talking about today? This week is my choice and I choose replication, but this is just one word, but a huge, huge topic. Let's try to talk about replication, only trying to focus on areas which are interesting to back-end developers and architects. So for people who create something, build something and so on. Yeah, exactly. Let's make sure it'd be cool to cover the different options people have
Starting point is 00:00:36 and the problems they're generally trying to solve when they're looking into this. We also had a couple of requests from listeners that covered this. So we've got the differences between physical and logical, and also challenges on different managed services, especially RDS. So that'd be cool. And options like while shipping, while streaming. Two subtypes of physical, right? Exactly. Let's start from the problem.
Starting point is 00:01:00 Like, why at all do we need replication? I see three areas where replication is needed, and two of them are major, and third is kind of controversial. So first question where replication might be needed is high availability. For example, in cloud, our VM can be down anytime, so we don't want to have downtime, right? So we want to have almost zero downtime or downtime less than one minute always. Yeah, even if we self-host the machine, things can go wrong with one
Starting point is 00:01:35 and for it to fail over to a secondary is super helpful. I see primary and standby as the language in the docs. Self-host, you mean you have some service in data center. In this case, you have perfect opportunity to go a different route and go blame your provider of hardware and so on. So in the case of clouds, this is less possible. So usually, and the issues happen more often actually naturally you can have very good hardware so each server is running years in cloud is also possible but less often and this is this
Starting point is 00:02:14 is the first option anyway second is i have high cpu or like i i have almost saturated disk io or everything is out of cache or something like I have resources which are close to saturation so one machine is not enough and in my opinion this second option in clouds they it's also definitely possible but you in clouds it's easier to just increase resources switching to more powerful machine right which is like less convenient to do out of clouds. So this reason triggers more like maybe sooner if you're not in cloud. But anyway, in cloud it also triggers
Starting point is 00:02:55 and at some point you want to offload reads to different server and Postgres allows it. And the third option is using replication in addition or instead of backups. And usually it's a bad idea. First of all, if you, for example, you lose, like, if you lose whole one node, it's about HA.
Starting point is 00:03:17 It's not about disaster recovery and backups. But if, for example, someone deleted something, replication immediately delivered this, propagated this change to all nodes. So that's why we always say replication is not for backups. But in Postgres, for physical replication, there is an option to have a special replica which will be delayed by several hours, for example, like four hours. And if someone deleted something, you can use that replica and actually promote it,
Starting point is 00:03:47 not promote, replay additional walls until the point right before deletion happened. And if you know this time, because sometimes it's not easy to find. And then you can restore manually, restore the data which was deleted. It's called delayed replica approach, but I like it not so much
Starting point is 00:04:05 because it's better to use branching or like, for example, Database Lab Engine and to have a specialized replica with snapshots every, for example, two or four hours or one hour and have a recipe to do the same point-in-time recovery starting from some snapshot. In this case, it's very fast. But always there is an option to restore from backups. You should have backups with point-in-time recovery with continuously archived walls so you can restore from there. But if your database is, for example, one terabyte, you should be prepared to spend roughly one hour. If it's 10 terabyte, 10 hours, roughly, sometimes five. So back to the main. So we will focus on two main use cases. We want to offload some workload to different server, or we want also to achieve very low downtime if something bad happens with some node.
Starting point is 00:04:54 And of course, obviously, interesting, historically in Postgres, there was some idea this should be implemented outside of engine. Replication, I mean. It only, as I remember, in 9.0, replication was implemented in core physical replication. Oh, really? That recently? Am I right about 9.0? Or it was 8-something?
Starting point is 00:05:15 We should check. But it was not long ago. Like in the middle of, for example, in 2005, we didn't have replication, and still there are ongoing discussions, like we shouldn't have it in the engine. It should be outside. And there was a system called Slony,
Starting point is 00:05:30 which implemented logical replication, actually, based on triggers. And it was very difficult. Like, you need to allocate a single whole DBA to maintain it. So it was not easy to maintain. And then Skype implemented LoneDist also based on triggers. Slightly better in terms of maintenance. And the good about those systems was that you can add additional logic easily because you can specify which tables you want to take and you can actually make transformations on the fly, a lot of flexibility.
Starting point is 00:06:06 But of course, this is based on triggers. There is overhead. You write to a table and trigger writes to an additional table, right? So then consumers propagate the changes to that additional table. So kind of significant overhead. But you can replicate from one major Postgres version to another major Postgres version, great. But nobody guarantees that subscribers or recipient node will have the same data. There are many risks when you use this system. So it's not as reliable.
Starting point is 00:06:39 And then based on wall, wall existed, write a headlock. so postgres writes any change first to write a headlock and then to and synchronize it with disks right so it's it's guaranteed that write a headlock has all physical instructions of all changes so based on wall the idea is wall is is used when we replay postgres after crash, replace changes since the latest, last checkpoint. And the idea was, let's take this mechanism and have another node which will be always replaying these walls. And it will be constantly replaying them, and that's it. It won't be allowed for this node to write itself, always just to replay.
Starting point is 00:07:27 So like constantly being in recovery mode. That's why function to check if it's primary node or replica node, standby node, it's called PG is in recovery. It's very confusing. Like each standby node, each replica, it's like kind of in recovery constantly. But there is additional recovery process for standby nodes after crash it also can crash right so there is recovery
Starting point is 00:07:53 and then we like there are two types of recovery state i would say like normal and not normal so it's a really good point though because i think some people come to this haven't had experience of other databases and wonder if there is a way of having multiple nodes that they can write to in Postgres. And that's not an option in vanilla Postgres, whereas there's some kind of Postgres compatible database that do allow that. We live in a world of you have one primary that you can write to and then you can have multiple replicas from that that you can read from as well. Right. For physical replication, physical replication is based on wall, and it's just we are constantly in recovery, but accepting connections. Okay, we have physical standby.
Starting point is 00:08:37 What we need to understand that it's everything. Unfortunately, Postgres wall, it's like it's a single thing for whole Postgres server, so-called cluster and Postgres documentation. So we cannot take only one database, unfortunately. Although such projects exist, and maybe it will be implemented at some point, other database systems support it. For example, SQL Server. So what we have here, we have Node, which is constantly replaying walls.
Starting point is 00:09:04 It means that everything is there. Everything, all tables, even bloat is there, which is constantly replaying walls it means that everything is there everything all tables even blood is there which is good actually sometimes and physical this is called physical replication and you mentioned in request there are two subtypes of physical replication based on wall shipping and historically it was implemented first and second streaming why we have two Because they're slightly different. Wall shipping is more like, I would say it's more reliable. And also you can ship walls from archive,
Starting point is 00:09:32 from object store, like S3, AWS S3. And even primary won't know about you. It's convenient. You can have a lot of replicas like that. But lag can be significant, like some seconds already. Okay, interesting. Yeah. And all files are 16 megs by default.
Starting point is 00:09:53 As I remember, RDS increased them to 64, if I'm not mistaken. But to get that file, to store it, it takes some time. So that's why streaming replication was additionally implemented in an effort to minimize the lag and also to get better control of replication slots. Replication slots is just the idea that it should be visible that we have standby nodes and we should be able to understand what are the lag there
Starting point is 00:10:21 and you can see even detailed lag there. So when you design some system, you can understand can understand okay we will have standbys they will be asynchronous by nature so lags are possible and we can also see from sitting on the primary what are the lags including phases of the lag like in transfer or like flashing to disk or replaying, like where, like how lag is, we can zoom into lag and see it under like closer. So standby nodes are asynchronous, but it's possible to make them synchronous as well. With obvious trade-off, if we say we need synchronous behavior,
Starting point is 00:10:59 then we need to, our transaction at commit time will need to wait until the this transaction confirmed to be written committed on both nodes or on three nodes it's very expensive in terms of writing transactions right but with funny trick you can design it your system that not every transaction will behave like that you can be you can have no this by nature, but then say this transaction is super important. So I want this transaction to be, to behave as, as synchronous. It's just controlled via a synchronous commit, right? So you can say, I want this to be saved on multiple servers. This is, this is trick. Not everyone knows. Right. So, yeah, I didn't know that. That's cool.
Starting point is 00:11:43 Do you use that? No, I actually, I see a lot of companies and people who use synchronous nodes. And also there is an option to say, okay, I have many standby nodes. I want at least one of them to receive the change. It's called quorum commit. You say how many nodes should receive it out of all existing nodes. It's also possible like there are many things that are now possible for physical replication but somehow we live with asynchronous nodes and lag are very small we control them usually using slots and later newest
Starting point is 00:12:16 postgres versions have protection there is like i remember my ops people told me they are afraid of slots because if some node is down, slot is accumulating a lot, big lag, and then you're out of disk on the primary. Now we have protection. I don't remember the parameter, but you can say not more than this number of gigabytes. Then this node will be considered as completely lost and slot will be like not working anymore. But you're safe your 3d space on the primary so like it's it's quite powerful but for if if i'm a backend developer i
Starting point is 00:12:53 need to understand that nature is by default is asynchronous if i want synchronous i won't be very careful i won't understand all network latencies because network is involved now, right? And of course, if I use standbys only for a shape, high availability, it matters less. And I, of course, should use Patroni or something. Patroni is the most popular autofillover. By the way, same idea. It should not be in core. That's why it's outside. Here, in replication, the idea, let's have it in core one for autofillover not yet it's
Starting point is 00:13:28 still outside my opinion everything should be in core but there are many things obviously that prevent it from happening so if you have ha you don't put load there you have a lot of money you know why not then you don't deal with like you just need to control the lag in terms of the health of this node right and primary is down and lag is known to be small okay data loss is small if it exceeds for example in patrony we control it we can set a loud lag for failover event if it if Patroni knows that the lag is too high, failover won't happen because we don't want data loss. So we can prioritize.
Starting point is 00:14:12 We have trade-off here. What to do? Failover and have a lower downtime, or we want to save all our data. We don't want data loss, so it's better to have downtime, but try to fix primary and save data. There is trade-off here. If people want to read up more about this kind of thing, I've seen people using the, like, if your boss is demanding zero data loss and instant failover,
Starting point is 00:14:39 I've seen the letters, the acronyms RPO and and rto and people have written a lot of useful guides around well rprto is good to understand but it's more like ops terms yeah exactly how to define as slo sla rpo rto yeah returning to these subtypes we have wall shipping and we have streaming of course streaming should be preferred if you want a smaller lag on your nodes. And especially if you want to offload reads, you want to offload it. So to reduce load, very often it's CPU load, sometimes it's disk IO. You want to offload it to replicas, standby servers, you probably would prefer streaming and with replication slots for better observability.
Starting point is 00:15:30 But the trick everyone implements, everyone who deals with such system, is, okay, sometimes we have lag, a few seconds. What if, and for example, we have a page or a mobile app and user created something and then immediately tries to search trying to check the result if user is very fast and our node has some lag user might see in like no new items in the result set where is my where is my post or comment or something in this case every every experienced backend developer implements the pattern sticky reads i i say i i named it like this sticky reads so you say okay in my session or this user or something
Starting point is 00:16:22 you need to identify somehow your user. If you have sessions, it's session. If you have like REST API, no context on server. So maybe you need somehow to identify like customer ID or user ID. So you say, okay, some recent data change happened. And we mark this user as recently changed database. Some flag in cache cache in some cache like memcache radius anything and in this case you need to teach your backend to work only with
Starting point is 00:16:53 primary for some period of time stick all reads to the primary that's why i call it sticky sticky reads or sticky selects and the funny thing that a couple of years ago, Rails, Ruby on Rails, they implemented it. So it's there and you just can use it. That's it. I think every framework should think about it
Starting point is 00:17:14 because they should think about scaling and performance issues because if you don't scale, you will be, users will be suffering in terms of performance. The closer you to saturation, for example, of CPU,
Starting point is 00:17:29 the bigger problems can happen. So CPU is good. Yeah, that sounds cool. The other thing I see people using replicas for in terms of reads are kind of read-only parts of the app, so like dashboards or their own internal analytics, that kind of thing? What risks are there?
Starting point is 00:17:52 Postgres doesn't allow to do it without consequences. You have two options and both are bad. You just need to choose which is less bad in your case. First option, you have regular standby, which means host and buy feedback is off. Host and buy feedback signals to the primary the state of our standby. It's controlled on standby, on each standby, you can define it. And it says, okay, if it's off, I don't care. Primary shouldn't know about where I am. But this means that any transaction that lasts longer,
Starting point is 00:18:27 and all analytical queries tend to last longer, right? They will pause replication. And there are parameters for streaming and for wall shipping separately, which define maximum duration, because after which we say, okay, it's enough. And we cancel transaction in favor to replay new changes from wall streaming or wall shipped doesn't matter this is not good because it makes your node like single user because other users like for example one user is using okay i'm
Starting point is 00:18:59 using it first danger is that at some point, my read-only select can be cancelled. But I can adjust this parameter and allow a couple of ups. But other users coming to this server observe huge lag. That's why I say virtually this is a single user node. You're alone there. Others will suffer, so it's not convenient. That's why in this case, there is another option, which is also bad. It's called
Starting point is 00:19:29 constant by feedback being on. We turn it on and standby reports to the primary about Xmin Horizon. We have transaction IDs and we have many transactions happening at the same time, like constantly, because it's a multi-user system, right?
Starting point is 00:19:46 And there is a vacuum which should clean that tuples which already dead for all transactions. But how to define for which, like all? To define all here, we need to understand which is the oldest transaction still ongoing. And this is called like transaction still ongoing. And this is called Xmin Horizon. Xmin is because, I remember, you can select Xmin from any table. There is always such column there. Our listeners who listened to the first episodes know.
Starting point is 00:20:15 I like this, like hidden columns in Postgres. Xmin, Xmax, CTID. So Xmin is like birth date in terms of transaction id you can see when each tuple physical row version was created in which transaction so we have multiple transactions and the oldest one is our xmin horizon and this is reported to primary so primary auto vacuum all after vacuum workers on primary respect this and don't delete the tuples because they are considered still needed on some replicas and imagine if you do it some you run some query on some node with hostback hosted by feedback turned on for many hours it will lead to auto vacuum unable to delete the tuples in all tables on the primary.
Starting point is 00:21:05 All tables. It's like cluster-wide problem. So we will eventually accumulate bloat. Later, transaction finished, autovacuum deleted, and we have gaps. Gaps is bloat. Bloat should be distinguished from dead tuples.
Starting point is 00:21:22 But dead tuples, accumulation of large volumes of N lead to bloat. And that's not good, affecting both table and indexes. Health. So constant back-to-back on is bad, off is bad. Choose what you prefer. So I prefer, my idea is, two other options.
Starting point is 00:21:43 Both are good, but both are difficult to implement first is don't use postgres for analytical queries use logical decoding logical replication to other systems like click house snowflake anything this is one second but of course it means like it's easy to say but but it's a lot of new expertise required and a lot of maintenance efforts of course it's easier to create a physical standby and work from there second option is to have again branching or database lab engine and provisions clone or branch and neon just released branching last week, by the way. I'm super excited. This should be everywhere.
Starting point is 00:22:27 Branching is super powerful. And I'm most excited about capabilities for development environments, like non-production environments. But for production environments, it's also good. Like you have specialized replica and when you provision branched or cloned Postgres, it doesn't report to a primary at all. You have frozen state, of course.
Starting point is 00:22:47 You're not receiving fresh changes, but you can do whatever you want. Like you're already detached from anyone. So you cannot be a problem for others. It's good. Well, of course you can have a logical replication. It doesn't report about XminHorizon to the primary, but it learns about system catalog changes.
Starting point is 00:23:11 So if some DDL is happening, there will be a problem, which is a problem in general in logical. So easy options are both bad. Difficult options are both good. But you need additional additional uh you know you know you need additional tools for it so that's that probably covers physical basically just use streaming it's good reliable you use wall ship replicas for standby nodes for cases when you have like for example you want to provision and keep it up to date some environment where you have, for example, you want to provision and keep it up to date,
Starting point is 00:23:46 some environment where you do experiments, for example. It can be based on wall shipping from archive. I prefer to touch the primary less often. And that's it. So physical is reliable, battle-proven, everywhere it's working, it's good. How about logical? Yeah, logical is good good also but it has a lot of limitations first of all in logical can be used in old-fashioned way longest is still a tool
Starting point is 00:24:14 to consider based on triggers or it can be used what we have logical decoding and replication of postgres first of all like for developers you should understand that logical is based on basically on wall as well on wall is is recording tuple changes like this is new tuple this is new whole page that changed and so on very low level changes and logical is based on that so it's not purely logical it's like logical derived from physical right and this leads to many many issues actually so many issues but biggest issues with logical that it it lacks some things but which currently under development for example ddl or replication of sequences there are there is ongoing work a lot of work and hackers which improves it and and, I mean, solves these problems.
Starting point is 00:25:10 So I'm quite positive that Postgres 16 or 17 will be much better in terms of logical. And recent Postgres versions 14 and 15, they also have improvements. On Postgres TV, we had a guest, Amit Kapila, who described logical replication improvements. But logical, in general, is needed when your destination should differ from the source. In logical replication terminology, they call the publisher and the subscriber. Like primary standby terms converted to publisher, subscriber here. So, for example, you want different major version, right? Yes, so if we're doing an upgrade. Or you want not whole database.
Starting point is 00:25:44 For example, you want only a couple of tables. Or it's a different database system at all, like ClickHouse. It's also possible based on logical decoding. Logical decoding idea is quite easy. Let's have a logical replication slot, and it will stream our changes, all records, and we will consume all records and decide what to do with them.
Starting point is 00:26:07 Either ignore or somehow process and allow to consume. So there are plugins, so-called, which define this logic on the publisher. And the problem, one of the biggest problems for logical for me, but it looks like practically it's not that big. So it's always parsing everything.
Starting point is 00:26:26 So like you cannot, it's both performance and scalability problem. It can lead to lags if you have complex records involved, like a lot of columns involved and so on. And also if you add additional stream and you want to say, okay, these tables will be there and those tables will be here like divide and conquer you won't conquer because consumption of cpu from of wall sender on the on the publisher side will be this like if it's it was 100 cpu of one core now you will have two cores both consuming 100 it's quite a problem but i but I see that it hits only under very heavy load.
Starting point is 00:27:12 You should go. Before, you probably hit the problem on the recipient side, on the subscriber. For example, if you miss some index and want to replay changes. And also, splitting to multiple streams makes more sense here, because definitely you can have multiple wall receivers and backends replaying changes
Starting point is 00:27:34 on the recipient side, on the subscriber. So, I also, for those who design and want to think about how to use logical, I wanted to highlight about one interesting point. We can distinguish two big areas of use cases. First is when we need all tables in Postgres.
Starting point is 00:27:54 It can be different major version of Postgres, but still we need all of them. And second, everything else, like we need only partially our database or it's not Postgres at all. So in the first case, I forgot to mention how we need to start with logical. First, we need to take a copy of database, maybe partially.
Starting point is 00:28:13 And then second phase, we stream changes. It's called CDC, change data capture. So there are two phases, initialization and already working state. And the great news, which is not covered in documentation, that you can switch from physical to logical very easily. So you just have physical standby, and then you create a logical replication slot on the primary.
Starting point is 00:28:40 It will provide you LSN while it's paused. So this is a point when it was created and we know that it's like the state is at this lsn lsn is it's like log sequence number it's the sequentially increasing number in in the wall and then you can say recovery target lsn to reach the same lsn on the standby right and then you just switch to logical from physical replication that's it so it it will take a few seconds only so you can have 10 terabyte database and provision logical in the couple like from physical physical it's easy like it's already very well known how to do it. But if you want whole database, you can do this trick. And super cool to provision logical.
Starting point is 00:29:32 And this opens the path to various experiments. For example, if you want to use logical for various migrations to different Postgres major version or something like fight with bloat. I don't know, like change personal system, something like that. That sounds super interesting for major version upgrades. So physical replication, then could we do an upgrade? Can we do the PG upgrade and then turn on logical replication to catch up? Exactly. So you do this trick and from physical if you convert physical node to logical node and then you can stop it upgrade it during
Starting point is 00:30:13 this time the primary will accumulate changes in the slot so yeah once you're up again you will consume and catch up that's a very. Very cool. Yeah. Super. So I don't know why people don't use it often. It's like, I would say this is secret of, how it's called? Secret of Polichanel, right? Like very well known secret, actually. Like many people know it,
Starting point is 00:30:40 but it's somehow keeping, it's not in the, maybe it should be covered in documentation, actually. But documentation in in general, lacks how-to parts. It's very occasional to see how-to parts in documentation, current form. So, that being said,
Starting point is 00:30:55 documentation is awesome, definitely. And if you need part of it, of course, it won't work. You can do it. For example, I want half of database. I do this trick, and then I will drop half of database. It can work. But in other cases, it won't work. You can do it. For example, I want half of database. I do this trick and then I will drop half of database. It can work. But in other cases, it won't. That's awesome. I was wondering about two last things, depending on which one you wanted to cover, either go into a bit more detail
Starting point is 00:31:16 on any of the issues on the logical replication side, or potentially more interesting, one of the questions included any specifics for managed service providers. So like any limitations to be aware of in the popular ones like RDS, Google Cloud? There are limitations everywhere, of course. So how many nodes you can have. For example, many projects prefer having like 10 or 15 physical standbys and offload a lot of reads there. It's constant work for backend developers to find more queries, transactions that can be offloaded.
Starting point is 00:31:51 But in general, as I remember, on RDS, originally, they distinguished two types of standbys. And first one was regular Postgres standby, and it was supported reads. And second was a checkbox multi-AZ, multi-availability zone. It was a hidden standby based not on PostgreSQL
Starting point is 00:32:09 replication. It was before recent changes. And it was not visible you couldn't use it so it's like just for HA. Now as I know they support they changed it. I haven't looked recently but I saw in the news that now for this HA replica,
Starting point is 00:32:27 read queries are also supported, which is good. So in an ideal world, of course, we should have HA replica and others replica. But some people think we should have at least three nodes and don't care about which, symmetric symmetric schema when all replicas can be used for file over and all replicas can receive reads i actually prefer this schema symmetric because if you allocate specific node for ha well caches problem you need to warm up, right? For read queries, it's warm up automatically if it's receiving reads. True. It's like kind of things to think about.
Starting point is 00:33:11 I prefer this approach. Patroni actually is good for it as well. By the way, in Patroni, there is a problem with logical slots. If a physical failover happened, you don't have slots. Interesting. That's bad. And also, unlike physical, physical can over happened, you don't have slots. And interesting, that's bad. And also logical, like unlike physical, physical can be cascaded, you can create interesting architectures. But for logical, it's always from primary unfortunately, for from physical primary. And there are
Starting point is 00:33:38 many, there are many discussions how to enable logical replication from physical standby,, it would be great to have. But risks are shifted to standbys, it's much better. But unfortunately, it's not yet implemented. So if you have physical failover, you'll lose lots. But there is some recipe and Patroni implemented, and there's a talk from Kukushkin, which is Mr. Patroni. Mr. Patroni. Yeah. You can listen to, there are many interesting details there. And actually on Postgres TV, Alexander also presented this talk.
Starting point is 00:34:13 Many interesting details. So it's like how to avoid full initialization for logical replicas. But for upgrades, we don't care about it because our replicas should live like some hours or a couple of days, right? Yeah, temporarily. So we don't care about it. So there are many different use cases, a lot of them. Awesome. Was there anything else you wanted to cover? There are many things to cover. For example, physical standby behavior may be very different
Starting point is 00:34:44 from primary. For example, if you behavior may be very different from primary. For example, if you have sub-transactions, you might want to check my article about it and the disaster GitLab had. They also blog-posted about it, a very good article. And it was quite difficult to troubleshoot, so
Starting point is 00:35:00 the idea that we have the same everything, well, yes, we have the same everything, but there are small details that can differ. And if you have sub-transactions and you should be prepared for growth, definitely worth checking that post because physical standby can have different behavior. I'll definitely include it. So in general, summary is physical replication is very, very reliable in Postgres. Logical has issues, but people also use it. There are also derived products like AWS has DMS.
Starting point is 00:35:32 GCP, Google also provides something in this area to migrate Postgres to Postgres, different versions or from different database systems to Postgres. Also, usually cloud providers provide logical replication connection, but no physical. It's worth keeping in mind. So you can migrate out of them only using logical. So logical should be in Arsenal, definitely, for various kinds of tasks.
Starting point is 00:35:55 And there are products like, I would mention a couple of them, Qlik, a former attunity. Attunity, maybe I'm pronouncing it wrong. They have issues with duplicates. They implement, I communicated with them many times. They just don't understand, don't bother to understand me.
Starting point is 00:36:15 Because they create slot at SQL level, so they cannot synchronize it properly. So if a recipient database like Snowflake don't have unique keys you will have duplicates if you need to when switching from initialization to cdc part or there's also a newer company called five trend this is quite good it's commercial company also works with on top of logical decoding and supports a lot of things and it's very like sas cloud cloud oriented but and and i think pricing is also good but it's worth considering if you want to save time yeah last last thing on the cloud providers is that they do often have that just like a check box where you can turn on high availability right right? So you don't need the physical replication for that if you use their own service.
Starting point is 00:37:09 But normally costs double just as if you're paying for a replica, right? Right, right, right. But as I said, this is about RDS. Checkbox doubled for price for primary and you couldn't use for reads, but they changed it recently worth checking. There are recent changes. So you can have the checkbox and still use every day this node for reads.
Starting point is 00:37:32 Interesting. Cloud SQL is similar, I think, in that you can just turn it on, but I'm not sure about being able to use it for reads. Right. Okay. Good. Nice one.
Starting point is 00:37:44 Thank you so much, Nikolai. That was awesome. Thanks for the suggestions for people that sent those in. And thanks everyone for listening. Thank you, Michael. See you next time. Bye. See you. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.