Postgres FM - Patroni
Episode Date: October 4, 2024Michael and Nikolay are joined by Alexander Kukushkin, PostgreSQL contributor and maintainer of Patroni, to discuss all things Patroni — what it is, how it works, recent improvements, and m...ore.Here are some links to things they mentioned:Alexander Kukushkin https://postgres.fm/people/alexander-kukushkinPatroni https://github.com/patroni/patroniSpilo https://github.com/zalando/spilo Zalando Postgres Operator https://github.com/zalando/postgres-operatorCrunchy Data Postgres Operator https://github.com/CrunchyData/postgres-operatorSplit-brain https://en.wikipedia.org/wiki/Split-brain_(computing)repmgr https://github.com/EnterpriseDB/repmgrCloudNativePG https://github.com/cloudnative-pg/cloudnative-pgPatroni release notes https://patroni.readthedocs.io/en/latest/releases.htmlCitus & Patroni talk and demo by Alexander (at Citus Con 2023) https://www.youtube.com/watch?v=Mw8O9d0ez7E~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artwork
 Transcript
 Discussion  (0)
    
                                         Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
                                         
                                         I'm Michael, founder of PGMustard. I'm joined as usual by Nikolai, founder of PostgresAI.
                                         
                                         Hey, Nikolai.
                                         
                                         Hi, Michael.
                                         
                                         And today we are delighted to be joined by Alexander Kokushkin,
                                         
                                         a Postgres contributor currently working at Microsoft and most famously, maintainer of Patroni.
                                         
                                         We had a listener request to discuss Patroni, so we're delighted you agreed to join us for an episode.
                                         
                                         Alexander. Hello, Michael. Hello, Nik you agreed to join us for an episode. Alexander.
                                         
    
                                         Hello, Michael. Hello, Nikolai. Thank you for inviting me.
                                         
                                         I'm really excited to talk about my favorite project.
                                         
                                         Us too.
                                         
                                         Perhaps as a starting point, could you give us an introduction?
                                         
                                         For most people, I think, will have heard of Patroni and know what it is,
                                         
                                         but for anybody that doesn't, could you give an an introduction what it is and why it's important yeah so Patroni like in the simple words it's a failover manager
                                         
                                         for Postgres it solves a problem of availability of a primary in the Postgres we don't use some
                                         
                                         words that non-inclusive like master that's why we call it primary and Patroni actually recently get
                                         
    
                                         read from this non-inclusive force completely and the way how Patroni does it makes sure that we're
                                         
                                         running just a single primary at a time and at the same time Patroni helps you to manage as many
                                         
                                         read-only replicas as you like to have, and keeping those replicas ready to
                                         
                                         become primary in case if primary has failed. At the same time, Patroni helps to automate
                                         
                                         usual DBA tasks like switchover, let's say configuration management, stuff like that.
                                         
                                         Node provisioning also, right? Not provisioning, not really. Like, node provisioning, this
                                         
                                         is a task for DBA. Like, DBA has to start Patroni and Patroni will take care about bootstrapping
                                         
                                         this node. In case of it's totally new cluster, Patroni will start as a primary. In case of
                                         
    
                                         the node joining existing cluster, the replica node will take a pgbase backup by default from the running
                                         
                                         primary and start as a replica. And the most interesting part, let's say we bring node back,
                                         
                                         which was previously running as a primary, and Patroni does everything to convert this
                                         
                                         failed primary as a new standby to join the cluster and be prepared for next and foreseen event. At least you
                                         
                                         agree that it does part of
                                         
                                         node provisioning because otherwise
                                         
                                         we wouldn't have situations when
                                         
                                         all data directory, all
                                         
    
                                         data was copied
                                         
                                         and new one is created
                                         
                                         and we are suddenly out of disk space
                                         
                                         and if you don't expect Patroni
                                         
                                         to participate in node provisioning
                                         
                                         then you think, what's happening?
                                         
                                         Why am I out of disk space?
                                         
                                         Right?
                                         
    
                                         It happens sometimes.
                                         
                                         It used to happen, I think, with bootstrap mode.
                                         
                                         I don't remember up until which version, but Patroni, like when it tries to create a new cluster, usually by using NITDB, but in some cases, like you can configure Patroni to create a cluster from existing backup, like from base backup.
                                         
                                         And if something goes wrong, Patroni does not remove this data directory, but renames it.
                                         
                                         And it used to apply current timestamp to the file name
                                         
                                         and therefore after the first failure it gives up waits a little bit and does the next attempt
                                         
                                         to directory name right yeah it creates it it uses yet another base, creates a new data directory, fails, and renames.
                                         
                                         Now it is not working like this.
                                         
    
                                         It just renames pgdata to pgdata-alt,
                                         
                                         something like this.
                                         
                                         And that's why you will not have
                                         
                                         an infinite number of directories.
                                         
                                         And having just one is enough
                                         
                                         to investigate the failure.
                                         
                                         But maximum, we end up,
                                         
                                         like if our data directory,
                                         
    
                                         we expected to fill 70% of the disk,
                                         
                                         we still might have out of disk space.
                                         
                                         Yeah, that's unfortunate.
                                         
                                         But the other option really,
                                         
                                         like you just drop it,
                                         
                                         but at the same time,
                                         
                                         like all the evidences of what failed,
                                         
                                         why it failed are also gone.
                                         
    
                                         You have nothing to investigate.
                                         
                                         Okay. To me it sounds like still like Patroni participates in node provisioning. Yes,
                                         
                                         it doesn't bring you resources like disk and virtual machine and so on, but it brings data,
                                         
                                         like the most important part of Postgres node provisioning, right? Okay, I just wanted to be right a little bit.
                                         
                                         Okay.
                                         
                                         Okay.
                                         
                                         It's a joke.
                                         
                                         Okay.
                                         
    
                                         I think diving deep quickly is great.
                                         
                                         It'd be good to discuss complex topics,
                                         
                                         but I think starting simple would also be good.
                                         
                                         I would love to hear a little bit almost about the history of Petroni.
                                         
                                         Like the early days, what were you doing before Petroni
                                         
                                         to solve this kind of issue?
                                         
                                         And why was it built what problems were there with the existing setups to be honest like while working for my previous
                                         
                                         company we didn't have any automatic failover solution in place what we relied on just a good
                                         
    
                                         monitoring system that sent you a message or someone call engineer,
                                         
                                         like just calls you in the night, the database failed.
                                         
                                         There were a lot of false positives, unfortunately, but it still felt more reliable than using
                                         
                                         solutions like replication manager, drip manager.
                                         
                                         Yeah, I remember this.
                                         
                                         Sorry for interrupting.
                                         
                                         I remember this very, very well.
                                         
                                         Like people were constantly saying,
                                         
    
                                         we don't need autofillover.
                                         
                                         It's like, it's evil
                                         
                                         because it can like switch over suddenly,
                                         
                                         fill over suddenly,
                                         
                                         and it's a mess.
                                         
                                         Let's rely on manual actions.
                                         
                                         I remember this time very well.
                                         
                                         Yeah, to our excuse,
                                         
    
                                         an amount of databases,
                                         
                                         like database clusters that we run
                                         
                                         wasn't so high.
                                         
                                         Like I think a few dozens and it was running on-prem, didn't fail so often, and therefore it was manageable.
                                         
                                         A bit later, we started moving to the cloud.
                                         
                                         And suddenly, not suddenly, but luckily for us, we found a project named Governor, which basically brought an idea of how to implement after failover in a very nice manner, like without having so much false positives and without risks of running to a split brain.
                                         
                                         Was it abandoned project already?
                                         
                                         No, no.
                                         
    
                                         So it was not really abandoned, but it wasn't also very active. So we
                                         
                                         started playing with it, found some problems, reported problems to maintainer of governor,
                                         
                                         got no reaction, unfortunately, started fixing those problems on our own. at some moment number of fixes and some new nice features accumulated
                                         
                                         and we decided just to fork it and give a new name to the project.
                                         
                                         So this is how Patronio was born.
                                         
                                         Georgia name, right?
                                         
                                         Right.
                                         
                                         Beautiful.
                                         
    
                                         What does it mean?
                                         
                                         Governor in Georgia.
                                         
                                         Oh, governor. I think so.? Governor in Georgian.
                                         
                                         Oh, governor.
                                         
                                         Yeah, almost, almost.
                                         
                                         Very close, but I'm not a good person to explain or to translate from Georgian, because I don't... I know yet another word in Georgian, and it's spilo.
                                         
                                         Yeah.
                                         
                                         It translates from Georgian as elephant right
                                         
    
                                         and the name chose i guess uh valentin uh gogichashvili right yes uh he was no at that
                                         
                                         time he wasn't my boss anymore but we still worked close together and i really appreciate
                                         
                                         his creativity and inventing good names for projects.
                                         
                                         Yeah, great names.
                                         
                                         And is this a good time to bring up Spilo?
                                         
                                         What is Spilo and how is that relevant?
                                         
                                         Spilo, as I said, it translates from Georgian elephant.
                                         
                                         When we started playing with Governor,
                                         
    
                                         we were already targeting to deploy everything in the cloud.
                                         
                                         We had no other choice but to build a Docker image and provision Postgres in a container.
                                         
                                         And we called this Docker image Spilo.
                                         
                                         Basically, we packaged Governor, Postgres, a few Postgres extensions and I think it was Wally back then as a backup and point-in-time
                                         
                                         recovery solution.
                                         
                                         And it still exists to this day as Spilo but now with Patroni?
                                         
                                         Yeah of course.
                                         
                                         Now there is a Patroni inside and now Spilo includes plenty of Postgres major versions Так, там є патроні, а спілкається з багатьох постградських майданчих версій, які можуть бути антипаттерами, але дозволяють робити майданчі підвищення. of Wally. And it's a part of operator.
                                         
    
                                         Not really part of
                                         
                                         operator. So Spilo
                                         
                                         is a product
                                         
                                         on its own.
                                         
                                         I know that some people run
                                         
                                         Postgres on Kubernetes or
                                         
                                         even just on
                                         
                                         virtual machines
                                         
    
                                         with Spilo
                                         
                                         without using Operator.
                                         
                                         But using Docker, for example.
                                         
                                         Yeah, of course.
                                         
                                         But that is a good opportunity to discuss Postgres Operator.
                                         
                                         Postgres-Operator was Zalando's...
                                         
                                         Was that one of the first operators of its type?
                                         
                                         I know we've got lots these days.
                                         
    
                                         Well, maybe it was but at the same time
                                         
                                         the same name was used by crunchy for their operator they were developed in parallel and
                                         
                                         back then crunchy wasn't relying on patrony yet as i said like we started moving things to the cloud. And at some point, Vector moved a little bit
                                         
                                         and started running plenty of workloads on Kubernetes, including Postgres.
                                         
                                         Since deploying everything manually and, more importantly,
                                         
                                         managing so many Postgres clusters manually was really a nightmare,
                                         
                                         we started building Postgres operator. Back then, I don't think
                                         
                                         some very nice Go library to implement operator
                                         
    
                                         pattern existed and therefore people had to invent
                                         
                                         everything from scratch and there is a lot of boilerplate code
                                         
                                         that copied over and so on.
                                         
                                         Is it only the move to the cloud
                                         
                                         what mattered here,
                                         
                                         but maybe also moving to microservices,
                                         
                                         splitting everything to microservices?
                                         
                                         Because I remember from...
                                         
    
                                         Microservices, of course, played a big role.
                                         
                                         And probably...
                                         
                                         Not probably.
                                         
                                         Microservices were really driving driving force to move to the cloud.
                                         
                                         Because with the scale of the organization, it wasn't possible to keep monolith.
                                         
                                         And the idea was, let's print everything to microservices.
                                         
                                         And every microservice usually requires its own database.
                                         
                                         Right. Sometimes sharded database,
                                         
    
                                         like we used application sharding.
                                         
                                         In certain cases, the same database is used
                                         
                                         by multiple microservices, but it's a different story.
                                         
                                         But really, the number of database clusters
                                         
                                         that we had to support exploded,
                                         
                                         like from dozens to hundreds and then to thousands.
                                         
                                         And this is already when you cannot rely on humans to perform a failover, right?
                                         
                                         Even when you run a few hundred database clusters,
                                         
    
                                         it's better not to rely on humans to do maintenance, in my opinion.
                                         
                                         Right. So that's interesting.
                                         
                                         And maybe it's also the right time to discuss why Postgres doesn't have internal built-in TOEFL lower.
                                         
                                         I remember discussions about replication when we relied on Slony, then
                                         
                                         Londeste, and some people
                                         
                                         resisted to bring
                                         
                                         replication inside Postgres, but somehow
                                         
                                         it was resolved, eventually.
                                         
    
                                         And Postgres has good replication,
                                         
                                         physical, logical,
                                         
                                         sometimes not good, but it's
                                         
                                         a different story. In general,
                                         
                                         it's very good, and improving, improving every release.
                                         
                                         We just last week discussed with michael uh what improvements of logical replication in 17 and maybe it will resonate a little bit with topic today patroni but it doesn't happen to
                                         
                                         autofill over at all right why so, I can only guess because to do it correctly, we cannot just have two nodes which most people run, like primary and standby,
                                         
                                         because there are many different factors involved. And one of the most critical ones is the network
                                         
    
                                         between those nodes. And when just having two machines,
                                         
                                         you cannot distinguish between failure on the networking
                                         
                                         and failure of the primary.
                                         
                                         Like if you just run health check from a standby
                                         
                                         and making decision based on the health check,
                                         
                                         you may have a false positive.
                                         
                                         Basically network just experience some short glitch,
                                         
                                         which could last even a few seconds, sometimes a few minutes, but at the same time, the old
                                         
    
                                         primary is still there. If we promote a standby, we get to split-brain situation with two primaries
                                         
                                         and not being clear to which one transactions are running.
                                         
                                         And in the worst case, you end up in an application connecting to both of them.
                                         
                                         This is what...
                                         
                                         Good luck with assembling all these changes together.
                                         
                                         This is what tools like Replication Manager do.
                                         
                                         So I ended up calling Replication Manager a split-bane split-bane solution because I observed it many
                                         
                                         many times
                                         
    
                                         like as a mitigation
                                         
                                         what maybe
                                         
                                         is possible to do the primary can
                                         
                                         also run a health check and
                                         
                                         in case if standby is not available
                                         
                                         just stop accepting
                                         
                                         writes by
                                         
                                         either like restarting in read-only or maybe by
                                         
    
                                         implementing some other mechanisms. But it also means that we lose availability
                                         
                                         without a good reason. So with this scenario, when we promote a
                                         
                                         standby, technically if standby cannot access someone else, it shouldn't be Технически, если standby не может доступить кого-то другого, то он не должен принимать права.
                                         
                                         Как в сплите в сети.
                                         
                                         Мы близко пришли к установке, как это называет репликатор, виднесс-нода.
                                         
                                         Виднесс-нода, yes, exactly. Yeah, witness node. Basically, you need to have more than two.
                                         
                                         And the witness node should help making decision.
                                         
                                         Let's say we have witness node in some third failure domain.
                                         
    
                                         Primary can see witness node, therefore it can continue run as a primary.
                                         
                                         And standby shouldn't be allowed to promote until, so if it cannot access the witness node. And it already reminds some systems like
                                         
                                         ATCD. Consensus algorithm. Yeah, consensus algorithm and write is possible when it is
                                         
                                         accepted by majority of nodes.
                                         
                                         This will already invented, right?
                                         
                                         Yeah, so this is already invented,
                                         
                                         and what Patroni is really relying on to implement after failover reliably.
                                         
                                         I can guess that at some moment in Postgres it will be added,
                                         
    
                                         and we already have plenty of such components in Postgres
                                         
                                         that exists. We have write-ahead log with LSN which is always incremented. We have
                                         
                                         timelines which is very similar to terms in ATCD. So basically at the end
                                         
                                         we will just need to have more than two nodes, like better three, so that we don't stop
                                         
                                         writes while one node is temporarily down. And it will give possibility to implement after failover
                                         
                                         without even doing pgRewind, let's say. Because when primary writes to write headlock, it will be first confirmed by standby nodes,
                                         
                                         and only after that...
                                         
                                         So, effectively, this is what
                                         
    
                                         we already have, but
                                         
                                         it's not enough, unfortunately.
                                         
                                         So, do you think at some
                                         
                                         point Patroni will not be needed, and everything
                                         
                                         will be inside Postgres, or no?
                                         
                                         I hope so,
                                         
                                         really. I hope so.
                                         
                                         No, no, no, not because I'm tired of maintaining Patroni,
                                         
    
                                         but because this is what people really want to have,
                                         
                                         to deploy highly available Postgres
                                         
                                         without necessity to research
                                         
                                         and learn a lot of external tools like Patroni,
                                         
                                         solutions for backup and point and time
                                         
                                         upgrade them sometimes because we're
                                         
                                         always lagging with these
                                         
                                         upgrades
                                         
    
                                         but at the same time
                                         
                                         let's imagine that
                                         
                                         it happens in a couple of years
                                         
                                         but with five
                                         
                                         years support cycle
                                         
                                         there will be still a lot of
                                         
                                         setups that run in not recent
                                         
                                         Postgres versions and they still need to use something external like Patroni.
                                         
    
                                         I'm actually looking right now at commits of replication manager it looks
                                         
                                         like the project is inactive for more than one year almost like few commits
                                         
                                         that's it it's like going down. Well I have probably some insights about it,
                                         
                                         not about replication manager,
                                         
                                         but I know that EnterpriseDB was contributing
                                         
                                         some features and bug fixes to Patroni.
                                         
                                         So they officially support Patroni.
                                         
                                         So it sounds interesting, right?
                                         
    
                                         So Patroni is a winner, obviously.
                                         
                                         It's used by many Kubernetes operators,
                                         
                                         many of them, and not only Kubernetes, of course.
                                         
                                         And winning, of course, some projects were abandoned, not only application manager,
                                         
                                         we know some others, right? But you thinking about one day, everything will be in core and
                                         
                                         Patroni will be abandoned maybe, right? And you think it's maybe for good. So every project has its own life cycle.
                                         
                                         At some moment,
                                         
                                         the project is abandoned
                                         
    
                                         and not used by anyone.
                                         
                                         We are not there yet.
                                         
                                         Right, right.
                                         
                                         While we're in this area,
                                         
                                         I wanted to ask you
                                         
                                         what you think about
                                         
                                         Kubernetes also has,
                                         
                                         it also relies on
                                         
    
                                         consensus algorithm itself.
                                         
                                         It has it. Why some operators choose, Kubernetes also relies on consensus algorithm itself.
                                         
                                         Why do some operators choose to use Patroni while others like Cloud Native PG decide to
                                         
                                         rely on Kubernetes native mechanisms and avoid using Patroni?
                                         
                                         What's better?
                                         
                                         To be honest, I don't know what's driving people that build cloud-native Postgres.
                                         
                                         But what's better in general?
                                         
                                         What are pros and cons?
                                         
    
                                         How to compare?
                                         
                                         What would you do?
                                         
                                         In a sense, cloud-native PG, there is a component that tries to manage all Postgres clusters and decide whether
                                         
                                         some primary is failed and promote one of the standbys.
                                         
                                         I'm not sure how they implement fencing of the failed primary, because if you don't correctly
                                         
                                         implement fencing and promote a standby to the primary, you again end up in
                                         
                                         a split-brain situation.
                                         
                                         And let's imagine that one Kubernetes node is isolated in the network.
                                         
    
                                         Network partition.
                                         
                                         Yeah.
                                         
                                         And it automatically means that you will not be able to stop pods for containers that are
                                         
                                         running on this node. At the same time,
                                         
                                         applications that are running on this node will still use Kubernetes services to be able to
                                         
                                         connect to the isolated primary. Right, yeah.
                                         
                                         So, Patroni detects such scenarios very easily. Because Patroni component runs on this in the same port with Postgres and if in
                                         
                                         case if it cannot write to Kubernetes API it just self does self-fencing it stops Postgres to read
                                         
    
                                         only. It's simple by the way right? Yeah so I don't know like if they do something similar
                                         
                                         like in case if they don't it's dangerous we should do a whole separate episode of cloud
                                         
                                         native pg actually i think that would be a good one yeah i'm not saying that cloud native pg is
                                         
                                         bad or like does something wrong so i'm just trying to understand what they're doing and
                                         
                                         raising my raising my concerns of course right back to patrony it worked like this from the
                                         
                                         beginning but it feels like in version 4 which which is the latest major release, there might be some life for a couple of years, by the way.
                                         
                                         From the very beginning, we wanted to support this feature, but what was stopping us,
                                         
                                         the promise of Patroni with synchronous replication, that we want to promote a node
                                         
    
                                         that was synchronous at the time when primary failed. If we just have a single name in synchronous
                                         
                                         standby names, like single node, it's very easy to say, okay, so this node was synchronous and
                                         
                                         therefore we can just promote it. When there are more than one node and we require all of them to
                                         
                                         be synchronous, we can promote any of them. But with quorum-based replication you can have something like any one
                                         
                                         from list of, let's say, three nodes. Which one is synchronous when the primary failed?
                                         
                                         I'm not demanding to answer this question, so I will just explain how it works in Patroni,
                                         
                                         like in the last major release. This information about current value of synchronous and binames is also stored in ATCD.
                                         
                                         Therefore, those three nodes that are listed in synchronous and binames know that we are listed as quorum nodes.
                                         
    
                                         And during the leader race, they need to access each other and get some number of votes. If there are three nodes, it means that every node,
                                         
                                         like to become a new primary, like a new candidate,
                                         
                                         need to access two remaining nodes at least
                                         
                                         and get confirmation that they are not ahead of
                                         
                                         all LSN on the current node.
                                         
                                         Is it clear? I should elaborate a little bit more.
                                         
                                         So if they were ahead, let me ask this stupid question,
                                         
                                         if a node checks that it is ahead of the current candidate to be leader,
                                         
    
                                         that's then a bad decision to promote that leader because
                                         
                                         a different one would... So just for your understanding, in Patroni there is no central component that decides on which node to promote.
                                         
                                         Every node makes decisions on its own.
                                         
                                         Therefore, every standby node listed in synchronous standby names goes through the cycle of health checks. It accesses
                                         
                                         remaining nodes from synchronized TNAVY names and checks at what LSN are they.
                                         
                                         And if they're on the same LSN or behind, we can assume that this node is
                                         
                                         the healthiest one. And the same procedure happens on remaining nodes. Basically, this way we can
                                         
                                         find, okay, so this node is eligible to become new primary. In case if we have something like
                                         
    
                                         any two and three nodes, we can make a decision by asking just a single node. Because we know that
                                         
                                         two nodes will have the latest commits,
                                         
                                         like the latest commits that are reported to the client.
                                         
                                         And it will be enough to just ask a single node.
                                         
                                         Although it will ask all nodes from synchronous to my names,
                                         
                                         but in case if one of them, let's say, failed,
                                         
                                         together with the primary,
                                         
                                         it is still enough to make a decision by asking the remaining one.
                                         
    
                                         Nice.
                                         
                                         And the tricky part comes when we need to change synchronous standby names and the values that we store in ATCD.
                                         
                                         Let's say we want to increase the number of synchronous nodes, like from 1 to 2.
                                         
                                         What should we change first? Synchronized in by names, guk, or value in
                                         
                                         etcd? So that we can correctly make a decision. If we
                                         
                                         change first value in etcd, it will assume, okay, so we need to ask just a single node to make a decision,
                                         
                                         although there is just one node that has latest commits, 100%.
                                         
                                         And in fact, we need to ask two.
                                         
    
                                         Therefore, when we're increasing this from one to two,
                                         
                                         first we need to update synchronous standby names, and only after that change in ATCD.
                                         
                                         And there are almost dozens of rules that one needs to follow to do such changes in correct order.
                                         
                                         Because it's not only about changing replication factor, it's also about adding new nodes to synchronous standby names Это также касается добавления новых нодов для синхронных стендбай-намец, или выключения нодов, которые исчезли и так далее.
                                         
                                         И я не думаю, что ни одна другая решение не применяет алгоритм для таких изменений.
                                         
                                         Сколько времени вы потратили на это? Yeah. So, like, originally this feature was implemented by Ants Asma.
                                         
                                         He's working for CyberTech.
                                         
                                         It happened in 2018.
                                         
    
                                         I did a few attempts to understand, like, this great logic of this algorithm.
                                         
                                         And finally, like, almost five years after, like, I was able to get enough time to fully focus on the problem.
                                         
                                         And even after that, I spent a couple of months implementing and fixing some bugs and corner cases and implementing all possible unit tests to cover all such transitions.
                                         
                                         There is no book which describes this, that you could follow.
                                         
                                         This is something really new that needs to be invented, right?
                                         
                                         Well, the idea was obvious, how to do it, or what to do.
                                         
                                         But implementing it correctly and proving that it is really working correctly, kā to redzēt, vai kā to redzēt, bet, kā to implementēt,
                                         
                                         kā to redzēt, un izmantojot, ka tas ir ļoti ātri iespējams,
                                         
    
                                         tas ir ļoti vajadzējums.
                                         
                                         Paldies, ka izmantojās visus vēsturēšanās.
                                         
                                         Ir vēl viena vēsturē, kuru es gribētu
                                         
                                         uzmantot. Tā bija Patronis 3.0.
                                         
                                         DCS filesef mūsu0 DCS file safe mode. So DCS is
                                         
                                         distributed
                                         
                                         configuration
                                         
                                         storage.
                                         
    
                                         And actually we just experienced
                                         
                                         a couple of outages
                                         
                                         because we are in Google
                                         
                                         Cloud and they are Kubernetes and running
                                         
                                         Zalando operator
                                         
                                         with Patroni of course.
                                         
                                         And I just checked the version of Patroni, and it seems to have it.
                                         
                                         But I don't think it is enabled.
                                         
    
                                         Exactly. This is my second question, actually.
                                         
                                         Why it's not enabled?
                                         
                                         So, first question, what is it?
                                         
                                         How do you solve this problem when etcd or console is temporarily out?
                                         
                                         Let's start from problem statement.
                                         
                                         The promise of Patroni is that it will run as a primary
                                         
                                         when it can write to distributed configuration store, like to ATCD.
                                         
                                         If it cannot write to ATCD, it means that maybe something wrong with ATCD
                                         
    
                                         or maybe this node is isolated and therefore writes are failing.
                                         
                                         And when node is isolated, it's apparently working by design,
                                         
                                         Patroni cannot write to ATCD, it will stop Postgres in read-only mode,
                                         
                                         but in case if ATCD is totally down, because of, I don't know, some human mistake,
                                         
                                         you cannot access any single node of ATCD. And in this case, Patroni also stops primary and starts it in read-only
                                         
                                         to protect from the case, let's say, some standby nodes can access
                                         
                                         DCS at the same time and promote one of the nodes. So people were really annoyed by this problem and were asking why we are demoting
                                         
                                         primary and so far the answer was always alright,
                                         
    
                                         so we cannot determine the state and therefore we demote to be on the safe side.
                                         
                                         The idea how to improve on that came at one of Postgres conferences after talking with other Patroni users.
                                         
                                         How it is improved using failsafe mode? that none of the ATCU nodes are accessible, it will try to access all
                                         
                                         Patroni nodes in the cluster using Patroni REST API. And in case if Patroni
                                         
                                         primary can get a response from all nodes in Patroni cluster in the
                                         
                                         failsafe mode, it will continue run as a primary. In this case it's much stronger
                                         
                                         stronger requirement than quorum or consensus so it is not expecting to get
                                         
                                         responses from let's say majority like it really wants to get responses from
                                         
    
                                         all standby nodes to continue to run as a primary. Yeah, so like this feature was introduced in Patroni version 3, but it
                                         
                                         is not enabled by default. Because I think there are some side effects when you enable
                                         
                                         this mode in certain environments. Probably it is related to environments where your node Возможно, это связано с обществом, где ваша сеть может отвечать с разным именем.
                                         
                                         Надо подумать об этом.
                                         
                                         Этот поведение документировано.
                                         
                                         Да, мы будем исследовать это, спасибо большое за это.
                                         
                                         На Kubernetes это безопасно, чтобы это позволить.
                                         
                                         Да, мы должны начать использовать это, я думаю.
                                         
    
                                         Мы обязательно будем исследовать это. to enable it yeah we should start using this is what this is what i think as well yeah definitely
                                         
                                         we'll explore things like pods always have the same name just different ip addresses i just got
                                         
                                         help for and and as usual i like i just wanted to publicly thank you for all the help you do for me
                                         
                                         and actually many companies many years it's this is huge thank you so much yeah i'm happy to help
                                         
                                         so another thing i wanted to discuss is probably replication slots and i remember a few years ago
                                         
                                         you implemented support for for follower of logical slots now we have it in postgres right so для фейловерных логических слотов, а теперь мы их имеем в Postgres, так что наконец-то
                                         
                                         одна вещь была, я думаю, от Патроны, или вы все еще держите эту функциональность?
                                         
                                         Мы все еще держим ее, и мы не сделали ничего особенного в Postgres 17. It was I think it was at 16 even no? Failover of ah. Or 17. Well ability to use a logical slot
                                         
    
                                         on physical standbys was in 16 but failover in 17. We just discussed it. Yes exactly exactly.
                                         
                                         Exactly. I confused. That's why I'm saying we didn't do anything special although I did some Поэтому я и говорю, что мы не сделали ничего особенного. Хотя я сделал несколько треков, чтобы сделать эту функцию работать с Patroni,
                                         
                                         потому что это требует иметь ваш датабейсный имя в PrimaryConInfo.
                                         
                                         И Patroni не ввел DB-имя в PrimaryConInfo, потому что для физической репликации это не полезно.
                                         
                                         Но это и есть. Es domāju, ka mēs, jā, mēs skatām slēpu primāri, tas ir jā stand by nodus, un viena no tās ir jūtas, vai vēl nekādi, bet viena no tās ir jūtas,
                                         
                                         lai logiski mācītu kaut ko, kādu posgu, snuflaku vai kaut ko.
                                         
                                         Vai Kavku vai kaut ko.
                                         
                                         Ja tas ir jā, tad jā, no stand by.
                                         
    
                                         Jo tas ir labi, mēs varam izmantot riski primāri, un tādā. Un Wall sender nevar jūtas CPU, un tādā. Потому что это хорошо, мы не используем премиальные риски,
                                         
                                         WallSender не использует CPU и так далее.
                                         
                                         И нет рисков на диске.
                                         
                                         Так что теперь у нас есть такой стендбай, и он вдруг исчез.
                                         
                                         Это не работа Patroni, чтобы его уничтожить, да?
                                         
                                         Потому что нам нужны какие-то механизмы,ого стендбайа. Вы имеете в виду, чтобы оставить слот логической репликации на новом стендбайе,
                                         
                                         на котором вы бы хотели подключиться. В теории, патроны, возможно, могут обеспечить это,
                                         
                                         поскольку возможно делать логическую репликацию с стендбай-кнопок с поз Postgres 16. So how it's implemented currently in Patroni,
                                         
    
                                         like logical file loader slots,
                                         
                                         it creates logical slots on standby nodes
                                         
                                         and uses pgReplicationSlotAdvance
                                         
                                         to move the slot to the same LSN
                                         
                                         as it's currently on the primary.
                                         
                                         So basically assumption is that
                                         
                                         logical replication happens on the primary. In theory there is no reason why it cannot be done for
                                         
                                         standby nodes. Let's say like we create logical slots on all standby nodes
                                         
    
                                         like with the same node, with the same name and Patroni can izmantot, kāda ir aktīva, un izmantot to ATCD. В теории это может работать, но я не знаю, если бы у меня был время, чтобы это сделать.
                                         
                                         Я просто пытаюсь понять, это довольно новая функция с 16-го года, чтобы логически репликатировать с физическими стендбайами, но...
                                         
                                         Но будьте в курсе, что это все еще влияет на основные. it still affects primary. Right. Right. So maybe like PG wall will not blow out,
                                         
                                         but PG catalog certainly will.
                                         
                                         Yeah, this for sure.
                                         
                                         I was referring to the need to preserve wall files on the primary.
                                         
                                         This risk has gone if we do this,
                                         
                                         but I cannot imagine how we can start using logical slots
                                         
    
                                         on physical standbys in serious projects without HA ideas.
                                         
                                         Because right now I don't understand how we solve HA for this.
                                         
                                         Yeah, and unfortunately, this hack that Patroni implements with PG replication slot advance has its downsides. It literally takes as much time to move the position of the logical slot
                                         
                                         as you consume from this slot. That's unfortunate. And how it's solved in Postgres 17, it basically
                                         
                                         does not need to parse a whole file and decode it. So it just literally ever writes some и он просто буквально вычитает какие-то цифры в слоте репликации,
                                         
                                         потому что он знает точные местности и делает это безопасно.
                                         
                                         Патроны не могут это сделать.
                                         
                                         Хотя, возможно, слоты PG Failover могут сделать то же самое.
                                         
    
                                         Для всех их версий.
                                         
                                         Окей, есть ли еще какие-то места, где мне можно подробнее исследовать, потому что мне нравится понимать многие места здесь. for older versions. Okay, some area additionally for me to explore deeper because I like understanding
                                         
                                         many places here. Good pieces of advice as well, thank you so much. Anything else, Michael,
                                         
                                         you wanted to discuss? Obviously, one of the biggest features was Cytos support, right?
                                         
                                         But I'm not using Cytos actively, so I don don't know if you want to discuss this, let's discuss.
                                         
                                         I know that some people certainly do, because from time to time I get questions about Citus with Patroni on Slack,
                                         
                                         or maybe not Citus specific questions, but according to output of Patroni control list, they are running Citus cluster.
                                         
                                         So there is certainly demand and I believe with Patroni implementing Citus support,
                                         
    
                                         it improved quality of life of some organizations and people that want to run like sharded setups.
                                         
                                         Is there anything specific you needed to solve
                                         
                                         to support this or like technical details?
                                         
                                         To support Citus?
                                         
                                         So Citus, I wouldn't say that it was very hard,
                                         
                                         but it wasn't very easy either.
                                         
                                         So Citus has notion of Citus coordinator where you
                                         
                                         originally you supposed to use coordinator for everything to do DDL, to
                                         
    
                                         run transactional workload and so on. And on coordinator there is a metadata table
                                         
                                         where you register all worker nodes and worker nodes is this is where you keep
                                         
                                         the actual data like charts and what i had to implement in patronia is registering automatically
                                         
                                         worker nodes inside with metadata and in case of failover happens for for the worker nodes we need to update metadata and put new ips or host names like whatever and
                                         
                                         basically when you want to scale out your situs cluster you just start more worker nodes like and
                                         
                                         every in every worker node in in fact is another small patronroni cluster. So, technically, in Patroni control, it looks like just a single cluster,
                                         
                                         but in fact, it's one cluster for coordinator, one cluster for every worker node, and
                                         
                                         on each of them
                                         
    
                                         there is its own failover
                                         
                                         happening. If you start worker nodes in a different group, like in the new one, it joins the existing
                                         
                                         Citus cluster and Patroni and the coordinator register new worker nodes.
                                         
                                         But what Patroni will not do, it will not redistribute existing data to the new workers.
                                         
                                         This is something that you will have to do manually afterwards and it has to be your
                                         
                                         own decision how to scale your data
                                         
                                         and replicate to other nodes. Although nowadays it's possible to do it without downtime because
                                         
                                         all enterprise features of Citus are included in Citus version 10. So everything that was in the enterprise now is an open source.
                                         
    
                                         That's cool.
                                         
                                         I saw Alexander has a good demo of this, of Citus and Petroni working together, including
                                         
                                         rebalancing. I think it was CitusCon last year?
                                         
                                         Yeah, it was CitusCon.
                                         
                                         Nice. I'll include that video in the show notes.
                                         
                                         I wish I had all this a few years ago. Yeah, of course.
                                         
                                         There was a little bit more work under the hood.
                                         
                                         In case if you do right workload via a coordinator, it's possible Patroni can do some tricks to
                                         
    
                                         avoid client connection termination while switchover of worker nodes is happening.
                                         
                                         This is what I did during the demo.
                                         
                                         There are certain tricks, but unfortunately,
                                         
                                         it works only on coordinator and only for write workloads.
                                         
                                         For read-only workloads, your connection will be broken.
                                         
                                         That's unfortunate.
                                         
                                         Maybe one day it will be broken. That's unfortunate. Maybe one day it will be fixed.
                                         
                                         So in the Citus, maybe one day
                                         
    
                                         the same
                                         
                                         stuff will also work
                                         
                                         on worker nodes. And by the way,
                                         
                                         on Citus, you can run
                                         
                                         transactional workload
                                         
                                         by connecting to every worker node.
                                         
                                         Only DDL
                                         
                                         must happen via
                                         
    
                                         coordinator.
                                         
                                         Nice. Speaking of improvements in the future,
                                         
                                         do you have any anything lined up that you still want to improve in Patroni?
                                         
                                         Hmm. That's a very good question.
                                         
                                         You usually some nice improvements are coming out of nothing.
                                         
                                         Like you don't plan anything, but you talk to people and they say,
                                         
                                         it would be nice to have this improvement or this feature.
                                         
                                         And you start thinking about it.
                                         
    
                                         Wow, yeah, it's a very nice idea and it's great to have it.
                                         
                                         But I rarely plan some big features from the ground up, let's say.
                                         
                                         So what I had in my mind, for example, it's a failover to a standby cluster, like in Patroni.
                                         
                                         Right now it's possible to run a standby cluster is like not aware of the source where it replicates from. It could be replicating from another Patroni cluster
                                         
                                         and what people ask we have a primary Patroni cluster we have standby Patroni
                                         
                                         clusters but there is no mechanism to automatically promote standby cluster
                                         
                                         because it's running in a different region and it is using completely another ATCD.
                                         
                                         So they simply don't know about each other.
                                         
    
                                         It would be nice to have, but again I cannot promise when I can start working on it and
                                         
                                         whether it will happen.
                                         
                                         I know that people from cybertech did some experiments and have some proof of concept solutions that seems to work but for some reason they also
                                         
                                         they're also not happy with such solution they implemented.
                                         
                                         Yeah, sounds tricky.
                                         
                                         Distributed systems are always tricky.
                                         
                                         Yeah, get that on a t-shirt.
                                         
                                         Thank you for coming. I, as usual i use podcast and all events i i participate and
                                         
    
                                         organize and so on and i use just for my personal education and daily work as well i just thank you
                                         
                                         so much for help again yes thank you for inviting me uh yeah it's a nice job that you are doing i
                                         
                                         know that many people listening to your podcast
                                         
                                         and very happy about
                                         
                                         that they're learning a lot of great stuff
                                         
                                         and also making
                                         
                                         a big list
                                         
                                         of to-do items
                                         
    
                                         I cannot
                                         
                                         say the same about myself
                                         
                                         that I watch every single
                                         
                                         episode but
                                         
                                         sometimes I do
                                         
                                         cool, thank you thanks so much Alexander, cheers Nikolai that I watch every single episode, but sometimes I do.
                                         
                                         Cool.
                                         
                                         Thank you.
                                         
    
                                         Thanks so much, Alexander.
                                         
                                         Cheers, Nikolai.
                                         
                                         Thank you.
                                         
                                         Bye-bye.
                                         
                                         Bye.
                                         
                                         Bye.
                                         
