Postgres FM - Patroni

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I'm Michael, founder of PGMustard. I'm joined as usual by Nikolai, founder of PostgresAI. Hey, Nikolai. Hi, Michael. And today we are delighted to be joined by Alexander Kokushkin, a Postgres contributor currently working at Microsoft and most famously, maintainer of Patroni. We had a listener request to discuss Patroni, so we're delighted you agreed to join us for an episode. Alexander. Hello, Michael. Hello, Nik you agreed to join us for an episode. Alexander.

Starting point is 00:00:26 Hello, Michael. Hello, Nikolai. Thank you for inviting me. I'm really excited to talk about my favorite project. Us too. Perhaps as a starting point, could you give us an introduction? For most people, I think, will have heard of Patroni and know what it is, but for anybody that doesn't, could you give an an introduction what it is and why it's important yeah so Patroni like in the simple words it's a failover manager for Postgres it solves a problem of availability of a primary in the Postgres we don't use some words that non-inclusive like master that's why we call it primary and Patroni actually recently get

Starting point is 00:01:06 read from this non-inclusive force completely and the way how Patroni does it makes sure that we're running just a single primary at a time and at the same time Patroni helps you to manage as many read-only replicas as you like to have, and keeping those replicas ready to become primary in case if primary has failed. At the same time, Patroni helps to automate usual DBA tasks like switchover, let's say configuration management, stuff like that. Node provisioning also, right? Not provisioning, not really. Like, node provisioning, this is a task for DBA. Like, DBA has to start Patroni and Patroni will take care about bootstrapping this node. In case of it's totally new cluster, Patroni will start as a primary. In case of

Starting point is 00:01:59 the node joining existing cluster, the replica node will take a pgbase backup by default from the running primary and start as a replica. And the most interesting part, let's say we bring node back, which was previously running as a primary, and Patroni does everything to convert this failed primary as a new standby to join the cluster and be prepared for next and foreseen event. At least you agree that it does part of node provisioning because otherwise we wouldn't have situations when all data directory, all

Starting point is 00:02:33 data was copied and new one is created and we are suddenly out of disk space and if you don't expect Patroni to participate in node provisioning then you think, what's happening? Why am I out of disk space? Right?

Starting point is 00:02:49 It happens sometimes. It used to happen, I think, with bootstrap mode. I don't remember up until which version, but Patroni, like when it tries to create a new cluster, usually by using NITDB, but in some cases, like you can configure Patroni to create a cluster from existing backup, like from base backup. And if something goes wrong, Patroni does not remove this data directory, but renames it. And it used to apply current timestamp to the file name and therefore after the first failure it gives up waits a little bit and does the next attempt to directory name right yeah it creates it it uses yet another base, creates a new data directory, fails, and renames. Now it is not working like this.

Starting point is 00:03:48 It just renames pgdata to pgdata-alt, something like this. And that's why you will not have an infinite number of directories. And having just one is enough to investigate the failure. But maximum, we end up, like if our data directory,

Starting point is 00:04:06 we expected to fill 70% of the disk, we still might have out of disk space. Yeah, that's unfortunate. But the other option really, like you just drop it, but at the same time, like all the evidences of what failed, why it failed are also gone.

Starting point is 00:04:24 You have nothing to investigate. Okay. To me it sounds like still like Patroni participates in node provisioning. Yes, it doesn't bring you resources like disk and virtual machine and so on, but it brings data, like the most important part of Postgres node provisioning, right? Okay, I just wanted to be right a little bit. Okay. Okay. It's a joke. Okay.

Starting point is 00:04:50 I think diving deep quickly is great. It'd be good to discuss complex topics, but I think starting simple would also be good. I would love to hear a little bit almost about the history of Petroni. Like the early days, what were you doing before Petroni to solve this kind of issue? And why was it built what problems were there with the existing setups to be honest like while working for my previous company we didn't have any automatic failover solution in place what we relied on just a good

Starting point is 00:05:21 monitoring system that sent you a message or someone call engineer, like just calls you in the night, the database failed. There were a lot of false positives, unfortunately, but it still felt more reliable than using solutions like replication manager, drip manager. Yeah, I remember this. Sorry for interrupting. I remember this very, very well. Like people were constantly saying,

Starting point is 00:05:46 we don't need autofillover. It's like, it's evil because it can like switch over suddenly, fill over suddenly, and it's a mess. Let's rely on manual actions. I remember this time very well. Yeah, to our excuse,

Starting point is 00:05:59 an amount of databases, like database clusters that we run wasn't so high. Like I think a few dozens and it was running on-prem, didn't fail so often, and therefore it was manageable. A bit later, we started moving to the cloud. And suddenly, not suddenly, but luckily for us, we found a project named Governor, which basically brought an idea of how to implement after failover in a very nice manner, like without having so much false positives and without risks of running to a split brain. Was it abandoned project already? No, no.

Starting point is 00:06:42 So it was not really abandoned, but it wasn't also very active. So we started playing with it, found some problems, reported problems to maintainer of governor, got no reaction, unfortunately, started fixing those problems on our own. at some moment number of fixes and some new nice features accumulated and we decided just to fork it and give a new name to the project. So this is how Patronio was born. Georgia name, right? Right. Beautiful.

Starting point is 00:07:22 What does it mean? Governor in Georgia. Oh, governor. I think so.? Governor in Georgian. Oh, governor. Yeah, almost, almost. Very close, but I'm not a good person to explain or to translate from Georgian, because I don't... I know yet another word in Georgian, and it's spilo. Yeah. It translates from Georgian as elephant right

Starting point is 00:07:47 and the name chose i guess uh valentin uh gogichashvili right yes uh he was no at that time he wasn't my boss anymore but we still worked close together and i really appreciate his creativity and inventing good names for projects. Yeah, great names. And is this a good time to bring up Spilo? What is Spilo and how is that relevant? Spilo, as I said, it translates from Georgian elephant. When we started playing with Governor,

Starting point is 00:08:20 we were already targeting to deploy everything in the cloud. We had no other choice but to build a Docker image and provision Postgres in a container. And we called this Docker image Spilo. Basically, we packaged Governor, Postgres, a few Postgres extensions and I think it was Wally back then as a backup and point-in-time recovery solution. And it still exists to this day as Spilo but now with Patroni? Yeah of course. Now there is a Patroni inside and now Spilo includes plenty of Postgres major versions Так, там є патроні, а спілкається з багатьох постградських майданчих версій, які можуть бути антипаттерами, але дозволяють робити майданчі підвищення. of Wally. And it's a part of operator.

Starting point is 00:09:28 Not really part of operator. So Spilo is a product on its own. I know that some people run Postgres on Kubernetes or even just on virtual machines

Starting point is 00:09:43 with Spilo without using Operator. But using Docker, for example. Yeah, of course. But that is a good opportunity to discuss Postgres Operator. Postgres-Operator was Zalando's... Was that one of the first operators of its type? I know we've got lots these days.

Starting point is 00:10:03 Well, maybe it was but at the same time the same name was used by crunchy for their operator they were developed in parallel and back then crunchy wasn't relying on patrony yet as i said like we started moving things to the cloud. And at some point, Vector moved a little bit and started running plenty of workloads on Kubernetes, including Postgres. Since deploying everything manually and, more importantly, managing so many Postgres clusters manually was really a nightmare, we started building Postgres operator. Back then, I don't think some very nice Go library to implement operator

Starting point is 00:10:52 pattern existed and therefore people had to invent everything from scratch and there is a lot of boilerplate code that copied over and so on. Is it only the move to the cloud what mattered here, but maybe also moving to microservices, splitting everything to microservices? Because I remember from...

Starting point is 00:11:16 Microservices, of course, played a big role. And probably... Not probably. Microservices were really driving driving force to move to the cloud. Because with the scale of the organization, it wasn't possible to keep monolith. And the idea was, let's print everything to microservices. And every microservice usually requires its own database. Right. Sometimes sharded database,

Starting point is 00:11:48 like we used application sharding. In certain cases, the same database is used by multiple microservices, but it's a different story. But really, the number of database clusters that we had to support exploded, like from dozens to hundreds and then to thousands. And this is already when you cannot rely on humans to perform a failover, right? Even when you run a few hundred database clusters,

Starting point is 00:12:23 it's better not to rely on humans to do maintenance, in my opinion. Right. So that's interesting. And maybe it's also the right time to discuss why Postgres doesn't have internal built-in TOEFL lower. I remember discussions about replication when we relied on Slony, then Londeste, and some people resisted to bring replication inside Postgres, but somehow it was resolved, eventually.

Starting point is 00:12:55 And Postgres has good replication, physical, logical, sometimes not good, but it's a different story. In general, it's very good, and improving, improving every release. We just last week discussed with michael uh what improvements of logical replication in 17 and maybe it will resonate a little bit with topic today patroni but it doesn't happen to autofill over at all right why so, I can only guess because to do it correctly, we cannot just have two nodes which most people run, like primary and standby, because there are many different factors involved. And one of the most critical ones is the network

Starting point is 00:13:42 between those nodes. And when just having two machines, you cannot distinguish between failure on the networking and failure of the primary. Like if you just run health check from a standby and making decision based on the health check, you may have a false positive. Basically network just experience some short glitch, which could last even a few seconds, sometimes a few minutes, but at the same time, the old

Starting point is 00:14:13 primary is still there. If we promote a standby, we get to split-brain situation with two primaries and not being clear to which one transactions are running. And in the worst case, you end up in an application connecting to both of them. This is what... Good luck with assembling all these changes together. This is what tools like Replication Manager do. So I ended up calling Replication Manager a split-bane split-bane solution because I observed it many many times

Starting point is 00:14:48 like as a mitigation what maybe is possible to do the primary can also run a health check and in case if standby is not available just stop accepting writes by either like restarting in read-only or maybe by

Starting point is 00:15:08 implementing some other mechanisms. But it also means that we lose availability without a good reason. So with this scenario, when we promote a standby, technically if standby cannot access someone else, it shouldn't be Технически, если standby не может доступить кого-то другого, то он не должен принимать права. Как в сплите в сети. Мы близко пришли к установке, как это называет репликатор, виднесс-нода. Виднесс-нода, yes, exactly. Yeah, witness node. Basically, you need to have more than two. And the witness node should help making decision. Let's say we have witness node in some third failure domain.

Starting point is 00:15:59 Primary can see witness node, therefore it can continue run as a primary. And standby shouldn't be allowed to promote until, so if it cannot access the witness node. And it already reminds some systems like ATCD. Consensus algorithm. Yeah, consensus algorithm and write is possible when it is accepted by majority of nodes. This will already invented, right? Yeah, so this is already invented, and what Patroni is really relying on to implement after failover reliably. I can guess that at some moment in Postgres it will be added,

Starting point is 00:16:42 and we already have plenty of such components in Postgres that exists. We have write-ahead log with LSN which is always incremented. We have timelines which is very similar to terms in ATCD. So basically at the end we will just need to have more than two nodes, like better three, so that we don't stop writes while one node is temporarily down. And it will give possibility to implement after failover without even doing pgRewind, let's say. Because when primary writes to write headlock, it will be first confirmed by standby nodes, and only after that... So, effectively, this is what

Starting point is 00:17:30 we already have, but it's not enough, unfortunately. So, do you think at some point Patroni will not be needed, and everything will be inside Postgres, or no? I hope so, really. I hope so. No, no, no, not because I'm tired of maintaining Patroni,

Starting point is 00:17:47 but because this is what people really want to have, to deploy highly available Postgres without necessity to research and learn a lot of external tools like Patroni, solutions for backup and point and time upgrade them sometimes because we're always lagging with these upgrades

Starting point is 00:18:10 but at the same time let's imagine that it happens in a couple of years but with five years support cycle there will be still a lot of setups that run in not recent Postgres versions and they still need to use something external like Patroni.

Starting point is 00:18:31 I'm actually looking right now at commits of replication manager it looks like the project is inactive for more than one year almost like few commits that's it it's like going down. Well I have probably some insights about it, not about replication manager, but I know that EnterpriseDB was contributing some features and bug fixes to Patroni. So they officially support Patroni. So it sounds interesting, right?

Starting point is 00:18:58 So Patroni is a winner, obviously. It's used by many Kubernetes operators, many of them, and not only Kubernetes, of course. And winning, of course, some projects were abandoned, not only application manager, we know some others, right? But you thinking about one day, everything will be in core and Patroni will be abandoned maybe, right? And you think it's maybe for good. So every project has its own life cycle. At some moment, the project is abandoned

Starting point is 00:19:30 and not used by anyone. We are not there yet. Right, right. While we're in this area, I wanted to ask you what you think about Kubernetes also has, it also relies on

Starting point is 00:19:42 consensus algorithm itself. It has it. Why some operators choose, Kubernetes also relies on consensus algorithm itself. Why do some operators choose to use Patroni while others like Cloud Native PG decide to rely on Kubernetes native mechanisms and avoid using Patroni? What's better? To be honest, I don't know what's driving people that build cloud-native Postgres. But what's better in general? What are pros and cons?

Starting point is 00:20:13 How to compare? What would you do? In a sense, cloud-native PG, there is a component that tries to manage all Postgres clusters and decide whether some primary is failed and promote one of the standbys. I'm not sure how they implement fencing of the failed primary, because if you don't correctly implement fencing and promote a standby to the primary, you again end up in a split-brain situation. And let's imagine that one Kubernetes node is isolated in the network.

Starting point is 00:20:55 Network partition. Yeah. And it automatically means that you will not be able to stop pods for containers that are running on this node. At the same time, applications that are running on this node will still use Kubernetes services to be able to connect to the isolated primary. Right, yeah. So, Patroni detects such scenarios very easily. Because Patroni component runs on this in the same port with Postgres and if in case if it cannot write to Kubernetes API it just self does self-fencing it stops Postgres to read

Starting point is 00:21:32 only. It's simple by the way right? Yeah so I don't know like if they do something similar like in case if they don't it's dangerous we should do a whole separate episode of cloud native pg actually i think that would be a good one yeah i'm not saying that cloud native pg is bad or like does something wrong so i'm just trying to understand what they're doing and raising my raising my concerns of course right back to patrony it worked like this from the beginning but it feels like in version 4 which which is the latest major release, there might be some life for a couple of years, by the way. From the very beginning, we wanted to support this feature, but what was stopping us, the promise of Patroni with synchronous replication, that we want to promote a node

Starting point is 00:22:38 that was synchronous at the time when primary failed. If we just have a single name in synchronous standby names, like single node, it's very easy to say, okay, so this node was synchronous and therefore we can just promote it. When there are more than one node and we require all of them to be synchronous, we can promote any of them. But with quorum-based replication you can have something like any one from list of, let's say, three nodes. Which one is synchronous when the primary failed? I'm not demanding to answer this question, so I will just explain how it works in Patroni, like in the last major release. This information about current value of synchronous and binames is also stored in ATCD. Therefore, those three nodes that are listed in synchronous and binames know that we are listed as quorum nodes.

Starting point is 00:23:35 And during the leader race, they need to access each other and get some number of votes. If there are three nodes, it means that every node, like to become a new primary, like a new candidate, need to access two remaining nodes at least and get confirmation that they are not ahead of all LSN on the current node. Is it clear? I should elaborate a little bit more. So if they were ahead, let me ask this stupid question, if a node checks that it is ahead of the current candidate to be leader,

Starting point is 00:24:20 that's then a bad decision to promote that leader because a different one would... So just for your understanding, in Patroni there is no central component that decides on which node to promote. Every node makes decisions on its own. Therefore, every standby node listed in synchronous standby names goes through the cycle of health checks. It accesses remaining nodes from synchronized TNAVY names and checks at what LSN are they. And if they're on the same LSN or behind, we can assume that this node is the healthiest one. And the same procedure happens on remaining nodes. Basically, this way we can find, okay, so this node is eligible to become new primary. In case if we have something like

Starting point is 00:25:13 any two and three nodes, we can make a decision by asking just a single node. Because we know that two nodes will have the latest commits, like the latest commits that are reported to the client. And it will be enough to just ask a single node. Although it will ask all nodes from synchronous to my names, but in case if one of them, let's say, failed, together with the primary, it is still enough to make a decision by asking the remaining one.

Starting point is 00:25:48 Nice. And the tricky part comes when we need to change synchronous standby names and the values that we store in ATCD. Let's say we want to increase the number of synchronous nodes, like from 1 to 2. What should we change first? Synchronized in by names, guk, or value in etcd? So that we can correctly make a decision. If we change first value in etcd, it will assume, okay, so we need to ask just a single node to make a decision, although there is just one node that has latest commits, 100%. And in fact, we need to ask two.

Starting point is 00:26:38 Therefore, when we're increasing this from one to two, first we need to update synchronous standby names, and only after that change in ATCD. And there are almost dozens of rules that one needs to follow to do such changes in correct order. Because it's not only about changing replication factor, it's also about adding new nodes to synchronous standby names Это также касается добавления новых нодов для синхронных стендбай-намец, или выключения нодов, которые исчезли и так далее. И я не думаю, что ни одна другая решение не применяет алгоритм для таких изменений. Сколько времени вы потратили на это? Yeah. So, like, originally this feature was implemented by Ants Asma. He's working for CyberTech. It happened in 2018.

Starting point is 00:27:32 I did a few attempts to understand, like, this great logic of this algorithm. And finally, like, almost five years after, like, I was able to get enough time to fully focus on the problem. And even after that, I spent a couple of months implementing and fixing some bugs and corner cases and implementing all possible unit tests to cover all such transitions. There is no book which describes this, that you could follow. This is something really new that needs to be invented, right? Well, the idea was obvious, how to do it, or what to do. But implementing it correctly and proving that it is really working correctly, kā to redzēt, vai kā to redzēt, bet, kā to implementēt, kā to redzēt, un izmantojot, ka tas ir ļoti ātri iespējams,

Starting point is 00:28:29 tas ir ļoti vajadzējums. Paldies, ka izmantojās visus vēsturēšanās. Ir vēl viena vēsturē, kuru es gribētu uzmantot. Tā bija Patronis 3.0. DCS filesef mūsu0 DCS file safe mode. So DCS is distributed configuration storage.

Starting point is 00:28:51 And actually we just experienced a couple of outages because we are in Google Cloud and they are Kubernetes and running Zalando operator with Patroni of course. And I just checked the version of Patroni, and it seems to have it. But I don't think it is enabled.

Starting point is 00:29:11 Exactly. This is my second question, actually. Why it's not enabled? So, first question, what is it? How do you solve this problem when etcd or console is temporarily out? Let's start from problem statement. The promise of Patroni is that it will run as a primary when it can write to distributed configuration store, like to ATCD. If it cannot write to ATCD, it means that maybe something wrong with ATCD

Starting point is 00:29:40 or maybe this node is isolated and therefore writes are failing. And when node is isolated, it's apparently working by design, Patroni cannot write to ATCD, it will stop Postgres in read-only mode, but in case if ATCD is totally down, because of, I don't know, some human mistake, you cannot access any single node of ATCD. And in this case, Patroni also stops primary and starts it in read-only to protect from the case, let's say, some standby nodes can access DCS at the same time and promote one of the nodes. So people were really annoyed by this problem and were asking why we are demoting primary and so far the answer was always alright,

Starting point is 00:30:34 so we cannot determine the state and therefore we demote to be on the safe side. The idea how to improve on that came at one of Postgres conferences after talking with other Patroni users. How it is improved using failsafe mode? that none of the ATCU nodes are accessible, it will try to access all Patroni nodes in the cluster using Patroni REST API. And in case if Patroni primary can get a response from all nodes in Patroni cluster in the failsafe mode, it will continue run as a primary. In this case it's much stronger stronger requirement than quorum or consensus so it is not expecting to get responses from let's say majority like it really wants to get responses from

Starting point is 00:31:36 all standby nodes to continue to run as a primary. Yeah, so like this feature was introduced in Patroni version 3, but it is not enabled by default. Because I think there are some side effects when you enable this mode in certain environments. Probably it is related to environments where your node Возможно, это связано с обществом, где ваша сеть может отвечать с разным именем. Надо подумать об этом. Этот поведение документировано. Да, мы будем исследовать это, спасибо большое за это. На Kubernetes это безопасно, чтобы это позволить. Да, мы должны начать использовать это, я думаю.

Starting point is 00:32:24 Мы обязательно будем исследовать это. to enable it yeah we should start using this is what this is what i think as well yeah definitely we'll explore things like pods always have the same name just different ip addresses i just got help for and and as usual i like i just wanted to publicly thank you for all the help you do for me and actually many companies many years it's this is huge thank you so much yeah i'm happy to help so another thing i wanted to discuss is probably replication slots and i remember a few years ago you implemented support for for follower of logical slots now we have it in postgres right so для фейловерных логических слотов, а теперь мы их имеем в Postgres, так что наконец-то одна вещь была, я думаю, от Патроны, или вы все еще держите эту функциональность? Мы все еще держим ее, и мы не сделали ничего особенного в Postgres 17. It was I think it was at 16 even no? Failover of ah. Or 17. Well ability to use a logical slot

Starting point is 00:33:32 on physical standbys was in 16 but failover in 17. We just discussed it. Yes exactly exactly. Exactly. I confused. That's why I'm saying we didn't do anything special although I did some Поэтому я и говорю, что мы не сделали ничего особенного. Хотя я сделал несколько треков, чтобы сделать эту функцию работать с Patroni, потому что это требует иметь ваш датабейсный имя в PrimaryConInfo. И Patroni не ввел DB-имя в PrimaryConInfo, потому что для физической репликации это не полезно. Но это и есть. Es domāju, ka mēs, jā, mēs skatām slēpu primāri, tas ir jā stand by nodus, un viena no tās ir jūtas, vai vēl nekādi, bet viena no tās ir jūtas, lai logiski mācītu kaut ko, kādu posgu, snuflaku vai kaut ko. Vai Kavku vai kaut ko. Ja tas ir jā, tad jā, no stand by.

Starting point is 00:34:41 Jo tas ir labi, mēs varam izmantot riski primāri, un tādā. Un Wall sender nevar jūtas CPU, un tādā. Потому что это хорошо, мы не используем премиальные риски, WallSender не использует CPU и так далее. И нет рисков на диске. Так что теперь у нас есть такой стендбай, и он вдруг исчез. Это не работа Patroni, чтобы его уничтожить, да? Потому что нам нужны какие-то механизмы,ого стендбайа. Вы имеете в виду, чтобы оставить слот логической репликации на новом стендбайе, на котором вы бы хотели подключиться. В теории, патроны, возможно, могут обеспечить это, поскольку возможно делать логическую репликацию с стендбай-кнопок с поз Postgres 16. So how it's implemented currently in Patroni,

Starting point is 00:35:26 like logical file loader slots, it creates logical slots on standby nodes and uses pgReplicationSlotAdvance to move the slot to the same LSN as it's currently on the primary. So basically assumption is that logical replication happens on the primary. In theory there is no reason why it cannot be done for standby nodes. Let's say like we create logical slots on all standby nodes

Starting point is 00:35:58 like with the same node, with the same name and Patroni can izmantot, kāda ir aktīva, un izmantot to ATCD. В теории это может работать, но я не знаю, если бы у меня был время, чтобы это сделать. Я просто пытаюсь понять, это довольно новая функция с 16-го года, чтобы логически репликатировать с физическими стендбайами, но... Но будьте в курсе, что это все еще влияет на основные. it still affects primary. Right. Right. So maybe like PG wall will not blow out, but PG catalog certainly will. Yeah, this for sure. I was referring to the need to preserve wall files on the primary. This risk has gone if we do this, but I cannot imagine how we can start using logical slots

Starting point is 00:37:03 on physical standbys in serious projects without HA ideas. Because right now I don't understand how we solve HA for this. Yeah, and unfortunately, this hack that Patroni implements with PG replication slot advance has its downsides. It literally takes as much time to move the position of the logical slot as you consume from this slot. That's unfortunate. And how it's solved in Postgres 17, it basically does not need to parse a whole file and decode it. So it just literally ever writes some и он просто буквально вычитает какие-то цифры в слоте репликации, потому что он знает точные местности и делает это безопасно. Патроны не могут это сделать. Хотя, возможно, слоты PG Failover могут сделать то же самое.

Starting point is 00:37:59 Для всех их версий. Окей, есть ли еще какие-то места, где мне можно подробнее исследовать, потому что мне нравится понимать многие места здесь. for older versions. Okay, some area additionally for me to explore deeper because I like understanding many places here. Good pieces of advice as well, thank you so much. Anything else, Michael, you wanted to discuss? Obviously, one of the biggest features was Cytos support, right? But I'm not using Cytos actively, so I don don't know if you want to discuss this, let's discuss. I know that some people certainly do, because from time to time I get questions about Citus with Patroni on Slack, or maybe not Citus specific questions, but according to output of Patroni control list, they are running Citus cluster. So there is certainly demand and I believe with Patroni implementing Citus support,

Starting point is 00:38:51 it improved quality of life of some organizations and people that want to run like sharded setups. Is there anything specific you needed to solve to support this or like technical details? To support Citus? So Citus, I wouldn't say that it was very hard, but it wasn't very easy either. So Citus has notion of Citus coordinator where you originally you supposed to use coordinator for everything to do DDL, to

Starting point is 00:39:35 run transactional workload and so on. And on coordinator there is a metadata table where you register all worker nodes and worker nodes is this is where you keep the actual data like charts and what i had to implement in patronia is registering automatically worker nodes inside with metadata and in case of failover happens for for the worker nodes we need to update metadata and put new ips or host names like whatever and basically when you want to scale out your situs cluster you just start more worker nodes like and every in every worker node in in fact is another small patronroni cluster. So, technically, in Patroni control, it looks like just a single cluster, but in fact, it's one cluster for coordinator, one cluster for every worker node, and on each of them

Starting point is 00:40:37 there is its own failover happening. If you start worker nodes in a different group, like in the new one, it joins the existing Citus cluster and Patroni and the coordinator register new worker nodes. But what Patroni will not do, it will not redistribute existing data to the new workers. This is something that you will have to do manually afterwards and it has to be your own decision how to scale your data and replicate to other nodes. Although nowadays it's possible to do it without downtime because all enterprise features of Citus are included in Citus version 10. So everything that was in the enterprise now is an open source.

Starting point is 00:41:25 That's cool. I saw Alexander has a good demo of this, of Citus and Petroni working together, including rebalancing. I think it was CitusCon last year? Yeah, it was CitusCon. Nice. I'll include that video in the show notes. I wish I had all this a few years ago. Yeah, of course. There was a little bit more work under the hood. In case if you do right workload via a coordinator, it's possible Patroni can do some tricks to

Starting point is 00:42:01 avoid client connection termination while switchover of worker nodes is happening. This is what I did during the demo. There are certain tricks, but unfortunately, it works only on coordinator and only for write workloads. For read-only workloads, your connection will be broken. That's unfortunate. Maybe one day it will be broken. That's unfortunate. Maybe one day it will be fixed. So in the Citus, maybe one day

Starting point is 00:42:28 the same stuff will also work on worker nodes. And by the way, on Citus, you can run transactional workload by connecting to every worker node. Only DDL must happen via

Starting point is 00:42:43 coordinator. Nice. Speaking of improvements in the future, do you have any anything lined up that you still want to improve in Patroni? Hmm. That's a very good question. You usually some nice improvements are coming out of nothing. Like you don't plan anything, but you talk to people and they say, it would be nice to have this improvement or this feature. And you start thinking about it.

Starting point is 00:43:16 Wow, yeah, it's a very nice idea and it's great to have it. But I rarely plan some big features from the ground up, let's say. So what I had in my mind, for example, it's a failover to a standby cluster, like in Patroni. Right now it's possible to run a standby cluster is like not aware of the source where it replicates from. It could be replicating from another Patroni cluster and what people ask we have a primary Patroni cluster we have standby Patroni clusters but there is no mechanism to automatically promote standby cluster because it's running in a different region and it is using completely another ATCD. So they simply don't know about each other.

Starting point is 00:44:10 It would be nice to have, but again I cannot promise when I can start working on it and whether it will happen. I know that people from cybertech did some experiments and have some proof of concept solutions that seems to work but for some reason they also they're also not happy with such solution they implemented. Yeah, sounds tricky. Distributed systems are always tricky. Yeah, get that on a t-shirt. Thank you for coming. I, as usual i use podcast and all events i i participate and

Starting point is 00:44:47 organize and so on and i use just for my personal education and daily work as well i just thank you so much for help again yes thank you for inviting me uh yeah it's a nice job that you are doing i know that many people listening to your podcast and very happy about that they're learning a lot of great stuff and also making a big list of to-do items

Starting point is 00:45:14 I cannot say the same about myself that I watch every single episode but sometimes I do cool, thank you thanks so much Alexander, cheers Nikolai that I watch every single episode, but sometimes I do. Cool. Thank you.

Starting point is 00:45:28 Thanks so much, Alexander. Cheers, Nikolai. Thank you. Bye-bye. Bye. Bye.

Your Ad Here

Postgres FM - Patroni

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.