Postgres FM - Patroni
Episode Date: October 4, 2024Michael and Nikolay are joined by Alexander Kukushkin, PostgreSQL contributor and maintainer of Patroni, to discuss all things Patroni — what it is, how it works, recent improvements, and m...ore.Here are some links to things they mentioned:Alexander Kukushkin https://postgres.fm/people/alexander-kukushkinPatroni https://github.com/patroni/patroniSpilo https://github.com/zalando/spilo Zalando Postgres Operator https://github.com/zalando/postgres-operatorCrunchy Data Postgres Operator https://github.com/CrunchyData/postgres-operatorSplit-brain https://en.wikipedia.org/wiki/Split-brain_(computing)repmgr https://github.com/EnterpriseDB/repmgrCloudNativePG https://github.com/cloudnative-pg/cloudnative-pgPatroni release notes https://patroni.readthedocs.io/en/latest/releases.htmlCitus & Patroni talk and demo by Alexander (at Citus Con 2023) https://www.youtube.com/watch?v=Mw8O9d0ez7E~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
I'm Michael, founder of PGMustard. I'm joined as usual by Nikolai, founder of PostgresAI.
Hey, Nikolai.
Hi, Michael.
And today we are delighted to be joined by Alexander Kokushkin,
a Postgres contributor currently working at Microsoft and most famously, maintainer of Patroni.
We had a listener request to discuss Patroni, so we're delighted you agreed to join us for an episode.
Alexander. Hello, Michael. Hello, Nik you agreed to join us for an episode. Alexander.
Hello, Michael. Hello, Nikolai. Thank you for inviting me.
I'm really excited to talk about my favorite project.
Us too.
Perhaps as a starting point, could you give us an introduction?
For most people, I think, will have heard of Patroni and know what it is,
but for anybody that doesn't, could you give an an introduction what it is and why it's important yeah so Patroni like in the simple words it's a failover manager
for Postgres it solves a problem of availability of a primary in the Postgres we don't use some
words that non-inclusive like master that's why we call it primary and Patroni actually recently get
read from this non-inclusive force completely and the way how Patroni does it makes sure that we're
running just a single primary at a time and at the same time Patroni helps you to manage as many
read-only replicas as you like to have, and keeping those replicas ready to
become primary in case if primary has failed. At the same time, Patroni helps to automate
usual DBA tasks like switchover, let's say configuration management, stuff like that.
Node provisioning also, right? Not provisioning, not really. Like, node provisioning, this
is a task for DBA. Like, DBA has to start Patroni and Patroni will take care about bootstrapping
this node. In case of it's totally new cluster, Patroni will start as a primary. In case of
the node joining existing cluster, the replica node will take a pgbase backup by default from the running
primary and start as a replica. And the most interesting part, let's say we bring node back,
which was previously running as a primary, and Patroni does everything to convert this
failed primary as a new standby to join the cluster and be prepared for next and foreseen event. At least you
agree that it does part of
node provisioning because otherwise
we wouldn't have situations when
all data directory, all
data was copied
and new one is created
and we are suddenly out of disk space
and if you don't expect Patroni
to participate in node provisioning
then you think, what's happening?
Why am I out of disk space?
Right?
It happens sometimes.
It used to happen, I think, with bootstrap mode.
I don't remember up until which version, but Patroni, like when it tries to create a new cluster, usually by using NITDB, but in some cases, like you can configure Patroni to create a cluster from existing backup, like from base backup.
And if something goes wrong, Patroni does not remove this data directory, but renames it.
And it used to apply current timestamp to the file name
and therefore after the first failure it gives up waits a little bit and does the next attempt
to directory name right yeah it creates it it uses yet another base, creates a new data directory, fails, and renames.
Now it is not working like this.
It just renames pgdata to pgdata-alt,
something like this.
And that's why you will not have
an infinite number of directories.
And having just one is enough
to investigate the failure.
But maximum, we end up,
like if our data directory,
we expected to fill 70% of the disk,
we still might have out of disk space.
Yeah, that's unfortunate.
But the other option really,
like you just drop it,
but at the same time,
like all the evidences of what failed,
why it failed are also gone.
You have nothing to investigate.
Okay. To me it sounds like still like Patroni participates in node provisioning. Yes,
it doesn't bring you resources like disk and virtual machine and so on, but it brings data,
like the most important part of Postgres node provisioning, right? Okay, I just wanted to be right a little bit.
Okay.
Okay.
It's a joke.
Okay.
I think diving deep quickly is great.
It'd be good to discuss complex topics,
but I think starting simple would also be good.
I would love to hear a little bit almost about the history of Petroni.
Like the early days, what were you doing before Petroni
to solve this kind of issue?
And why was it built what problems were there with the existing setups to be honest like while working for my previous
company we didn't have any automatic failover solution in place what we relied on just a good
monitoring system that sent you a message or someone call engineer,
like just calls you in the night, the database failed.
There were a lot of false positives, unfortunately, but it still felt more reliable than using
solutions like replication manager, drip manager.
Yeah, I remember this.
Sorry for interrupting.
I remember this very, very well.
Like people were constantly saying,
we don't need autofillover.
It's like, it's evil
because it can like switch over suddenly,
fill over suddenly,
and it's a mess.
Let's rely on manual actions.
I remember this time very well.
Yeah, to our excuse,
an amount of databases,
like database clusters that we run
wasn't so high.
Like I think a few dozens and it was running on-prem, didn't fail so often, and therefore it was manageable.
A bit later, we started moving to the cloud.
And suddenly, not suddenly, but luckily for us, we found a project named Governor, which basically brought an idea of how to implement after failover in a very nice manner, like without having so much false positives and without risks of running to a split brain.
Was it abandoned project already?
No, no.
So it was not really abandoned, but it wasn't also very active. So we
started playing with it, found some problems, reported problems to maintainer of governor,
got no reaction, unfortunately, started fixing those problems on our own. at some moment number of fixes and some new nice features accumulated
and we decided just to fork it and give a new name to the project.
So this is how Patronio was born.
Georgia name, right?
Right.
Beautiful.
What does it mean?
Governor in Georgia.
Oh, governor. I think so.? Governor in Georgian.
Oh, governor.
Yeah, almost, almost.
Very close, but I'm not a good person to explain or to translate from Georgian, because I don't... I know yet another word in Georgian, and it's spilo.
Yeah.
It translates from Georgian as elephant right
and the name chose i guess uh valentin uh gogichashvili right yes uh he was no at that
time he wasn't my boss anymore but we still worked close together and i really appreciate
his creativity and inventing good names for projects.
Yeah, great names.
And is this a good time to bring up Spilo?
What is Spilo and how is that relevant?
Spilo, as I said, it translates from Georgian elephant.
When we started playing with Governor,
we were already targeting to deploy everything in the cloud.
We had no other choice but to build a Docker image and provision Postgres in a container.
And we called this Docker image Spilo.
Basically, we packaged Governor, Postgres, a few Postgres extensions and I think it was Wally back then as a backup and point-in-time
recovery solution.
And it still exists to this day as Spilo but now with Patroni?
Yeah of course.
Now there is a Patroni inside and now Spilo includes plenty of Postgres major versions Так, там є патроні, а спілкається з багатьох постградських майданчих версій, які можуть бути антипаттерами, але дозволяють робити майданчі підвищення. of Wally. And it's a part of operator.
Not really part of
operator. So Spilo
is a product
on its own.
I know that some people run
Postgres on Kubernetes or
even just on
virtual machines
with Spilo
without using Operator.
But using Docker, for example.
Yeah, of course.
But that is a good opportunity to discuss Postgres Operator.
Postgres-Operator was Zalando's...
Was that one of the first operators of its type?
I know we've got lots these days.
Well, maybe it was but at the same time
the same name was used by crunchy for their operator they were developed in parallel and
back then crunchy wasn't relying on patrony yet as i said like we started moving things to the cloud. And at some point, Vector moved a little bit
and started running plenty of workloads on Kubernetes, including Postgres.
Since deploying everything manually and, more importantly,
managing so many Postgres clusters manually was really a nightmare,
we started building Postgres operator. Back then, I don't think
some very nice Go library to implement operator
pattern existed and therefore people had to invent
everything from scratch and there is a lot of boilerplate code
that copied over and so on.
Is it only the move to the cloud
what mattered here,
but maybe also moving to microservices,
splitting everything to microservices?
Because I remember from...
Microservices, of course, played a big role.
And probably...
Not probably.
Microservices were really driving driving force to move to the cloud.
Because with the scale of the organization, it wasn't possible to keep monolith.
And the idea was, let's print everything to microservices.
And every microservice usually requires its own database.
Right. Sometimes sharded database,
like we used application sharding.
In certain cases, the same database is used
by multiple microservices, but it's a different story.
But really, the number of database clusters
that we had to support exploded,
like from dozens to hundreds and then to thousands.
And this is already when you cannot rely on humans to perform a failover, right?
Even when you run a few hundred database clusters,
it's better not to rely on humans to do maintenance, in my opinion.
Right. So that's interesting.
And maybe it's also the right time to discuss why Postgres doesn't have internal built-in TOEFL lower.
I remember discussions about replication when we relied on Slony, then
Londeste, and some people
resisted to bring
replication inside Postgres, but somehow
it was resolved, eventually.
And Postgres has good replication,
physical, logical,
sometimes not good, but it's
a different story. In general,
it's very good, and improving, improving every release.
We just last week discussed with michael uh what improvements of logical replication in 17 and maybe it will resonate a little bit with topic today patroni but it doesn't happen to
autofill over at all right why so, I can only guess because to do it correctly, we cannot just have two nodes which most people run, like primary and standby,
because there are many different factors involved. And one of the most critical ones is the network
between those nodes. And when just having two machines,
you cannot distinguish between failure on the networking
and failure of the primary.
Like if you just run health check from a standby
and making decision based on the health check,
you may have a false positive.
Basically network just experience some short glitch,
which could last even a few seconds, sometimes a few minutes, but at the same time, the old
primary is still there. If we promote a standby, we get to split-brain situation with two primaries
and not being clear to which one transactions are running.
And in the worst case, you end up in an application connecting to both of them.
This is what...
Good luck with assembling all these changes together.
This is what tools like Replication Manager do.
So I ended up calling Replication Manager a split-bane split-bane solution because I observed it many
many times
like as a mitigation
what maybe
is possible to do the primary can
also run a health check and
in case if standby is not available
just stop accepting
writes by
either like restarting in read-only or maybe by
implementing some other mechanisms. But it also means that we lose availability
without a good reason. So with this scenario, when we promote a
standby, technically if standby cannot access someone else, it shouldn't be Технически, если standby не может доступить кого-то другого, то он не должен принимать права.
Как в сплите в сети.
Мы близко пришли к установке, как это называет репликатор, виднесс-нода.
Виднесс-нода, yes, exactly. Yeah, witness node. Basically, you need to have more than two.
And the witness node should help making decision.
Let's say we have witness node in some third failure domain.
Primary can see witness node, therefore it can continue run as a primary.
And standby shouldn't be allowed to promote until, so if it cannot access the witness node. And it already reminds some systems like
ATCD. Consensus algorithm. Yeah, consensus algorithm and write is possible when it is
accepted by majority of nodes.
This will already invented, right?
Yeah, so this is already invented,
and what Patroni is really relying on to implement after failover reliably.
I can guess that at some moment in Postgres it will be added,
and we already have plenty of such components in Postgres
that exists. We have write-ahead log with LSN which is always incremented. We have
timelines which is very similar to terms in ATCD. So basically at the end
we will just need to have more than two nodes, like better three, so that we don't stop
writes while one node is temporarily down. And it will give possibility to implement after failover
without even doing pgRewind, let's say. Because when primary writes to write headlock, it will be first confirmed by standby nodes,
and only after that...
So, effectively, this is what
we already have, but
it's not enough, unfortunately.
So, do you think at some
point Patroni will not be needed, and everything
will be inside Postgres, or no?
I hope so,
really. I hope so.
No, no, no, not because I'm tired of maintaining Patroni,
but because this is what people really want to have,
to deploy highly available Postgres
without necessity to research
and learn a lot of external tools like Patroni,
solutions for backup and point and time
upgrade them sometimes because we're
always lagging with these
upgrades
but at the same time
let's imagine that
it happens in a couple of years
but with five
years support cycle
there will be still a lot of
setups that run in not recent
Postgres versions and they still need to use something external like Patroni.
I'm actually looking right now at commits of replication manager it looks
like the project is inactive for more than one year almost like few commits
that's it it's like going down. Well I have probably some insights about it,
not about replication manager,
but I know that EnterpriseDB was contributing
some features and bug fixes to Patroni.
So they officially support Patroni.
So it sounds interesting, right?
So Patroni is a winner, obviously.
It's used by many Kubernetes operators,
many of them, and not only Kubernetes, of course.
And winning, of course, some projects were abandoned, not only application manager,
we know some others, right? But you thinking about one day, everything will be in core and
Patroni will be abandoned maybe, right? And you think it's maybe for good. So every project has its own life cycle.
At some moment,
the project is abandoned
and not used by anyone.
We are not there yet.
Right, right.
While we're in this area,
I wanted to ask you
what you think about
Kubernetes also has,
it also relies on
consensus algorithm itself.
It has it. Why some operators choose, Kubernetes also relies on consensus algorithm itself.
Why do some operators choose to use Patroni while others like Cloud Native PG decide to
rely on Kubernetes native mechanisms and avoid using Patroni?
What's better?
To be honest, I don't know what's driving people that build cloud-native Postgres.
But what's better in general?
What are pros and cons?
How to compare?
What would you do?
In a sense, cloud-native PG, there is a component that tries to manage all Postgres clusters and decide whether
some primary is failed and promote one of the standbys.
I'm not sure how they implement fencing of the failed primary, because if you don't correctly
implement fencing and promote a standby to the primary, you again end up in
a split-brain situation.
And let's imagine that one Kubernetes node is isolated in the network.
Network partition.
Yeah.
And it automatically means that you will not be able to stop pods for containers that are
running on this node. At the same time,
applications that are running on this node will still use Kubernetes services to be able to
connect to the isolated primary. Right, yeah.
So, Patroni detects such scenarios very easily. Because Patroni component runs on this in the same port with Postgres and if in
case if it cannot write to Kubernetes API it just self does self-fencing it stops Postgres to read
only. It's simple by the way right? Yeah so I don't know like if they do something similar
like in case if they don't it's dangerous we should do a whole separate episode of cloud
native pg actually i think that would be a good one yeah i'm not saying that cloud native pg is
bad or like does something wrong so i'm just trying to understand what they're doing and
raising my raising my concerns of course right back to patrony it worked like this from the
beginning but it feels like in version 4 which which is the latest major release, there might be some life for a couple of years, by the way.
From the very beginning, we wanted to support this feature, but what was stopping us,
the promise of Patroni with synchronous replication, that we want to promote a node
that was synchronous at the time when primary failed. If we just have a single name in synchronous
standby names, like single node, it's very easy to say, okay, so this node was synchronous and
therefore we can just promote it. When there are more than one node and we require all of them to
be synchronous, we can promote any of them. But with quorum-based replication you can have something like any one
from list of, let's say, three nodes. Which one is synchronous when the primary failed?
I'm not demanding to answer this question, so I will just explain how it works in Patroni,
like in the last major release. This information about current value of synchronous and binames is also stored in ATCD.
Therefore, those three nodes that are listed in synchronous and binames know that we are listed as quorum nodes.
And during the leader race, they need to access each other and get some number of votes. If there are three nodes, it means that every node,
like to become a new primary, like a new candidate,
need to access two remaining nodes at least
and get confirmation that they are not ahead of
all LSN on the current node.
Is it clear? I should elaborate a little bit more.
So if they were ahead, let me ask this stupid question,
if a node checks that it is ahead of the current candidate to be leader,
that's then a bad decision to promote that leader because
a different one would... So just for your understanding, in Patroni there is no central component that decides on which node to promote.
Every node makes decisions on its own.
Therefore, every standby node listed in synchronous standby names goes through the cycle of health checks. It accesses
remaining nodes from synchronized TNAVY names and checks at what LSN are they.
And if they're on the same LSN or behind, we can assume that this node is
the healthiest one. And the same procedure happens on remaining nodes. Basically, this way we can
find, okay, so this node is eligible to become new primary. In case if we have something like
any two and three nodes, we can make a decision by asking just a single node. Because we know that
two nodes will have the latest commits,
like the latest commits that are reported to the client.
And it will be enough to just ask a single node.
Although it will ask all nodes from synchronous to my names,
but in case if one of them, let's say, failed,
together with the primary,
it is still enough to make a decision by asking the remaining one.
Nice.
And the tricky part comes when we need to change synchronous standby names and the values that we store in ATCD.
Let's say we want to increase the number of synchronous nodes, like from 1 to 2.
What should we change first? Synchronized in by names, guk, or value in
etcd? So that we can correctly make a decision. If we
change first value in etcd, it will assume, okay, so we need to ask just a single node to make a decision,
although there is just one node that has latest commits, 100%.
And in fact, we need to ask two.
Therefore, when we're increasing this from one to two,
first we need to update synchronous standby names, and only after that change in ATCD.
And there are almost dozens of rules that one needs to follow to do such changes in correct order.
Because it's not only about changing replication factor, it's also about adding new nodes to synchronous standby names Это также касается добавления новых нодов для синхронных стендбай-намец, или выключения нодов, которые исчезли и так далее.
И я не думаю, что ни одна другая решение не применяет алгоритм для таких изменений.
Сколько времени вы потратили на это? Yeah. So, like, originally this feature was implemented by Ants Asma.
He's working for CyberTech.
It happened in 2018.
I did a few attempts to understand, like, this great logic of this algorithm.
And finally, like, almost five years after, like, I was able to get enough time to fully focus on the problem.
And even after that, I spent a couple of months implementing and fixing some bugs and corner cases and implementing all possible unit tests to cover all such transitions.
There is no book which describes this, that you could follow.
This is something really new that needs to be invented, right?
Well, the idea was obvious, how to do it, or what to do.
But implementing it correctly and proving that it is really working correctly, kā to redzēt, vai kā to redzēt, bet, kā to implementēt,
kā to redzēt, un izmantojot, ka tas ir ļoti ātri iespējams,
tas ir ļoti vajadzējums.
Paldies, ka izmantojās visus vēsturēšanās.
Ir vēl viena vēsturē, kuru es gribētu
uzmantot. Tā bija Patronis 3.0.
DCS filesef mūsu0 DCS file safe mode. So DCS is
distributed
configuration
storage.
And actually we just experienced
a couple of outages
because we are in Google
Cloud and they are Kubernetes and running
Zalando operator
with Patroni of course.
And I just checked the version of Patroni, and it seems to have it.
But I don't think it is enabled.
Exactly. This is my second question, actually.
Why it's not enabled?
So, first question, what is it?
How do you solve this problem when etcd or console is temporarily out?
Let's start from problem statement.
The promise of Patroni is that it will run as a primary
when it can write to distributed configuration store, like to ATCD.
If it cannot write to ATCD, it means that maybe something wrong with ATCD
or maybe this node is isolated and therefore writes are failing.
And when node is isolated, it's apparently working by design,
Patroni cannot write to ATCD, it will stop Postgres in read-only mode,
but in case if ATCD is totally down, because of, I don't know, some human mistake,
you cannot access any single node of ATCD. And in this case, Patroni also stops primary and starts it in read-only
to protect from the case, let's say, some standby nodes can access
DCS at the same time and promote one of the nodes. So people were really annoyed by this problem and were asking why we are demoting
primary and so far the answer was always alright,
so we cannot determine the state and therefore we demote to be on the safe side.
The idea how to improve on that came at one of Postgres conferences after talking with other Patroni users.
How it is improved using failsafe mode? that none of the ATCU nodes are accessible, it will try to access all
Patroni nodes in the cluster using Patroni REST API. And in case if Patroni
primary can get a response from all nodes in Patroni cluster in the
failsafe mode, it will continue run as a primary. In this case it's much stronger
stronger requirement than quorum or consensus so it is not expecting to get
responses from let's say majority like it really wants to get responses from
all standby nodes to continue to run as a primary. Yeah, so like this feature was introduced in Patroni version 3, but it
is not enabled by default. Because I think there are some side effects when you enable
this mode in certain environments. Probably it is related to environments where your node Возможно, это связано с обществом, где ваша сеть может отвечать с разным именем.
Надо подумать об этом.
Этот поведение документировано.
Да, мы будем исследовать это, спасибо большое за это.
На Kubernetes это безопасно, чтобы это позволить.
Да, мы должны начать использовать это, я думаю.
Мы обязательно будем исследовать это. to enable it yeah we should start using this is what this is what i think as well yeah definitely
we'll explore things like pods always have the same name just different ip addresses i just got
help for and and as usual i like i just wanted to publicly thank you for all the help you do for me
and actually many companies many years it's this is huge thank you so much yeah i'm happy to help
so another thing i wanted to discuss is probably replication slots and i remember a few years ago
you implemented support for for follower of logical slots now we have it in postgres right so для фейловерных логических слотов, а теперь мы их имеем в Postgres, так что наконец-то
одна вещь была, я думаю, от Патроны, или вы все еще держите эту функциональность?
Мы все еще держим ее, и мы не сделали ничего особенного в Postgres 17. It was I think it was at 16 even no? Failover of ah. Or 17. Well ability to use a logical slot
on physical standbys was in 16 but failover in 17. We just discussed it. Yes exactly exactly.
Exactly. I confused. That's why I'm saying we didn't do anything special although I did some Поэтому я и говорю, что мы не сделали ничего особенного. Хотя я сделал несколько треков, чтобы сделать эту функцию работать с Patroni,
потому что это требует иметь ваш датабейсный имя в PrimaryConInfo.
И Patroni не ввел DB-имя в PrimaryConInfo, потому что для физической репликации это не полезно.
Но это и есть. Es domāju, ka mēs, jā, mēs skatām slēpu primāri, tas ir jā stand by nodus, un viena no tās ir jūtas, vai vēl nekādi, bet viena no tās ir jūtas,
lai logiski mācītu kaut ko, kādu posgu, snuflaku vai kaut ko.
Vai Kavku vai kaut ko.
Ja tas ir jā, tad jā, no stand by.
Jo tas ir labi, mēs varam izmantot riski primāri, un tādā. Un Wall sender nevar jūtas CPU, un tādā. Потому что это хорошо, мы не используем премиальные риски,
WallSender не использует CPU и так далее.
И нет рисков на диске.
Так что теперь у нас есть такой стендбай, и он вдруг исчез.
Это не работа Patroni, чтобы его уничтожить, да?
Потому что нам нужны какие-то механизмы,ого стендбайа. Вы имеете в виду, чтобы оставить слот логической репликации на новом стендбайе,
на котором вы бы хотели подключиться. В теории, патроны, возможно, могут обеспечить это,
поскольку возможно делать логическую репликацию с стендбай-кнопок с поз Postgres 16. So how it's implemented currently in Patroni,
like logical file loader slots,
it creates logical slots on standby nodes
and uses pgReplicationSlotAdvance
to move the slot to the same LSN
as it's currently on the primary.
So basically assumption is that
logical replication happens on the primary. In theory there is no reason why it cannot be done for
standby nodes. Let's say like we create logical slots on all standby nodes
like with the same node, with the same name and Patroni can izmantot, kāda ir aktīva, un izmantot to ATCD. В теории это может работать, но я не знаю, если бы у меня был время, чтобы это сделать.
Я просто пытаюсь понять, это довольно новая функция с 16-го года, чтобы логически репликатировать с физическими стендбайами, но...
Но будьте в курсе, что это все еще влияет на основные. it still affects primary. Right. Right. So maybe like PG wall will not blow out,
but PG catalog certainly will.
Yeah, this for sure.
I was referring to the need to preserve wall files on the primary.
This risk has gone if we do this,
but I cannot imagine how we can start using logical slots
on physical standbys in serious projects without HA ideas.
Because right now I don't understand how we solve HA for this.
Yeah, and unfortunately, this hack that Patroni implements with PG replication slot advance has its downsides. It literally takes as much time to move the position of the logical slot
as you consume from this slot. That's unfortunate. And how it's solved in Postgres 17, it basically
does not need to parse a whole file and decode it. So it just literally ever writes some и он просто буквально вычитает какие-то цифры в слоте репликации,
потому что он знает точные местности и делает это безопасно.
Патроны не могут это сделать.
Хотя, возможно, слоты PG Failover могут сделать то же самое.
Для всех их версий.
Окей, есть ли еще какие-то места, где мне можно подробнее исследовать, потому что мне нравится понимать многие места здесь. for older versions. Okay, some area additionally for me to explore deeper because I like understanding
many places here. Good pieces of advice as well, thank you so much. Anything else, Michael,
you wanted to discuss? Obviously, one of the biggest features was Cytos support, right?
But I'm not using Cytos actively, so I don don't know if you want to discuss this, let's discuss.
I know that some people certainly do, because from time to time I get questions about Citus with Patroni on Slack,
or maybe not Citus specific questions, but according to output of Patroni control list, they are running Citus cluster.
So there is certainly demand and I believe with Patroni implementing Citus support,
it improved quality of life of some organizations and people that want to run like sharded setups.
Is there anything specific you needed to solve
to support this or like technical details?
To support Citus?
So Citus, I wouldn't say that it was very hard,
but it wasn't very easy either.
So Citus has notion of Citus coordinator where you
originally you supposed to use coordinator for everything to do DDL, to
run transactional workload and so on. And on coordinator there is a metadata table
where you register all worker nodes and worker nodes is this is where you keep
the actual data like charts and what i had to implement in patronia is registering automatically
worker nodes inside with metadata and in case of failover happens for for the worker nodes we need to update metadata and put new ips or host names like whatever and
basically when you want to scale out your situs cluster you just start more worker nodes like and
every in every worker node in in fact is another small patronroni cluster. So, technically, in Patroni control, it looks like just a single cluster,
but in fact, it's one cluster for coordinator, one cluster for every worker node, and
on each of them
there is its own failover
happening. If you start worker nodes in a different group, like in the new one, it joins the existing
Citus cluster and Patroni and the coordinator register new worker nodes.
But what Patroni will not do, it will not redistribute existing data to the new workers.
This is something that you will have to do manually afterwards and it has to be your
own decision how to scale your data
and replicate to other nodes. Although nowadays it's possible to do it without downtime because
all enterprise features of Citus are included in Citus version 10. So everything that was in the enterprise now is an open source.
That's cool.
I saw Alexander has a good demo of this, of Citus and Petroni working together, including
rebalancing. I think it was CitusCon last year?
Yeah, it was CitusCon.
Nice. I'll include that video in the show notes.
I wish I had all this a few years ago. Yeah, of course.
There was a little bit more work under the hood.
In case if you do right workload via a coordinator, it's possible Patroni can do some tricks to
avoid client connection termination while switchover of worker nodes is happening.
This is what I did during the demo.
There are certain tricks, but unfortunately,
it works only on coordinator and only for write workloads.
For read-only workloads, your connection will be broken.
That's unfortunate.
Maybe one day it will be broken. That's unfortunate. Maybe one day it will be fixed.
So in the Citus, maybe one day
the same
stuff will also work
on worker nodes. And by the way,
on Citus, you can run
transactional workload
by connecting to every worker node.
Only DDL
must happen via
coordinator.
Nice. Speaking of improvements in the future,
do you have any anything lined up that you still want to improve in Patroni?
Hmm. That's a very good question.
You usually some nice improvements are coming out of nothing.
Like you don't plan anything, but you talk to people and they say,
it would be nice to have this improvement or this feature.
And you start thinking about it.
Wow, yeah, it's a very nice idea and it's great to have it.
But I rarely plan some big features from the ground up, let's say.
So what I had in my mind, for example, it's a failover to a standby cluster, like in Patroni.
Right now it's possible to run a standby cluster is like not aware of the source where it replicates from. It could be replicating from another Patroni cluster
and what people ask we have a primary Patroni cluster we have standby Patroni
clusters but there is no mechanism to automatically promote standby cluster
because it's running in a different region and it is using completely another ATCD.
So they simply don't know about each other.
It would be nice to have, but again I cannot promise when I can start working on it and
whether it will happen.
I know that people from cybertech did some experiments and have some proof of concept solutions that seems to work but for some reason they also
they're also not happy with such solution they implemented.
Yeah, sounds tricky.
Distributed systems are always tricky.
Yeah, get that on a t-shirt.
Thank you for coming. I, as usual i use podcast and all events i i participate and
organize and so on and i use just for my personal education and daily work as well i just thank you
so much for help again yes thank you for inviting me uh yeah it's a nice job that you are doing i
know that many people listening to your podcast
and very happy about
that they're learning a lot of great stuff
and also making
a big list
of to-do items
I cannot
say the same about myself
that I watch every single
episode but
sometimes I do
cool, thank you thanks so much Alexander, cheers Nikolai that I watch every single episode, but sometimes I do.
Cool.
Thank you.
Thanks so much, Alexander.
Cheers, Nikolai.
Thank you.
Bye-bye.
Bye.
Bye.