Postgres FM - High availability
Episode Date: July 28, 2023Nikolay and Michael discuss HA (high availability) — what it means, tools and techniques for maximising it, while going through some of the more common causes of downtime.  Here are some ...links to some things they mentioned:https://en.wikipedia.org/wiki/High_availability https://postgres.fm/episodes/upgrades  https://github.com/shayonj/pg_easy_replicate/ pg_easy_replicate discussion on Hacker News https://news.ycombinator.com/item?id=36405761 https://postgres.fm/episodes/connection-poolers https://www.postgresql.org/docs/current/libpq.html  Support load balancing in libpq (new feature in Postgres 16) https://commitfest.postgresql.org/42/3679/ target_session_attrs options for high availability and scaling (2021; a post by Laurenz Albe) https://www.cybertec-postgresql.com/en/new-target_session_attrs-settings-for-high-availability-and-scaling-in-postgresql-v14/Postgres 10 highlight - read-write and read-only mode of libpq (2016, a post by Michael Paquier) https://paquier.xyz/postgresql-2/postgres-10-libpq-read-write/ Postgres 10 highlight - Quorum set of synchronous standbys (2017, a post by Michael Paquier) https://paquier.xyz/postgresql-2/postgres-10-quorum-sync/ https://github.com/zalando/patroni https://postgres.fm/episodes/replication https://blog.rustprooflabs.com/2021/06/postgres-bigint-by-default Zero-downtime Postgres schema migrations need this: lock_timeout and retries (2021) https://postgres.ai/blog/20210923-zero-downtime-postgres-schema-migrations-lock-timeout-and-retries A fix in Patroni to mitigate a very long shutdown attempt when archive_command has a lot of WALs to archive https://github.com/zalando/patroni/pull/2067 ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello, hello, this is PostgresFM episode number 56. I'm Nikolai, this is my co-host Michael. Hi, Michael.
Hello, Nikolai.
So we are going to talk today about HA, very funny topic. If you write it three times, you will have ha-ha-ha, right?
Yeah, I never thought of it like that.
Yeah, HA, very funny topic, actually. Many people fight for some achievements in this area,
but it's high availability,
opposite to low availability,
which is like downtime and incidents and so on.
Yeah, or number of nines.
That's how it was popular to describe
how many nines you were a few years ago, wasn't it?
Right. How many nines do you think is good?
Well, I mean, I remember when loads of
the major providers were claiming five nines. And then when you did the math and realized like
quite how little downtime that is, it's like five minutes per year. Yeah, nearly impossible to
guarantee that. It's called downtime budget. So yeah, what do you think? I mean, I think four is
still pretty good. But you you can what would you say
oh yes excellent i think if during whole year you allowed only five minutes of downtime i think it's
quite good of course six means like half a minute of downtime per year it's i think absolutely
excellent but i would take three as well it's like like three nines, it's more than four hours.
Well, for many, it's okay, not for all.
Actually, I skipped one.
For nines, it's 52 minutes per year.
And it's okay.
Well, it's already quite good goal for nines.
Yeah.
Right.
But of course, five nines is, I think,
maybe it's like A mark and six nines is A plus mark, if you can achieve it.
And I mentioned budget because some people have management for AHA, for downtime, so-called budget.
And, for example, if you plan some works which have planned downtime or you have high risks of some downtime,
if you, for example, upgrade upgrade something os or major postgres upgrade
or something and you expect some things can go wrong so in this case if you already had downtime
minutes or seconds which contributed to low down your ha characteristics in this month you probably
want to postpone your work till next month right this is regular approach to manage this like ha budget downtime budget and presumably because some services do credits don't
they for their customers if they don't achieve certain goals so that might avoid them paying
out lots in terms of penalties i had a question for you as well though like with we normally talk
about extremely high throughputthroughput environments,
but if you've got quite a low, like, let's say you might even provide
an internal application for a team or something,
have you ever heard that phrase or the kind of philosophical question,
if a tree falls in a forest and there's no one around to hear it,
does it make a sound?
If a service is unavailable but it receives no requests in that time, is it down?
Well, I usually consider a service down observed by external service. It should be observed
externally. So if from outside, from various points, for example, from various regions in the
world, service cannot be reached. And some simple checks like front page is not responding at all or returning 5xx errors,
meaning some problems on server side, or some simple API calls like health checks are not
responding from outside.
This is downtime.
Internally, probably we can have everything working and just some
network incident happening, which usually affects like some small issue with network
configuration can lead to huge downtime, huge outage for many customers. For example,
sometimes we see AWS or GCP has issues with some reconfiguration of network devices or
routing and so on. And this leads to huge incidents. It's always DNS, right?
Not always, actually, but very often, yes. That was the joke. Sorry. Just for listeners,
I've had awful internet today. So this is the first day that we can't see each other.
So Nikolai couldn't see that I was grinning, which is my normal giveaway that I'm telling a joke.
Episode number 56, and I didn't learn how to recognize your smiling just from audio.
Yeah.
Okay.
So HA is about uptime.
Uptime is opposite of downtime.
And what should we discuss in this area?
Where do you want to start?
Well, I think there might be a lot of people out there like me
that conflate HA and high availability with just switchover and failover tools
and processes and runbooks books and you make a really
good point that it's not about those those are a couple of techniques we have to achieve higher
availability but it's really about avoiding downtime and when you made that point to me
beforehand while we were preparing it made sense to then go and list what are all the ways we can cause
ourselves downtime, or the ways, what are the main causes for, let's say, Postgres-specific downtime.
And then maybe once we've gone through those, or maybe as we go through those,
discuss some of the things that we can then do about it. Does that sound sensible?
Yeah, it does. And you're also right. It depends on your point of view.
If, for example, you are responsible only for infrastructure.
And remember, we can distinguish infrastructure DBA and application DBA.
And if you are purely infrastructure DBA, at some point, you probably should consider everything engineers, developers do with application changes,
for example, schema changes and so on.
Probably you should consider there's a black box
and care only about failures related to infrastructure.
For example, as you mentioned, autofailover,
meaning that some node is down due to a number of reasons, or PGBouncer or any puller also being a single point of failure,
SPOF, right?
Or some networking issues, something like this.
So purely infrastructure, and you consider everything application does
as a black box. But if you are application DBA, or if you care of both parts
of like, actually, the boundary here is quite not so strict, right? So I prefer consider everything.
And if you're a database expert, or you part of your work, at least part, is about databases, related to databases.
In this case, I prefer taking both infrastructure things like autofailover
and purely application things.
For example, we create primary key integer 4.
We have 2 billion rows.
Soon we have huge outage. Or we forgot to implement low log timeout and retries logic.
And releasing something, some small schema change when auto vacuum and transaction ID wrap around prevention mode is working we are going to have a queue of queries which are waiting for us while while we are
waiting for this auto vacuum so we have at least partial outage as well sometimes global i mean
sometimes everything is going down because of this simple mistake which is easy to do because it's default behavior. If you don't do anything, at some point you will have it.
So I prefer to care about both worlds
and consider include application stuff
to the set of measures to prevent downtime.
So for example, for me, primary key integer four, when we know we will
have a lot of rows inserted, it's also definitely related to HA. Yeah, good point. I mean, you've
mentioned the DBA roles, but really, it's development as well, isn't it? It's everybody
who has responsibility for any of the design is involved in this as well
and and even things that we can't avoid right we've had episodes already about major version
upgrades and discussed quite how difficult it is to do those potentially even impossible to do those
without any doubt like with absolutely zero downtime it's possible so yeah oh yeah we we discussed that it's possible i provided a recipe by the way
we mentioned this new tool to orchestrate major upgrades the tool called pg easy upgrade pg easy
replicate sorry pg easy replicate which supports major upgrades written ruby as i remember and
we mentioned that i haven't tried it, but I
criticized it a little bit. But it this tool got attention at Hacker News. And yesterday, the author
contacted me directly on Twitter. So we discussed improvements. And obviously, it's possible to
achieve practically zero downtime, I mean, like one, two seconds of latency spike if you have at least
if you have not huge data volumes and not big tps numbers for example if you have less than 10
terabytes and if you have less than 10 000 tps on the primary it's definitely possible to achieve
like basically zero downtime upgrade when you will have just a spike of latency but this needs component besides
postgres you need pg bouncer or any puller we had episode about pullers which supports postresume
with ability to switch primary to different machine yeah right or presumably some of the
load balancing tools and operators support that kind of thing as well. Pause-resume is not very popular, I think.
But I might be mistaken.
It's an interesting topic.
I think it's underdeveloped.
It received not a lot of attention so far, this topic.
Because, you know, we have bigger problems.
We have bigger points when we can have downtime.
For example, even minor upgrade can lead to downtime
for up to one minute. And there are many risks where you can put everything down to many minutes.
Yeah, I know we don't have time to discuss each one in depth, but should we run through a few
more?
You know, let's discuss first autofailover because of course, it's very important component
to achieve HA.
And I remember discussions 10 years ago or so, discussions that it should live outside Postgres.
This should not be inside Postgres itself. So here we are.
We have Patroni and alternatives. And inside Postgres we almost don't have anything.
Actually, we have something. And in recent Postgres versions, we have improvements in libpq. This is a library which is used in many, many, many drivers,
which are used by application code to connect to Postgres. And already for many years, you could
use multiple hosts in connection. So you can say this host or that host or that host.
And automatically, the PQ,
so like I mean at driver level,
it would try first, if it's down,
immediately try second one.
And this is already convenient
to achieve lower downtime, right?
Yeah, or higher availability.
Or bigger uptime, or how to say it.
Yeah.
Better uptime.
And in Postgres 15 or 16, I don't remember particularly,
additional improvements happened there.
I don't remember details, but I remember previously it was quite,
you know, like the order is hard-coded.
Ah, load balancing went there.
So load balancing is supported at protocol level right now
in very new Postgres versions.
So maybe in 16, actually, which is not released yet
in its beta 2, as we speak.
But this is good addition.
And so meaning that some things go to Postgres core,
basically, because LePicu is part of Postgres.
But also there is target attributes.
I don't remember exactly, but you can say, now I need read-write connection.
And now I need only read-only connection.
And establishing connection, libpq knows what you need.
And it can automatically switch to the primary.
It knows you will need writes.
And this is also some interesting thing.
You can build interesting stuff on top of it.
But of course, it's not autofillover.
Autofillover is needed.
I remember when we had a couple of servers or three servers,
and many people, many DBAs considered autofillover as a very bad thing
because who knows what it will do.
It was quite popular opinion that you should do it yourself.
It's better to be paged, wake up at 1 a.m. and perform failover properly.
And it's okay to be down for a few minutes,
but not allow some who knows who developed it. like who knows what it will do maybe it will lead
to split brain or something so it was a matter of trust but then tooling started to improve
as usually i should warn don't use replication manager rep mgr and other tools which are not
based on consensus algorithms like raftft or Paxos.
So use Patroni or similar tool which is based on consensus algorithm.
And a lot of work was invested to make it work properly.
So with replication manager, I personally had several cases of split brain.
With Patroni, I never had it.
I had several cases of split brain. With Patroni, I never had it. I had different issues.
And usually, Kukushkin, the maintainer of Patroni,
fixes them quite quick.
So it's quite well maintained.
And I think right now it's standard de facto.
Many systems, many post-gas operators included as a component.
And now nobody thinks already that you should use manual failover, right?
It's already nonsense. Yeah, I can't imagine somebody still believing that. But for anybody
that isn't familiar with split brain, the time I've come across that most is, well, in fact,
actually, definition wise, is that just when you can end up with one database, one node believing one thing and one node believing another in terms of data.
So we have two primaries.
Old name for primary is master.
We have two masters.
So we have two primary nodes
and some writes go to one node,
other writes go to another node
and nobody knows how to merge them back and solve it.
It's very hard usually in general case. In some cases
it's possible but in general case for example if we have had reduction of split brain when some
writes on one node happened to particular tables and on another node to different tables in this
case probably it's quite easy to merge but in general general case, it's unsolvable problem.
So you should avoid it.
This is the biggest problem of autofailover, basically.
So this risk of split brand is the biggest problem.
And those tools which do not use consensus algorithm
basically are bad in this area.
Even if they have like witness node
and a lot of heuristics coded it's not good and i remember i
spent some time analyzing replication manager c code and i found places exactly where split
break can occur and it's like unsolved issue so just don't use it i'm quite maybe like
sound not polite but it's based on many years of experience and observations of
troubles i saw and my customers saw and experienced so patron is battle proven and
standard the factor right now but there are a few other alternatives which also use
based on raft or or paxes well i think a lot of people these days are using cloud provider defaults you
know you can you can just click a button that says high availability and it gives you your replicas
and auto failure voters like a as a feature not having to worry about exactly how that's
implemented and the people and even people some people are using i believe it's a big feature of
the kubernetes operate or still not sure how to pronounce that but the operators that are coming And even some people are using, I believe it's a big feature of the Kubernetes operator,
or still not sure how to pronounce that, but the operators that are coming up left, right, and center.
So I think there are new options where you don't have to worry about the exact underlying technology, or maybe I should be more worried about that.
Yeah, yeah, yeah.
Okay, so split-brain is one risk if you use autofillover,
and we agreed we should use autofillover to achieve better HA characteristics.
Another risk is data loss.
By default, if we use asynchronous standby nodes,
if autofillover happens, we might lose some data which were written on the primary,
but none of standbys received it.
In this case, autofillover happens,
and different node becomes primary,
and it lacks a tail of writes.
So some portion is not propagated there yet,
because, for example, of replication lag,
which all standbys had.
Because they are allowed to have lag,
because they're asynchronous.
By default, physical standbys are asynchronous.
And Patroni, for example, has a special knob in configuration
which says lags more than, by default, I think 10 mb are not allowed
for autofillover to happen.
So if all asynchronous standby nodes have lag above this threshold,
out of a lever won't happen.
So we choose, we need to prioritize.
We have a trade-off here.
What to choose?
Data loss or availability.
Data loss versus downtime and HA goes down.
I mean, HA characteristics, so uptime worsens.
And usually people can just, we have two solutions.
We can tune this threshold and say,
okay, even one megabyte is too much for us.
But we understand that it increases risks to be down
because autoflower doesn't happen.
Cluster exists, but in read-only mode
because primary has gone due to a number of happen. Cluster exists, but in read-only mode because primary has
gone due to a number of reasons. Some hardware failure, for example. And then we just need to,
manual intervention is needed. But by the way, the story about autofillovers and how they work
properly. So Postgres AI, it's not a big database. I mean, our website.
And it's running on Kubernetes Zalando operator,
Patroni inside.
And at some point, I noticed the timeline exceeds 100.
What do you mean?
100.
When we have a failover or switchover,
when we change primary node, timeline increments by one.
So we created a cluster in NDB, timeline is one.
We have switchover, we change primary, or failover.
So switchover is planned failover, basically.
Failover is like unplanned change of primary.
Switchover is planned. For example, we want to upgrade or something, change something, we perform switchover, and then we, for example, remove old node, old primary.
And I've noticed that our timeline exceeded 100, meaning that we had so many autofill overs and nobody noticed.
Is that a good thing or a bad thing? Both. Because it was due to, in Kubernetes,
it's usually a regular problem
to have out-of-memory killer killing Postgres.
It's a big problem and people have solutions to it.
So we needed to adjust some settings,
add more memory, so we did.
But the main problem here was
our monitoring didn't tell us about these cases.
And we left with it some months,
maybe actually a year or maybe more.
So we had the changing node,
changing primary node all the time.
And it just happened and happened and happened
and nobody complained, which is a good thing.
But lack of proper monitoring alerting is a bad thing, of course.
We should notice earlier.
But this case demonstrates how autofollower can be helpful
not to notice some issues, right?
Meaning it can hide those issues.
So it worked well. Patroni worked well.
What not worked well, monitoring in this case.
And it's, of course, our fault.
We should configure it properly
and receive all alerts and so on.
Every time fallover happens,
we should receive alert, of course,
and investigate and try to prevent it
because, of course, it's a stress.
Again, I mentioned risks of split-brain,
which I can say Patroni
almost certainly shouldn't have,
shouldn't have, and risk of data loss.
So this is not good.
But in these cases, data loss didn't happen because,
okay, shutdown, but when we have two streams of data propagation,
first streaming replication, when I'm actually,
I'm not sure if on killer issues, is it sick kill?
So meaning that wall sender won't send all bytes to standbys, right?
But at least we have also archiving, and archiving goes to object storage.
It was wall G, and then we restore.
So data loss at least was not noticed there.
But it's a bad story, actually.
I feel right now embarrassed
about it yeah do as i say not as i do right yeah don't do this yeah or learn from my mistakes
it feels like this is a good time to like go on to some of the other things we could you mentioned
alerting already a few other things that we've mentioned in previous episodes that you can alert
on are at certain points of running out of transaction IDs.
So making sure you've got some...
Hold on. Sorry, hold on.
Let me finish with second solution.
Second solution is using synchronous replicas.
And this is what serious setups should have.
You have a synchronous replica and better you have quorum commits.
So you have a semi-synchronous approach when you say, have okay i have five replicas five standby nodes and when commit happens on
the primary it should be at least received or applied by at least one of nodes and postgres
modern postgres allows very fine-tuned configuration here so you need you can choose
and decide between like of course if you, if you do it, you increase latency of committal for writing transactions.
We talked about that in detail, I think, in our replication episode.
Right.
So this is what you should have, like, you should use it.
In this case, ideally, if autoflower happens, you will have zero data loss because data always exists somewhere already.
It's at least at one node.
It's guaranteed.
And in this case, this is for serious approach.
This is what you should do.
And that's it.
Also, I wanted just to finish about switchover.
Switchover by default will lead to downtime because of shutdown checkpoint.
We discussed it a few times as well.
But if you just decide to perform switchover,
Patroni needs to shut down primary.
I don't remember if Patroni issues explicit checkpoint.
It should probably, but worth checking.
The recipe for good fast switchover is
make sure you issue explicit checkpoint.
This helps shutdown process
to perform shutdown checkpoint much faster
because for clean shutdown Postgres
needs to write all dirty buffers from
the buffer pool to disk and
if we issued explicit checkpoint
shutdown checkpoint will
have only little to write
this is the recipe
and in this case when you perform
shutdown during shutdown
wall senders make sure that everything
is sent to standbys.
So clean shutdown, standby nodes received all data from the primary,
so no lags.
And also archive command, everything is backed up.
And this is a big problem I noticed in Patroni, actually.
We had an incident when we also, due to poor monitoring,
monitoring is super important for HA, of course, but due to poor monitoring is super important for hr of course but due to
poor monitoring we didn't notice that archive command was lagging a lot meaning that a lot of
walls were not backed up so primary accumulated many like say thousand walls being not backed up
but when you try to perform clean shutdown and patroni tried to do it right
it tries to wait on it postgres this is a postgres behavior postgres waits
while archive command is working and it's working very long because and maybe there is some reason
that cannot work properly so it still waits and waits and patroni waits and so on. So now I know it was fixed in Patroni and now it waits not long and performs failover.
Also trade-off here, right?
Because we want to back up everything.
But Patroni choice is to perform failover in this case
and let backup process to work additionally.
This is super interesting topic.
Again, worth checking details,
but many people fail to understand,
and I also was failing to understand,
that when we shut down Postgres,
it waits while archive command is working.
And this can lead to downtime, actually.
Unarchived walls.
I feel like we could do a whole episode solely on Petroni.
That might be a good one in the future.
Yeah, actually, good idea.
I can refresh memory on some interesting cases and so on.
And yeah, we can do it.
I think it's a good idea.
Patroni is an interesting tool developed at good pace
and receiving a lot of good features.
But I must say it's mostly on shoulders
on a single maintainer,
Alexander Kokushkin, and definitely it's Python. So I think more contributors could help this
project because more people work with Python than C, right? And for some people, it's easier to
read and write Python code. So I advertise to help Alexander with better
bugs description testing and sometimes contribution in form of pull requests to
GitHub repository of Patroni. It's worth doing it.
Nice.
Okay, I think we covered both failover and switchover and how it can contribute to downtime
and affect uptime or HA characteristics.
Let's switch to something else. Well, I don't think we've actually covered it, but I've written
a list in preparation of all the things that I've seen cause downtime in the past. I guess we're
running a bit short of time and we could run through a few of those. But in terms of solutions,
it's not just those, right? It's not just failover and you made this
point early on we've also got those alerts that we can set up we've got logging we can do to help
not only just to help avoid this in the first place but also to help us learn when it does
happen exactly what happens so we can prevent it from happening again some of it's just learning
right like it's good to be notified or alerted when you're running out of integers in an int4 column.
But it's even better to not have int4 columns in the first place.
Like I read a really good post by Ryan from Rust Proof Labs.
Yeah, just saying why not default to int8?
It's, you know, it's only a few extra bytes and you avoid so much pain in the future if you ever scale so yeah that makes a lot
of sense to me so some of it's learning i think some of it's helping educate your team and some
of it i've seen you and others do or suggest people do and your health checks and your capacity
planning things like that where you do look ahead a bit and try and so a lot of these issues are
scaling issues right it's not too surprising that you can cause yourself downtime through scaling things like that where you do look ahead a bit and try and so a lot of these issues are scaling
issues right it's not too surprising that you can cause yourself downtime through scaling we've all
read the stories on normally on hacker news or a lot of good blog posts over the years of people
causing themselves downtime via lots of traffic or just general scaling issues or just mistakes
yeah absolutely which are not mistakes if people don't know about
it. That's why it's good to learn and read about other people's failures. Yeah, exactly. Yeah. So
let me emphasize one thing here. I'm naturally working in the area of prevention of issues.
And I tend to underpay attention to monitoring and reactive components.
They are very important. So every time something occurs, it should lead to alert. I think it's
even more important than prevention is the first step of building good HA system. First, you need
to cover yourself with good alerts. And for example, they can be different. For example, sometimes it's alert
about failover,
about auto disk space incident,
about some locking,
deadlocks and so on, many things.
But then you cover yourself with alerts.
And I must say,
monitoring systems currently,
Postgres monitoring systems,
none of them have good alerts for Postgres.
None of them.
Like many have
parts of it we can also like play with this game give me some monitoring system and i name you 10
alerts it's lacking this is the problem we have we have no mature monitoring systems designed yet
or provided on the market it's a big big underdeveloped area many many systems they are all good in their sense but no systems which
i would give even b plus i'm not speaking about a mark great right so nothing no systems would get
even b plus from from me but uh so we need to cover with like incident happened, some problem happened with all these things.
Then we should try to think about prevention already and cover with some alerts about thresholds,
like 80% of something.
For example, you mentioned integer for our flow.
We know when it happened.
It's 2.1 billion.
If some sequence has 2.1 billion value, okay, we should define some alert.
In this case, I would define it maybe at 40, 50% even, because it takes a lot of time to develop proper solution and so on.
Or also transaction ID wrap around prevention.
By the way, these two cases, after failover won't help because everything you have on primary yeah
standbys also have right so you need to work at application level with schema with adjusting
something or in the case of transaction idea up around at like dba level you need to fix
vacuuming you need to make sure auto vacuum or your explicit vacuum did job properly.
Also, monitoring long transactions, a lot of things.
And a lot of alerts can happen. Then we have like 80%, 50% thresholds for alerting for dangerous behavior.
And also we need some forecasting component
because it's not enough usually to say,
okay, we reached 50% of integer for capacity.
Usually people immediately ask next question, when will doomsday happen? So they need some forecast
and you can apply some machine learning techniques here or use some
libraries. Or just basic statistics
and some rough estimates. It doesn't have to be
too fancy yeah yeah so yeah
and in this case uh this is already about prevention and yeah health checks can help and
these like proactive not reactive but proactive behavior management of your postgres can help
find problems yeah do you want to mention some other issues we wanted to
mention or should we do a quick fire and you tell me how common they've been in your experience
let's do it hardware failure well hardware failure actually it improves over time i remember like 15
years ago it was nightmare right now like even cloud environment when hardware failure should be
considered as quite common thing i i feel it's it improved i mean cloud providers and many providers
improved the characteristics of hardware and it's becoming less common actually so like of course
if you manage like many thousands of servers you will have multiple incidents per week, definitely.
Maybe almost every day you will have incidents. It's normal.
But if you manage only a dozen of servers, it's quite stable already.
If you have an incident per week with hardware managing just a few dozen virtual machines or physical machines,
it's time to think about moving somewhere
change provider yeah how about operating system failures operating system failures also improve
like we unless you went with some fancy kernel settings it's not happening often usually here
we have failures of different kind not like like everything was fine and then suddenly something.
Usually, well, we need tuning, of course, some kind of tuning,
and we need to take care of behavior of page cache in Linux and so on,
especially if you have a lot of writes.
But usually when we upgrade a version system,
this is where we might have problems, you know, related to,
first of all, related to JLPC
version and collation
changes. Not normally downtime
but corruption, right?
It's not downtime, I agree.
However,
Postgres 15 started to complain
about JLPC
version change. And I don't
remember, is it warning or an error?
I think it's a lock warning. I'm not sure an error? I think it's a lock warning.
I'm not sure.
It's good if it's just a warning.
I don't remember.
Yeah, actually, I'm not sure at all.
I've got a couple more, three more, I think.
Pooler or load balancer issues?
Pooler we discussed recently.
If you grow, grow, grow, if you have a single...
First of all, pooler can... For example, PgBouncer can have a single first of all puller can for example pg bouncer can be
a single point of failure and if this virtual machine down it's not good you need to have
multiple ones and balance between them and second big problem it happens all the time with many
people still pg bouncer is a single threaded process so it utilizes only one core. You can have dozens of cores on this machine,
but it will take only one core.
So you need to run multiple PIDG bouncers.
Otherwise you will have saturation
and it will cause downtime definitely
somewhere between like 10 or 20,000 TPS,
roughly, depending on particular CPU you are using.
This experience is based on Intel Pro mostly.
If you're, for example, on ARM processor,
in this case, it can happen earlier.
So it's already like, I mean, 10,000 TPS,
it's already not a huge load today.
So single PGA Bouncer,
it's a big bump that can trigger at some point and this is not good.
Second to last one, running out of disk space?
Very common of course. If you don't have monitoring and your project is growing, it's very common.
It's a good thing that one of the reasons was improved. I mean, previously for replication slots,
basically mainly logical replication slots,
if you created it, but no logical replica is using it,
it leads to accumulation of walls on the primary
and eventually you are out of disk space.
But now you can specify a threshold after which
Postgres will say enough waiting.
I give up this logical slot in favor of
not being down because of disk space. It can be improved at Postgres, I think. People complain
about behavior. When Postgres reaches out of disk space, you need to perform some dance to fix it,
and it can be improved. I remember recent discussions. I don't remember details, but
I was personally in this position many times
usually i escape rather quickly but it's very annoying to people when you're out of this space
and you need to escape so yeah yeah i thought you were going to say this is less common now with
bigger disks and cloud providers that some of them let you kind of scale this don't they
yeah most of them support zero downtime scaling i mean if you see scaling up anyway scaling up yeah not down yeah yeah good
point and if you have 90 i remember some issues on azure but i don't like azure anyways
it's my natural choice like they didn't support online change until recently and also recently
also with some limitations, but GCP and Azure and others support online change. So you can
just say, I want like a terabyte more. And then you, depending on your file system, you
need to issue a few commands in terminal and then Postgres already is fine. Right? So it's
easy. For what it's worth, I'm warming to Azure. I think there's some cool Postgres already is fine, right? So it's easy.
For what it's worth, I'm warming to Azure.
I think there's some cool Postgres stuff going on there,
but I haven't tried it myself.
Well, different level,
like I call it second layer of cloud world, right?
Managed Postgres services,
of course they should have it fully automated.
If your managed cloud provider doesn't offer scaling up
for disk space, it's not good at all.
This feature all should have already.
Final one, direct denial of service.
Either an external actor, or I've seen this internally as well, where someone's accidentally hammering their own database.
Have you seen this?
I guess it isn't necessarily a full outage, but yeah,
at least a partial outage normally. Yeah, first slow down and then outage,
especially if you have max connections a lot. And like, it's very common if some mistake
happened in application code, and it started to issue some queries at a huge rate,
a lot of queries per second of some kind,
it can put you down definitely, of course.
Yeah.
Awesome, that's all I have.
Good.
I think it was a good overview of some,
maybe some like not comprehensive because this topic is huge actually.
And especially if you include,
as I prefer,
if you include application level stuff into it,
HA is a huge topic
and big departments are working
to achieve good HA.
So SREs and DBAs, DBREs and so on.
But I hope we performed some both overview
and dived into a few particular areas.
Of course, not into all of them.
Thank you so much.
Yeah, awesome.
Thanks so much, Nikolai.
Yeah, I forgot to mention the importance of likes and subscriptions.
Please do it because I know we received a lot of attention recently with recent episodes
and also our anniversary and so on.
And every time you like it, please make sure you left some like or comment or something. This is
super important for us. This is our fuel and also it helps to grow our channels. We have multiple
channels and this helps to grow and more people can benefit from what we discuss here. So please don't underestimate these social media likes and comments and so on.
And share with colleagues as usual.
Thank you everyone for everything.
Yeah, we appreciate it.
Take care.