Postgres FM - Upgrades
Episode Date: March 17, 2023Nikolay and Michael discuss major and minor version Postgres upgrades — what they are, how often they come out, and how regularly we should be upgrading. Here are links to a few things we ...mentioned: Postgres versioning policy why-upgrade (by depesz)postgresqlco.nf (by Ongres)postgresql.conf comparison (by Rustproof Labs) pg_upgradeLogical replication CHECKPOINTamcheckLocale data changes (e.g. glibc upgrades)ANALYZEUpgrades are hard (summary of panel discussion by Andreas 'ads' Scherbaum)spiloRecent pgsql hackers discussion about using logical and pg_upgrade together------------------------What did you like or not like? What should we discuss next time? Let us know on social media, or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
I'm Michael, founder of PGMastered. This is my co-host Nikolai, founder of PostgresAI.
Hey Nikolai, what are we talking about today?
Hi Michael, let's talk about upgrades.
Your choice, as usual, very boring topic, which I don't like, but which all need to do it.
I like to be always running the freshest version, including minor version of Postgres. But no, we don't have
XI CD with automatic deployments to all Postgres installations, unfortunately, right?
That's one of the reasons I really like this topic, though. I think it's something that a
lot of people want to be often on the latest version, or there'd be benefits for doing that.
But there's enough downsides,'s there's enough friction at the moment
it's not always easy for people to be on the latest versions and there can be quite a lot of
work involved so i wanted to make like to discuss that a little bit things that we can do to minimize
that or give people tips on on things to look out for but yeah i was actually surprised how many
people i did a quick poll before the episode on social media and i think about a third of
people like you they like to upgrade it every year which is well i guess that doesn't necessarily
mean they're on the latest version well i like chances at least once per quarter if we talk
about minor upgrades right and yes so should we go to that first actually in terms of what the
differences are yeah let's discuss so minor upgrades is like security or bug fix upgrades. They don't change in terms of anything in terms of functionality. Most often, sometimes some features are and major upgrades are every year we have big major postgres version, new one with a lot of new things, usually very good ones.
So, yes.
So in terms of, it's not like some projects, right?
So some projects you get kind of semantic or semver,
and they will release new features in minor upgrades,
whereas Postgres doesn't.
And I think that's really important. I think not only will they only release new features in major versions,
it also means a major version does not necessarily mean breaking changes.
Whereas in some other platforms, a major version will bring breaking changes.
Naturally, it's also the only place Postgres will introduce breaking changes.
But it doesn't mean there are.
I think they're quite rare, actually.
I think Postgres is really good at maintaining backwards compatibility.
So I think sometimes people think maybe they're a bit too scared of major versions in Postgres,
whereas obviously you need to test.
But if you read the release notes, there aren't many things that your existing system will
stop being able to do.
Right.
Well, yes.
So speaking of differences between minor and major versions, it's still worth remembering that Postgres changed versioning schema from three numbers to two numbers roughly five years ago.
When Postgres 10 was released after Postgres 9.6.
And this caused two issues.
Some people started to think that there is such version as 9,
just major version 9, never existed.
But some people, vice versa, they started to think that if you check,
no, there is no different problem, sorry, right?
Or there was, I don't remember.
So, yeah, I didn't see anybody confusing, for example, 10.0 and 10.1
and thinking those were two different major versions.
But I do see people referring back to 9.5 and 9.6 as if they're the same version.
So that's the time it changed.
After the 9.6 was the last point release.
So I'm calling that a point release that was a major version.
And then we went to 10.0.
And then we went to 11.0 being like the next major one after that.
So even back when it was 9.something, these were major versions that were released annually.
And yeah, it's nice that now looking back, the oldest supported version is no longer confusing.
So it must have been more than five years ago, which is cool.
But we still have some production systems with older version, unfortunately,
and unsupported and not upgraded yet, which lag in terms of upgrades a lot.
I wanted to, returning to this question, when to upgrade,
there are two different, also different directions.
First, we all want to be up to date and benefit from all fixes, including security fixes, and also from new features, including performance-related features.
And at the same time, my nature is more development nature. a lot of ops people and companies in terms of administration of Postgres.
I always tell people that my nature is development.
I always try to move forward with new features, new exciting stuff, and so on.
But of course, over time, since I have many years of experience,
I developed some practices to avoid moving too fast and too early,
to untest itested and so on but at the
same time there is a big idea let's not use the latest version because it's probably not ready
for our very serious production case and this is reasonable so there is for example when new
major version is released some people say okay i will wait until a couple of bug fix releases.
Right?
Like, okay, 15.0 released.
I will install 15.2 only.
Right?
I will wait.
And this is reasonable
because bug fixes happen
with a new major version released.
A lot of new testing is happening in reality,
in production testing. Right? And you can benefit from it if you wait but if everyone is thinking in this direction
imagine that it will not be tested well enough so it's well i also think in practice it hasn't
worked out well like that if you look back it's at least since i've been
following postgres there haven't been major issues in the point point zero releases except for one
occasion so in the last seven eight years there haven't been any huge issues in dot zero that
weren't in dot one except for 14.0, which was only fixed.
14.4, yes.
Exactly.
So that's the only example.
A lot of corruption cases happened as well and so on.
In which other versions?
Well, in all versions, something happens.
For example, create and recreate index concurrently.
This is the example I'm talking about from 2014, right?
Right, but there were several problems with creating index concurrently
and index concurrently. Not only in 2014. Before also happened, issues
happened in 2013, where it appeared? In 2012? i don't remember in 11 maybe right several like
almost every year something happened with this unfortunate but very important functionality
which doesn't follow acid principles because it can leave you with some leftovers and this is not
transactional of course but under load this is what you need. Create index concurrently or index concurrently.
So this is only one example.
There are issues in GIST, some issues.
In GIN, in GIN.
In GIN indexes, there are issues.
Every release has some issues.
Of course, right?
Otherwise, we wouldn't need bug fixes.
But all I meant was that the biggest issue I've seen in the last six, seven years wasn't fixed until dot four.
So anybody that waited until dot one or dot two didn't benefit from that strategy in that case anyway.
That makes sense.
But this example shows the downside of these tactics.
Let's wait a couple of bug fix releases.
I agree with you.
And also, well, security bug fixes
happen at any time.
It can happen with.10,.15,
any time.
So until it's out of support, and that's
the other thing driving upgrades, right?
If you want
security fixes, you need to be on a supported
version of Postgres. And that's quite important
because eventually, like, there's
a nice window. Five years is a nice window in terms of length, but it's also not that's quite important because eventually, there's a nice window. Five
years is a nice window in terms of length, but it's also not that long. For large enterprises,
by the time you've tested an upgrade, maybe that's taken you six months to a year. You've
already lost one of those five years. And then you need to start testing the new one before that one
gets out of support. So even large companies are probably having to look at this at minimum on a
three to four year horizon so it's even if you're not doing it every year even if you're waiting
like so it's not the latest version and then you and then you want to be make sure you're on a
supported version your window isn't that long but it's long enough i think i think it's a very good
policy yep so long story short we want to have if we are good Postgres ecosystem citizens, we should start using new major version earlier, as soon as possible on maybe a smaller databases I mean using in production and deliver feedback and so
on. And if bugs happen, okay, you participated and found something. But I understand those who
wait a little bit also. I understand them. Yeah. I think also if you've got a system that would
particularly benefit from a new feature, like if you've got, let's say, it might not be one of your
more critical ones, but there might be another reason that you can take that risk with a different database, right?
If the benefits of the new feature might outweigh some of that risk.
So I've definitely seen some people skip some major versions if there's not any major features in there that would make a big difference for them.
But then go quite quickly onto the next major version because there is a feature that's particularly interesting that would help them a lot with some other problem
so i've seen that strategy a little bit as well kind of picking and choosing or since like upgrade
might require a lot of efforts from many people sometimes the companies have major upgrades only
once per two years just to save on overhead because it's painful still a lot of a lot of
pain there but minor upgrades let's talk about minor upgrades yeah do you think uh it's worth
upgrading always to the very fresh version in terms of minor upgrade for example if i'm on 13
yeah i i want which version i can't keep track it's probably the number of
quarters since since it's printed on the main page always uh well not always but the latest
releases and so on so now it's 13.10 right so it's nice to about two and a half years since it came out. The 11th minor version in this major branch.
So a good question.
I'd say most of the time, yes.
I don't see much downside to upgrading minor versions.
I guess we're going to talk quite soon about some of the difficulties of major version upgrades,
but we don't have those with minor versions.
A lot of the cloud providers make it extremely easy.
But they lag, always lag.
They do, sorry, but for minor versions, they're quite good.
No, no.
Oh, okay.
I mean, it depends, but sometimes it's several weeks.
And if it's security bug, it's not good.
Yeah, agreed.
So, yeah, what do you think?
Do you think that you should always
be on the latest version?
Yes, I think yes, but you still need to test it. Some automated testing should some
workflow should be developed like tested first on lower environments for a couple of days
and then release. And also extensions, they also have some of their own lifecycle, except those which are contrib modules inside Postgres distribution, Postgres contrib.
So they also require testing.
And I had bad incidents causing segfaults when some extension was released or not properly tested,
and it happened to reach some critical production system, not tested in both cases
by developers who decided to release it
and by SREs
who just forgot
to put it to some stop list for upgrades
to proper test
and they had automatic upgrade of the extension
it was
Santos or Ubuntu
I don't remember but it was some automatic
upgrade of packages without proper testing.
So two mistakes, and we have bad consequence.
Some testing is needed, but in general,
you should be on the latest version
in terms of minor upgrades.
And there is a perfect resource to check the diff
in terms of functionality,
whyupgrade.depesh.com.
whyupgrade.depesh.com. Always recommend upgrade with dash dot Depeche.com
always recommended
yeah it's fantastic
it's really good for major versions too
I like that they give
kind of big red warnings
for security issues
it's quite a nice resource
if anybody's ever interested
in being able to search
like text wise
as to whether like
there have been any changes
to a certain feature
or any commits with
mentioning a certain feature
in the last couple of major versions,
or if you want to see what's in specific minor versions,
it's a really good resource.
Should we talk through the options people have
for major version upgrades?
Let me confuse you here.
I just wanted to point out,
I saw on why upgrade there is also diff in options
for major versions, which options were removed, which were added in terms of configuration.
Yes, so like configuration.
Yes, yes. So it has it all as well, also useful because not only like entries in release notes, but also configuration options, which probably returns us also to another resource.
I mentioned quite often PostgreSQL code.nf to deal with details
about configuration options for different versions, major versions.
Yeah.
Right.
There's one other resource that I've forgotten the name of,
but Ryan Lambert from Rust Proof Labs released it.
And it will show you not just the config,
it will show you any defaults that have changed as well.
I'm not sure if the Depeze one does that.
So like if it's an existing configuration option,
but the default has changed.
Default is very, very, very fragile topic
because there is default defined in PostgreSQL
source code and there is also default which sometimes is different, for example, for shared
buffers which is defined by official PostgreSQL package, apt or rpm package.
And which default are you talking about?
I think he builds from source, so I suspect it's that one.
That one you probably see when you select from pgsettings. There is a reset value or something
there, and this default came from sources. But usually people don't have different defaults
slightly. So if you check how PostgreSQL, for example, apt package is created, then you will see that it overrides some defaults.
So it's an interesting topic.
I think what these sites are good for is kind of giving you reminders of things to check.
You should still do your own checks, right?
It's not that you should treat these as gospel.
It's more that they remind that, oh, some of these settings might have changed as well.
What's that like in your system?
Does your cloud provider change it to something different?
Like, does your cloud, like, it gives you an idea.
I understand.
But I think it's, I find them a good reminder
of things I should be checking.
Like, are there things I would have forgotten to
without looking at some of these tools?
Right.
Well, yes. so regarding minor upgrades
how do we do it i prefer doing it with minimal downtime it's possible to do it almost without
downtime if you have some proxy connection puller in the middle supporting pause resume approach
like pg bouncer does. It's possible.
Others also try to implement it, I see.
But it's a slightly different topic.
My main advice here is don't forget to issue explicit checkpoint because if you just restart, you need to restart POS.
Mind Upgrade is just a replacement of binaries, basically.
New binaries, you need to stop using old is just replacement of binaries basically you new binaries you need to
stop using old ones and start using new ones so let's restart but restart if done just straightforwardly like not thinking about what's happening under heavy load with higher max max
full size if someone tuned checkpoint properly restart might take time
significant time because postgres when shutdown attempt is occurs postgres first executes
shutdown checkpoint to flush all dirty buffers to the disk so to to save basically to save all
pages which are not safe from memory to disk, obviously, right?
But while Postgres is doing this, no new queries are accepted.
So clients start observing errors.
And if it takes a minute, in heavy loaded clusters, it might take a minute.
It's not good.
So instead of this, instead of just allowing Postgres to implicitly do everything,
you issue an explicit checkpoint
right before you restart. In this case, shutdown checkpoint will not have work to do.
Yeah, I think that's a great tip.
So you already said… Well, a little bit. Because
how much of buffers happened to become dirty again? Dirty means like changed in memory. And just a few, right? Like if you do explicit
checkpoint, just SQL checkpoint under super user, and then immediately restart. Immediately
after it, well, not immediately, of course, everything has duration, right? You restart,
only few buffers are now dirty and shutdown checkpoint have just little work to do and restart becomes fast i see
problem here that postgres doesn't tell somehow like it could tell there is opportunity here but
it's for improvements of postgres there is opportunity to suggest what postgres is doing
right now in terms of this shutdown checkpoint for example every one second it could report to logs or to output like shutdown checkpoint like progress bar or something like 20 percent done or something right
because some people i've observed many times some people issue restart or stop for postgres service
and after 10-15 seconds they become very nervous not understanding that it's still doing some work and they start doing very
crazy bad things for example kill minus nine you can see kill and so on and then this is like not
good at all uh you already have a crash basically and need to recover and then during recover also
they have similar issue not understanding and what is Postgres doing right now.
If you check PS, what I do usually, I check PS during recovery time.
If it happens that we have recovery.
And in PS, in top, for example, you run top and then press C, you see common details.
And there you will see details for Postgresgres even if it's still starting up and
doesn't allow connections you will see lsn and it's progressing this is the most important i
just discussed it earlier today on different topics the most important progress bar ux feature
is to let you know that something is is happening for good right so some good work is being done right now yeah unless if you don't have it you
can start be very nervous and do mistakes so this is what we have i have in my mind speaking of
restarts and for minor upgrade we just need to care about restart well some other features like
what about extensions as well do you want to upgrade them at the same time?
Oh, of course you need to check release notes.
That's why we mentioned why upgrade,
because the release notes might tell you that, for example,
you need to index some indexes.
Yes.
We're not talking about, I guess actually, yeah,
for even thinking it shouldn't be true for minor versions.
Yeah, I was thinking it shouldn't be true for minor versions, but exactly right.
Yeah.
Really good point.
Fortunately, if you check what happened carefully, for example, for 14.4, it told that some indexes might be corrupted.
Well, we have some tools to check which indexes are corrupted, at least for B3.
There is an official tool, AmpCheck.
It doesn't work for GIN and GIST indexes yet.
There is work in progress, still not committed.
But for B3, we have quite powerful tools.
So if you don't want to re-index everything,
also you can plan to re-index everything just for your safety,
to get rid of loads and so on.
But if you don't want to do it during downtime,
because for 14.4 we require downtime, unfortunately,
or unless you involve logical replication.
We will talk about it in a minute.
So you need to find exactly which indexes are already corrupted
and re-index them.
Well, yeah, I think you're right.
I think that would give you quite a high degree of safety.
But equally, Amcheq will tell you that an index is definitely corrupted,
but it can't tell you for sure that an index is not corrupted.
So I would still be nervous like if especially around 14.4 i
would have still in the re-indexed everything personally i think you are not right i remember
problems that early versions of gist and gin support in armcheck which is still in progress
they had false positives telling that some index is... Well, there are different types of corruption,
first of all.
And when we check heavily,
and this check requires some
exclusive...
Not exclusive, but some heavy locks
blocking at least either
DDL or writes, I don't remember exactly.
So you cannot do it on the primary
just online.
You need to do it on some copy of your database.
So if the tool says nothing, it's indeed nothing in terms of what it checked.
So no false negatives.
Okay, this confuses me.
I thought it was the other way around.
I think I might have even made this mistake before.
Well, I might be mistaken as well worth double checking but in real life we used
we like we could not afford the indexing everything like during downtime or immediately after i
actually i i think for this case when you do minor upgrade if we know already that we have already
corruption we found it uh well we can do it online after a minor upgrade and probably think about consequences of this kind of corruption.
It depends on the application, what exactly happened and so on.
But, of course, worth checking how Amchek is working.
It requires additional expertise.
I don't remember everything. I'm checking myself
all the time as well. I can share only my mistake. I did it twice. I thought Amchek cannot work in
parallel. Of course, if you want to check, it's like with re-indexing as well. Sometimes you need
to move faster, like parallelization of analyze and some processes useful during upgrades. So
uncheck can be executed with dash J option if it's CLI tool, PGM check. And I keep forgetting
about it. And I have several versions of my own implementation for parallelization of
uncheck. So don't be like me, just use the existing tool it supports it while we're in this
area of like talking about corruption i feel like it might be sensible to discuss glibc versioning
and but it's more about major upgrades sometimes people try to combine combine several big changes
in once because it's it's stress sometimes you need some downtime most often you need downtime
to plan downtime when you run pg upgrade uh even if if uh dash dash k or dash dash link so hard
links involved should be fast like like a minute or two but still it's a minute or two you need to
plan it or then also you need to analyze, probably in stages, depending on application,
if your application works well when default statistics target is low,
so rough statistics.
Still, it's either downtime or no,
like depends.
But overhead is significant
and sometimes management might decide
to combine multiple steps in one.
And if you change,
for example, operational system version, switching from old Ubuntu, for
example, 16.04 is already out of picture in terms of support, we
need to change. Every JLPC version changes should be
checked very carefully in terms of possible corruption. So
again, like uncheck is very important there. But in
general, if you can afford not doing
multiple big steps in one step, not not merging them, I will do separate upgrades for operational
system separate upgrade for Postgres major version, probably hardware, hardware is easier,
usually. And so just because it reduces the risks of being in situation when you don't understand what
caused some issue. Like you upgrade it, you see it's something goes wrong. And you start, you
have wider field to analyze what is like the possible reasons of it, right? So root cause
analysis. Like, right, like, it could be one, it could be the other way. It could be both.
Dependencies.
You have so many exactly.
So root cause analysis, in case of issues after it could be the other, it could be both. Dependencies. You have so many, exactly, yeah. So root cause analysis in case of issues after upgrade
might be very expensive and problematic.
So if you can afford doing separate steps,
you should afford doing separate steps.
This is my usual advice.
But, of course, real life sometimes tells us,
let's combine steps.
In this case, just proper testing and so on i kind of forced us onto major versions of upgrades i guess was there anything else you
wanted to talk about on the minor version upgrade side well my upgrades that's it i i my advice is
checkpoint and that's it i don't know what else to mention there so i don't i think so either only
that the security patches come out.
I think they come out every quarter unless there's a really big security issue.
Yeah, there is a schedule.
If you Google Postgres version policy or something, there is a wiki page describing the schedule.
I'll make sure to share that.
And that they get backpatched to old versions for about five years a little bit longer i think
but not much and those those uh reasons usually happen on thursdays nice just i don't know that
it's like my favorite game is released on thursdays as well it's in virtual reality and i
think there's people want to have friday for for to react to issues not not weekend right so it makes sense
right fair enough cool all right so major versions we've mentioned a couple of things already pg
upgrade briefly talked about logical do you have like favorites in different scenarios how do you
tend to approach things it's like let's be brief because we're already approaching our limit in
terms of time and this is of course a huge
topic a huge topic oh and i have a recommendation it was it your you did a panel discussion was it
recorded yeah it was one oh that was good i think i think it might be anyway i'll look for that if
there's a recording of that it was great it was not recorded okay so yes but there isn't some article
with some summary we can attach it so if you uh want to upgrade with uh like
right now the standard de facto is using pg upgrade in place upgrade it involves dumping
schema recreation of of new cluster.
Cluster means like the PgData directory.
And then dumping, restoring schema.
And then dash K speeds things very well, hard links.
And then you already can open gates
and analyze in stages.
First query plans,
query execution plans will be not good
because statistics is not yet collected. Butzing stages, first query plans, query execution plans will be not good because statistics is not yet collected.
But analyzing stages starts from very rough statistics,
low number of buckets, and then jumps, jumps, jumps,
and then finally you have proper...
I would like to recommend looking at all nuances here,
how Alexander Kukushkin implemented this in Spilo, a part of
PostgreSQL
operator for Kubernetes from Zalando.
Alexander also
is maintainer of Patroni.
So, Alexander implemented it
very well in Python, checking
also things about extensions and so on.
And already, I suppose
it already very well battle
tested at Zalando, where they have a lot of
clusters yeah so so they already i think upgraded i'm not 100 sure but i'm almost sure they upgraded
a lot of clusters we can check with alexander in twitter he's active on twitter sometimes so
you can ask him there so alexander by the way just a side side note, he's not at Zalando anymore, he's working
at Microsoft. Patroni recently started to support Citus, it's interesting. But anyway,
this is, in my opinion, very good implementation of automation around PGA Upgrade. Because
Postgres lacks a lot of things to automate these things. You need to automate many steps. It's not like a button and that's it.
Unfortunately, this is a problem. But then there is also approach involving logical replication
to minimize downtime almost to zero. Not really zero because you need to switch over, to implement switch over, to execute switch over. And during switchover, some of your users will see some errors.
So it's not zero downtime.
It's so-called near zero downtime.
Yeah.
And by the way, if before switchover,
you also don't forget about checkpoints,
on older versions, which require restarts during checkpoints.
And Patroni restarted all nodes during switchover.
This is important.
In this case, they happen faster. Newer versions promotion is implemented in a better way.
So, speaking of logical, usual approach is, if you have not big databases, is to initiate
logical replication to new cluster. Basically, data will be brought there at logical level.
Think about it like dump restore.
Eliminating bloat, good side effect, right?
But this also happens involving quite long transaction resource,
and sometimes this process cannot converge
if you have dozens of terabytes under heavy load.
It's very hard to use it.
But for smaller clusters, you have logical replication.
It's working.
Then you just switch over there.
Good thing here that you can do many additional steps
while logical is working.
You can re-index some indexes, fixing any issues.
You can adjust some things compatible.
Of course, logical has a huge list of limitations.
Still, they are being improved.
In 15, Postgres 15,
a lot of improvements. In 16, there will be
more improvements. For example,
in general, you should think about sequences.
Sequences need to be synchronized.
Otherwise, after switchover,
you will have overlapping usage
of old values so
and basically insert won't work if you have some sequences and some old tables so you need to reset
value to match all new value plus maybe some gap if you use integer 8 big int primary keys you can
afford making gaps like millions nobody will notice such big gaps
because our capacity is huge yeah and then you also need to think about blocking ddl during
this procedure for existing versions because ddl is not logically replicated and several more
restrictions worth checking but again for clusters, this recipe doesn't work
because we simply cannot initialize such using this logical level. There is a trick. We can
use physical standby converted to logical using recovery target LSN. Now we have logical
replication and we can upgrade it initially and then switch over right thought i
a month ago and i was mistaken because during running pg upgrade logical replication is
becoming inactive and we have some data loss there so so new recipe yeah yeah but there is new
by the way there is a good discussion in PGSQL hackers.
There is some hope that future versions of Postgres will allow you to use logical and PGA upgrade together officially.
What I'm talking here is some recipes from not a good life.
We try to invent something for heavily loaded big systems.
And there is such a recipe.
Instead of using a logical, converting physical to logical and then running PGApgrade, we just create a slot,
allow physical to reach recovery target LSN,
matching the slot position in terms of LSN.
Then we don't switch to using logical replication.
We keep accumulating some lag on the slot, on the primary, old primary.
And then we already run pgupgrade.
It's very quick, right?
We don't analyze.
We analyze later.
And then already we switch to using logical.
Cluster was already upgraded.
So we're just starting to use logical.
We don't have any data loss in this recipe at all.
And then we can analyze we have
quite a good time to run analyze even without stages we can analyze it for our recovery target
yes during this we keep using logical the only requirement here is to know 100 that logical will
keep up i mean it will not lag. And here, I also, like,
it depends on workload.
In some cases, wall sender is
our bottleneck, because
it's single. You cannot
paralyze this.
If you have multiple wall senders,
all of them still need to parse
whole wall stream.
So if you had 100% CPU on
one wall sender,
you will have multiple wall senders
using 100% of multiple cores.
Not good.
But on the recipient side,
we can use the trick
parallelizing work among multiple slots
and have multiple wall receivers and so on.
And in this case,
part of our tables are processed
by one logical replication stream, another part by others and so on and in this case part of our tables are processed by one logical replication
stream, another part by others
and so on. In this case we can
move faster, keep lag
lower and
have again zero
downtime upgrade for large
clusters. Final trick, there is
pgBouncer here
if you can afford
pause resume depending on workload you might may have
real zero downtime upgrade right but i never saw this recipe described anywhere
we just like some secret with tastings right now maybe we will publish some article in some future
definitely should i see this come up from time
to time but honestly most of the time when people i talk to don't have zero downtime it's like a
really hard requirement if you can afford a few seconds or here or there or like a little window
every once every year or every couple of years for this the options become much much simpler so i
tend to see people go that route.
But yeah, for huge systems under load,
I can see how that's awesome.
Right, and I agree.
But sometimes we have a problem that, for example,
if we have a small downtime,
it can cause requirements for additional downtime.
Like Windows starts to grow.
It starts to grow because, you know,
like if we need to stop it for one minute,
we need to stop some other components.
And when you already have 10 minutes requirement and then it grows to half an hour and so on, it's not good.
But I agree.
In many cases, we can afford some brief downtime.
In this case, a recommendation is to take care of your application code and make sure that data loss doesn't happen in case of when database is down.
And all systems should be resilient to brief outages. So retries. Or read-only well yeah retries are great or have a read-only mode
where read-only mode plus retries for rights this is the best approach and it should be used by all
back-end developers in my opinion but the problem is usually like people like they understand it
but postponed you know like it's like with CICD coverage testing.
We have good testing, but always room to improve, and so on.
But in reality, you need to design your system not to lose rights,
read-only mode, and test it regularly.
For example, switchover.
Okay, some rights failed, but users didn't notice,
or they were told explicitly that the form was not safe or some data was not safe.
Please retry explicitly.
Press the button one more time.
The worst implementation is we told the user we saved it, but the database lost it.
It's bad.
This is what can happen here.
Yeah, not good.
Cool.
All right.
Anything else you want to make sure we covered?
Well, we didn't cover a lot of small topics in the PGA upgrade area.
It's a big topic.
How to benchmark properly, how to test it properly.
It's a big topic indeed.
But I think it's good enough for this short episode.
I think so.
We'll link out to various of the things we mentioned,
a couple of good resources on this, but hopefully given some people some good things to look into yeah and i hope future
versions of postgres will have better pg upgrade workflow more automation around it and so on
but unfortunately for example this post resume which is implemented in pg bouncer i don't have
hope here because all attempts to bring Puller inside Postgres
failed, unfortunately.
So Puller is considered as an external thing
and we cannot achieve absolutely zero downtime
using just Postgres core.
We need additional pieces.
So it makes things more...
I wouldn't be shocked, yeah.
I wouldn't be shocked to see this solved
at the operator level
or even the cloud provider level.
Lots of cloud providers, maybe they'll work together one day.
Operators.
Yeah, exactly.
I think we do already this thing.
I hope so.
I'm lagging.
A lot of interesting things happen in development of Postgres operators for Kubernetes or Kubernetes operators for Postgres.
What's the right choice here?
They automate more and more and more.
For example, I know Staggres has a lot of automation, including this, maybe not fully,
but some parts of it and they have indexing, index maintenance automation.
We had an episode about it and so on so yeah i think this is good area to expect improvements
from the developers of operators nice one well thanks again nicola thanks everyone for listening
and catch next week okay bye