Postgres FM - max_connections vs migrations
Episode Date: December 5, 2025Nik and Michael discuss max_connections, especially in the context of increasing it to solve problems like migrations intermittently failing(!) Here are some links to things they mentioned:�...�max_connections https://www.postgresql.org/docs/current/runtime-config-connection.html#GUC-MAX-CONNECTIONSTweet about deployments vs connections issue https://x.com/brankopetric00/status/1991394329886077090Nik tweet in response https://x.com/samokhvalov/status/1991465573684027443Analyzing the Limits of Connection Scalability in Postgres (blog post by Andres Freund) https://www.citusdata.com/blog/2020/10/08/analyzing-connection-scalability/Exponential Backoff And Jitter (blog post by Marc Brooker) https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello, hello. This is PostGos FM. I don't remember the number of episodes. Michael, do you remember?
No, we must be getting close to 170, though.
Yeah, some really big number. Not as big as your Max Connections, right?
I like it. Good segue.
Yeah, my name is Nick, PostGSI, and as usual, here is Michael P. Pige Master, hi, Michael.
Hello, Nick.
So we are going to discuss really high Max Connection number.
in like in general and maybe we will touch a little bit connection pooling systems a little bit
and maybe we will touch per user connections max connection setting because it's possible
yeah so in general like is it bad idea to raise max connections and is it a good mitigation action
when you have problems right this this kind of actions what's the ideal max connections
and why do you like I think you've seen some cases recently where people were raising it
and you think it's not a good idea so is it worth kind of jumping straight into some recent
experiences like why is this on your mind at the moment yeah so I proposed this topic after
a tweet I saw where it was shared that there was a pattern when deployments I guess
schema migrations or DDL you're failing but only in particular hours of every work
day or maybe every day overall and then they realized that at the same time during a couple of hours
some data pipelines were running and they just raised my max connections and and that's it actually
i'm not sure max connections because it doesn't say which database system it is i just assumed it's
postgres because postgres is default choice right now i might be maybe it's different databases i took a note and it said
the solution was, one, separate connection pool for migrations, and two, better cross-team
communication, which I thought was interesting. But yeah, maybe separate connection pool for
migrations is where you're like drawing that conclusion from. Well, this doesn't sound as
mitigation for me. It sounds like isolation of problem. Yeah. Maybe. I might be wrong.
So again, like there are several possible options here, but my,
I bet what happened.
These pipelines produced long-running transactions, right?
And these long-running transactions blocked some DDLs,
alter table at column, for example.
And this DDL blocked any queries to this table,
and a number of active sessions spiked and achieved,
reached max connections.
If you start isolating, maybe some
actually if you start isolating everyone will suffer anyway right so this this idea that
somehow isolation helped somehow like if this theory is right then uh separate connection
puller for migration and pipelines doesn't help at all because logs are global right so you
can you don't lock a table only in the context of some connection pool you will lock
globally in this database right in this logical database yes so if it really did mitigate
the issue either it was the other part of their solution which is the better costing communication
or the theory is wrong and actually it was a case of these migrations taking up a few slots
a few connections that weren't uh that the application expected to be able to use
migration needs only one connection right yeah true so
how could it
pipelines, data pipelines
take a few connections.
Yeah.
I don't understand how
they can exhaust
the rich max connections at all.
Well, should we talk about the general case a bit more then?
Yeah, so again, like this general case is described
and we discussed it a few times.
Everyone should understand that any alter statement
if it requires a lock,
exclusive lock on the table.
even well there are exclusions for example I know exclusions actually alter table is a serious
thing even if it's super fast you still need a lock and if you with default settings fail
like if your session fails to acquire this lock it will start waiting it will wait forever
with default settings default settings means a lock timeout zero statement time out zero
transaction time out zero I don't know transaction session time out zero everything
zero means forever in infinite right waiting infinitely so this means it will be
waiting waiting waiting but any query including any select to this table will need to wait
behind us in the line so there is a queue forming there waiting queue lock queue blocking queue and
actually you might see multiple cues and it forms trees and actually not just trees that forests
so it's beautiful like it can look really great like forest of blocking or blocking trees
but this is eventually like what accumulates a lot of active sessions can overwhelm your server
basically even if it's not achieving max connections it can reach to to high CPU consumption of
course and so on but very very common to see that max connections is reached because too many
next few sessions and they are most of them are just waiting so you can when you reach max
connections any additional attempt to connect will be we'll see an error right too many connections
yeah too many clients or how's it i don't remember too many clients i think yeah i can't remember
either. I remember back in the day when a lot of people used Heroku Postgres. This used to come up all
the time because they were much stricter on how many connections they would let you have. And these
days, with more people on things like RDS who have much, much higher default max connections
than Heroku ever has. I see it much, much less that people are hitting this. I think probably
causing other issues, I'm sure we'll get to that. But it's interesting that that used to come up more
in like various online forums yeah so in the past so we need to distinguish the
states of sessions or connections connections and sessions or back-ends are all are synonyms
in this context of context of postgres so we need to distinguish active idle and idle transaction
three main states and you can see states and selecting it from pg set activity so before postgres
12, 13, 14, so five years ago.
Yeah, is this the Andres foreign work?
Right, right.
Yeah.
A couple of articles published originally in Citus block, then,
and Microsoft Azure Citus there, right?
So a couple of good articles were like excellent articles,
first researching and then like discussing optimization.
So usual fear was,
that idle connections consume memory, but articles prove that the main problem is that how
snapshot work with snapshots is organized and idle connections, for example, if you just add
a thousand idle connections, it can hurt your performance. And I demonstrated it with benchmarks
easily as well. So it was quite easy to see, but after, so if you have, I think, 14 plus,
maybe 13 plus. If you have this,
Postgres basically everyone right now. Oh, true. It's all supported PostgreSQL
versions now. That's incredible. Because 13, we have a couple, a couple of cases where 13
is still in production and then talks to top graded. But basically right now, if you are on
supported version, so idle connections don't hurt performance anymore, too much.
Or anywhere near as much. Yeah, yeah, yeah. So having one, two thousand additional connections,
it's not a big deal these days and also processors improved I really like Graviton 4 it's like
it's amazing and it handles a lot of connections even even active very well much like I have my
neural network trained what to expect from like 500 connections for example on a machine
with fewer than 100 VCPUs right I can imagine what was what should happen but
On new hardware, I see it much better.
It handles more active sessions much better.
When you say your neural network, do you mean your brain?
Yeah.
Yeah.
Yeah.
Yeah.
My LLM internally.
Yeah.
So I just, well, I expected this, but wow, guys, you are fine, actually.
You are not down.
I would expect you to be down by this point, right?
So we just see it in our monitoring what people experience, what their cluster experiences.
And like, wow, this is, when they check what hardware it is, oh, it's like, she'll give it on four on Amazon, which is great. Like, I, I'm impressed and progress and. And your postgres. Yeah. Yeah, okay. But you're talking about lots and lots of active sessions rather than. Yeah. So, yeah. Additionally, on larger machines, RDS and Aurora RDS, they just set really high max connections. I think this decision was my, I'm really, I'm really curious how this decision was. I'm really curious how this decision was.
was made. And it's like we see 2,500, 50,000 of 5,000. It's very common. I can already like
looking at settings, I can tell you, oh, this is not seeing the settings which start with
RDS dot. Just looking at this like profile of settings, I already realized, oh, this is RDS
or this is like CloudSQL. Like there are patterns of some decisions. Yeah. You know,
random page cost four and max connections 5,000. Oh, very likely it's Ardios.
Yes. That's funny. So this gives us opportunity to observe how clusters behave with high
max connections. And all documents don't work anymore. So all documents were like, guys, like this is not,
this is bad. Like you just have 2,000 connections, 1,900 of which are idle.
why do you do this because you just basically pay big tax over like for performance performance wise because like all the latencies it hurts all the latencies but now it's not so and the only argument left actually in my mind is that it still if number of active sessions exceeds the number of VCPUs significantly then it hurts performance globally and all queries
are processed for much longer time, latencies increase.
And none of them have chances to be finished in time,
instead of time out, which, for example, 30 seconds.
So instead, it would be better if we reach max connection sooner
and just say some clients,
retry later, too many clients.
But these guys would have much higher chances to complete
be to complete execution inside statement amount right so we have we have limited
resource CPU in this care yeah to make sure I understand are we talking about if
you if you allow too many connections even if most of them are going to be idle most
of the time you'd run the risk that a bunch of them all become active at the same
time yeah exactly so if some wave of requests unexpected or expected
but poorly planned happens if we have low max connections we will we will
have limited number of back-ends which will do job inside statement timeout if
it's limited if state of time out limited it's a different story also it's also a
very wrong state for OLTP we also discuss it many times but if we have 10 20 50x
compared to number of VCPUs we actually have max connections and this wave
comes.
All of them try to execute, but none of them will succeed because they will all bump into
statement timeout.
Or if statement time out, if it's not set, they will bump into some different timeout
on HTTP level, application level or something, like one minute or something.
They will still, like nobody, like, we took all the chances from everyone.
That's the problem.
If we limit, some of them will succeed.
Some of them will need to retry because too many clients and we will need to fix the problem.
And of course, if there is connection pooler, it's great because connection pooler will take a lot of headache from Postgres backends.
Yeah, well, I wanted to ask you about this, actually, because I think sometimes when one of us says connection pool, I think people, or people can assume either direction, right?
They can assume database side or application side.
and a lot of the time I see people
scaling their app and having application side pooling
and that's how I see
I think that's how a lot of the Heroku issues happened
a lot of the people running out of connection on Heroku
because it's so easy to spin up new versions of your app
and scale the app horizontally
but each one comes with another 10 connections
another 10 connections but they're using poolers
but it's application side
so there's no and they don't use anything on the database are too far from
database and exactly this is this pattern I saw so many times yeah especially
imagine e-commerce and they have microservices many clusters some
cluster we see it okay max connections is high because they don't use pj bouncer they say
okay we have java application and all the connection pooling is on our side
he carry a cp for example right this such connection pooler and all great i cannot convince him
to start using pj bouncer not bad okay max connections we keep us somehow high because we have a lot
of idle connections why because there are many application notes and then before black friday which
happened last week this this story was many many years ago before black friday guys
of infrastructure guys who are responsible for application nodes decided to double
capacity this is in their hands idea of decision okay we need to be better prepared let's
just double number of application notes from 100 to 200 for example or something and nobody
thought about the configuration of those pullers so they they were like 40 right 40 connections
So it was, for example, I don't know, like 20 nodes
times 40, it's 800, for example, right?
Maximum number of connections, 800.
But they doubled it, it became 1600.
I think numbers were higher in that case.
And nobody talked to database team.
It's just the application notes.
Let's have stateless, stateless, let them.
And this is how you suddenly see, oh, many more idle connections.
The actual load didn't increase.
We don't need like a number of active sessions didn't increase, but since capacity was doubled, a number of idle connections doubled, obviously.
And you cannot solve this problem having connection pooler on application side at all.
Well, you can, but you ever like you need to somehow memorize every time you add or remove application node.
You need to rethink pool configuration for all of them to maintain the same overall number.
some of connections should be maintained like right so you mean if we double the number of
pools we can halve how many connections each pool has yeah but then people yeah it seems brittle
to me to then need to remember that yeah there are two numbers there is like minimum and maximum
number i'm talking mostly about like minimum number yeah these connections are open and
they're maintained and it doesn't go down yeah all that right so if we doubled
capacity but actual workload didn't increase because it's not Black Friday yet, right?
Why do we open more connections to Postgres? It's not right.
And good solution here is to have additional puller to closer to database, where it will
be configured according to database situation, not a situation with application notes and
the needs to increase the fleet. Yeah. So connection pooler helps and usually we say,
like, okay, if you have PG bouncer, each pg bouncer can handle 10,000 connections in coming,
like something like this, you know, like thousands of them. And it should open only like 100 connections
to database. This depends on the number of VCPUs. We still say, take number of VCPUs times
some constant. Usually it was two, three, four. Now, okay, we said 10. Because we see Postgres behavior
improved, hardware improved, okay, 10. But not more. If you have only 96,000.
connections okay thousand if you have 32 VCPS I mean number of max
connections should be okay three three hundred twenty it should not be five
thousand I'm very curious how this team made this decision because I guess they
I have several theories here one theory is there's so many clients without
pulling and this is what's just a need yeah okay let's do it
it's just moving the problem isn't it it's instead of people instead of people
thinking oh what i don't have enough connections they're going to have some
additional yeah maybe they have some additional reasoning which i don't get i'm very curious
and another thing maybe rd s proxy was not so good it's still not so good like i we saw it in
action rds proxy like it's very weird proxy compared to pj bouncer behavior it's um it's like
you see idle connections more i don't remember all the details but every time we touch it i
are we like why is it behaving so right so what what is this very strange it's very different
there's no like very good multiplexing capabilities like in pj bouncer like you have 10 000
connections in coming maximum and you transform them to maximum 100 back ends
Pidge Bounser is very reliable and stable to solve this task.
RDS proxy not so much.
So maybe if they had issues with PIRDS proxy and they needed to raise max connections
because it was not solving this task somehow.
Yeah.
Interesting theory.
If anybody knows more last.
Yeah.
I'm very curious.
So this is my approach right now to max connections as maximum 10x for VCPUs.
Ideally lower.
So database can breathe under heavy load.
And lowers the risk, I think, is that, yeah.
So, yeah, just to be honest with you, how many resources do you have, right?
Yeah.
Do you have 1,000 quarts?
Maybe yes.
In this case, no problem.
Go, go for it.
Because it sits, like, it's like, sits all the time, like no problem, no problem.
But then we have a problem, we have a huge spike and unresponsive database, right?
A system, I mean, database system.
Yeah, exactly. It's picking which, like, what do you want your failure case to be, right?
Do you want your failure case to be some people can't access it?
Or do you want your failure case to be nobody can do anything?
Yeah, and also if you, if some attempt to connect and execute a query from back end failed,
back end should have a retry logic.
Okay, yes.
Connections and in query execution, in both.
So it should be handled there properly additionally.
If users immediately see there's too many clients, okay, maybe you should additionally adjust your application code.
So to have ideal state is retrial logic with exponential back off and jitter.
So this is like more scientific approach.
There is a good article, Amazon article about this.
It's not rocket science, but it's good.
There is some science behind it.
Exponential back off and jitter.
So it can happen quite far.
quite fast and this adds some resiliency to like this system a little bit of
course if you have 500 max connections and all the time all of them are busy
yeah that's that will be the users will know users will notice this right yeah but still
those who are still are working they will have more chances to complete and this is i think
super important then see it with like welcoming performance cliff basically
Yeah. But another thing which I'm curious in is how to demonstrate a problem.
Naive attempts failed so far. So I don't have a good article of demonstration how this performance cliff behaves.
I name it performance cliff maybe strong as well. But I feel it's somehow like not maybe cleave, but somehow too high max connections increases risks to be down for everyone.
And how to demonstrate that this is, I'm not sure.
So this is, these considerations are more theoretical.
So we have many clients who, who listen to us agreed, but, you know, Max Connections to change it, you need a restart.
So they still see it with 5,000.
That's it.
At the same time, I'm not super worried.
Well, I still believe we need to go down with Max connections and plan restart and do it.
So unfortunately, some companies are afraid of restarts and like,
try to minimize it. It's another question, right? How to stop feeling restarts and do minor upgrades more often and so on.
But another fact, okay, if you have two max connections, it's also good because, you know, in our monitoring, we put active session history analysis, weight event analysis in the center right now.
And it's quite great to see very colorful, beautiful graphs.
if we have a lot of max connections,
a lot of areas with different colors,
it's so good.
It was a joke, right?
Okay.
I mean, if you have max connections high,
your graphs are looking better, you know?
Oh, I see what you mean, yeah.
Postgis probably is not responsive, right?
But graphs look really great.
Okay.
Well, I'm interested in the pool,
like, I do think we do need to justify,
like, why should you have an extra,
an extra thing running?
Like, it's more complexity for people to justify, right?
It's another layer if we're saying you should use PG bouncer.
And added latency, like, maybe not much, but like, there's a bit of extra, right?
There's one extra hop.
So I do think it is sensible to, like, have to demonstrate what are, like, what is the advantage?
Yeah.
Yeah.
Well, it's, it's quite easy to demonstrate.
demonstrate the benefit of having PG bouncer in between there are certain types of
workload where it will be very noticeable that the Bouncer improves and also if
you have slow clients it improves things because back-ends not working on
transferring data and like it's it's offloaded to a PG Bouncer yeah okay
that's a good one yeah but also if you take a lot of fast selects
And with PG Bouncer, it will be much better than they were without PG Bouncer.
Because, for example, you achieve 1 million TPS, we select only PigeBinch, right?
And Pige Bouncer helps PostGGGS to communicate and start working on next execution.
So you can feel it and you can raise number of clients easier and scale better.
as a simple benchmark actually to you know usually in this like stress load testing we start
with one client like two clients four clients and so on a number of like parameters hyphen c and
hyphen j lower case in pj bench we match them because there is interesting logic not everyone understands
from from beginning how these parameters are connected to each other and then we just grow but usually
when we grow out of the number of VCPUs, we go beyond that, and TPS numbers go down,
latency goes up, right?
Because, and this is ideal state for direct connection benchmark.
We just scale until the peak, which matches number of VCPUs.
If a peak starts earlier, there is some interesting bottleneck, I bet, some light-wet lock or something.
And then if you introduce PG Bouncer, this pic should shift to the right and become higher.
Yeah, that's how you can see it.
Well, is that an easier way to sell it then?
Instead of trying to convince people to use it because it reduces the risk around
exhaustion of resources, actually just try and sell it on the, oh, increased performance.
There's no problem to sell PG Bouncer.
Everyone wants it.
Almost everyone.
Well, I don't, I, am interesting.
Yeah, yeah.
Yeah, well, I talked about that company.
There, I couldn't sell it.
These days, I see everyone understands quite easier that connection pooling is needed.
And also, keep in mind that I just said a lot of very fast selects.
If you have long-running statements quite long, for example, 10 milliseconds or 100 milliseconds.
It's quite long.
It means if 100 milliseconds, it means execution 100 milliseconds, including planning time.
It means that one backend can do only 10.
A second.
QPS, TPS, yeah.
It's not a lot at all.
And that's it.
And the communication overhead, which PG bouncer will take on its shoulders, will not help a lot in this case.
It will help only if you have tons of fast selects.
In this case, this overhead will be comparable to execution.
time right or maybe even higher somehow sub millisecond selects this is what we need and guys
can tell you you know what like we don't have like so many like depending on application but
sometimes we see average query latency is exceeding one millisecond significantly in this case
pj bouncer will have but not so much well again depending on the clients sometimes clients
are slow sometimes they're under control not so slow so but anyway
I just, from my experience lately, dealing with a lot of startups, including AI startups, which grow really fast.
Everyone understands connection pooling is needed.
Usually it's a trade-off, like if it's RDS, what to do because RDS proxy is managed, no-brainer, but they lack good behavior, as I said, and also they lack post-resume, which is weird because, yeah, there's blue-green diplomas, but RDS proxy doesn't have post-resume.
that there is no zero-down time possibility here to maintain connections right and the
pidgeabouncer has it but you need to to deal with it yourself and if you go to global database
in aurora it's like it's it sticks to it sticks you to rdus proxy more so what's their
blue-green deployments doing behind the scenes if it's not i don't know about green
Gooding Diplomates, they do logical replication and everything.
This is that what they do.
But this switch, if it's for Ardiasprox,
RDS Prox, it doesn't have Post Museum.
I know it, right?
Yeah, interesting.
Well, maybe I haven't checked a couple of months.
Maybe I have outdated.
But I didn't see the news about Post-Vosium.
I think the switch is still like interrupting current execution.
So it's near zero-down time, not fully zero-down time.
So when when you were talking back us a bit, I'm still a bit confused, you were talking about folks or not being able to demonstrate your kind of a theoretical limit of the what's the added risk.
Why would you need to demonstrate it?
Like if you're seeing everybody using a database side polar like PG bounce already, who's it for that demonstration?
So let's distinguish two things here.
First is how to demonstrate the need.
in connection pooling. This is slightly different topic than reconfiguration of max connections.
Yeah, but very related, no?
Related, related, but I find that easier to convince that connection pooling is needed
compared to let's reduce max connections. Okay, so people are like, we'll have a polar,
but we'll also still have 10,000 max connections. Interesting.
Well, it may be just because it's painful to restart for them.
That's it, honestly.
Yeah.
And if most things are going through the polar and the polar has a lower limit,
then actually they might not get near their max connections.
It's just a number, not actually how many connections they have.
Yeah, I'm checking in Blue Green deployments once again,
and I don't see pause resume, and I think switch over.
It allows connections in both environments.
allows right operations on one cluster and only read only in another class.
And then you basically need to take care of everything yourself.
This is undercooked, I think, situation.
They should have pause with you in RDS proxy or just ship PG Bouncer in a managed form
as some other managed platforms do, right?
Yeah, that's it.
In this case, it would be possible to achieve fully zero downtime.
And we know, like, our standard criticism of blue-green diplomats
is that it's not blue-green diplomas
because you lose one environment of after switchover,
which is super strange, so it's not reversible.
It's not reversible, yeah.
I don't think you lose it, but it's not reversible, yeah.
Yeah, so, again, like, I don't find problems to convincing
about the need of puller, but how to, and again,
it's easy to demonstrate that Puller brings benefits to certain types of workload significantly,
but how to demonstrate, by the way, we could emulate also emulate slow clients.
In this case, it will be even more interesting to see how with PJ Bouncer, Postgres behaves
much better, higher TPS, healthier state, lower active sessions and so on.
But when we move to the topic, okay, max connections, forget about Pooler.
How can I see that lower max connections is somehow beneficial?
Here I have hard time right now.
If anyone has ideas, let's work together and create some benchmarks, demonstration tests, experiments.
Yeah.
So maybe coming full circle, the things that triggered this for you was the discussion of it in the context of migrations.
And I mean specifically like schema changes.
later pipeline, like long, maybe long running transactions, maybe things that take heavy locks.
It's so long running transactions, because if you acquire a lock, it's at least in the end, so it's
long running transaction.
Yeah.
I was thinking on the, like, read only ones, at least there's, I mean, there are still massive
implications, but it's not as, not as devastating, unless they're combined, right, unless
you have both.
Did only transactions also acquire locks?
Yes, yeah.
of course, but they're fine until they block a modification.
Exclusive log of coming from AlterTable cannot be acquired because of ongoing accessory
lock from some ongoing select, which probably even finished, but there is data pipelines
they love complex long transactions and some brief select which you executed in the beginning
of transaction holds this accessory lock.
till the very end of transaction.
And it blocks alter table.
Just alter table should have low lock time out and retries.
Yes.
So yeah, so that's the...
That's approach, which is, yeah.
This is good mitigation not to properly scared or something
because if not a data pipeline, it can be something else.
At any time, Postgres can start aggressive auto-vacuum,
process which will be freezing your tuples to prevent transaction idea wraparound and this will block
your ddl if it hasn't happened yet either it happened but you didn't notice yet which is very common
i think yeah i think some some outage we will try everything good okay postgis is weird
or even not even an outage maybe just some customers getting some slow like things well
Yeah, like five seconds or 10 seconds, latency spike, and we don't understand why.
Or you just got lucky and it waits you next week.
So everyone must understand this topic and implement low lockdown out and retrys
until Postgres will have word concurrently in all operations,
which will make things interesting because concurrently means you lose transaction.
behavioral behavior of the deal. So it's very interesting. The future is interesting because I think
altar tables should have something like concurrently or with retry tries, but it cannot be inside
transaction block in this case. So it's so interesting topic, right? Yeah, you always combine these
low lockdown out and retries and it just struck me that if that sounds like a lot of work to people
and then retries can feel like a lot of work. Actually, the important part is the, the lot of
timeout retries it's like that's a good point yesterday we opened gates for first clients for our
co-pilot product post-gacy co-pilot and we had a new client which actually originally they came
as consulting client but use our product inside consulting and like we found a lot of issues as usual
it's great and observing their situation I think it's a
RDIOS or Aurora, I don't remember particularly, but I noticed lock time out was very low.
It was very, very low.
It's so interesting, 100 milliseconds.
I haven't talked to them yet, I will soon, but it's an interesting situation.
Global lock time out is very low, right?
So I even started to think, okay, lock time out.
Does it affect only DDL or raw level locks as well?
What do you think?
Good question.
I don't know.
Yeah, it affects all.
yeah role level as well so if you update a row but uh didn't say commit so it's like multi-statan
transaction you updated a row sit inside transaction another transaction just fails after one second
oh not after one second after 100 milliseconds yeah yeah and this makes me think okay this i think
it's not right here you know that lock time out is one second by default yeah but lock time out
is 100 milliseconds so that log detection never happens yeah in that case yeah how is it possible i i'm very
curious how this this application feels right and i will be talking to them soon i hope and i think maybe i'm
just overlooking uh maybe there are some settings at like session level application user level so this global
maybe is overwritten
because it's a very interesting situation
to keep it so low
and global.
But what you say basically,
okay, forget about re-d-rise,
let's just set lock time out
a second or two seconds
and maybe only for DDL, right?
Yeah, that's what I'm thinking is
to start with,
start, like add
lock timeouts to you, just to your migrations.
And then if they start failing
because they won't always,
like depending on your lock timeout
and depending on your like
especially if you design them in a ways
that is not super you know
if you're not doing full table rewarks and things
you are right but I can tell you what happens next
next pipeline runs for 10 minutes
and then it fails and like
we need to retry it whole
so retries will be
within the pipeline
and with you now yeah yeah that's why I say
do retrys right on the most
radical way, extreme way to do retries, is inside transactions who have sub-transaction.
This is the only case when I can say sub-transactions are reasonable.
Well, I think some financial applications might need it as well.
But in general case, I tend to say, like, avoid sub-transactions.
But if you have complex D-DL transaction, not D-DL, it can be DL, DML or something,
and then inside it you want to have retries and you cannot lose, you can, you can
but it's too costly.
You will need to retry a lot of stuff.
You don't want to lose what you have so far in that transaction.
Okay, you can use save points there.
But you need to double check that safe points don't go too deep in terms of nesting, 64, right?
And also you need to check there are no long running transactions
and that will affect the health and you will see subtransl or your weight events.
So in this case, you just retries and if retry fails, you retry, if you have long time out,
your local ddl fails you don't lose everything you lose only until the latest
safe point then retry again usually it's done using PLPG SQL and begin exception
when blah blah and blocks it will it will create some transactions implicitly for you
right and then you can retry there and this begin exception and block and it's good it's good
I mean this is good thing to have and this in this case retries are super local
And whole pipeline of deployment won't fail.
Yes, yes.
If everything is like all right.
Of course, if there is long transaction running two hours,
those retries probably won't last so much.
Yeah, it fails anyway.
So that's why I connect to these two words.
Oh, I definitely think a lot of time out and retries is great.
I'm just thinking there is an intermediate step
if people want to get a lot of the benefit without all of the,
the work. Well, I agree. At least
lock time out. Yeah, this is already
it's a safeguard from
from downtime, basically.
Yeah, at least partial downtime.
And here, interesting that we want
like shift left testing, you know, we want to
test and find bugs earlier. Ideally
developer who developed the code
finds bugs immediately, right?
This is shifted to left. Here
it tries we want to shift to the right.
Because if we shifted
to like testing pipelines,
and CI like it's it's too it makes retry heavier right yeah these things are also I think
onto the testing front these are the hardest things to test as well right because yeah they
they're dependent on other activity and situation and very few people have good test setups that
have concurrent activity you know it's not common yeah this is on one hand on another
hand it's not rocket not rocket science at all and you just need to avoid
avoid two things i thought about this and and you know like in db lab for example in our tool for
database branch yeah we we eventually develop methodology where we don't need background workload
because it's it's very uncertain like one day it has long-running transaction we should be prepared
for everything this is our approach and we just decided to do two things one thing is that we need
lock time out and retries. This is
one thing. Just to be prepared that
lock cannot be acquired. Again,
the lock cannot be acquired sometimes
because of auto-vacuum.
Oh yeah, in anti-raphroam, right do you mean?
Yeah, and it can run, if it's a huge tab,
it can run hours. Yeah. If it's
throttled, it can run hours.
And another thing is that
we should not keep
exclusive locks
for long.
That's it. For long means like for
many seconds. They should be brief. If you achieve to, like if you know how to acquire
logs gracefully, so not waiting for long, blocking others and with it right. And also if you
acquired lock, you don't hold it too long. Yeah, for long. That's it. Yeah. There are
exceptions though. There are exceptions. For example, if you have, if you create a new table and you
load data to it, well, you're, you're technically nobody is working with.
with this table yet, right?
So if you created the table,
you own the lock on it, exclusive lock.
Inside the same transaction, you can load data.
Why not?
In this case, you are breaking the rule number two.
You hold the log too long, but it's harmless, right?
So, but yeah, there are exclusions,
but in general these two rules are serving great,
like they are helpful if you keep them in mind,
graceful acquisition and don't call it too long
it's very heavy like lock heaviest lock exclusive lock sounds good anything else you want to
touch on before we call it a day i think that's it i hope it was helpful yep nice and i'm interested
to hear anyone gets in touch about any of this as well oh yeah i'm curious maybe someone from rDS
will comment on this why max connections is so high by default yeah it could be interesting
All right. Thanks, Nikolai. Take care.
You too.
Bye.
