Postgres FM - transaction_timeout
Episode Date: March 8, 2024Nikolay and Michael discuss transaction_timeout (a recently committed addition for Postgres 17) — what it's for, how to get around not having it already, and whether it will replace the nee...d to set statement_timeout globally in future. Here are some links to things they mentioned:transaction_timeout (devel docs) https://www.postgresql.org/docs/devel/runtime-config-client.html#GUC-TRANSACTION-TIMEOUTCommit: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=51efe38cb92f4b15b68811bcce9ab878fbc71ea5 Mailing list thread: https://www.postgresql.org/message-id/flat/CAAhFRxiQsRs2Eq5kCo9nXE3HTugsAAJdSQSmxncivebAxdmBjQ%40mail.gmail.com  Hacking Postgres session on Postgres TV where work started https://www.youtube.com/live/WLoMpg8A4WU?t=50 Our first ever episode, on slow queries and slow transactions https://postgres.fm/episodes/slow-queries-and-slow-transactions statement_timeout (v7.3) https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-STATEMENT-TIMEOUT idle_in_transaction_session_timeout (v9.6) https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-IDLE-IN-TRANSACTION-SESSION-TIMEOUT idle_session_timeout (v14) https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-IDLE-SESSION-TIMEOUT ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello, hello, this is Posgus FM, a podcast about Posgus.
Hi, Mike, Michael.
Sorry for Mike last time.
That's all right, Nick.
I don't even notice normally.
Right.
Well, Nick is fine.
Nick is fine.
So episode number almost 90.
I don't remember exactly.
Like, do you remember?
You always remember.
I remember, but you told me not to remind you anymore until we get to Milestone.
And you're the only one that ever brings up the episode number really i never mention it yeah okay interesting interesting
so we are going to talk about uh what oh yeah so this week was your suggestion and it's a cool one
it's the it's going to be a new feature in postgres 17, hopefully. It's committed, but you never know for sure.
And it's transaction timeout, a new parameter that we'll have,
hopefully, in the next version.
I can show off.
It was my idea.
But most of the work, Andrei did.
Andrei Borodin did.
So it started during our Postgres TV hacking Postgres session.
Which are back now, right? After a long pause.
Well, yes, we'll try. Nobody knows.
It's hard to arrange usually because to do proper hacking, you need a lot of time.
Last time we did restart it, we talked about this transaction timeout
and Andrej tried to write tests because former tests
needed to be removed. By the way, I also wanted to say thanks to Alexander Korotkov who committed
this patch. Honestly, I even didn't ask. So it was good to see that it's committed. And so session
finished after 90 minutes by timeout because we had to go.
Session about timeout finished.
Okay, now you're smiling.
Okay.
So transaction timeout.
And it took a few years for me to understand something is wrong.
Don't have something.
And then last year, like a year ago maybe, I started thinking, oh, it should be.
And my idea was like, maybe I'm very wrong or I'm very right.
So it's so strange in 2023 to think about this basic setting.
But honestly, last few years, I already suggested everyone consider the transaction timeout to be applied
on application and in 100 of cases i saw like very strange look if it was video meeting right
it was like very strange look like postgres cannot do it yes postgres cannot do this transaction
timeout but you do need it and we can discuss why so So I always suggest you do it always on the application side if you
have LTP because you know like anyway HTTP has timeout. I said server for example Nginx
like 30 seconds it's normal right nobody will wait many minutes unless it's specific request.
Sometimes we can wait few minutes but usually a limit. And Postgres doesn't have
this transaction timeout. It has only statement timeout and idle transaction session timeout,
but no transaction timeout. And also, so I looked into when we got each of these,
and statement timeout was ages ago, like 7.3, would you believe? And then idle in transaction session timeout
was actually more recent than I realized.
It's 9.6, which is probably only about seven or eight years ago.
And then we also got more recently idle session timeout.
Which is different.
It's outside of normal work.
It's for, like, let's drop connection if it doesn't do anything.
Yeah, but that was a recent
version 14 edition,
so still only a couple of years ago.
Ah, you want to say that it's also
a very, very basic thing and
only added recently. Yeah, I agree with that.
Usually people implemented
it outside Postgres as well
to drop connections.
I've seen less need for that one
than i have seen for this one but like you i didn't it's one of those things that i kind of
didn't realize i needed because statement timeout can cover a lot of the base like a lot of the use
cases can be just about coped with via statement timeout but one or two really can't and i think i i saw your post and i'll
link it up the mailing list thread that i think andre started but your first reply to that included
some really good examples of when you do need this in addition to statement timeout or like
instead of statement timeout so is it worth like covering a of those? Yeah, let's discuss why we use statement timeout in OTP,
like the most common cases like web and mobile apps.
We had an episode about that,
and I think it was the very first episode maybe.
Speaking of counting.
Yeah, we've done a few other episodes related to this as well.
We did one on connections, one on zero downtime migration.
No, I'm talking about the very first episode.
What was it?
Yeah, slow queries.
Slow queries, yes.
And this is about statement amount, basically.
But people don't do SQL queries.
I mean, end users, web and mobile app users,
they don't do SQL queries, right?
So they work at higher
level, HTTP requests. So why do we need statement amount at all? Why? What's the problem we're
solving? I have this honest question, because in my opinion, when we think what we try to solve with statement timeout
it's not right tool usually for example again like users yeah users work at high level and we know
some intermediate software settings middleware which is usually written in some like python
java ruby anyway anything these application servers and also HTTP server,
like maybe front-end server like Nginx or something, Envoy, right?
They usually have some limits to drop connection which lasts too long
if server is taking too long to respond or communication is taking too long,
usually like 30 or 60 seconds.
And this is what users will get if something which should take 100 milliseconds or less takes 30 seconds.
It's time to admit there is a problem and tell a user that for 502 gateway timeout,
it's usually like NGINX's it returning and then we back to database
we think okay we need to limit as well but what exactly we need to limit this is this is a
question what's the statement amount right tool for that i do think there's a like a lot of cases
where it's enough so if for example we're trying to in that case let's say for example the users or like the
the app has already given up but the database carries on working yeah that's it's doing
pointless kind of that's pointless work so avoiding pointless work would be great but
there's other cases too where even if the app didn't have a timeout is it okay for a few users of a system to be using all the resources when
you know like hogging those resources by running some really really slow or really really heavy
queries is that an acceptable trade-off or do you want to limit that so i think there are other
like slightly different cases right but work is not... Statements is too small.
It's too atomic piece, right?
Well, and I think there's what... Before we move on to why transaction timeout,
there's also one other case which I think...
Or maybe this makes the point perfectly.
These long-running transactions,
even if they're a single statement transaction,
they can block
like internal processes like vacuum and cause issues that we've i mean we've spent multiple
episodes talking about the kind of issues that long-running statements could cause never mind
long-running multi-statement transactions not statements block it transactions but as you've
said multiple times you can't have a statement without
a transaction. So a long-running
statement is enough,
but you're right that...
You can have a statement without transactions.
For example, create index concurrently.
Three transactions.
I mean...
Without transactions completely, no. No dirty reads.
Exactly.
Right.
Right.
So... Yeah, but Without transaction completely, no, no dirty reads. Exactly. Right, right, right.
So, yeah, but we want to apply the right tool.
It means that it should suit for all edge and corner cases, right?
And I'm saying statement amount is like for different hole to close, you know. You gave a really great example that completely sold me in that email thread I mentioned.
Email.
Yes.
And that was a case I've seen actually quite recently, which was... Yeah, I call it.
It's like from Machine Gun.
Yeah.
Statements coming from Machine Gun.
Brief statements with brief pause and an ex and somebody setting a
transaction at the beginning so in the case i saw it was deliberately taking out like doing begin
then doing multiple updates or like upsets in this case and then only at the end committing it
and so any one of those statements was really short, but because they were doing
tens of thousands or even hundreds of thousands, I think in this case, the transaction as a whole
was longer than you'd probably want on an OLTP system. So in that case, statement timeout
wouldn't have helped, or at least any sane statement timeout wouldn't have cancelled that,
but a transaction timeout. An idle transaction session
timeout wouldn't cancel this
transaction because the breaks are also
short.
It's not idle, exactly.
It's like from machine gun.
It can be select plus one, by the way,
if it's wrapped to a transaction block.
Select plus one, this anti-pattern
which...
Like n plus one.
Oh, yes, yes, yes. Yeah. I mean, it's actually a terrible name. But But the
this, the idea is that it's an select in a loop. So we didn't notice that they
have a loop. And it can be short with the idea that we want to make them short
because it's better.
But if it's wrapped into a transaction,
you cannot release any logs until the very end of the transaction.
This is the rule.
All logs are released only if it's commit or rollback.
That's it. And so it's a bad idea to split to batches, updates or
deletes. Insert is almost never, I think, it makes sense to split. Insert should be single massive
insert, right? Usually. Like copy or insert. But if it's updates and deletes, like people
know there is a rule to split, then they put it into a single transaction and like what's
happening here. But what do we do? We're accumulating a lot of logs and not releasing them. So we
definitely, so there are only two reasons, the big reasons I see. Well, the source utilization is one reason,
but there are two problems which are very bright. They scream, let's have transaction timeout,
and they did it for many years. And I always say, okay, this is the problem we must solve
on the application side. We cannot do it in Postgres at all. Surprise. Well, we can use
pgcron and put maybe some, I call it terminator snippet,
working with pg-sub activity, checking exact start, like transaction start timestamp, and then
killing, terminating. So terminator skip. This name is derived from pg-terminate backend,
right? Backend terminator. Yeah, so was it named or no? I don't remember. Backup start, start backup was renamed.
It's about naming and I support it. Well, I'm having three in my speech, sorry. So
start backup was renamed to pgbackup start. So the same should happen with becant terminate,
I think. Now terminate becant should be becant termin should be become terminated.
So this terminate script is also like a workaround.
It should be some simple solution.
Let's not allow too long running transactions
because we keep locks.
This is the problem number one, which I discussed.
And second problem you mentioned,
keeping XminHorizon frozen, frozen basically not shifting which means that all like xmin horizon is
the hidden xmin value of a tuple which is the oldest in the system and we usually cannot
auto vacuum usually comes and delete this as a tu we delete it. But we cannot delete it if there is some transaction which still can read from this.
Because Xmin is like current time, not the past for this transaction.
Which means that if we keep our transaction, we block this Xmin frozen.
And that means that AutoV auto vacuum cannot delete that tuples
more and more of them freshly that tuples I call them or maybe it can be
improved this naming so these two problems locks and Xmin horizon frozen
locks are not released and they both can hit and a lot of people can notice and you can even be
down down server can be down in some cases and you cannot limit it including
in PostgreSQL 16 I was so happy to realize I'm not mad
it's obvious and I think it was discussed maybe, so it shouldn't be discussed many times,
but we just went ahead, many things to Andrea,
just went ahead and coded it.
Then many strange cases started to appear.
For example, sub-transactions,
or conflict with statement on timeout,
which should fire first, for example.
That's a good point.
Do you want to talk a little about in this future,
like once Postgres 17 is out,
if you're designing a new OLTP system,
are you even using, like,
do you normally use global statement timeouts?
And if you did in the past, would you now replace that
or would you use both with different settings? How would you do in the past, would you now replace that? Or would you use both
with different settings? How would you do it? Yeah, good question. Good question. So first of
all, in documentation, we can read that setting statement timeout globally is not a good idea.
And I think it's a very bad idea to write this in documentation. Because we discussed even HTTP
server application server, they have their own timeouts.
Database is usually the most like data intensive work is usually in database,
unless we talk about some other stuff happening now outside of web and mobile apps, right?
I mean, AI stuff like in GPU.
So we want to limit in all TPKs, we definitely want to limit our work 30 seconds maybe.
It's a good limit.
This article and our very first episode, it's about basics, which are so like 30 years,
like 40 years since the beginning of Internet.
People don't like to wait more than a
second in normal case.
It means that we
do want to limit globally and
be protected. Only specific users
which need it always can change
it unless
it's restricted.
So you just set statement
amount of anything and go.
Of course, if DDL is needed to be executed,
for example, create index concurrently,
of course, you also should remove
this limit. But global setting
is protection. It should be there for
OTP case.
And Postgres is, like, OTP
is the main case. We don't
usually think about
analytical queries.
So it means that I always recommend setting this
and documentations having bad advice in this case.
Right?
So 30 seconds, my recommendation.
Sometimes 15 is better.
Oh, wow.
Yeah.
Well, even 10 maybe.
It depends.
Like it raises the bar requirement for queries
that this is about quality, actually.
So if your query is poorly designed
and you forgot to create a proper index,
something like bad plan,
be killed.
But now, with transaction timeout,
I would say statement timeout
might be not needed at all.
Interesting.
I wondered if you were going to say that. So again, my logic is applied at higher level. This is about HTTP. If you go down,
both cases, database specific, Postgres specific cases I mentioned, they also apply to a transaction
level at higher levels. We don't care about statement and breaks between statements.
And do you think same order of magnitude?
Like you mentioned 30,
I've seen 60 seconds quite often mentioned as a sensible default.
So 30 seconds, not too different.
Would you go similar?
Like for any reason to double it or like increase it,
like add a little bit of buffer there or just similar number?
Why buffer?
Set it to 30 to solve all problems and
then consider going even down.
And if we talk about statement amount,
why at all we need it? Okay, we put it,
but what? If we have transaction
timeout 30 seconds, for example,
which statement timeout would I use?
30 seconds maybe as well.
But it's already solved.
I don't think there's any... That wouldn't make sense to me.
But I was wondering if you might say, you know, statement timeout
30 seconds, transaction timeout 60 seconds,
and...
I don't transaction session timeout 60 seconds
when transaction timeout...
What?
Session timeout?
No, no, transaction.
Transactions can consist of multiple
statements, right? Right.
So, there's no need to have a statement timeout that is bigger or equal to the
transaction.
It doesn't make sense.
So the only thing that could make sense would be a statement timeout.
That's less than the transaction timeout.
Right.
But it seems pretty neat.
Yeah.
But,
but again,
like if you take transaction timeout, which is already quite good, restrictive, 30 or 60 seconds,
then I think statement timeout, and I don't see logic to go down with it.
Okay, it's about resources. If you go down, maybe you save some resources, killing some statement earlier than it reaches transaction amount.
Okay.
But it's not the main reason to have transaction amount for me.
Well, of course, maybe actually I just forgot.
Maybe it's actually super important reason, right?
Maybe I just forgot that we use it.
Okay.
I don't know. I saw so many incidents where even statement amount
was not achieved, but resources were spent fully. So I don't know. I mean, we have very
restrictive statement amount, very, very low. And still CPU 100% and everything is down
because a lot of work is happening. So statement amount doesn't protect you. Even 15 years.
I mean, to start protecting maybe I would I would like to have it one second. In this
case, I think to start protecting.
Interesting, but 30 seconds too high.
Yeah, it's a tricky one, right? Because, yeah, I think there's a growing,
and I know you mentioned analytical isn't our primary case,
but there's a growing trend that I'm seeing of kind of hybrid.
I've had it called HTAP databases, where people don't want to spin up a second analytical warehouse or something.
And they're trying to run some small analytical analytical queries on them like oltp database which means you do get these kind of the odd several second query or or
at least hundreds of milliseconds running on the same database so i can see how it could get tricky
for some folks but i think that the 30 seconds and the higher feels like a sane starting point
as you mentioned.
And if you start to see some errors from those, look into them.
And if you don't, look into what your slowest queries are
and whether you can reduce it.
In case of H-Type, I will just have another database user
specifically for long-running queries.
I will adjust this setting, maybe setting to a couple of,
or maybe five minutes and allowing it to last longer, that's it.
Yeah, that's a good point that I don't
know if we mentioned yet, that these can be set
at the role level, at
session level, and globally.
Well, I never tried, but statement timeout
definitely can be set at user level.
I think transaction timeout should work as well.
Worth checking, by the way.
Yeah, I thought the doc said that, but
I was checking because this
is the development doc.
Maybe even it shouldn't say.
Maybe it's
so I'm quite sure it should work.
Yeah. So you
anyway, even if we talk about statement
amount right now, we don't have transaction time
out until 17. So statement amount can
be set for different users differently.
And I also had a case when people had
very low global statement amount, but then
adjusted for people, for
human connections, they adjusted it to set to zero.
How do you feel about that? I feel dangerous. It's like
someone can just run something,
then drink some coffee, and then go for some call.
It's kind of dangerous.
By the way, I've got bad news for you.
The warning seems to have been copied
over to transaction timeout documentation as well,
and it says in the new docs,
setting transaction timeout in PostgreSQL.conf
is not recommended because it would affect all sessions.
Yeah, actually, it's a good point.
I need to raise this again.
I don't agree with this.
It's a bad, very bad statement.
There's also one other note in here that I had forgotten.
I guess it's only a minor note,
but prepared transactions are not subject to this timeout.
Thought worth mentioning. Yeah, it's only a minor note, but prepared transactions are not subject to this timeout. Thought worth mentioning.
Yeah, it's so.
Is there anything else?
I guess the story for how this came about is pretty interesting.
The thread seems quite long, to me at least.
As usual, a lot of unpredicted problems and then people seem to agree.
And then I think, okay, not not for 17 but it was a good
surprise last week right or a couple of weeks ago
yeah
yeah I just wanted to say
for
if our world survives right
for folks who
are listening this in
2025 or later
just start thinking about
limiting a transaction level first and then go down 2025 or later, just start thinking about limiting
at transaction level first and then go down.
This is my advice.
But for others, limit
at application side and limit
statement timeout and idle transaction session
timeout because both is good to have
limited to below a minute maybe.
At least one minute.
Yeah.
I also saw people are afraid of making,
I don't,
I don't transaction session them out low.
I don't understand this.
Put it very low,
very low.
Like if someone is not doing anything,
bye bye.
Okay.
Maybe for human humans,
you can limit it to one or two minutes,
but still you're probably holding some very important exclusive
locks and blocking someone. It's not a good
idea, so
either commit or do something
already, right?
The default is no limit.
There's absolutely no limit
at all.
So any...
Yeah. Even if you're
really... Even if you are concerned,
what would it... What would would several hour limit look like or several days? Like even that's going to be better than none at all. So yeah, I like that advice to people to be checking. Do you even have any of these limits set? Because the default is not to and even, even on a lot of cloud providers, they don't set Yeah, Nginx, if you work on web or mobile apps,
Nginx, for example,
has default proxy
timeout limit as 60
seconds. This is a good
starting point to think about it.
If 99%
of your database work is to serve
HTTP queries,
at least 60 should be there.
And go down to details.
Also, one more thing.
Maybe some folks will be inspired.
I'm not a good drawer of some pictures,
but imagine if we had a schema in documentation
showing the relationships between statement
and breaks between statement and some big transaction and timeouts will be visualized.
So for better understanding, like this is transaction timeout between them,
there are multiple statements, they are also limited and breaks are limited.
And also outside we have session timeout
so idle session how is it called i don't know we have both oh no session timeout what is it
i mean we have idle in transaction session timeout and we have idle session timeout i don't search
timeout this is what like it's outside transaction also have limited so all exactly all periods of
workflow are covered now.
And to visualize this would be a great idea.
Documentation already has a couple of pictures.
It would be a good picture to have.
I think you'd be in the first 10 still, though, if you got this one in.
Right, exactly.
So maybe someone will be inspired.
Good, nice.
Yeah.
Thanks so much, Nikolai.
Thank you.
Catch you next week.
Bye-bye.