Postgres FM - transaction_timeout

Episode Date: March 8, 2024

Nikolay and Michael discuss transaction_timeout (a recently committed addition for Postgres 17) — what it's for, how to get around not having it already, and whether it will replace the nee...d to set statement_timeout globally in future. Here are some links to things they mentioned:transaction_timeout (devel docs) https://www.postgresql.org/docs/devel/runtime-config-client.html#GUC-TRANSACTION-TIMEOUTCommit: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=51efe38cb92f4b15b68811bcce9ab878fbc71ea5 Mailing list thread: https://www.postgresql.org/message-id/flat/CAAhFRxiQsRs2Eq5kCo9nXE3HTugsAAJdSQSmxncivebAxdmBjQ%40mail.gmail.com  Hacking Postgres session on Postgres TV where work started https://www.youtube.com/live/WLoMpg8A4WU?t=50 Our first ever episode, on slow queries and slow transactions https://postgres.fm/episodes/slow-queries-and-slow-transactions statement_timeout (v7.3) https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-STATEMENT-TIMEOUT idle_in_transaction_session_timeout (v9.6) https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-IDLE-IN-TRANSACTION-SESSION-TIMEOUT idle_session_timeout (v14) https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-IDLE-SESSION-TIMEOUT ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, hello, this is Posgus FM, a podcast about Posgus. Hi, Mike, Michael. Sorry for Mike last time. That's all right, Nick. I don't even notice normally. Right. Well, Nick is fine. Nick is fine.
Starting point is 00:00:12 So episode number almost 90. I don't remember exactly. Like, do you remember? You always remember. I remember, but you told me not to remind you anymore until we get to Milestone. And you're the only one that ever brings up the episode number really i never mention it yeah okay interesting interesting so we are going to talk about uh what oh yeah so this week was your suggestion and it's a cool one it's the it's going to be a new feature in postgres 17, hopefully. It's committed, but you never know for sure.
Starting point is 00:00:47 And it's transaction timeout, a new parameter that we'll have, hopefully, in the next version. I can show off. It was my idea. But most of the work, Andrei did. Andrei Borodin did. So it started during our Postgres TV hacking Postgres session. Which are back now, right? After a long pause.
Starting point is 00:01:09 Well, yes, we'll try. Nobody knows. It's hard to arrange usually because to do proper hacking, you need a lot of time. Last time we did restart it, we talked about this transaction timeout and Andrej tried to write tests because former tests needed to be removed. By the way, I also wanted to say thanks to Alexander Korotkov who committed this patch. Honestly, I even didn't ask. So it was good to see that it's committed. And so session finished after 90 minutes by timeout because we had to go. Session about timeout finished.
Starting point is 00:01:49 Okay, now you're smiling. Okay. So transaction timeout. And it took a few years for me to understand something is wrong. Don't have something. And then last year, like a year ago maybe, I started thinking, oh, it should be. And my idea was like, maybe I'm very wrong or I'm very right. So it's so strange in 2023 to think about this basic setting.
Starting point is 00:02:18 But honestly, last few years, I already suggested everyone consider the transaction timeout to be applied on application and in 100 of cases i saw like very strange look if it was video meeting right it was like very strange look like postgres cannot do it yes postgres cannot do this transaction timeout but you do need it and we can discuss why so So I always suggest you do it always on the application side if you have LTP because you know like anyway HTTP has timeout. I said server for example Nginx like 30 seconds it's normal right nobody will wait many minutes unless it's specific request. Sometimes we can wait few minutes but usually a limit. And Postgres doesn't have this transaction timeout. It has only statement timeout and idle transaction session timeout,
Starting point is 00:03:11 but no transaction timeout. And also, so I looked into when we got each of these, and statement timeout was ages ago, like 7.3, would you believe? And then idle in transaction session timeout was actually more recent than I realized. It's 9.6, which is probably only about seven or eight years ago. And then we also got more recently idle session timeout. Which is different. It's outside of normal work. It's for, like, let's drop connection if it doesn't do anything.
Starting point is 00:03:46 Yeah, but that was a recent version 14 edition, so still only a couple of years ago. Ah, you want to say that it's also a very, very basic thing and only added recently. Yeah, I agree with that. Usually people implemented it outside Postgres as well
Starting point is 00:04:02 to drop connections. I've seen less need for that one than i have seen for this one but like you i didn't it's one of those things that i kind of didn't realize i needed because statement timeout can cover a lot of the base like a lot of the use cases can be just about coped with via statement timeout but one or two really can't and i think i i saw your post and i'll link it up the mailing list thread that i think andre started but your first reply to that included some really good examples of when you do need this in addition to statement timeout or like instead of statement timeout so is it worth like covering a of those? Yeah, let's discuss why we use statement timeout in OTP,
Starting point is 00:04:48 like the most common cases like web and mobile apps. We had an episode about that, and I think it was the very first episode maybe. Speaking of counting. Yeah, we've done a few other episodes related to this as well. We did one on connections, one on zero downtime migration. No, I'm talking about the very first episode. What was it?
Starting point is 00:05:09 Yeah, slow queries. Slow queries, yes. And this is about statement amount, basically. But people don't do SQL queries. I mean, end users, web and mobile app users, they don't do SQL queries, right? So they work at higher level, HTTP requests. So why do we need statement amount at all? Why? What's the problem we're
Starting point is 00:05:37 solving? I have this honest question, because in my opinion, when we think what we try to solve with statement timeout it's not right tool usually for example again like users yeah users work at high level and we know some intermediate software settings middleware which is usually written in some like python java ruby anyway anything these application servers and also HTTP server, like maybe front-end server like Nginx or something, Envoy, right? They usually have some limits to drop connection which lasts too long if server is taking too long to respond or communication is taking too long, usually like 30 or 60 seconds.
Starting point is 00:06:26 And this is what users will get if something which should take 100 milliseconds or less takes 30 seconds. It's time to admit there is a problem and tell a user that for 502 gateway timeout, it's usually like NGINX's it returning and then we back to database we think okay we need to limit as well but what exactly we need to limit this is this is a question what's the statement amount right tool for that i do think there's a like a lot of cases where it's enough so if for example we're trying to in that case let's say for example the users or like the the app has already given up but the database carries on working yeah that's it's doing pointless kind of that's pointless work so avoiding pointless work would be great but
Starting point is 00:07:17 there's other cases too where even if the app didn't have a timeout is it okay for a few users of a system to be using all the resources when you know like hogging those resources by running some really really slow or really really heavy queries is that an acceptable trade-off or do you want to limit that so i think there are other like slightly different cases right but work is not... Statements is too small. It's too atomic piece, right? Well, and I think there's what... Before we move on to why transaction timeout, there's also one other case which I think... Or maybe this makes the point perfectly.
Starting point is 00:07:59 These long-running transactions, even if they're a single statement transaction, they can block like internal processes like vacuum and cause issues that we've i mean we've spent multiple episodes talking about the kind of issues that long-running statements could cause never mind long-running multi-statement transactions not statements block it transactions but as you've said multiple times you can't have a statement without a transaction. So a long-running
Starting point is 00:08:28 statement is enough, but you're right that... You can have a statement without transactions. For example, create index concurrently. Three transactions. I mean... Without transactions completely, no. No dirty reads. Exactly.
Starting point is 00:08:43 Right. Right. So... Yeah, but Without transaction completely, no, no dirty reads. Exactly. Right, right, right. So, yeah, but we want to apply the right tool. It means that it should suit for all edge and corner cases, right? And I'm saying statement amount is like for different hole to close, you know. You gave a really great example that completely sold me in that email thread I mentioned. Email. Yes.
Starting point is 00:09:15 And that was a case I've seen actually quite recently, which was... Yeah, I call it. It's like from Machine Gun. Yeah. Statements coming from Machine Gun. Brief statements with brief pause and an ex and somebody setting a transaction at the beginning so in the case i saw it was deliberately taking out like doing begin then doing multiple updates or like upsets in this case and then only at the end committing it and so any one of those statements was really short, but because they were doing
Starting point is 00:09:45 tens of thousands or even hundreds of thousands, I think in this case, the transaction as a whole was longer than you'd probably want on an OLTP system. So in that case, statement timeout wouldn't have helped, or at least any sane statement timeout wouldn't have cancelled that, but a transaction timeout. An idle transaction session timeout wouldn't cancel this transaction because the breaks are also short. It's not idle, exactly.
Starting point is 00:10:14 It's like from machine gun. It can be select plus one, by the way, if it's wrapped to a transaction block. Select plus one, this anti-pattern which... Like n plus one. Oh, yes, yes, yes. Yeah. I mean, it's actually a terrible name. But But the this, the idea is that it's an select in a loop. So we didn't notice that they
Starting point is 00:10:40 have a loop. And it can be short with the idea that we want to make them short because it's better. But if it's wrapped into a transaction, you cannot release any logs until the very end of the transaction. This is the rule. All logs are released only if it's commit or rollback. That's it. And so it's a bad idea to split to batches, updates or deletes. Insert is almost never, I think, it makes sense to split. Insert should be single massive
Starting point is 00:11:19 insert, right? Usually. Like copy or insert. But if it's updates and deletes, like people know there is a rule to split, then they put it into a single transaction and like what's happening here. But what do we do? We're accumulating a lot of logs and not releasing them. So we definitely, so there are only two reasons, the big reasons I see. Well, the source utilization is one reason, but there are two problems which are very bright. They scream, let's have transaction timeout, and they did it for many years. And I always say, okay, this is the problem we must solve on the application side. We cannot do it in Postgres at all. Surprise. Well, we can use pgcron and put maybe some, I call it terminator snippet,
Starting point is 00:12:07 working with pg-sub activity, checking exact start, like transaction start timestamp, and then killing, terminating. So terminator skip. This name is derived from pg-terminate backend, right? Backend terminator. Yeah, so was it named or no? I don't remember. Backup start, start backup was renamed. It's about naming and I support it. Well, I'm having three in my speech, sorry. So start backup was renamed to pgbackup start. So the same should happen with becant terminate, I think. Now terminate becant should be becant termin should be become terminated. So this terminate script is also like a workaround. It should be some simple solution.
Starting point is 00:12:51 Let's not allow too long running transactions because we keep locks. This is the problem number one, which I discussed. And second problem you mentioned, keeping XminHorizon frozen, frozen basically not shifting which means that all like xmin horizon is the hidden xmin value of a tuple which is the oldest in the system and we usually cannot auto vacuum usually comes and delete this as a tu we delete it. But we cannot delete it if there is some transaction which still can read from this. Because Xmin is like current time, not the past for this transaction.
Starting point is 00:13:36 Which means that if we keep our transaction, we block this Xmin frozen. And that means that AutoV auto vacuum cannot delete that tuples more and more of them freshly that tuples I call them or maybe it can be improved this naming so these two problems locks and Xmin horizon frozen locks are not released and they both can hit and a lot of people can notice and you can even be down down server can be down in some cases and you cannot limit it including in PostgreSQL 16 I was so happy to realize I'm not mad it's obvious and I think it was discussed maybe, so it shouldn't be discussed many times,
Starting point is 00:14:27 but we just went ahead, many things to Andrea, just went ahead and coded it. Then many strange cases started to appear. For example, sub-transactions, or conflict with statement on timeout, which should fire first, for example. That's a good point. Do you want to talk a little about in this future,
Starting point is 00:14:51 like once Postgres 17 is out, if you're designing a new OLTP system, are you even using, like, do you normally use global statement timeouts? And if you did in the past, would you now replace that or would you use both with different settings? How would you do in the past, would you now replace that? Or would you use both with different settings? How would you do it? Yeah, good question. Good question. So first of all, in documentation, we can read that setting statement timeout globally is not a good idea.
Starting point is 00:15:15 And I think it's a very bad idea to write this in documentation. Because we discussed even HTTP server application server, they have their own timeouts. Database is usually the most like data intensive work is usually in database, unless we talk about some other stuff happening now outside of web and mobile apps, right? I mean, AI stuff like in GPU. So we want to limit in all TPKs, we definitely want to limit our work 30 seconds maybe. It's a good limit. This article and our very first episode, it's about basics, which are so like 30 years,
Starting point is 00:15:59 like 40 years since the beginning of Internet. People don't like to wait more than a second in normal case. It means that we do want to limit globally and be protected. Only specific users which need it always can change it unless
Starting point is 00:16:17 it's restricted. So you just set statement amount of anything and go. Of course, if DDL is needed to be executed, for example, create index concurrently, of course, you also should remove this limit. But global setting is protection. It should be there for
Starting point is 00:16:34 OTP case. And Postgres is, like, OTP is the main case. We don't usually think about analytical queries. So it means that I always recommend setting this and documentations having bad advice in this case. Right?
Starting point is 00:16:53 So 30 seconds, my recommendation. Sometimes 15 is better. Oh, wow. Yeah. Well, even 10 maybe. It depends. Like it raises the bar requirement for queries that this is about quality, actually.
Starting point is 00:17:08 So if your query is poorly designed and you forgot to create a proper index, something like bad plan, be killed. But now, with transaction timeout, I would say statement timeout might be not needed at all. Interesting.
Starting point is 00:17:23 I wondered if you were going to say that. So again, my logic is applied at higher level. This is about HTTP. If you go down, both cases, database specific, Postgres specific cases I mentioned, they also apply to a transaction level at higher levels. We don't care about statement and breaks between statements. And do you think same order of magnitude? Like you mentioned 30, I've seen 60 seconds quite often mentioned as a sensible default. So 30 seconds, not too different. Would you go similar?
Starting point is 00:17:54 Like for any reason to double it or like increase it, like add a little bit of buffer there or just similar number? Why buffer? Set it to 30 to solve all problems and then consider going even down. And if we talk about statement amount, why at all we need it? Okay, we put it, but what? If we have transaction
Starting point is 00:18:13 timeout 30 seconds, for example, which statement timeout would I use? 30 seconds maybe as well. But it's already solved. I don't think there's any... That wouldn't make sense to me. But I was wondering if you might say, you know, statement timeout 30 seconds, transaction timeout 60 seconds, and...
Starting point is 00:18:30 I don't transaction session timeout 60 seconds when transaction timeout... What? Session timeout? No, no, transaction. Transactions can consist of multiple statements, right? Right. So, there's no need to have a statement timeout that is bigger or equal to the
Starting point is 00:18:52 transaction. It doesn't make sense. So the only thing that could make sense would be a statement timeout. That's less than the transaction timeout. Right. But it seems pretty neat. Yeah. But,
Starting point is 00:19:03 but again, like if you take transaction timeout, which is already quite good, restrictive, 30 or 60 seconds, then I think statement timeout, and I don't see logic to go down with it. Okay, it's about resources. If you go down, maybe you save some resources, killing some statement earlier than it reaches transaction amount. Okay. But it's not the main reason to have transaction amount for me. Well, of course, maybe actually I just forgot. Maybe it's actually super important reason, right?
Starting point is 00:19:37 Maybe I just forgot that we use it. Okay. I don't know. I saw so many incidents where even statement amount was not achieved, but resources were spent fully. So I don't know. I mean, we have very restrictive statement amount, very, very low. And still CPU 100% and everything is down because a lot of work is happening. So statement amount doesn't protect you. Even 15 years. I mean, to start protecting maybe I would I would like to have it one second. In this case, I think to start protecting.
Starting point is 00:20:13 Interesting, but 30 seconds too high. Yeah, it's a tricky one, right? Because, yeah, I think there's a growing, and I know you mentioned analytical isn't our primary case, but there's a growing trend that I'm seeing of kind of hybrid. I've had it called HTAP databases, where people don't want to spin up a second analytical warehouse or something. And they're trying to run some small analytical analytical queries on them like oltp database which means you do get these kind of the odd several second query or or at least hundreds of milliseconds running on the same database so i can see how it could get tricky for some folks but i think that the 30 seconds and the higher feels like a sane starting point
Starting point is 00:21:04 as you mentioned. And if you start to see some errors from those, look into them. And if you don't, look into what your slowest queries are and whether you can reduce it. In case of H-Type, I will just have another database user specifically for long-running queries. I will adjust this setting, maybe setting to a couple of, or maybe five minutes and allowing it to last longer, that's it.
Starting point is 00:21:26 Yeah, that's a good point that I don't know if we mentioned yet, that these can be set at the role level, at session level, and globally. Well, I never tried, but statement timeout definitely can be set at user level. I think transaction timeout should work as well. Worth checking, by the way.
Starting point is 00:21:42 Yeah, I thought the doc said that, but I was checking because this is the development doc. Maybe even it shouldn't say. Maybe it's so I'm quite sure it should work. Yeah. So you anyway, even if we talk about statement
Starting point is 00:21:57 amount right now, we don't have transaction time out until 17. So statement amount can be set for different users differently. And I also had a case when people had very low global statement amount, but then adjusted for people, for human connections, they adjusted it to set to zero. How do you feel about that? I feel dangerous. It's like
Starting point is 00:22:24 someone can just run something, then drink some coffee, and then go for some call. It's kind of dangerous. By the way, I've got bad news for you. The warning seems to have been copied over to transaction timeout documentation as well, and it says in the new docs, setting transaction timeout in PostgreSQL.conf
Starting point is 00:22:47 is not recommended because it would affect all sessions. Yeah, actually, it's a good point. I need to raise this again. I don't agree with this. It's a bad, very bad statement. There's also one other note in here that I had forgotten. I guess it's only a minor note, but prepared transactions are not subject to this timeout.
Starting point is 00:23:04 Thought worth mentioning. Yeah, it's only a minor note, but prepared transactions are not subject to this timeout. Thought worth mentioning. Yeah, it's so. Is there anything else? I guess the story for how this came about is pretty interesting. The thread seems quite long, to me at least. As usual, a lot of unpredicted problems and then people seem to agree. And then I think, okay, not not for 17 but it was a good surprise last week right or a couple of weeks ago
Starting point is 00:23:28 yeah yeah I just wanted to say for if our world survives right for folks who are listening this in 2025 or later just start thinking about
Starting point is 00:23:44 limiting a transaction level first and then go down 2025 or later, just start thinking about limiting at transaction level first and then go down. This is my advice. But for others, limit at application side and limit statement timeout and idle transaction session timeout because both is good to have limited to below a minute maybe.
Starting point is 00:24:00 At least one minute. Yeah. I also saw people are afraid of making, I don't, I don't transaction session them out low. I don't understand this. Put it very low, very low.
Starting point is 00:24:13 Like if someone is not doing anything, bye bye. Okay. Maybe for human humans, you can limit it to one or two minutes, but still you're probably holding some very important exclusive locks and blocking someone. It's not a good idea, so
Starting point is 00:24:28 either commit or do something already, right? The default is no limit. There's absolutely no limit at all. So any... Yeah. Even if you're really... Even if you are concerned,
Starting point is 00:25:05 what would it... What would would several hour limit look like or several days? Like even that's going to be better than none at all. So yeah, I like that advice to people to be checking. Do you even have any of these limits set? Because the default is not to and even, even on a lot of cloud providers, they don't set Yeah, Nginx, if you work on web or mobile apps, Nginx, for example, has default proxy timeout limit as 60 seconds. This is a good starting point to think about it. If 99% of your database work is to serve
Starting point is 00:25:22 HTTP queries, at least 60 should be there. And go down to details. Also, one more thing. Maybe some folks will be inspired. I'm not a good drawer of some pictures, but imagine if we had a schema in documentation showing the relationships between statement
Starting point is 00:25:48 and breaks between statement and some big transaction and timeouts will be visualized. So for better understanding, like this is transaction timeout between them, there are multiple statements, they are also limited and breaks are limited. And also outside we have session timeout so idle session how is it called i don't know we have both oh no session timeout what is it i mean we have idle in transaction session timeout and we have idle session timeout i don't search timeout this is what like it's outside transaction also have limited so all exactly all periods of workflow are covered now.
Starting point is 00:26:26 And to visualize this would be a great idea. Documentation already has a couple of pictures. It would be a good picture to have. I think you'd be in the first 10 still, though, if you got this one in. Right, exactly. So maybe someone will be inspired. Good, nice. Yeah.
Starting point is 00:26:43 Thanks so much, Nikolai. Thank you. Catch you next week. Bye-bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.