Postgres FM - Reads causing writes

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly chat about all things PostgresQL. I am Michael, founder of PG Mustard, and this is Nikolai, founder of Postgres AI. Hey Nikolai, how's it going? Hello Michael, going great, how are you? Yeah, I'm good, thank you. So, what are we talking about this week? Let's talk about why we can have writes on replicas. Yeah.

Starting point is 00:00:23 Or, the thing I've seen people get confused by is why they can get rights even on the primary for select queries, like for read queries. But yeah, I like your framing. Yeah, it can happen on primary as well. I mean, this surprise can happen on the primary as well. And specifically where people might notice this is seeing if they get buffers in query plans like shared shared dirty exactly but also i think the place i've seen people spot it at least when they mention it in blog posts is pgstat statements or monitoring tools that are looking at pgstat statements and seeing that some of the top offending queries for buffers dirtied could even be select queries and that that really surprised them so it's it's kind of one of those surprising moments where you think ah is something badly wrong and then you look into it and realize

Starting point is 00:01:15 maybe not yeah actually to be fully clear rights can happen during selects because of temporary files as well right and if you think about rights to disk but this is pretty straightforward we see again in producer statements we can see it as well as temporary bytes or blocks written and read and yeah is it local or is it temp i think it's local no local is different i but i always forget what local you're right yeah you're right local is for temporary tables that's why i get confused locals for temporary tables and temporary objects but but temp is for like if for example a sort or a hash spills to disk because it's too big for work mem or work mem times hash mem multiplier then you get temp blocks read, temp blocks written. But I think what people are spotting is the dirtied

Starting point is 00:02:07 and being like, wait, that should... Dirtied means, you know, an update or delete normally. It's so confusing, right? I'm checking Pugetstaff Statements documentation and for local blocks hid, read, dirtied, written, and local block read time, write time, it only says total number of local blocks read by statement number of local blocks like everyone understands what local block is well local block okay i think you're right it's a block i also so like the more more i deal with it in postgres

Starting point is 00:02:42 almost how many years like 20 years, right? The more I dislike the fact that we call operations blocks. These are operations, not blocks. Because if we say 10 blocks written, it's actually 10 block writes. Maybe it's the same block which was written 10 times right or rare so it should be block reads block writes block dirty i don't know the proper word how to make to say dirtying dirtying yeah i don't know but anyway here like of course i think it would be great to have understanding what local block is. Local block, I think, is an operation. Local block operation is what is caused when we work with temporary tables,

Starting point is 00:03:32 which I always try to avoid because of different reasons. Maybe we should talk one day about temporary tables. But temporary tables, temporary blocks, temporary blocks is not about temporary tables, but rather about temporary tables, temporary blocks, temporary blocks is not about temporary tables, but rather about temporary files, which can happen during execution of a query dealing with normal tables anytime because workmem is not enough, right? So again, this is easy. This part is easy. If we see some select or update or anything is producing a lot of temporary blocks read or written.

Starting point is 00:04:07 We need to just consider raising workmem or optimized queries so it deals with lower volumes, smaller volumes of data, right? For example, I don't know, like, just increase selectivity of our, I don't know, it's a separate topic anyway, let's just point that writes can happen because of temporary files, of course selects can produce temporary files because workmem is not enough, and that's it

Starting point is 00:04:34 right, but then sometimes we have huge workmem definitely which is enough, and select is still writing to disk or, well technically dirty, blocks dirtied. These operations, I will keep calling these operations not data. These events, these operations can happen during selects. And if it's just dirtied, it not necessarily goes to disk immediately because this is the work for checkpointer and background writer, right?

Starting point is 00:05:09 But it's already like, it's not good. Select is producing some writes to the buffer pool, to shell buffers area. And of course, if it's written, it's even worse because I'm pretty sure this makes select also slow. Similarly, it happens when it creates temporary file and writes to it right yeah good point although i think in the two cases that we're going to be talking about actually no no there could be a lot of them i was thinking i was thinking it's generally only in a lot of the cases I've seen, it's only one block at a time. But actually, you could easily have a scan doing a lot of these.

Starting point is 00:05:50 So yeah, it could be a lot of data. Yeah, and actually, to be absolutely precise, a block written not necessarily goes to disk, because this is written to a page cache. And then like pdflusher or something like this is already about Linux, it should write to disk at some point. And written doesn't necessarily mean that it goes to disk. I think to understand that we should use different extension because pgsr statements is great, it's available everywhere, because it's like, I think this is the most popular extension should be,

Starting point is 00:06:30 at least among observers. And there is opinion that should go to core at some point. It should stop being extension, especially after query ID propagated to PG-SAT activity to auto-explain, right? Definitely PG-SAT statements should be in core. But what I'm saying there is PGSAT-K cache which extends PGSAT statements to allow users see metrics related to actual physical resources such as CPU, user CPU, system CPU, and also disk I.O.

Starting point is 00:07:02 Real disk writes, real disk reads how many bytes are written, how many bytes are read for each query ID, which is great. And I think, you know, we talked that finally buffers are committed to explain and analyze. By default, you run explain and analyze in Postgres 18, unless it's reverted. I hope it won't be reverted. In Postgres 18, explainalize will always include buffers in output, which is a great victory.

Starting point is 00:07:33 We talked about it a couple of years on this podcast, I think almost since the very beginning. And I was thinking what topic to choose next to complain about occasionally, regularly. Or maybe campaign rather than complain. Okay, complain, complain, complain. More positive. Yeah. Right. So what kind of campaign to launch after this? Like I was happy, but I was like, oh, I will be missing us to talk about buffers and explain

Starting point is 00:08:02 the lies and so on. So I chose this I chose this topic. I think we should focus on all managed Postgres platform builders and advertise to include PG-weight sampling and PG-stat-k cache

Starting point is 00:08:18 to extensions. Because this is super important to have them both to be able to do proper query analysis in different aspects. PgStat statements is great. It has a lot of metrics. It keeps growing in terms of number of metrics,

Starting point is 00:08:37 but it doesn't have what PgStat kcache and weight sampling provide. So we need those extensions on each serious database, especially heavily loaded and producer cache is great here. And this would allow us to see, oh, actually this query being select caused actual physical disk writes. Yeah, I'm going to have a think. If we only get one campaign, I'm not sure that's mine, but maybe. I like it. It's a good idea. Okay, please consider this.

Starting point is 00:09:09 I'm definitely launching a campaign, at least on my Twitter and LinkedIn. And actually, I already launched it, and I saw a reaction from... Actually, I mentioned that these extensions are present on all serious setups, which we helped to maintain for quite some time. And pg-stat-k-cache has been used in serious databases I'm dealing with for a very long time, many years. And I know some platforms use it, but in different countries, right?

Starting point is 00:09:39 And the only case I see is pg-weight-sampling is available on Google Cloud SQL. PgStatK cache is not available there, but PgWide sampling is, which is great. Are you sure? I'm sure. PgWide sampling is available. Imagine RDS, they have performance insights, which is great, but this information is not available through SQL interface. I'm not sure about CloudWatch, maybe it's available there, but it's a different API. I want to see details

Starting point is 00:10:11 right in SQL because I have tools to analyze PG-START statements and I want extension to Active Session History analysis and also to physical metrics, which PG-START KCache provides. So Cloud SQL has PGgWet sampling but it doesn't have pgSAT-k cache others don't have any of these two. And the reaction was from one of Cloud SQL engineers I saw who promised to

Starting point is 00:10:36 think about it. I mean, it would be great and Cloud SQL could be much better if they have pgSAT-k cache. I think they're already much better if they have PGSatK cache. I think they're already much better in terms of observability, better than RDS, because PGSatK sampling is better than performance insights. Usually, you combine multiple tools for observability,

Starting point is 00:10:59 and you go to active session history. For RDS or Aurora, you go to one place, to their own place. For everything else, you go to another place. It's not convenient. So in case of Google, they have,

Starting point is 00:11:11 they call it, I think, query insights or something like this. They have it exposed in their own monitoring interface, but they also expose it for anyone

Starting point is 00:11:19 through SQL interface, which is great, right, through PGA-weight sampling. If they have PGA-SATK cache, I will be happy, but I also am going to ask others to consider these two extensions. For example,

Starting point is 00:11:33 CrunchyBridge, Supabase, who else? Everyone. I actually think RDS should consider PGA weight sampling as well, and PGSATK cache as well. Why not? It should be just an option users could decide so back to us back to our question i think if you have pgstat k cache you can see actual writes happening from queries and you can see selects and you know work memory is enough so it's not

Starting point is 00:12:00 temporary files you can confirm it through pgstat statements or also through individual query execution via explainer lies, buffers. And that's it. You're thinking what's happening, right? Why select is writing to disk? Or at least why is it writing to the buffer pool, which was already quite a surprise, right? Yeah. So. so and well until recently

Starting point is 00:12:28 i knew this was a phenomenon um and i'd read about it a couple of times seen on the main list and in a post or two but i only knew of one of the potential the reasons and i the reason i think this came up recently is there was a good blog post by alex jessica i'm not sure how to pronounce that sorry um who mentioned a second reason as well so the first reason is related to setting of hint bits and maybe we can go into detail on that one first because that's the one i've seen most often and it comes up the most do you want to discuss that one first yeah let's do it so why do why do we have him bits and what are him bits by the way maybe it's jessie poff or jessie poff but alex is from germany i see and works at luminovo which is interesting because

Starting point is 00:13:20 we we worked together last year i mean with, with his colleagues. It's interesting. I just realized. It's cool. I'm going to send him this episode for sure. So we have two cases, right? Two cases. And I'm not sure which one is the easiest. Let's start with the easiest, actually. I think Himbits is easiest to understand, and it's most common. So my understanding of this is when new data is read, is written to a page at that moment, Postgres doesn't yet know if that,

Starting point is 00:13:56 that could be part of a much larger, longer transaction. And we don't yet know if that transaction is going to commit or get aborted and rolled back. So we at that moment cannot set we cannot say on the data page that this transaction has been committed and therefore should be visible to new reads of this page. So that information is in a separate log once the transaction does commit and that means when you've got if a if somebody's reading the page they need to check which of these row versions has already been committed for sure and if there's any ambiguous any ones where we don't know yet because but based on these hint bits based on these four different flags we can tell what status it is is it unknown is it definitely committed is it already out of date like is it already

Starting point is 00:14:59 being replaced so if in the case of these reeds writes, it's a subsequent read coming along, seeing a new row version that we don't know yet whether it's been committed, checking the commit log and then setting, having read it once, it sets that and dirties the page, writes the page again to WoW, so that any future reads now don't have to check. Right. Yeah. So, yeah, Hinbits can be invisible, dead, frozen for tuple. And the thing is, the most interesting part of it is that this right can happen on the replica, which is kind of a surprise for people, right? How come? There's a really good...

Starting point is 00:15:52 Oh, sorry, just to go back to hint bits, there's a really good page on the Postgres wiki that describes it really succinctly, and there are four hint bits. Xmin committed, Xmin aborted, X-max committed, X-max aborted. Yeah. So, yeah, that's it actually. So, we just know about this phenomenon and that's it. We should not be surprised that it happens. And subsequently… And it's healthy, right? It's not a sign of anything having gone wrong. In health, it will happen quite often. Yeah, and in case...

Starting point is 00:16:29 Subsequent select already is not causing this, right? So it's because it's already updated. And it's kind of... You can consider it kind of like dealing with warming up caches. But it's vice versa because it's rights. So this overhead is only present on the first very call with the same parameters. warming warming up caches but it's vice versa because it's rights right so it's only this overhead is only present on the first very call with the same parameters of course different parameters can cause such rights in different pages yeah yeah and it's necessary it's a necessary

Starting point is 00:16:57 result of us wanting to be able to serve things concurrently because of MVCC, we need versions of rows, and we need to then know which one should be visible to which transactions. So in order to do that, this is a necessary part of that. Right, right. So yeah, let's talk about the second case. Yeah, the second was more interesting to me. Like, I hadn't come across this before. Yeah. Let's talk about it. So in Alex's post, I think you're probably right on the surname, Financiation, but I'm not going to try it again.

Starting point is 00:17:33 He describes these as page pruning, he calls it, which I have not heard it called that before. And when I think of pruning, I think my mind naturally goes to, for example, the mechanism for removing an empty page at the end of the heap, for example. Is that also called pruning? But this is different. I think it's truncation. If you talk about removing last page when vacuum does it, it's truncation. That makes sense. So this is in almost like a, I think you just,

Starting point is 00:18:07 before the call you mentioned it was described as a kind of in-page vacuum. Yeah. So it's like... And it can happen during sell-ex on the fly, like it's interesting, right? Which is like also strange, but if we recall how hot updates are organized, which is a great feature, unfortunately not available always because it requires two special conditions to be met. First is we are changing values of the columns which are not indexed. And second is there is enough empty space in the same page where our old tuple tuple is stored in this case so-called hot chains are created and it can be multiple like versions in the same page of the same type tuple and in this case what happens when if we have index scan we only know like the page and offset for the first.

Starting point is 00:19:06 Indexes are not updated in this case because it's hot update. This is optimization to fight index write amplification, which is terrible Postgres MVCC behavior. Many projects are suffering from when updating one row, having many indexes on the table. We need to update all of indexes producing a lot of wall writes and like just making updates slow heavy and and so on in case of hot updates with those two conditions i mentioned are met postgres writes only to the same page where tuple is

Starting point is 00:19:38 already stored because there is enough space and it doesn't update indexes at all because we are changing the value which is not indexed and And it produces a new version, new raw version inside the same page, creating chain. It can be new, new, new. And when we have index scan, Postgres quick index scan points to the first version in the page. And then it's quick to jump between versions and find the actual one inside the same page.

Starting point is 00:20:06 One crucial thing that I think becomes important later is that chain information is stored in the header of the page, whereas the data from the row version is stored at the end of the page. Right, right. This is true. true yes it's so uh and uh if we already have old versions at some point they need need to be need to be deleted it can happen during vacuuming or it can it can happen earlier if during dealing with this page postgres is okay we can clean up old versions right now why not and can happen during select as well which is very yeah well it's so cool but i think again there's like a a condition where it will only do this if there is not that much space left on the page so i think the number is 10 of the page is left in terms of free space. There's some subtlety around if you've changed

Starting point is 00:21:08 fill factor. So which means that this is an effort to maintain hotness of updates further, because if without this, we would need to go to another page. And this would definitely lead to updating all indexes, yes but i think the optimization of not doing it unless the page is quite full means we don't have to do it that often so for example if we have maybe let's say a really quick 10 15 updates of the same row and they all fit on the same page we not have and then we're having reads of that row in between, we're not cleaning up each version each time we do a read. We're waiting until we get full and then doing it in one go. So I think it's quite a pretty smart optimization.

Starting point is 00:21:56 It's a kind of trade-off balance between two-word situations and so on. So we don't do it too often, but also we don't allow, we try not to allow this tuple to drift to another page, which would cause the need to update indexes because indexes point to the old page, right? So yeah, this is quite interesting. And again, this can happen during select.

Starting point is 00:22:22 And yeah, but it can happen only on the primary right yes or no i it's it's my guess i don't know honestly why because because can happen on replica yeah well i actually i don't know well let us know in the comments this is an interesting question because if it's happening on replica it means we have different very different content of pages on both replica and primary right and and and like vacuum happening on replica please no because this changes the content of page and i cannot understand how they can be, page content should be synchronized in terms of tuples.

Starting point is 00:23:09 It's physical replica, yeah. Yeah, yeah, yeah. So it should happen only on the primary because… Okay, yeah. That makes sense. …and the replicas should just get this content. Now, with, Hinbits is different. Hinbits gets additional information and it doesn't change how tuples are stored inside

Starting point is 00:23:26 the batch. This thing is changing, it's cleaning up the space for new tuples. So it should happen on real and primary, I think. That makes sense, yeah. Logically on a physical replica. Right. Okay. Well, good. There's one more open question yeah i i saw when this blog this latest blog post was shared to reddit there was a really interesting comment that didn't that

Starting point is 00:23:53 no one has replied to and there's almost no activity on it but somebody posed the question i think it's a good one whether there might be a third possible cause of this and that's so it's more of a question to listen so i'm not actually sure my i don't know myself and haven't been able to confirm for sure but there's a pd all visible flag on pages and that might be a third way if if selects can flip that based on visibility of all the rows in the, all the tuples in the page, then that might be a third case, but I don't, I don't know if selects can.

Starting point is 00:24:31 So yeah, again, if you know, I'll be really interested to hear. Yeah. Yeah. So I guess that's it, right?

Starting point is 00:24:37 Yeah. So, yeah. So yeah, great, great blog post. There's also an old one from Nikolai Sivko on the OKMeter blog that I'll share. So there's a few articles to read more about this.

Starting point is 00:24:49 Correction. Nikolai currently has a new startup, which is called Corut. Yeah. So then OKMeter is his old startup. And Peter Zaitsev, founder of Percona, joined Nikolai recently, not very long ago. And I'm like, Corut is a great project. But this, I guess this blog post was created while Nikolai was working on the previous project and just started observing the rights happening from selects and was very surprised and decided to

Starting point is 00:25:16 blog post about it, to blog about it. So yeah, well, it was a monitoring project as well, right? Yeah, yeah, it was it had very good Postgres-related features, which I guess in some practices were inherited by Corut. Yeah, but Corut is amazing. Just a couple of words about it. If you like flame graphs, Corut can show you dynamically. Like imagine you have a dashboard, you choose time for Postgres and you see flame graphs, but like hanging down.

Starting point is 00:25:53 So they are like 180 degrees rotated or mirrored, right? Horizontal mirror. And you can see details what was like if you have debug symbols installed i think everyone should have debug symbols always installed you see deep details what's happening how where time is spent and dynamic diagnostics like this it's it's crazy i think probably i should add this to recommendations for postgres platform builders, managed Postgres platform builders, because to consider K-routes also a good thing for observability. It's open source, right?

Starting point is 00:26:32 It has open source. I think it's like an open core model. I'm not sure. But the main thing is definitely open source. Again, I'm not sure. Thanks so much, Nikolai. Thanks, everyone, for listening. We'll catch you next week.

Starting point is 00:26:45 Thank you, Michael. See you soon. Bye.

Your Ad Here

Postgres FM - Reads causing writes

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.