Postgres FM - Reads causing writes
Episode Date: January 24, 2025Nikolay and Michael discuss a couple of surprising ways read queries (selects) can cause writes (shared buffers dirtied) in Postgres. Here are some links to things they mentioned:Reads causi...ng writes in Postgres (post by Alex Jesipow) https://jesipow.com/blog/postgres-reads-cause-writes/Exploring how SELECT queries can produce disk writes (post by Nikolay Sivko) https://blog.okmeter.io/postgresql-exploring-how-select-queries-can-produce-disk-writes-f36c8bee6b6f Hint Bits (wiki) https://wiki.postgresql.org/wiki/Hint_Bitspg_stat_statements https://www.postgresql.org/docs/current/pgstatstatements.html pg_stat_kcache https://github.com/powa-team/pg_stat_kcache pg_wait_sampling https://github.com/postgrespro/pg_wait_sampling BUFFERS by default (episode 4!) https://postgres.fm/episodes/buffers-by-default Page layout (docs) https://www.postgresql.org/docs/current/storage-page-layout.html From Reddit (user merlinm), PD_ALL_VISIBLE may be a third case https://www.reddit.com/r/PostgreSQL/comments/1hjuyfn/comment/m3e6e7v/ Coroot https://coroot.com/ ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly chat about all things PostgresQL.
I am Michael, founder of PG Mustard, and this is Nikolai, founder of Postgres AI.
Hey Nikolai, how's it going?
Hello Michael, going great, how are you?
Yeah, I'm good, thank you.
So, what are we talking about this week?
Let's talk about why we can have writes on replicas.
Yeah.
Or, the thing I've seen people get confused by is why they can get rights even on the primary
for select queries, like for read queries. But yeah, I like your framing.
Yeah, it can happen on primary as well. I mean, this surprise can happen on the primary as well.
And specifically where people might notice this is seeing if they get buffers in query plans like shared shared dirty exactly but also i think the
place i've seen people spot it at least when they mention it in blog posts is pgstat statements or
monitoring tools that are looking at pgstat statements and seeing that some of the top
offending queries for buffers dirtied could even be select queries and that that really surprised them so it's it's kind of one of those
surprising moments where you think ah is something badly wrong and then you look into it and realize
maybe not yeah actually to be fully clear rights can happen during selects because of temporary
files as well right and if you think about rights to disk but this is
pretty straightforward we see again in producer statements we can see it as well as temporary
bytes or blocks written and read and yeah is it local or is it temp i think it's local
no local is different i but i always forget what local you're right yeah you're right local is for temporary tables
that's why i get confused locals for temporary tables and temporary objects but but temp is for
like if for example a sort or a hash spills to disk because it's too big for work mem or work
mem times hash mem multiplier then you get temp blocks read, temp blocks written. But I think what people are spotting is the dirtied
and being like, wait, that should...
Dirtied means, you know, an update or delete normally.
It's so confusing, right?
I'm checking Pugetstaff Statements documentation
and for local blocks hid, read, dirtied, written,
and local block read time, write time, it only says total number of local blocks read by
statement number of local blocks like everyone understands what local block is well local block
okay i think you're right it's a block i also so like the more more i deal with it in postgres
almost how many years like 20 years, right? The more
I dislike the fact that we call operations blocks. These are operations, not blocks.
Because if we say 10 blocks written, it's actually 10 block writes. Maybe it's the same block which was written 10 times right or rare so it should be
block reads block writes block dirty i don't know the proper word how to make to say
dirtying dirtying yeah i don't know but anyway here like of course i think it would be great to
have understanding what local block is.
Local block, I think, is an operation.
Local block operation is what is caused when we work with temporary tables,
which I always try to avoid because of different reasons.
Maybe we should talk one day about temporary tables.
But temporary tables, temporary blocks, temporary blocks is not about temporary tables,
but rather about temporary tables, temporary blocks, temporary blocks is not about temporary tables, but rather about temporary files, which can happen during execution of a query
dealing with normal tables anytime because workmem is not enough, right?
So again, this is easy.
This part is easy.
If we see some select or update or anything is producing a lot of temporary blocks read or written.
We need to just consider raising workmem or optimized queries
so it deals with lower volumes, smaller volumes of data, right?
For example, I don't know, like, just increase selectivity of our,
I don't know, it's a separate topic anyway, let's just
point that writes can happen because
of temporary files, of course selects can produce
temporary files because workmem is not
enough, and that's it
right, but then
sometimes we have
huge workmem definitely
which is enough, and
select is still writing to disk
or, well technically dirty,
blocks dirtied. These operations, I will keep calling these operations not data. These events,
these operations can happen during selects. And if it's just dirtied, it not necessarily goes to disk immediately because this is the work for checkpointer and background writer, right?
But it's already like, it's not good.
Select is producing some writes to the buffer pool, to shell buffers area.
And of course, if it's written, it's even worse because I'm pretty sure this makes select also slow.
Similarly, it happens when it creates
temporary file and writes to it right yeah good point although i think in the two cases that we're
going to be talking about actually no no there could be a lot of them i was thinking i was
thinking it's generally only in a lot of the cases I've seen, it's only one block at a time.
But actually, you could easily have a scan doing a lot of these.
So yeah, it could be a lot of data.
Yeah, and actually, to be absolutely precise, a block written not necessarily goes to disk,
because this is written to a page cache. And then like pdflusher or something like
this is already about Linux, it should write to disk at some point. And written doesn't necessarily
mean that it goes to disk. I think to understand that we should use different extension because
pgsr statements is great,
it's available everywhere, because it's like,
I think this is the most popular extension should be,
at least among observers.
And there is opinion that should go to core at some point.
It should stop being extension,
especially after query ID propagated to PG-SAT activity
to auto-explain, right?
Definitely PG-SAT statements should be in
core. But what I'm saying there is PGSAT-K cache which extends PGSAT statements to allow users see
metrics related to actual physical resources such as CPU, user CPU, system CPU, and also disk I.O.
Real disk writes, real disk reads how many bytes are written,
how many bytes are read for each query ID, which is great.
And I think, you know, we talked that finally buffers are committed
to explain and analyze.
By default, you run explain and analyze in Postgres 18,
unless it's reverted.
I hope it won't be reverted.
In Postgres 18, explainalize will always include buffers in output, which is a great victory.
We talked about it a couple of years on this podcast, I think almost since the very beginning.
And I was thinking what topic to choose next to complain about occasionally, regularly.
Or maybe campaign rather than complain. Okay, complain, complain, complain.
More positive.
Yeah.
Right.
So what kind of campaign to launch after this?
Like I was happy, but I was like, oh, I will be missing us to talk about buffers and explain
the lies and so on.
So I chose this I chose this topic.
I think
we should focus on all
managed Postgres platform
builders and advertise
to include
PG-weight sampling and PG-stat-k cache
to extensions.
Because this is super important
to have them both
to be able to do proper query analysis
in different aspects.
PgStat statements is great.
It has a lot of metrics.
It keeps growing in terms of number of metrics,
but it doesn't have what PgStat kcache
and weight sampling provide.
So we need those extensions on each serious database,
especially heavily loaded and producer cache is great here. And this would allow us to see, oh,
actually this query being select caused actual physical disk writes.
Yeah, I'm going to have a think. If we only get one campaign, I'm not sure that's mine,
but maybe. I like it. It's a good idea.
Okay, please consider this.
I'm definitely launching a campaign, at least on my Twitter and LinkedIn.
And actually, I already launched it, and I saw a reaction from...
Actually, I mentioned that these extensions are present on all serious setups,
which we helped to maintain for quite some time.
And pg-stat-k-cache has been used in serious databases
I'm dealing with for a very long time, many years.
And I know some platforms use it,
but in different countries, right?
And the only case I see is pg-weight-sampling
is available on Google Cloud SQL.
PgStatK cache is not available there, but PgWide sampling is, which is great.
Are you sure?
I'm sure. PgWide sampling is available. Imagine RDS, they have performance insights,
which is great, but this information is not available through SQL interface.
I'm not sure
about CloudWatch, maybe it's available there, but it's a different API. I want to see details
right in SQL because I have tools to analyze PG-START statements and I want extension
to Active Session History analysis and also to physical metrics, which PG-START KCache provides.
So Cloud SQL has PGgWet sampling but it
doesn't have pgSAT-k cache
others don't have any of these two.
And the reaction was from one of
Cloud SQL engineers I saw
who promised to
think about it.
I mean, it would be great and
Cloud SQL could be much better
if they have
pgSAT-k cache. I think they're already much better if they have PGSatK cache.
I think they're already much better in terms of observability,
better than RDS, because PGSatK sampling is better than performance insights.
Usually, you combine multiple tools for observability,
and you go to active session history.
For RDS or Aurora, you go to one place,
to their own place.
For everything else,
you go to another place.
It's not convenient.
So in case of Google,
they have,
they call it,
I think,
query insights or something like this.
They have it exposed
in their own
monitoring interface,
but they also expose it
for anyone
through SQL interface,
which is great,
right,
through PGA-weight sampling.
If they have PGA-SATK cache, I will be happy,
but I also am going to
ask others to consider these
two extensions. For example,
CrunchyBridge, Supabase, who else?
Everyone. I actually think
RDS should consider PGA weight sampling
as well, and PGSATK cache as well.
Why not?
It should be just an option users could decide
so back to us back to our question i think if you have pgstat k cache you can see actual writes
happening from queries and you can see selects and you know work memory is enough so it's not
temporary files you can confirm it through pgstat statements or also through individual query execution
via explainer lies, buffers.
And that's it.
You're thinking what's happening, right?
Why select is writing to disk?
Or at least why is it writing to the buffer pool,
which was already quite a surprise, right?
Yeah. So. so and well until recently
i knew this was a phenomenon um and i'd read about it a couple of times seen on the main list and in
a post or two but i only knew of one of the potential the reasons and i the reason i think
this came up recently is there was a good blog
post by alex jessica i'm not sure how to pronounce that sorry um who mentioned a second reason as
well so the first reason is related to setting of hint bits and maybe we can go into detail on that
one first because that's the one i've seen most often and it comes up the most do you want to discuss that one first yeah let's do
it so why do why do we have him bits and what are him bits by the way maybe it's jessie poff or
jessie poff but alex is from germany i see and works at luminovo which is interesting because
we we worked together last year i mean with, with his colleagues. It's interesting. I just realized. It's cool.
I'm going to send him this episode for sure.
So we have two cases, right? Two cases.
And I'm not sure which one is the easiest. Let's start with the easiest, actually.
I think Himbits is easiest to understand, and it's most common.
So my understanding of this is when new data is read,
is written to a page at that moment,
Postgres doesn't yet know if that,
that could be part of a much larger,
longer transaction.
And we don't yet know if that transaction is going to commit or get aborted and rolled back.
So we at that moment cannot set we cannot say on the data page that this transaction has been committed and therefore should be visible to new reads of this page. So that information is in a separate log once the transaction does commit
and that means when you've got if a if somebody's reading the page they need to check which of these
row versions has already been committed for sure and if there's any ambiguous any ones where we don't know yet
because but based on these hint bits based on these four different flags we can tell what status
it is is it unknown is it definitely committed is it already out of date like is it already
being replaced so if in the case of these reeds writes, it's a subsequent read coming along, seeing a new row version that we don't know yet whether it's been committed, checking the commit log and then setting, having read it once, it sets that and dirties the page, writes the page again to WoW, so that any future reads now don't have to check.
Right.
Yeah.
So, yeah, Hinbits can be invisible, dead, frozen for tuple.
And the thing is, the most interesting part of it is that this right can happen on the replica,
which is kind of a surprise for people, right?
How come?
There's a really good...
Oh, sorry, just to go back to hint bits,
there's a really good page on the Postgres wiki that describes it really succinctly,
and there are four hint bits.
Xmin committed, Xmin aborted, X-max committed, X-max aborted.
Yeah. So, yeah, that's it actually. So, we just know about this phenomenon and that's
it. We should not be surprised that it happens. And subsequently…
And it's healthy, right? It's not a sign of anything having gone wrong. In health, it will happen quite often.
Yeah, and in case...
Subsequent select already is not causing this, right?
So it's because it's already updated.
And it's kind of...
You can consider it kind of like dealing with warming up caches.
But it's vice versa because it's rights.
So this overhead is only present on the first very call with the same parameters. warming warming up caches but it's vice versa because it's rights right so it's only this
overhead is only present on the first very call with the same parameters of course different
parameters can cause such rights in different pages yeah yeah and it's necessary it's a necessary
result of us wanting to be able to serve things concurrently because of MVCC, we need versions of rows, and we need to then know
which one should be visible to which transactions. So in order to do that, this is a necessary part
of that. Right, right. So yeah, let's talk about the second case. Yeah, the second was more
interesting to me. Like, I hadn't come across this before. Yeah. Let's talk about it.
So in Alex's post,
I think you're probably right on the surname,
Financiation,
but I'm not going to try it again.
He describes these as page pruning,
he calls it,
which I have not heard it called that before.
And when I think of pruning,
I think my mind naturally goes to, for example, the mechanism for
removing an empty page at the end of the heap, for example. Is that also called pruning?
But this is different. I think it's truncation. If you talk about removing last page when vacuum
does it, it's truncation. That makes sense. So this is in almost like a, I think you just,
before the call you mentioned it was described as a kind of in-page vacuum. Yeah. So it's like...
And it can happen during sell-ex on the fly, like it's interesting, right?
Which is like also strange, but if we recall how hot updates are organized, which is a great feature, unfortunately not available always because it requires two special conditions to be met.
First is we are changing values of the columns which are not indexed.
And second is there is enough empty space in the same page where our old tuple tuple is stored in this case so-called hot
chains are created and it can be multiple like versions in the same page of the same
type tuple and in this case what happens when if we have index scan we only know like the
page and offset for the first.
Indexes are not updated in this case because it's hot update.
This is optimization to fight index write amplification,
which is terrible Postgres MVCC behavior.
Many projects are suffering from when updating one row,
having many indexes on the table.
We need to update all of indexes producing a lot of
wall writes and like just making updates slow heavy and and so on in case of hot updates with
those two conditions i mentioned are met postgres writes only to the same page where tuple is
already stored because there is enough space and it doesn't update indexes at all because we are
changing the value which is not indexed and And it produces a new version, new raw version inside the same page,
creating chain.
It can be new, new, new.
And when we have index scan,
Postgres quick index scan points to the first version in the page.
And then it's quick to jump between versions
and find the actual one inside the same page.
One crucial thing that I think becomes important later is that chain information is stored in the header of the page,
whereas the data from the row version is stored at the end of the page.
Right, right. This is true. true yes it's so uh and uh if we already have old versions at some point
they need need to be need to be deleted it can happen during vacuuming or it can it can happen
earlier if during dealing with this page postgres is okay we can clean up old versions right now
why not and can happen during select as well which is very yeah well it's so cool
but i think again there's like a a condition where it will only do this if there is not that
much space left on the page so i think the number is 10 of the page is left in terms of free space. There's some subtlety around if you've changed
fill factor. So which means that this is an effort to maintain hotness of updates further,
because if without this, we would need to go to another page. And this would definitely lead to
updating all indexes, yes but i think the optimization
of not doing it unless the page is quite full means we don't have to do it that often so for
example if we have maybe let's say a really quick 10 15 updates of the same row and they all fit on
the same page we not have and then we're having reads of that row in between, we're not cleaning up each version each time we do a read.
We're waiting until we get full and then doing it in one go.
So I think it's quite a pretty smart optimization.
It's a kind of trade-off balance between two-word situations and so on.
So we don't do it too often, but also we don't allow,
we try not to allow this tuple
to drift to another page,
which would cause the need to update indexes
because indexes point to the old page, right?
So yeah, this is quite interesting.
And again, this can happen during select.
And yeah, but it can happen only on the primary right
yes or no i it's it's my guess i don't know honestly why because because
can happen on replica yeah well i actually i don't know well let us know in the comments
this is an interesting question because if it's happening on replica it
means we have different very different content of pages on both replica and primary right and
and and like vacuum happening on replica please no because this changes the content of page and
i cannot understand how they can be, page content should be synchronized
in terms of tuples.
It's physical replica, yeah.
Yeah, yeah, yeah.
So it should happen only on the primary because…
Okay, yeah.
That makes sense.
…and the replicas should just get this content.
Now, with, Hinbits is different.
Hinbits gets additional information and it doesn't change how tuples are stored inside
the batch. This thing is changing, it's cleaning up the space for new tuples. So it should
happen on real and primary, I think.
That makes sense, yeah. Logically on a physical replica.
Right.
Okay.
Well, good.
There's one more open question yeah i i saw when this blog
this latest blog post was shared to reddit there was a really interesting comment that didn't that
no one has replied to and there's almost no activity on it but somebody posed the question
i think it's a good one whether there might be a third possible cause of this and that's so it's more of a question to
listen so i'm not actually sure my i don't know myself and haven't been able to confirm for sure
but there's a pd all visible flag on pages and that might be a third way if if selects can
flip that based on visibility of all the rows in the,
all the tuples in the page, then that might be a third case,
but I don't,
I don't know if selects can.
So yeah,
again,
if you know,
I'll be really interested to hear.
Yeah.
Yeah.
So I guess that's it,
right?
Yeah.
So,
yeah.
So yeah,
great,
great blog post.
There's also an old one from Nikolai Sivko on the OKMeter blog that I'll share.
So there's a few articles to read more about this.
Correction.
Nikolai currently has a new startup, which is called Corut.
Yeah.
So then OKMeter is his old startup.
And Peter Zaitsev, founder of Percona, joined Nikolai recently, not very long ago.
And I'm like, Corut is a great project. But this,
I guess this blog post was created while Nikolai was working on the previous project and just
started observing the rights happening from selects and was very surprised and decided to
blog post about it, to blog about it. So yeah, well, it was a monitoring project as well, right?
Yeah, yeah, it was it had very good Postgres-related features,
which I guess in some practices were inherited by Corut.
Yeah, but Corut is amazing.
Just a couple of words about it.
If you like flame graphs, Corut can show you dynamically.
Like imagine you have a dashboard, you choose time for
Postgres and you see flame graphs, but like hanging down.
So they are like 180 degrees rotated or mirrored, right?
Horizontal mirror.
And you can see details what was like if you have debug symbols installed i
think everyone should have debug symbols always installed you see deep details what's happening
how where time is spent and dynamic diagnostics like this it's it's crazy i think probably i
should add this to recommendations for postgres platform builders, managed Postgres platform builders,
because to consider K-routes also a good thing for observability.
It's open source, right?
It has open source.
I think it's like an open core model.
I'm not sure.
But the main thing is definitely open source.
Again, I'm not sure.
Thanks so much, Nikolai.
Thanks, everyone, for listening.
We'll catch you next week.
Thank you, Michael.
See you soon.
Bye.