Postgres FM - BUFFERS by default
Episode Date: July 29, 2022Here are links to a few things we mentioned: EXPLAIN parameters (PostgreSQL documentation)EXPLAIN (ANALYZE) needs BUFFERS (blog post by Nikolay)Using BUFFERS for query optimization (blog pos...t by Michael)  H3 indexes on PostGIS data (blog post by Ryan Lambert)Turning BUFFERS on by default (latest patch)pgMustard explain.depesz.comexplain.dalibo.comDatabase Lab Engine------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov and @michristofidesIf you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael,
founder of BG Mustard. This is my co-host Nikolai, founder of PostgresAI. Hey Nikolai,
how are you doing today? Hello, doing great. How are you?
I am doing well, thank you very much. Today we're going to talk about buffers,
and I know this is a topic you care a lot about and have enjoyed reading your opinions on them
in the past. so really excited to
talk about why they're important maybe we can start with what they are you know what let's start
slightly off topic sql can be written in uppercase or lowercase or like maybe some maybe there are
more options so i prefer lowercase because i write a lot of sql So it's like I write SQL code much more than
other code, like any other language I use over the last many years. And so I don't like to scream
and use uppercase at all. But when I type buffers, I enjoy typing uppercase because it's so important to use them.
So just to check, do you write explain, open brackets, analyze, comma,
and then caps lock on, buffers, and then you go continue?
I mean, I still use uppercase when I explain things to people
and when you embed small parts of SQL in your regular
text, it makes sense to
still use uppercase.
That's why probably I
remember that. Like buffers,
just a couple of times
per week, I explain how
important it is to use buffers
to other people.
And this is the only
word I enjoy typing uppercase that's so funny
on that i think i mix my use of lowercase and uppercase all the time in blog posts it's really
difficult to know sometimes it's not clear that you're talking about a keyword or code and
sometimes the code formatting is not great so yeah yeah, especially when it's inline,
I do sometimes use capitals just to show that I'm talking about the keyword,
not just a normal word in the sentence.
But I'm very inconsistent.
While we're talking about consistency of how we write things,
do you always write Postgres?
Sometimes PostgresQL?
Do you have like a rule on which one you write?
90% Postgres.
Just for eight letters instead of 10.
Just for the shortness. I find myself doing the same, probably similar, 90%. I tend to use
PostgreSQL if like it's a super formal use. So maybe if I'm talking about a version number
in a formal setting, I might say Postgresql but i don't have any better
yeah i will be trying to to pull us to off topics you know like it's so bad that postgres is eight
letters not seven because in california you can have a custom driver license plate and the limit
is seven letters so imagine being in the car with a license plate post-Gorea without us.
Well, so that would be pretty funny, right?
Yeah. There are people out there that would see that as a hate crime, I think.
Anyway, yeah. So back to a shorter word, buffers. So in case anybody's not sure what we're talking about this is a measure of well i guess
it's not quite strictly this but it's a rough measure of io in the query so in terms of number
of blocks being read or written by various parts of the query and it shows up in multiple places. I'm aware of it being in explain output.
So explain analyze mostly, but also now explain as of recent versions.
And of course, as columns in PG stat statements in terms of telling us how much IO different queries are doing.
Are there other places that it's showing up? Yeah, well, let's explain a
little bit very, very briefly, because it's very confusing sometimes for new people. Explain is
just to check what Planner thinks about future query execution, but it does not execute the
query. It only shows what Planner thinks right now for given data statistics and parameters of Postgres.
Explain analysis for execution.
Buffers make sense where?
Only in execution?
So explain, analyze, or in both?
It's, I think, like, you know, like, since Postgres 13, it also makes sense for planning
stage as well, right?
Because it can use some buffers
for planner to make it
work, right? Tricky question. I don't
remember 100%.
Well, I guess it's always been possible
for the planning stage to read
data, but I guess we've
not had the ability to ask it how
much it's doing before. Since Postgres
13 is possible, I guess, right? So
there is a for planner
stage it also shows how much how many buffers were hit red or and so on right but what i also
wanted to like i'm there are many confusing places in database area field in general and
postgres particularly for example there is also analyze keyword,
which is absolutely another thing.
It's a command to recalculate statistics.
Well, it's not 100% enough, like far from getting the plan
because if you run analyze on a table,
you can fix the plan, for example,
because you will have fresh statistics.
But it can be confusing
because analyze after explain means very different thing than analyze a table right so it's like
basics for people who start with postgres yeah absolutely so in probably today we're only going
to be talking about analyzing the context of the explain parameter. Exactly. Yeah.
Right.
Cool.
So is there anything in particular you wanted to make sure we,
like, where did you want to start with this? Maybe we should discuss what 1,000 buffers hit or read means, right?
Like, it's, I found working over many years,
working with various engineers, and most interesting in this case are backend engineers who are the authors directly or indirectly of SQL.
They either write SQL directly or they use some ORM or something that generates it.
But they have the biggest influence on the result.
And I noticed that most of them don't understand what... They maybe understand,
but they don't feel like a thousand buffer hits. What is it?
Yeah. Awesome. So when we're talking about this, we're saying, let's say I've done explain analyze on a query that's a bit slower than I'm expecting it to be.
And because I've been told I always should use buffers from some helpful people down the years, maybe they listen to a podcast and they're now using buffers.
But they see under, let's say, an index scan, they see shared hit equals 500,
read equals 500.
So in total, we've got 500 blocks that are shared hits
and 500 blocks that are shared read.
And this in total is a thousand blocks.
And these represent,
each one of these is an eight kilobyte read.
The hits being from Postgres's buffer cache,
and the reads being from, well, maybe from disk,
but maybe from the operating system cache.
We don't, unfortunately, we don't know which one.
By the way, it would be so great to see,
like we have for macro query analysis,
we have extension additional pgstat kcache,
which can show you
real disk io but for explain we don't have anything it would be so good somehow to hack
we have i've forgotten the actual word wording for it is it io timing so we can we can do
show io timing or i've forgotten the exact keyword track io timing parameter but it won't show you
like number of buffers is amount of work
somebody mentioned this phrase in
Twitter, we had discussion about
yet another discussion about buffers
in explain on Twitter and
somebody mentioned amount
of work, this is exactly, this is
great description of
this information and timing
is not amount of work
it's duration of work. Why we are interested in amount of work it's duration of work why we are interested
in the amount of work we'll discuss it later right because while it's maybe more interesting
than timing but what i first of all i double checked explain buffers without analyze it makes
sense i have postgres 14 but i i believe it's since postgres 13 when it got this planning stage and I see buffers hits and reads there.
So, 1,000 buffers, is it big or not that big?
How to fill it?
Because developers in mind, they may understand, okay, one buffer is 8 kbytes.
By the way, your article is old school.
It says kilobytes in old school way.
KB bytes, it's like, because it's not 1000, it's 1024, but it's another off topic.
So we have block size 8 KB bytes.
Is it big to have 500 hits and 500 reads for buffers at all?
My arithmetic is awful at this kind of thing.
That's one of the reasons why in the tool we make,
we display that for people to try and make that easier.
So we take number of blocks, multiply by 8, divide by 1,000.
1,000, of course, 1,000 blocks is 8 mibibytes.
It's not that big.
Okay.
It's quite a small number, but it depends, of course.
If you just need to read a very, very small row consisting of a couple of numbers,
probably it's too much to read a tiny row with two columns.
So it's not that big.
But what I'm trying to tell that you're absolutely right.
Converting to bytes, it encourages engineers to think about,
to imagine how big is this data volume.
So if they hear, to read this couple of rows,
we need to deal with even hitting not not reading hitting a
gigabyte so it makes them to think oh it's something not optimal is happening here i should
find a better way to improve this query for example to have a better index option or or
something like that but there is a trick here. When we talk about reads,
if we, for example,
okay, 1,000 reads of buffers
can be converted to 8 mebybytes,
but 1,000 hits can be tricky
because some buffers can be hit multiple times
in the buffer pool.
Well, yeah, so I think this is contentious.
I've chosen to mostly ignore this, and if we get some double counting,
then actually, in some ways, Postgres is doing duplicate work.
There is some duplicate work going on, and if we're using it as a measure of work done,
I think it's okay with the double counting.
I also think it's okay.
So we can have stored much less data in
memory, but if we need to hit the same buffer multiple times, we still count it the same way
as it would be separate buffers. And we just need to understand how much work we need to do.
We can imagine the cases when the buffer hits converted to bytes, you can see the amount of work bigger than to be hit,
buffers to be hit.
It exceeds the buffer pool size maybe, right?
Well, it could even exceed the amount of data
you have in the database.
It's totally possible.
Theoretically.
Well, I saw an example.
I think it was from through test data,
but did you see the blog post by ryan lambert on h3 indexes it was a it's
a few weeks back and it was really interesting to me they were type of geospatial index and one of
his example query plans he was looking at he was doing an aggregation on a lot of the data i believe
and it was doing something like 39 gigabytes of buffers total and that's that's a lot right but he it
really shocked him because his data set he knew was smaller than that 35 gigabytes of buffers or
buffer hits buffers total so buffers work total because like if we if we say some number of bytes
of data it feels like a storage not amount of work to be done yeah good
point so like i mean it can lead to confusion much easier compared to the case when we mentioned
hits and reads all the time buffer hits buffer reads so i think we shouldn't omit if we convert
to bytes we shouldn't omit these like action words that's a good interesting point do you mean like
hits and reads or do you mean to make sure we still mention that they are buffers?
I mean, if we say number of bytes buffers,
we can provoke the confusion to think about it
as a number of bytes stored in memory.
But if we keep mentioning hits and reads,
we avoid this confusion, I think.
Like maybe it maybe just some opinion
right yeah your post also mentions other types of buffers not only shared buffers but also local
and temp and additional confusion that can be made there because local is used for temporary
tables yeah temp buffers used are for some other operations
and it's interesting
that it can lead to confusion.
But I found most of the time
we just work with shared buffers
when we optimize a query.
And let's discuss why
it's more interesting
to focus on buffers
than just on timing.
Yes, I did do a blog post
on this recently.
I'll link it up in the show notes.
This is something I think I learned mostly from listening to you speak in the past.
But the people that are super experienced in Postgres performance work do often tell me that
they focus a lot on buffers at the start. And it took me a while to really work out why that was.
But the super important parts are that timings alone.
So if we just get explain, analyze and don't ask for buffers, there are a few slight issues with that.
One is we can ask for the same query plan 100 times and get 100 different durations.
You might get some slightly slower ones, some slightly fast ones, and they're mostly around the same time.
But it's different each time.
That's one flaw.
Especially if you're not alone on this server.
Yeah.
And we almost always, we are not alone.
Yeah, really good point.
So conversely, why is it different for buffers?
The number of shared hits might change and the number of shared reads might change. But in combination, unless you change something else, chances are if you run the same query 100 times, those two numbers summed together will sum to the same number each time.
So that is a more consistent number than timings, even if the individual numbers there change.
So that leads on to issue number two, which is if you're looking
at timings, the first time you run a query, that data might not be cached. And as you run it,
yeah, exactly. It might or it might not, but you don't necessarily know without buffers information.
So timings can fluctuate quite a lot based on the cache state. Again, buffers, whilst the number of
hits and reads would change, the sum of those two won't change depending on the state of the cache state again buffers whilst the number of hits and reads would change the sum of those two
won't change depending on the state of the cache and then the third one i pointed out in this blog
post doesn't come up as much but i think it's quite important that the postgres query planner
is not trying to minimize the number of buffers what what it's trying to do is minimize the amount of time. And sometimes it will pick a plan that is inefficient in terms of buffers if it could make it faster.
So the most obvious example of this, I think, maybe the only one I'm not sure, is through parallelism.
So if it can spin up multiple workers to do the work quicker and sequentially scan through the entire table,
maybe it would choose to do that, even though on a pure efficiency play,
you might have been able to do less work on a single worker.
So yeah, I'm not sure I see many examples of that,
but it does feel like a flaw of looking at timings alone. Yeah, exactly. I agree with all points.
Also, if you think about time of course you
want to minimize it this is your final goal but indeed if you check the query on a clone for
example which has different hardware maybe even a file system and so on and it makes you think
about timing like you deal with time and it doesn't match production.
And you think, oh, it's not possible. We need the same level of machine and so on. But then
the process becomes very expensive, but it's still possible to keep the process cheap.
You just need to focus on buffers, forget about timing for a bit, optimize based on the amount
of work. And if we focus on buffer numbers, of course, we focus
on row numbers, but it's like more logical. You have rows, but you don't understand how many
row versions were checked and how many dead tuples were removed because explain doesn't show it.
But buffers can help you understand the amount of work to be done. And this is exactly what optimization should be about
because any index is a way to reduce I.O.,
just to reduce the amount of work.
Instead of sequential scan, for example, on large table,
when we need to read a lot of pages,
index helps us to read a few pages
and reach the target quicker.
So the index is the way to reduce amount of work,
and that's why timing is also reduced.
It's a consequence.
So when you optimize something, analyze a query, optimize it,
why to deal with consequences instead of the core of optimization,
the amount of work, so buffers?
I think I completely agree with you, but I do have a couple of questions.
I think people really click when they see that they didn't have an index before,
sequential scan, it read 500 megabytes of data maybe.
And then when they add an index, it's able to look up the exact same rows in 24 kilobytes or something you know right right and
instead of seeing how timing reduced and thinking oh good we see how buffers are reduced and
understand why timing was also reduced like we see the reason of this reduction of time exactly i
think there's a risk that people think they see an index scan they think oh index is a magic that's
why it's fast it's like oh no it's not magic it just lets you look it up much more efficiently and therefore faster so i'm
completely with you on that but where i lose you a little bit is that there are expensive operations
that don't report buffers so for example a sort in memory or some aggregations for example maybe these would count as cpu intensive rather
than io and maybe that's far less often the bottleneck but we don't get any buffers reported
for them if they're done in memory so i getting both timing and buffers and using them in
combination yeah of course we still have other information in the plan so we can understand
okay io was quite low buffers like four buffers hit and that's it but we have 100 milliseconds
what's happening here right like it's of course sometimes but quite rare would you agree like
most often we the reason of slow query is a lot of IO happening under the hood, right?
Well, even with the sort case, right?
Like why is sort taking so long?
It's because you're sorting a million rows.
And if you could instead sort 10, the first 10 that you need,
maybe you're paginating or something, you can get those ordered from an index.
You're going to massively reduce the IO,
therefore not need to sort as many rows in the first place so even when it's not the bottleneck i think it's
often the solution even if you spell out that sort of a million rows it's still going to be a
lot lot slower than only fetching and sorting 10 yeah we also may think about like our like
efficiency in the following way.
Okay, we need to return 25 rows or 10 rows.
How many buffers were involved in the whole query?
And the buffer numbers are reported in an accumulative form.
So you can look at the root of the query tree and see the total number for everything included underneath. So the question will be,
how many buffers were involved to return our 10 rows?
If it's 10 buffers, it's quite good.
If it's one, it's excellent.
So it means that we had a scan of just one buffer,
one page, and all rows happened to be present in this page.
So few buffers is good to return 10 rows.
1,000 already not so good.
We discussed that it's just
8 mibibytes, but to return 10 rows
probably it's not that efficient.
But also two slightly
deeper comments related to
explain. It's interesting, like as
I mentioned, for pgstat statements we have
pgstat kcache extension.
Unfortunately not available
on most managed Postgres
services like RDS,
but available for all
people who manage Postgres
themselves. So this
excellent extension, it
adds you information
about CPU and
real disk IEO, and CPU
can even distinguish user
and the system CPU time, also
context switches. Excellent.
But for X-Plan we don't have it.
And simple idea
like we could still get
this information if we have access to
slash proc on the host.
We know process ID even if we have
parallel workers we can
extract their process IDs
and we could take very interesting information about real disk I.O.
happened and also CPU.
You mentioned CPU intensive work.
It could be present in X-Plane, right?
Somehow like additional extension or something, or maybe like some hacked Postgres for non-production
environments.
I think it's quite interesting, the area to explore and improve observability
of single query analysis, actually.
And it can be helpful to see that it's very CPU-intensive work.
I.O. was low.
That's why query was slow.
You just see how much CPU was spent.
Or something like this.
And of course, real disk I.O. is also interesting to see.
And another thing, I lack the ability to understand the second best and third best plan in X-Plane.
Yeah.
You see, because the planner makes decisions based on virtual cost, like something abstract, right?
Yeah.
Which is, of course, can be tuned according to parameters like random patch cost, sec patch cost, and so on.
But you can tune costs.
And Planner never think about what the CPU is used.
It doesn't think about it.
And how many gigabytes we have doesn't think about it.
So it does factor those into the costs, right?
Like it does
CPU tuple costs and things like that.
But I think I know what you mean. It doesn't factor
in the server parameters.
The planner doesn't know what hardware
we have. Yeah, sure.
And the planner even, we can
fool the planner and we do it for
query optimization in non-production environments.
So when we have, for example,
on production we have almost a terabyte of RAM. On non-production environments. So when we have, for example, on production, we have almost a terabyte of RAM.
On non-production, we don't want to pay for it.
We have, for example, I don't know, 32 gigabytes of RAM,
and the buffer pool is much smaller than on production.
It's not a problem.
The planar doesn't even look at the shared buffer's setting value at all.
It only looks at effective cache size so you
can say we have terabyte of memory so we said like three fourth of that usually usual approach
so you you trick the planner and it behaves exactly like on production it chooses the same
plan but what i'm telling like sometimes we, Planner thinks this is the best option to execute the query based on cost, which depends on statistics and our settings.
But we see a lot of IO happening.
Buffers option shows it.
Why?
What else do we have on the plate?
Planner had on the plate.
We don't see it, unfortunately.
I've heard Mongo has this capability to explain this and provide the second option as well.
So what do we usually do?
We apply a trick.
We say, okay, we had like bitmap scan here.
Set, enable bitmap scan to off,
and try to check what other option was. So we put a penalty to bitmap scan to off and try to check what other option was so we put a penalty to
bitmap scans so we see the second possible option probably second we're not sure but this is a trick
we well that's what i wanted to ask like how i think it's a really difficult problem i've not
looked into it myself but what do we mean by second best we mean second best plan that's
sufficiently different?
What if it did a bitmap?
Slightly worse cost.
Yeah, so I understand what you mean,
but I think we might end up with not quite what we wanted.
So if we actually want to see what would this do
with an index scan of the same table,
maybe disabling bitmap scan is the perfect way to go.
But what if the second best plan Postgres could have chosen would have been a bitmap scan is the perfect way to go but what if the second best plan protocross could
have chosen would have been a bitmap scan of a different index would we want to see that right
good point or like what if if it just changed the join order a little bit or the index scan
direction or like there's so many minor things that it's choosing between i agree i agree but
my intent is to understand what your other options, several of them maybe, to understand their costs and their buffer, like their I.O. as well and compare. Sometimes cost can be slightly different, like on some edge case. Buffers are drastically fewer. So we start thinking maybe we need to adjust our settings for the planner. For example, random page cost default four should go down to secPatchCost, which is one.
And this, like, exactly understanding the second option.
Okay, maybe you're right.
Maybe there are many options in between.
So this second may be number 10 already.
I don't know.
But this is what I lack in explain.
Two things. 10 already i don't know but this is what i lack in explain two things uh real physical operations
like cpu and io disk real disk io and also second third other options what were their costs right
yeah so it would be good to like you mentioned somewhere that it's already too complex, too complicated to read, explain. It requires a lot
of experience, but it still lacks many interesting points, in my opinion. I think this is such an
interesting trade-off though, right? And this takes us onto the last topic I did want to make
sure we discussed. I think there's a trade-off between being useful for people that are new to
Postgres versus being useful for super experienced people.
And I'm not sure exactly where we should be drawing that line or where the people in charge
should be drawing that line. And we've talked for quite a while about, you know, defaults and what
should be on by default. So explain itself is fairly simple, but explain analyze, once we have
timings, the extra, let's say, penalty of also asking for buffers, maybe even verbose, but other
parameters and definitely buffers based on our whole conversation today, should that be on by
default? So when anybody asks for explain analyze, they also get those buffer statistics.
Even if they don't know about them, don't ask for them.
You can turn them off.
Maybe if you're an advanced user, you know, you don't want them for some reason, but you
have the ability to shape what beginners ask for.
So if they're reading some guide from three years ago that says use explain analyze, then
they'll get buffers on by default.
Default is very important.
Yeah. Do you have some stats about your users how many of them have buffers included yeah last time i checked it was 95 percent do include buffers but 95 wow that's impressive well 95 also include
verbose not the exact same 95 but almost because i guess your documentation
offers it right not just that we offer it the outlaw does not support the text format i've
explained so automatically if somebody tries to get explain analyze and paste it into pgmuster
it will tell them we need at minimum format json but by that point we're also saying please ask for explain
analyze buffers for both right so you're supposed to use it that's why they use it if you check the
publicly available plans on explain depish com for example or dalai kom i'm sure more than 50
will be without buffers unfortunately because this is default behavior and i think
interesting enough like there is a
consensus based on what i saw in hackers mailing list i didn't see big objections it looks like
people think that it should be on by default but still somehow the patch the patch needs review
actually right now there are several iterations already and let's include the link also if
someone can can do review it would be great help for
community because i think we should have buffers enabled by default i hope we convinced people
right that buffers should be used we said that it's important sometimes to convert numbers to
bytes to have a feeling how big is that we discussed some lacking features of explain that
probably are tricky to develop but still like would be good to have and also we discussed that
it's possible to run explain analyze with buffers of course on a different environment than on
production and in this case i also would like to mention that our tool,
Database Lab Engine, an additional chatbot, which can be executed and can be run in Slack or
browser, it's called a job bot. And it also converts to bytes and it allows you to have a
very good workflow of SQL optimization, where you don't touch production it's really cool it even
estimates how like if let's say the timing is 100 milliseconds it even estimates how much faster
that would be on production too right yeah well this is this is tricky we this option is experimental
it's very tricky to develop we still don't consider it as as like final version but it's
not like very needed people are fine with just seeing buffers different timing because different
file system different state of caches and so on but buffers if we have this shift in mind
to focus on buffers when performing optimization this is perfect place to play with queries and Database Lab Engine also
provides the ability to create an index, not disturbing production and your colleagues.
This is very important. And to see if it is helpful to reduce the amount of work,
so buffer numbers, and therefore to reduce timing in the end of the day. So I recommend checking this on postgres.ai.
And of course, PgMaster for understanding plans.
Maybe that's it, right?
So we discussed everything we wanted, right?
Yeah.
So final thing is if you or anybody you know,
any of your friends are able to review Postgres patches, please, please, please do check out the one at the moment, the way Postgres development works.
There's a new version of Postgres due out back end of this year, Postgres 15.
That's already frozen.
Yes.
So that's already feature-free. So even if we do manage to get this committed soon,
it still at best will come out in just over a year's time.
So even if it makes it into Postgres 16.
So these things can take years.
So don't expect fast results.
But if you can, that would be wonderful.
Thank you.
Current commit fest closes on July 31st.
So in five days.
So get your skates on.
Right, but there will be one more CommitFest, definitely.
A few, actually, for Postgres.
Yeah, of course.
Okay, good.
It was interesting, I guess.
I hope so.
I hope everyone likes our podcast.
We need your help.
Please like, subscribe, and please, please share the links in your social networks and
groups where you discuss Postgres, database engineering, and so on.
Thank you, everyone, for listening.
Thanks so much.
See you next week.