Postgres FM - BUFFERS by default

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael, founder of BG Mustard. This is my co-host Nikolai, founder of PostgresAI. Hey Nikolai, how are you doing today? Hello, doing great. How are you? I am doing well, thank you very much. Today we're going to talk about buffers, and I know this is a topic you care a lot about and have enjoyed reading your opinions on them in the past. so really excited to talk about why they're important maybe we can start with what they are you know what let's start slightly off topic sql can be written in uppercase or lowercase or like maybe some maybe there are

Starting point is 00:00:37 more options so i prefer lowercase because i write a lot of sql So it's like I write SQL code much more than other code, like any other language I use over the last many years. And so I don't like to scream and use uppercase at all. But when I type buffers, I enjoy typing uppercase because it's so important to use them. So just to check, do you write explain, open brackets, analyze, comma, and then caps lock on, buffers, and then you go continue? I mean, I still use uppercase when I explain things to people and when you embed small parts of SQL in your regular text, it makes sense to

Starting point is 00:01:28 still use uppercase. That's why probably I remember that. Like buffers, just a couple of times per week, I explain how important it is to use buffers to other people. And this is the only

Starting point is 00:01:44 word I enjoy typing uppercase that's so funny on that i think i mix my use of lowercase and uppercase all the time in blog posts it's really difficult to know sometimes it's not clear that you're talking about a keyword or code and sometimes the code formatting is not great so yeah yeah, especially when it's inline, I do sometimes use capitals just to show that I'm talking about the keyword, not just a normal word in the sentence. But I'm very inconsistent. While we're talking about consistency of how we write things,

Starting point is 00:02:17 do you always write Postgres? Sometimes PostgresQL? Do you have like a rule on which one you write? 90% Postgres. Just for eight letters instead of 10. Just for the shortness. I find myself doing the same, probably similar, 90%. I tend to use PostgreSQL if like it's a super formal use. So maybe if I'm talking about a version number in a formal setting, I might say Postgresql but i don't have any better

Starting point is 00:02:45 yeah i will be trying to to pull us to off topics you know like it's so bad that postgres is eight letters not seven because in california you can have a custom driver license plate and the limit is seven letters so imagine being in the car with a license plate post-Gorea without us. Well, so that would be pretty funny, right? Yeah. There are people out there that would see that as a hate crime, I think. Anyway, yeah. So back to a shorter word, buffers. So in case anybody's not sure what we're talking about this is a measure of well i guess it's not quite strictly this but it's a rough measure of io in the query so in terms of number of blocks being read or written by various parts of the query and it shows up in multiple places. I'm aware of it being in explain output.

Starting point is 00:03:46 So explain analyze mostly, but also now explain as of recent versions. And of course, as columns in PG stat statements in terms of telling us how much IO different queries are doing. Are there other places that it's showing up? Yeah, well, let's explain a little bit very, very briefly, because it's very confusing sometimes for new people. Explain is just to check what Planner thinks about future query execution, but it does not execute the query. It only shows what Planner thinks right now for given data statistics and parameters of Postgres. Explain analysis for execution. Buffers make sense where?

Starting point is 00:04:31 Only in execution? So explain, analyze, or in both? It's, I think, like, you know, like, since Postgres 13, it also makes sense for planning stage as well, right? Because it can use some buffers for planner to make it work, right? Tricky question. I don't remember 100%.

Starting point is 00:04:51 Well, I guess it's always been possible for the planning stage to read data, but I guess we've not had the ability to ask it how much it's doing before. Since Postgres 13 is possible, I guess, right? So there is a for planner stage it also shows how much how many buffers were hit red or and so on right but what i also

Starting point is 00:05:16 wanted to like i'm there are many confusing places in database area field in general and postgres particularly for example there is also analyze keyword, which is absolutely another thing. It's a command to recalculate statistics. Well, it's not 100% enough, like far from getting the plan because if you run analyze on a table, you can fix the plan, for example, because you will have fresh statistics.

Starting point is 00:05:44 But it can be confusing because analyze after explain means very different thing than analyze a table right so it's like basics for people who start with postgres yeah absolutely so in probably today we're only going to be talking about analyzing the context of the explain parameter. Exactly. Yeah. Right. Cool. So is there anything in particular you wanted to make sure we, like, where did you want to start with this? Maybe we should discuss what 1,000 buffers hit or read means, right?

Starting point is 00:06:18 Like, it's, I found working over many years, working with various engineers, and most interesting in this case are backend engineers who are the authors directly or indirectly of SQL. They either write SQL directly or they use some ORM or something that generates it. But they have the biggest influence on the result. And I noticed that most of them don't understand what... They maybe understand, but they don't feel like a thousand buffer hits. What is it? Yeah. Awesome. So when we're talking about this, we're saying, let's say I've done explain analyze on a query that's a bit slower than I'm expecting it to be. And because I've been told I always should use buffers from some helpful people down the years, maybe they listen to a podcast and they're now using buffers.

Starting point is 00:07:18 But they see under, let's say, an index scan, they see shared hit equals 500, read equals 500. So in total, we've got 500 blocks that are shared hits and 500 blocks that are shared read. And this in total is a thousand blocks. And these represent, each one of these is an eight kilobyte read. The hits being from Postgres's buffer cache,

Starting point is 00:07:45 and the reads being from, well, maybe from disk, but maybe from the operating system cache. We don't, unfortunately, we don't know which one. By the way, it would be so great to see, like we have for macro query analysis, we have extension additional pgstat kcache, which can show you real disk io but for explain we don't have anything it would be so good somehow to hack

Starting point is 00:08:11 we have i've forgotten the actual word wording for it is it io timing so we can we can do show io timing or i've forgotten the exact keyword track io timing parameter but it won't show you like number of buffers is amount of work somebody mentioned this phrase in Twitter, we had discussion about yet another discussion about buffers in explain on Twitter and somebody mentioned amount

Starting point is 00:08:36 of work, this is exactly, this is great description of this information and timing is not amount of work it's duration of work. Why we are interested in amount of work it's duration of work why we are interested in the amount of work we'll discuss it later right because while it's maybe more interesting than timing but what i first of all i double checked explain buffers without analyze it makes sense i have postgres 14 but i i believe it's since postgres 13 when it got this planning stage and I see buffers hits and reads there.

Starting point is 00:09:11 So, 1,000 buffers, is it big or not that big? How to fill it? Because developers in mind, they may understand, okay, one buffer is 8 kbytes. By the way, your article is old school. It says kilobytes in old school way. KB bytes, it's like, because it's not 1000, it's 1024, but it's another off topic. So we have block size 8 KB bytes. Is it big to have 500 hits and 500 reads for buffers at all?

Starting point is 00:09:46 My arithmetic is awful at this kind of thing. That's one of the reasons why in the tool we make, we display that for people to try and make that easier. So we take number of blocks, multiply by 8, divide by 1,000. 1,000, of course, 1,000 blocks is 8 mibibytes. It's not that big. Okay. It's quite a small number, but it depends, of course.

Starting point is 00:10:14 If you just need to read a very, very small row consisting of a couple of numbers, probably it's too much to read a tiny row with two columns. So it's not that big. But what I'm trying to tell that you're absolutely right. Converting to bytes, it encourages engineers to think about, to imagine how big is this data volume. So if they hear, to read this couple of rows, we need to deal with even hitting not not reading hitting a

Starting point is 00:10:47 gigabyte so it makes them to think oh it's something not optimal is happening here i should find a better way to improve this query for example to have a better index option or or something like that but there is a trick here. When we talk about reads, if we, for example, okay, 1,000 reads of buffers can be converted to 8 mebybytes, but 1,000 hits can be tricky because some buffers can be hit multiple times

Starting point is 00:11:21 in the buffer pool. Well, yeah, so I think this is contentious. I've chosen to mostly ignore this, and if we get some double counting, then actually, in some ways, Postgres is doing duplicate work. There is some duplicate work going on, and if we're using it as a measure of work done, I think it's okay with the double counting. I also think it's okay. So we can have stored much less data in

Starting point is 00:11:47 memory, but if we need to hit the same buffer multiple times, we still count it the same way as it would be separate buffers. And we just need to understand how much work we need to do. We can imagine the cases when the buffer hits converted to bytes, you can see the amount of work bigger than to be hit, buffers to be hit. It exceeds the buffer pool size maybe, right? Well, it could even exceed the amount of data you have in the database. It's totally possible.

Starting point is 00:12:17 Theoretically. Well, I saw an example. I think it was from through test data, but did you see the blog post by ryan lambert on h3 indexes it was a it's a few weeks back and it was really interesting to me they were type of geospatial index and one of his example query plans he was looking at he was doing an aggregation on a lot of the data i believe and it was doing something like 39 gigabytes of buffers total and that's that's a lot right but he it really shocked him because his data set he knew was smaller than that 35 gigabytes of buffers or

Starting point is 00:12:53 buffer hits buffers total so buffers work total because like if we if we say some number of bytes of data it feels like a storage not amount of work to be done yeah good point so like i mean it can lead to confusion much easier compared to the case when we mentioned hits and reads all the time buffer hits buffer reads so i think we shouldn't omit if we convert to bytes we shouldn't omit these like action words that's a good interesting point do you mean like hits and reads or do you mean to make sure we still mention that they are buffers? I mean, if we say number of bytes buffers, we can provoke the confusion to think about it

Starting point is 00:13:35 as a number of bytes stored in memory. But if we keep mentioning hits and reads, we avoid this confusion, I think. Like maybe it maybe just some opinion right yeah your post also mentions other types of buffers not only shared buffers but also local and temp and additional confusion that can be made there because local is used for temporary tables yeah temp buffers used are for some other operations and it's interesting

Starting point is 00:14:07 that it can lead to confusion. But I found most of the time we just work with shared buffers when we optimize a query. And let's discuss why it's more interesting to focus on buffers than just on timing.

Starting point is 00:14:21 Yes, I did do a blog post on this recently. I'll link it up in the show notes. This is something I think I learned mostly from listening to you speak in the past. But the people that are super experienced in Postgres performance work do often tell me that they focus a lot on buffers at the start. And it took me a while to really work out why that was. But the super important parts are that timings alone. So if we just get explain, analyze and don't ask for buffers, there are a few slight issues with that.

Starting point is 00:14:54 One is we can ask for the same query plan 100 times and get 100 different durations. You might get some slightly slower ones, some slightly fast ones, and they're mostly around the same time. But it's different each time. That's one flaw. Especially if you're not alone on this server. Yeah. And we almost always, we are not alone. Yeah, really good point.

Starting point is 00:15:15 So conversely, why is it different for buffers? The number of shared hits might change and the number of shared reads might change. But in combination, unless you change something else, chances are if you run the same query 100 times, those two numbers summed together will sum to the same number each time. So that is a more consistent number than timings, even if the individual numbers there change. So that leads on to issue number two, which is if you're looking at timings, the first time you run a query, that data might not be cached. And as you run it, yeah, exactly. It might or it might not, but you don't necessarily know without buffers information. So timings can fluctuate quite a lot based on the cache state. Again, buffers, whilst the number of hits and reads would change, the sum of those two won't change depending on the state of the cache state again buffers whilst the number of hits and reads would change the sum of those two

Starting point is 00:16:06 won't change depending on the state of the cache and then the third one i pointed out in this blog post doesn't come up as much but i think it's quite important that the postgres query planner is not trying to minimize the number of buffers what what it's trying to do is minimize the amount of time. And sometimes it will pick a plan that is inefficient in terms of buffers if it could make it faster. So the most obvious example of this, I think, maybe the only one I'm not sure, is through parallelism. So if it can spin up multiple workers to do the work quicker and sequentially scan through the entire table, maybe it would choose to do that, even though on a pure efficiency play, you might have been able to do less work on a single worker. So yeah, I'm not sure I see many examples of that,

Starting point is 00:17:00 but it does feel like a flaw of looking at timings alone. Yeah, exactly. I agree with all points. Also, if you think about time of course you want to minimize it this is your final goal but indeed if you check the query on a clone for example which has different hardware maybe even a file system and so on and it makes you think about timing like you deal with time and it doesn't match production. And you think, oh, it's not possible. We need the same level of machine and so on. But then the process becomes very expensive, but it's still possible to keep the process cheap. You just need to focus on buffers, forget about timing for a bit, optimize based on the amount

Starting point is 00:17:42 of work. And if we focus on buffer numbers, of course, we focus on row numbers, but it's like more logical. You have rows, but you don't understand how many row versions were checked and how many dead tuples were removed because explain doesn't show it. But buffers can help you understand the amount of work to be done. And this is exactly what optimization should be about because any index is a way to reduce I.O., just to reduce the amount of work. Instead of sequential scan, for example, on large table, when we need to read a lot of pages,

Starting point is 00:18:20 index helps us to read a few pages and reach the target quicker. So the index is the way to reduce amount of work, and that's why timing is also reduced. It's a consequence. So when you optimize something, analyze a query, optimize it, why to deal with consequences instead of the core of optimization, the amount of work, so buffers?

Starting point is 00:18:43 I think I completely agree with you, but I do have a couple of questions. I think people really click when they see that they didn't have an index before, sequential scan, it read 500 megabytes of data maybe. And then when they add an index, it's able to look up the exact same rows in 24 kilobytes or something you know right right and instead of seeing how timing reduced and thinking oh good we see how buffers are reduced and understand why timing was also reduced like we see the reason of this reduction of time exactly i think there's a risk that people think they see an index scan they think oh index is a magic that's why it's fast it's like oh no it's not magic it just lets you look it up much more efficiently and therefore faster so i'm

Starting point is 00:19:30 completely with you on that but where i lose you a little bit is that there are expensive operations that don't report buffers so for example a sort in memory or some aggregations for example maybe these would count as cpu intensive rather than io and maybe that's far less often the bottleneck but we don't get any buffers reported for them if they're done in memory so i getting both timing and buffers and using them in combination yeah of course we still have other information in the plan so we can understand okay io was quite low buffers like four buffers hit and that's it but we have 100 milliseconds what's happening here right like it's of course sometimes but quite rare would you agree like most often we the reason of slow query is a lot of IO happening under the hood, right?

Starting point is 00:20:26 Well, even with the sort case, right? Like why is sort taking so long? It's because you're sorting a million rows. And if you could instead sort 10, the first 10 that you need, maybe you're paginating or something, you can get those ordered from an index. You're going to massively reduce the IO, therefore not need to sort as many rows in the first place so even when it's not the bottleneck i think it's often the solution even if you spell out that sort of a million rows it's still going to be a

Starting point is 00:20:55 lot lot slower than only fetching and sorting 10 yeah we also may think about like our like efficiency in the following way. Okay, we need to return 25 rows or 10 rows. How many buffers were involved in the whole query? And the buffer numbers are reported in an accumulative form. So you can look at the root of the query tree and see the total number for everything included underneath. So the question will be, how many buffers were involved to return our 10 rows? If it's 10 buffers, it's quite good.

Starting point is 00:21:33 If it's one, it's excellent. So it means that we had a scan of just one buffer, one page, and all rows happened to be present in this page. So few buffers is good to return 10 rows. 1,000 already not so good. We discussed that it's just 8 mibibytes, but to return 10 rows probably it's not that efficient.

Starting point is 00:21:52 But also two slightly deeper comments related to explain. It's interesting, like as I mentioned, for pgstat statements we have pgstat kcache extension. Unfortunately not available on most managed Postgres services like RDS,

Starting point is 00:22:08 but available for all people who manage Postgres themselves. So this excellent extension, it adds you information about CPU and real disk IEO, and CPU can even distinguish user

Starting point is 00:22:23 and the system CPU time, also context switches. Excellent. But for X-Plan we don't have it. And simple idea like we could still get this information if we have access to slash proc on the host. We know process ID even if we have

Starting point is 00:22:39 parallel workers we can extract their process IDs and we could take very interesting information about real disk I.O. happened and also CPU. You mentioned CPU intensive work. It could be present in X-Plane, right? Somehow like additional extension or something, or maybe like some hacked Postgres for non-production environments.

Starting point is 00:23:01 I think it's quite interesting, the area to explore and improve observability of single query analysis, actually. And it can be helpful to see that it's very CPU-intensive work. I.O. was low. That's why query was slow. You just see how much CPU was spent. Or something like this. And of course, real disk I.O. is also interesting to see.

Starting point is 00:23:23 And another thing, I lack the ability to understand the second best and third best plan in X-Plane. Yeah. You see, because the planner makes decisions based on virtual cost, like something abstract, right? Yeah. Which is, of course, can be tuned according to parameters like random patch cost, sec patch cost, and so on. But you can tune costs. And Planner never think about what the CPU is used. It doesn't think about it.

Starting point is 00:23:58 And how many gigabytes we have doesn't think about it. So it does factor those into the costs, right? Like it does CPU tuple costs and things like that. But I think I know what you mean. It doesn't factor in the server parameters. The planner doesn't know what hardware we have. Yeah, sure.

Starting point is 00:24:15 And the planner even, we can fool the planner and we do it for query optimization in non-production environments. So when we have, for example, on production we have almost a terabyte of RAM. On non-production environments. So when we have, for example, on production, we have almost a terabyte of RAM. On non-production, we don't want to pay for it. We have, for example, I don't know, 32 gigabytes of RAM, and the buffer pool is much smaller than on production.

Starting point is 00:24:37 It's not a problem. The planar doesn't even look at the shared buffer's setting value at all. It only looks at effective cache size so you can say we have terabyte of memory so we said like three fourth of that usually usual approach so you you trick the planner and it behaves exactly like on production it chooses the same plan but what i'm telling like sometimes we, Planner thinks this is the best option to execute the query based on cost, which depends on statistics and our settings. But we see a lot of IO happening. Buffers option shows it.

Starting point is 00:25:17 Why? What else do we have on the plate? Planner had on the plate. We don't see it, unfortunately. I've heard Mongo has this capability to explain this and provide the second option as well. So what do we usually do? We apply a trick. We say, okay, we had like bitmap scan here.

Starting point is 00:25:39 Set, enable bitmap scan to off, and try to check what other option was. So we put a penalty to bitmap scan to off and try to check what other option was so we put a penalty to bitmap scans so we see the second possible option probably second we're not sure but this is a trick we well that's what i wanted to ask like how i think it's a really difficult problem i've not looked into it myself but what do we mean by second best we mean second best plan that's sufficiently different? What if it did a bitmap? Slightly worse cost.

Starting point is 00:26:08 Yeah, so I understand what you mean, but I think we might end up with not quite what we wanted. So if we actually want to see what would this do with an index scan of the same table, maybe disabling bitmap scan is the perfect way to go. But what if the second best plan Postgres could have chosen would have been a bitmap scan is the perfect way to go but what if the second best plan protocross could have chosen would have been a bitmap scan of a different index would we want to see that right good point or like what if if it just changed the join order a little bit or the index scan

Starting point is 00:26:36 direction or like there's so many minor things that it's choosing between i agree i agree but my intent is to understand what your other options, several of them maybe, to understand their costs and their buffer, like their I.O. as well and compare. Sometimes cost can be slightly different, like on some edge case. Buffers are drastically fewer. So we start thinking maybe we need to adjust our settings for the planner. For example, random page cost default four should go down to secPatchCost, which is one. And this, like, exactly understanding the second option. Okay, maybe you're right. Maybe there are many options in between. So this second may be number 10 already. I don't know. But this is what I lack in explain.

Starting point is 00:27:24 Two things. 10 already i don't know but this is what i lack in explain two things uh real physical operations like cpu and io disk real disk io and also second third other options what were their costs right yeah so it would be good to like you mentioned somewhere that it's already too complex, too complicated to read, explain. It requires a lot of experience, but it still lacks many interesting points, in my opinion. I think this is such an interesting trade-off though, right? And this takes us onto the last topic I did want to make sure we discussed. I think there's a trade-off between being useful for people that are new to Postgres versus being useful for super experienced people. And I'm not sure exactly where we should be drawing that line or where the people in charge

Starting point is 00:28:12 should be drawing that line. And we've talked for quite a while about, you know, defaults and what should be on by default. So explain itself is fairly simple, but explain analyze, once we have timings, the extra, let's say, penalty of also asking for buffers, maybe even verbose, but other parameters and definitely buffers based on our whole conversation today, should that be on by default? So when anybody asks for explain analyze, they also get those buffer statistics. Even if they don't know about them, don't ask for them. You can turn them off. Maybe if you're an advanced user, you know, you don't want them for some reason, but you

Starting point is 00:28:54 have the ability to shape what beginners ask for. So if they're reading some guide from three years ago that says use explain analyze, then they'll get buffers on by default. Default is very important. Yeah. Do you have some stats about your users how many of them have buffers included yeah last time i checked it was 95 percent do include buffers but 95 wow that's impressive well 95 also include verbose not the exact same 95 but almost because i guess your documentation offers it right not just that we offer it the outlaw does not support the text format i've explained so automatically if somebody tries to get explain analyze and paste it into pgmuster

Starting point is 00:29:39 it will tell them we need at minimum format json but by that point we're also saying please ask for explain analyze buffers for both right so you're supposed to use it that's why they use it if you check the publicly available plans on explain depish com for example or dalai kom i'm sure more than 50 will be without buffers unfortunately because this is default behavior and i think interesting enough like there is a consensus based on what i saw in hackers mailing list i didn't see big objections it looks like people think that it should be on by default but still somehow the patch the patch needs review actually right now there are several iterations already and let's include the link also if

Starting point is 00:30:23 someone can can do review it would be great help for community because i think we should have buffers enabled by default i hope we convinced people right that buffers should be used we said that it's important sometimes to convert numbers to bytes to have a feeling how big is that we discussed some lacking features of explain that probably are tricky to develop but still like would be good to have and also we discussed that it's possible to run explain analyze with buffers of course on a different environment than on production and in this case i also would like to mention that our tool, Database Lab Engine, an additional chatbot, which can be executed and can be run in Slack or

Starting point is 00:31:13 browser, it's called a job bot. And it also converts to bytes and it allows you to have a very good workflow of SQL optimization, where you don't touch production it's really cool it even estimates how like if let's say the timing is 100 milliseconds it even estimates how much faster that would be on production too right yeah well this is this is tricky we this option is experimental it's very tricky to develop we still don't consider it as as like final version but it's not like very needed people are fine with just seeing buffers different timing because different file system different state of caches and so on but buffers if we have this shift in mind to focus on buffers when performing optimization this is perfect place to play with queries and Database Lab Engine also

Starting point is 00:32:05 provides the ability to create an index, not disturbing production and your colleagues. This is very important. And to see if it is helpful to reduce the amount of work, so buffer numbers, and therefore to reduce timing in the end of the day. So I recommend checking this on postgres.ai. And of course, PgMaster for understanding plans. Maybe that's it, right? So we discussed everything we wanted, right? Yeah. So final thing is if you or anybody you know,

Starting point is 00:32:42 any of your friends are able to review Postgres patches, please, please, please do check out the one at the moment, the way Postgres development works. There's a new version of Postgres due out back end of this year, Postgres 15. That's already frozen. Yes. So that's already feature-free. So even if we do manage to get this committed soon, it still at best will come out in just over a year's time. So even if it makes it into Postgres 16. So these things can take years.

Starting point is 00:33:14 So don't expect fast results. But if you can, that would be wonderful. Thank you. Current commit fest closes on July 31st. So in five days. So get your skates on. Right, but there will be one more CommitFest, definitely. A few, actually, for Postgres.

Starting point is 00:33:35 Yeah, of course. Okay, good. It was interesting, I guess. I hope so. I hope everyone likes our podcast. We need your help. Please like, subscribe, and please, please share the links in your social networks and groups where you discuss Postgres, database engineering, and so on.

Starting point is 00:33:54 Thank you, everyone, for listening. Thanks so much. See you next week.

Your Ad Here

Postgres FM - BUFFERS by default

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.