Postgres FM - Buffers II (the sequel)

Starting point is 00:00:00 Hello, hello, this is PostgresFM, episode number 94. And we are going to repeat episode number 4, but not really. So today we're going to talk about buffers again. So let's call it buffers 2. But I forgot to say hi, Michael. Hello, Nikolai. I was thinking about buffers already too much. This topic is super, super important, I think.

Starting point is 00:00:24 Yeah, so last time we called the episode buffers by default, because you especially, and I'm a convert to this, think that buffers, which is an explain parameter, should be on by default, at least with explain analyze. And there've been a couple of patches over the years trying to put this in place or trying to make steps towards this as a goal and we were in support of that but we also wanted to explain why we thought people that are doing query optimization work so developers should include the buffers parameter when they run explain analyze manually or with auto explain so that they can more effectively do query tuning.

Starting point is 00:01:05 Is that a good summary? Yes, that's a good summary. Have you ever thought that in explain plans, buffers are not turned on by default, although there is consensus, I think, and the fact that it was not changed so far hasn't been changed. It's only purely technical reasons. But I'm sure since you checked it probably you'll explain details here but don't you find it's interesting that explain plans by

Starting point is 00:01:32 default have buffers off but pgstr statements has it on right and it's just interesting why and because if you want to start connected data and I think we should do it. We should look at the whole picture, macro level, and we also should work with individual plans provided by explain command, explain analyze buffers command. And if you forget buffers, just do explain analyze, which many, many, many, many blog posts still suggest, unfortunately. In this case, it's hard for you to connect the dots between the individual plans and macro level. Yes, an interesting point about PGTAC statements. I reread the two threads, the two most recent threads about turning buffers

Starting point is 00:02:26 on by default and the first one was started by vick fearing and it actually was he took a step backwards and said i personally would like to turn buffers on by default but because i realized that might not go it might not be what everybody wants and that might be a wider discussion here's a patch proposing that we allow multiple new parameters for for postgres each one was like like it was allowing you to turn each explains parameter on or off by default so set set the default yourself for your instance so you could set buffers on by default for you for your own instance and other people could leave it off by default and that was widely thought of as um people other people wanted buffers on by default but because he also included settings to enable and analyze by default and enable every other single you know verbose right ahead logings settings i

Starting point is 00:03:25 actually can't remember which ones but basically all of the explain parameters that existed at the time of the thread was started also came with one of these settings people fixated on the fact that an administrator turning on the analyze parameter by default would lead to you running explain on its own and that executing the query which was seen deemed to be a foot gun that would be too dangerous so generalization did a trick bad trick here and honestly i think it's not like i'm not interested in the ability to turn it for specific servers i I want real default for everyone, global default, because I deal with many, many servers and I cannot tell them all. This is what we do right now. We tell them all right now. Use buffers by default when you provide plans. It works only partially. It never works. Efficiency

Starting point is 00:04:25 here is not 100% coverage. So if we can do it by some additional settings, okay. Maybe some cloud providers would do it. I don't know. But this partial solution I don't like at all. I do think we're in a new world with cloud providers. I think

Starting point is 00:04:41 now we have this, you know, for example, PG stat statements is off by default, but in the cloud, most cloud providers have it on by default. So a lot of users Some of them even learned about template databases, template zero, template one, right? And put it there. So you just knew the database created. This is how it's implemented, right? So we're in this new world where we don't necessarily have to have things on by default in postgres core for it to be on by default for most users in the real world so there's a there's a slight weird subject here but but to

Starting point is 00:05:17 move on to the second that's still partial solution i want postgres 18 for example to have this by default what what should we do but actually we we jumped straight to the this uh question why it's still not default but we didn't explain what it is right so maybe we should explain first because it's also important why we care so much about this why why if if the if a plan doesn't have buffers data, so what? Right? Yeah, let's do a short version because I think we did quite a good job of that

Starting point is 00:05:53 in the first episode we did on this, which we can point people at episode four. But do you want to give a short version? Yeah, let's do a short version. So I see this data. As I mentioned, this data is present in PGSR statements always. So if PGSR statements is installed, you have buffers, shared blocks, read shared blocks hit. There are also local and temp pair, two additional pairs of numbers, but it's not like interesting here, we will talk only about, let's talk only about the buffer pool numbers. There are also two more columns, dirted and written.

Starting point is 00:06:33 So if some query changes data, it's dirted. If it's written out of the buffer pool it's written. And the hit and read are interesting, of course, because this can give us idea about cache efficiency in terms of buffer pool cache efficiency because there is also underlying page cache additionally, always still, right? And in the explain plan, if you run explain, it's only the plan. If you run explain, it's only the plan.

Starting point is 00:07:05 If you run explain, analyze, it's the plan with execution. But if you don't put explicitly, you don't put the buffers word. And if you put, you need parentheses additionally. So explain, analyze, buffers in parentheses. In this case, you don't see this buffer data. So you only see timing and you understand. You cannot, for example, understand cache efficiency in your particular execution of this particular query. Like how much was taken from the buffer pool, how much was taken outside, maybe from memory, maybe from disk.

Starting point is 00:07:39 If it's red, it's outside. You have no idea. Of course, you can enable timing, I.O. timing, and not regular timing, which is always enabled in the explainer plans actual timing, right? But if you enable I.O. timing, this can give you an idea about how much time was spent on I.O. And as I understand, I.O. in that sense means I.O. outside of the buffer pool, right? It's so tricky, by the way. What is I.O.?

Starting point is 00:08:10 Does I.O. include communication to what? To disk? To memory? Memory what? Memory. Page cache or buffer pool? Both or just page cache? In case of track I. track IO timing, the setting,

Starting point is 00:08:25 I guess everything but communication to the buffer pool is counted. Yes, but I would count everything. So it's tricky. Anyway,

Starting point is 00:08:37 you can see timing again, but cache efficiency wouldn't be possible to see. If you're curious, you do need buffers. But in my opinion, cache efficiency is not the number one benefit that buffers provide.

Starting point is 00:08:51 Number one benefit is the amount of work. If we know timing to understand the real efficiency of our mechanism, we need to understand how much work was done. And if we see only number of rows returned, it's of course kind of like, this is final result, right? Delivered. But what's inside?

Starting point is 00:09:12 Inside we need to understand how much actions were needed. And these actions in database world, it's IO. Databases are all about, mostly IO, of course. We like, we don't do a lot of calculations. Usually we deal with file. The goal is to find the data we need as quick as possible to efficiently change it, rewrite it, store it, and so on. So only the buffers can show the actual amount of work that was performed during the like achieving final result. And this is again a hit, read, dirtied, written, shared buffers. Of course, temporary files are also interesting if you see

Starting point is 00:09:58 buffers written and read. And this means that this query needed to deal with... Workman was not enough, we needed to write to disk. It's also amount of work and this can explain some IO timing and overall latency increased significantly. So the amount of work. Let's do some maybe new analogy. For example, you need to replace cabinets on kitchen. I don't know. And we're not interested in details. Like you need final result. And then you hire some contractor and pay this contractor or like some team, right? Of contractors. You pay to them some money and they say, oh, you know, this took one month, right? And you say, well, okay okay i see final result i understand one month i pay and this is bye-bye right in the fair maybe it's too much right how we understand this too much there is

Starting point is 00:10:53 we can compare to different similar similar cases but it's better to dive in dive in to understand what exactly what are you going to do let's check step by step. You will go by this thing, you will put this, you will remove something. You start understanding the real amount of work to be done. And then you think, really? You guys spent one month for this kind of work? This is strange. But

Starting point is 00:11:17 in some cases, they can really explain oh, you know, you have a difficult case here. You need to change this and that and this really like a lot of work. And then you start understanding, okay, in this case, probably one month is really reasonable, right? Similar in databases. Let us go back to databases. If you see that to return some rows, database needed to perform sequential scan, it's reasonable to have bad latency,

Starting point is 00:11:46 very high value, right? Because it need to scan very big table. It's like a lot of work. And then if it can be optimized, of course, yes, we can create index, right? Maybe we can buy tools for our contractors, better tools than they use and so on. I don't know, but diving inside this is the key

Starting point is 00:12:06 you understand the work to be done does make sense yeah it does i don't think the analogy is great like i think but i can like for example there's there's one parallel that could make sense how many people were working for the month like was it four people flat out for a month? Parallelism is an interesting analogy there. But I think you can see all the steps that Explainer is doing. You can see

Starting point is 00:12:36 what operations it's doing even without offers. We can see sequential scan or index scan. When you see, for example, sequential scan, you see rows. You see index scan, okay, you see rows filtered out. But, for example, let's move slightly to deeper topics. If database is bloated, so to read 1,000 rows, we needed to read 1,000,

Starting point is 00:13:02 for example, sequential scan. 1,000 rows, they can be packed in a few pages. Or they can be scattered, sparsely distributed to thousand pages, right? Thousand blocks. And we can immediately see, okay, to read these

Starting point is 00:13:17 thousand rows, we needed to deal with thousand pages. Thousand shared buffers hits, ideally, not reads, right? In this case, you think, okay, something is not good here. There's no data locality here. What can I do about it? Maybe we deal with bloat or maybe we just need to think about, I don't know, like cluster our data or something. Not very choice but i mean reorganize physically the order of storage but this this is good i mean you can understand and without buffers you don't see it

Starting point is 00:13:53 right yeah well i mean i'm completely on board with this i was just trying to push back on the analogy like we've discussed this before we we both agree it should be we'd love it on my default i i make a tool where it really benefits from when people use buffers we we both agree it should be we'd love it on my default i i make a tool where it really benefits from when people use buffers we we give a tip in exactly that scenario when uh the we call it read efficiency when we say the read efficiency is not good and it might be it might be bloat it might be data locality there might be a different reason but yeah you can't spot read efficiency issues without buffers or other reads related issues like read speed so if you've if you've got bad disks or you're you mentioned like being

Starting point is 00:14:35 throttled by a cloud provider or something like that you can't spot that you can you maybe you can get a clue if you're really tuned into your data volumes and the time being taken but if you've got the number of blocks or pages and you can convert that in your head to you know his we're reading 100 megabytes and it's taking let's say 20 seconds that's only doing five megabytes a second that That is terrible. You've got an idea that... You touched several interesting points here. First of all, sequential scan, if there is a...

Starting point is 00:15:14 For example, you have sequential scan, some filters and limit one. We need just one row. It can be very quick if this row was found quickly or it can be very slow. It takes a lot of time if it needs to scan through a lot of irrelevant rows. And without buffers, we have no idea in the plan, right? Because we don't see how many rows were filtered out.

Starting point is 00:15:37 Index will report how many rows, how many irrelevant rows. You do see rows removed by filter and sequential scan as well how come ah part five ah filter out yeah okay sorry yeah you're right you're right yeah okay but yeah it's it's not it's only about like i understand yeah i agree with you it's only about the cases when storage is not efficient for example example, bloat or dead apples, this won't be reported and if we see buffers, we understand okay, two buffers here are present somehow, let's understand

Starting point is 00:16:12 why. And second point I wanted to extract from your speech it's super important not only to use buffers but always convert them to bytes you said megabytes per second yeah buffers per second it's hard to understand usually it's just it's not it's also

Starting point is 00:16:34 not what disks disks don't tell us how many pages per second they read they exactly we cannot compare to our disk characteristics we know know usually our random access, sequential access, throughput numbers, maximum numbers, and for in VMI, for example, gigabyte to gigabytes per second, then we understand, okay, we have powerful disks. And like 100 megabytes per second is already very noticeable. But if, of course memory is much like several orders of magnitude much faster. But if we say how many buffers per second, I don't know. So we need to always multiply it by the size of block size, which is in absolutely majority of cases is eight kb. Right. And I wanted to emphasize this because I think not everyone watched our episode number four about buffers.

Starting point is 00:17:30 So I wanted to emphasize how important it is. If you deal with these numbers and you started using them in plans and explaining something, you should understand that your colleagues might not understand you unless you convert it to bytes, kilobytes, megabytes, or kibibytes, mibibytes, gibibytes. So you start putting these numbers. It's magic. I saw it so many times. It's magic.

Starting point is 00:17:57 People say, oh, we are reading one gigabyte here for 10 rows. Okay, what's happening here? When we saw like, boring buffer numbers, they didn't have such reaction at all. Once they see gigabytes, wow, like, this is not and mine. I had cases when instead of diving deep into the some specifics of the plan, it was actually a mistake if you dive deep straight to the details of the explain plan. The number one optimization tip is, okay, guys, how many rows do you want like reasonably to return or touched, I mean changed. And let's see the whole work done by database here. This gives you a very good sense

Starting point is 00:18:47 what like how inefficient it is. And in many cases, it might be some forgotten limit or something and you understand, oh, okay, this is not what we want, actually, at all. And this is very high level optimization tip. I see you're skeptical about it, but it works very well. I don't mind it. I think I've described it. I've heard this described by others and described it myself as the best way of doing work faster is not doing the work at all. So if you can avoid doing it entirely,

Starting point is 00:19:18 if you can do less work in the first place or no work, if you don't even need to run the query, great, get rid of it. But you're talking about if you do decide you do need to run the query, can you just get, like, are you doing the smallest amount of work that's possible? So what I think would be good, you know, there's a summary in the end of the explain plan. Yeah. And in this summary, we should have data in bytes. It would be great to have it.

Starting point is 00:19:47 Like how much was read, hit? You know, I've done this. You know, it's actually... We both did this with our tools. But I think I did it in a controversial way. I did it in a way... I think this gets down to kind of the... It makes it difficult to... The trade-off between being accurate and being helpful.

Starting point is 00:20:10 If you're already talking about splitting it into the four types, like, I guess there's actually eight, there's actually 10, you know, you've got shared hits, shared read, shared written, shared dirtied, local red, local hit. So there's 10 statistics already if you want to be accurate about the overall, even the overall reporting. And what we ended up doing is it's horrible and it doesn't make any sense, but it's nice from like a – well, we just sum them all together and give an overall number. what if we would say shared block reads shared block hits talking about operations

Starting point is 00:21:08 not the amount of data because we both know for hits for blocks hit there might be multiple hits of the same block many many times nested loop for example

Starting point is 00:21:20 very simple case for reads it's also possible if I don't know it's possible I think, I don't know, it's possible, I think. Let's not spend time here. When you summarize it, so instead of blocks hit, I would say how many block hits happened and how many block reads happened. And then you summarize them

Starting point is 00:21:41 and say how many I.O. happened I would maybe not summarize them all together. I would maybe still distinguish reading and writing types of direction of operation, you know. I understand. But the reason I love this approach is... I separately of O. The reason I like this is most of the time we're looking at read queries. That's just the general case. Don't you have write-intensive workloads? Yeah, of course there are.

Starting point is 00:22:16 But if you're looking at people trying to speed up queries, the vast majority are read queries. Or they take update and extract the search part of it and optimize it first as select yeah it also happens i agree with you yeah but my my main my main point was even if it's a right intensive query it's the before versus the after and it's funny that you used the word magic a few minutes ago that your is the technique that's magic it kind of flips the switch for people i think it's i use that word but in the opposite direction i think it takes the magic out of query optimization i think sometimes junior developers or people beginning

Starting point is 00:22:55 think that adding an index is magic and it's gone it's you know we've seen the blog post it this query is now 10 000 times faster's like, there's no magic here. This is why. Yeah, yeah. So let's agree. There's no contradiction here at all. When I started Postgres AI, my idea was that Postgres database administration

Starting point is 00:23:18 and optimization and all other areas, it should not be black magic. It should be white magic. And black magic happens when people need to spend 10 years because of lack of transparency of observability tools or regular tools and understanding good materials. Now we have much better materials than 15 years ago, for example, a lot of good tools. And buffers is one of the cornerstones of tools and approaches which really clarifies everything.

Starting point is 00:23:50 And so magic, black magic disappears and only white magic. It's white magic when you say, okay, I did this and now query is thousand times faster. It's great, right? But looking only at timing, you cannot achieve it usually. You do need to look at the work

Starting point is 00:24:06 to be done or is being done and this is buffers well and if you see on the one that's a thousand times faster that instead of having to read a gigabyte or instead of the sum of buffers being a gigabyte the sum of buffers is now two megabytes like some something in the order of a thousand then it becomes obvious surprises anymore yeah exactly and it doesn't seem magic to people anymore it's you start to realize how the index is working and why it's therefore so much faster so that's the reason i love the simplicity of it but i think it also like i want to get back to is this now a good time to go back to why adding buffers by default didn't work the second time around?

Starting point is 00:24:47 Yeah, please. I'm all ears. Well, you started the thread. Yes, maybe I forgot. Lost tests, right? Tests. Yeah, that was the main. So adding it by default, I think one of the problems was the patch that eventually the patches the patch set that ended up getting proposed there were about four different patches that all did slightly different things and i think that was part of the problem that it wasn't kept as a simple change and actually the biggest part of the change proposed was around tests as you say so if we turn buffers on by default the question the first question that you asked and

Starting point is 00:25:25 that the patch ended up doing was should it be on for explain on its own as well as explain analyze and i think that was a mistake i think that over complicated matters um because also can have buffers for plenty right yes but the reason i think that was a mistake was it changes, like there are far more explains in the test set in Postgres' regression tests than there are explain analyzes. So firstly, it complicates that. And secondly, timing is off by, or summary, which gives you the planning time, is off by default in explain. So it's really weird to have buffers on by default and not and not summary on by default so i i don't think it makes sense logically but also it complicates the the patch set and the regression tests so i if i if i was redoing this i would have

Starting point is 00:26:21 a go without that you know just have it on for explain analyze not for explain okay well that's a that's an opinion i don't actually know if it'd be any better without it but this the second issue was then what do we do with explain explain analyze has timings right but they have a regression test setting that that turns those off so that each time the regression test suite runs the timing's fluctuating doesn't make the tests fail and there was an attempt to add a similar thing for buffers and so to disable them for tests only instead of rewriting the tests i think yeah and maybe it's better would be to rewrite the tests, especially because

Starting point is 00:27:06 I forgot to mention one of the very good things I like about buffers is that they don't lie. They're always stable. So you can use them in tests and rely because, well, of course...

Starting point is 00:27:20 They're not quite as stable as... Hits and reads? Not just that. Planning buffers as well. Sometimes you get a lot of planning buffers, sometimes you don't. It can depend on... Okay, at least to some extent they are stable. Even if the query, for example, is blocked by some different session and our session is waiting for a log to to be acquired we can spend a lot of

Starting point is 00:27:47 time on this weight but buffers will be fine i mean it won't affect them and this is one of the signs if you see buffers are low but latencies high maybe locking issues are involved here right yeah um but but tests don't like they shouldn't block tests passing right like you can you can handle this i would by the way i would bring my bad or maybe good analogy back and say if your guys are wasting a lot of time but do a little work maybe there is some blocker right i mean with kitchen kitchen i see some analogy here I see at least it's my

Starting point is 00:28:28 I like this approach when you dive deep into details what's happening what people are doing really doing it can be applied to anything maybe lawyers or accountants or something if you hide you usually pay more if you don't look inside at the amount of real amount of work to be done you usually pay more I mean if you don't look inside

Starting point is 00:28:45 at the amount of real amount of work to be done you usually end up paying much more and this is true with databases as well if you don't use buffers you will be having inefficient plans and you will need to provision bigger instance for example

Starting point is 00:29:00 and you will pay more same. Maybe we found the real reason that they won't turn it on by default so too many hackers are working at cloud companies and they are not interested in a conspiracy to be clear

Starting point is 00:29:16 I don't believe that before we finish because we have not much time left let's touch the very fresh topic serialize. Sure. Postbiz 17 is going to have yet another option in explain, and it's called serialize,

Starting point is 00:29:36 and everyone is excited about that, right? As I said. I have heard quite a lot of people quite excited about it. It's good. I think it is good. We're getting the option to see another case where your query can be slow, but explain analyze at the moment won't

Starting point is 00:29:52 show that it's slow. So yeah, for sure, good. But again, off by default. What's the rise is, first of all? Just briefly, it's amount of what? Oh, come on. We're back to buffers, but bytes

Starting point is 00:30:08 and timing. Yeah, by default it shows timing, bytes, buffers. It shows everything, especially if you have buffers. It will show buffers as well. But what is it about? It's about serialization. So if you...

Starting point is 00:30:23 It can connect us to the last episode when we discussed select star. And I said select star is actually not a bad thing because I was comparing this in my head to the idea to explicitly list all the columns. And select star for me is a more convenient way. But, of course, if you need only, for example, one column, numeric column, for example, integer column, but there is a JSON column next to it, which is toasted.

Starting point is 00:30:53 Of course, if you select it, it's a big piece of inefficiency in this query. If you don't need it, you don't use it. But Postgres will need to deal with the toast table. And serialize option can very well show this, right? A lot of work to do. Instead of selecting just one number, we need

Starting point is 00:31:13 to do a lot. Well, I think the issue is more that the explain analyze at the moment doesn't show the inefficiency. So if you did select star from that table, you would get a really slow, like you did it from your client,

Starting point is 00:31:28 you would get a really slow result depending on how much data there was. You'd see in the client that it was really slow, not just because of the, well, there's a couple of reasons. One is because a lot of data is being sent across the wire, but also because of the serialization.

Starting point is 00:31:40 So there's two reasons. But if you didn't explain and analyze, yes. Actually, you know, now I understand that it's pity I didn't see the discussion beforehand, because there is demand actually, and I saw it not once, when people want to see the plan and actually result as well. Yeah, it's a different conversation. Different, yeah. I ought to explain the reason, of course, but... result as well yeah it's a different conversation and this is a explain auto-explain there is of course but yeah this is only the serialization part and it's off by default again um so it's

Starting point is 00:32:14 it's more output i like it we can add it to the list of parameters we ask people to use and you think it should be wrong by default i I mean, it's the same argument, isn't it? Like, we want to see the actual work being done. You know, this and that, and also, I mean, this off by default, that buffers off by default, and also analyzes super confusing work because it's used in other areas of Postgres administration. How about having new words

Starting point is 00:32:45 invented that will show proper plans? Of course we can do it in P-SQL for example with set. I did consider I would be interested in your thoughts on this and listeners thoughts on the idea of explain all

Starting point is 00:33:00 or something that whatever version of Postgres you're on, it will do all of the parameters. I think including analyze, full is loaded for sure because of vacuum 4 I think All what? All queries?

Starting point is 00:33:16 Yeah, okay, let's not pretend we are hackers at all No, but I think there is something to what's the end goal here? Is explain going to have 300 parameters? How many are we going to keep adding? Yeah. Yes.

Starting point is 00:33:32 So buffers definitely need to civilize. I need to test it myself for some time to conclude. I would say I would have. I have reservation. I think it's helpful for showing people where the problem is. But because it's still off by default, I think a lot of times when it would have been helpful, it won't be used, which is a shame.

Starting point is 00:33:51 Yeah, that's interesting. Everything here serializes off by default. I actually didn't give this thought before we recorded this podcast. It's funny. Yeah, it actually should be default on. Of course, I understand like someone will say it will produce observer effect, but honestly, I don't care

Starting point is 00:34:11 because timing itself produces observer effect, maybe number one player here. And also this observer effect is maybe not observer effect, but actual normal effect because regular clients need this part, right? Yeah, it's true. This is the opposite. Observer effect, normally we're talking about the additional timing added

Starting point is 00:34:32 by observing, whereas this is a case where explain, analyze will under-report the time taken because of this. It's keeping some work that is needed, actually. Weird. So I need to... My fingers need to memorize.

Starting point is 00:34:47 By the way, I type explain analyze buffers super fast. Now I need to as well serialize. It's already too much. I actually now use a text expander. So every time I can type explain analyze and it will change it to the full list. But obviously you need to work in lots of different terminals all over the place so yeah probably best to learn to type first

Starting point is 00:35:11 anyway okay so we like the thing but don't like its default off maybe we should actually that's my opinion reasons yeah okay it probably makes the test difficult of course that's difficult of course but uh yeah i i would uh emphasize once again in tests and regular tests application that's buffers is is gold because yeah i understand that sometimes there may be slight differences because for example data is packed slightly different physically, right? Legout is slightly different. But there is no order of magnitude difference. With timing, you can have it.

Starting point is 00:35:53 Disks are slower and saturated. Disk IOS is saturated. Or locking issues, something like that. And you have timing which you don't understand. With buffers, it's a reliable number you can use in testing. Cool. Thanks so much, Nikolai. Thank you.

Your Ad Here

Postgres FM - Buffers II (the sequel)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.