Postgres FM - Return of the BUFFERS

Starting point is 00:00:00 hello hello this is posgus fm i'm nikolai from posgus ai and as usual my co-host is michael from pgmaster hi michael hello nikolai so i chose very boring topic and we talked about it a lot i hope this is the last time we're going to talk about it, maybe. Buffers, or pages, or blocks, or how to name it, depending on the perspective you have. But buffers. Buffers, episode three. Which title are you going to choose? It's like the third in a trilogy, isn't it? Maybe it's like Return of the Buffers or something, I don't know.

Starting point is 00:00:43 Right. But we have a few reasons to revisit this topic today. Of course, this topic has been very heartwarming for me. For a couple of years, we discussed how important it is to consider buffers when optimizing queries at micro level, when we talk about explain, explain, analyze comment or at macro level, when we talk about PGS statements and other extensions and aggregate everything and see parts of workload or the whole workload. And recently,

Starting point is 00:01:19 finally, as we already mentioned a few times, buffers were added to explain and analyze by default after a few unsuccessful attempts from previous years I think it was David Rowley right? Yes and

Starting point is 00:01:36 Guillaume Lelage played a big part as well Right and all previous hackers also who made even unsuccessful attempts made some shift into this direction of course it's it's really small technical change so now you don't need to write the word buffers you just write explain analyze and as a reminderlan just provides a plan for the query, which planner is going to choose, optimizer is going to choose. And then X-Plan analyze actually provides both the plan and actually execution.

Starting point is 00:02:13 It actually executes the query, so you see more data. And by default, buffer information, buffer hits, reads, buffers dirted and written we are not present in the plan and that was a bummer for us who dealt with queries at scale right so because this is important information as we discussed in and we will discuss again I think but now in Postgres 18 you don't need to write buffers explicitly this information will be included by default and that's great I think it's a super important small change

Starting point is 00:02:55 which will affect lives of many engineers I know you've alluded to this already it is an upcoming version of Postgres 18. It's been committed, and things that get committed have a very good chance of making it to the final version. But things, yeah, things do get reverted. So there's a small chance it's still, we'll still have to talk about this more in future.

Starting point is 00:03:20 But yeah, we already had the ability to request buffers. We could already add, explain, analyze buffers and other parameters. But a lot of the time when people are asking for help online or when people set up auto explain, although I see buffers more in auto explain. I don't know what your experience is. But yeah, when people are asking for help or people are writing blog posts or people are just getting familiar with postgres they'll come across explain analyze and then not know that buffers exists or not use it when it would actually have shown the issue so there's a bunch of cases where by having it on by default we'll be able to help more people online when they first have an issue without having to ask them to go back and get another

Starting point is 00:04:01 query plan or it'll just be more likely to be in that well it will be in those crew plans by default if they're on postgres 18 or newer yeah yeah and uh i know you have also opinion and i agree with you that some other things should be included if we provide something like all or they're both this verbose but it doesn't include everything and this this is slightly strange and unexpected if you don't know details but uh let's focus on only on buffers yes yeah it's really hard for me to imagine it will be reverted it might be but i think yeah i could just imagine this aging really you know aging like milk we say sometimes just case. I just wanted to make people aware that might have happened. Yeah, so we talk about Postgres 18,

Starting point is 00:04:48 which is going to be released this fall in many months, and it will hit production only in a couple of years, on average, maybe, or maybe even more. So it's still in far future, in distant future. But I wanted to just think in advance, since we are both building some tools and methodologies and everything, I wanted to think,

Starting point is 00:05:13 is it going to have some positive, I think positive effect on how we are going to optimize queries in the future? And if, for example, defaults matter a lot at scale. And I learned it building three social networks in the past. The lesson, the simple lesson I remember, some spammer was sending, on behalf of administration,

Starting point is 00:05:39 was sending the message, send text message to this number without any reasoning and if you send it to 1 million people there will be some people who will follow because you know like why not you know like of course if you add some reasoning it will work even more so defaults matter a lot and we can put everywhere words like don't forget buffers don't forget buffers but if it's still like if default behavior remains without buffers there will be a lot of people who will not use it right and this is what we see if you go to explain the pushcom it has history and of course actually even the explain the pushcom also advocates for buffers since not long

Starting point is 00:06:26 ago, which is great. But if you just check history, you see a lot of plans provided without buffers, right? Because defaults. Now, we like we have good reason to expect that after a couple of years, we will see majority of plans with buffers do you agree yes because maybe not into maybe not in the fall but maybe in like definitely in five years time maybe in two three four years it will become the majority well let's say if it will be 18 plus version this plan like again 99 or even almost 100 it will include buffers because only few people will bother turning it off.

Starting point is 00:07:08 Oh, I meant actually people still running very old versions. Yeah, I understood. That was my more likely. But imagine we're already thinking about people with new versions, and I think we will have buffers. What does it mean for analysis? Are we going to highlight this information or use this information even more intensively? What do you think? Well, good question. I think people will ask, people already have lots of questions when looking to explain, analyze output for the first time. And people are like, what does

Starting point is 00:07:39 this statistic mean? What does that statistic mean? And I think buffers are another kind of one that at first people are a little bit confused by what does it mean like what what is that number is that a lot is it not a lot what what unit is it in all these kind of questions so i suspect it will immediately in people's heads have a bit of extra confusion but once they're familiar it i can't see how it isn't helpful it's just going to be so helpful especially once people start like people and tools start converting that into data volume sizes i know it's an imperfect but let's accept the imperfection of timesing it by eight kilobytes

Starting point is 00:08:20 so each buffer being a page and multiplying it by eight kilobytes you get an idea of the data volume being read when people see a query and notice it's reading a gigabyte of data even if it's from shared memory when it's not using an index or when it's using a specific index and then it's only using a few kilobytes when it's using a different index suddenly they realize the index is helping speed things up because it's more efficient because postgres is only having to read far fewer blocks far like much less in order to serve that query i think it becomes a lot more intuitive that indexes help because they're more they're a more efficient way of running certain queries, not their magic. And I think there's currently, not amongst

Starting point is 00:09:07 Postgres folks, but definitely in the developer community as a whole, there's this perception that indexes are magic. And I think buffers being present help is like a good step towards no, they're not magic, they're just a very clever idea that makes things really simple yeah yeah that's a good point i i was listening to you and thought i'm i'm contradicting to myself when i say buffer buffers hit and read and written and dirtied it's incorrect it should be operations buffers buffer reads hits writes and i don and I don't know how to deal with dirt.

Starting point is 00:09:48 Dirties? I don't know. Yeah, yeah, yeah. Pieces of dirt, right? No, no, no, like actions of dirt. But anyway, I'm saying these are operations, and if you sum up some numbers, it might be the same buffer which you hit multiple times, right? I'm not sure about other operations, but hit, we definitely have in the nested loop,

Starting point is 00:10:11 we can have hits for the same buffer many, many, many times. But when you say we need to convert, I totally agree. I noticed that very long ago, once you converted to bytes megabytes gigabytes sometimes terabytes actually yeah uh tb bytes right gibby bytes once you convert it engineers have a ha moment of always i mean those who see it first time because 100 buffers buffer hits or reads 100 buffer reads there's nothing to them actually it's what it is hard but once you say okay we here we read 800 kb bytes to return a number of like a 8 byte number you can you can feel it already how how much it's just to return one byte, you are reading 100 or 800, how many, kibibytes. Each kibibytes is 1,024 bytes.

Starting point is 00:11:16 Wow. Like, it's not very efficient, right? How about having, I don't know, hash or hash table or something to find this number and then oh okay this is why like because it's a sequential scan and this sequential scan is just scanning whole table if we remember big o notation sequential scan has terrible terrible performance because so many operations and it also depends on the size of of each tuple how many tuples are present in in one eight kilobyte block eight kilobyte buffer right so if we on if only a few tuples there we need to read a lot of data from memory or from disk so So once you switch to a B3 index, it becomes only a few buffer

Starting point is 00:12:08 operations. Even if you have a billion rows, it will be like seven, I don't remember exactly, but it looks at seven buffer hits, that's it. And even Okay, they are still like eight kilobytes, it's a load, but it's already much better than it if it would be sequential scan which is insanely slow in this case and you think okay that's why it's slow because we deal with huge volumes of data and when we apply index we suddenly deal with very low volumes of data right and this is number one number one reason why database becomes fast. This is how indexing works. Like we just reduce the number of IO operations. And when I say IO operations, I often include operations with memory hits.

Starting point is 00:12:55 Because, well, we know operations with memory versus operations with disk, it's like 1,000 times difference. It's, of course, a lot, but still like we can compare it and one read is roughly like 1000 hits to memory. Unless this read is also from memory from the page cache in this case, so they're very similar, right? So what I'm trying to say, we talked about this reasoning,

Starting point is 00:13:21 and I agree with you, we need to convert to bytes and present it at the same time we need somehow like have some remark that these buffer hits are might be like to the same buffer it's not only it it's it's a volume it's from some abstract volume of data. We can have like a megabyte of data hitting the same 8-kibibyte buffer, right? Yeah. So it's okay. But comparing volumes of data, it's so good because first of all, you start thinking about data volumes,

Starting point is 00:14:01 which is number one reason of slowness, especially if we forget about concurrency. Because concurrency is a different topic. If we take just one query, there are cases when it can be very CPU-intensive workload. But in the majority of cases, it will be IO-intensive workload for one query, for databases. And our goal is to help Postgres with indexes, with maybe some redesign,

Starting point is 00:14:28 to help index to deal with as few operations with memory and disk as possible. It means we need to focus on buffers. Yeah, are there any... I think there might be a few other exceptions, but I think they're more... Yeah, I don't think that... Well, maybe it's only one. I think maybe it's a few other exceptions, but I think they're more... Yeah, I don't think that... Well, maybe it's only one.

Starting point is 00:14:48 I think maybe it's CPU as well, actually. I was thinking JIT compilation. Well, we had recently a discussion about RLS, and there we had inefficient, purely in-memory work, CPU work, like check some volatile function or stable function also right so it's if it's in loop for all rows it's cpu intensive work which yeah there have been i was just trying to think there are there have been a couple of other times i've not i've really not been able to spot issues by looking at buffers but they've actually they're much they're not

Starting point is 00:15:23 actually it's not because postgres isn't reading data it's just it's not reporting it in explain analyze yet so actually maybe getting it on by default will encourage some more reporting of buffers as well so for example i don't think triggers report the buffers read like if you have you know the famous case of um not indexing a foreign key and then having on cascade delete like that trigger might tell you it's taking many many seconds because it's having to do a sequential scan of the reference table but no buffers no buffers reported as part of that same with in memory operations like sorts or hashes like they could be really slow even though they're done maybe you've got quite large work mem and it's maybe

Starting point is 00:16:11 it's the dominant part of the query plan but no buffers reported as part of that because it's in memory it will report the buffers if it spills to disk but for read for reads of tables and indexes we get reports even if it's from memory. But for operations like sorts and hashes, we don't get. So there are a few things that are still doing work that it would be good to get those. But it doesn't go against what you're saying in principle. It just means I can't use what Postgres is actually giving me in explain, analyze buffers to actually diagnose the issue there. Yeah, I remember we also touched the topic that it would be really great to see details for buffers.

Starting point is 00:16:51 For example, how many buffers from heap, from index, maybe distinguishing each index, some details to see. Because if we go back to the original approach for reading dealing with explain analyze and we see it like even many hackers block posts provide plans analyze plans but they don't use buffers still i hope this will change of course over time yeah so usually okay we have time we try to guess okay this time is lost here why okay? Okay, because sequential scan. But in terms of data volumes, what do we have? Only rows, right? Expected and actual rows. Logical rows.

Starting point is 00:17:34 And sometimes we think, okay, we fetch one row here. Oh, we get width as well. Like an average estimated width. Okay. Which is, times by rows gives you some idea, but it's only the width of what's being returned, not the width of what's being read. Exactly.

Starting point is 00:17:58 This is like, okay, yeah, we can multiply width by rows, get basically number of bytes, but it will be like logical level. And underlying level might be very different very different drastically different for example if there is a lot of dead tuples which which postgres checked to return this very row right maybe it checked thousand dead tuples and we can it's hidden if we look only at at rows and timing and we have no idea why timing is so bad, right? Okay, return one row, index scan, but why so bad? Here's why. We go to buffers and we see that a lot of buffers were hit or read. I don't know, doesn't matter. That's why timing is bad.

Starting point is 00:18:40 And then my point is that it's already very useful, but imagine if we saw, oh, actually, from this particular index, from this particular table, we got those operations with buffers. It means that to return this row, so many buffers, even with index scan, it's bad. Oh, that's why, because it's bloat. This row had many versions not cleaned,

Starting point is 00:19:09 and index scan needs to check heap table to get version information. So this would be useful to, like, detailed buffers, to have detailed buffers, maybe, right? I had this case recently. In January, I was doing some office hours calls just to get an idea of what kind of issues people were facing. And I had some really good conversations. Thank you to everybody that jumped in on those.

Starting point is 00:19:31 But one of them, they were using our tool, PG Mustard. And it was one of the tips we show people is we call it read efficiency to look for exactly that. When are you reading lots of data and we actually for for long-term users used to call it table bloat potential or like bloat potential so it could be index bloat could be table bloat but we renamed it read efficiency a few years ago because there are other possible causes of it as well so it's a tricky one and because we didn't so the this office that was cool they were hitting quite a few read efficiency issues but they couldn't diagnose exactly where it was like was it index level was

Starting point is 00:20:10 it table level they looked at both it didn't they didn't seem bloated they re-indexed they they tried a lot of the things and it couldn't reduce it so it was a um it was a really interesting call but that yeah that's my first put of call is is and this used to be true more in older versions of protocols but indexes can get especially can get extremely bloated so can tables in some cases but yeah it's it's amazing that you can suddenly see that by turning on buffers um and that is something i wasn't expecting to be true because because i see bloat as kind of like more of a system-wide issue but to be able to spot it in query latencies was really oh bloat can be very tricky sometimes we

Starting point is 00:20:50 have with our consulting clients also we have like bloat is low but some query suffers and it turns out to that like we call it usually local bloat so for particular ids yes the situation is super bad it can also be not a blood but some uh you know like specially stored especially stored records they are like distributed among many buffers and you expect okay i have very narrow narrow table only only like four four or five columns. I expect a lot of tuples to be fit inside one page. But somehow reading 1,000 records, rows, I deal with thousands of buffer operations, what's happening here. And if we just check city ID,

Starting point is 00:21:36 we can understand that each row is stored in each particular page. So we have a lack of data locality here. And so clustering with pgd pack without downtime could help or so partitioning helps usually in such situations and so on yeah yeah yeah so and buffer is exactly the way to feel these all these problems better but i agree i i like your idea to talk about read efficiency and maybe write efficiency as well. Because for end user, it's obvious, okay, I have, I return this data, I return only 25 rows on my page. So I expect not gigabytes of data to be fetched from buffer pool or even worse from disk, right? Or from page cache, which is between disk and the buffer pool.

Starting point is 00:22:29 So I expect maybe only like some kilobytes, right? Maybe tens of kilobytes. Even not megabytes, if it's quite narrow table. Why should we... Postgres should not do a lot of work to show 25 rows. If it does a lot of work to show 25 rows? If it does a lot of work, something is not right and we need to optimize this query.

Starting point is 00:22:51 And without buffers, we only can guess. With buffers, we see it, right? With buffer details, it would be even better, right? Someday, maybe we will have buffer, detailed buffers. So, per each database object. Some statistics would be interesting, I think, to observe. Yeah, and there is criticism of this approach. I recently had an interesting discussion on Twitter on X,

Starting point is 00:23:19 and somebody told me that in Oracle there is ability to flash caches. And in Postgres, it's tricky. Usually it means we need to restart Postgres to flash the buffer pool and also to echo 3 blah, blah, blah to flash Linux caches if you want to go hardcore and very cold state, to check very cold state. And this is interesting. i think you can restart you start what the data like if you're testing locally for example on a clone if you restart the database but doesn't that um reset caches it uh restart of postgres will make empty only

Starting point is 00:24:00 the buffer pool but we also have if it's linux we have also page cache right we can flash page cache with simple command like we of course if you have sudo yes so it's it's easy i always google it echo three to some path and then sync and that's it if you're a managed service what's the best you can do well manage servers you cannot do this of course, right? But restart is like the best you can do. Reboot. Reboot, yeah. Yeah, yeah, reboot. Well, this is hardcore. So let's talk about this. I also see a constant desire to check the cold state. In database lab we build, Some users use it for query optimization with some bot we have, job bot, very old stuff,

Starting point is 00:24:47 but it works really well for query optimization and also like experiments, right? So you can check on think loan, on branch, you can check various ideas. And there people say, okay, I see the hot state.

Starting point is 00:25:03 Data is already cached. I want to check the cold state. And I think it's natural desire, but yeah. Well, every time we say there is no such easy way. And interesting that I didn't know, but interesting that in Postgres 17, in pgBufferCache, there is a new function, pgBufferCacheEvict. And you can try to use it.

Starting point is 00:25:25 I think we are going to try it at some point when Postgres 17 will be more present on production systems. We probably will consider adding this to make the buffer pool empty, right? But unfortunately, this function, I suspect it won't work really well. It's definitely not for production, first of all, right? It's for like lab environments we build. And unfortunately, per documentation, it won't evict buffers which are pinned by, for example, auto vacuum, vacuum process and so on. So in some cases, it will fail. We will need to think how to deal with it.

Starting point is 00:26:04 Of course, restart is definitely working in this case. So if we think about why do people need cold case at all? What's your opinion on this? Well, I think it's coming from a good engineering rationale. You know, I want to make sure the worst case isn't too bad or you know like it's definitely you want good engineers to be thinking about edge cases to be thinking about the worst possible case but i also think that in practice the kinds of optimizations we're doing here the whole point of the discussion here is we're saying try and reduce the amount of data being

Starting point is 00:26:46 read to as minimal as possible if it's an important query try and get a very very good access path for that query and that will involve getting the buffer numbers down as much as possible whether they're cached or not and therefore and maybe you want to explain to people in the in the pr or some you need to explain to your team, how much faster has this made it? So I can understand wanting to compare cold cache state to cold cache state, and then having these kind of relative differences.

Starting point is 00:27:14 But I would encourage going the opposite route. It's much easier to work with warm cache, especially when you're doing query optimization, you're naturally going to warm up the cache if you're running the same query over and over again. So I would encourage more going the opposite direction and say compare what yes this definitely applies for ltp but even for olap or olap queries i don't see the downside of saying to your team these i mean it's still explain analyze right if you're showing career execution plans before and after which is often what people are doing kind of show

Starting point is 00:27:50 this is what it was this is what it will be it's already imperfect in terms of reporting numbers it's not exactly the same as what the client's seeing so if we're already imperfect i don't see the downside of comparing warm cache to warm cache. So I tend to run things a few times. Before optimization and after optimization, we compare two cases. Both should be warm. And we focus on not even if we have reads still. For example, buffer pool may be smaller in labor environment.

Starting point is 00:28:22 We usually have much smaller buffer pool because we run multiple Postgres instances on the same machine. It's a shared environment. So if you deal with large volumes of data, it might not fit the buffer pool. So you are going to see reads. But the idea is let's just summarize reads and hits,

Starting point is 00:28:41 maybe even writes and dirties, how to name it, right? So all four buffer operations, if we just summarize it and use it like as universal metric, this is now our number of, or just maybe reads and hits, depends, right? But overall, we just see this is overall volume of data. And this is what you return, 25 rows. I like like that a lot that's exactly what we do but i i get the sense that what people really want is to compare timings so they want to say this query that used to take because sometimes it's coming from a complaint right or something this query used to take 20 seconds now is 500 milliseconds or this query that used to be 200 milliseconds is now less than a millisecond so they want this kind of like difference in timing even though behind the scenes it's a

Starting point is 00:29:33 difference in buffers so for that if you want kind of apples to apples comparison instead of trying to get cold and be fair on the on the cold side i just say it's easier to be fair and do it on the warmer side. I know it's not perfect, and it might be that your real users are hitting the cold cache state, but if that's what you want, that's the direction I would go. Yeah, well, we have slightly different methodologies here. I usually say, okay, we start from the statement, the query is slow, it takes like 30 seconds or one minute. It's super slow, unacceptable. We want it below 100 milliseconds or 10 milliseconds.

Starting point is 00:30:13 And then I say, forget about timing for now, completely. Focus only on buffers. If we check buffers and see a low volume, 100 buffers, 100 buffer hits and reads to return 25 rows. We know our micro-optimization is over. We need to go and see the whole picture, what's happening with concurrency. Probably we'll have some,

Starting point is 00:30:37 maybe this query is waiting on log acquisition of all those seconds. That's it, right? So it's already this. But if we confirm large volumes of data, we forget about time. We focus only on bringing the... We focus on obtaining sane numbers for buffers.

Starting point is 00:30:55 Sane means, okay, hundreds, thousands, not more. If even thousands to return 25 rows, it's already quite a lot. And we must do it. Sometimes, of course, it's already quite quite a lot and we we must do it sometimes of course it's not easy sometimes we need to do denormalization some complex surgery on our database and involving like long lasting operations with back feeling and so on but once we achieve this in most cases we we can have same number of buffers and then we say okay now we can compare timing and uh comparing timing in in lab environment it still might differ from

Starting point is 00:31:35 production yeah of course but but of course i know i know i use it as well i say we improved timing 100,000, like 10,000 times. It happens, right? And this is good for final comment, you know, in the end. But only when we did work on buffers fully, right? This is where you... Yeah. Given it's you, you could, with your reputation and your teachings you could leave that last line as we we improved buffer read or uh read efficiency for example read efficiency exactly and it would be

Starting point is 00:32:14 probably almost the exact same number not necessarily but yeah maybe you'll be surprised how often it's close okay good good good yeah that's good maybe maybe i should think about it and also present this information during consulting activities and in tools as well yeah i think it's good so yeah read efficiency and you you take a number of rows and uh width and just multiply it get bytes from there and then yeah it's a tricky one because like different scan types you expect different amounts of efficiency so yes it's slightly involved not involved also interesting right yeah but i think i think the main differences are sequential like sequential scans very different from uh well sequential kind and bitmap scan have some similarities index scan

Starting point is 00:33:05 index only scan have some similarities but then like nest like looped index scans are somewhat different to like range index scans in terms of how many tuples but like in a loop for example you can't you can't get fewer than the number of loops like it has to do at least that many so yeah there's uh there's some subtlety but yeah um that's it essentially yeah but but back to the cold state i also agree like i can imagine you would need it and unfortunately it's possible as i see only in level lab environment where you're alone because this is a global operation to flash operational system page cache for example and when you control the hardware right or like the you know the well yeah it's managed situation again restart as you said also an option but again this also only works if you're alone and this is

Starting point is 00:33:59 expensive we aim to make experiments super, super extremely fast and cost efficient. So it means managed environment. Many people work on the same machine. So if one decides to flush everything, others will notice, right? And with this function, pgbuffer cache evict, I think if we manage to make it work, I see it's useful. Even imagine if we have page cache, or in the case of DbLab, it's ZFS arc, which is common. If one block is cached, it's cached for all clones, all branches.

Starting point is 00:34:37 It's very good in terms of efficiency. But sometimes, indeed, engineers want to check inefficient situation okay in this case i think what we would do we try to okay if user requests for their particular database particular postgres instance running on this machine we can ask to evict maybe everything from buffer pool or for particular tables and indexes say everything and. And if we succeeded, there are chances that the same buffer, these buffers are cached in underlying cache, right? Page cache or arc. In this case, query timing is not going to be very different, but plans will be different. We will see reads instead of hits, right? Observing buffer numbers, we will see the difference.

Starting point is 00:35:26 Okay, reads now, not hits. And what I'm thinking, in this case, we could consider it relatively cold, this situation. Lukewarm. Yeah, yeah, yeah. And we could just say, okay, we know that reading from disk is roughly a thousand times slower than from memory.

Starting point is 00:35:48 Right? So we could do some estimation of real cold state. Well, how often do you see clients having track IO timing on? Oh, very often. Great. Okay. That's really positive. Because that gives us an idea of how long those reads rather than hits took. And you could do some division of reads by IO timing to get an idea of where they're coming from. I think there might also be some work. I think I saw something either in a commit for century or hackers thread to try and get yes it's like this is that k k cash for micro level right yeah yeah well it's possible if you

Starting point is 00:36:35 have access to everything it's possible from uh proc process number io you can get it from there i think but it'd be good to have it in postgres core so that we could even get it from managed services for example yeah yeah well there is something that can be done here and the improved observability of part and work with particular query if you have control yeah but i'm trying to say like even if you with this evict we can get some interesting information even if we we cannot allow ourselves to reset everything, including page cache. So it already can be useful in non-production environments to study the query behavior if it's a clone. I feel there's a lot of things that can be improved in this area to work with queries.

Starting point is 00:37:26 And especially it's important to continue this work because I suspect we will use some tools to work on this at massive scale. Like, for example, if we fetched from – we already know during consulting, We did it with a few clients. We photo explained. We fetched a lot of queries, hundreds usually, sometimes thousands. It's insane. Nice. Examples with plans. And then, well, actually, you know, because we partnered, in some cases, we use PGMaster to provide additional insights, which is great.

Starting point is 00:38:03 And then we have a lot of stuff. Unfortunately, you know, unfortunately, I still don't see how to avoid human here. I like to involve human there, but I would like to involve less, you know, because usually right now we have a lot of cases of query analysis like that, and then we need to draw some common picture from it, right? And this, like, going above all those plans and say, okay, we have a pattern here, for example, bloat is obvious, or something like this, or index health is not good. And to draw this picture, human is involved heavily right now i'm thinking like

Starting point is 00:38:46 over time probably we will build some additional um like macro level analysis for micro level cases hundreds or thousands of them and this is this is going to be great and then uh looking at buffers and actually we will have buffers there as well. You were right in auto-explaining. Timing is not present. People are afraid of overhead. Buffers also have overhead, but kind of smaller, right? And yeah, we had articles. You had articles about this as well.

Starting point is 00:39:18 Yeah. So yeah, and so we've called caches tricky, and there is this function. That's it. So Postgres 17 plus. What else? There is macro level, right? And at macro level, we can talk about PG star statements, PG blocks, read, hit. By the way, there, it's, yeah, also naming.

Starting point is 00:39:40 We talked about it. It's also naming like it's like data volumes but it's okay and so read hit written dirtied for shared buffer blocks in pgstat statements also there is pgstat kcache if we want real disk io to be observed and again i advertise to include pgstat kcache in any setup it's a great extension. And there we can talk about some macro level, like, okay, this query ID has this volume per second or this volume per query.

Starting point is 00:40:14 Or it's responsible for 90% of all hits to the buffer pool happening in database. Maybe it doesn't look super slow, but it's so intensive with shared buffer pool. That's a good point. I almost always encourage people to look at their top queries by total time. Unfortunately, with PGStats statements, at least, there's no kind of total blocks. Because it's split into these many, many steps it you could order by shared hit plus shared red and get a good sense of one one type or or by dirtied for a

Starting point is 00:40:54 different but yeah you'd have to sum in the in most cases i'm thinking of that were good give give a good like system-wide aggregation you need to sum two columns i think to get a good system-wide aggregation, you need to sum two columns, I think, to get a good... Reads and hits, right? Ordering, yeah. Yes, yes. So we usually... It's an underestimated approach with PG-STAR statements

Starting point is 00:41:16 to order by sum of reads and hits. Yeah. It's super important. This is how you identify IO-intensive queries. And now they can be mostly hits, but if database grows, they can be converted more and more into reads, right?

Starting point is 00:41:33 Or vice versa, you're going to double your memory size and the buffer pool size, so reads will be gone. But just sum them and consider them together to understand how IO-intensive the query is. And think, oh, this query probably needs optimization, especially if it returns only one row or zero rows. So, yeah, in this case, we can also consider buffer-focused optimization,

Starting point is 00:42:03 which is important. I think it's underestimated. But also, it's worth mentioning this interesting blog post from Andrey Lepikov, who gave, I would say, hacker researcher perspective on this, right?

Starting point is 00:42:18 Saying, actually, I didn't mention that this is my point for sure for lab environments we build. It's important to say timing is super volatile. It's unpredictable. You can have timing one today, another tomorrow. One on production, another on clone of production.

Starting point is 00:42:39 Very different timing. And it will be a big question to track all reasons why timing is unpredictable and there is a concept like there's a simple methodology if we deal with complex problem let's split it to pieces and analyze pieces separately so when we take a clone and bring a single query and optimize it we already get rid of concurrency issues, heavy logs, lightweight logs, everything, forgetting about them. And we deal with a single query alone in one clone. And it's already super good. But we still need to have something reliable in terms

Starting point is 00:43:20 of metrics to optimize. And timing is not reliable because it's unpredictable. It depends on states of caches. If it's a shared environment, maybe CPU and disk are super busy and there is noise from neighbors when we optimize in this case, right? But once you focus on buffers inside optimization,

Starting point is 00:43:38 you have invariant metric. Okay, it can be reads versus hits, but some of it will be constant if you repeat the query. For this particular clone, for this particular query, these parameters, everything is the same. Plan is the same, right? So you have the same thing. It's repeatable. It's reproducible.

Starting point is 00:43:58 This gives you a foundation for optimization because without it, you have very shaky foundation like not not solid foundation right timing so and then you optimize and you i like your idea about optimization like actually i used it i used it i said okay 10 000 times timing and and i remember I mentioned buffers as well, but I didn't expect people will value this comment a lot. Maybe with buffers by default, we should push on this methodology more and more. But what Andrei is talking about in this article is that for research purposes and so on, instead of timing, maybe we should focus only on reads. And reads is something that could be a target itself

Starting point is 00:44:50 for optimization and decision if the system works efficiently or not efficiently. Because we are talking about databases, and I.O. is the number one thing in terms of what takes time. And I like this idea, yes so like this is very close to what i see for many years building lab environments non-production environments which can help people scale their teams and databases and processes and so on yeah so what what what's your opinion about this article yeah i thought it was good i think all of i think this blog's great

Starting point is 00:45:26 but it's very i think it's very hacker focused and i think that's good right like this is a good topic for hackers as well if there's a lot of blog posts i see from people often explaining performance issues and not including buffers and it seems like a missed opportunity to educate on that on that front so this is a great argument for the people doing the educating, I think, or even doing the implementations of why it can be helpful to see this. I think he makes some good points as well about hardware changing over time. And his initial paragraph is even about, I can't reproduce this paper. I can't reproduce timing, yes.

Starting point is 00:46:04 Yeah, because it's not just that it's also there was and this is a different point slightly but there's insufficient setup information which is really common in benchmarks it's really common to not include which exact machines were being used or which exact so if without that information buffers can still be used as like a oh this interesting sure there might be optimizations over time that Postgres can use fewer buffers for certain operations or certain indexes become more efficient. There's been a bunch of work to optimize B-trees in Postgres over the years. And I would expect that to lead to slightly fewer buffer reads in a bunch of cases. But they would be easier to see and easier to see the differences in than timings

Starting point is 00:46:46 and i think he's also got a really good graph here that shows that it's there's not zero variance in buffer numbers but it's much much lower and it's it's way more stable than timings and nearly completely stable with some exceptions which is my experience as well yeah yeah yeah i i agree especially for example if you run the query first time real cache is not there it will give you yeah planning time it will give you huge overhead and huge buffer numbers but second execution will fully shave it off right yeah planning buffers planning buffers is a separate topic it's it's very important topic yeah i'm like i like we have it me too but i don't fully understand them yet like why my understand oh yeah it why do we get so many more on the first execution and then why

Starting point is 00:47:40 do we sometimes get no none reported at all ah Ah, okay. You mean, yeah, yeah, yeah. I think we discussed it maybe. Not sure if you... I saw some very weird behavior in planning buffers. Maybe something is not tracked as well, as you mentioned. I think it must be. Yeah, yeah, yeah.

Starting point is 00:47:58 So maybe we should dig into this topic and raise discussion about that, yeah. So back to Andrey's blog post. Yes. I agree and I cannot reproduce. And we build lab environments sometimes very different from production just for the sake of cost optimization.

Starting point is 00:48:18 Like if you have 360 core EPIC, fifth generation EPIC on production and almost terabyte or more of memory it doesn't mean you cannot optimize your queries on raspberry pi you can on on magnetic disks very cheap and their query fully cached might be slow for you on production, might be like a second. You go and optimize, you understand you have shitty hardware. In this case, you just see, okay, it takes

Starting point is 00:48:53 minutes or hours. But buffers are the same, and plan is the same. You can make decisions on super cheap hardware with smaller caches. Everything is smaller. very slow disk, you can still make cautious decisions and move forward with optimization, find solution, see that you optimize buffer numbers like a thousand times and expect similar effect, maybe the same effect on production affecting time accordingly, right? And this is the power of lab environments. I think everyone should have them. Super cheap, shared, so many people and CI pipelines work together and focusing on buffers is magic, right?

Starting point is 00:49:38 Yeah, I think it's easy. I think you're completely right. I think it's easy for people to make the mistake of then also not using sufficient data volumes. I know you don't. I know that's not what you're advocating for. But yeah, it's the data volumes that matter way more than the power of the disks or the specific CPU. Yeah, usually people say, okay, we are going to have weaker hardware for non-production, and we are going to focus on relative optimization. So if it was

Starting point is 00:50:09 a second in production, 10 seconds in this non-production, which is weak, we are going to optimize from there. But it's still weak because status of cache and so on. Just focus on buffers, and then you can understand it. Again, there are exceptions. I agree. Such as we

Starting point is 00:50:25 discussed with RLS and so on. It's possible that your particular case is CPU intensive. But these are exclusions from the main rule. In the absolute majority of cases, you will be dealing with buffers during optimization, right? So congrats with Postgres 18 change. And I also want to thank Andrea Lepikhov for this blog post, which is very welcome. I think we need more posts about the importance of IOMetrics at various levels, one query level, micro level, or macro level like pgSAT statements, pgSAT, kcache. So yeah, I hope this topic will grow and be more popular, this methodology.

Starting point is 00:51:17 Yeah, agreed. Yeah, another thank you to the people involved in getting this committed to Postgres 18. Yeah. Big thanks. Oh, and I think this might be our first episode since we both got Postgres contributor status, Nikolai. Oh, congratulations. Congratulations to you. And thank you to whoever nominated us or whatever the process is there. Thank you to everyone involved. Good.

Starting point is 00:51:42 See you next time. See you soon. Bye.

Your Ad Here

Postgres FM - Return of the BUFFERS

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.