Postgres FM - pg_stat_statements

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael, founder of PGMustard. This is my co-host Nikolai, founder of PostgresAI. Hey Nikolai, I picked the topic this week and we're going to be talking about PGStats statements. Let's do it. Good choice. You find this one interesting, finally. The reason for this is I'm also hosting this week a blogging event that was started by i believe ryan booze probably amongst others called pg sequel friday so each month somebody hosts this somebody different and they pick the topic and the topic i've chosen for this week was pg sat statement so i'm hoping

Starting point is 00:00:38 to give our entry as an audio one i'll write a blog as well, but it'd be nice to have an audio entry from us as the podcast. Okay. We count on you. Yeah. But, uh, it's a good choice. This is extension. It's hard to find any Postgres user who doesn't use it. Yeah.

Starting point is 00:00:56 It's a really popular, I mean, we call it an extension, but it's it when, um, I mean, we've both done the, there was a, there's a really good series of interviews by Andreas or Adz Scherbaum. I'm not sure if I'm pronouncing that correctly, but this interview series that he does, he asks people, what's your favorite extension? And it's not everybody saying this, super useful but also as it's another one of these contrib modules because it comes with postgres most people have access to it if they want to use it yeah well pgc assessments is a major observability part of postgres ecosystem in addition to logs single log file actually right autoained we discussed last year i think pgsa is much much more popular by the way let's briefly mention that our previous episode about auto-explained

Starting point is 00:01:53 was released on thursday and it was you claimed that heroku doesn't have auto-explained right but we recorded it on wednesday and on the very same day, Hiroko released it. It took like 10 plus years, right? Because among all those years, they could choose any day, but they choose the same day you claim that doesn't exist. It's funny. Yeah, exactly.

Starting point is 00:02:18 And to let listeners in on kind of what happens behind the scenes, we normally record on a Tuesday, but this week we couldn't get a time we could both do until the Thursday. So we really were recording last minute on the Thursday for a release on the Friday. So I couldn't believe the timings. You sent me some history from there.

Starting point is 00:02:36 I think it's from their GitHub showing that they were working on it a week before. So I could have found out if I'd looked deep enough. The proof to decrease our ego and say that it was not a reaction. It was like an independent event, but it's a very interesting coincidence. So Heroku has auto-explained. And of course, we have PgStar statements. Yes.

Starting point is 00:02:57 And actually, a little bit of trivia to link the two. I didn't realize that they were both added in the same release of Postgres. 8.4. Interesting. 2009, I think. Yes, 2008 was 8.3 when XML data type and functions went to the core. I remember it by my heart. So, but the next year was 8.4.

Starting point is 00:03:21 So super old as well. It keeps getting better. Lots of additions to PG stat statements, even in recent years. So it's an old extension, but one that people keep improving. Right, with a lot of disadvantages. Go on. Well, we should discuss why it's needed and so on. If you want to start from negative parts, we can start from them as well.

Starting point is 00:03:46 I don't care. I'm intrigued. For the start, it doesn't register failing queries. So some failing queries can take a lot of time and resources, consume a lot of resources, but you don't see them if you analyze only PgStars statements. And this is a big issue

Starting point is 00:04:01 because probably it's like a half of your traffic which is failing queries, right? If you're in a disaster. In extreme cases, right. But producer's statements tell you it's okay. Yeah, let's start with those then. So any other big limitations or downsides? Yeah, of course.

Starting point is 00:04:17 For example, if our listeners check out the latest Postgres Hacking recording on Postgres TV. By the way, Postgres TV reached 2000 subscribers, not a huge number, but it's growing pretty well. Thank you. And also our podcast published there, I think, contributed a lot. And so we had a session with Kirk and Andrei, as usual, and we wanted to improve performance overhead of pgstat kcache, it's very related to performance overhead of pgstat statements because basically it's the same code. So pgstat statements has special list of queries in normalized form, so-called normalized form. In my materials, I quite often call them query groups to highlight that inside those queries, normalized queries, we can have many different queries with different plans, right? So I call them a grouped query group.

Starting point is 00:05:11 So when a new entry needs to be inserted, it needs an exclusive log on shared memory area for PGSTAT statements or PGSTAT KCache. And in this case, if some selects are happening concurrently in parallel, those selects hold shared log and your insert needs to wait. So for example, imagine we have like 5000 that's default records in pgstat statements, and somebody is slowly reading them during, for example, 100 milliseconds. It's's quite slow and while they are reading it one session they hold shared log and any new query which is not yet known to pgstats.atmos trying to be inserted is blocked and we have like a spike up to 100 milliseconds in latency for that query. It can be any query which is trying to insert.

Starting point is 00:06:08 Sometimes it's some set command. We can discuss it also. Also disadvantage. It's called track utility commands. Set application name, set something like that. It's any set. Set local. Off by default, right?

Starting point is 00:06:23 On by default, I think, no? Okay. We should check. I usually see it from my perspective in client setups. You know, we have many layers of defaults, including cloud defaults from RDS or so. I usually see it being on, and it's an issue exactly because of what I explained here.

Starting point is 00:06:41 So if you have cases when new entry needs to be inserted quite often to PG-star statements, for example, you call PG-star statements reset every minute, for example, extreme case. By the way, it's on by default. Yeah. So if you reset it, it's empty, and a lot of new entries need to be written, and all of them need to log whole structure, whole PGSR statements content quickly, like for a very short time, but exclusive log is needed. And if some monitoring is reading all the content often, for example, every 10 seconds, why not not you have a contention issue here at shared buffers area previous assessments have their own shared buffers area not general shared buffers and there are plans to improve this so in this case if like or for example if you have a lot of different groups, but PG-SAT statements keep only 5,000,

Starting point is 00:07:47 according to some algorithm. I actually don't know details about this algorithm, but some kind of least recently used, maybe. I don't know. It says it gets... Are you talking about the ones, like how does it deallocate or how does things drop off?

Starting point is 00:08:01 It says the least used. Yeah, sometimes some query is known, but it's evicted because there is no space, only 5,000 dot max, pgSense statements dot max is 5,000 by default. You can increase it, but not more than 10,000, I guess, right? Which is quite almost the same, just 2x, right? So if query is evicted and needs to be registered again,

Starting point is 00:08:23 the problem is that registration is a synchronous process and when you call some query, pgSAS Atmos does know it and you need to insert it. It should be inserted in background synchronously with the same backend. You have latency spike for your query execution. In this case, this eviction and like back and forth it's also a problem and this can be definitely the case when you for example have track utility in pgsa statements on track utility means all set commands also considered as queries and the problem is that all current postgres versions utility commands are not normalized and if if you have a Java app, I had it in my life, a Java app which calls set application name,

Starting point is 00:09:10 to, and then session ID. In each session, we do this. Every time this session ID is quite unique, it's a new session. For several queries in the same session, it will be the same, but we need to set it for each session. And it spams our PGSAS Advanced content. queries in the same session it will be the same but we need to set it for each session and it

Starting point is 00:09:25 spams our pg-status-advance content i saw like 50 percent of all normalized queries in pg-status-advance which i call query groups it was set commands and in this case this amplifies the problem of this contention and i observed it for some e-commerce during black friday it was interesting case i was wondering why we have latency spikes up to it was up to 500 milliseconds but every five minutes but with shift 0207 minute 12 minute 17 and so on and then i recalled oh it was me who put some pull like select from start start from PGC attachments remotely to have some poor man storage because monitoring was weak. And I implemented this

Starting point is 00:10:13 and I shifted just because I know like sometimes people have a lot of other Chrome jobs starting at even time. So I shifted it two minutes and it helped me recognize this is my own script is doing this. And then I reproduced this problem.

Starting point is 00:10:30 So a couple of weeks ago, we had this session on Thursday with Andrey and Kirk, and we implemented like some proposal. When new entry is inserted instead of synchronous unconditional insert, we try to insert it, we make several attempts during like 10 microseconds. And if it's failing, we give up and just do it next time. Why not? And I propose like, I think it's a good idea.

Starting point is 00:10:55 We discussed it with PgStat KCache maintainer. There are reasons to say that maybe it's not a good idea. So there is ongoing discussions in PgStat.k cache repository. But again, since the code is very similar, I agree with the point that it should be fixed in both extensions, pgstat.k cache and stat statements, and maybe somehow differently. But the problem is quite serious in heavily loaded setups.

Starting point is 00:11:22 So current advice is turn track utility off. Makes sense. And this problem goes down. Not fully disappears, but goes down. Pressure decreases. But good news in Postgres 16, utility commands are normalized by PGSAS statements. So it's good.

Starting point is 00:11:42 So set application name to something will become single entry in Pages of Statements. Yeah, so there's a commit for this, right? We should probably say that it's not guaranteed that that will be in 16, but it's very, very likely. Yes, yes, yes. It's not huge. It's not merge, right?

Starting point is 00:11:57 So I hope we'll make it to Postgres 16. Fingers crossed. So thank you to everyone involved in that. You want more bad things? Yeah, let's do it. Why not? postgres 16 fingers crossed so thank you to everyone involved in that any you want more bad things yeah let's do it why not

Starting point is 00:12:09 okay well of course when like sometimes people listening to me say you are postgres hater

Starting point is 00:12:16 you know anybody that knows you knows that's not true I had it like new audience well when they joined they think

Starting point is 00:12:23 I talked so many bad things about postgres so I'm a hater well if I'm a new audience. When they joined, they think I talked so many bad things about Postgres, so I'm a hater. Well, if I'm a hater, why am I using it 18 years everywhere? Yeah, and it's not like you're locked in. But one thing I did want to check quickly, you mentioned PGStats statements max having at most 10,000. I can't find documentation that suggests differently, but when I go to edit it on a cloud provider, for example, it says the max value is much higher than that. So I'm a bit confused.

Starting point is 00:12:52 I might be easily wrong here. And we should check source code, of course. Cloud providers can restrict additionally. For example, I was checking Cloud SQL documentation yesterday and learned that, for example, they allow you to set share buffers not more than 60% of memory and effective cache size not more than 70% or 80%. I don't remember, but sometimes I want more than 100% because I want to full-cause this to make Planner work as on different size machine. So cloud providers can add additional limits, but here I would check source code and maybe

Starting point is 00:13:25 it's more than 10,000, right? Cool. Interesting question. What's the overhead of having a bigger number here? Yeah, I was wondering whether that would cause more contention or less, but actually… Contention depends on workload. Yeah, that makes more sense. Of course, on this number as well,

Starting point is 00:13:46 but on workload, definitely. And when PGSAT statements appeared, as you mentioned, 14 years ago, it was like a relief. Because before that, we used only logs, PGFUIN and then PGBadger, understanding that it's a tip of iceberg because only above some log-measurement statement like 1 second or 500 milliseconds

Starting point is 00:14:08 definitely tip of iceberg. PG-SAT statements register everything almost everything we discussed. It's limited for number of normalized queries but for each query which is already known to PG-SAT statements to increment metrics

Starting point is 00:14:24 Like aggregated? Right, in a reliable way, cumulatively in a reliable way. Yes, that's great. Unlike if you use log-based query analysis, you always have the problem that you see only the tip of iceberg in terms of duration.

Starting point is 00:14:40 You see only slow queries, but fast queries can be an issue in many cases. Yeah, exactly. That probably leads us nicely into the times that it's most useful. So like one of the times I see it used most... No, no, no, no, no. We talk about bad things. It was a small break. So bad thing is that overhead is significant from PGuget's assessments. It's noticeable. So have you seen any, like I've looked for benchmarks on this kind of thing. I've seen numbers mentioned, but very, very little data behind it.

Starting point is 00:15:16 There is common consensus. It's kind of 7%, but it's so like dumped number out of nowhere. So it depends on many things. It's hard to measure, it's worth measuring for particular system, particular workload and with it without it. But worth mentioning that if you create an extension, if you actually read it to share libraries and create extensions global for all logical databases inside this cluster, right? I guess it's some kind of several percent. That's it. I've heard a lot of single-digit percent. I've heard probably more commonly, I've heard 2% as a rough estimate, but I guess there's no point throwing out this. Well,

Starting point is 00:15:59 we didn't even talk numbers when we talked auto-explained, did we, really? But people need to measure it themselves. But I wish there there was a bit more like a few more publicly available benchmarks on this kind of thing also also track timing enabled versus not enabled right because previous assessments also can have it or can skip it are we talking track io timing in this case yeah track io timing sorry yeah yeah so well that's a recent addition, I think. Was that Postgres 15, I think, or only a recent, maybe a couple more versions back than that.

Starting point is 00:16:32 But it's good that we can have that now in PGStat statements, but there is warning in the docs again that that does cause additional overhead. Right, so I think it makes sense to measure overhead only for particular systems, but in general, we can claim quite reliably that it should be below 50%. And I personally would pay up to 10%. There was a good discussion about overhead from observability tools recently,

Starting point is 00:16:59 raised by Jeremy Schneider, mentioning Oracle experts, and Tom Cate, I guess, and so on, saying that this overhead is negative. Because even if you see it as positive, you optimized queries using this tool. So you're already on very good side, right? So it's negative overhead. It's interesting points. Yeah, it's interesting points. So if you analyze whole picture, having this tool is beneficial, and you can improve your performance using it, especially if you do it routinely. But speaking of bad sides of producer statements, again, lack of visibility of ongoing queries as well. Yeah, things need to have completed. Right. And also lack of plans and planning is off by default right like we don't even get planning time even in the newer versions that do include planning time which is

Starting point is 00:17:52 not that many it's off by default so even the i use cloud sequel at the moment and they don't even let you turn it on so it's quite difficult to get planning information. Yep. But I'm talking about this fact that a single normalized query can contain many cases. Simple example when I usually use. Select star from some table where column equals one or equals two. For example, if 99% of values is one and only 1% is two, in one case, you will get sequential scan and in another case, you will get index scan. It's two different cases,

Starting point is 00:18:33 very different in terms of performance. And PG-STAT statements doesn't allow you to distinguish them at all. No, good point. So yeah, no example plans. Lack of weight event analysis because it could could be combined with PGSTAT statements. For example, if you have a normalized query, you understand that. If you could build some pie chart understanding when we spend time,

Starting point is 00:18:56 okay, we have total time, mean time, mean max time. We have this standardization. Some of the building blocks, yeah. Like query identifier as well now. Right. In some cases, we lose time doing some actual work. It can be CPU, it can be IO. In some cases, we just wait, being blocked by other sessions.

Starting point is 00:19:16 True. In this case, CPU is free, it can do something else, right? But where is this info, right? We don't have this info at all. There are many other things and i know percona try to address it with pgstat monitor it's quite good and worth looking but their overhead i would definitely measure as we discussed recently i would definitely measure and understand before going to production but it's an interesting alternative to pgstat statements if they give

Starting point is 00:19:42 you even more ability to tune queries, maybe it's even more negative as an overhead. Maybe, yeah. It addresses many things like erroring queries and so on. By the way, the biggest disadvantage, in my opinion, is the lack of examples. Until very fresh Postgres versions, it was very hard to find examples in logs and so on. Like we have some obviously not good normalized query in pgSatAtoms. We see it's slow or it's too frequent. Okay.

Starting point is 00:20:16 What are our examples? Especially, okay, if it's frequent, probably we can find, oh, but how to find the proper place in code? It's also like when I call disadvantage, it doesn't mean something is bad. Something is maybe not just developed yet. There is opportunity here. So for example, we found some query which is not frequent, but just slow in terms of average latency.

Starting point is 00:20:39 We need to troubleshoot it. We need to run explain and analyze to perform micro optimization, single query optimization, right? And in this case, we need examples, because, as mentioned, parameters define concrete query, like we need example to work with. And yeah, and here we can go either to like we can go to logs, we can go to BGStatActivity, or we can use a modern approach with BPF or something like that. Or like sometimes developers are familiar enough with their application to know what the kind of parameters that could be inserted would be. So like if it's a slow dashboard that's been loading in their SaaS application and they know that like an account ID is likely to be of this format and then you've got you know probably a date

Starting point is 00:21:33 range or something you could like know that maybe the other things are not available. Exactly. Well but it happens and that's great but in general case it's very uncommon to hear it works for me well yeah right because different parameters were checked and we like for 90 of cases for this query we have good situation but for 10 we have very bad situation we need those examples and good news is that in modern postgres versions 15 future future 16, maybe even 14, previous one, they had a connection between observability parts. PgSysAdvance has query ID for ages.

Starting point is 00:22:13 Always. It had it always. If you consider currently supported Postgres versions, all PgSysAdvance have query ID. Some weird numbers, sometimes negative. And logs also have the same number in modern postgres and also pgst activity and pgst activity is my favorite choice because logs are usually above only log min duration statement it can be also by the way auto explain doesn't log query id right so i think again it's committed to 16 but not oh great great so we log slow log

Starting point is 00:22:47 sometimes it's easier way but in many cases we have a case but we don't have good examples because none of them reached log min duration statement in this case we go to purchase activity and then we bump into the problem our limit track activity size, I don't remember, I always like it's 1000 2024. And to change, we need to restart. And then serious setup. It's a big ask. Well, it's you need to like this trade off because we start and yeah, so it's worth increasing this track size to 10 000 or something because this will allow you to collect samples of queries easier and with query id you join this data with pgc statements data and understand the whole picture is this according to pgc statements and here are the examples even fast examples we will capture them just with some sampling it's easy right yeah so this is this is

Starting point is 00:23:49 also like kind of disadvantage it requires some efforts to benefit from epigenetic statements because otherwise it's just a building block because you cannot use it directly without any dance around it because you need examples you also need snapshots because it's okay we have cumulative metrics they are growing growing growing but it doesn't remember when it was reset by the way it was also added to 16 right is that time for pgc statements or no I remember the patch but I don't know was it committed or no what do you mean specifically if you reset the just statements and then you mean specifically? If you reset PGSR statements and then you have beautiful numbers

Starting point is 00:24:27 which you know they can be analyzed. You don't need two snapshots. You can work with a single snapshot if you know the reset time. For example, you reset it one hour ago. You have a snapshot. You know these numbers are observation during one hour. You have averages.

Starting point is 00:24:45 But if the reset time is not remembered, you know these numbers are observation during one hour. You have averages, right? But reset time is not remembered. Unlike regular statistics, PGStatReset, you can check it in PGStat database when it was reset. But for PGStat statements, you cannot check. That's why I always reset. There's a view, isn't there? Isn't it like PGStatStatements info or something? Maybe in some new versions in current versions it's not possible to understand i always reset statistics if i do it

Starting point is 00:25:13 i do i try to do it less often according to because of the problems we discussed earlier but if i do it i usually reset statistics both database statistics and PGCell statements, and that's how I know the number. Or deal with two snapshots. Yeah, add it in Postgres 14. You can now see... Oh, I'm lagging. Good.

Starting point is 00:25:35 So it's in 14 and 15 then. Okay, good, good. Yeah, it contains two things. One is a timestamp in UTC of when the last reset was, and the other is about reallocation. So like how many, you can see how many queries have been kicked out or how many query groups evicted from that. Let's say you've got 5,000.

Starting point is 00:25:57 I missed that. I don't have six. I still don't have a good example. So even 14, I work in production systems. I observe myself. I work with older versions so that's how i missed it okay that's great news yeah thank you for correcting me so anyway we need to deal with two snapshots and here we come like okay it's a building block and if you want to work with it you can work directly it requires additional effort but it would be good if all monitoring systems supported it quite well but they there is no like almost none of them supported

Starting point is 00:26:31 well for example consider calls and total time we need all aspects of all dimensions we need to take total time and understand like every second how many seconds we spend to process queries assuming that we know all of them of course 5000 by default not not all of them but if you disabled set by the way set can also if you disable set you exclude it from consideration and also not good right so i mean in terms of observer effect but if like seconds per seconds this is beautiful metric if you forget about this contention in terms of transactions and blocking issues it would roughly give you an understanding how many cores you need right to process this workload if it's like two seconds per second okay two cores would be probably enough but we will probably have 100 it's very very rough

Starting point is 00:27:27 approach but it gives you understanding of your workload and the same for with buffers block numbers shared block heat red temporary blocks and so on local blocks we need to divide them we need to deal with two actually three maybe derivatives here first is divide all metrics by calls this give you average for individual query total time by calls will be every time you have it anyway and all blocks by call will be on average each call each query takes this number of blocks hit this number of blocks read, and so on. I.O. timing as well. So we can divide by calls. And also we can divide it by duration

Starting point is 00:28:09 between distance, time distance between two snapshots. Monitoring system should do it. And this also super needed. And for example, Datadog provides some of them, but not all of them. And in some cases I had like, why I don't have it here.

Starting point is 00:28:24 And I'm speaking not about dashboards, final dashboards and Datadog, but in the metrics they collect. They don't collect this and you cannot calculate it yourself. So there is lack of info. They don't collect everything. This is the problem.

Starting point is 00:28:37 And if you check out all other monitoring systems, this problem also the same. And once you already, like if you analyze yesterday, for example, if you don't have this data, you cannot understand what's happening. And sometimes. The reason this is a problem is because if we got direct access to PGStats statements,

Starting point is 00:28:58 but we only have it since the last reset. We only have the current snapshots information. Of course, we need to see spikes when an incident started, incident ended. And inside it, I would like to have all metrics PDSR statements have divided by calls and divided by time diff.

Starting point is 00:29:16 And I mentioned third one, it's percentage. So it's interesting to understand. For analysis, like top-down analysis, you want to understand which query is very data-intensive in terms of hits and reads in shared buffers. I would combine them, divide and see. Maybe hits and reads separately are also good.

Starting point is 00:29:38 Because I had incidents when reads was not a problem, hits was a problem, and it was not collected by monitoring, not presented. But eventually we realized that if it was, we would understand root cause analysis much better, much faster, because some query was dealing with data fully cached in the buffer pool, but it was so intensive, not only because of frequency calls, but just because single query reading so many gigabytes from the buffer pool. And it was not visible because nobody expected that blocks hit should be present in monitoring as well. It was a surprise.

Starting point is 00:30:18 And this is like, I would take everything, divide. I know there's a price in terms of monitoring storage and so on, but it's worth it because we already spent some percent of overhead in Postgres itself to collect this data. Please deliver it to monitoring, right? So currently I see monitoring, they fail to do this job pretty well. And some highlight in PGCon, I will have tutorial,

Starting point is 00:30:47 three hours tutorial. We will talk about various aspects of Postgres monitoring, and this is a very big aspect of it, how we work with PGC statements and similar extensions and how we present data, which metrics we need

Starting point is 00:31:00 and which derived metrics we need as well. Right. Nice. Am I allowed to talk about any positives yet? Okay. So let's talk about positives. 100% you should install it. It's super useful, despite of all downsides.

Starting point is 00:31:19 I imagine there's an exception to this, but for most applications I've seen, they would benefit from having this information. A few times I've tried to help people and they didn't have it installed or like the hosting provider didn't put it on by default and we were kind of hamstrung until they turned it on for a while it's very difficult to start to get an idea of what's going on without this kind of information without any logging so it helped even if you aren't doing anything with it currently in the future if you need help from somebody having it on can can be really beneficial one thing that you mentioned dividing by number of calls a lot and i think i see quite a lot of guides and that's the kind of example

Starting point is 00:31:56 query they give querying pgstat statements looking at a few things like total execution time and then mean execution time using that calculation and a few other things but then they order it by mean execution time and i understand looking at the slowest queries by average time but more often i see people that want to look at queries by total load so they're looking to reduce total load on the database more than the average, you know, the worst performing in terms of length of query. So I prefer to start looking at that ordered by total time. And that goes to what you were saying about looking at even fast, fast in inverted commas queries that are executed a huge number of times. Well, yeah, you know, I have my approach here.

Starting point is 00:32:44 I think we need to order by everything almost i like we need to have different tables and if we for example if our monitoring also limits the entries not like five five thousand it cannot handle all of them and it needs to limit them we probably need to have a few lists. For example, if we limit 100 or 500, we need to have several ordering approaches. And of course, total time versus mean time is the most popular, but also there are IO metrics, order by hits plus reads,

Starting point is 00:33:19 or order by hits or order by reads, two lists. But speaking of total time versus mean time in order by, I have simple approach here. If you care more about humans, you should order by mean time. If you care more about machines, computers, you need to order by total time. Because optimization goal, if you target to reduce resource utilization, you order by total time. Because total time, it's like final metric. Of course, there are, again,

Starting point is 00:33:48 there are some lock issues, blocking issues, which can increase total time. And unfortunately, we cannot distinguish if we had a wait event in four in PGSR statements combined easily, right? We could say, okay,

Starting point is 00:33:59 we order by total time, but we exclude a lock. We don't want to, because lock, it's like some application logic problem it's not a resource utilization but a rough approach is like order by total time and optimize all queries which exceed 20 percent of that calculate percentage if it's if some query group normalized query exceeds 20 good candidate for optimization if you care about machines. Probably users won't notice. Cost, yes. Well, yes, if you reduce utilization, it sometimes also converts to dollar.

Starting point is 00:34:36 And users care if a system is overloaded, like that's affecting everybody on the system, for example. On edge case, if it's saturated. But if it's 20% CPU and you order by total time, you reduce it, and maybe the people won't notice because maybe it's query which executes below one millisecond on average, but it's so frequent, right? It's super fast query, already well optimized, but it's called thousand times per second. I had many cases, like for example, select one. And you see 30% of total time is this select one. What? Like, why do we need it? We pay for it, right? But users don't see it because it's just one millisecond or even below one millisecond. To select one, it should be like

Starting point is 00:35:20 some dozens of microseconds probably. But if you care about humans or you have not LTP but some analytical system and again users, that means, right? We need to order by mean time, average time and find users suffer due to those queries. And if we optimize them,

Starting point is 00:35:40 probably our computers won't notice because for example, it's accessible for less than 1% of total time, but it's still, in terms of mean time, it's very bad. Like, above one second, users need to wait. Or analytical system, like, one minute. Like, it's very low resource usage, but very bad user experience. So, resource usage is total time. UX is user experience is mean time usage is total time UX is your

Starting point is 00:36:06 experience is meantime, this is my rule. It's quite easy to explain and people usually understand. I like it. But let's not forget about buffers, right? Our motto is buffers first, sometimes not maybe here if like but if if you think okay, total time is big, or meantime is big, what's happening here? We should take shared block hit and read and then divide either by or like we can take it as a whole between two snapshots. And let's say if we care about total time or we need to divide by calls if we care about mean time. And we understand, okay, mean time is huge.

Starting point is 00:36:43 But because on average, single query deals with gigabytes of data. And that's the problem. Probably it's a select count or select sum or something aggregates, right? Let's think about how to improve it. Yeah. So you can look at this really good for splitting out all of the different buffer numbers. So I think there's about 10 different types that you get through the same ones you see in in explain explain analyze buffers you get via pgstat statements and it could be used like we could look at temp blocks for example we could look at which queries most often or using the most temporary space like what and what do they use on average like that kind of thing could be really handy the other thing that they've added that isn't buffers

Starting point is 00:37:24 related but i think i didn't realize they had and could be really handy. The other thing that they've added that isn't buffers related, but I think I didn't realize they had and could be really useful for certain, like one certain case. Well, will they do add? But I was going to say just in time compilation statistics, I've seen quite a lot of people considering upping the cost limits for that or turning it off completely. But this would be a really good way of-

Starting point is 00:37:43 My advice is to turn it off completely for LTP. That's it. Yes, but it's quite cool that you can look in Pageset statements and see which queries are using it and which ones by average. So you could order by that for that ad hoc analysis.

Starting point is 00:37:56 Data-driven decision versus just listening to me. Right, right, right. Yeah, that makes sense. Definitely. But speaking of hits and reads and so on, if you divide by calls, you have average. It's super important. But speaking of hits and reads and so on, if you divide by calls, you have average, it's super important to have average for hits and reads. It makes

Starting point is 00:38:10 sense also for like temporary files, but temporary files, sometimes we just quickly realize we need to increase work mem, that's it. But if you think about like, sometimes, yes, it makes sense. And shared buffers is the most interesting and this is like okay our average is this and all gigabytes for single query not not good but if for for total time centric analysis for machines right if we consider hits we again like we can divide by duration number of seconds between two snapshots and we can imagine this flow okay this query group is responsible for like okay one gigabyte per second or 100 gigabytes per second uh dealing with shared buffers and 100 megabytes per second of dealing with underlying page cache because this is maybe not only disk it's page cache as well depending on our on the size so we don't have visibility without pg start k cache to this and this is super good for

Starting point is 00:39:06 engineers to understand this these streams like okay this generates this io and this is even without timing it's already very super useful number of buffers per second especially in a world where you're seeing we're seeing you know providers that charge by that you know if that's the metric that you're paying for, for example. Aurora, for example. Yeah. Yeah, makes sense. But wall numbers, it was added to Postgres 13 maybe, or 14?

Starting point is 00:39:33 13, I think. Both to explain and analyze and... 13. Both to explain and analyze and produce some statements, and it's super good. Just checked, yeah. Sometimes you have terabytes of data generated of wall per day you enable compression to reduce it it's already better but still a lot it's like it's increasing your backups are growing replication needs to

Starting point is 00:39:57 transfer all of these things probably it's because of index amplification probably because of something else and pgcSourceAdmonds can show you which queries are responsible, for example, 20% of all generation among all queries. Of course, actually, createindex also will go there, right? It will be registered by pgSourceAdmonds, createindex, or reindex concurrent. If you have, is it the utility? Would it be under utility commands? No, I think it's DDL. It's not utility. Utility is set.

Starting point is 00:40:32 Yes, and auto vacuum won't show up here. Auto vacuum also generates some wall. Processing tables and indexes generate some wall, but we won't see it in PGSR statements. But normal SQL, including DDL or index maintenance like reindex concurrently we will see them and we can again understand percentage for each query group for each

Starting point is 00:40:52 normalized query and make conclusions okay our index maintenance it's a lot of wall but we move it to weekend so we are fine or these inserts together like whole group of inserts it's responsible responsible for i don't know maybe not inserts updates updates for example responsible for 10 percent of all all wall generation or 20 percent and we check we probably drop or adjust some index we make these updates hot reduce index amplification problem or maybe drop some index, we make these updates hot, reduce index amplification problem, or maybe drop some indexes, sometimes unused, redundant. We had an episode about index maintenance.

Starting point is 00:41:32 It's important. And in this case, less wall will be generated, and we will be better in terms of replication, backups, and so on. Less pressure to those subsystems. I feel like we've gone into quite a lot of detail here. I'm glad we have, but I feel like we might be coming to the end.

Starting point is 00:41:50 Yeah, I'm just showing, like, I'm just illustrating why everyone should enable, create extension PgStat statements in just one database. It's not needed to install it everywhere. It's a single thing, actually. You can install it to Postgres database, for example, and work with superuser. And you will see everything from all users, all databases, all queries globally. And there's a chance on some cloud providers, it's on by default, so you might not have to, but worth checking. Well, yeah, RDS superuser and so on. By the way, also interesting dimensions, DB ID or database and user, like which user, if we have like segregation, if we split, like divide and conquer, these parts of our application use this DB role, DB user. Humans,

Starting point is 00:42:40 they have their own database users. So we can perform analysis and we have segmentation here. We can perform analysis and understand who is using what. Who is generating, like, is responsible for total time and all, IOMetrics and so on. And also, a super interesting case, like, okay, we found some query. We talk about performance analysis starting from database point,

Starting point is 00:43:02 but database is just the heart of any system right we have also applications humans many things and sometimes we need to trace and understand which code is doing this and this is super hard topic because because for example if you did that data doc it has application performance monitoring apm so if you analyze performance from application side, you can trace and say, this is the code generating it. You can also have network latency involved in latency analyzed, but it's pros and it's cons as well. It's beneficial because for application, you have final latency, including network round around trip time but you cannot distinguish how much of those milliseconds or microseconds is our network from postgres you don't see this like network latency and you cannot trace to code unless you make some tricks for example sometimes people start

Starting point is 00:44:00 putting some comments for example for ruby there is a library called marginalia. It can help you trace. It can be URL, for example, to put to comments. And PGStat statements will ignore comments. It will use first comment, keep it, but all subsequent will be ignored in normalization process. And this can be used. So you put some comment here and you can trace them. Okay, this query group is from that part of my application. That URL or this module or this file. You need to invent some way and it requires some efforts. There is interesting effort from some Russian developers for pgstatk cache originally. So the idea to tag.

Starting point is 00:44:44 So you have some dimensions for pgstat kcache originally so the idea to tag so you have some dimensions for pgstat kcache entries you tag like for example this is application module equals this and you start distinguishing different calls and metrics and you can have specialized segmentation depending on your application logic i think it would be good to consider similar improvement for pgsa statements as well because this tracing problem it's it's sometimes we lose like database experts we lose for with our tooling we lose to application tooling this apm stuff they say you know it's more convenient for us because we see which code exactly generated it with your pg assessments, we cannot do it.

Starting point is 00:45:26 Okay, I'm talking about downsides again. I think time to call it a day. That was really helpful. Thank you so much, Nikolai. Thanks, everyone, for listening. Thank you for this topic. You see it resonates inside me a lot because, yes, I spent many years working in this area.

Starting point is 00:45:43 I hope it was helpful. Good luck with your blog. Thank you. Okay good one see you next week bye bye bye thank you bye

Postgres FM - pg_stat_statements

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.