Postgres FM - Query macro analysis intro

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I'm Michael, founder of PGMustard, and this is my co-host Nikolai, founder of PostgresAI. Hey Nikolai, what are we going to be talking about today? Hi Michael, a few weeks ago we discussed what I call query micro-analysis. When we have only one query and we want to understand how it works, how it will be executed on production. Let's talk today about the wide picture, about how we can analyze the whole workload and find the worst parts of it. Yeah, so if that was micro performance analysis, this I've heard you call macro performance analysis.

Starting point is 00:00:40 So maybe there's nothing wrong, but we want to be able to see at a glance, are there any big issues? Is there a spike on some metrics, something like that? Maybe it's monitoring related, maybe it's and other, I don't know, like SREs and so on. And we want to identify the parts of workload which behead the worst. And this is one thing. Or we just want to optimize resource consumption to prepare for future growth. And we don't want to spend more for hardware, so we want to find, again, we want to find the worst behaving parts of workload and we optimize them. There are many different cases.

Starting point is 00:01:32 Yeah. So, I mean, it could be an application developer telling us, it could be customers reporting that there are issues. And maybe we want to find out which part is slow or maybe we want to show them that there isn't an issue. You know, we can't see anything database side. All right. One of the cases I like especially is when we perform workload analysis as a whole during various kinds of preparations, various kinds of testing before we deploy. And this is also interesting. And there we can also try to understand are all parts behave well

Starting point is 00:02:04 or there are not so well behaving parts. So we should optimize before we deploy. But let's just start from maybe historical aspects of it. Like 15 years ago or so, we didn't have PGSTAR statements, which right now is standard de facto extension for macro analysis. And there is consensus in the community that PG-STAR statements should be enabled in all Postgres installations. By default, this extension is not installed, but everyone should consider installing this extension because it has a very, very small overhead,

Starting point is 00:02:38 but it's like the place where you probably want to start your macro analysis, understanding how workload behaves. But before PG-STAT statements, we had only logs. And the idea was, okay, we log all slow queries, for example, queries, execution of which is longer than one second. And I remember when I was briefly, just one week, I was a user of MySQL and switched to Postgres. And then it was a confirmation of, like, my choice was right.

Starting point is 00:03:10 It was confirmation when I found that in Postgres, I could go down below one second. I mean, log mean duration statement timeout. And the log queries, for example, which are above 100 milliseconds. But in MySQL, it was not possible. And one second was the lowest value. Right now, they fixed it already, and you can go down below one second. But at that time, it was in 2005, 2006, 2007. And we log all slow queries, and then we can parse logs.

Starting point is 00:03:42 And I remember we had a tool called pgfwin written in PHP. And then pgbadger was created, written in Perl. It's much better in terms of performance. It has much more features and so on, like more robust. It's being developed still. And by the way, yesterday they released version 12, I think. I saw some in the news, right? So pgbadger 12.

Starting point is 00:04:04 More and more features right now can work with auto-explain and many things. So the idea was, let's parse those queries and remove parameters from them and so aggregate them. It's the process which in PgStats statements, a source code terminology called query normalization. And then we will show the worst, according to some metric, the worst query groups. But the problem with this approach is that this is only the tip of the iceberg. We might have many more queries which are not visible in logs, but they produce the most load.

Starting point is 00:04:44 Sometimes 90% of load is produced by queries which are under log mean duration statement timeout. So some DBAs used approach like let's enable all query logging with duration for a few minutes and collect logs. And some, including myself, found a way to store logs in memory it was quite like risky so i used it only a couple of times so we put we create a drive in memory and we put logs there but it's like only like i don't know like half a gigabyte because memory is is expensive and we make very aggressive rotation so we don't let Postgres log to saturate this small disk. But since it's in memory, we can afford logging a lot.

Starting point is 00:05:31 And we can set log mediation statement amount to zero, meaning that let's log all queries. That's the big downside of logging does have an overhead if you're logging excessively. So that seems to be a good argument for not. Well, that's why we don't have to anymore right right yes logging overhead may be very big and it's not noticeable until some point when your drive where you write logs is saturated in terms of right disk io and then everything goes down and it it's not fun. Like this is one of the worst observer effects I had in big production cases and it was very painful. So I don't recommend to go down to zero

Starting point is 00:06:12 in terms of log mineration statement blindly and without proper preparation. But anyway, this right now we consider as outdated approach because we have PG-STAT statements, right? And we don't need to log all queries anymore. We still need to log some queries because pgstat statements doesn't have examples. In pgbeger reports, aggregated normalized queries, I call them query groups,

Starting point is 00:06:37 pgbeger provides a few examples. And this is very important because in each query group, you might have different cases. The same query, abstract query without parameters, might behave very differently in terms of execution plan, depending on parameters. So when you already identified queries you want to improve, you need examples, always. And this is tricky. Guessing examples is a big, not yet solved task.

Starting point is 00:07:05 I think it's a good task for machine learning and so on. I'm very interested in this area. If someone of our listeners is also interested in this area, please let's talk. Because I think it's a very interesting area to automate, to improve, to allow us to improve more queries with less time, right? But at the moment, we can patch it together via a mixture of PGStats statements and logging.

Starting point is 00:07:32 Yeah, we can combine logging and PGStats statements. Also, query ID helps in the very recent Postgres versions. Before query ID, we used libpqquery from LucasFittal, right? So it's an additional ID. And the good thing about this library or tool that if you apply it to already normalized query, it will produce the same fingerprint as for non-normalized raw query. But anyway, if you use logs, you can find examples. Another source of examples of queries is pre-reset activity. But in most cases where lack of DBA involvement happened, I saw that track activity size, or how it's called, I always forget.

Starting point is 00:08:19 There is a parameter which says the maximum length of pgstartactivity.query column. And by default, it's 1024 only. Right, right. And it's not enough. Or our rams or humans, they can create much bigger queries these days. So we want to put like 10k there or something. Overhead is very small, so I always also recommend to increasing. But increasing it requires restart. This is the main problem. Yeah, so one of those ones you have to do always also recommend increasing, but increasing it requires restart.

Starting point is 00:08:45 This is the main problem. Yeah, so one of those ones you have to do at the beginning normally, isn't it? Yeah, yeah, yeah. So this parameter should be increased, and if we increase it, we have more opportunity to get samples of queries from registered activity. Then we can join or match them with registered statements data, and then probably logs are not that needed, unless you use auto-explain,

Starting point is 00:09:08 because auto-explain is also very useful. And again, your article about the overhead and how to measure it and the idea that sometimes it's not that big is good. And maybe you want also this, because in this case, you see plans exactly as it was during execution because plan flips happen as well.

Starting point is 00:09:28 Yeah, I was going to ask actually, we've talked about the overhead of these things a little bit. You mentioned the overhead of PGStats statements is low. I've seen some people mention that they've tried to benchmark it and struggled. But have you seen anybody actually struggle to measure the overhead on a on a normal workload of it's it's very have you seen any benchmarks of it i haven't i haven't i trust already current experience but we can do it this is not the most difficult benchmark in the in postgres ecosystem so we can just of course it's it's good if you can benchmark for your own workload.

Starting point is 00:10:05 And the question is how to reproduce workload in a reliable way so each run is the same or very close to each other. And then we can just use various metrics which database has without PGSTAT statements. For example, from PGSTAT database, if we enable track IO timing, and by the way, usually we should enable it. Of course, we also discussed it. There are cases on some hardware, it can be expensive, it's worth checking. But if we enable it, we can produce workloads, check this parameter, then reset, or we can also check throughput and especially latency from application side. If we use application called pgbench, which I don't know why Ubuntu ships it in server package, not client package. But if we use that application, it reports all latencies or sysbench, anything.

Starting point is 00:10:57 Yeah, what I meant, I think when I said struggled, what I meant was the variance of each run is larger than any overhead so either you you're saying the you can't say the overhead zero right because it's definitely doing some work but it's not not necessarily measurable because a few percent i don't think it i think it might even be lower than that for some rltp workloads well uh benchmarking is the area we probably need to discuss one day separately but my general advice is by default many benchmark tools and pg badger is no exclusion here they don't do load testing regular load testing they do edge case of load testing called stress testing like let's load it 100 by default that's why i suggest usually to find the spot in terms of tps you can control tps and pg pg bench you find a spot like loading your system in terms of for example cpu or disco like 25 between 25 and 50.

Starting point is 00:12:01 emulating normal days of your production because Because if it's above, you already should think about upgrading or very heavy optimization. And in this case, you should check latencies and compare in each run. And in this case, variance should be the same. I don't know why they are different. Something is wrong. I would like to see the concrete case and understand. Yeah. Well, so back onto, so now that we have PG stat statements,

Starting point is 00:12:31 there's a few things to mention there. It's not on by default unless you're on a cloud provider. They often do have it on by default. So people do need to load it if they don't already. I come across quite a lot of customers who don't, if they're self-managing, they're not even aware it's a thing. So there probably are a bunch of people out there who don't have it on, even though the experience people in there. Everyone should have it.

Starting point is 00:12:52 Yeah, I think so. But then again, so there are a few default settings there that I think can be improved. There's like a, is it 5,000 statements or 5,000 unique? Yeah, you're talking about pgstartstatements.max parameter, which is, as I remember, 5,000 unique? Yeah, you're talking about pgstatstatements.max parameter, which is, as I remember, 5,000 by default. Usually it's enough,

Starting point is 00:13:09 but in some cases it's not enough when your queries are very volatile in terms of structure, not in terms of parameters because pgstatstatements during query normalization removes parameters, but in terms of the structure.

Starting point is 00:13:20 If you just swap two columns in your query, it's already considered as two different cases, two different entries in PG-SAT statements table. And PG-SAT statements max is 5,000. So you can increase it, but as I remember, only to 10,000 maximum. I don't remember exactly, but why we care? Because if PG-SAT statements has metrics, which just grow over time, incremental metrics, like total time, total exact time, total planned time, because they split in Postgres 13, as I remember. In this case, we need two snapshots to analyze.

Starting point is 00:13:54 We need two snapshots and two numbers, and then difference between those two numbers is what we have during our period of observation. Snapshotting is absolutely needed. Otherwise, it's not like the manual approach. Let's reset statistics often and then use only final snapshot, thinking that we started everything from zero. It has downsides. For example, filling with new entries has overhead as well. If you check source code, it says when we add an entry. So there is a lock there, and it

Starting point is 00:14:26 might be noticeable during dozens or sometimes like 100 milliseconds, it can be noticeable for all workload. So it's better not to reset very often, in my experience. And anyway, it's not practical to reset them, you lose information as well. Then the question is like how often we have evictions of rare queries and like what is the like some drift of our query set. And you can see it, you can compare the difference between two snapshots, like for example, one hour between them. And you can see which new queries and queries which disappeared from the list. And this difference indicates how, like, eviction speed. And I've noticed that some, for example, Java applications, they use set application name to something very unique,

Starting point is 00:15:17 including some maybe process ID or something. And PGSA statements cannot normalize so-called utility commands and set as utility commands. So these queries are considered as separate, all of them, all set. In this case, you might want to turn off PGSAS statements track utility, which is on by default. And in this case, because I haven't had cases when we do need to analyze the speed of said commands. Well, maybe it might happen, but in my experience, not yet. So it's better to just turn it off. It's on by default.

Starting point is 00:15:52 That makes loads of sense. I think talking about the snapshot comparisons, I think that must be how the cloud providers all do it. And a lot of the dashboards that you'll see in RDS, Google Cloud SQL, there's a bunch of other ones as well. Like, or even, even like open source tools, the talking of tools that have had releases recently, PG Hero came out with version three, I think yesterday. I haven't noticed.

Starting point is 00:16:15 Good. Interesting. Yeah. It's very, very lightweight and a good tool, like for, for small teams. I, I enjoy like, yeah. And, and based on PGStat statements again. So it's the basis for lots of these. But back to the cloud providers,

Starting point is 00:16:32 the way they get historic data is by taking these snapshots and rolling them and comparing them to each other, not by rolling it forever on the same. They don't want to reset once a year, for example, if that makes sense. Right. Well, so why do we care about this eviction speed and transition to new list?

Starting point is 00:16:52 Because we want to analyze the whole workload, right? And in this case, of course, disabling tech utility, we don't see this utility part. But if you consider it small, we can do it. But we will have more real queries in our PGSR statements. And we will have 5,000 by default. It's quite a big number. But the second place where cutoff can happen is monitoring system or cloud power.

Starting point is 00:17:19 I don't know. They usually store like 500 on your 1,000. They don't take everything because it's expensive to store everything in monitoring system. You need to, if you want snapshots, samples of PGSR statements, snapshots every, for example, minute. Imagine how many records you need to store. Every minute you store 5,000 entries from PGSR statements. So usually they also cut off and making decisions what what

Starting point is 00:17:47 to remove what to leave we usually think about which metrics are the most important yeah so actually that you i think you already mentioned it briefly but the the total time seems is my is my favorite i know i know we've talked about mineings. Mine as well, but I saw people which prefer not total time. Actually, in my team, there are such people which prefer, for example, average time, mean time, mean exact time, or mean exact time plus mean planned time because we probably want to combine them because execution includes both planning and execution.

Starting point is 00:18:24 I mean, okay. Sorry. Yeah, we call that total time by summing the two, but there's no such parameter. Total already used in different contexts. Exactly. Well. But the problem,

Starting point is 00:18:38 this goes back to our conversation about logs versus PG stat statements, though. The reason, I guess, the reason for you as well, but I'd be interested. The reason I prefer total time is you could easily have your biggest performance problem being... So you understand why total is used twice here, right? Because total is sum of all timing for...

Starting point is 00:19:00 I mean, there is total exact time and total planning time. Total, total, no, it's not good. and total planning time. Yeah. Total, total. No, it's not good. Like total whole. How to name it? Yeah. Well, they don't. And they, but we can sum those at the query level, right?

Starting point is 00:19:13 We can sum the two of them if that's what we care about. But my, yeah, sorry. What I was, what I guess I was trying to say was our biggest performance problem could easily be a relatively fast query which has a really low average mean so average by mean time so it could be on average 20 milliseconds but it's getting run so many times and maybe it's still like not optimal maybe it could be running in sub one millisecond and that could be our biggest performance opportunity. And by looking at total time, total execution time plus total planning time, we could see that that could rise to the top of our query. It could be line number one. Whereas if we're looking at average

Starting point is 00:19:55 time, we could easily, like so many queries that only run a couple of times that take a few seconds each. They'd be long above it. This is an interesting topic. Which metric is more important? By the way, the lack of words here indicates that the topic is quite complex, right? I mean, English doesn't have enough words to provide it. I'm joking, of course, but interesting. Total what time if you combine both exec and plan?

Starting point is 00:20:21 Some word should exist, and we probably already use somewhere. So total versus average or mean time. In my, like I came to conclusion like this. If our primary goal is resource optimization, if we want to prepare for future growth, we want to pay less for cloud resources or hardware, total time is our friend

Starting point is 00:20:44 because this is, well, for cloud resources or hardware, total time is our friend. Because this is, well, of course, it includes some wait time as well. For example, we have a lot of contention and some sessions are blocked by other sessions. It also contributes to total time, but resource consumption probably is not that much because waiting is quite cheap usually, right? But if we forget about this a little bit total both plan and exact time if we combine them this is our time spent for to process our workload if we know that we analyzed everything this is how much work was did we can even take total total time and divide it by observation duration. And we will understand how much time we spend every second.

Starting point is 00:21:26 I call it, like, metric is in seconds per seconds, my favorite metric. For example, if it's one second per second, it means that roughly one core could process this. It's very, very, not like we forget about context switches here, of course, and so on. But about context switches here, of course, and so on. But it gives someone a feeling of our workload. If we have 10 seconds per second needed to process, it's quite good workload already. We need probably some beefy server here.

Starting point is 00:21:56 As for average time, these numbers are most useful when we have a goal like let's optimize for best user experience yeah so like it i guess that's our 50th percentile isn't it with is it no it's not it's not so i i even then i don't prefer i don't even like mean for those because i'd much rather look at a p95 or something and look at the client side not not database side? Well, it depends. But sometimes, as you said, sometimes we had very, very rarely executed queries, but quite important ones. For example, it can be some kind of analytics, not analytics, but some aggregation and so on. And the average is terrible.

Starting point is 00:22:41 And we do want to improve it because we know that users who look at those numbers or who use these queries these users are important for example some like our internal team analyzing something or i don't know like finance people or something and some some kind of a more analytical workload it's not necessarily analytical but i think i hope you understand right i think i understand so like give you an example when i was at a payments company we had it was like a batch it was daily batch payments we had a deadline to submit a file it was like a 10 p.m uk time deadline and the job literally had to finish before then and as this job got longer and longer it got closer to that deadline and then it

Starting point is 00:23:25 forced some some work so maybe that wouldn't have shown up if we'd looked at duration or total total time but yeah i did but i also think those kinds of issues often crop up without you doing having to do this like macro analysis work because somebody's telling you about them. Yeah, so if we decided to order by meantime, sometimes we see on the top, we see something that we say, well, it's fine that it executes a minute, because it's some cron job and nobody cares. If it's just select, for example,

Starting point is 00:23:58 and no locking involved, and it lasts one minute, it's not a big deal. So we probably want to exclude some queries from top and ordered by mean time, average time. For total time, it's not so. I'm usually really interested in each entry from the top. That's why I also prefer total time. But I see people use mean time successfully, caring about mostly users, not about servers. So roughly, total time is for infrastructure teams to optimize for servers, while meantime is probably interesting to application development teams and for humans,

Starting point is 00:24:33 right? So very, very roughly. And there is also calls important metric, right? Why we discuss which metric to choose? Because when you build good monitoring, you need to choose because when you build good monitoring you need to choose several metrics and build a dashboard consisting of multiple charts top end charts top end by total time top end by mean time top end by calls for example why calls probably for database itself it's not that important and if the most frequent queries they might might produce not the biggest load of course if there are a lot of very, very fast queries, I would check context switches, for example, and so on. And think about how CPUs are busy in this area.

Starting point is 00:25:15 But I've noticed that sometimes we want to reduce frequency of some of the most frequent queries because overhead on application side is terrible. This is unusual approach because sometimes people optimizing workload or database, they think only about database. But I had cases when optimization, for example, let's take top three order by calls and just reducing the frequency, we can throw out 50% of our application nodes. Can you imagine the benefit of it?

Starting point is 00:25:47 The cost saving, right? But so it could, just to give an example from the application side, I guess that would be one way of spotting potential N plus one issues where if it's the same queries getting executed over and over again, that's the kind of thing it could point to.

Starting point is 00:26:02 Right, right. So it's interesting. I think I don't understand all of aspects here. And I think we lack good documentation, how to use Pidzista statements. So many angles, so many derivatives as well. But I would like to finalize discussion of metrics. I wanted to mention also IEO metrics, your metrics shared buffer hits and, and shared blocks. They call child blocks, uh, red and heat, right? Let me check shared blocks, heat, shell blocks, read also written.

Starting point is 00:26:34 But if we consider only, well, local and temp additional, but let's just, if we discuss everything, we need an hour, right? So many aspects. But I wanted to mention only a few, like hit and read, shared buffers, hit and read, hit and read. Because it's, okay, if we talk about this only, an interesting thing here is that sometimes monitoring system thinks that read is enough. Because it's the slowest operation, let's, well, as we already discussed a few times, Postgres doesn't see the actual disk read. Like the operating system cache versus disk read. So this read is from a page cache.

Starting point is 00:27:14 Maybe it's disk read, but maybe not. We don't know. But usually monitoring system says, okay, ordering by shared blocks read is the most interesting. But I had cases at least two times when I really needed PgStats statements shared blocks hit and finding the most like,

Starting point is 00:27:33 because working with buffer pool in Postgres shared buffers was so intensive by some query group. And if you don't have it in monitoring, you need to start sampling PGS assessments yourself, writing some scripts on the fly. It's not fun at all. So I think most DBS who have experience, they have something in their tool set. But for example, PGS center, PGS center can sample it for you.

Starting point is 00:27:59 You can use it as ad hoc tool if you don't have it in monitoring and you have problem right now, for example. But I also suspect they lack top end by hit number so these angles and new new metric wall how yes let's find queries which generate the most the the more wall data let's order house code let's check i have the list here. It's called Wall Records, Wall FPI, Full Page Inserts, and Wall Bytes. Three metrics added to Postgres 13. I didn't see them yet in any monitoring. I hope... Oh, maybe our PgWatch 2 Postgres AI edition, it has it already, right? Yeah, I remember this was added to explain in version 13 was it added to pgstat statements at the same time same time yes yes

Starting point is 00:28:52 yeah makes sense and it's so good it's so good like order by like we want to reduce wall generation definitely because the reduction of it will have very positive effect both on our backups subsystem and replication both logical and physical so we do want to produce fewer wall records or fewer wall bytes or full page or full exactly yeah yeah so i never use it yet i i hope soon i will use it someday and so this i mean i know it there, but never use it myself yet. Yeah. Talking about this has given me an idea as well. We talked a while back about buffers.

Starting point is 00:29:31 And one of the things we do on a per query basis is look at the total sum of all of the buffers. And I know that doesn't make tons of sense, summing dirtied buffers plus temp buffers plus local plus shared. We can call it total IO.O. or something. Yeah, exactly. Or some kind of measure of work done.

Starting point is 00:29:50 And actually summing all of those and then ordering by that and looking at the top 10 queries by total I.O. I think it's a smart idea. Each I.O. has some cost. And if we find queries which involve most I.O. operations, of course, it's a good angle for our analysis. Yeah. What else? We mentioned that we deal with page cache when we look at ION. Yeah.

Starting point is 00:30:13 But sometimes we do want to order by real physical disk ION, right? And there is such opportunity for those who manage Postgres themselves. It's called pgstat kcache. Additional extension to pgstat statements. Extension to extension, I would say. And it provides you very good things like disk reads and writes, real physical disk reads and writes, and also CPU.

Starting point is 00:30:37 Sometimes you want to find queries that generate the most load to your CPU. And it even distinguishes system and user CPU. It's cool. It's good. Yeah, yeah. And also context switches. So it's a very useful extension

Starting point is 00:30:52 if you care about resource consumption and you want to prepare for growth and you want to do some capacity planning and before that you want to optimize. And I think you've said before, but did you say it's not available on most managed services? No, I only know Yandex managed services. They installed it by default, but I'm not aware of any others.

Starting point is 00:31:14 So yeah, also there is another way to analyze workload. We didn't cover today at all. Wait event analysis. This is what RDS, for example, provides as like starting point, actually, for workload analysis. I think it came from Oracle World Active Session History Analysis. So let's, yes, someday let's discuss it and compare it with traditional analysis we discussed today. Yes. And for anybody that is aware of Ash and wants it for Postgres, there is a, I've heard it called Pash.

Starting point is 00:31:46 Yeah, P-A-S-H as well. Yeah, exactly. But it's a Java client application which will do sampling from PgSense activity, but it can only be used as ad hoc tool if you're in the middle of an incident. Same as Ash, right? Isn't it? Well, you can install, for example, a PgWeight sampling extension, and immediately in our PgWatch 2 Postgresor Edition, you will have similar graphs as Performance Insights in RDS.

Starting point is 00:32:14 Nice. I think Google also implemented it. I'm not 100% sure, but I think they did it. Or PgCenter, I mentioned earlier, also a good ad hoc tool. It also has weight event sampling. But let's discuss it some other day. We also have an episode on monitoring that people can go check out. A deeper discussion on that. And micro analysis. It's good to distinguish things. Sometimes you have already a query. You need to go inside it and understand what's happening, why is it so slow.

Starting point is 00:32:47 But sometimes you have no idea where to start. Database is slow, everything is bad. In this case, query analysis, macro analysis is definitely worth conducting. So yeah. Wonderful. Okay, sorry about 40 minutes again. So it's getting longer than we wanted. It's good. As usual, let's thank all our listeners who provide feedback. This week was excellent

Starting point is 00:33:11 as well. A lot of feedback. Yeah, we had a lot of great suggestions. It's been really good. Thank you. Yeah, this drives us. Thank you so much. Yeah, really appreciate it. Well, thanks again, Nikolai. I hope you have a good week and see you next time. Final words. Like, share, share, share. Thank you. Bye. Take care. Bye.

Postgres FM - Query macro analysis intro

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.