Postgres FM - auto_explain

Episode Date: April 28, 2023

Nikolay and Michael discuss auto_explain — what it is, how it can help, and how to check it's overhead. Here are links to a few things we mentioned: auto_explain (docs)ScaleGrid guide to ...auto_explain Can auto_explain, with timing, have low overhead? (Blog post by Michael)pgBadger pg_stat_monitorEXPLAIN ANALYZE may be lying to you (blog post by Álvaro from Ongres)pg_test_timingOur episode on benchmarkingDatabase Lab Engine~~~What did you like or not like? What should we discuss next time? Let us know on YouTube, on social media, or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael, founder of PG Mustard. This is my co-host Nikolai, founder of Postgres AI. Hey Nikolai, what are we talking about today? Hi Michael, the topic is quite obvious for me. Over the last week or two, it popped up several times in various places. Let's talk about AutoExplain. Yeah, I think this might be our first purely extension-focused episode, which is cool. And I know
Starting point is 00:00:27 I count it as an extension. Do you? In terms of it being like a contrib module? It's extension, right? But what else? It's a little extension. There are two kinds of extensions. Main extensions, which are
Starting point is 00:00:44 contrib modules, so which are contrib modules, so-called contrib modules, old name, right? And they come with Postgres. And you can install it from official package, Postgres SQL contrib. And there are all other extensions. But probably we should talk about extensions separately. It's a good episode idea, maybe, right? But this auto-explainer, it's a contrib module.
Starting point is 00:01:04 It's very, very like super official extension it probably has more potential risks in terms of bugs compared to postgres itself but much much much less compared to other extensions because it's maintained by the same team as postgres yeah that's a really good point it comes out on the same schedule it's shipped with postgres so if you install community postgres you get it um off by default but it makes it that bit easier for cloud providers to offer it they don't all offer it but it does make it easier who doesn't offer it who doesn't offer i was looking at heroku's docs i'm pretty sure they still don't offer it there's a chance that the docs are out of date i don't even think some of the newer ones so like super base i'm not sure they offer it based on their docs maybe that's
Starting point is 00:01:49 out of date and even there are some new ones that don't offer extensions at all like but rds gcp cloud sql and azure they all offered right yeah exactly quite a few a few of the other popular ones do. Right. Maybe we should cover what auto-explain is. Automatic explain? Yeah, so it logs them, right? So it logs execution plans like you would get if you ran explain. Explain and analyze. Yep, so you've got access to all of the parameters. There are some minor differences, but yeah, you've got access to all the parameters you do when you run explain and a couple of additional ones,
Starting point is 00:02:28 right? So you can set a threshold so that it will only log plans above a certain duration or execution time. It's competitor to log mean duration statement, which logs queries which exceeded log mean duration
Starting point is 00:02:44 statement logs queries after successful execution right not in the beginning as would log statement do but when it's successful and if duration exceeds log min duration statement value for example one second right or half a second i remember by the way one of posgis advantages like 15 years ago was that compared to MySQL is that Postgres allowed fraction of second for this parameter, while MySQL was like one second was minimal value you can use. So AutoExplain is competitor here, right? Because sometimes people do it. You have both log min duration statement set to one second or 500 milliseconds, and auto-explain set to the same value. But does it make sense at all
Starting point is 00:03:32 to use this approach? Because auto-explain output will include query as well. I thought about this quite a lot in the past. And I think there are some differences that might make it worthwhile, but you are duplicating effort for sure. One example is another bonus parameter we get with AutoExplain, which is the sample rate. So because there is a small amount of overhead... Which is not a bonus anymore because log-mean duration statement also has this capability since Postgres 14, if I'm not mistaken.
Starting point is 00:04:03 We can also sample. So what I meant was that if we're sampling, we might want to sample one, but not the other, for example. So we might want to get all... Ah, that's a good point. Yeah. So you understood. So I feel like we've actually covered two of the big points. One is that we've got this logmin duration for auto-explain, and we can set that in milliseconds so it's off by default so it's my it's negative one minus one by default which means don't log anything so you have to set it to something i would heavily heavily suggest you do not set it to zero which will log every every statement in many cases it will just kill you if you do it. Yeah, the logging overhead is severe, not just in terms of performance, but also in terms of size, sheer volume of logs.
Starting point is 00:04:52 Yeah, kill, I mean, the logging subsystem will probably quickly become bottleneck and no queries will be possible to execute for Postgres at all. And we can discuss the details here. It's an interesting topic separately. Yes, so a typical value for that on a transactional system might be a few hundred milliseconds. So it's in milliseconds by default. You can set it to 100 and that would be 100 milliseconds
Starting point is 00:05:16 or 1,000, that would be a second. Yeah, and remember, we discussed what slow query is. And here it's very connected. So why 200 milliseconds? Because this is exactly 200 milliseconds. This is exactly when perception of human mind starts working. So if something exceeds 200 milliseconds, human notices this latency. So we don't want queries to exceed it in LTP systems. And that's why 200 milliseconds. But unfortunately, in some
Starting point is 00:05:46 systems, it's quite low value and volumes you will be seeing in logs and observer effect from logging will already start hitting you. So sometimes people choose 500 or even second. It depends. This is something where if you're worried about the overhead, or you feel like you've got a lot of slow queries at the moment and you just want to work your way down, you could start with a higher value and reduce it over time as you have fewer of those. Right. This can be changed without restart.
Starting point is 00:06:16 What you cannot change without restart is switching to Login Collector, which I would highly recommend to do. But switching to Login login collector requires restart. Yeah, actually, that's a really good point. I have seen guides that suggest installing or loading AutoExplain in shared preload libraries, but I think the general advice is to do it in session preload library
Starting point is 00:06:39 because then you don't have to restart to get it. Well, it depends, of course. But usually when you need it, if you didn't think in advance, you don't have it, right? And it's hard to quickly, even in session, it's hard to implement it. So I would still recommend to put it to shared preload libraries and probably even maybe turned off or with very high threshold in terms of duration and just be prepared to use it for some incidents yeah that's an interesting point a high duration would mean you don't log
Starting point is 00:07:14 too much at least but you would still be paying the overhead of running explain all the time so maybe also a low sample rate if you there are multiple overheads here indeed. So let's finish discussing why it's useful, and then we'll talk in details about various overhead things. So why it's useful? It's useful because Postgres doesn't have... We have PGSTAR statements, but they lock only query bodies in a normalized way. So the first question we usually have
Starting point is 00:07:44 when we try to optimize some particularly bad query group or normalized query, we need examples. Examples can be taken from various places, but log mid-direction statement, slow examples from logs, it's quite good idea.
Starting point is 00:07:59 But then we think, okay, we experiment in some environment, for example, it's an exact copy or it's like, for example, database lab engine. We have a thin clone branch and we check plan there. But then we think, maybe in reality, a few hours ago, when the incident happened in production, maybe this query had a different plan.
Starting point is 00:08:24 And we don't know, right? Maybe, maybe. And we can't reproduce it, right? If you run explain and analyze. You can reproduce it. With Database Lab Engine, you can do point-in-time recovery to go to a particular point in time
Starting point is 00:08:37 exactly in the middle of an incident. And then you can use the exact same planner settings and workmem and then Postgres should be had similarly. Like 99% it will be so. But sometimes you don't have database lab engine. Sometimes you still have doubts. In this case, auto-explain is a good way.
Starting point is 00:08:57 It's a very reliable way to tell that this is what we had in terms of plan. Because we have it locked, right? Maybe the most reliable way to understand what plan Postgres used when executing this query. Yeah, exactly. And the metrics involved, right? And the buffer numbers, the timings. So it's the only way.
Starting point is 00:09:20 You can choose both, none of them, or just timing, or just buffers. It's possible to tune it and choose. And you have very good articles saying that sometimes timing overhead is not that big. But let's finish pros of it. So to me, AutoExplan feels this approach, working with actual plans, feels like Generation 1. Same as, remember, we didn't have PG-statements 15 years ago. And we all used PG4im first and then PG Badger and normalized queries processing logs. Same approach as PG-statements, but only for slower queries, those which exceeded threshold.
Starting point is 00:10:07 And the tool normalizes them, and we have reports, nice reports. We understand that probably we don't see full picture. Okay, maybe sometimes we allow ourselves to turn on full query logging for some short period. This was considered as a valid approach, understanding risks overhead and so on. We turn it on for some minutes just to collect whole iceberg, right? And then we have a query analysis, but without plans, right? We never saw plans there. But then PGC statements replaced this approach almost because like, okay, logs are good, but it's a type of iceberg. BGCAT statements is a very good way to control almost all.
Starting point is 00:10:47 It has like 5,000 limit in terms of normalized queries. Sometimes it's not enough. But still, like you see almost whole iceberg. But without examples and without plans. So still logs are good for examples. I would choose BGCAT activity. We've increased track size. I don't remember. It's 1,024 by default. Usually what you want is to put 10,000 there or so and see real examples to
Starting point is 00:11:15 sampling from PgSetActivity. But where to get plans? This is the big question because if you want to gather samples of plans or maybe even more normalized plants and understand what's happening, AutoExplan is the way to go still. I think in the future, we will have some Generation 2 tools, maybe based on the BPF, by the way, maybe some extensions. But right now, this is the best way to collect actual plans auto explain that's why it's so important and many people want it that's why yes and i think it's the only way of running your queries getting the output of your queries and also logging the plan if you run explain analyze manually, the queries run, but you don't get back the results.
Starting point is 00:12:07 So it's an interesting best of both worlds there. I don't get it. To explain, don't log results. It logs query badges and plan. It doesn't log them, but whichever client requested that. So let's say you're running queries in client A over here. It's getting results back from those queries and separately we're logging the results.
Starting point is 00:12:31 Whereas if we start up a session as client two and run explain analyze on one of those queries, we get the plan back, but we don't get the query. I see. Yeah, so if you request explain analyze, you say I want plan but not results. But sometimes you want both. Yeah, it makes sense.
Starting point is 00:12:53 Makes sense. So yeah, so that's… And you can play with parameters of auto-explain in your session and say, okay, now in this session, I want all queries to be logged by Autox plan with plans, with timing, with buffers, let's go, like tracing all plans. And then you work normally obtaining results. It can be some application you use, right?
Starting point is 00:13:19 Or just human who is working with database using SQL. And for this session, we log all plans because we want to understand what's happening, right? Yeah. Let's say, for example, you have something that's intermittently slow and you can't work out why it's intermittently slow. You run explain analyze 10 times and it's not slow in any of those. But then when you run it by your application, one in a hundred is slow. So how do you know when to run explain that like how do you reproduce that or to explain a really good way in terms of the v2 of pgstat statements that's a really interesting topic and not one that i prepared for but there is that
Starting point is 00:13:55 extension from picona that's that they're trying to replace and i think that does i would consider it as v 2.5 not v3 but v 2.5 nice yeah because it has more more things than pg monitor right pg monitor pg start monitor we'll get the correct name and i'll put it in the replacement for pg start statements very promising interesting but i think uh like to become uh popular you we need to check its overhead. It's very interesting. I would like to learn more about it. And also, probably, it should be in contrib models. Well, I suspect that's their goal long term.
Starting point is 00:14:35 Yeah, I talked to them when they started. I talked to Peter, actually, and asked him about this particular decision instead of improving PGC statements to work separately. And one of the reasons was obviously independent release cycle. Because if you want to improve something from contrib modules,
Starting point is 00:14:56 you understand that release cycle is one year. And you need to convince a lot of people and experimental things and so on. It's hard to get into it. It will take years to build something very breakthrough because it's hard to get into it right so it will take years to build something very breakthrough because it's it's hard yeah but separate you can move faster and do many things i think it's a smart strategy so you've mentioned overhead a couple of times and i think this is a really interesting area not just for to explain but for a lot of these like
Starting point is 00:15:22 a lot of things we've mentioned i have looked in the past and tried to find some even basic benchmarks of the overhead of things like pg stat statements or to explain uh pg stat monitor you know anything like this i'm just interested i understand that but the flip side of that is that in quite a few places people mention horror stories or there were really severe warnings in the docs saying for some of them not all of them that the overheads really that can be really massive so be careful but no one ever kind of says how much it is and then if you go digging in the mailing list every now and again you see either like some people mention a few percent here or there some people one person mentioned like two or three x overhead so there are some really scary
Starting point is 00:16:11 stories out there so yeah that's the time it can be killing overhead if you if you use uh if you don't use login collector if you use syslog with journal d well maybe it's improved but i benchmarked it like many years ago and then you you say threshold is zero meaning that meaning that you log everything i think you don't even need to track timing and buffers it will be enough to kill your system like just like that so speaking of benchmarks let's be a little bit in context this week yesterday actually autotune published an article saying the worst thing in postgres is mvcc like they hate it most well i understand but i dig into underlying research and there is a paper like scientific paper comparing mvcc in various
Starting point is 00:16:57 systems and it doesn't provide any discussion any any numbers like how exactly they tuned autovacuum, for example, and so on. So details are very important here. And we all know it should be written on the front page of postgreSQL.org that defaults are very old. Don't use defaults. You do need to tune it, unfortunately.
Starting point is 00:17:21 This is Postgres. You do need to tune it. You need to understand some things and you need probably help do need to tune it you need to understand some things and you need or probably help from from expert and tune it right so auto explain is great thing but of course if you enable like saying auto explain is bad it's the same as saying log min duration statement is bad they both are bad and but they both are good let's talk about overhead. There are maybe three places where I see overhead. First, we already touched actually. First is logging and disk I.O. overhead. So if you log too much, since you log to disk, it's not good. For example, if you logged into some,
Starting point is 00:17:58 by the way, I did it in some team PFS, like into memory, small disk, but very, very quick rotation, understanding risks that it can be saturated in terms of space. But if you rotate it very quickly and send details to some offloaded to somewhere, you don't touch disk at all, you can lock everything in this case, it's possible to use it. Here, I would blame Postgres for not having some easy ways not to log to disk. I would rather, for example, say it over UDP to some different place, and that's it. Just send logs. And in this case, we would log much more, and just network would be our bottleneck, right? But if you log to disk, as we do in most cases with PostgreSQL, this
Starting point is 00:18:45 is number one thing to consider. And sometimes people, they store logs on separate disk. And I saw several times that this disk is magnetic. So yeah, so this is easy. Like you can you can shoot your foot feet very quickly here. So this is number one. Number two, CPU, right? Maybe you can tell more here. So there have been good blog posts in the past, like one from the Ongress team. Alvaro.
Starting point is 00:19:21 Yeah, yes. So he said, explain, analyze. He's lying. Yeah, yeah exactly it's a very good title but the main point he has is that if the if the track timing calls to measure timings if you're running explain analyze with timings which is the default for explain analyze then even the fact of measure so the fact of measuring them so there's an observer effect classic observer observer effect yeah Classic observer effect. Yeah.
Starting point is 00:19:45 Once you measure it, it's changed. You measure something else already. Yeah, exactly. So the timing's reported, and so the query will be longer because you're measuring it. Now, this is true if… But it depends on CPU, right? On Xeon, for example, modern Xeon, like Cascade Lake, for example.
Starting point is 00:20:04 Well, relatively modern. It's not that bad if you check it. No, there is a tool within Postgres. I'll link it up in the notes, but I think it's called pgtest timing or pgtrack timing, something like that. In the directory. Yeah. And it's very quick and easy to check if yours is like relatively fast the only place in recent time that i've seen a very like a not well performing one is when somebody was running postgres in docker and most production
Starting point is 00:20:32 systems i've seen a run on sufficiently decent hardware that it hasn't been a big issue but i do trust that there are some people so very well respected people who do recommend turning timing off when you use auto explainexplain for this reason. And I think it's mentioned in the docs still as well. So I suspect there are cases where people are still running production systems. Or there were workloads where, I think Alvaro mentioned nested loops having a particularly high overhead, for example. So if you have a workload that has lots of loops that do things in the tens of thousands hundreds of thousands so for some probably not good crew plans but it's possible that certain workloads possibly yeah
Starting point is 00:21:15 i wonder what's overhead on arm like graviton 2 and so on like i have you seen anything about it i think i've published the only blog post on this that's ever been published like i would love to see more but on this but i haven't seen any it's a good it's a good good area to research definitely interesting i think it's interesting but based on the reaction i got which is like it's just a comment from you and nothing else i think there's limited interest in this for some reason. And I've seen quite a few production systems with auto-explain on sample rate one, I mean, duration, 100 milliseconds, and timing on. So not just timing, right? Buffers as well. So I think this is like there are,
Starting point is 00:21:59 you mentioned, it's not just, I think you saw in the past that buffers might have a small overhead as well. Buffers definitely have overhead. If you say, explain, analyze timing of buffers off, you see... I mean, manually, right? And you can do the same with auto-explain. You get rid of timing and buffers overhead. It's also already useful, right? Because you see structure of the plan. And you have total timing because it's it's already calculated it's like similar to log min duration
Starting point is 00:22:30 statement log entries so it's already useful to see structure and just total timing good but then buffers and timing i would choose buffers. They have some overhead. It's also interesting to understand for your system and for your queries what it is. But buffer numbers in this podcast are honored. We have timing already, right? For overall timing. We don't have to choose either or though, right? Like we can get both.
Starting point is 00:22:59 Well, we can get both, right? But let's agree on this. Even if someone performs very good benchmark of overhead of auto-explain, I would consider it as valid only for a particular system, particular workload. That's it. In each particular case, if we are serious guys and we want our system to be reliable, performant,
Starting point is 00:23:23 and scalable, handle handle growth and so on we need our specific benchmark for our workload which is very interesting topic and hard to we probably also need to talk about one time and for our hardware for our postgres version and so on and so on and in this case we should measure two overheads from buffers and from timing. And consider four points, like without anything, with both, with only timing, with only buffers. Here I would exclude with timing only. Because buffers, in my opinion, in terms of internal metrics inside the query, buffers are more important than timing numbers what do you think uh more important i i think it's like an argument for the sake of an argument i don't i don't see
Starting point is 00:24:14 it as an either or like if timing overhead is not worth it for your system if it's too much don't track it and buffers are useful but would i like if i had the choice of just buffers or buffers and timing i would choose buffers and timing i'm never in a situation where it's a choice one or the other i've not seen like if they had similar overhead and we could if we could afford one but not the other sure but i've never seen that and i can't imagine it's true because for example timings tracked per loop right but buffers are reported not per loop they're reported per um per node so i wouldn't be surprised if it doesn't have the same looping overheads for example um and there were no warnings about it in the docs so it's i've anyway yes sure in principle i think you're probably right but i just don just haven't seen that be a practical question ever.
Starting point is 00:25:06 Interesting. Okay, well, okay. Maybe you should enable both and have sampling. Maybe. But if you do check this on your system, I'd love to hear, even if you can't share details, I'd love to hear what order of magnitude were the overheads. Yeah, I have it enabled in a couple of places actually
Starting point is 00:25:26 quite loaded once recently but we didn't do somehow i missed this we didn't have it doesn't follow my principles my principle is always have some benchmarks and make changes based on numbers and so on but since it was enabled with some quite low sample rate, I mean, like below 10%, for example, it was considered as low risk and threshold and sample rate. So like, let's have something, right? Already have something already useful. We didn't discuss it, but one of the use cases, probably maybe very important important one, is when we enable it and if we have big teams, we use this to support their work on optimization, like continuous optimization. We say, okay, these are bad slow queries, and we have examples with plants. Look at them. And in this case, if you enable it above this, like 500 milliseconds threshold,
Starting point is 00:26:26 and also sampling 10%, enable both, right? But it's very bad if we do it blindly, we need to run benchmarks and understand what's the price here. How would we do it in some system? For example, we have some system, OLTP, and we want to measure it. How? We did a whole episode on this, right? Like benchmarking. Well, yeah, but this is specific benchmarking. First thing I would do here is to ensure that we don't have overhead from observer effects from logging system, right? So, to test our timing. Yeah, we should separate these things. And benchmarks should not bump into this ceiling.
Starting point is 00:27:09 We don't want to saturate our disk IO and then make conclusions. Like, oh, you know how to explain this to the next. This is how you can do it very easily. So, you need to make sure that you have room in terms of disk IO. And then you can start thinking about overhead. Well, probably I would run two systems exactly the same. One with auto-explain, one without. And I would probably make sure I don't saturate anything at all.
Starting point is 00:27:41 It means that probably I would do it in a single session. I don't need multipleate anything at all. It means that probably I would do it in a single session. I don't need multiple sessions here at all, which means that it's possible also to run it on some non-production environments like Database Lab Engine, for example. You can just compare and see infrequent queries, like a single session, run this query, run that query, and compare details.
Starting point is 00:28:07 Probably latency, mostly. Yeah, latency, but do it many times with the same query. Yeah. Or a workload that you can replay on both, right? Right. So you don't want to reach the edge in terms
Starting point is 00:28:22 of CPU and so on. It's still a performance experiment it's performance research and you repeat the same queries many times to have reliable data and then you just compare without it and with it what's the difference and yeah latency probably what else we have we can have pg start statements enabled and check not only latency but also i don't know like also uh buffer numbers well buffer numbers should not change right because we shouldn't they shouldn't right what else probably a weight event analysis would be applied here as well maybe like i'm trying to invent something on the fly, but definitely latency is the main point here. Yeah.
Starting point is 00:29:05 The main danger in this benchmark is to bump into some different limit, like disk IO or anything else, and measure it instead of real overhead from AutoExplain. Yeah. Even I fell into that trap in my blog post of the, I was probably doing far too high workload for the machine I was running it on. And I also tested the logging overhead and on my system it was there was about 25 overhead just from logging everything and then i turned that off and it got it down to like it was less than a couple of percent by the way eppf analysis would be great here and like yeah you you should see that overhead comes from auto-explain. This would be like the best if you, okay, like you measure for this type of query, we have like 5% overhead because we know like
Starting point is 00:29:54 this query executes with nested loops, like for different parameters. By the way, different parameters in PGSA statements, we have one query, but with different parameters, we have different plans, right? So we can have same query, but depending on parameters, overhead may be different because plans are different. So parameters also define the plan. And of course, we need to take care of planner settings and workmem and so on. I mean, this experiment can be done on different hardware. I mean, CPU should be the same,
Starting point is 00:30:30 but I don't care about disks unless they are saturated. If we don't saturate disk I.O., disks can be the same, file system can be different. Different file system, that's why ZFS and DatabaseLab can work. But cores, CPU cores should be the same. Postgres version should be the same. Planner settings should be the same. And then version should be the same. Planner settings should be the same.
Starting point is 00:30:45 And then we play with queries in single session and compare. Interesting experiment, by the way. Yeah. And if we have ability to check with a BPF where, like, to perform in CPU analysis and see that indeed, like, 5% overhead for this particular query with particular plans, 5% overhead comes from auto-explain? This is success. So you mentioned three types of overhead. I think you said disk IO for logging, for example,
Starting point is 00:31:18 CPU for timing. Was there a third? CPU for buffers. CPU for buffers, yeah. Makes sense. Yeah. Well, what else? What about additional memory?
Starting point is 00:31:29 Probably no, right? What about network? Probably no. So I don't see any. Maybe if there are some other types, I just don't see them. No. There are a couple of limitations I wanted to make sure we covered that I've seen catch people out in the past. One is that it doesn't track planning time.
Starting point is 00:31:49 So these timings are all about execution times and there's no reporting of planning time. So if, for example, your client is reporting that a query is taking 1.1 seconds and it's only a couple of hundred milliseconds in auto-explain in the logs, that the difference I mean, you can verify this by running explain analyze, chances are it's a very high planning time overhead. And that's, I guess that's another argument for... We have it, since Postgres 13,
Starting point is 00:32:17 we have it in planning time. And of course, it would be interesting to dive into... AutoExplan power is the tree, execution tree it shows, right? It would be interesting to dive into planning time for particular nodes. It's not possible. Planning time is a whole. And as a whole, we have it in PgStats statements. If we divide it by calls, we have average planning time, and so on. Actually, we have it there. Like min, max, it's there.
Starting point is 00:32:49 It's tracked there. I think this might be another argument for the log min, what's the other setting, the logging one we were talking about having both on. I think that one does include planning time. That's interesting. So any difference there? Anyway, so that's kind of any difference there anyway so that's a that's
Starting point is 00:33:06 kind of a bit of a gotcha um the another small difference is that and i think this is coming in postgres 16 i saw a commit for it is it doesn't currently log the query identifier which can be really useful even if you yeah query id it auto explain currently doesn't log it even if you, yeah, query ID, AutoExplain currently doesn't log it, even if you use verbose, which is the parameter that adds the query ID in explain. So that's another like minor difference between them. And log min duration statement does it, right? Yes. Good point. Another reason. Since some, or yes, I remember in some version. 14, I think.
Starting point is 00:33:40 14, yeah. We have, since that version, we have query ID matching pgstat statements query ID in both logs, slow log. We call it slow log, but it's not separate, unfortunately. And in pgstat activity. So we can basically load it to some tables and then
Starting point is 00:33:57 join them and have quite good analysis. And some people do this. Right, yeah, that makes sense, Interesting. And one other thing I forgot to mention that is different between them is that in AutoExplain, we have an extra parameter, which is log nested statements. And that's another use case for this. We don't have access to that when we want to explain, for example, doing function. Like, I know a couple of people that use a lot of... You go already too far.
Starting point is 00:34:27 So I saw discussions on Twitter the last couple of weeks, and there are different approaches. Some people say, no, no, no, not in production, not in production. But some people say, yes, yes. And I can say yes, just carefully. I like it. Match it.
Starting point is 00:34:45 But details, it's slowly to... Oh, by the way, sampling is very interesting. I didn't realize it until maybe a couple of years ago when you wrote this article that sampling exists for many versions already. Unlike query logging
Starting point is 00:35:02 for which sampling appeared in 2013 or 2014, only a couple of years ago, in AutoExplan it exists for many years and many Postgres versions. So we can benefit from, for example, in Postgres 12 or 11. All supported versions of Postgres for sure. So it's great. It's great. I like sampling in heavily loaded systems. Even if we concluded that we have overhead like we enable it, and nobody is using this information. But if we do need it, it's okay to pay some price in some cases and have this very important information to optimize our system.
Starting point is 00:35:54 Completely agree. It will pay off very quickly. All monitoring of any description has some overhead, right? We can't monitor at all, and most people think monitoring is a good idea. So then the question becomes how much overhead is worth it for the benefit? Yeah, observers and observer effects are everywhere. Yeah.
Starting point is 00:36:17 Wonderful. Should we call it a day? Yeah, I think it's like a summary is that measure it on your system. Be quite optimistic, I think. Like fears are discussed around sometimes go from not understanding where it can hit you, right? And if you understand where it can hit you, you'll be fine. Like this, like we discussed it.
Starting point is 00:36:43 And sampling. So you can be good and even if overhead for your system is quite significant with sampling and high threshold you can still find it very useful because plans are a very good thing yeah and don't set log min duration to zero that's my final parting advice right right and don't send queries as is to ChartGPT. And plans also. Don't send plans as is. Nice. All right.
Starting point is 00:37:13 Well, thank you, Nikolai. Thanks, everybody. And see you next week. See you. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.