Postgres FM - What to log

Starting point is 00:00:00 Hello and welcome to Postgres FM, weekly show about all things Postgres Core. I am Michael, founder of PG Mustard and this is Nick, founder of Postgres AI. Hello Nick, how's it going? Hello Michael, everything's alright. Let's discuss logs. Everyone loves to deal with logs, right? What do you mean? Well, I mean, if you have a huge server, a lot of logs, so good to deal with them.

Starting point is 00:00:29 Well, I guess we're blessed in Postgres with a few ways that we don't always have to check logs, right? There are some system views that we can look at various information. For example, corruption, right? There's no such view. Yeah, but there's quite a few things that there's no such view. Yeah, yeah, there was a, but there's a quite a few things that there's no such view. Well, there's quite a lot of information that's put in the logs and isn't available elsewhere. So it is important. And when I say is put in the logs, actually quite often it isn't even put in the logs unless you change some settings. So I thought it was you had this good idea to discuss all of the

Starting point is 00:01:02 things that most people should change in terms of what to log so that we can then check the logs for issues and when we're doing performance analysis and that kind of thing. Yeah, so I think system views, ideally we should, for everything, we should have both system views and detailed logging. Oh, both. Yeah. Yeah, because like, think about speech-to-speech statements. We have normalized queries only. It's not possible to have there every query with all parameters because cardinality will be insane. Right?

Starting point is 00:01:39 But equally, I think it's quite often not possible to have every query logged to the logs, right? Like there's also limitations there. Exactly. So pros and cons of both approaches. Cumulative metrics, they are good. You can visualize, you can see spikes, everything, but they lack details. And same for example, temporary files, right?

Starting point is 00:02:00 We can see at high level, we can see from PgStar database that, okay, we generate a lot of temporary files, queries generate a lot of temporary files, some queries. We can go down to PgStar statements and see which query groups or normalized queries produce those temporary files. It's already super helpful. But obviously, the same normalized query, if you take one set of parameters, it can produce gigabytes of temporary files. Another set of parameters that can produce zero temporary files, depending on parameters,

Starting point is 00:02:34 it can be very different. So we need examples, and this is exactly when we need to look at the logs. But at the same time, if we log everything, observer effect is not a joke. And I must confess several big clients suffered from observer effect when I decided to log too much. And it caused downtime and it was not a pleasant experience. So we always need to think about observer effect. Those who say log everything, log all queries, they just don't deal with really heavy loaded systems, unfortunately. Just try to do it in heavy loaded systems which has like 100,000 TPS.

Starting point is 00:03:17 And that's it. You will put down everything very quickly. You will learn it's not the way to go so we need to limit the amount of logging but we cannot avoid logging because we need examples so my approach is to combine both things yeah it's a balance isn't it we need to record enough information to be able to diagnose issues when they occur but not so much that there's excessive overhead. So it's a balancing act. Right. And you need to think in like, it's better to think in advance what to log, because if some incident happened, you somehow mitigated it, you want to do root course analysis,

Starting point is 00:03:57 but logs lack this information. So you can only guess. And you can guess quite well if you have a lot of experience, but still it's better to have log logging tuned properly in advance. And I see Postgres in the latest few years, like improving defaults, but still defaults are not enough. Usually we tune some defaults for our clients always. I mean, we tune some log related settings to have more logging. Yeah, in my research for this,

Starting point is 00:04:30 I found three different parameters that were either introduced or had the default changed from off to some version of on, all in Postgres 15. So I was kind of expecting a gradual improvement in this over time, but it seems almost, I don't know if it's a coincidence or not, but one major version where quite a few things are improved and I haven't seen a whole lot of movement in other areas, unfortunately. Right. Defaults changed in 15. You mean log checkpoints and auto vacuum related to 10 minutes, which is enormous still. It's a huge value.

Starting point is 00:05:09 What was the third? There's a new log startup progress into it. But it's new. It's not a default change. But the decision was made not to introduce it being turned off by default. Yes. It's a decision, which is a great decision. I agree.

Starting point is 00:05:24 And it's quite like 10 seconds, right? So every 10... Yes. Yeah, let's discuss everything we have, right? Let's start from something, move to different topic. But I agree. So 15 was revolutionary in terms of bringing good defaults in logging.

Starting point is 00:05:43 And... In some areas, yeah. And also, worth mentioning, Postgres log, it's always a single log. So imagine if we had the ability to separate streams of logging. For example, slow queries could go in one place, maybe to faster disks, or maybe to faster disks or maybe to... there is like this idea, a very old idea, that some logging should go not to local disk but straight to some storage over network UDP and that's it, like send it. So we could log a lot of more stuff. Although when you ask guys who support, I don't know, like Elastic or something, which

Starting point is 00:06:25 is receiving a lot of logging from different systems, they also would like to limit, you know. Or Splunk, for example. Splunk is old commercial enterprise system for logs. They can deal with huge volumes, but always like still like cautious about how much you are going to send to them. Yeah, it's a good point though. but always still cautious about how much you are going to send to them. Yeah. It's a good point though.

Starting point is 00:06:47 Other databases do have, I'm pretty sure MySQL has a separate slow query log because I've seen people familiar with MySQL asking about where's Postgres's slow query log? Yeah, it's not a big change. It could be like, I think it could be, I it's it's easy to implement but I imagine how much of infrastructure changes needed to be done in all the systems if Slow query log would be separated. What do you think what what I'm missing something though? What's the benefit of having it separated out? You can put it to different disks faster disks, for example Analysis wise also it's easier, right? It's the volume is likely to be so much higher than the other log.

Starting point is 00:07:29 Maybe some rotation policies could be different. And analysis wise, I see a bunch of logs which are related to query optimization. And I also see a bunch of log messages which are related to like startup, auto vacuum, checkpointing, like more like not about query processing, but everything else. And separation of them is like, to me, it sounds reasonable. So I imagine it would be possible if like, what do you think? Imagine there is a new setting which says everything related to query processing, including log mediation statement based. Everything that can contain a query text, actually.

Starting point is 00:08:18 Everything. For example, slow queries exceeding log mediation statement. Queries generate generated temporary files, queries which are slow because they are waiting for other queries and logged after that log timeout one second by default, right? Imagine all of them go in one place, everything else stays in old place. But it's optional.

Starting point is 00:08:39 You can decide to choose it or not. I haven't thought long and hard about it, but I do quite like the simplicity of everything being in one place. But they can also be, it can also be helpful, right? Like I understand how if you're not worried about slow queries and looking for other things, they can be really noisy and get in the way. And in a heavy loading environment, it would just be impossible without tools to then get anything useful out of that.

Starting point is 00:09:12 But like an auto vacuum could be the cause of some slow like some issue that you're facing with query performance or checkpoint could be the cause and maybe in a less heavily loaded environment where you're reviewing it manually seeing that in in order might help you debugging like what's actually happening here. Yeah, makes sense. Makes sense. And also some errors should be like, what will happen? What is, which are airing out. Yeah. But, but still like there's, there, it could be a choice, right?

Starting point is 00:09:41 Because people who are interested in errors, corruption-related errors, I hope we will discuss them as well, and maybe a checkpointer behavior and so on, less interested in query processing itself. They often say, you know, like your huge query is being logged just because they are like exceeding one second, they make our log really hard to read. Yeah. Right. Well, hard to read manually, right?

Starting point is 00:10:15 With tools you can filter to the bits. Or with DuckDB you can have CSV and talk to it through DuckDB queries, which I think is a great idea, quite new, great idea. But I like this almost as a way of splitting our conversation. You mentioned the performance related stuff and everything else. Where do you want to start with one of those? Yeah, we already talked about log checkpoints. Let's maybe talk about this. So in 15, it was changed to on and that's it. That's great.

Starting point is 00:10:49 I mean, it's easy. Like I think everyone needs it. And finally in 15, how long ago it was? Three years ago. Yeah, nearly. It was realized that everyone should have it. Everyone. So our advice to people would be if you're on a version older than that turn it on.

Starting point is 00:11:07 100% so. If it's not already. Yeah and another thing happened in 15 it's related to auto vacuum logging. So log to vacuum mean duration was always I think before that it was minus one so means it means to vacuum is not logged at all And it was changed to 10 minutes. In my opinion, it's not enough. 10 minutes is way too long. It should be 10 seconds. In some places, we set it to zero to see everything. Of course, I can imagine, especially if you have many workers, you also tuned the auto-vacuum nap time to very low

Starting point is 00:11:47 value, so they sleep less between work. Or workers sleep less, right? In this case, I imagine the amount of logging from auto-vacuum workers might be substantial, but at least like 10 seconds or a few seconds threshold, it's good. Because if you have a lot of tables, 10 minutes, you will miss a lot of details and interesting details. And it's not that frequent a thing if we go back to that, you know, recording enough information to diagnose issues without excessive overhead, the overhead of logging all of your auto vacuums unless you have, as you say,

Starting point is 00:12:33 really like, yeah, very low nap time and maybe lots and lots of tape, like, maybe you've got thousands of tapes, thousands of schemas, like, if they've multiplied out, I think I heard somebody that was talking about having millions of objects in their database. All right. Do you remember if you have, for example, nap time, one second, you have 10 workers, for example, and you have longer running transaction holding X-min horizon, so the vacuum cannot delete some dead tuples. How often those workers will visit this table which needs processing? Maybe they will do it in loop, right?

Starting point is 00:13:00 Yeah, good point. I don't know. And if table is very small, then quickly process it, sleep one second, again, processes, sleep one second. So it will be a lot of spam, right? Unless there is additional logic in Notovacium, I'm not 100% remembering. It would be good to check. It's worth checking. But also once per second, when we're talking about logging, when we're talking about potentially, it's also still not that bad. But then work thousands tables.

Starting point is 00:13:32 I can imagine some something like this, but also nobody like default nap time is one minute and default number of forks is three. So anyway, definitely we always recommend to our clients to see more of auto vacuum work in logs. It's really helpful, especially during some incidents. After some incidents, we have this useful information to understand what happened in terms of data that cannot be deleted. It also reports X-min horizon, by the way. It reports the delta between current X-Seed and X-min horizon, so it's super useful.

Starting point is 00:14:19 Nice. So just to be clear, if you on older than 15, make sure you have this set to something so that you're actually logging auto-vacuum at all. And even if you're on 15 or above, consider lowering it substantially from the default. Yeah. One more thing here is that if we don't have activity at all, in this case, it will be producing some log entries, which probably not super useful. But we don't have any auto vacuum activity or any kind of updates, deletes. Yeah, or maybe select only during some long time.

Starting point is 00:15:00 But that was the argument for not having checkpoint on for like if there's a silent database. But as soon... There was a battle about it, yeah. Yeah, but we're talking about very, very few log entries in the grand scheme of things still. I just mean if you, for example, if you build some service, if you have lots of databases or you build a platform for other people. And you can imagine like many of them will have databases which are not superactive, but logging has costs. Say you have like 10,000 databases and 9,000 of them are not superactive. And you will be paying for this disk space for logs, right?

Starting point is 00:15:50 This will make you think differently and try to save. But this is exactly like, this is different situation compared to, okay, I have one database, it's a hot startup, AI startup, it's growing, it's a single monolith or a few monolith or it's sharded, doesn't matter. And we need to understand our workload really well. In this case, we need to log everything, right? Yeah, I get what you mean. But going back to the, let's say, the platform that has 10,000 of which 9,000 are quiet, in that case, why not just delete logs more often? Good point.

Starting point is 00:16:30 Good point. So yeah, if you, for example, understand that you're going to log something that's probably in not loaded system is not super useful, but you know the maximum amount of that already, how many lines it will be, and you know it will be rotated. So there is just constant price for each node that you're going to accept. And logs also can be compressed really well, for sure. So yeah, I agree. I agree with this point as well. I just wanted us not to forget about different perspectives people might have. One big pet versus huge cattle you need to take care of. Good. What about auto vacuum? That's it, right? We need to know details, especially about larger tables,

Starting point is 00:17:24 especially when auto vacuum isacuum is blocked. And if you're, for example, a backend developer, I think it's interesting for you as well because it may sound auto-vacuum is just some, okay, it's a garbage collector in Postgres, right? But if you think, like if you have chances to deal with performance of system you develop, If you have chances to deal with performance of system you develop, garbage collection part is very important because every time you deleted or updated something or had insert with rollback, you produce dead tuples. And basically, if you deleted something, commit, return to your application code something,

Starting point is 00:17:59 but it's only half of work done. AutoVacuum will do the remaining half of work, so to speak. And you definitely want to know when it happened, in which volumes. At some point you will need probably to process a lot of rows, like delete millions of rows. We had an episode about this. And people keep asking, I saw comments yesterday, comments like, okay, delete, what about updates? Well, similar, similar updates also, they produce new tuple, they mark old tuple as dead, but auto-vacuum is which the process, what exactly will clean it. And we need to understand its behavior. And if we want to move really fast, we need to take care

Starting point is 00:18:47 of cleaning. Otherwise, we will see a lot of huge gaps in storage. Space won't be reclaimed, it won't be reused very quickly, and this is called bloat. So we don't want this. So we need to understand the garbage collection and Postgres, which is called auto-vacuum. Yeah, this is like, that's why you want auto-vacuum logging, even if it's, even if you are not DBA, but just backend developer. This was my point. Yeah. And it's just like you mentioned at the beginning, there are system views where we can look at when was the last time auto-vacuum ran on each table, which is useful, but you can't see the couple of times before that or... You don't see details how many that was deleted and was in horizon a problem actually.

Starting point is 00:19:36 I think by the way, is there some work in progress to extend it? I don't know. Because it's worth extending, I think. PgSTAT all tables or user tables. It's useful, but it's just a counter and last timestamp. Not enough. Two counters, two last timestamps. Last timestamps for manual and automatic vacuuming.

Starting point is 00:19:57 Okay, good. In the same release, just to close with Postgres 15, it was this log startup process interval. I don't think it's for backend engineers, it's more for DBAs because if they tune checkpointer behavior and distance between checkpoint is huge, significant, in this case startup might take long time and it's great that now we have this setting. As you said, default is quite good. So we can not to think about it at all. Yeah. What's next?

Starting point is 00:20:30 What about log statement? We talked about that in a recent episode as well. Change the default there is, is none, right? Yeah, exactly. And I like to recommend people to set DDL to put DDL there just to trace all schema changes, when it was done, who did it. Of course, actor may overwrite it. But it's good to have some default there. And I think there's a disability actually to forbid overwriting, right? Some settings. So I think there's a ability actually to forbid overriding, right, some settings. So I think there's two,

Starting point is 00:21:09 there's almost two categories here, right? You're talking about almost the security element of it. But I think even if we assume, it's general case, it's still extremely useful to see when DDL changes were happening in the context of everything else. Timestamp, exact query. For example, it may save you time in future if you, for example, start dealing with logical replication, doing some experiments and it's blocked by DDL.

Starting point is 00:21:37 But you can trace when exactly it happened and which is it was, which is great. Or even look back historically over the past 30 days and be like, which days do people run DDL on? Which ones don't they? You know, like you could plan an upgrade around when it's unlikely to have DDL happening. It's super helpful in various incident analysis when you have a moment and something went wrong.

Starting point is 00:22:02 DDL, we discussed it many times, the deal may be a reason if it cooked not properly, it may be a reason for downtime or some issues. And if you have log statement, they call DDL, you can trace it. Yes. Yeah. At least after the end, after incident happened.

Starting point is 00:22:22 So, so definitely DDL is a good idea. And the cost is very low, right? How much DDL are you running on your system? Not thousand per second, definitely. No. I hope. I see systems which have DDL as a part of their normal workload coming from users. It happens. For example, if it's a multi-tenant system which creates tables for each customer and if they extend functionality, they also create something. Also, the temporary tables, which I don't like at all. I avoid temporary tables at all costs, but people use them and temporary table creation is also DDL.

Starting point is 00:23:06 Yeah, good point. So I can imagine that there are cases where it can happen like we have a storm of DDL coming to logs, but for me it's kind of edge case rather than normal. Great. On the security side, I wondered about connections and disconnections are both off by default. I don't think it's... Sometimes I see people turn it on. It depends on the particular system. In many cases it's off. It's off by default. this information is usually most interesting.

Starting point is 00:23:46 It looks most interesting to DBS, DBRs, not back-end developers. Let's move on closer to workloads and query optimization maybe. Yeah. And I think we did have an episode either about PG audit or about... Did we do it specifically about PG audit or about security or something like that? So I'll link up that episode because I think that covers probably the more interesting parts of the DBA side of things. Sounds good. Yeah, maybe let's talk about log-log weights before we focus on workload analysis. So it's also part of our code analysis, just specific part related to locks.

Starting point is 00:24:25 And we also discussed it recently, I remember. And this is definitely what we recommend turning on because it's off by default. Because it's helpful. Also during incidents, it's helpful. If you're developing your app, didn't think about multi-session concurrency control properly and blocking situations, you might have incidents, spike of active sessions, or maybe even downtime caused by heavy lock contention. And having locks related to this is really helpful. Really helpful because without it, you can only diagnose something in runtime. Okay, you see there's a spike of exclusive logs.

Starting point is 00:25:14 So with log-log-waits, you will see that some session tried to run some query, but it was blocked by another session. And it was waiting, our main, like our query was waiting longer than deadlock timeout, which is by default one second. So this, if you, for example, try to update one row, but there is another transaction, which is not committed yet, it may be already updated this row, but it's not committed. And we need to wait until commit or rollback

Starting point is 00:25:44 for the transaction. Our update waiting period lasts one second. Some systems may be longer or less longer if deadlock timeout was adjusted. This gets locked and you see, okay, our session process ID, we've executed query this, was waiting for long, and it was waiting for long and it was waiting for those process IDs unfortunately we don't see details about those sessions this is disadvantage but it's already half of the coin visible you know yeah well imagine trying to debug without this it

Starting point is 00:26:20 just seems to me reproducing ordering for locking issues just seems to me reproducing, ordering for locking issues just seems to me to be so difficult without this kind of information. Yeah, also it's helpful if you have deadlock issues and you try to understand your system better, this kind of logging is helpful. I cannot say it will answer all your questions, unfortunately. I had cases when deadlock troubleshooting was extremely challenging. And we had logging, we had everything,

Starting point is 00:26:49 but it was still challenging. Sometimes it's really challenging. You need to go to application code, understand what exactly it's doing. But it's already something and it's better to have it always. Yeah. And this to me seems like a no brainer. The one, the trickier one, balance-wise, seems to be the

Starting point is 00:27:07 time, whether to reduce it from one second or increase it. And that seems quite loaded because it's doing two things, right? It's determining at what time lock weights get logged, but also at what time deadlocks get a pick. Deadlock detection happens. So it's doing two things. I understand the decision from engineering perspective. It was like, where should we log it? Oh, this is where we're checking it.

Starting point is 00:27:37 It's a good time to log it exactly at that time. So I can imagine this decision, why it happened. But I agree with you, there is some lack of flexibility a little bit, but also it's protecting us from huge volumes. Let's talk a little bit about observer effect. Here I don't see it actually. So each session, there is this one second delay basically, by default one second delay before it gets locked. If we have a lot of sessions, okay, how many? 100 active sessions. Max connections. Yeah. We might have 100 these messages per second. Well, it's significant volume, but I think on this server with 100 active sessions and I hope with many more cores, I think disk is already quite powerful so they can handle 100 entries

Starting point is 00:28:26 per second for this type of logging. But it might contribute in some cases, I think. Again, in some edge or corner cases, it might contribute a lot. So it's worth understanding. But personally, I've never seen the case when we decided to turn it off. And I think it should be on by default, but it's off by default. So advice from us, turn it on. Yeah, good one.

Starting point is 00:28:54 What's next? Should we move on to performance things? Yeah, maybe temporary files first. Yeah, good one. Kind of bridge the gap a little bit. Yeah, and by default, it's disabled minus one. Kind of bridge the gap a little bit. Yeah, and by default it's disabled minus one. Yes, so this is when, in fact, actually I looked into it, this is logging when the temporary file

Starting point is 00:29:14 gets deleted, interestingly. But what's happening is when a query is using some temporary storage, for example, if a sort spills to disk or doesn't have enough working memory available to do a hash join or that or any any operation that uses temporary data so I had this under performance but I guess it is also just general system wide tuning as well around config parameters, like workmem and hashmem multiplier more recently as well.

Starting point is 00:29:49 Yeah, yeah. Yeah, I can imagine many queries will start producing a lot of logs if they need temporary files. It's so multi-sided question to me because if you produce... So if you try to save on logging of temporary file generation cases, deletion cases, as you said, you already have issues with dealing with disk. The query processing itself was dealing with disk. It's better to capture that in four and try to optimize it as soon as possible.

Starting point is 00:30:27 But will I put zero there? Zero means let's log all the cases even if it's just eight kilobytes. Maybe no. But what I would do, what we usually recommend, I'm sharing three pieces of advice today. We say minus one is definitely not good. And of course, in some cases, if you don't see temporary file generator right now, looking at PgSAT database or PgSAT statements, because they also have temporary file written in the red, a couple of columns, so you can see it there as well. If you don't see a big problem right now, maybe, okay, maybe not, let's not log it, but why not to be prepared for future, right? Yeah, I would say the opposite. I'd say if you're not seeing them right now, you could

Starting point is 00:31:12 even afford to set it to zero because you're just not going to get much volume. What if tomorrow memory is not enough and we start logging very tiny amount, tiny numbers and it will be a lot of cases. Well, it depends if you've tuned Workmem. Even if Workmem is the default 4 megabytes, it would only be logging most likely queries that are using more than 4 megabytes of memory and sorts and things. So how many of those per second are you realistically running? It seems unlikely to me that it's going to be super high volume.

Starting point is 00:31:48 First of all, workmem should be increased. But if you keep it very low and suddenly you changed like data volumes become bigger and bigger. And for hash operations, ordering, we need more and more. At some point, we slightly exceed four megabytes. And many of processes do it, for example. Boom, you need to generate temporary files and log temp files equals zero, add some text on it, additional text. My approach is different. If we don't see it right now, but we grow, this is scary moment for me. I would prefer seeing very heavily loaded systems already having some temporary files, and we understand them, and we go from quite significant value, like 10 10 megabytes and we go down in stages. Restart is not needed by the way for all settings we discussed today so far. Restart is not needed

Starting point is 00:32:53 which is perfect for going in iterations gradually right so we gradual descent to zero but maybe not to zero maybe 200 kilobytes or something. But's the dip like it's on are you gonna get temporary files that are 100 kilobytes well if we put 100 kilobytes to log temp files we will see only the cases which exceeded yeah but what I mean is if workmem's 4 megabytes and hashmem multipliers 2 by default, we're only realistically going to have temp files and more roughly, very roughly, four megabytes or eight megabytes. I understood you. Well, obviously we think about it differently and I'm not sure who is right.

Starting point is 00:33:36 You think when memory is not enough, everything goes to temporary file. I think when memory is not enough, only Delta goes to temporary file, which is right. That's a good point. I think it's actually different in two different cases. So for example, like sorts, I think it's all or nothing. But I think, for example, with bitmap scans that have non-exact... Michael, we have homework to do. non-exact. Michael, we have homework to do. Yeah, it's a good point though. You're right, there probably are some operations, I think, including bitmap scans. I want to dive into internals in this area. It's a super interesting question. To some hackers, it might sound very simple actually. I'm pretty sure will... Well, I'm pretty sure with sorts, I'm pretty sure with sorts it's all or nothing.

Starting point is 00:34:27 But it's a good point that going down in stages makes a load of sense. And I was just trying to say, I feel like you can skip it if you've currently got none. Yeah, why go down if it's all or nothing? I understand this logic, yeah. But low, for me low is almost as good as zero. Like, there's no difference realistically. It's 100 kilobytes, great. If that makes things much safer in extremely heavily loaded systems with lots of potential

Starting point is 00:34:52 max connections, great. Low is still good. Yeah. But again, this is useful parameter because if you have cases when... We discussed it in the beginning already. When you have cases, you can see some some normalized query but you need examples to understand how to reproduce it and then decide what to do with it raise workmem globally locally this will be a question. For one user for like an analytics user that kind of thing.

Starting point is 00:35:18 Right right right yeah yeah so okay. And memory management might be quite challenging, like it's sometimes not trivial. Good, so let's talk about the main part of query analysis. Its PGSA statements is not enough, right? So you need to either have quite low log mean duration statement, at least one second, preferably lower, like half a second, maybe 100 milliseconds, to capture more cases of slow queries and then optimize them. Right? Yeah, absolutely.

Starting point is 00:36:00 And I like to point about them being examples. So it's not necessarily that the average of that query is going to be running in a hundred milliseconds. We want to catch the outliers. Like what are the parameters for that query that mean it does tend to run over a hundred milliseconds. So we get those examples. It's perfect. And again, one that you can lower in stages? Yeah, definitely. So I will start if it's bigger known from five seconds and go down. Doesn't make any sense to start with really large value,

Starting point is 00:36:36 because I assume we talk about all TPC systems where, for example, 10 seconds is already absolutely slow query. So I usually see one or half a second as value being used these days and if you already performed a lot of optimization efforts and you want to improve even further, you can go down. But maybe it's worth instead of this. Oh, before we go there, there's actually a idea you can go down to zero, but switch to sampling. There is a statement level sampling and transactional level sampling, which is supported since a few years ago. Yeah, you can have two levels currently. You can set log mean duration statement to,

Starting point is 00:37:25 let's say, 100 milliseconds, and that will log everything above 100. Then you can have like a second threshold that is log mean duration sample that's like, you could say 10 milliseconds, and then you separately set a log statement sample rate to, you could say like 0.1 and that would do 10% or 0.01 it would do 1% and that would log let's say 1% of the statements that were between 10 and 100 milliseconds. Yeah. So yeah it would be a way of like capturing some examples. Yeah like 1% of everything and everything, like 100% if it's above the old log-min duration statement threshold. And this was added in Postgres 13, I think it was adjusted later, 14, so we can say it's

Starting point is 00:38:19 already almost all versions. All supported versions, yeah. All supported versions, yeah. Actually, 13 is the oldest already. Wow. Yeah. Yeah. Time flies. Good, good.

Starting point is 00:38:32 So we talked about this, but there's also idea instead of logging queries, let's log them with plans and just jump straight to using auto explain and just not to use what we just discussed, a log mean iteration statement and log mean sample, blah, blah, blah. Right. So just auto explain brings plans, which is great. Yeah. And it also does log the statement, so you do still get the query. But you could, in theory, have a third setting. You could have a third value. So you might want to log the queries above 100 milliseconds and you have a separate threshold

Starting point is 00:39:16 for logging auto-explained plans at a different threshold if you wanted to. It was you who surprised me a lot a couple of years ago when you said auto-explan support sampling for ages. Yeah. I haven't noticed somehow. Yeah, but it's great. You can log every plan, execution with plan. You log actual execution with plan, but only one percent samples of it. Right. So, yeah. So this is slightly different though, I think, because if you, with the sampling on the log mean duration statement, I get the impression it's, well, I'm not sure. Maybe it maybe this is a question I get impression.

Starting point is 00:39:58 It's tracking them all, but only logging a percentage of them. So you're then saving yourself from an overhead around logging because... Yeah, there are two of them. Calculate, calculate metrics and then log to disk. Exactly. So there's the observer effect of observing, of like tracking timings and then there's a second observer effect of actually logging. Whereas with auto explain the sampling you don't even measure a certain proportion of the... so it is subtly different I think. What would you recommend? Like we can have so many options here and worth saying we can, it's quite flexible, we tune it all and then you think, okay, this is a big complex system. We have so much logging.

Starting point is 00:40:52 We log everything, like not everything, like sometimes same thing, sometimes not. But then you think, actually, you know what? I still need PgSat statements because I want reliable, higher-level metrics to understand. So because I identify 10 queries I need to work on, but where to start? What is the most influential in various terms? I need to process statements to realize that. And I connect them by query ID, and query ID in AutoExplain is supported only since version 16 16 I think

Starting point is 00:41:25 so you this is like this was a big missing missing piece before 16 but once you're on 16 or 17 or maybe already 18 better one I don't know then you connect by query ID and you see okay based on log, I see these plans need my work, need attention or these queries. But I'm choosing this one because I see it's like 25% of all time spent based on PgS statements. And this is a great way. So this brings us to the idea that logs is not like having logs is not enough. Yeah, well, they kind of serving slightly different purposes, aren't they? Because if a query is taking 25% of total CPU, like if it's just taking 25% of our resources,

Starting point is 00:42:15 it'll probably be relatively easy to reproduce that. If we grab the query, grab an example that's, you know, relatively slow and run explain, analyze with all the parameters on that will probably get a slow query plan. But where Walter explain comes into its own is if queries are only every nap, like if they're slow sometimes, but not always. If there's very specific... One flip happens for example. Exactly. It's cases like that that might be difficult to diagnose. Maybe they happen at certain times or with you know in cases that you're struggling to put your finger on. Getting

Starting point is 00:42:58 the plan at that moment is extremely helpful and it's not always possible if you just load this... Do you think that AutoExpand should be in a different log? Huh... I mean I still... I may have already mentioned this recipe in the past, I used it. So you have TeamPFS disk, RAM disk? Yeah, yeah, how like gigabyte we have memory. So one gigabyte and Log goes there And

Starting point is 00:43:34 You quickly rotate it to ensure it's not filled up and send it somewhere to storage and It would be great to have separate log for such very bloaty things like AutoExplain, right? Well, and that's the big difference between AutoExplain and just logging the query, because even queries can get long, but query plans can be ridiculously big. Like the largest one I think I've seen was getting close to 100 megabytes of text. Just one. The largest one I think I've seen was getting close to 100 megabytes of text. Just one. Just one query plan. A lot of that was like, there was a lot of like white space and a lot of...

Starting point is 00:44:12 It should be compressed, immediately compressed and sent over network, but compressed by whom? Logging collector or... That's an interesting question. So it is the big downside in terms of volume. So we see so much power we have right now for observability and query processing, query optimization, but still it feels like in heavily loaded systems, it's hard to have good... It's not super hard, but there are opportunities to improve 100%. And if you've not got issues right now, and if you've got a bit of,

Starting point is 00:44:45 if you've not got issues right now, it's great to set a bunch of these, and you can afford to set things like, or to explain, if you've got Headroom, you can set these things up in advance, and then change them. If you start to get into the 90% of CPU being used, you can look into things you can do to reduce

Starting point is 00:45:06 those. But I really like a quote, I think somebody was at an Oracle conference and somebody asked what's all the overhead of all of the performance analysis Oracle's doing. Oh, yeah, I think it was Stanil Potter. Oh, nice. Yeah, he said something like, I can pay like 10% or something overhead but save like 90% thanks to my optimization based on those information. Yes, I think it's along those lines. I think if you have this information.

Starting point is 00:45:40 If you have this but you're not going to use it, you're losing, right? Additionally, but if you are definitely like you, you are going to allocate your resources to optimize ready to pay this penalty like or tax, but because I know I will optimize and I will I will be on better spot after it. Right? Yeah, exactly. It's not a liability. It's an asset at that point. And those 10% they will shrink because thanks to my optimization compared to initial latencies, they become may become smaller. Right. You know,

Starting point is 00:46:19 maybe if they're a percentage, I've got a feeling some of them are fixed. Yeah. But if we, if we are not sure that we will be able to optimize, I can understand that that can be, it can sound risky to pay this additional price and without a reliable outcome, you know, like who knows how successful we will be. But again, all of these things, except putting auto-explain to SharePilot libraries, you can change them on fly. So it's so great. You can temporarily lower all thresholds during a week or two while you're optimizing, for example, and then get back to your normal and do it every quarter, for example, routine optimization. Yeah, so cycles of optimization, or also if it's like non-production or something,

Starting point is 00:47:10 you can be more aggressive in going down with thresholds and having more. To see more. Yeah. It feels, that feels like one last thing on auto explain the thing in the docs and quite a few experts will recommend is turning off the timing parameter. So you collect the explain analyze information, but without per node operation type, like per node timings. It does reduce the overhead of course, but it also reduces the like fidelity of those plans. and it is quite useful to know.

Starting point is 00:47:47 Well, here I don't agree with you. For me, most valuable is structure and buffers, as usual. Of course, I get that they're more useful, but I think quite often in a lot of environments, collecting the timing information is also worth it. Well, yeah. You had the post long ago about this, but we also had a whole episode about AutoXplain. Do you remember? Yeah. I'll link up the relevant episode. It's a big topic. AutoXplain has many settings to tune, so it's a big topic. Good. I think maybe that's it, right?

Starting point is 00:48:27 Yes, definitely. Thanks so much, Nikolai. Catch you next week. Yeah, bye-bye. Bye.

Your Ad Here

Postgres FM - What to log

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.