Postgres FM - Monitoring checklist

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things Postgres QR. I am Michael, founder of PGMustard, and this is my co-host Nikolai, founder of Postgres AI. Hey Nikolai, what are we talking about today? Hi Michael, today we are going to talk to monitoring, and I think we attached very, very wide areas of Postgres, and probably we will revisit each one of them trying to go for in some specific sub areas right but every project every project should have it well do you mean so let's check that first so we're naturally for large enterprises we we definitely want monitoring we definitely want database specific things being monitored sometimes couple a couple of tools so like

Starting point is 00:00:42 sometimes i saw big projects using three monitoring and all three tools were monitoring Postgres. None of them were perfect to be enough alone, right? So we needed to combine parts from three places. It happens, I think. By the way, it's not

Starting point is 00:01:00 a problem if you have a couple of tools used if you know that overhead is low, right? Yeah, absolutely. So we are talking about postgres monitoring this is huge topic and my main statement is there is no good product developed yet no perfect product maybe or no good product even even okay okay good is maybe okay no products. I don't see great products because I understand what should be present in any Postgres monitoring. And it's not my point of view. A couple of years ago, we started some community work. So we started to discuss various tools. We had several sessions in Russian community.

Starting point is 00:01:41 We had several sessions reviewing various tools, both open source and commercial. And then we created a document called Postgres Monitoring Checklist, collecting ideas, what should be present in monitoring and what should be must have or nice to have and so on. And then some people outside of Russian community joined as well. And eventually, I think we've got a very good understanding what should be present like 100% if you want your monitoring to be very useful. And I don't see any product, commercial or open source, or maybe even in-house. Sometimes in-house products products i see very good ones but none of them meet these criterias even 80 percent they usually it's below 50 so yeah so

Starting point is 00:02:33 we're talking about things that you consider probably almost a must-have for most projects that people care about the uptime of making sure they have each of those covered in is it monitoring monitoring and alerts like do you count that as the same thing so well alerts it's like a slightly different topic i'm talking about troubleshooting sessions when we open monitoring and try to understand what's happening what's wrong because we have some complaints from users or from our co-workers and we something wrong with postgres happening and we need to understand what exactly is wrong. And I see two big areas which must present in any monitoring.

Starting point is 00:03:10 First area is, of course, we should have physical metrics like CPU, memory, disk IO, network. It should be covered and almost always it's covered. And it's like it's separate. But Postgres specifics, and here we can recall Alexey Lisovsky's pgstats.dev, this interactive picture, which shows you architecture of Postgres and what system views and extensions or tools are responsible for monitoring

Starting point is 00:03:40 particular part of Postgres architecture. It's a great tool. We will provide link in the description i'm sure so this should be present in monitoring and this first part i call dashboard number one the goal of dashboard number one is you have like 30 or maximum 60 seconds and you need to perform very wide and very shallow overview of all Postgres components, all Postgres areas, and understand which area looks not well.

Starting point is 00:04:13 So we need to go deeper using other dashboards probably, or maybe manually looking at Postgres and so on. And second component is query monitoring. Query monitoring, it's like not a king, it's a queen of observability, right? It should be present always. It's a whole different topic. It should be present in any monitoring. And again, I don't see all,

Starting point is 00:04:33 like some systems improving. For example, Datadog currently is very well improving in this area particularly. And by the way, they do it for all databases, not only for Postgres. I just see how well they are compared to two years ago but they have big mistakes as well like not mistakes but disadvantages like missing parts which don't allow me to work with it but probably we should

Starting point is 00:04:58 postpone a discussion about query analysis it's like different talk, but let's focus on this first like wide and shallow dashboard. Like I want everything quickly, half of each component. So the use case is my page has just gone off or I've got a notification or users mentioned something that doesn't sound good. And I want to look at something and see where is the problem? Like what, where do I even start? Right. So there are some people say there are errors related to Postgres or some performance degradation. And we want to have 10 or like dozen of like 12 metrics present on a single dashboard.

Starting point is 00:05:48 So we quickly understand, are we okay compared to like one hour ago or one day ago or one week ago? Or we are not okay in particular areas, so let's dive deeper. So let's probably quickly have some overview of these metrics because I have a list and also we can attach this list to our episode, right? Yeah. So what's the top of your list? At the top of the list is two major things which represent the behavior of any multi-user system. Throughput and latency. Not only multi-user, but where network is also involved. So throughput and latency. Throughput, if we talk about

Starting point is 00:06:20 databases, Poroskis particularly, can be measured in several ways, but two ways are most interesting to us. It's TPS and QPS, transactions per second and queries per second. And unfortunately, Postgres doesn't allow you to measure the second one really well. TPS are present in PGE star database or PGE database. I always mix it. Exact commit and exact rollback. We observe these two numbers and sum of how fast it's increasing. This is our TPS. Or decreasing, right? If there's some problems where it could... It cannot decrease. It's accumulative. It's just a counter of committed transactions and rolled back transactions. Sorry, I thought you were talking about TPS. So, right. What can decrease if we divide it by the duration of, like, we need two snapshots, as always, right? And we need to understand how many seconds between them. So we divide. And of course, this like speed can go up and down, TPS can go up and down. But QPS, unfortunately, Postgres system views don't provide it. And the usual trick is to use PGSA statements, but it's not the exact number, unfortunately.

Starting point is 00:07:25 Because PGSA statements, as I remember, dot max, PGSA statements dot max parameter by default is 5,000, if I'm not mistaken. It can be tuned to 10,000 or 100. It means that we don't register all queries. So we probably see only the tip of the iceberg. And QPS can be wrong if we use this approach if we use calls metric from pgsa segments when i have doubts i usually usually like most of the cases we have pg bouncer so i check pg bouncer logs and it reports qps like fair number of qps but this is like maybe already too deep about it yeah let's focus on the things the things we can measure in Postgres. Right. This is throughput. Latency, it's query duration. Again, usually it's measured from

Starting point is 00:08:09 PGSAS statements, timing. And why it's important? Because we need to understand, for example, if QPS and TPS dropped significantly, maybe it's just a problem with application. Errors on application caused our load dropped. Maybe it's a problem related application errors on application cost like our load dropped maybe it's a problem related to like a cpu 100 all cpus are 100 we have saturation or disk saturation on postgres that's why we process fewer tps and latency of course very important and i see in many cases we don't have it in monitoring these two metrics are like it's our general health of our database should be always like we have problem let's check tps qps and latency average duration of queries so next very very very very important and i don't also see like more than 50 systems

Starting point is 00:08:59 are not good with with this it's connection monitoring my ideal is just one single stacked graph of connections active idle transaction idle fourth i don't remember so but these three already tell them tell everything right so active we process something if active goes above number of cpu it's already bad sign so if you have 12 cores and we see active 20, well probably already too much work to do for this server and probably we have already huge degradation

Starting point is 00:09:34 in performance. It's not always so but in many cases so. Idols and idle transactions also very important, especially before Postgres 13 or 14 where optimizations... I think 14. 14, right, right. So I

Starting point is 00:09:50 saw sometimes people say, okay, we have a couple of thousand more idle connections. Don't worry. They are idle, but overhead is big. And I think Andres Freud had a very good couple of posts. Also, we should provide links right explaining overhead

Starting point is 00:10:06 I think he did a lot of the work in 14 I think he was optimizing it was related not to memory consumption as I remember but how work with snapshots is organized usually people think couple of thousand connections give us a lot of memory overhead no it's more about

Starting point is 00:10:22 how work with snapshots is organized but But what I like about stack graphs is that we can quickly understand the total as well, right? And sometimes people split to multiple charts, multiple graphs, it's also fine. But I like to see connection and overview in one place. So, and of course, I don't, I don't in transaction sessions or connections connection sessions it's the same basically they are very dangerous sometimes because if application started transaction and went to talk to some external api for example or to do some other work it's it can be very dangerous so we should also understand like we should have some threshold

Starting point is 00:11:01 and not allow system to go about it so yeah at this point, I guess we're talking about long transactions. No. Well, sometimes these things correlate, but not always. Because we can have, for example, at each moment, we have 200 idle in transaction sessions, right? But duration of each transaction is less than one second. It might happen. Very brief transactions, but they do something and keep connections idle it's also possible but next item is long transactions actually and this is actually a big topic as well because usually long transactions

Starting point is 00:11:38 have two dangers one danger is to lock something and block others it's one point another point is disturb auto locking work but in the latter we should talk about not long transactions but about xmin horizon it should be different and i didn't see it never like no monitoring distinguishes these two things yet of course i didn't see all existing systems, of course. But over the last couple of years, observing various systems, like dozens of them. Should we talk about that then straight away? So the Xmin Horizon being to avoid transaction ID wraparound, which is, you know, there've been some famous blog posts that I can include on that. Right. Transaction ID wraparound is one danger from Xmin horizon being stuck in the very

Starting point is 00:12:26 like past but also vacuum accumulating a lot of dead tuples and then it converts to bloat also a problem see previous episode right right exactly we discussed it already but again like simple graph what is the maximum age of the longest transaction right now excluding vacuum because actually regular vacuum it's by the way it's tricky a regular vacuum doesn't hold xmin horizon it doesn't block anyone except with analyze it can hold xmin horizon and if the vacuum is working to prevent transaction idea up around it can block others so in each area there are interesting ifs right if then if then so it's quite difficult to build ideal monitoring of course but anyway like i want to have a graph do we understand the maximum age of transaction right now usually people say i don't know what

Starting point is 00:13:19 kind of number are you worried about there right right, right. In OLTP systems, and by the way, we forgot to mention that we again discuss OLTP systems, monitoring for them, right? Because for analytical systems, things are different. For OLTP, I would like to have everything below, ideally under one second. Again, it's related to human perception, which is like we can distinguish 200 milliseconds roughly. So one second feels slow. But of course, sometimes we need slightly longer transactions, a few seconds. But transaction one minute, if it can block someone else, it's already a problem.

Starting point is 00:13:55 We should split it to batches. Yeah. But we discussed it briefly last time. If you create an index, unfortunately, you hold the Xmin horizon. And of course, if your table is big, you need to spend like sometimes an hour or maybe even more to build an index. And that's why partitioning is needed,

Starting point is 00:14:16 again, to have this time sane. So we're about halfway through our, nearly halfway through our dashboard. Good. So next thing is the ratio of errors. So it's can measure, we already discussed this exact from PG star database, exact commit, exact rollback. How many transactions are committed?

Starting point is 00:14:36 How many transactions are canceled? And we can have a ratio to understand the percentage of successful transactions. And it's interesting. Sometimes I see like system produces like only 50% or even of successful transactions. And it's interesting. Sometimes I see like system produces like only 50% or even fewer successful transactions. But developers say this is intentional. This is by design. And it triggers interesting questions.

Starting point is 00:14:56 What negative consequences can be if application users roll back too often? It's like separate topic. But the main reason you want this ratio is to see if it changes from your baseline. So if you've got normally a very high ratio, you just want to tell if it's 99% and suddenly we have 70%. Obviously it means we have a spike of errors. So we need to go to logs probably and understand which errors, or if you use extension called log errors, you can have counters right inside database

Starting point is 00:15:26 so you can select aggregated by error type. So next thing is transaction ID wrap around separate metric. Here people always like, okay, not always. In 90% of cases, people forget about multi-exact ID wrap around. They just don't check it. Like it's more exotic. Like transaction ID wraparound is exotic.

Starting point is 00:15:47 But when it happens, it's a big disaster. But multi-exact ID wraparound, it's even more, it's even like very, very rare, very unusual. Is this the thing that you mentioned in the GitLab blog post about sub-transactions? Briefly. Yes, briefly.

Starting point is 00:16:04 There is connection to it yes because if we use select for share and sometimes select for update we might have quickly increasing multi exact ids and risks of well if we use foreign keys we just use foreign keys we have multi exact ids that's it of course So are we talking about two graphs here then? One for like, what do we actually need for our dashboard? It can be one, but just the idea is we should not forget about this second ID and it also can be wrapped around. It has the same risk. So we should monitor both risks and alert when it's already above threshold and so on. Okay, then the health of replication.

Starting point is 00:16:52 Again, like for this dashboard, we need only brief analysis. Like, do we have at least one standby, which is lagging, physical or logical? We should define our thresholds and observe. That's it. Like, if we see a spike or lag leg it's also a problem you can get out of disk space event or something not good then a few things only left by the way i realized we in this list we will provide this list as i've said we don't have vacuum behavior auto vacuum behavior and we don't have checkpointer and bg writer behavior. We have it in our systems. We have it on dashboard two. But I think we probably should move it to dashboard one as well,

Starting point is 00:17:29 because these two big components of Postgres, vacuuming and checkpointing, right? Yeah. So we should have it here. Before you move on this one, I see you also had unused replication slots, which I thought was really interesting. I've heard of some issues where, like, if you've got a static number of those, if it decreases, if something ends up using one of those replication slots that you weren't expecting, that could be sign of a problem as well, right? The problem is that, right, so we don't have lag per se here, but if slot is unused it's accumulating walls so it's easy to be out of disk space soon

Starting point is 00:18:07 so unused slots are dangerous definitely but it can be everything can be here combined on one or couple of graphs so but as i've said we want to keep this dashboard quite small so we can quickly overview everything and then like after application i'd like to talk about lags related to archiving. Very often people don't have it. They don't monitor the lag. Like how many walls are not yet archived? Why is this important?

Starting point is 00:18:34 If you try to stop Postgres, restart or just stop, and it has a lot of unarchived walls, you know, like when we shut down Postgres, there is so-called implicit shutdown checkpoint. And the same way Postgres tries to archive them all. So it runs archive command of all pending walls and can take hours. If we accumulate thousands of walls, it happens. You wait.

Starting point is 00:18:59 Well, it's not just if you try, if you want to upgrade or something, it could also be if there's a crash, that it's how how long is it going to take you to recover well recovery is different if it crashed it will recover quite soon it's just a replay redo what's needed and that's it i'm talking about like intentional restart or shutdown for example if patroni wants to perform switchover or failover due to some reasons it tries to shut down the primary until some version not long ago because i told kukushkin about it and it was a surprise because we had it on production in some system so patrony failed to perform failover because a lot of walls were pending to be archived and kukushkin fixed it in in recent versions of patrony upgrade upgrade

Starting point is 00:19:48 it's quite it's quite new thing quite new bug fix well it's not bug fix it's like an improvement patrony doesn't wait and like tries to perform failover there is a trade-off because what what to do with these walls right i don't remember details there because if we don't archive walls, it disrupts our DR strategy, disaster recovery strategy, right? We need to keep backups in good shape. So this is an interesting question as well. But anyway, monitoring should have

Starting point is 00:20:17 this number of pending walls or also maybe the rate of archiving, like how many walls we archive per second, for example. It's a very interesting metric as well to understand. You've got that next on the list. Ah, okay. Yeah, okay.

Starting point is 00:20:34 Wall generation rates. And yeah, as I recently learned, if you have few walls per second, probably you won't be able to use a logical replication because a wall sender will saturate a single CPU core. few walls per second probably you won't be able to use a logical replication because wall sender will will be will saturate single cpu core and finally locks and deadlocks well maybe they should not go together because like there's different things right but and locks of

Starting point is 00:21:00 course we like sometimes people draw both exclusive and share logs there are many kinds of logs here we talk about like heavy logs not lightweight logs latches and so heavy logs so we're locking rows and locking tables and here sometimes people try to draw everything but we should focus on maybe only on exclusive logs which can be reason for blocking others first reason well it's also like there are many problems there this this chart alone can be difficult to implement properly but anyway we want to have something i just saw sometimes application engineers they say oh we had a spike of logs you look at it you have had spike of share logs it's okay because you had some deployment it's normal it's not a problem so

Starting point is 00:21:53 this chart is also like interesting what should be there and dead logs of course we want to understand how many dead logs so many other things are missing here but, but I'm not a big fan of combining physical and this logical. For example, in Datadog, if you check database monitoring, you will see CPU combined with TPS. It's a mix of everything. It's not right. Not right. Host stats, I would like to see separately from database stats. do you think interesting well i think in the early days i've seen a lot of teams get really far looking at application stats and then adding some database things into it they don't have some of these like real scaling issues i guess uh i think a lot of people implement monitoring after a big issue, after a big outage. They think, oh, maybe we do need it now.

Starting point is 00:22:47 So probably people put this in general a little bit too late. But yeah, I can see an argument for keeping it where the application monitoring is, especially from the query performance point of view, at least especially from the performance monitoring, maybe not so much from the troubleshooting, we've had an outage perspective. Right. Well, again, we didn't discuss here query analysis. It's another half an hour at least.

Starting point is 00:23:14 It's like, it's very interesting, very deep topic. But I also like forgot to mention that we have, we took PgWatch 2 from CyberTech. And I remember when we first reviewed it didn't feel well like strange dashboards to us like very like like you explain oh why why not add this why not add that but the fact that it's open source good components like grafana and so on, like it can be, it can use Postgres and Timescale. So we, Vitaly from my team implemented the, initially the set of dashboards following this philosophy. So dashboard number one is for quick and very shallow troubleshooting, understanding which

Starting point is 00:23:59 areas need further analysis. Dashboard number two, more interesting things for dba dashboard number three is query analysis and it's available so it's open source available and recently another contributor helped us to support timescale version so postgres and timescale can be used as storage and i like to like i encourage everyone to consider it as a secondary monitor probably you already have something like Datadog which is like usually or anything

Starting point is 00:24:31 else which usually like like organization already adopted some tool for everything not for databases not for Postgres and of course it's worth having it everywhere because it's like unified solution but it's like a unified solution. But it's missing a lot of things.

Starting point is 00:24:48 And I hope I convinced our listeners today that there are many things that should be present there. So PidginWatch 2 can be added like a secondary monitoring, very gentle, without big observer effect, and it can provide a lot of insights for database experts. And actually not only experts. This TPS, QPS plus latency, this throughput plus latency, it's alone already something very good, which many systems missing. Yeah, absolutely. So yeah, so I guess if you don't have some of these things,

Starting point is 00:25:24 maybe look into how much work it would be to add them. And hopefully it would save you or make it a lot easier next time you have a big issue to spot what's going wrong. Right. And if you develop your tool in-house or you're a developer of monitoring, this list, it's a product of community work, like several months of work. I discussed it on Twitter, during our sessions in Russian community and so on. This list was like, I think it's very, very useful for like checklist for everyone who wants to develop great Postgres monitoring. So please use it as a reference. What should be added? Awesome. I think it should be added. Awesome.

Starting point is 00:26:05 I think it will be useful. Right. Probably one day we will discuss query monitoring as well. Yeah. I have something to say about it too. Definitely. Part two coming. Brilliant.

Starting point is 00:26:18 Well, thanks everyone for joining us again. I really appreciate it. Let us know if there's anything else you want to make sure we covered or anything you think we missed. It'd be very helpful. And yeah, check out the show notes. I'll include links to all the tools we mentioned or various things as well, including this list.

Starting point is 00:26:33 Yeah, and thank you again for feedback. Feedback is a very important thing to have, Twitter and other places. And please subscribe, like, and share, share, share everywhere in your social networks and groups where you discuss engineering and databases and so on. Wonderful. Well, thank you, Nikolai.

Starting point is 00:26:55 Good to see you and speak to you next week. Thank you, Michael. Till next time. Bye-bye. Cheers. Bye now.

Postgres FM - Monitoring checklist

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.