Postgres FM - Monitoring checklist
Episode Date: August 19, 2022Monitoring checklist (dashboard 1):TPS and (optional but also desired) QPSLatency (query duration) — at least average. Better: histogram, percentilesConnections (sessions) — stacked graph... of session counts by state (first of all: active and idle-in-transaction; also interesting: idle, others) and how far the sum is from max_connection (+pool size for PgBouncer).Longest transactions (max transaction age or top-n transactions by age), excluding autovacuum activityCommits vs rollbacks — how many transactions are rolled backTransactions left till transaction ID wraparoundReplication lags / bytes in replication slot / unused replication slotsCount of WALs waiting to be archived (archiving lag)WAL generation ratesLocks and deadlocksBasic query analysis graph (top-n by total_time or by mean_time?)Basic wait event analysis (a.k.a. “active session analysis” or “performance insights”)And links to a few things we mentioned: Postgres monitoring review checklist (community document) pgstats.dev Improving Postgres Connection Scalability: Snapshots (blog post by Andres Freund) Transaction ID Wraparound in Postgres (blog post by David Cramer) Subtransactions Considered Harmful (blog post by Nikolay)datadoghq.com pgwatch2 (Postgres.ai Edition) ------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov and @michristofidesIf you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things Postgres QR.
I am Michael, founder of PGMustard, and this is my co-host Nikolai, founder of Postgres AI.
Hey Nikolai, what are we talking about today?
Hi Michael, today we are going to talk to monitoring, and I think we attached very, very wide areas of Postgres,
and probably we will revisit each one of them trying to go for in some specific
sub areas right but every project every project should have it well do you mean so let's check
that first so we're naturally for large enterprises we we definitely want monitoring we definitely
want database specific things being monitored sometimes couple a couple of tools so like
sometimes i saw big projects using three monitoring
and all three tools were monitoring
Postgres. None of them were
perfect to be enough
alone, right? So we needed
to combine parts
from three places.
It happens, I think. By the way, it's not
a problem if you have a couple of tools used
if you know that overhead is low, right?
Yeah, absolutely. So we are talking about postgres monitoring this is huge topic and my main statement is there is no good product developed yet no perfect product maybe or no good product
even even okay okay good is maybe okay no products. I don't see great products because I understand what should be present in any Postgres monitoring.
And it's not my point of view.
A couple of years ago, we started some community work.
So we started to discuss various tools.
We had several sessions in Russian community.
We had several sessions reviewing various tools, both open source and commercial. And then we created a document called Postgres Monitoring
Checklist, collecting ideas, what should be present in monitoring and what should be must
have or nice to have and so on. And then some people outside of Russian community joined as well.
And eventually, I think we've got a very good understanding
what should be present like 100% if you want your monitoring to be very useful.
And I don't see any product, commercial or open source, or maybe even in-house.
Sometimes in-house products products i see very good ones
but none of them meet these criterias even 80 percent they usually it's below 50 so yeah so
we're talking about things that you consider probably almost a must-have for most projects
that people care about the uptime of making sure they have each of those covered in is it monitoring monitoring and alerts
like do you count that as the same thing so well alerts it's like a slightly different topic i'm
talking about troubleshooting sessions when we open monitoring and try to understand what's
happening what's wrong because we have some complaints from users or from our co-workers
and we something wrong with postgres happening and we need to understand what
exactly is wrong.
And I see two big areas which must present in any monitoring.
First area is, of course, we should have physical metrics like CPU, memory, disk IO, network.
It should be covered and almost always it's covered.
And it's like it's separate. But Postgres specifics,
and here we can recall Alexey Lisovsky's pgstats.dev,
this interactive picture,
which shows you architecture of Postgres
and what system views and extensions
or tools are responsible for monitoring
particular part of Postgres architecture.
It's a great tool.
We will provide link in the
description i'm sure so this should be present in monitoring and this first part i call dashboard
number one the goal of dashboard number one is you have like 30 or maximum 60 seconds and you need to
perform very wide and very shallow overview of all Postgres components,
all Postgres areas,
and understand which area looks not well.
So we need to go deeper using other dashboards probably, or maybe manually looking at Postgres and so on.
And second component is query monitoring.
Query monitoring, it's like not a king,
it's a queen of observability, right?
It should be present always.
It's a whole different topic.
It should be present in any monitoring.
And again, I don't see all,
like some systems improving.
For example, Datadog currently is very well improving
in this area particularly.
And by the way, they do it for all databases,
not only for Postgres.
I just see
how well they are compared to two years ago but they have big mistakes as well like not mistakes
but disadvantages like missing parts which don't allow me to work with it but probably we should
postpone a discussion about query analysis it's like different talk, but let's focus on this first like wide
and shallow dashboard. Like I want everything quickly, half of each component. So the use case
is my page has just gone off or I've got a notification or users mentioned something
that doesn't sound good. And I want to look at something and see where is the problem?
Like what, where do I even start?
Right.
So there are some people say there are errors related to Postgres or some performance degradation.
And we want to have 10 or like dozen of like 12 metrics present on a single dashboard.
So we quickly understand, are we okay compared to like one hour ago or one day ago or one week ago?
Or we are not okay in particular areas, so let's dive deeper.
So let's probably quickly have some overview of these metrics because I have a list and
also we can attach this list to our episode, right?
Yeah.
So what's the top of your list? At the top of the list is two major things
which represent the behavior of any multi-user system. Throughput and latency. Not only multi-user,
but where network is also involved. So throughput and latency. Throughput, if we talk about
databases, Poroskis particularly, can be measured in several ways, but two ways are most
interesting to us. It's TPS and QPS, transactions per second and queries per second. And unfortunately,
Postgres doesn't allow you to measure the second one really well. TPS are present in PGE star
database or PGE database. I always mix it. Exact commit and exact rollback. We observe these two numbers and sum of how fast it's increasing.
This is our TPS. Or decreasing, right? If there's some problems where it could...
It cannot decrease. It's accumulative. It's just a counter of committed transactions and
rolled back transactions. Sorry, I thought you were talking about TPS.
So, right. What can decrease if we divide it by the duration of, like, we need two snapshots, as always, right? And we need to understand how many seconds between them. So we divide. And of course, this like speed can go up and down, TPS can go up and down. But QPS, unfortunately, Postgres system views don't provide it. And the usual trick is to use PGSA statements, but it's not the exact number, unfortunately.
Because PGSA statements, as I remember, dot max, PGSA statements dot max parameter by default is 5,000, if I'm not mistaken.
It can be tuned to 10,000 or 100.
It means that we don't register all queries.
So we probably see only the tip of the iceberg.
And QPS can be wrong if we use this approach if we use calls metric from pgsa
segments when i have doubts i usually usually like most of the cases we have pg bouncer so i
check pg bouncer logs and it reports qps like fair number of qps but this is like maybe already
too deep about it yeah let's focus on the things the things we can measure in Postgres. Right. This is throughput. Latency, it's query duration. Again, usually it's measured from
PGSAS statements, timing. And why it's important? Because we need to understand, for example,
if QPS and TPS dropped significantly, maybe it's just a problem with application. Errors
on application caused our load dropped. Maybe it's a problem related application errors on application cost like our load dropped maybe it's a problem
related to like a cpu 100 all cpus are 100 we have saturation or disk saturation on postgres
that's why we process fewer tps and latency of course very important and i see in many cases
we don't have it in monitoring these two metrics are like it's our general health of
our database should be always like we have problem let's check tps qps and latency average duration
of queries so next very very very very important and i don't also see like more than 50 systems
are not good with with this it's connection monitoring my ideal is just one single stacked graph of
connections active idle transaction idle fourth i don't remember so but these three already tell
them tell everything right so active we process something if active goes above number of cpu it's
already bad sign so if you have 12 cores and we see
active 20, well
probably already too much
work to do for this server and
probably we have already huge degradation
in performance. It's not always so
but in many cases so.
Idols and idle
transactions
also very important, especially
before Postgres 13 or
14 where optimizations... I think 14.
14, right, right. So I
saw sometimes people say, okay,
we have a couple of thousand more
idle connections. Don't worry.
They are idle, but
overhead is big. And I think
Andres Freud had a very good
couple of posts. Also, we should
provide links right explaining overhead
I think he did a lot of the work
in 14 I think he was
optimizing it was related not to
memory consumption as I remember but
how work with snapshots is
organized usually people think
couple of thousand connections give us
a lot of memory overhead no it's more about
how work with snapshots
is organized but But what I
like about stack graphs is that we can quickly understand the total as well, right? And sometimes
people split to multiple charts, multiple graphs, it's also fine. But I like to see connection and
overview in one place. So, and of course, I don't, I don't in transaction sessions or connections connection sessions it's
the same basically they are very dangerous sometimes because if application started
transaction and went to talk to some external api for example or to do some other work it's
it can be very dangerous so we should also understand like we should have some threshold
and not allow system to go about it so yeah at this point, I guess we're talking about long transactions.
No.
Well, sometimes these things correlate, but not always.
Because we can have, for example, at each moment, we have 200 idle in transaction sessions, right?
But duration of each transaction is less than one second.
It might happen.
Very brief transactions, but they do something and keep connections idle it's also possible but next item is long
transactions actually and this is actually a big topic as well because usually long transactions
have two dangers one danger is to lock something and block others it's one point another point is disturb auto locking work
but in the latter we should talk about not long transactions but about xmin horizon it should be
different and i didn't see it never like no monitoring distinguishes these two things yet
of course i didn't see all existing systems, of course. But over the last couple of
years, observing various systems, like dozens of them. Should we talk about that then straight
away? So the Xmin Horizon being to avoid transaction ID wraparound, which is, you know,
there've been some famous blog posts that I can include on that. Right. Transaction ID wraparound
is one danger from Xmin horizon being stuck in the very
like past but also vacuum accumulating a lot of dead tuples and then it converts to bloat also a
problem see previous episode right right exactly we discussed it already but again like simple
graph what is the maximum age of the longest transaction right now excluding vacuum because actually regular
vacuum it's by the way it's tricky a regular vacuum doesn't hold xmin horizon it doesn't block
anyone except with analyze it can hold xmin horizon and if the vacuum is working to prevent
transaction idea up around it can block others so in each area there are interesting ifs right if then if then so it's
quite difficult to build ideal monitoring of course but anyway like i want to have a graph
do we understand the maximum age of transaction right now usually people say i don't know what
kind of number are you worried about there right right, right. In OLTP systems, and by the way, we forgot to mention that we again discuss OLTP systems, monitoring for them, right? Because for
analytical systems, things are different. For OLTP, I would like to have everything below,
ideally under one second. Again, it's related to human perception, which is like we can distinguish
200 milliseconds roughly. So one second feels slow.
But of course, sometimes we need slightly longer transactions,
a few seconds.
But transaction one minute, if it can block someone else,
it's already a problem.
We should split it to batches.
Yeah.
But we discussed it briefly last time.
If you create an index, unfortunately, you hold the Xmin horizon.
And of course, if your table is big,
you need to spend like sometimes an hour
or maybe even more to build an index.
And that's why partitioning is needed,
again, to have this time sane.
So we're about halfway through our,
nearly halfway through our dashboard.
Good.
So next thing is the ratio of errors.
So it's can measure, we already discussed this exact from PG star database, exact commit,
exact rollback.
How many transactions are committed?
How many transactions are canceled?
And we can have a ratio to understand the percentage of successful transactions.
And it's interesting.
Sometimes I see like system produces like only 50% or even of successful transactions. And it's interesting. Sometimes I see like system produces like only 50%
or even fewer successful transactions.
But developers say this is intentional.
This is by design.
And it triggers interesting questions.
What negative consequences can be
if application users roll back too often?
It's like separate topic.
But the main reason you want this ratio
is to see if it changes from your baseline. So if you've got normally a very high ratio,
you just want to tell if it's 99% and suddenly we have 70%. Obviously it means we have a spike
of errors. So we need to go to logs probably and understand which errors, or if you use extension
called log errors, you can have counters right inside database
so you can select aggregated by error type.
So next thing is transaction ID wrap around
separate metric.
Here people always like, okay, not always.
In 90% of cases, people forget about multi-exact ID wrap around.
They just don't check it.
Like it's more exotic.
Like transaction ID wraparound is exotic.
But when it happens, it's a big disaster.
But multi-exact ID wraparound,
it's even more, it's even like very, very rare,
very unusual.
Is this the thing that you mentioned
in the GitLab blog post about sub-transactions?
Briefly.
Yes, briefly.
There is connection to it yes because
if we use select for share and sometimes select for update we might have quickly increasing multi
exact ids and risks of well if we use foreign keys we just use foreign keys we have multi exact
ids that's it of course So are we talking about two graphs
here then? One for like, what do we actually need for our dashboard? It can be one, but just the
idea is we should not forget about this second ID and it also can be wrapped around. It has the
same risk. So we should monitor both risks and alert when it's already above threshold and so on.
Okay, then the health of replication.
Again, like for this dashboard, we need only brief analysis. Like, do we have at least one standby, which is lagging, physical or logical?
We should define our thresholds and observe.
That's it.
Like, if we see a spike or lag leg it's also a problem you can get
out of disk space event or something not good then a few things only left by the way i realized we
in this list we will provide this list as i've said we don't have vacuum behavior auto vacuum
behavior and we don't have checkpointer and bg writer behavior. We have it in our systems. We have it on dashboard two.
But I think we probably should move it to dashboard one as well,
because these two big components of Postgres, vacuuming and checkpointing, right?
Yeah.
So we should have it here.
Before you move on this one, I see you also had unused replication slots,
which I thought was really interesting.
I've heard of some issues where, like, if you've got a static number of those, if it decreases, if something ends up using one of
those replication slots that you weren't expecting, that could be sign of a problem as well, right?
The problem is that, right, so we don't have lag per se here, but if slot is unused it's accumulating walls so it's easy to be out of disk space soon
so unused slots are dangerous definitely but it can be everything can be here combined on one or
couple of graphs so but as i've said we want to keep this dashboard quite small so we can quickly
overview everything and then like after application i'd like to talk about lags related to archiving.
Very often people don't have it.
They don't monitor the lag.
Like how many walls
are not yet archived?
Why is this important?
If you try to stop Postgres,
restart or just stop,
and it has a lot of unarchived walls,
you know, like when we shut down Postgres,
there is so-called
implicit shutdown checkpoint.
And the same way Postgres tries to archive them all. So it runs archive command of all pending
walls and can take hours. If we accumulate thousands of walls, it happens. You wait.
Well, it's not just if you try, if you want to upgrade or something, it could also be if there's
a crash, that it's how
how long is it going to take you to recover well recovery is different if it crashed it will recover
quite soon it's just a replay redo what's needed and that's it i'm talking about like intentional
restart or shutdown for example if patroni wants to perform switchover or failover due to some reasons it tries to shut down the primary
until some version not long ago because i told kukushkin about it and it was a surprise
because we had it on production in some system so patrony failed to perform failover because a lot
of walls were pending to be archived and kukushkin fixed it in in recent versions of patrony upgrade upgrade
it's quite it's quite new thing quite new bug fix well it's not bug fix it's like an improvement
patrony doesn't wait and like tries to perform failover there is a trade-off because what what
to do with these walls right i don't remember details there because if we don't archive walls,
it disrupts our DR strategy,
disaster recovery strategy, right?
We need to keep backups in good shape.
So this is an interesting question as well.
But anyway, monitoring should have
this number of pending walls
or also maybe the rate of archiving,
like how many walls we archive per second, for example.
It's a very interesting metric as well
to understand.
You've got that next on the list.
Ah, okay.
Yeah, okay.
Wall generation rates.
And yeah, as I recently learned,
if you have few walls per second,
probably you won't be able
to use a logical replication
because a wall sender will saturate a single CPU core. few walls per second probably you won't be able to use a logical replication because
wall sender will will be will saturate single cpu core and finally locks and deadlocks well
maybe they should not go together because like there's different things right but and locks of
course we like sometimes people draw both exclusive and share logs there are many kinds
of logs here we talk about like heavy logs not lightweight logs latches and so heavy logs so
we're locking rows and locking tables and here sometimes people try to draw everything but we
should focus on maybe only on exclusive logs which can be reason for blocking
others first reason well it's also like there are many problems there this this chart alone can be
difficult to implement properly but anyway we want to have something i just saw sometimes
application engineers they say oh we had a spike of logs you look at it you have
had spike of share logs it's okay because you had some deployment it's normal it's not a problem so
this chart is also like interesting what should be there and dead logs of course we want to
understand how many dead logs so many other things are missing here but, but I'm not a big fan of combining physical and this
logical. For example, in Datadog, if you check database monitoring, you will see CPU combined
with TPS. It's a mix of everything. It's not right. Not right. Host stats, I would like to
see separately from database stats. do you think interesting well i think
in the early days i've seen a lot of teams get really far looking at application stats and then
adding some database things into it they don't have some of these like real scaling issues i
guess uh i think a lot of people implement monitoring after a big issue, after a big outage. They think, oh, maybe we do need it now.
So probably people put this in general a little bit too late.
But yeah, I can see an argument for keeping it where the application monitoring is,
especially from the query performance point of view,
at least especially from the performance monitoring,
maybe not so much from the troubleshooting, we've had an outage perspective.
Right.
Well, again, we didn't discuss here query analysis.
It's another half an hour at least.
It's like, it's very interesting, very deep topic.
But I also like forgot to mention that we have,
we took PgWatch 2 from CyberTech.
And I remember when we first reviewed it didn't feel well like strange dashboards to us like very like like you explain oh why why not add this why not
add that but the fact that it's open source good components like grafana and so on, like it can be, it can use Postgres and Timescale.
So we, Vitaly from my team implemented the, initially the set of dashboards following
this philosophy.
So dashboard number one is for quick and very shallow troubleshooting, understanding which
areas need further analysis.
Dashboard number two, more interesting things for dba dashboard number three
is query analysis and it's available so it's open source available and recently another contributor
helped us to support timescale version so postgres and timescale can be used as storage
and i like to like i encourage everyone to consider it as a secondary monitor
probably you already have something
like Datadog
which is like usually or anything
else which usually like
like
organization already adopted some tool
for everything not for databases
not for Postgres and of course
it's worth having it everywhere because
it's like unified solution but it's like a unified solution.
But it's missing a lot of things.
And I hope I convinced our listeners today
that there are many things that should be present there.
So PidginWatch 2 can be added like a secondary monitoring,
very gentle, without big observer effect,
and it can provide a lot of insights for database experts.
And actually not only experts. This TPS, QPS plus latency, this throughput plus latency,
it's alone already something very good, which many systems missing.
Yeah, absolutely. So yeah, so I guess if you don't have some of these things,
maybe look into how much work it would be to add them.
And hopefully it would save you or make it a lot easier next time you have a big issue to spot what's going wrong.
Right. And if you develop your tool in-house or you're a developer of monitoring, this list, it's a product of community work, like several months of work.
I discussed it on Twitter, during our sessions in Russian community and so on.
This list was like, I think it's very, very useful for like checklist for everyone who wants to develop great Postgres monitoring.
So please use it as a reference.
What should be added?
Awesome. I think it should be added. Awesome.
I think it will be useful.
Right.
Probably one day we will discuss query monitoring as well.
Yeah.
I have something to say about it too.
Definitely.
Part two coming.
Brilliant.
Well, thanks everyone for joining us again.
I really appreciate it.
Let us know if there's anything else you want to make sure we covered
or anything you think we missed.
It'd be very helpful.
And yeah, check out the show notes.
I'll include links to all the tools we mentioned
or various things as well, including this list.
Yeah, and thank you again for feedback.
Feedback is a very important thing to have,
Twitter and other places.
And please subscribe, like, and share, share, share everywhere
in your social networks and groups
where you discuss engineering and databases and so on.
Wonderful.
Well, thank you, Nikolai.
Good to see you and speak to you next week.
Thank you, Michael.
Till next time.
Bye-bye.
Cheers. Bye now.