Postgres FM - Postgres Emergency Room
Episode Date: August 30, 2024Nikolay and Michael discuss PostgreSQL emergencies — both the psychological side of incident management, and some technical aspects too. Here are some links to things they mentioned:Site R...eliability Engineering resources from Google https://sre.googleGitLab Handbook SRE https://handbook.gitlab.com/job-families/engineering/infrastructure/site-reliability-engineerKeeping Customers Streaming — The Centralized Site Reliability Practice at Netflix https://netflixtechblog.com/keeping-customers-streaming-the-centralized-site-reliability-practice-at-netflix-205cc37aa9fbOur monitoring checklist episode https://postgres.fm/episodes/monitoring-checklistHannu Krosing talk on Postgres TV — Do you vacuum everyday? https://www.youtube.com/watch?v=JcRi8Z7rkPgOur episode on corruption https://postgres.fm/episodes/corruptionNikolay’s episode on stopping and starting Postgres faster https://postgres.fm/episodes/stop-and-start-postgres-fasterOur episode on out of disk https://postgres.fm/episodes/out-of-diskThe USE method (Brendan Gregg) https://www.brendangregg.com/usemethod.html Thundering herd problem https://en.wikipedia.org/wiki/Thundering_herd_problempgwatch2 Postgres AI edition https://gitlab.com/postgres-ai/pgwatch2~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artworkÂ
Transcript
Discussion (0)
Hello, hello, this is PostgresFM. My name is Nikolai from Postgres.ai, and as usual, my co-host is Michael from PgMaster. Hi, Michael.
Hello, Nikolai.
And the topic I chose is related to problems, acute problems, unpredictable, which sometimes happen to production systems, including Postgres. In many cases, the database is in the center of storm.
And let's discuss how we manage this and how to manage it better.
So yeah, how to handle crisis situation with production Postgres. I called it
Postgres Urgent Care or Emergency Room. I don't know what's a better name here.
But yeah, I guess this is something I can share from my past experience.
Yeah.
So let's discuss this.
Sounds good.
And good point about the database often being in the center of things.
I think when you see large, sometimes I guess it is often the large companies that we
notice on Twitter and things that post, you know, people start tweeting that it's down. I think
recently there was a big GitHub incident. And that was, I think, one of the first communications was
about it being database related. Database changes related, not only database, but also database
changes, which when you have a change with database, it means increased risk of incident in most cases, actually.
But it was not about Postgres, so let's exclude this case.
No.
It's MySQL.
But I truly believe that the database, as I usually say, is the heart of any tech system
because it's very, very stateful, right?
Because it's data.
And since it's stateful,
it's really hard to scale it
and handle performance and so on.
So indeed, since database is heart of our systems,
production systems,
it's often in the center of storm.
Where would you like to start with this? I have a plan. Let's have a two-step discussion. First, we'll discuss the psychological
part of incident management related to databases, and second, purely technical. How about this?
Yeah, I like it. The psychological aspect probably isn't talked about as much as it could be. I think people often in postmortems focus on the technical things that could be done differently to avoid the issue reoccurring. I mean, that obviously is the most important thing. have learned a lot from how they communicated or how much they communicated. Often I see almost no communication from companies that are having big outages,
very poor levels of updates.
And it wouldn't completely alleviate the situation, of course.
If you're down, you're still down.
But I'd be surprised if companies that communicate better
don't have much better outcomes than the ones that don't.
You know, people looking to cancel services afterwards, that kind of thing.
Yeah. So indeed, it may be not well discussed, but I think anyone who is dealing with production
at larger scale definitely knows that, first of all, many people manage such kind of stress
not well, and that's fine. Maybe it's not your kind of thing, right? Some people manage such kind of stress not well, and that's fine.
Maybe it's not your kind of thing, right?
Some people manage stress very well.
It's not me, by the way.
I manage stress moderately well.
I learn how to do it and so on, but in the center of production incident,
I still find myself at high emotions.
We should do something very quickly, right?
And it's hard.
And also tunnel vision, you know, like you see only something and you don't have time for anything else.
It's very hard to relax and capture the whole picture and so on.
And that's why tooling is also important here, right?
Tooling should be designed for people who are under stress
and some runbooks and so on.
But this is technical stuff.
We will talk about it slightly later.
So I think there are some trainings and so on.
I'm not a good source of pieces of advice here, but I know we should look at people
who deal with production systems, SREs and so on, and there are many books written about it, handbooks,
runbooks and so on. So yeah, there are some good practices how to deal with such stress
and so on.
What are your favorites of those?
I like to use materials from Google, of course. There's sre.google, this is host name.
Nice. There are three books there
and quite good content.
I also like Handbook from GitLab
for production SRE team.
I seem to remember Netflix
having some good stuff as well.
Yeah, yeah.
We are in the area,
it's not only about databases,
of course, right?
It's SRE area, basically.
And there are many good materials for this.
But specifically for databases, I think in general, if you feel emotional, maybe you
should tell this to colleagues.
So maybe someone else should be acting more.
You need to understand yourself, basically, right?
If you feel, I sometimes feel myself, I remember we had a Black Friday, very boring because we were very well prepared.
We had like a lot of stuff, like it's a large company, e-commerce, and we had very good preparation and war room organized and we are prepared for incidents.
And the whole Black Friday was so boring.
So when we finally found some incident at 9 p.m.,
I was happy.
Finally, we have some material.
And recently, I was helping to some customers
and I also remember exactly this state.
I'm finally something interesting, you know.
This is a great state to be in.
Instead of being stressed
and everything is on your your shoulders you don't know
what to do or maybe you know but what if it won't work you you have some good materials to check
like some monitoring systems to check but but so stressed right because you you have fears of
failure right what if you won't be able to bring database up again back up during many
hours and and then like everything is disaster but if you in the state of like this is interesting
this is like finally we have some work to do like in production let's just i have high level of
curiosity maybe it's a new case fine this is this comes with
experience you saw many cases but you are looking for new kind of cases because it's already too
boring to deal with the same kind of corruption again or or i don't know like some some database
down again this this you saw it many times and you are hunting for new types of cases and curiosity helps.
So these are very two opposite states, I would say.
And I was in both in my life.
So, yeah.
For you, does it depend a bit on the severity though?
Because for me, even if it was 9pm and I'd been hoping for some interesting case to come up,
if it was super serious and
the whole, like, everything was down, I wouldn't be happy that we finally got it.
Well, honestly, I didn't have cases when, like, for example, life of people depends
on this.
Fair.
I can assume this might happen with some systems, but I was in cases when cost of downtime was super high.
And now I'm not scared already, you know.
I already had it, right?
So I know how it feels and so on.
I'm not super scared if it's only about money.
But life-threatening downtime, honestly, I didn't have it.
And I think if it happened I would be
very concerned right yeah maybe this realization of all this is only about money and the worst
thing that can happen somebody will lose money and you will lose job it's not the worst case
actually yeah right just relax right but life-threatening, it's another story. I'm very curious if someone who is listening to us has some system
where the state of post-guest can influence health or life of people.
This is interesting.
Yeah, you do hear about healthcare use cases, but also military use cases.
Right, right.
Yeah, it might happen.
It might happen.
I didn't have it.
So that's why I'm not super scared.
It's like, okay, it's not a big deal.
I know it's a big deal.
We will be professionally helping, right?
But let's just do what we can and that's it.
And that's why I'm curious.
Is it new case finally?
Okay, let's work on it.
But again, back to my point, you need to understand yourself.
This is very important.
If you know you react not well even to small problems,
it's better maybe to be an assistant.
Yeah, I was doing some thinking before the episode on what kinds of emergencies there could be.
And a couple that
i don't know if you're 100 thinking of that would be really scary for me would be like security style
incidents either external yeah exactly like yeah that would be scary in a different way potentially
or exciting in a different way this this is i exciting in a different way. I think this is CEO level already.
So definitely if something like that happens, it's not only...
There is a technical aspect here, but it's also a very high level organizational aspect of it,
how to handle this situation properly.
I was still talking about psychologically, though.
You know, psychologically, but decisions like how to handle it, properly. I was still talking about psychologically though.
Psychologically, but decisions like how to handle it, it's already
CEO level.
Recently, we received from our
insurance, I think, we received
regular routine
notice that our database is
stolen again.
And it just happens.
Maybe we don't know if your record
is also stolen. Maybe no, like, and a couple of days later, I found on GitHub very good
project. Some guy created a database of all SSNs of all Americans, and just published
it on GitHub. The fact is that it's only 1 billion like the how many digits this
number has so he just published all numbers up to 1 billion oh okay but some people on twitter
started thinking oh i found my ssn as well like. Like, it was like, you know, snowball joke.
Some people started, okay, I'm going to remove my SSN,
create a pull request.
It's funny.
Yeah.
So back to this, like, if you know yourself, it's good.
It helps you understand your stress level.
On another note, it's funny that we aim to monitor database and production systems well,
like with second level precision sometimes, but we don't monitor ourselves, like cortisol
level, right?
It would be great to understand, but we don't have it.
This bothers me a lot, like monitoring of human bodies and i don't
understand my own state except how do i feel so it would be good some people see heart rate for
example right yeah rings and watches that monitor heart rate which is probably quite like correlates
probably quite well with stress level yeah but let's maybe slowly shift to technical stuff so of course if you know yourself helps
if you i wanted to share one story i very long ago 15 years ago or so i had a great team
great startup i was i was cto i think maybe no i was ceo, but yeah, combining two of these roles.
And I had Postgres production system and great Postgres experts in my team.
And I remember one guy was a great Postgres expert and I made a mistake.
It was my mistake.
I was leaving to a trip for a few days and I said,
you will be responsible for production especially
postgres state because he was the best postgres expert in my team right but it was an obvious
mistake because incident happened and he couldn't handle it properly and he was completely like
like he lost his sorry for french right, you mean technically he would have been best placed to handle it in the team?
Yeah.
Technical expert is not necessarily good in terms of incident management, right?
Yeah.
And this is my mistake.
I didn't recognize it.
And this led to the end of our cooperation, unfortunately.
So, sometimes good technical experts should be assistant.
Right?
Yes.
Not pressure on shoulders, right?
This is super important to understand.
So my advice is try to understand yourself
and should you be responsible for incident management
or just assisting technically, right?
Yeah, know yourself, but also know your team
and know who in your team could,
like who you can call on for different things.
Yeah.
Now, let's move to technical stuff.
What is helpful?
Very helpful, first of all, and many small,
like we deal, our main focus right now is companies who grow startups,
usually lacking database expertise.
And many such companies come to us for help.
And almost none of them have good incident management in place.
It's not only about Postgres, right?
And we always suggest thinking about at least simple process.
Because they say, we had incident
last week.
My question is, show us incident nodes.
Is it logged anyhow, like with timestamps?
And most cases they don't have anything but just words.
They have words.
Okay, we saw database was slow, then it was unresponsive, blah, blah, blah.
What you must have, for incident,
is a sequence like you have,
we must have a document with artifacts,
like first known event happened,
some logs,
screenshots from monitoring,
better with links so we can revisit it.
But screenshots matter a lot because sometimes monitoring has a small retention window.
And the investigation might be long, especially if you involve external consultants like us, right?
So there should be some template and plan for documenting incidents.
And when you have it, it also helps with stress because you know what to do.
You need to identify first abnormal event documented, things before it, things after it, like some form it should happen.
And everything you notice also documented, important things highlighted.
It can be a Google doc or something with discussion around
it it's good when it's possible to discuss some things so people can ask questions clarify add
some additional knowledge and so on it can be anything actually right but it's important to have
to be prepared to document it yeah i've seen a lot of people start with like a like whatever app you
use for chat normally in the
team or some people have like a different app for instance specifically but if you're using slack
for example start a new channel for the instant all instant related stuff goes in there screenshots
logs yeah that's cool and then people turn it into a doc later sometimes but i could see an
argument for starting with a doc. Yeah.
But normally people are panicking at the beginning.
So chat makes sense.
Yeah.
Chat is more convenient for many people.
It's what you use every day.
So chat is good.
But it's important to have long-term storage for this document.
Convert it to document.
And I can say like most of startups
which grew to terabyte or a couple of terabytes
in terms of database size,
most of them don't have proper incident management
workflow developed.
They must have it.
It's time already.
So yeah, I definitely encourage,
even if you have a couple of technical teams
and technical experts in your team,
still it's super important to have incident management workflow developed.
So yeah, detailed, step-by-step, so we understand what's happening.
And you agree on format of this document in advance.
You can use some other companies as example. Again, sre.google and
GitLab handbook
for particular this area are
useful. GitLab, for example,
particularly has example
for incident management.
Many other companies also
share their templates and
description how to document it
properly. Super important.
And also, of course, sometimes
you feel okay, I'm documenting documenting, but who will be solving actually solving the problem,
right? So it's good if you have a few folks who can help each other and some of them is
responsible for documenting, another is trying to find a quick solution. And also document is
important to have because then in bigger companies
we have a procedure called root cause analysis, RCA, right? To learn from mistakes and fix them
and prevent them for future, right? So that's why also important to document. But then this helps,
and this is I think fundamental number one technical thing you need to do. Oh, it's an organizational
thing, sorry. But it includes some technical aspects. For example, which monitoring we use
when an incident happens, right? Where do we start? This dashboard or that dashboard, right?
What technical things we must document there? For example, of course, we care about CPU level
and disk I.O. basics, right? Hosts. If database seems to be slow or unresponsive,
we must document these things. We had discussions about monitoring dashboard number one,
proposal like these things. Dashboard number one in our PgWatch 2 Postgres edition is designed
for shallow but very wide analysis, very quick, up to one minute analysis of various components
of Postgres and the various properties at very high level, like 30,000 feet level of workload,
to understand which directions to investigate further, right?
Yeah, where's the issue?
Yeah, this is very good to prepare in advance. Like, I know if something happens,
how I will act, where I will start, right? Yeah, so this is important. And you will recommend,
you'll know how to start. This is about monitoring and observability and logs and so on.
Next, there are several particular cases I can quickly share
which are important to be prepared for.
For example, of course, if you already know that database has,
for example, a transaction ID wrap around.
Like you can see straight away that there's an error in the log or something.
Yeah, yeah.
So we have cases very well documented from Sentry, MailChimp,
somebody else, three.
And also we have very, very good work from Google, GCP,
Hannu Crossing.
He was at Postgres TV presenting his talk
about how to handle transaction ID wrap around
without single user mode.
He thinks single user mode is not the right way to do,
but this is traditional approach, single user mode,
and very long time for processing of database,
for recovering the state of database.
So this is like you can document if it happens someday.
But I haven't seen it for so long because monitoring has it,
alerts have it, and so on.
And also recent versions have improvements in this area.
I remember I think Peter Gagan did some good work around this,
probably others too.
Yeah, I just started
from a very scary thing.
The scariest, this and also
scariest is like loss of
backups and you cannot perform disaster
recovery, right?
Also like
very low risk
these days.
Yeah, I guess these days what
sometimes like major issues are things
like the whole of us-east one is down for like this hasn't doesn't really
gets happened for a while but like a cloud regional outage I feel like that
could still take down a company's day if you're using a managed service or the
cloud at all you're at risk of that.
Obviously, you can have plans in place to mitigate.
Even if it's self-managed, not many people have multi-region setup.
Exactly.
It's very hard, actually.
So if you don't have off-site backups,
kind of you're sat there thinking, we just have to wait.
Yeah, it's a complex thing to have multi-region purely
and well-tested productions, like failover tested very well.
Yeah, it's a big topic, actually.
So backups and transaction interrupts are two nightmares of any Postgres DBA, right?
Are they the scariest to you?
I think corruption is pretty scary.
Well, it's a good, interesting topic.
Corruption, we had an episode about corruption as well, right?
But this is good to put to preparation of incidents.
If corruption happens, what we will do?
Some steps.
And the first step is, according to wikiposgrave.org, copy database, right?
Because you will try to fix,
maybe you will break it more, right?
So copy. This is the first step to do.
And knowing this helps
because this kind of thing
you can know in advance.
By the way, transaction ID wrap around,
you can practice as well. There is a recipe
I wrote how to
simulate it, right? So you can
have it in lower environment and then good luck dealing
with it. Or you can clone your database and simulate it there. So corruption is very broad
topic, many types of corruption, but some kinds of can be also simulated. There are tools for it.
So it's good to know it, right? But in cases I saw, in most cases it was quiet.
Like there was some path to escape.
In some cases, escape was we just restore from backups,
losing some data, and for that project it was acceptable.
In some cases it was, okay, we just noticed
that only PG statistic is corrupted.
So running analyze fixes it.
But long term, we see the database is on NFS.
And this is no, no, no.
Don't use NFS for PG data.
It's quite...
Like in most cases, I saw corruption.
It was something silly, actually.
But corruption happens also due to bugs, due to various stuff.
Or mistake planning some major change, like switching to new operational system, JLPC.
Fortunately, over the last few years, such corruption happened in non-production, so we fully prevented it.
Well, the reason I find it scary is more that we could have been returning bad results.
Oh, silent corruption.
Yeah. That's the fear to me. It's more, how far back does this go? Anyway, but it's a different kind of emergency. Yeah, we had a corruption due to index and GDPC change in production
with one company last year, and it was our oversight.
But fortunately, it happened only on standby nodes,
which were not used.
So it was pure matter of luck that this production corruption happened.
And no failover.
Yeah, yeah.
Other clusters used standby nodes.
These clusters didn't use it,
and we just saw some errors.
It was during upgrade with logical replication.
We saw errors in logs and quickly reacted
and then realized,
these standby nodes are not used. Let's pray that failover won't happen soon.
Of course, it's like just imagine. So we quickly mitigated this completely. Nobody noticed.
But if it happens, yeah, the question is how, like, what's the propagation here? But there's also tooling and learning from other people.
Mistakes helps, of course, as usual.
And knowing tools like Amcheck should be a very routine tool being used often.
Amcheck to check B3 indexes.
Hopefully, it will finally support other types of indexes soon.
I think it's still a work in progress,
right? Yeah, I saw some
work going on.
What else?
For example,
if database
is shutting down
too slowly, it takes
a lot of time. Or starting up takes
a lot of time. Not once
I saw many people being nervous,
not understanding that it's normal, not understanding how to check the progress,
what to expect. And it was like when you perform checkpoint tuning, we also had an episode about it,
and increase checkpoint timeout in max wall size, which you should do on loaded systems.
In this case, restart or just stopping database
or starting database might take many, many, many minutes.
And if it's self-managed, I saw people kill minus nine,
sick kill, right, sending to Postgres
because they are nervous not understanding,
oh, Postgres
is not starting, what to do. And I think now in fresh versions, there are some log messages
telling that we are in recovery mode and showing some progress, right? I thought about it.
It's very recent. I think, I can't remember if it did go into something. It should be so. It should be very straightforward.
DBA should see the progress
and have understanding when it will finish.
And for older version,
at least definitely older than 16,
it's unclear.
And usually you need to,
like if it's self-managed,
you just run PS to see what process reports in its title or top, right?
And you see LSN there, then you use PG control data to understand the point of consistency.
And then you understand how many, if you have two LSNs and go to another Postgres, you can calculate difference and
differences in bytes.
So you understand how many bytes, megabytes, gigabytes left, and then you can already monitor
like every minute or every second and understand the progress and have ETA, expected time of
arrival, right?
And this helps.
And I think it's a good idea to learn how to do it.
In older versions, in newer versions,
I have a how-to about it, actually.
What if Postgres startup and stop time,
time takes, like, it's long.
What to do about it?
And, yeah, it should be just learned, right?
And if you're prepared, it will reduce stress.
And, yeah, we had a lot of such cases working on DB Lab. Sometimes like clone is not created. Why? But it's because Maxwell size
is huge and so on. It's just recovering. So you just need to wait a little bit more. But then we
improved it. So yeah, this might happen.
This is a very common situation.
Long restart time.
Yeah, I'll definitely share that episode in the show notes as well
so people can find it if they weren't listening back then.
What else?
Somebody deleted data.
We have other episodes like out of disk.
There's other kinds of emergencies.
One we haven't covered, I don't think, in much detail
was the big, like out of integers.
Yeah.
Out of integers, it's a big disaster.
Yeah.
But I guess that's quite common.
In terms of other common issues people come to you with,
is that up there or what tends to be there?
Maybe I'm biased here because I have a feeling it's a very well-known problem
and people already mitigate it or mitigating not requiring a lot of expertise.
Our Postgres checkup tool has a report for it,
like saying how much of capacity of int4 regular integer primary key left for
particular table. For me, it's like straightforward already, right? And I don't see a big deal. But
if it happens, of course, it's like partial, at least partial downtime, because you cannot insert
new rows in this table. And it can be scary.
That's true of so many of these issues though, right?
Like once you're monitoring for them,
once you know about them and you've got alerts far enough out,
they become not emergencies.
Right.
But I'd like to mention also like common problem,
like database is slow
or database is unresponsive for what to do.
Like very general, like where to start, what do you think?
Well, I think that's the monitoring thing, isn't it?
Like that's to go to the monitoring, that number one dashboard you talked about.
Yeah.
I think that's the work out where is the problem.
It needs to be the first point, doesn't it?
Yeah, I agree.
And first thing I would start understanding,
I think we can talk about methodologies here,
like starting from USE use, right?
And others, like there are many of them.
But question, like USE usage situation errors
from Brandon Greger, like basics from Netflix, right?
This is a very, very trivial approach, I would say.
But yeah, here, first question,
if database is slow and responsive,
first question, do we really put more workload on it?
Very simple question, but sometimes hard to answer.
Because often we find out that, you know,
many more clients connected,
some background jobs started bombing
database with new queries
retrying
connections.
Like cascading.
Is it the elephants?
I actually don't know that term.
So the question is
more load
coming from
externally to database.
This can, of course, be a reason
why it's slow.
And if it's not well to handle
spikes of load, for example,
you keep max connections high, ignoring advice
from Postgres experts that let's keep it sane.
Recently I saw, I'm sharing without names so I can share, right?
12,000 max connections.
This is for me, I think, a record.
A new client showed it and they explained.
And I see like it's a trend.
Recently, when I say you need to decrease max connections,
I also say most likely you will not do it right now
because most people
tend not to do it. They all have
reasons why max connections should be
very high. And of course, since
Postgres, I think, 14,
things are improved in terms of handling
idle connections. But
when an incident happens,
these idle connections become active
and we have almost zero chances for
statements to be finished right because server is overwhelmed with load right but while if you have
same number of max connections i would say take your vcpu's number multiplied by some relatively low multiplier, like less than 10. And this should be
max connections for LTP workloads. And then you have PgBouncer or something. Yeah, so if you have
this and enormous load is coming, additional load will be receiving out of connections error.
Timeouts or
something yeah and you have chances to finish current statement current queries processing
and new and so on so it's much better than you try to please everyone right and cannot do it at all
including like with your old clients.
It also makes some diagnosis easier, right?
If the database is still responding to anything,
it's easier to diagnose issues than if it's not responding at all, right?
Exactly.
It's kind of just moving the problem,
but it's definitely an improvement.
But yeah, it's a good point. It could just be overwhelmed problem but it's definitely an improvement yeah but yeah it's
a good point like it could just be overwhelmed but it could be there are like a million other reasons
of course but first question i would say like uh are we are we receiving more load which already
like so reason is already outside of postgres well technically i just explained additional like
factor high max connections.
It's partially problem is inside Postgres,
but the main reason, root cause is outside.
We're just receiving much more than usually.
This is the number one thing to check.
We don't have time to discuss full recipe
for troubleshooting of such cases.
We've got an episode, I think, for that.
Maybe, yeah.
I already keep forgetting what we have.
It's probably actually just monitoring.
Yeah.
Yeah, yeah.
But maybe we should have, like, you know, like, how to troubleshoot slow database step
by step.
So to save time, second advice I would say, just check wait event analysis.
Second thing, if you have a lot of
active sessions, maybe actually sometimes database is slow without a lot of active sessions. It's
interesting, but it's also if you understand number of active sessions, it's very important.
But next thing, understand the state of what are they doing, right? So are they doing a lot of I.O.?
Or there is a contention related to lock manager, for example,
or sub-transactions or anything like that.
So wait-event analysis is super important.
And we discussed right now how to improve dashboard number one.
No, not dashboard number three,
which is query analysis in PgWatch Postgresor Edition, and I'm almost convinced
to put wait event query analysis to the top.
Previously, I was thinking we should have total time
from PgSort statements and average time,
total time maybe should be higher,
and we have long discussion inside the time maybe should be higher. And we have a long discussion inside the team
what should be higher.
But now I'm almost convinced actually
weight event analysis should be on very top
because it gives you very quick understanding
just from one chart you quickly understand
number of active sessions and distribution
in terms of what they are doing in in an analysis when you have some
number next step is to segment analysis right properly so to divide this number to some segments
and i think weight event is very good direction for segmentation how to say yeah it's few it's
like it splits it into fewer things so therefore it's
easier to spot if there's like a majority whereas like with query analysis you could have a real
long tail like your even the most commonly executed query might only be one percent of your workload
well yeah it might be 50 but it might be 1%, whereas wait events are more likely.
Yeah, and timing in PGSR
statements, it hides details. It might
be actual work database
is doing, and that's why it's spending time.
For example, sequential scans due to
lack of indexes or something like that.
Or it might be
waiting for a log to be acquired.
So it's also spending time
and you quickly see. So very
good book as usual, books from Brandon Gregg. There is in troubleshooting, I remember also
his talks, two-part talk about tooling for Linux and so on. And he mentioned that if
he needed to choose just one Linux tool, you can use only one tool,
and the biggest outcome in terms of troubleshooting, what is it?
Do you remember, no?
No.
It's IOSTAT.
It gives you disk I.O. and also it reports CPU as well,
segmented by user system I.O. weight.
So it's super good.
You see disk IO
and CPU just from one to
similar here we have
we see
a number of active sessions and also
we see wait events
segmentation.
It's very good chart to have
for troubleshooting.
It feels to me like an interesting trade-off
whether you're looking at
monitoring more often or not even necessarily more often but do you optimize for people
in an incident or do you optimize for people doing general performance work and i think
optimizing for the instant people make some sense even though it's less often hopefully
yeah they have less time they'll. Less time, but also heightened emotions
and not thinking straightly like we started.
So maybe that's it.
Paths should be shorter.
Yeah, I agree.
Right, yeah.
So there are many other things that can happen with database, of course, right?
But if you know some common things, it helps a lot.
Yeah.
And tooling should be prepared and, yeah, observability is important.
Yeah.
One last question.
I think there's some arguments for trying to reduce incidents down to, like, nearly zero,
like trying to put everything in place so that you never have any incidents you know
high availability everything to try and minimize the risk and then you could i think as a team you
can get out of out of practice dealing with incidents if you're if you're good at that
kind of thing but then when one does happen it can really throw you some teams like to deal with like
super minor incidents and treat those
as incidents almost like as practice do you have any any opinions or feelings around that kind of
thing yeah good good point so we actually didn't discuss many things for example how to categorize
incidents as like priority one priority two and so on because when i uh when clients a client comes
it happened like a couple of times over the
last month, like a client comes and shows me some graphs with spikes of active sessions
exceeding the CPU count significantly, I already say, oh, you are having at least like, you
know, P3 incident or maybe P2. Maybe it's not user facing, people haven't noticed it, but it's an incident
already. It requires investigation and they like database is slow, but this is already you need
some reaction and mitigation for it. So it requires maybe understanding and expertise and classification rules which require Postgres understanding,
right? Because sometimes I have a hard time convincing people that if you have, I don't
know, like 64 cores but accession count jumped to 200, 300, it's already not normal. They
say, well, it worked, right?
No one complained? Yeah, yeah, yeah. Well, it worked. And part of the
problem in Postgres, we don't have good metric for average latency, for example, for query processing,
because database job, like, we want query processing to be not producing errors and be
fast. Fast, we have definition of fast for LTP case.
I have an article about it.
Definitely, it's not one second.
It should be below.
It should be below 100 milliseconds.
In most cases, it should be below 10 milliseconds
because one HTTP request consists of multiple SQL,
usually in many cases.
And human perception is 200 milliseconds.
So we have some threshold already.
So let's keep latency low.
But funny thing, Postgres doesn't have latency exposed,
average latency.
It doesn't.
So PGSTART database doesn't have it.
Nothing has it.
Only PGSTART statements.
Yeah.
But it's not precise.
It's not in core. It's not precise. There is max
5,000. In some cases, workload is complex and there is constant eviction
of records from PGSTAT statements
and appearance of new ones. So latency measured
from PGSTIS statements,
this is what most monitoring systems do,
including dashboard number one,
we discussed earlier from PGWatch to Postgresor Edition,
but it feels not fully reliable, right?
But this is important because this is how we can say,
okay, really slow. How much?
We had sub-millisecond latency.
Now we have five-millisecond latency.
Okay, indeed, there is proof of it.
I like that PgBouncer reports it.
I was going to ask, yeah.
It logs and then starts and reports it.
This is great.
This is what we should have, honestly, in Postgres as well.
But, yeah, I actually don't remember discussions about it.
There should be some discussions. So this is our maybe main characteristics of performance.
I wish of course we had percentiles, not only average. Many people monitor it from the client
side. Datadog has APM and there's ability to monitor it from client side. But this
is not purely database latency because it includes RTTs, route trip times, network,
right? And it should be excluded if we talk about database to understand behavioral database,
right? So yeah, this is how we understand latency.
And, yeah, if it's slow, it's slow.
And then we need to, again,
apply segmentation and top-down analysis
and find what exactly is slow.
Everything or just some of it, right?
So it's P2, P3 incidents.
I think for smaller companies, it's hard in terms of database.
It's possible, but it's too much work, maybe.
Well, but I also think there could be an argument
from like make incidents a bit more normal in your team
and less stressful.
So when you do have a stressful one,
or like when you do have a big one that's a bigger deal.
I see your point
unless your team is
overwhelmed with P1 incidents
which I also had in
my team actually and I saw
it like we have every day
we have database down
unless that it's a good
idea if you don't have database incidents
to say okay
let's look
for P2, P3 incidents
and start processing them
routinely so we build a muscle
for incident
management. It's a great advice.
Indeed.
Indeed.
Yeah, maybe that's it. Let's wrap it up.
Sounds good.
Thanks so much, Nikolai. Catch you next week.
Thank you, Michael.