Postgres FM - Postgres Emergency Room

Starting point is 00:00:00 Hello, hello, this is PostgresFM. My name is Nikolai from Postgres.ai, and as usual, my co-host is Michael from PgMaster. Hi, Michael. Hello, Nikolai. And the topic I chose is related to problems, acute problems, unpredictable, which sometimes happen to production systems, including Postgres. In many cases, the database is in the center of storm. And let's discuss how we manage this and how to manage it better. So yeah, how to handle crisis situation with production Postgres. I called it Postgres Urgent Care or Emergency Room. I don't know what's a better name here. But yeah, I guess this is something I can share from my past experience. Yeah.

Starting point is 00:00:54 So let's discuss this. Sounds good. And good point about the database often being in the center of things. I think when you see large, sometimes I guess it is often the large companies that we notice on Twitter and things that post, you know, people start tweeting that it's down. I think recently there was a big GitHub incident. And that was, I think, one of the first communications was about it being database related. Database changes related, not only database, but also database changes, which when you have a change with database, it means increased risk of incident in most cases, actually.

Starting point is 00:01:32 But it was not about Postgres, so let's exclude this case. No. It's MySQL. But I truly believe that the database, as I usually say, is the heart of any tech system because it's very, very stateful, right? Because it's data. And since it's stateful, it's really hard to scale it

Starting point is 00:01:53 and handle performance and so on. So indeed, since database is heart of our systems, production systems, it's often in the center of storm. Where would you like to start with this? I have a plan. Let's have a two-step discussion. First, we'll discuss the psychological part of incident management related to databases, and second, purely technical. How about this? Yeah, I like it. The psychological aspect probably isn't talked about as much as it could be. I think people often in postmortems focus on the technical things that could be done differently to avoid the issue reoccurring. I mean, that obviously is the most important thing. have learned a lot from how they communicated or how much they communicated. Often I see almost no communication from companies that are having big outages, very poor levels of updates.

Starting point is 00:02:52 And it wouldn't completely alleviate the situation, of course. If you're down, you're still down. But I'd be surprised if companies that communicate better don't have much better outcomes than the ones that don't. You know, people looking to cancel services afterwards, that kind of thing. Yeah. So indeed, it may be not well discussed, but I think anyone who is dealing with production at larger scale definitely knows that, first of all, many people manage such kind of stress not well, and that's fine. Maybe it's not your kind of thing, right? Some people manage such kind of stress not well, and that's fine.

Starting point is 00:03:27 Maybe it's not your kind of thing, right? Some people manage stress very well. It's not me, by the way. I manage stress moderately well. I learn how to do it and so on, but in the center of production incident, I still find myself at high emotions. We should do something very quickly, right? And it's hard.

Starting point is 00:03:49 And also tunnel vision, you know, like you see only something and you don't have time for anything else. It's very hard to relax and capture the whole picture and so on. And that's why tooling is also important here, right? Tooling should be designed for people who are under stress and some runbooks and so on. But this is technical stuff. We will talk about it slightly later. So I think there are some trainings and so on.

Starting point is 00:04:12 I'm not a good source of pieces of advice here, but I know we should look at people who deal with production systems, SREs and so on, and there are many books written about it, handbooks, runbooks and so on. So yeah, there are some good practices how to deal with such stress and so on. What are your favorites of those? I like to use materials from Google, of course. There's sre.google, this is host name. Nice. There are three books there and quite good content.

Starting point is 00:04:49 I also like Handbook from GitLab for production SRE team. I seem to remember Netflix having some good stuff as well. Yeah, yeah. We are in the area, it's not only about databases, of course, right?

Starting point is 00:05:03 It's SRE area, basically. And there are many good materials for this. But specifically for databases, I think in general, if you feel emotional, maybe you should tell this to colleagues. So maybe someone else should be acting more. You need to understand yourself, basically, right? If you feel, I sometimes feel myself, I remember we had a Black Friday, very boring because we were very well prepared. We had like a lot of stuff, like it's a large company, e-commerce, and we had very good preparation and war room organized and we are prepared for incidents.

Starting point is 00:05:40 And the whole Black Friday was so boring. So when we finally found some incident at 9 p.m., I was happy. Finally, we have some material. And recently, I was helping to some customers and I also remember exactly this state. I'm finally something interesting, you know. This is a great state to be in.

Starting point is 00:06:01 Instead of being stressed and everything is on your your shoulders you don't know what to do or maybe you know but what if it won't work you you have some good materials to check like some monitoring systems to check but but so stressed right because you you have fears of failure right what if you won't be able to bring database up again back up during many hours and and then like everything is disaster but if you in the state of like this is interesting this is like finally we have some work to do like in production let's just i have high level of curiosity maybe it's a new case fine this is this comes with

Starting point is 00:06:46 experience you saw many cases but you are looking for new kind of cases because it's already too boring to deal with the same kind of corruption again or or i don't know like some some database down again this this you saw it many times and you are hunting for new types of cases and curiosity helps. So these are very two opposite states, I would say. And I was in both in my life. So, yeah. For you, does it depend a bit on the severity though? Because for me, even if it was 9pm and I'd been hoping for some interesting case to come up,

Starting point is 00:07:24 if it was super serious and the whole, like, everything was down, I wouldn't be happy that we finally got it. Well, honestly, I didn't have cases when, like, for example, life of people depends on this. Fair. I can assume this might happen with some systems, but I was in cases when cost of downtime was super high. And now I'm not scared already, you know. I already had it, right?

Starting point is 00:07:54 So I know how it feels and so on. I'm not super scared if it's only about money. But life-threatening downtime, honestly, I didn't have it. And I think if it happened I would be very concerned right yeah maybe this realization of all this is only about money and the worst thing that can happen somebody will lose money and you will lose job it's not the worst case actually yeah right just relax right but life-threatening, it's another story. I'm very curious if someone who is listening to us has some system where the state of post-guest can influence health or life of people.

Starting point is 00:08:33 This is interesting. Yeah, you do hear about healthcare use cases, but also military use cases. Right, right. Yeah, it might happen. It might happen. I didn't have it. So that's why I'm not super scared. It's like, okay, it's not a big deal.

Starting point is 00:08:50 I know it's a big deal. We will be professionally helping, right? But let's just do what we can and that's it. And that's why I'm curious. Is it new case finally? Okay, let's work on it. But again, back to my point, you need to understand yourself. This is very important.

Starting point is 00:09:15 If you know you react not well even to small problems, it's better maybe to be an assistant. Yeah, I was doing some thinking before the episode on what kinds of emergencies there could be. And a couple that i don't know if you're 100 thinking of that would be really scary for me would be like security style incidents either external yeah exactly like yeah that would be scary in a different way potentially or exciting in a different way this this is i exciting in a different way. I think this is CEO level already. So definitely if something like that happens, it's not only...

Starting point is 00:09:52 There is a technical aspect here, but it's also a very high level organizational aspect of it, how to handle this situation properly. I was still talking about psychologically, though. You know, psychologically, but decisions like how to handle it, properly. I was still talking about psychologically though. Psychologically, but decisions like how to handle it, it's already CEO level. Recently, we received from our insurance, I think, we received

Starting point is 00:10:15 regular routine notice that our database is stolen again. And it just happens. Maybe we don't know if your record is also stolen. Maybe no, like, and a couple of days later, I found on GitHub very good project. Some guy created a database of all SSNs of all Americans, and just published it on GitHub. The fact is that it's only 1 billion like the how many digits this

Starting point is 00:10:49 number has so he just published all numbers up to 1 billion oh okay but some people on twitter started thinking oh i found my ssn as well like. Like, it was like, you know, snowball joke. Some people started, okay, I'm going to remove my SSN, create a pull request. It's funny. Yeah. So back to this, like, if you know yourself, it's good. It helps you understand your stress level.

Starting point is 00:11:27 On another note, it's funny that we aim to monitor database and production systems well, like with second level precision sometimes, but we don't monitor ourselves, like cortisol level, right? It would be great to understand, but we don't have it. This bothers me a lot, like monitoring of human bodies and i don't understand my own state except how do i feel so it would be good some people see heart rate for example right yeah rings and watches that monitor heart rate which is probably quite like correlates probably quite well with stress level yeah but let's maybe slowly shift to technical stuff so of course if you know yourself helps

Starting point is 00:12:07 if you i wanted to share one story i very long ago 15 years ago or so i had a great team great startup i was i was cto i think maybe no i was ceo, but yeah, combining two of these roles. And I had Postgres production system and great Postgres experts in my team. And I remember one guy was a great Postgres expert and I made a mistake. It was my mistake. I was leaving to a trip for a few days and I said, you will be responsible for production especially postgres state because he was the best postgres expert in my team right but it was an obvious

Starting point is 00:12:51 mistake because incident happened and he couldn't handle it properly and he was completely like like he lost his sorry for french right, you mean technically he would have been best placed to handle it in the team? Yeah. Technical expert is not necessarily good in terms of incident management, right? Yeah. And this is my mistake. I didn't recognize it. And this led to the end of our cooperation, unfortunately.

Starting point is 00:13:24 So, sometimes good technical experts should be assistant. Right? Yes. Not pressure on shoulders, right? This is super important to understand. So my advice is try to understand yourself and should you be responsible for incident management or just assisting technically, right?

Starting point is 00:13:44 Yeah, know yourself, but also know your team and know who in your team could, like who you can call on for different things. Yeah. Now, let's move to technical stuff. What is helpful? Very helpful, first of all, and many small, like we deal, our main focus right now is companies who grow startups,

Starting point is 00:14:07 usually lacking database expertise. And many such companies come to us for help. And almost none of them have good incident management in place. It's not only about Postgres, right? And we always suggest thinking about at least simple process. Because they say, we had incident last week. My question is, show us incident nodes.

Starting point is 00:14:33 Is it logged anyhow, like with timestamps? And most cases they don't have anything but just words. They have words. Okay, we saw database was slow, then it was unresponsive, blah, blah, blah. What you must have, for incident, is a sequence like you have, we must have a document with artifacts, like first known event happened,

Starting point is 00:14:56 some logs, screenshots from monitoring, better with links so we can revisit it. But screenshots matter a lot because sometimes monitoring has a small retention window. And the investigation might be long, especially if you involve external consultants like us, right? So there should be some template and plan for documenting incidents. And when you have it, it also helps with stress because you know what to do. You need to identify first abnormal event documented, things before it, things after it, like some form it should happen.

Starting point is 00:15:36 And everything you notice also documented, important things highlighted. It can be a Google doc or something with discussion around it it's good when it's possible to discuss some things so people can ask questions clarify add some additional knowledge and so on it can be anything actually right but it's important to have to be prepared to document it yeah i've seen a lot of people start with like a like whatever app you use for chat normally in the team or some people have like a different app for instance specifically but if you're using slack for example start a new channel for the instant all instant related stuff goes in there screenshots

Starting point is 00:16:17 logs yeah that's cool and then people turn it into a doc later sometimes but i could see an argument for starting with a doc. Yeah. But normally people are panicking at the beginning. So chat makes sense. Yeah. Chat is more convenient for many people. It's what you use every day. So chat is good.

Starting point is 00:16:38 But it's important to have long-term storage for this document. Convert it to document. And I can say like most of startups which grew to terabyte or a couple of terabytes in terms of database size, most of them don't have proper incident management workflow developed. They must have it.

Starting point is 00:16:58 It's time already. So yeah, I definitely encourage, even if you have a couple of technical teams and technical experts in your team, still it's super important to have incident management workflow developed. So yeah, detailed, step-by-step, so we understand what's happening. And you agree on format of this document in advance. You can use some other companies as example. Again, sre.google and

Starting point is 00:17:25 GitLab handbook for particular this area are useful. GitLab, for example, particularly has example for incident management. Many other companies also share their templates and description how to document it

Starting point is 00:17:41 properly. Super important. And also, of course, sometimes you feel okay, I'm documenting documenting, but who will be solving actually solving the problem, right? So it's good if you have a few folks who can help each other and some of them is responsible for documenting, another is trying to find a quick solution. And also document is important to have because then in bigger companies we have a procedure called root cause analysis, RCA, right? To learn from mistakes and fix them and prevent them for future, right? So that's why also important to document. But then this helps,

Starting point is 00:18:19 and this is I think fundamental number one technical thing you need to do. Oh, it's an organizational thing, sorry. But it includes some technical aspects. For example, which monitoring we use when an incident happens, right? Where do we start? This dashboard or that dashboard, right? What technical things we must document there? For example, of course, we care about CPU level and disk I.O. basics, right? Hosts. If database seems to be slow or unresponsive, we must document these things. We had discussions about monitoring dashboard number one, proposal like these things. Dashboard number one in our PgWatch 2 Postgres edition is designed for shallow but very wide analysis, very quick, up to one minute analysis of various components

Starting point is 00:19:15 of Postgres and the various properties at very high level, like 30,000 feet level of workload, to understand which directions to investigate further, right? Yeah, where's the issue? Yeah, this is very good to prepare in advance. Like, I know if something happens, how I will act, where I will start, right? Yeah, so this is important. And you will recommend, you'll know how to start. This is about monitoring and observability and logs and so on. Next, there are several particular cases I can quickly share which are important to be prepared for.

Starting point is 00:19:57 For example, of course, if you already know that database has, for example, a transaction ID wrap around. Like you can see straight away that there's an error in the log or something. Yeah, yeah. So we have cases very well documented from Sentry, MailChimp, somebody else, three. And also we have very, very good work from Google, GCP, Hannu Crossing.

Starting point is 00:20:22 He was at Postgres TV presenting his talk about how to handle transaction ID wrap around without single user mode. He thinks single user mode is not the right way to do, but this is traditional approach, single user mode, and very long time for processing of database, for recovering the state of database. So this is like you can document if it happens someday.

Starting point is 00:20:51 But I haven't seen it for so long because monitoring has it, alerts have it, and so on. And also recent versions have improvements in this area. I remember I think Peter Gagan did some good work around this, probably others too. Yeah, I just started from a very scary thing. The scariest, this and also

Starting point is 00:21:10 scariest is like loss of backups and you cannot perform disaster recovery, right? Also like very low risk these days. Yeah, I guess these days what sometimes like major issues are things

Starting point is 00:21:28 like the whole of us-east one is down for like this hasn't doesn't really gets happened for a while but like a cloud regional outage I feel like that could still take down a company's day if you're using a managed service or the cloud at all you're at risk of that. Obviously, you can have plans in place to mitigate. Even if it's self-managed, not many people have multi-region setup. Exactly. It's very hard, actually.

Starting point is 00:21:55 So if you don't have off-site backups, kind of you're sat there thinking, we just have to wait. Yeah, it's a complex thing to have multi-region purely and well-tested productions, like failover tested very well. Yeah, it's a big topic, actually. So backups and transaction interrupts are two nightmares of any Postgres DBA, right? Are they the scariest to you? I think corruption is pretty scary.

Starting point is 00:22:23 Well, it's a good, interesting topic. Corruption, we had an episode about corruption as well, right? But this is good to put to preparation of incidents. If corruption happens, what we will do? Some steps. And the first step is, according to wikiposgrave.org, copy database, right? Because you will try to fix, maybe you will break it more, right?

Starting point is 00:22:47 So copy. This is the first step to do. And knowing this helps because this kind of thing you can know in advance. By the way, transaction ID wrap around, you can practice as well. There is a recipe I wrote how to simulate it, right? So you can

Starting point is 00:23:03 have it in lower environment and then good luck dealing with it. Or you can clone your database and simulate it there. So corruption is very broad topic, many types of corruption, but some kinds of can be also simulated. There are tools for it. So it's good to know it, right? But in cases I saw, in most cases it was quiet. Like there was some path to escape. In some cases, escape was we just restore from backups, losing some data, and for that project it was acceptable. In some cases it was, okay, we just noticed

Starting point is 00:23:44 that only PG statistic is corrupted. So running analyze fixes it. But long term, we see the database is on NFS. And this is no, no, no. Don't use NFS for PG data. It's quite... Like in most cases, I saw corruption. It was something silly, actually.

Starting point is 00:24:10 But corruption happens also due to bugs, due to various stuff. Or mistake planning some major change, like switching to new operational system, JLPC. Fortunately, over the last few years, such corruption happened in non-production, so we fully prevented it. Well, the reason I find it scary is more that we could have been returning bad results. Oh, silent corruption. Yeah. That's the fear to me. It's more, how far back does this go? Anyway, but it's a different kind of emergency. Yeah, we had a corruption due to index and GDPC change in production with one company last year, and it was our oversight. But fortunately, it happened only on standby nodes,

Starting point is 00:24:59 which were not used. So it was pure matter of luck that this production corruption happened. And no failover. Yeah, yeah. Other clusters used standby nodes. These clusters didn't use it, and we just saw some errors. It was during upgrade with logical replication.

Starting point is 00:25:21 We saw errors in logs and quickly reacted and then realized, these standby nodes are not used. Let's pray that failover won't happen soon. Of course, it's like just imagine. So we quickly mitigated this completely. Nobody noticed. But if it happens, yeah, the question is how, like, what's the propagation here? But there's also tooling and learning from other people. Mistakes helps, of course, as usual. And knowing tools like Amcheck should be a very routine tool being used often. Amcheck to check B3 indexes.

Starting point is 00:26:01 Hopefully, it will finally support other types of indexes soon. I think it's still a work in progress, right? Yeah, I saw some work going on. What else? For example, if database is shutting down

Starting point is 00:26:18 too slowly, it takes a lot of time. Or starting up takes a lot of time. Not once I saw many people being nervous, not understanding that it's normal, not understanding how to check the progress, what to expect. And it was like when you perform checkpoint tuning, we also had an episode about it, and increase checkpoint timeout in max wall size, which you should do on loaded systems. In this case, restart or just stopping database

Starting point is 00:26:51 or starting database might take many, many, many minutes. And if it's self-managed, I saw people kill minus nine, sick kill, right, sending to Postgres because they are nervous not understanding, oh, Postgres is not starting, what to do. And I think now in fresh versions, there are some log messages telling that we are in recovery mode and showing some progress, right? I thought about it. It's very recent. I think, I can't remember if it did go into something. It should be so. It should be very straightforward.

Starting point is 00:27:27 DBA should see the progress and have understanding when it will finish. And for older version, at least definitely older than 16, it's unclear. And usually you need to, like if it's self-managed, you just run PS to see what process reports in its title or top, right?

Starting point is 00:27:50 And you see LSN there, then you use PG control data to understand the point of consistency. And then you understand how many, if you have two LSNs and go to another Postgres, you can calculate difference and differences in bytes. So you understand how many bytes, megabytes, gigabytes left, and then you can already monitor like every minute or every second and understand the progress and have ETA, expected time of arrival, right? And this helps. And I think it's a good idea to learn how to do it.

Starting point is 00:28:27 In older versions, in newer versions, I have a how-to about it, actually. What if Postgres startup and stop time, time takes, like, it's long. What to do about it? And, yeah, it should be just learned, right? And if you're prepared, it will reduce stress. And, yeah, we had a lot of such cases working on DB Lab. Sometimes like clone is not created. Why? But it's because Maxwell size

Starting point is 00:28:56 is huge and so on. It's just recovering. So you just need to wait a little bit more. But then we improved it. So yeah, this might happen. This is a very common situation. Long restart time. Yeah, I'll definitely share that episode in the show notes as well so people can find it if they weren't listening back then. What else? Somebody deleted data.

Starting point is 00:29:20 We have other episodes like out of disk. There's other kinds of emergencies. One we haven't covered, I don't think, in much detail was the big, like out of integers. Yeah. Out of integers, it's a big disaster. Yeah. But I guess that's quite common.

Starting point is 00:29:39 In terms of other common issues people come to you with, is that up there or what tends to be there? Maybe I'm biased here because I have a feeling it's a very well-known problem and people already mitigate it or mitigating not requiring a lot of expertise. Our Postgres checkup tool has a report for it, like saying how much of capacity of int4 regular integer primary key left for particular table. For me, it's like straightforward already, right? And I don't see a big deal. But if it happens, of course, it's like partial, at least partial downtime, because you cannot insert

Starting point is 00:30:19 new rows in this table. And it can be scary. That's true of so many of these issues though, right? Like once you're monitoring for them, once you know about them and you've got alerts far enough out, they become not emergencies. Right. But I'd like to mention also like common problem, like database is slow

Starting point is 00:30:40 or database is unresponsive for what to do. Like very general, like where to start, what do you think? Well, I think that's the monitoring thing, isn't it? Like that's to go to the monitoring, that number one dashboard you talked about. Yeah. I think that's the work out where is the problem. It needs to be the first point, doesn't it? Yeah, I agree.

Starting point is 00:31:01 And first thing I would start understanding, I think we can talk about methodologies here, like starting from USE use, right? And others, like there are many of them. But question, like USE usage situation errors from Brandon Greger, like basics from Netflix, right? This is a very, very trivial approach, I would say. But yeah, here, first question,

Starting point is 00:31:30 if database is slow and responsive, first question, do we really put more workload on it? Very simple question, but sometimes hard to answer. Because often we find out that, you know, many more clients connected, some background jobs started bombing database with new queries retrying

Starting point is 00:31:49 connections. Like cascading. Is it the elephants? I actually don't know that term. So the question is more load coming from externally to database.

Starting point is 00:32:06 This can, of course, be a reason why it's slow. And if it's not well to handle spikes of load, for example, you keep max connections high, ignoring advice from Postgres experts that let's keep it sane. Recently I saw, I'm sharing without names so I can share, right? 12,000 max connections.

Starting point is 00:32:31 This is for me, I think, a record. A new client showed it and they explained. And I see like it's a trend. Recently, when I say you need to decrease max connections, I also say most likely you will not do it right now because most people tend not to do it. They all have reasons why max connections should be

Starting point is 00:32:52 very high. And of course, since Postgres, I think, 14, things are improved in terms of handling idle connections. But when an incident happens, these idle connections become active and we have almost zero chances for statements to be finished right because server is overwhelmed with load right but while if you have

Starting point is 00:33:17 same number of max connections i would say take your vcpu's number multiplied by some relatively low multiplier, like less than 10. And this should be max connections for LTP workloads. And then you have PgBouncer or something. Yeah, so if you have this and enormous load is coming, additional load will be receiving out of connections error. Timeouts or something yeah and you have chances to finish current statement current queries processing and new and so on so it's much better than you try to please everyone right and cannot do it at all including like with your old clients. It also makes some diagnosis easier, right?

Starting point is 00:34:12 If the database is still responding to anything, it's easier to diagnose issues than if it's not responding at all, right? Exactly. It's kind of just moving the problem, but it's definitely an improvement. But yeah, it's a good point. It could just be overwhelmed problem but it's definitely an improvement yeah but yeah it's a good point like it could just be overwhelmed but it could be there are like a million other reasons of course but first question i would say like uh are we are we receiving more load which already

Starting point is 00:34:37 like so reason is already outside of postgres well technically i just explained additional like factor high max connections. It's partially problem is inside Postgres, but the main reason, root cause is outside. We're just receiving much more than usually. This is the number one thing to check. We don't have time to discuss full recipe for troubleshooting of such cases.

Starting point is 00:35:03 We've got an episode, I think, for that. Maybe, yeah. I already keep forgetting what we have. It's probably actually just monitoring. Yeah. Yeah, yeah. But maybe we should have, like, you know, like, how to troubleshoot slow database step by step.

Starting point is 00:35:18 So to save time, second advice I would say, just check wait event analysis. Second thing, if you have a lot of active sessions, maybe actually sometimes database is slow without a lot of active sessions. It's interesting, but it's also if you understand number of active sessions, it's very important. But next thing, understand the state of what are they doing, right? So are they doing a lot of I.O.? Or there is a contention related to lock manager, for example, or sub-transactions or anything like that. So wait-event analysis is super important.

Starting point is 00:35:56 And we discussed right now how to improve dashboard number one. No, not dashboard number three, which is query analysis in PgWatch Postgresor Edition, and I'm almost convinced to put wait event query analysis to the top. Previously, I was thinking we should have total time from PgSort statements and average time, total time maybe should be higher, and we have long discussion inside the time maybe should be higher. And we have a long discussion inside the team

Starting point is 00:36:26 what should be higher. But now I'm almost convinced actually weight event analysis should be on very top because it gives you very quick understanding just from one chart you quickly understand number of active sessions and distribution in terms of what they are doing in in an analysis when you have some number next step is to segment analysis right properly so to divide this number to some segments

Starting point is 00:36:55 and i think weight event is very good direction for segmentation how to say yeah it's few it's like it splits it into fewer things so therefore it's easier to spot if there's like a majority whereas like with query analysis you could have a real long tail like your even the most commonly executed query might only be one percent of your workload well yeah it might be 50 but it might be 1%, whereas wait events are more likely. Yeah, and timing in PGSR statements, it hides details. It might be actual work database

Starting point is 00:37:32 is doing, and that's why it's spending time. For example, sequential scans due to lack of indexes or something like that. Or it might be waiting for a log to be acquired. So it's also spending time and you quickly see. So very good book as usual, books from Brandon Gregg. There is in troubleshooting, I remember also

Starting point is 00:37:55 his talks, two-part talk about tooling for Linux and so on. And he mentioned that if he needed to choose just one Linux tool, you can use only one tool, and the biggest outcome in terms of troubleshooting, what is it? Do you remember, no? No. It's IOSTAT. It gives you disk I.O. and also it reports CPU as well, segmented by user system I.O. weight.

Starting point is 00:38:23 So it's super good. You see disk IO and CPU just from one to similar here we have we see a number of active sessions and also we see wait events segmentation.

Starting point is 00:38:38 It's very good chart to have for troubleshooting. It feels to me like an interesting trade-off whether you're looking at monitoring more often or not even necessarily more often but do you optimize for people in an incident or do you optimize for people doing general performance work and i think optimizing for the instant people make some sense even though it's less often hopefully yeah they have less time they'll. Less time, but also heightened emotions

Starting point is 00:39:07 and not thinking straightly like we started. So maybe that's it. Paths should be shorter. Yeah, I agree. Right, yeah. So there are many other things that can happen with database, of course, right? But if you know some common things, it helps a lot. Yeah.

Starting point is 00:39:24 And tooling should be prepared and, yeah, observability is important. Yeah. One last question. I think there's some arguments for trying to reduce incidents down to, like, nearly zero, like trying to put everything in place so that you never have any incidents you know high availability everything to try and minimize the risk and then you could i think as a team you can get out of out of practice dealing with incidents if you're if you're good at that kind of thing but then when one does happen it can really throw you some teams like to deal with like

Starting point is 00:40:04 super minor incidents and treat those as incidents almost like as practice do you have any any opinions or feelings around that kind of thing yeah good good point so we actually didn't discuss many things for example how to categorize incidents as like priority one priority two and so on because when i uh when clients a client comes it happened like a couple of times over the last month, like a client comes and shows me some graphs with spikes of active sessions exceeding the CPU count significantly, I already say, oh, you are having at least like, you know, P3 incident or maybe P2. Maybe it's not user facing, people haven't noticed it, but it's an incident

Starting point is 00:40:47 already. It requires investigation and they like database is slow, but this is already you need some reaction and mitigation for it. So it requires maybe understanding and expertise and classification rules which require Postgres understanding, right? Because sometimes I have a hard time convincing people that if you have, I don't know, like 64 cores but accession count jumped to 200, 300, it's already not normal. They say, well, it worked, right? No one complained? Yeah, yeah, yeah. Well, it worked. And part of the problem in Postgres, we don't have good metric for average latency, for example, for query processing, because database job, like, we want query processing to be not producing errors and be

Starting point is 00:41:42 fast. Fast, we have definition of fast for LTP case. I have an article about it. Definitely, it's not one second. It should be below. It should be below 100 milliseconds. In most cases, it should be below 10 milliseconds because one HTTP request consists of multiple SQL, usually in many cases.

Starting point is 00:42:00 And human perception is 200 milliseconds. So we have some threshold already. So let's keep latency low. But funny thing, Postgres doesn't have latency exposed, average latency. It doesn't. So PGSTART database doesn't have it. Nothing has it.

Starting point is 00:42:21 Only PGSTART statements. Yeah. But it's not precise. It's not in core. It's not precise. There is max 5,000. In some cases, workload is complex and there is constant eviction of records from PGSTAT statements and appearance of new ones. So latency measured from PGSTIS statements,

Starting point is 00:42:45 this is what most monitoring systems do, including dashboard number one, we discussed earlier from PGWatch to Postgresor Edition, but it feels not fully reliable, right? But this is important because this is how we can say, okay, really slow. How much? We had sub-millisecond latency. Now we have five-millisecond latency.

Starting point is 00:43:09 Okay, indeed, there is proof of it. I like that PgBouncer reports it. I was going to ask, yeah. It logs and then starts and reports it. This is great. This is what we should have, honestly, in Postgres as well. But, yeah, I actually don't remember discussions about it. There should be some discussions. So this is our maybe main characteristics of performance.

Starting point is 00:43:33 I wish of course we had percentiles, not only average. Many people monitor it from the client side. Datadog has APM and there's ability to monitor it from client side. But this is not purely database latency because it includes RTTs, route trip times, network, right? And it should be excluded if we talk about database to understand behavioral database, right? So yeah, this is how we understand latency. And, yeah, if it's slow, it's slow. And then we need to, again, apply segmentation and top-down analysis

Starting point is 00:44:13 and find what exactly is slow. Everything or just some of it, right? So it's P2, P3 incidents. I think for smaller companies, it's hard in terms of database. It's possible, but it's too much work, maybe. Well, but I also think there could be an argument from like make incidents a bit more normal in your team and less stressful.

Starting point is 00:44:38 So when you do have a stressful one, or like when you do have a big one that's a bigger deal. I see your point unless your team is overwhelmed with P1 incidents which I also had in my team actually and I saw it like we have every day

Starting point is 00:44:55 we have database down unless that it's a good idea if you don't have database incidents to say okay let's look for P2, P3 incidents and start processing them routinely so we build a muscle

Starting point is 00:45:12 for incident management. It's a great advice. Indeed. Indeed. Yeah, maybe that's it. Let's wrap it up. Sounds good. Thanks so much, Nikolai. Catch you next week. Thank you, Michael.

CODACE Plant Stand

Postgres FM - Postgres Emergency Room

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Postgres FM - Postgres Emergency Room

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.