Postgres FM - pgBadger

Starting point is 00:00:00 Hello and welcome to PostgreSQL FM, a weekly show about all things PostgreSQL. I am Michael, founder of PGMustard, and today I am joined by Alicja Kulhačić, Program Manager for Azure Database for PostgreSQL at Microsoft and Organizer of the Warsaw PostgreSQL User Group. Thank you so much for joining me today, Alicja. It's an honor to have you here. Thank you for having me, Michael. Wonderful. Well, I proposed a topic this week, but mainly because you're one of the biggest fans or at least advocates of this tool, I think in the world.

Starting point is 00:00:33 So but it's probably an underused tool, in my opinion, anyway, I definitely don't use it at times when when I could or should. And I know most people that could be using it don't have never even heard of it at least newer users so what are we going to be talking about today? Yeah about my favorite tool which is pgpadger which is totally open source which is really great it's like allows you like to do really comprehensive troubleshooting and I believe like it's yeah top notch of the class. Yeah absolutely do you want to give us a little bit of background on the tool or how people use it?

Starting point is 00:01:12 Yes absolutely so maybe I will start with my own story because I've learned that the tool exists over I believe 10 years ago when I was working for Polish Bank. And the processing there was like each time something was happening, they started like to generate a PGPager report. And that was like just always the first step that they've been doing. So that was the place where I learned that this tool exists. And over the years, actually, because, you know, I was moving from company to company and i started like to be postgres consultant and back then that started to be my main tool for doing the troubleshooting

Starting point is 00:01:53 because you know as a consultant you don't really know the system like you're going to the customer you know like something has happened and you know the customer is saying okay like for instance cpu it's very high or latency of the query like suddenly has grown but you still you don't really have any knowledge about the system right and pg badger is really one of the best tool like to really quickly get knowledge about the workload that is happening within your database and i I really like the PG Badger also that it's showing you like really comprehensive picture

Starting point is 00:02:29 because you might know other tools, like similar tools, like PG Stat Statements. And it's not that I want to say that this is a bad tool or anything. I don't use PG Stat Statement. I know that many people do. But you know, pgstat statement is actually just one tab in the pgpatter report. And next to that,

Starting point is 00:02:52 you've got also like report about the temporary files, you've got really nicely aggregated errors. So everything that you've got in the logs that are errors, fatals, warnings, it's really nicely aggregated on the events tab. You would also have the logs and logs with CK, right? Not logs, but logs. I know it's easy to mix this too. So you also have the logs. You've got the number of connections, number of sessions. So you really have like comprehensive view. For instance, if you've got the number of connections, number of sessions, so you really have a comprehensive view.

Starting point is 00:03:26 For instance, if you've got pgStatsStatements, you see, okay, this query was executed like 1000 times and the total time it was this and the average time was this, right? You would have the same thing in pgPager, but the difference is in PG Badger, you can quickly go to another tab for instance, temporary files tab and see that, okay, this execution time was caused actually because this query was generating temporary files, or maybe it was waiting for a log. So that's why, you know, I do prefer PG Badger over PG stat statements and other tools, because it's giving me this comprehensive view. Yeah, I like it. Shots fired right up top.

Starting point is 00:04:10 I'm a big fan of pgStats statements as well, so I'm hoping to be somewhat converted here. I definitely think there's pros of both. So with pgBadger, it's a log analysis. So it's a standalone tool as well, right? So open source, as you said, we can install it wherever and analyze our log files for anything that we can spit out to the log.

Starting point is 00:04:32 So as you mentioned, locks, yeah, so wait events, errors as well. That's a big missing part of PGStats statements. So sometimes we have, it's quite common advice to have a timeout, right? If you have queries that are taking more than i don't know some on some systems might be 30 seconds time them out instead of measuring instead of letting them complete in pgstat statements we don't see those at all but in pgbadger we can see those because of the errors absolutely yes yes yes yes yes and the errors actually uh you know it's helping with other problems. Like recently I was troubleshooting logical replication errors.

Starting point is 00:05:08 And if you are just, you know, have the plain textbook and you've got just one error, it's really not easy like to go, you know, up like thousands of lines up and see what was the root cause error, right? Because sometimes you've got like one error and this is causing the whole you know whole other errors right like in this case and in pg badger you you're seeing like that you know that's great because of the aggregation that pg badger is doing nice so because we're reliant on the logs here i would like to ask a lot of people won't have a lot of the default settings are to not log a lot of important settings. So a lot of people's logs, if they're new to Postgres or if they're using a managed provider and haven't set any settings, they won't have many of the things. So if they ran pgbadger now, I'm guessing a lot of them wouldn't get some of the reports you're talking about because they wouldn't have the right settings. Is that fair?

Starting point is 00:06:04 Yes, yes yes absolutely like with default settings you would probably have like a very poor pgpager report because as you said like pgpager is just perl script that is just taking the text files the text logs as the input and then generating html report as the output right so whatever is in logs it will be regenerated like it will be parsed and it will be you know aggregated in this really nice pgpatcher format but if something is not in the logs then obviously you will not see that right so for instance if you want to see the logs logs time, you need to have lock, lock, I might mix the name, lock, locks on. There's one setting in the Postgres that you need to have locks locked to be able to generate that. Yeah, log, lock, waits, I think it is.

Starting point is 00:06:59 Log, lock, waits, yes. I believe this is the one. And this is responsible for the locks tab in the PgPager. So if you've got this off, you will not see anything. Also, the logline prefix is super important. So the default prefix, it's pretty much poor. And I'm using the one that is actually recommended by PitchBudger, and it contains a lot of stuff like database name, username,

Starting point is 00:07:32 application name, of course, like the timestamp, and it also has like a host, so the IP of the server that connection coming from. And this allows you to have really like, you know, reach PitchBudger because P pg badger is able also to split the queries and different tabs per database per user so it's able like to show you for instance how many select queries were like on particular database or was issued by particular user so you know but you need load line prefix for that otherwise if you've got really like short like default you will not see a lot of useful information but you know i believe like in here like the pg pgpager documentation

Starting point is 00:08:10 is pretty good like you've got the entire list of the settings that you need to set and most of that like most of that is safe like i would say most because you've got, of course, one setting, log mean duration statement, which is super, might be super dangerous, right? Of course, maybe just to explain log mean duration statement, it's the setting that is regulating like how many queries we want to log into the text logs. And by default, this is minus one, it means no. And zero means log everything. So everything means like even select one, like something that is under milliseconds, right? With zero, you will see like a huge, huge impact on your database. So, you know, you always need to kind of adjust this to your system, to the workload you've got,

Starting point is 00:09:10 how intensive, you know, your database is processing the queries. If you've got analytical database, right, like super huge data amounts that you've got, yeah, five queries per hour, probably not many of you, but let's say, look, five queries per hour. Probably not many of you, but yeah, let's say log mean duration statement zero is okay then. But with millions, thousands of queries per second, this is, yes, this is a killer.

Starting point is 00:09:34 Like I saw even like 40% overhead of the overall performance, like with log mean duration statement zero. So this is dangerous. I usually start with something around between one minute to 10 minutes, depending on the system. Oh, wow. Yeah.

Starting point is 00:09:52 So, you know, depending on the declared intensivity, like let's say of the system, depends what customer is telling me, okay, it's super intensive. It's not, right? So let's say we are starting from one minute and then we see the overhead. If it's huge, we are going up.

Starting point is 00:10:09 If it's not huge, then we try and go a bit down with the setting. When you say one minute, you mean only log the duration of statements that take longer than 60 seconds? Yes, yes, yes, yes, yes. I thought you were going to go much lower lower but that makes a lot of sense just in case there's a huge overhead exactly exactly this is like to be on the you know safe side right because otherwise if you've got production system then you go too

Starting point is 00:10:39 low and then you know you've got an outage and and i don't think this requires a restart right so you can reload the config and it's not a huge overhead to changing that one but yeah that's a really good exactly without that we get almost nothing like we don't get any uh reporting on the statements but i guess the trade-off is if we if our system's biggest bottleneck on the cp let's say the customer mentioned they have high cpu and that's mostly being caused by millions of executions of very fast queries then we don't spot them so it's that trade-off between slow queries being the issue versus lots of fast queries being the issue it is but you know in this kind of this is a very good

Starting point is 00:11:21 point because you are right like i i saw outages when we had like yes a lot of super small queries right that were executed super super uh often like thousand times a second and you know we've got two ways actually to do that one way would be like to have some kind of cron job that would change log mean duration statement to zero and after some short period of time like switch back all right that would be one and yeah and and this the second one would be there is like the new method that i haven't tried yet on production but there is this something so you know in postgres like i haven't tried it yet um but it might be something to consider so for now i was just doing cron job but to be to be very honest with you uh it's a

Starting point is 00:12:13 problem when you are analyzing then the report right because the yeah you see really weird like charts so you always need to keep that in mind that that you've got this switch in the cron job yeah or for the fast queries rely on something like pgstat statements if you have that on if you have that on already for example but um back to pg badger i've got i looked at the docs and some of the other ones that people might want to switch on that have low overhead stop me if i mention one that is dangerous but we have log checkpoints log connections log disconnections log lock weights you mentioned already log temp files do you do any so the recommendation in the pg badge docs is to set that to zero no i just set it to zero you know okay great so that's good that that's

Starting point is 00:13:03 you don't see that as being anywhere near as dangerous i have seen some recommendations to start with the higher setting of that and go downwards as well which i guess people can do if they want to be extra cautious but if you've tuned workmem at all already then i guess you should be pretty safe to put that to zero i've never seen any problem when i tweak twitch temp files or checkpoints, to be very honest. Like maybe there are some narrow use cases, but for, you know, over the years, I've never seen any problems. So we do it with other settings, like the same with auto vacuum. Like people also are a bit afraid about the auto vacuum,

Starting point is 00:13:36 but it's not, it's really, it's a couple of lines like per minute. And yeah, it will not hit your system that much. Nice. And yeah, you've got one I i forgot there which is log auto vacuum min duration set that to zero as well so that we get the important vacuum stuff great so that's that lets people get get this set up and i've checked some hosting so seeing as you work on a managed process provided do you have some of these set by default or is this something that people need to go in and change yes for instance uh yeah because i work at microsoft it's it's the easiest for me to talk about how we we've got the setup down right i don't know that much other cloud providers but

Starting point is 00:14:17 we've got log connections log these connections to on and users cannot change it it's read only and it's because we need it for the internal telemetry of ours so this is just on also you know pgpager needs to have logs in english so also this is by default on azure so a lot of them are already enabled already on like on azure so this is this is good news but still you need to enable logs, you need to enable auto vacuum, you need to enable logging for other stuff and logline prefix. I know that not on all cloud providers

Starting point is 00:14:57 you are able to change logline prefix. As far as I know, at least. Yes, we do allow that. So this is good news, but I've heard that not on all, like probably AWS doesn't allow that, as I remember. At least it wasn't a couple of years ago.

Starting point is 00:15:15 I remember writing like a script like to parse the logs to be able to generate like pgPager because we were not able to change logline prefix. And yes, they also have this this they had really weird formats so we needed like to script that would parse that but good news is that a pg badger introduced a support for this log format so you've got separate format which is rds so you don't need to have custom scripts anymore so this is this is good news yeah i was so impressed reading through the pg badge docs all the different formats they support

Starting point is 00:15:50 it's it's incredibly flexible tool almost as if it's designed by people that are using it and short by people that use it all the time but definitely tons of options and lots of auto recognizing formats and file types and things like that. So for a lot of people, I would guess it would just work. And for people on cloud providers, you've written a great post for using it with Azure, but there's also, I found a couple of posts from people from the RDS team and from Google Cloud SQL teams that explain how to use it with their systems as well. If any other cloud providers are listening and have guides,

Starting point is 00:16:26 please send them to me and I'll include them up in the show notes. But yeah, it feels like this is a tool people could be using even if they're unmanaged Postgres providers. Exactly. It makes me really happy to see the blog posts from AWS and Google Cloud because exactly as you said, it means that they're using PgPager, like the users of the ATt manage services are using right the pg badger so it was it was probably needed right nice yep and also it means this thing can

Starting point is 00:16:52 live on right it's we don't have to completely replace it with cloud native tools it still should work right still postgres the only place that currently puts some of this information is in the logs so at some point we're going to have to look at the logs and having a tool to do so often helps cool all right so we've mentioned the min duration being a potential gotcha are there any others that we need to be cautious of when using this about the parameter settings actually this is like the the most dangerous so uh not that much i would say like the other thing you might have problem with like it's the size of the lock itself so i had sometimes like a problem you know when i was like receiving from the customer just terabytes of locks and it was okay here are my logs just generate whatever you want from that, right? And yeah, that also might be a problem.

Starting point is 00:17:47 And although, you know, PgPager has this minus J option, like most of Postgres tools, right? It's still like on my own laptop, it was like days. So yes, so still like if you've got like really, you know, huge amount of logs you probably still need powerful vm just to parse the logs otherwise like yeah it will not work yeah i saw i saw it has both dash j lowercase and dash j uppercase so it has like two different ways of handling um i i think the first one is for parallelizing,

Starting point is 00:18:25 like having multiple CPUs handling a single large log file. I didn't hear terabytes. All the examples I saw were in gigabytes. So that's quite scary. But I think dash capital J was even faster if you had, let's say, like if you were given 20 large log files

Starting point is 00:18:41 and you wanted to give 20 CPUs and handle them one at a time, like one CPUpu per file so that's quite cool that it has multiple ways of handling like huge workloads but terabytes of logs is terrifying um was it did they have to so i saw in that like it has so many options by the way i have a in fact a quick trivia question for you it has so many dash single letter commands that i noticed there are only two letters in the english alphabet that it doesn't have a uh dash that letter can you guess which letters they are oh it's not gonna be easy to be honest only k and y um it's not a fair i

Starting point is 00:19:23 only k and y and and there is even a they even have an option that's keep comments so my my recommendation to the pg badger team is to to make that the k and then they only have to find one more and they have the whole alphabet um anyway sorry that's a that's not very useful for people but no no this this is actually great point because sometimes you know the customers has got like some non-common questions, right? For instance, I had a customer that had a problem, like performance problem, and they wanted to share PgPager with me, but they were not able to because me the queries and i didn't even know but yes pgpager has so many options that i quickly check and there is like this anonymize option like yeah so you can you can find a lot of stuff like really a lot of options uh in pgpager so this is actually a great

Starting point is 00:20:19 point i was going to ask about two of them actually b, B and E, so beginning and end. If we have a huge log file, but let's say it's from a month of logs or something, and we only care about, let's say, a day or two that we had problems, if we set beginning and end, would that drastically speed things up as well? Or are we still going to... Depends, right, of the size of the file so it's more like operating system question i would say because you know sometimes if you've got really huge file like it you need to read your operating system need to read it anyways right so in this case might not help that much so you know it's it's long story but you know it's really huge logs like huge files i've been trying

Starting point is 00:21:08 like a lot of stuff and i've been trying like to split the files uh you know using just the comments like i was trying like a lot of different ways but finally what works the best it's just huge vm like this is this is the easiest way and you just put that on huge vm and you've got it solved nice cool okay i saw you hosted a one of the pg sequel friday we've done a couple of submission a podcast submissions to that event and yours was on pg badger but i saw submitted a post and one of the things they mentioned was flexibility so they they really like to typical postgres consultant or not typical but uh extra on the postgres side they pass the logs into a postgres database in order to be able to query them

Starting point is 00:21:58 but like giving them a little bit more flexibility so i guess we are somewhat limited by the reports that pg badger gives us but in the real world it's giving us so much useful flexibility so i guess we are somewhat limited by the reports that pg badger gives us but in the real world it's giving us so much useful information that i guess those cases are a bit more like few and far between and we get most of the way there just with the pg badger report is that fair you know i yeah i saw the answer from leticia and you know this is actually interesting like parsing it this to the database especially like i had a lot of ask about that from ex-Oracle DBAs, right? They always asking, we're asking if there is a way like to do that, right? So it might be useful.

Starting point is 00:22:35 But when I was a consultant, because I'm not that much anymore, like I do prefer like to use something that I can just ping a link to, like the tool that is already there and it works, and just pinging the link generates me data, focusing on writing my own tools. But okay, I don't say it's wrong or something, but it's just easier for me. And also what I like is the visual part,

Starting point is 00:23:03 because the visual part is really something that allows you super quickly to go to the root cause. It's not that easy with the text. If you're doing the query and you've got the result, okay, but you need to focus a bit more. And with Pidgey Badger, you've got this chart and you just, you know, see everything straight away. So it's way easier, way faster for me. And you've got like all the answers straight away. So, you know, it's just a matter of preferences, I would say.

Starting point is 00:23:35 My preference is to have graphical presentation. Yeah, that's a huge pro for a lot of people, me included. You've also mentioned another one there which is which i don't think we've mentioned is how tried and tested this is this is a i think i looked it up and it was 2012 pg badger v1 came out april or may time and it's version 12.4 now with a ton of bug fixes and all sorts of feature improvements since then so to to start writing your own tool we've already mentioned some of the features we get but there's honestly dozens of settings uh that that it has more than the ones i mentioned not just not just every letter covered

Starting point is 00:24:16 but multiple letters have lowercase and uppercase and there are quite a few that don't have shortcuts like there's another few dozen that are just uh you know you have to set separately so you're yeah it's incredibly flexible and tested like it's been tested for many many years which is worth it worth a lot exactly exactly you know like to be honest to write your own tool that would be that comprehensive it would have like so many dashboards so many information in it like it's years you know so yeah i do prefer to have this tool instead of you know writing writing my own but i totally get people like you know that prefer to write their own tools nice and is there anything in that they could be doing like i think is is it Gilles Deroux?

Starting point is 00:25:06 I don't know quite how to pronounce their name, but the maintainer. Is there anything they could be doing or is there anything you'd like to see in PgBadger in the future? I would like to see the support for Azure. That would be my main thing. And, you know, probably that also like the thing about the logs, and I know this is not easy because it's not that easy to get that from the logs. But in the logs tab, you would see only the queries that were logged, not the queries that caused the log. And of course, there is a simple reason because of that because it's how postgres is doing this right but sometimes if you've got log min duration statements to zero right

Starting point is 00:25:52 and you've got pid in um in the prefix you can just correlate right because postgres would normally tell you like okay uh this PID is blocking your query. So then you can search for this PID and maybe you've got the query and try to correlate that. Because it's not that useful if you just see the queries that are locked. And it's not what is actually locking the queries. So you need to have additional cron job with the log query to compare with.

Starting point is 00:26:29 So yeah, that's for sure would be an ask, but I'm not sure if this is actually the ask to Giles. I also am not sure how to announce his name, so I'm saying Giles. I'm not sure if this is like an ask to Giles or more to postgres

Starting point is 00:26:46 commuters to be honest like just to have that in the logs yeah to log more information so yeah that yeah makes sense that it might be there cool and yeah it's released under the postgresql license which is one of the more most permissive licenses i've ever seen which is awesome so thank you for doing that thanks for all your hard work on it is there anything else you'd like people to know nothing coming to my mind to be honest well actually i had because you started with um throwing shade at pgstat statements i did wonder if there were any so like for example with pg set statements we can see things like buffers like we can see the amount of work done by each query so reads right so temp shared that kind of

Starting point is 00:27:33 thing i i don't think is there a way to see that kind of information i guess we could look at the auto explain logs but i don't think pg badger is doing anything at the moment to aggregate those per query no i've never seen that just thinking if it would have like more uh if we would change error verbosity if that would be logged and then presented but i haven't seen that so i don't i don't want to guess but you know like for me actually the PG Badger, it's like I'm going to the top queries. And when I see top query, like, yeah, maybe let's go first through the process that will be easier. So, you know, when I see the top queries, sometimes it's just, you know, you just see what's wrong. Like it's really pretty often, like much more often that,

Starting point is 00:28:26 you know, you would guess like, just you see, okay, this is not in, right? Or this is something similar. So you just see that, right? So you don't need to even run explain analyze. But if you don't, then I just, you know, take the query and run explain analyze on on that and yeah that's the

Starting point is 00:28:47 next step i would do so but yes you are right and then you can ask for buffers exactly and then like buffers yes absolutely i would i would use the option so just um to clarify on that i think this is great so top queries is a really good point and it orders those by total execution time so across all calls right so you see it could be a slow query that's run a few times that's taking up the most resources or it could be a fairly fast query that's being run thousands of times per second as you mentioned and they're ordered by i believe the main report is by total execution time is that right you've got got like couple sections on this page. So you've got the most important one that I'm using.

Starting point is 00:29:30 It's exactly the one that you mentioned. So the total execution time, that this is always like the first thing I'm looking at, like, you know, what took the most time of this, you know, from the system, which queries, and it doesn't matter for me, just one query, or it was like multiple execution of of

Starting point is 00:29:45 another queries right so this is like probably the third uh section from the top but you also have another aggregations like uh you know like the uh slowest individual queries other stuff but i i usually don't look at that to be very honest uh i just you you know, I just taking the, this is absolutely my favorite. If I need to then, yeah, I go, but it's absolutely my main point, like to go there and see. Nice. That's what I like to recommend people do

Starting point is 00:30:14 on PGStats statements as well. But I guess with this one, with PgBadger, we get more of an opinionated, shows people what to look at rather than PGStats statements. People can query and order by whatever they want. So I see a lot more people making that mistake or just looking at their slowest queries rather than the totals amazing alicia thanks so much for your time today is there any anything

Starting point is 00:30:36 you want to plug or anything you want to share with people actually if i may plug my songs like because i just realized not many people like to know my songs. I produced two songs. The first was about the vacuum and the second one, actually, you are author of the lyrics, right? So, you know, there are like two Postgres songs out there. And yes, we are co-producers of the second one. So, yeah, that would be absolutely my my plug that's kind of you the lyrics are extremely basic for that second one so i apologize to people in advance

Starting point is 00:31:11 but you did an awesome job with the vocals and the there was not me the vocals and the video as well was a nice one of all the postgres events so yeah you did an awesome job i will happily link those up in the show notes thank you thank you so much alicia it's been a pleasure take care thank you michael bye

Your Ad Here

Postgres FM - pgBadger

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.