Postgres FM - pgBadger
Episode Date: January 12, 2024Michael is joined by Alicja Kucharczyk, Program Manager for Azure Database for PostgreSQL at Microsoft and organiser of the Warsaw PostgreSQL User Group, to discuss Alicja's favorite tool: pg...Badger. They cover what it is, how and when it's useful, and some guidance for getting the most out of it (without causing an outage). Here are some links to things they mentioned:pgBadger website https://pgbadger.darold.net pgBadger repo https://github.com/darold/pgbadgerAn example pgBadger report https://pgbadger.darold.net/examples/sample.html pg_stat_statements https://www.postgresql.org/docs/current/pgstatstatements.htmlAzure Guide https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/how-to-generate-pgbadger-report-from-azure-database-for/ba-p/3756328Google Cloud SQL guide https://cloud.google.com/blog/products/databases/guide-to-the-database-observability-with-cloud-sqlRDS guide https://aws.amazon.com/blogs/database/optimizing-and-tuning-queries-in-amazon-rds-postgresql-based-on-native-and-external-tools/Community Insights on pgBadger (PGSQL Phriday Recap) https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/community-insights-on-pgbadger-a-pgsql-phriday-010-recap/ba-p/3880911 PGSQL Phriday #010: Log analysis (blog post by Lætitia Avrot) https://mydbanotebook.org/post/log-analysis/ Nothing Compares To VACUUM/The Ballad of Bloat https://www.youtube.com/watch?v=2NxIngqq1y0 Explain Analyze (Feliz Navidad cover) https://www.youtube.com/watch?v=qznnzYZPdkM ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the amazing artwork
Transcript
Discussion (0)
Hello and welcome to PostgreSQL FM, a weekly show about all things PostgreSQL.
I am Michael, founder of PGMustard, and today I am joined by Alicja Kulhačić,
Program Manager for Azure Database for PostgreSQL at Microsoft
and Organizer of the Warsaw PostgreSQL User Group.
Thank you so much for joining me today, Alicja. It's an honor to have you here.
Thank you for having me, Michael.
Wonderful. Well, I proposed a topic this week, but mainly
because you're one of the biggest fans or at least advocates of this tool, I think in the world.
So but it's probably an underused tool, in my opinion, anyway, I definitely don't use it at
times when when I could or should. And I know most people that could be using it don't have
never even
heard of it at least newer users so what are we going to be talking about today?
Yeah about my favorite tool which is pgpadger which is totally open source which is really
great it's like allows you like to do really comprehensive troubleshooting and I believe
like it's yeah top notch of the class.
Yeah absolutely do you want to give us a little bit of background on the tool or how people use it?
Yes absolutely so maybe I will start with my own story because I've learned that the tool exists
over I believe 10 years ago when I was working for Polish Bank. And the processing there was like each time something was happening,
they started like to generate a PGPager report.
And that was like just always the first step that they've been doing.
So that was the place where I learned that this tool exists.
And over the years, actually, because, you know,
I was moving from company to company and i started like to be
postgres consultant and back then that started to be my main tool for doing the troubleshooting
because you know as a consultant you don't really know the system like you're going to the customer
you know like something has happened and you know the customer is saying okay like for instance cpu
it's very high or latency of
the query like suddenly has grown but you still you don't really have any knowledge about the
system right and pg badger is really one of the best tool like to really quickly get knowledge
about the workload that is happening within your database and i I really like the PG Badger also
that it's showing you
like really comprehensive picture
because you might know other tools,
like similar tools,
like PG Stat Statements.
And it's not that I want to say
that this is a bad tool or anything.
I don't use PG Stat Statement.
I know that many people do. But you know,
pgstat statement is actually just one tab in the pgpatter report. And next to that,
you've got also like report about the temporary files, you've got really nicely aggregated errors.
So everything that you've got in the logs that are errors, fatals, warnings, it's really nicely aggregated on the events tab.
You would also have the logs and logs with CK, right?
Not logs, but logs.
I know it's easy to mix this too.
So you also have the logs.
You've got the number of connections, number of sessions.
So you really have like comprehensive view. For instance, if you've got the number of connections, number of sessions, so you really have a comprehensive view.
For instance, if you've got pgStatsStatements, you see, okay, this query was executed like
1000 times and the total time it was this and the average time was this, right?
You would have the same thing in pgPager, but the difference is in PG Badger, you can quickly go to another tab for instance,
temporary files tab and see that, okay, this execution time was caused actually because
this query was generating temporary files, or maybe it was waiting for a log. So that's why,
you know, I do prefer PG Badger over PG stat statements and other tools, because it's giving me this comprehensive view.
Yeah, I like it.
Shots fired right up top.
I'm a big fan of pgStats statements as well,
so I'm hoping to be somewhat converted here.
I definitely think there's pros of both.
So with pgBadger, it's a log analysis.
So it's a standalone tool as well, right?
So open source, as you said,
we can install it wherever and analyze our log files for
anything that we can spit out to the log.
So as you mentioned, locks, yeah, so wait events, errors as well.
That's a big missing part of PGStats statements.
So sometimes we have, it's quite common advice to have a timeout, right?
If you have queries
that are taking more than i don't know some on some systems might be 30 seconds time them out
instead of measuring instead of letting them complete in pgstat statements we don't see those
at all but in pgbadger we can see those because of the errors absolutely yes yes yes yes yes and
the errors actually uh you know it's helping with other problems. Like recently I was troubleshooting logical replication errors.
And if you are just, you know, have the plain textbook and you've got just one error, it's really not easy like to go, you know, up like thousands of lines up and see what was the root cause error, right?
Because sometimes you've got like one error and this is causing the whole you know whole other errors right like in this case and in pg badger you you're seeing like
that you know that's great because of the aggregation that pg badger is doing
nice so because we're reliant on the logs here i would like to ask a lot of people won't have a
lot of the default settings are to not log a lot of important settings.
So a lot of people's logs, if they're new to Postgres or if they're using a managed provider and haven't set any settings, they won't have many of the things.
So if they ran pgbadger now, I'm guessing a lot of them wouldn't get some of the reports you're talking about because they wouldn't have the right settings.
Is that fair?
Yes, yes yes absolutely like with default settings you would probably have like a very
poor pgpager report because as you said like pgpager is just perl script that is just taking
the text files the text logs as the input and then generating html report as the output right so whatever is in logs it will be
regenerated like it will be parsed and it will be you know aggregated in this really nice
pgpatcher format but if something is not in the logs then obviously you will not see that right
so for instance if you want to see the logs logs time, you need to have lock, lock, I might mix the name, lock, locks on.
There's one setting in the Postgres that you need to have locks locked to be able to generate that.
Yeah, log, lock, waits, I think it is.
Log, lock, waits, yes.
I believe this is the one.
And this is responsible for the locks tab in the PgPager.
So if you've got this off, you will not see anything.
Also, the logline prefix is super important.
So the default prefix, it's pretty much poor.
And I'm using the one that is actually recommended by PitchBudger,
and it contains a lot of stuff like database name, username,
application name, of course, like the timestamp,
and it also has like a host,
so the IP of the server that connection coming from.
And this allows you to have really like, you know, reach PitchBudger because P pg badger is able also to split the queries
and different tabs per database per user so it's able like to show you for instance how many
select queries were like on particular database or was issued by particular user so you know but
you need load line prefix for that otherwise if you've got really like short like default you
will not see a lot of useful information but you know i believe like in here like the pg pgpager documentation
is pretty good like you've got the entire list of the settings that you need to set
and most of that like most of that is safe like i would say most because you've got, of course, one setting, log mean duration
statement, which is super, might be super dangerous, right? Of course, maybe just to explain
log mean duration statement, it's the setting that is regulating like how many queries we want to log
into the text logs. And by default, this is minus one, it means no. And zero means log
everything. So everything means like even select one, like something that is under milliseconds,
right? With zero, you will see like a huge, huge impact on your database. So, you know, you always need to kind of adjust this
to your system, to the workload you've got,
how intensive, you know, your database is processing the queries.
If you've got analytical database, right,
like super huge data amounts that you've got,
yeah, five queries per hour, probably not many of you,
but let's say, look, five queries per hour. Probably not many of you, but yeah,
let's say log mean duration statement zero is okay then.
But with millions, thousands of queries per second,
this is, yes, this is a killer.
Like I saw even like 40% overhead
of the overall performance,
like with log mean duration statement zero.
So this is dangerous.
I usually start with something around between one minute to 10 minutes,
depending on the system.
Oh, wow.
Yeah.
So, you know, depending on the declared intensivity,
like let's say of the system,
depends what customer is telling me,
okay, it's super intensive.
It's not, right?
So let's say we are starting from one minute
and then we see the overhead.
If it's huge, we are going up.
If it's not huge,
then we try and go a bit down with the setting.
When you say one minute,
you mean only log the duration of statements
that take longer than 60 seconds?
Yes, yes, yes, yes, yes.
I thought you were going to go much lower lower but that makes a lot of sense just in case there's a huge overhead exactly exactly this is like to be
on the you know safe side right because otherwise if you've got production system then you go too
low and then you know you've got an outage and and i don't think this requires a restart right so
you can reload the config and it's not a huge overhead to changing that one but yeah that's
a really good exactly without that we get almost nothing like we don't get any uh reporting on the
statements but i guess the trade-off is if we if our system's biggest bottleneck on the cp let's
say the customer
mentioned they have high cpu and that's mostly being caused by millions of executions of very
fast queries then we don't spot them so it's that trade-off between slow queries being the issue
versus lots of fast queries being the issue it is but you know in this kind of this is a very good
point because you are right like i i saw outages when we had like yes a lot
of super small queries right that were executed super super uh often like thousand times a second
and you know we've got two ways actually to do that one way would be like to have some kind of
cron job that would change log mean duration statement to
zero and after some short period of time like switch back all right that would be one and yeah
and and this the second one would be there is like the new method that i haven't tried yet on
production but there is this something so you know in postgres like i haven't tried it yet um but it might be something
to consider so for now i was just doing cron job but to be to be very honest with you uh it's a
problem when you are analyzing then the report right because the yeah you see really weird like
charts so you always need to keep that in mind that that you've got this switch in the
cron job yeah or for the fast queries rely on something like pgstat statements if you have that
on if you have that on already for example but um back to pg badger i've got i looked at the
docs and some of the other ones that people might want to switch on that have low overhead stop me
if i mention one that is dangerous but we have log checkpoints log connections log disconnections log lock
weights you mentioned already log temp files do you do any so the recommendation in the pg badge
docs is to set that to zero no i just set it to zero you know okay great so that's good that that's
you don't see that as being anywhere near
as dangerous i have seen some recommendations to start with the higher setting of that and go
downwards as well which i guess people can do if they want to be extra cautious but if you've
tuned workmem at all already then i guess you should be pretty safe to put that to zero i've
never seen any problem when i tweak twitch temp files or checkpoints, to be very honest. Like maybe there are some narrow use cases,
but for, you know, over the years, I've never seen any problems.
So we do it with other settings, like the same with auto vacuum.
Like people also are a bit afraid about the auto vacuum,
but it's not, it's really, it's a couple of lines like per minute.
And yeah, it will not hit your system that much.
Nice. And yeah, you've got one I i forgot there which is log auto vacuum min duration set that to zero as well so that we get the
important vacuum stuff great so that's that lets people get get this set up and i've checked some
hosting so seeing as you work on a managed process provided do you have some of these set by default
or is this something that people need to go in and
change yes for instance uh yeah because i work at microsoft it's it's the easiest for me to talk
about how we we've got the setup down right i don't know that much other cloud providers but
we've got log connections log these connections to on and users cannot change it it's read only and it's because we need it for the internal
telemetry of ours so this is just on also you know pgpager needs to have logs in english so
also this is by default on azure so a lot of them are already enabled already on like on azure so
this is this is good news but still you need to enable logs,
you need to enable auto vacuum,
you need to enable logging for other stuff
and logline prefix.
I know that not on all cloud providers
you are able to change logline prefix.
As far as I know, at least.
Yes, we do allow that.
So this is good news,
but I've heard that not on all,
like probably AWS doesn't allow that,
as I remember.
At least it wasn't a couple of years ago.
I remember writing like a script
like to parse the logs
to be able to generate like pgPager
because we were not able to change logline prefix.
And yes, they also have this this they had really weird formats so we needed like to script that would parse that
but good news is that a pg badger introduced a support for this log format so you've got separate
format which is rds so you don't need to have custom scripts anymore so this is this is good
news yeah i was so impressed reading through the pg badge docs all the different formats they support
it's it's incredibly flexible tool almost as if it's designed by people that are using it and
short by people that use it all the time but definitely tons of options and lots of auto
recognizing formats and file types and things like that.
So for a lot of people, I would guess it would just work.
And for people on cloud providers, you've written a great post for using it with Azure,
but there's also, I found a couple of posts from people from the RDS team
and from Google Cloud SQL teams that explain how to use it with their systems as well.
If any other cloud providers are listening and have guides,
please send them to me and I'll include them up in the show notes.
But yeah, it feels like this is a tool people could be using
even if they're unmanaged Postgres providers.
Exactly. It makes me really happy to see the blog posts
from AWS and Google Cloud because exactly as you said,
it means that they're using PgPager,
like the users of the ATt manage services are using right
the pg badger so it was it was probably needed right nice yep and also it means this thing can
live on right it's we don't have to completely replace it with cloud native tools it still
should work right still postgres the only place that currently puts some of this information is
in the logs so at some point we're going to have to look at the logs and having a tool to do so often helps cool all right so
we've mentioned the min duration being a potential gotcha are there any others that we need to be
cautious of when using this about the parameter settings actually this is like the the most dangerous so uh not that much i would say like the other thing
you might have problem with like it's the size of the lock itself so i had sometimes like a problem
you know when i was like receiving from the customer just terabytes of locks and it was okay
here are my logs just generate whatever you want from that, right? And yeah, that also might be a problem.
And although, you know, PgPager has this minus J option,
like most of Postgres tools, right?
It's still like on my own laptop, it was like days.
So yes, so still like if you've got like really, you know,
huge amount of logs you probably still
need powerful vm just to parse the logs otherwise like yeah it will not work
yeah i saw i saw it has both dash j lowercase and dash j uppercase so it has like two different
ways of handling um i i think the first one is for parallelizing,
like having multiple CPUs
handling a single large log file.
I didn't hear terabytes.
All the examples I saw were in gigabytes.
So that's quite scary.
But I think dash capital J was even faster
if you had, let's say,
like if you were given 20 large log files
and you wanted to give 20 CPUs
and handle them one at a time,
like one CPUpu per file
so that's quite cool that it has multiple ways of handling like huge workloads but terabytes of logs
is terrifying um was it did they have to so i saw in that like it has so many options by the way i
have a in fact a quick trivia question for you it has so many dash single letter commands that i noticed there are
only two letters in the english alphabet that it doesn't have a uh dash that letter can you guess
which letters they are oh it's not gonna be easy to be honest only k and y um it's not a fair i
only k and y and and there is even a they even have an
option that's keep comments so my my recommendation to the pg badger team is to to make that the k
and then they only have to find one more and they have the whole alphabet um anyway sorry that's a
that's not very useful for people but no no this this is actually great point because sometimes
you know the customers has got like some non-common questions, right?
For instance, I had a customer that had a problem, like performance problem, and they wanted to share PgPager with me, but they were not able to because me the queries and i didn't even know but yes pgpager has so many
options that i quickly check and there is like this anonymize option like yeah so you can you
can find a lot of stuff like really a lot of options uh in pgpager so this is actually a great
point i was going to ask about two of them actually b, B and E, so beginning and end. If we have a huge log file, but let's say it's from a month of logs or something,
and we only care about, let's say, a day or two that we had problems,
if we set beginning and end, would that drastically speed things up as well?
Or are we still going to...
Depends, right, of the size of the file so it's more like
operating system question i would say because you know sometimes if you've got really huge file like
it you need to read your operating system need to read it anyways right so in this case might
not help that much so you know it's it's long story but you know it's really huge logs like huge files i've been trying
like a lot of stuff and i've been trying like to split the files uh you know using just the
comments like i was trying like a lot of different ways but finally what works the best it's just
huge vm like this is this is the easiest way and you just put that on huge vm and you've
got it solved nice cool okay i saw you hosted a one of the pg sequel friday we've done a couple
of submission a podcast submissions to that event and yours was on pg badger but i saw
submitted a post and one of the things they mentioned was flexibility
so they they really like to typical postgres consultant or not typical but uh extra on the
postgres side they pass the logs into a postgres database in order to be able to query them
but like giving them a little bit more flexibility so i guess we are somewhat limited by the reports
that pg badger gives us but in the real world it's giving us so much useful flexibility so i guess we are somewhat limited by the reports that pg badger gives us but
in the real world it's giving us so much useful information that i guess those cases are a bit
more like few and far between and we get most of the way there just with the pg badger report is
that fair you know i yeah i saw the answer from leticia and you know this is actually interesting
like parsing it this to the database especially like i had a lot of ask about that from ex-Oracle DBAs, right?
They always asking, we're asking if there is a way like to do that, right?
So it might be useful.
But when I was a consultant, because I'm not that much anymore,
like I do prefer like to use something that I can just ping a link to,
like the tool that is already there and it works,
and just pinging the link generates me data,
focusing on writing my own tools.
But okay, I don't say it's wrong or something,
but it's just easier for me.
And also what I like is the visual part,
because the visual part is really something that allows you super quickly to go to the root cause.
It's not that easy with the text.
If you're doing the query and you've got the result, okay,
but you need to focus a bit more.
And with Pidgey Badger, you've got this chart and you just, you know, see everything straight away.
So it's way easier, way faster for me.
And you've got like all the answers straight away.
So, you know, it's just a matter of preferences, I would say.
My preference is to have graphical presentation.
Yeah, that's a huge pro for a lot of people, me included.
You've also mentioned another one
there which is which i don't think we've mentioned is how tried and tested this is this is a i think
i looked it up and it was 2012 pg badger v1 came out april or may time and it's version 12.4 now
with a ton of bug fixes and all sorts of feature improvements since then so to to start writing
your own tool we've already mentioned some of the features we get but there's honestly dozens of
settings uh that that it has more than the ones i mentioned not just not just every letter covered
but multiple letters have lowercase and uppercase and there are quite a few that don't have shortcuts
like there's another few dozen that are just uh you know you
have to set separately so you're yeah it's incredibly flexible and tested like it's been
tested for many many years which is worth it worth a lot exactly exactly you know like to be honest
to write your own tool that would be that comprehensive it would have like so many dashboards so many information in it like it's years you know
so yeah i do prefer to have this tool instead of you know writing writing my own but i totally get
people like you know that prefer to write their own tools nice and is there anything in that they
could be doing like i think is is it Gilles Deroux?
I don't know quite how to pronounce their name, but the maintainer.
Is there anything they could be doing or is there anything you'd like to see in PgBadger in the future?
I would like to see the support for Azure.
That would be my main thing.
And, you know, probably that also like the thing about the logs, and I know this is not easy because it's not that easy to get that from the logs.
But in the logs tab, you would see only the queries that were logged, not the queries that caused the log.
And of course, there is a simple reason because of that because it's how postgres
is doing this right but sometimes if you've got log min duration statements to zero right
and you've got pid in um in the prefix you can just correlate right because postgres would
normally tell you like okay uh this PID is blocking your query.
So then you can search for this PID and maybe you've got the query and try to correlate that.
Because it's not that useful if you just see the queries that are locked.
And it's not what is actually locking the queries.
So you need to have additional cron job
with the log query
to compare with.
So yeah, that's for sure
would be an ask, but I'm not sure if this is
actually the
ask to Giles. I
also am not sure how to announce
his name, so I'm saying
Giles. I'm not sure if this
is like an ask to Giles or more to postgres
commuters to be honest like just to have that in the logs yeah to log more information so yeah that
yeah makes sense that it might be there cool and yeah it's released under the postgresql license
which is one of the more most permissive licenses i've ever seen which is awesome so thank you for
doing that thanks for
all your hard work on it is there anything else you'd like people to know nothing coming to my
mind to be honest well actually i had because you started with um throwing shade at pgstat
statements i did wonder if there were any so like for example with pg set statements we can see things like buffers
like we can see the amount of work done by each query so reads right so temp shared that kind of
thing i i don't think is there a way to see that kind of information i guess we could look at the
auto explain logs but i don't think pg badger is doing anything at the moment to aggregate those
per query no i've never seen that just thinking if it would have like more uh if we would change
error verbosity if that would be logged and then presented but i haven't seen that so i don't i
don't want to guess but you know like for me actually the PG Badger, it's like I'm going to the top queries.
And when I see top query, like, yeah, maybe let's go first through the process that will be easier.
So, you know, when I see the top queries, sometimes it's just, you know, you just see what's wrong.
Like it's really pretty often, like much more often that,
you know, you would guess like,
just you see, okay, this is not in, right?
Or this is something similar.
So you just see that, right?
So you don't need to even run explain analyze.
But if you don't,
then I just, you know,
take the query and run explain analyze on on that and yeah that's the
next step i would do so but yes you are right and then you can ask for buffers exactly and then like
buffers yes absolutely i would i would use the option so just um to clarify on that i think this
is great so top queries is a really good point and it orders those by total execution time
so across all calls right so you see it could be a slow query that's run a few times that's taking
up the most resources or it could be a fairly fast query that's being run thousands of times
per second as you mentioned and they're ordered by i believe the main report is by total execution
time is that right you've got got like couple sections on this page.
So you've got the most important one that I'm using.
It's exactly the one that you mentioned.
So the total execution time,
that this is always like the first thing I'm looking at,
like, you know, what took the most time of this,
you know, from the system,
which queries, and it doesn't matter for me,
just one query,
or it was like multiple execution of of
another queries right so this is like probably the third uh section from the top but you also
have another aggregations like uh you know like the uh slowest individual queries other stuff but
i i usually don't look at that to be very honest uh i just you you know, I just taking the, this is absolutely my favorite.
If I need to then, yeah, I go,
but it's absolutely my main point,
like to go there and see.
Nice.
That's what I like to recommend people do
on PGStats statements as well.
But I guess with this one, with PgBadger,
we get more of an opinionated,
shows people what to look at
rather than PGStats statements.
People can query and order by whatever they want.
So I see a lot more people making that mistake or just looking at their slowest queries
rather than the totals amazing alicia thanks so much for your time today is there any anything
you want to plug or anything you want to share with people actually if i may plug my songs like
because i just realized not many people like to know my songs.
I produced two songs.
The first was about the vacuum and the second one, actually, you are author of the lyrics, right?
So, you know, there are like two Postgres songs out there.
And yes, we are co-producers of the second one.
So, yeah, that would be absolutely my my plug that's kind
of you the lyrics are extremely basic for that second one so i apologize to people in advance
but you did an awesome job with the vocals and the there was not me the vocals and the video as
well was a nice one of all the postgres events so yeah you did an awesome job i will happily
link those up in the show notes thank you thank you so much alicia it's
been a pleasure take care thank you michael bye