Postgres FM - Query macro analysis intro
Episode Date: September 16, 2022Here are links to a few things we mentioned:Â pg_stat_statementspgFouinepgBadgerpg_querypg_stat_activityauto_explainCan auto_explain (with timing) have low overhead? (blog post by Michael)tra...ck_io_timingpgbenchPgHeropgCenterpgwatch2 (Postgres AI edition)pg_stat_kcachePASH Viewer------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov and @michristofides or by commenting on our topic ideas Google doc.If you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
I'm Michael, founder of PGMustard, and this is my co-host Nikolai, founder of PostgresAI.
Hey Nikolai, what are we going to be talking about today?
Hi Michael, a few weeks ago we discussed what I call query micro-analysis.
When we have only one query and we want to understand how it works, how it will be executed on production. Let's talk today about the wide picture,
about how we can analyze the whole workload and find the worst parts of it.
Yeah, so if that was micro performance analysis,
this I've heard you call macro performance analysis.
So maybe there's nothing wrong, but we want to be able to see at a glance,
are there any big issues?
Is there a spike on some metrics, something like that? Maybe it's monitoring related, maybe it's and other, I don't know, like SREs and so on.
And we want to identify the parts of workload which behead the worst.
And this is one thing.
Or we just want to optimize resource consumption to prepare for future growth.
And we don't want to spend more for hardware, so we want to find, again, we want to find the worst behaving parts of workload and we optimize them.
There are many different cases.
Yeah.
So, I mean, it could be an application developer telling us, it could be customers reporting that there are issues.
And maybe we want to find out which part is slow or maybe we want to show them that there isn't an issue.
You know, we can't see anything database side.
All right.
One of the cases I like especially is when we perform workload analysis as a whole during various kinds of preparations, various kinds of testing before we deploy.
And this is also interesting.
And there we can also try to understand are all parts behave well
or there are
not so well behaving parts. So we should optimize before we deploy. But let's just start from maybe
historical aspects of it. Like 15 years ago or so, we didn't have PGSTAR statements, which right now
is standard de facto extension for macro analysis. And there is consensus in the community that PG-STAR statements
should be enabled in all Postgres installations.
By default, this extension is not installed,
but everyone should consider installing this extension
because it has a very, very small overhead,
but it's like the place where you probably want to start your macro analysis,
understanding how workload behaves.
But before PG-STAT statements, we had only logs.
And the idea was, okay, we log all slow queries,
for example, queries, execution of which is longer than one second.
And I remember when I was briefly, just one week,
I was a user of MySQL and switched to Postgres.
And then it was a confirmation of, like, my choice was right.
It was confirmation when I found that in Postgres, I could go down below one second.
I mean, log mean duration statement timeout.
And the log queries, for example, which are above 100 milliseconds.
But in MySQL, it was not possible.
And one second was the lowest value.
Right now, they fixed it already, and you can go down below one second.
But at that time, it was in 2005, 2006, 2007.
And we log all slow queries, and then we can parse logs.
And I remember we had a tool called pgfwin written in PHP.
And then pgbadger was created, written in Perl.
It's much better in terms of performance.
It has much more features and so on, like more robust.
It's being developed still.
And by the way, yesterday they released version 12, I think.
I saw some in the news, right?
So pgbadger 12.
More and more features
right now can work with auto-explain and many things. So the idea was, let's parse those
queries and remove parameters from them and so aggregate them. It's the process which in PgStats
statements, a source code terminology called query normalization. And then we will show the worst, according to some metric,
the worst query groups.
But the problem with this approach is that this is only the tip of the iceberg.
We might have many more queries which are not visible in logs,
but they produce the most load.
Sometimes 90% of load is produced by queries which are
under log mean duration statement timeout. So some DBAs used approach like let's enable all
query logging with duration for a few minutes and collect logs. And some, including myself,
found a way to store logs in memory it was quite like risky
so i used it only a couple of times so we put we create a drive in memory and we put logs there
but it's like only like i don't know like half a gigabyte because memory is is expensive and we
make very aggressive rotation so we don't let Postgres log to saturate this small disk.
But since it's in memory, we can afford logging a lot.
And we can set log mediation statement amount to zero, meaning that let's log all queries.
That's the big downside of logging does have an overhead if you're logging excessively.
So that seems to be a good argument for not. Well, that's why we don't have to anymore right right yes logging overhead may be
very big and it's not noticeable until some point when your drive where you write logs is saturated
in terms of right disk io and then everything goes down and it it's not fun. Like this is one of the worst observer effects
I had in big production cases
and it was very painful.
So I don't recommend to go down to zero
in terms of log mineration statement
blindly and without proper preparation.
But anyway, this right now
we consider as outdated approach
because we have PG-STAT statements, right?
And we don't need to log
all queries anymore. We still need to log some queries because pgstat statements doesn't have
examples. In pgbeger reports, aggregated normalized queries, I call them query groups,
pgbeger provides a few examples. And this is very important because in each query group,
you might have different cases. The same query, abstract query without parameters,
might behave very differently in terms of execution plan,
depending on parameters.
So when you already identified queries you want to improve,
you need examples, always.
And this is tricky.
Guessing examples is a big, not yet solved task.
I think it's a good task for machine learning and so on.
I'm very interested in this area.
If someone of our listeners is also interested in this area,
please let's talk.
Because I think it's a very interesting area to automate,
to improve, to allow us to improve more queries with less time, right?
But at the moment, we can patch it together
via a mixture of PGStats statements and logging.
Yeah, we can combine logging and PGStats statements.
Also, query ID helps in the very recent Postgres versions.
Before query ID, we used libpqquery from LucasFittal, right? So it's an additional ID. And the good
thing about this library or tool that if you apply it to already normalized query, it will produce
the same fingerprint as for non-normalized raw query. But anyway, if you use logs, you can find examples.
Another source of examples of queries is pre-reset activity.
But in most cases where lack of DBA involvement happened,
I saw that track activity size, or how it's called, I always forget.
There is a parameter which says the maximum length of pgstartactivity.query column.
And by default, it's 1024 only.
Right, right.
And it's not enough.
Or our rams or humans, they can create much bigger queries these days.
So we want to put like 10k there or something.
Overhead is very small, so I always also recommend to increasing.
But increasing it requires restart. This is the main problem. Yeah, so one of those ones you have to do always also recommend increasing, but increasing it requires restart.
This is the main problem.
Yeah, so one of those ones you have to do at the beginning normally, isn't it?
Yeah, yeah, yeah.
So this parameter should be increased, and if we increase it,
we have more opportunity to get samples of queries from registered activity.
Then we can join or match them with registered statements data,
and then probably logs are not that needed,
unless you use auto-explain,
because auto-explain is also very useful.
And again, your article about the overhead
and how to measure it
and the idea that sometimes it's not that big is good.
And maybe you want also this,
because in this case,
you see plans exactly as it was during execution
because plan flips happen as well.
Yeah, I was going to ask actually,
we've talked about the overhead of these things a little bit.
You mentioned the overhead of PGStats statements is low.
I've seen some people mention that they've tried to benchmark it and struggled.
But have you seen anybody actually struggle to measure the overhead on a on a normal workload of it's it's very have you seen any benchmarks of it
i haven't i haven't i trust already current experience but we can do it this is not the
most difficult benchmark in the in postgres ecosystem so we can just of course it's it's
good if you can benchmark for your own workload.
And the question is how to reproduce workload in a reliable way so each run is the same or very close to each other.
And then we can just use various metrics which database has without PGSTAT statements.
For example, from PGSTAT database, if we enable track IO timing, and by the way,
usually we should enable it. Of course, we also discussed it. There are cases on some hardware,
it can be expensive, it's worth checking. But if we enable it, we can produce workloads,
check this parameter, then reset, or we can also check throughput and especially latency
from application side. If we use application called pgbench, which I don't know why Ubuntu ships it in server package, not client package.
But if we use that application, it reports all latencies or sysbench, anything.
Yeah, what I meant, I think when I said struggled, what I meant was the variance of each run is larger than any overhead so either you
you're saying the you can't say the overhead zero right because it's definitely doing some work
but it's not not necessarily measurable because a few percent i don't think it i think it might
even be lower than that for some rltp workloads well uh benchmarking is the area we probably need to discuss one day separately but my general
advice is by default many benchmark tools and pg badger is no exclusion here they don't do
load testing regular load testing they do edge case of load testing called stress testing like let's load it 100 by default that's why i
suggest usually to find the spot in terms of tps you can control tps and pg pg bench you find a
spot like loading your system in terms of for example cpu or disco like 25 between 25 and 50.
emulating normal days of your production because Because if it's above, you already should think about upgrading or very heavy optimization.
And in this case, you should check latencies and compare in each run.
And in this case, variance should be the same.
I don't know why they are different.
Something is wrong.
I would like to see the concrete case and understand.
Yeah.
Well, so back onto, so now that we have PG stat statements,
there's a few things to mention there.
It's not on by default unless you're on a cloud provider.
They often do have it on by default.
So people do need to load it if they don't already. I come across quite a lot of customers who don't,
if they're self-managing, they're not even aware it's a thing.
So there probably are a bunch of people out there who don't have it on,
even though the experience people in there.
Everyone should have it.
Yeah, I think so.
But then again, so there are a few default settings there
that I think can be improved.
There's like a, is it 5,000 statements or 5,000 unique?
Yeah, you're talking about pgstartstatements.max parameter, which is, as I remember, 5,000 unique? Yeah, you're talking about pgstatstatements.max parameter,
which is, as I remember,
5,000 by default.
Usually it's enough,
but in some cases it's not enough
when your queries are very volatile
in terms of structure,
not in terms of parameters
because pgstatstatements
during query normalization
removes parameters,
but in terms of the structure.
If you just swap two columns
in your query,
it's already considered as two different
cases, two different entries in PG-SAT statements table. And PG-SAT statements max is 5,000. So
you can increase it, but as I remember, only to 10,000 maximum. I don't remember exactly,
but why we care? Because if PG-SAT statements has metrics, which just grow over time,
incremental metrics, like total time, total exact time, total planned time, because they split in Postgres 13, as I remember.
In this case, we need two snapshots to analyze.
We need two snapshots and two numbers, and then difference between those two numbers is what we have during our period of observation.
Snapshotting is absolutely needed. Otherwise, it's not like the manual approach.
Let's reset statistics often and then use only final snapshot, thinking that we started
everything from zero.
It has downsides.
For example, filling with new entries has overhead as well.
If you check source code, it says when we add an entry.
So there is a lock there, and it
might be noticeable during dozens or sometimes like 100 milliseconds, it can be noticeable for
all workload. So it's better not to reset very often, in my experience. And anyway, it's not
practical to reset them, you lose information as well. Then the question is like how often we have evictions of rare queries
and like what is the like some drift of our query set. And you can see it, you can compare the
difference between two snapshots, like for example, one hour between them. And you can see which new
queries and queries which disappeared from the list. And this difference indicates how, like, eviction speed.
And I've noticed that some, for example, Java applications,
they use set application name to something very unique,
including some maybe process ID or something.
And PGSA statements cannot normalize so-called utility commands and set as utility commands.
So these queries are considered as separate, all of them, all set.
In this case, you might want to turn off PGSAS statements track utility, which is on by default.
And in this case, because I haven't had cases when we do need to analyze the speed of said commands.
Well, maybe it might happen, but in my experience, not yet.
So it's better to just turn it off.
It's on by default.
That makes loads of sense.
I think talking about the snapshot comparisons,
I think that must be how the cloud providers all do it.
And a lot of the dashboards that you'll see in RDS, Google Cloud SQL,
there's a bunch of other ones as well.
Like, or even, even like open source tools, the talking of tools that have had releases
recently, PG Hero came out with version three, I think yesterday.
I haven't noticed.
Good.
Interesting.
Yeah.
It's very, very lightweight and a good tool, like for, for small teams.
I, I enjoy like, yeah.
And, and based on PGStat statements again.
So it's the basis for lots of these.
But back to the cloud providers,
the way they get historic data
is by taking these snapshots and rolling them
and comparing them to each other,
not by rolling it forever on the same.
They don't want to reset once a year, for example,
if that makes sense.
Right.
Well, so why do we care about this eviction speed and transition to new list?
Because we want to analyze the whole workload, right?
And in this case, of course, disabling tech utility, we don't see this utility part.
But if you consider it small, we can do it.
But we will have more real queries in our PGSR statements.
And we will have 5,000 by default.
It's quite a big number.
But the second place where cutoff can happen
is monitoring system or cloud power.
I don't know.
They usually store like 500 on your 1,000.
They don't take everything
because it's expensive to store everything in monitoring system.
You need to, if you want snapshots, samples of PGSR statements, snapshots every, for example, minute.
Imagine how many records you need to store.
Every minute you store 5,000 entries from PGSR statements.
So usually they also cut off and making decisions what what
to remove what to leave we usually think about which metrics are the most important yeah so
actually that you i think you already mentioned it briefly but the the total time seems is my
is my favorite i know i know we've talked about mineings. Mine as well, but I saw people which prefer not total time.
Actually, in my team, there are such people
which prefer, for example, average time, mean time,
mean exact time, or mean exact time plus mean planned time
because we probably want to combine them
because execution includes both planning and execution.
I mean, okay.
Sorry.
Yeah, we call that total time by summing the two,
but there's no such parameter.
Total already used in different contexts.
Exactly.
Well.
But the problem,
this goes back to our conversation
about logs versus PG stat statements, though.
The reason, I guess, the reason for you as well,
but I'd be interested.
The reason I prefer total time is
you could easily have your biggest performance problem being...
So you understand why total is used twice here, right?
Because total is sum of all timing for...
I mean, there is total exact time and total planning time.
Total, total, no, it's not good. and total planning time. Yeah. Total, total.
No, it's not good.
Like total whole.
How to name it?
Yeah.
Well, they don't.
And they, but we can sum those at the query level, right?
We can sum the two of them if that's what we care about.
But my, yeah, sorry.
What I was, what I guess I was trying to say was our biggest performance problem could easily be a relatively fast query which has a really low
average mean so average by mean time so it could be on average 20 milliseconds but it's getting run
so many times and maybe it's still like not optimal maybe it could be running in sub
one millisecond and that could be our biggest performance opportunity. And by looking at
total time, total execution time plus total planning time, we could see that that could
rise to the top of our query. It could be line number one. Whereas if we're looking at average
time, we could easily, like so many queries that only run a couple of times that take a few seconds
each. They'd be long above it. This is an interesting topic.
Which metric is more important?
By the way, the lack of words here
indicates that the topic is quite complex, right?
I mean, English doesn't have enough words to provide it.
I'm joking, of course, but interesting.
Total what time if you combine both exec and plan?
Some word should exist,
and we probably already use somewhere.
So total versus average or mean time.
In my, like I came to conclusion like this.
If our primary goal is resource optimization,
if we want to prepare for future growth,
we want to pay less for cloud resources or hardware,
total time is our friend
because this is, well, for cloud resources or hardware, total time is our friend.
Because this is, well, of course, it includes some wait time as well.
For example, we have a lot of contention and some sessions are blocked by other sessions.
It also contributes to total time, but resource consumption probably is not that much because waiting is quite cheap usually, right?
But if we forget about this a little bit
total both plan and exact time if we combine them this is our time spent for to process our
workload if we know that we analyzed everything this is how much work was did we can even take
total total time and divide it by observation duration. And we will understand how much time we spend every second.
I call it, like, metric is in seconds per seconds,
my favorite metric.
For example, if it's one second per second,
it means that roughly one core could process this.
It's very, very, not like we forget about context switches here,
of course, and so on. But about context switches here, of course,
and so on. But it gives someone a feeling of our workload. If we have 10 seconds per second
needed to process, it's quite good workload already. We need probably some beefy server here.
As for average time, these numbers are most useful when we have a goal like let's optimize for best user experience
yeah so like it i guess that's our 50th percentile isn't it with is it no it's not it's not so i i
even then i don't prefer i don't even like mean for those because i'd much rather look at a p95
or something and look at the client side not not database side? Well, it depends. But sometimes, as you said, sometimes we had very, very rarely executed queries, but quite
important ones.
For example, it can be some kind of analytics, not analytics, but some aggregation and so
on.
And the average is terrible.
And we do want to improve it because we know that users who look at those numbers or
who use these queries these users are important for example some like our internal team analyzing
something or i don't know like finance people or something and some some kind of a more analytical
workload it's not necessarily analytical but i think i hope you understand right i think i
understand so like give you an example when i was at a payments company we had it was like a batch
it was daily batch payments we had a deadline to submit a file it was like a 10 p.m uk time
deadline and the job literally had to finish before then and as this job got longer and longer
it got closer to that deadline and then it
forced some some work so maybe that wouldn't have shown up if we'd looked at duration or total total
time but yeah i did but i also think those kinds of issues often crop up without you doing having
to do this like macro analysis work because somebody's telling you about them. Yeah, so if we decided to order by meantime,
sometimes we see on the top,
we see something that we say,
well, it's fine that it executes a minute,
because it's some cron job and nobody cares.
If it's just select, for example,
and no locking involved,
and it lasts one minute, it's not a big deal.
So we probably want to exclude
some queries from top and ordered by mean time, average time. For total time, it's not so. I'm
usually really interested in each entry from the top. That's why I also prefer total time.
But I see people use mean time successfully, caring about mostly users, not about servers.
So roughly, total time is for infrastructure teams to optimize for
servers, while meantime is probably interesting to application development teams and for humans,
right? So very, very roughly. And there is also calls important metric, right? Why we discuss
which metric to choose? Because when you build good monitoring, you need to choose because when you build good monitoring you need to choose several
metrics and build a dashboard consisting of multiple charts top end charts top end by total
time top end by mean time top end by calls for example why calls probably for database itself
it's not that important and if the most frequent queries they might might produce not the biggest
load of course if there are a lot of very, very fast queries,
I would check context switches, for example, and so on.
And think about how CPUs are busy in this area.
But I've noticed that sometimes we want to reduce frequency
of some of the most frequent queries
because overhead on application side is terrible.
This is unusual approach because sometimes people optimizing workload or database,
they think only about database.
But I had cases when optimization, for example, let's take top three order by calls
and just reducing the frequency, we can throw out 50% of our application nodes.
Can you imagine the benefit of it?
The cost saving, right?
But so it could, just to give an example
from the application side,
I guess that would be one way of spotting
potential N plus one issues
where if it's the same queries
getting executed over and over again,
that's the kind of thing it could point to.
Right, right.
So it's interesting.
I think I don't
understand all of aspects here. And I think we lack good documentation, how to use Pidzista
statements. So many angles, so many derivatives as well. But I would like to finalize discussion
of metrics. I wanted to mention also IEO metrics, your metrics shared buffer hits and, and shared blocks.
They call child blocks, uh, red and heat, right?
Let me check shared blocks, heat, shell blocks, read also written.
But if we consider only, well, local and temp additional, but let's just, if we discuss
everything, we need an hour, right?
So many aspects.
But I wanted to mention only a few, like hit and read, shared buffers, hit and read, hit and read.
Because it's, okay, if we talk about this only, an interesting thing here is that sometimes monitoring system thinks that read is enough.
Because it's the slowest operation, let's, well, as we already discussed a few times, Postgres doesn't see the actual disk read.
Like the operating system cache versus disk read.
So this read is from a page cache.
Maybe it's disk read, but maybe not.
We don't know.
But usually monitoring system says,
okay, ordering by shared blocks read
is the most interesting.
But I had cases at least two times
when I really needed PgStats statements shared blocks hit
and finding the most like,
because working with buffer pool in Postgres shared buffers
was so intensive by some query group.
And if you don't have it in monitoring,
you need to start sampling PGS assessments
yourself, writing some scripts on the fly.
It's not fun at all.
So I think most DBS who have experience, they have something in their tool set.
But for example, PGS center, PGS center can sample it for you.
You can use it as ad hoc tool if you don't have it in monitoring and you have problem
right now, for example.
But I also suspect they lack top end by hit number so these angles and new new metric wall
how yes let's find queries which generate the most the the more wall data let's order house code let's check i have the list here. It's called Wall Records, Wall FPI, Full Page
Inserts, and Wall Bytes. Three metrics added to Postgres 13. I didn't see them yet in any
monitoring. I hope... Oh, maybe our PgWatch 2 Postgres AI edition, it has it already, right?
Yeah, I remember this was added to
explain in version 13 was it added to pgstat statements at the same time same time yes yes
yeah makes sense and it's so good it's so good like order by like we want to reduce wall generation
definitely because the reduction of it will have very positive effect both on our backups subsystem and replication
both logical and physical so we do want to produce fewer wall records or fewer wall bytes
or full page or full exactly yeah yeah so i never use it yet i i hope soon i will use it someday
and so this i mean i know it there, but never use it myself yet.
Yeah.
Talking about this has given me an idea as well.
We talked a while back about buffers.
And one of the things we do on a per query basis
is look at the total sum of all of the buffers.
And I know that doesn't make tons of sense,
summing dirtied buffers plus temp buffers
plus local plus shared.
We can call it total IO.O. or something.
Yeah, exactly.
Or some kind of measure of work done.
And actually summing all of those and then ordering by that and looking at the top 10 queries by total I.O.
I think it's a smart idea.
Each I.O. has some cost.
And if we find queries which involve most I.O. operations, of course, it's a good angle for our analysis.
Yeah.
What else?
We mentioned that we deal with page cache when we look at ION.
Yeah.
But sometimes we do want to order by real physical disk ION, right?
And there is such opportunity for those who manage Postgres themselves.
It's called pgstat kcache.
Additional extension to pgstat statements.
Extension to extension, I would say.
And it provides you very good things like disk reads and writes,
real physical disk reads and writes,
and also CPU.
Sometimes you want to find queries
that generate the most load to your CPU.
And it even distinguishes system and user CPU.
It's cool.
It's good.
Yeah, yeah.
And also context switches.
So it's a very useful extension
if you care about resource consumption
and you want to prepare for growth
and you want to do some capacity planning
and before that you want to optimize.
And I think you've said before,
but did you say it's not available on most managed services?
No, I only know Yandex managed services.
They installed it by default, but I'm not aware of any others.
So yeah, also there is another way to analyze workload.
We didn't cover today at all.
Wait event analysis.
This is what RDS, for example, provides as like starting point,
actually, for workload analysis. I think it came from Oracle World Active Session History Analysis.
So let's, yes, someday let's discuss it and compare it with traditional analysis we discussed today.
Yes. And for anybody that is aware of Ash and wants it for Postgres, there is a,
I've heard it called Pash.
Yeah, P-A-S-H as well.
Yeah, exactly.
But it's a Java client application which will do sampling from PgSense activity,
but it can only be used as ad hoc tool if you're in the middle of an incident.
Same as Ash, right? Isn't it?
Well, you can install, for example, a PgWeight sampling extension,
and immediately in our PgWatch 2 Postgresor Edition,
you will have similar graphs as Performance Insights in RDS.
Nice.
I think Google also implemented it.
I'm not 100% sure, but I think they did it.
Or PgCenter, I mentioned earlier, also a good ad hoc tool. It also has weight event
sampling. But let's discuss it some other day. We also have an episode on monitoring that people
can go check out. A deeper discussion on that. And micro analysis. It's good to distinguish
things. Sometimes you have already a query. You need to go inside it and understand what's happening, why is it so
slow.
But sometimes you have no idea where to start.
Database is slow, everything is bad.
In this case, query analysis, macro analysis is definitely worth conducting.
So yeah.
Wonderful.
Okay, sorry about 40 minutes again.
So it's getting longer than we wanted. It's good.
As usual, let's thank all our listeners who provide feedback. This week was excellent
as well. A lot of feedback.
Yeah, we had a lot of great suggestions. It's been really good. Thank you.
Yeah, this drives us. Thank you so much.
Yeah, really appreciate it. Well, thanks again, Nikolai. I hope you have a good week
and see you next time.
Final words. Like, share, share, share. Thank you.
Bye.
Take care. Bye.