Postgres FM - Overhead of pg_stat_statements and pg_stat_kcache
Episode Date: February 16, 2024Nikolay and Michael discuss the overhead of pg_stat_statements and pg_stat_kcache — mostly focusing on some interesting things Nikolay found while stress-testing some large spot instances u...p to and beyond 2m TPS(!) Here are some links to things they mentioned:pg_stat_statements https://www.postgresql.org/docs/current/pgstatstatements.htmlpg_stat_kcache https://github.com/powa-team/pg_stat_kcachepg_stat_statements overhead estimate in pganalyze FAQs https://pganalyze.com/faqWhat is the performance impact of pg_stat_statements? (DBA Stack Exchange answer with several links) https://dba.stackexchange.com/questions/303503/what-is-the-performance-impact-of-pg-stat-statementsTowards Millions TPS (blog post by Alexander Korotkov) https://akorotkov.github.io/blog/2016/05/09/scalability-towards-millions-tpsPostgres.ai blog post with links to benchmarks https://postgres.ai/blog/20240127-postges-ai-botpostgresql_cluster https://github.com/vitabaks/postgresql_clusterpg_wait_sampling https://github.com/postgrespro/pg_wait_samplingpg_stat_kcache issue, discussion, and fast resolution https://github.com/powa-team/pg_stat_kcache/issues/41 log_statement_sample_rate https://postgresqlco.nf/doc/en/param/log_statement_sample_rate/ auto_explain.sample_rate https://www.postgresql.org/docs/current/auto-explain.html#id-1.11.7.14.5.3.13.1.3pg_stat_statements performance penalty on Intel much higher than on AMD https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/issues/52~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork
Transcript
Discussion (0)
Hello and welcome to Postgres FM, a weekly show about all things PostgresQL.
I am Michael, founder of PGMustard, and this is my co-host Nikolai, founder of Postgres AI.
Welcome back, Nikolai. How are you getting on and what would you like to talk about today?
Hi, Michael. First of all, thank you for keeping the, how is it called, the ball rolling, or how to say,
long time no see or long time no hear.
Yeah.
Yeah.
Thank you for very interesting interviews.
I liked listening to them while I was slightly off.
But maybe it's time to return to our regular format or have a mix.
I don't know what we will decide next,
but today we have a regular format, right?
Yeah.
Yeah. decide next but today we have a regular format right yeah yeah and and the topic i brought this
topic is overhead of observability tools the most popular ones well one is the most popular one pgc
assessments and usually people like i remember exactly when we had we didn't have pgc statements
we had only log based analysis we didn't have weight events didn't have PGSR statements. We had only log-based analysis.
We didn't have wait events, didn't have PGSR statements like 15 years ago, right?
And we only had log-based analysis.
And there was a tool written in PHP called pgfwin.
And then another tool replaced it, written in Perl, called Pidgey Badger.
And the idea was people always first think to figure out that,
oh, usually we only see a tip of the iceberg.
We see only some queries which are the slowest ones,
but you cannot properly perform holistic analysis if you don't see the fast queries.
Because fast queries might be consuming even more resources than slow queries.
So what people did usually, they experienced DBAs, usually said,
okay, I'm going to switch off all query logging for a few minutes to collect everything and then analyze.
In very many cases, it worked well.
Yes, there is observer effect because we probably put our Postgres down completely if we log all queries.
But not always.
Sometimes it depends.
So, yeah.
This is quite understandable.
But then PGSR statements were created, and some DBAs, I remember, were saying,
oh, you know, like, we still need to study. It's overhead.
And then somehow there was, like, there were some benchmarks.
I don't remember any benchmarks I would say I trust them.
But I remember many experienced folks started saying,
oh, you know, yes, there is overhead, but it's below 10%.
And since this thing is measuring everything,
it's observing whole workload, it's worth keeping it enabled.
Let's have it enabled. I remember also there was some number like people, many people saying, you know, Pages and Assets overhead is 7%.
It's kind of strange, but like kind of 7%. Okay, who's 7%? I don't know. Like%. I don't remember any benchmarks that proved it, but I remember it was like consensus was below 10%, so we are all good.
If you try to find a good benchmark from these early days of PGS assignments, it would be great.
I don't think it exists. But honestly, we as communities should perform good benchmarks there. I don't remember
that. It does mean they don't exist, of course. I did a little bit of searching beforehand because
I was interested in if there was any up-to-date ones as well. I couldn't find anything great,
but there were a few trustworthy sources. One was PG Analyze, a commercial monitoring tool that uses PGSAT
statements. In their frequently asked questions, they gave an estimate of approximately 1% of CPU
time, which I thought was interesting that they're telling their customers. Anyway, as we're going to
discuss, this is all very dependent on workload, right? But at least that's interesting. And I
trust them enough to say that's probably based on some tests on their side, even if
I didn't see the actual benchmarking.
And then the other things I've seen are a series of, well, there's a question on the
DBA Stack Exchange, and somebody's quoting various Postgres authors and a benchmarking
series.
One that was like a more realistic workload that was about 0.5% measured overhead, and another that was a more stress test that measured it about 10%.
So those are numbers, but that's a wild range of different numbers.
And as we're going to discuss later, you've done some work that shows even that it can be very different, which is cool.
I hope by the end of this episode, people will start having some ideas
why this range can be so wide.
But I can imagine like 0.5% or 10%.
In both cases, it can be so.
I mean, it depends on workload, actually.
Very well depends on the workload.
And also not only on workload.
We will talk about it later, right?
So, yeah, okay.
But when I say overhead, of course, I mean CPU overhead, first of all,
because we don't think about disk IO or disk usage, disk space usage here,
network, nothing like that.
Only pure CPU overhead. By the way, when
guys like RDS and others, probably inheriting this from Oracle, when they
consider wait event being null in PidgestActivity as CPU marking it as green.
CPU green in performance insights.
Do they mean other wait events, all of them are not CPU?
I don't understand this.
I don't know because many of for example
lightweight locks
spin locks for example
they are purely CPU
I know we have several listeners
from AWS so maybe they can let us know
we have listeners
yeah yeah
on the other hand if you say
CPU instead of null, null means unknown.
And according to Postgres documentation and source code, it means no wait.
In reality, it means either really no wait and maybe like some kind of CPU work.
But also, in many cases, it means a wait event which is not yet created and code is not covered with this.
For example,
Postgres 16, we recently had some benchmarks
and we saw wait event
which you see in Postgres 16, but
you don't see it in Postgres 14
because it was not yet created there.
So I'm
thinking, okay, it means in performance
insights on RDS,
and I think in many other systems, maybe in Cloud SQL as well,
or this PeshViewer at HawkTool written in Java,
it also likes to use green color and say this is CPU,
but it's actually not CPU, it's null.
Because CPU doesn't exist in the list of wait events in PGSF activity.
So if you market CPU CPU for 16, but...
Oh, for 14 because it was null, but then you start distinguishing it for 16,
something is not right here, right? So I'm asking, is it really CPU?
Yeah. So interesting question, but it's slightly off topic.
Back to PgStat statements. Let's talk about benchmarks we had. So idea was we got some credits from Google Cloud. And it's great. We like this year we will be before I mean, we I mean, Postgres CI team, we will be performing a lot of benchmarks. I'm going to do a lot of postgres. We already had a lot of benchmarks.
I think thousands of them if you count each iteration.
And hundreds of them if you count whole benchmark consisting of many iterations.
So we are curious how many TPS we can squeeze from Postgres 16 on various big machines,
Intel, AMD.
And there was a good article from 2016 from Alexander Korotkov when both Postgres and
MySQL, I remember, teamed up.
It was interesting.
I think, I don't remember, somebody from Percona also was working on MySQL.
The goal was to demonstrate that the both systems can show million TPS. And during that work,
some contention issues were improved, fixed in Postgres. So Postgres reached million TPS on some strong machines.
And I like that benchmark because it's simple. It's PgBench, select only.
So it's only selects, very simple ones. So we just repeated the same benchmark.
And as usual for such benchmarks, it's actually stress load testing, stress testing, because you are exploring this
edge. It's not what you need to do for your application, for example, unless you're on
purpose studying it. I hate this behavior being default in PgBench. It should not be so because
it's provoking to perform stress tests instead of regular load tests and exploring normal situation, for example, 25 or 50% of CPU load.
But in our case, we just want to squeeze as much as we can.
So we did this and I published a blog post about our new AI bot and it has details and links to details how we did it. So we took the newest Intel,
fourth generation Intel scalable Saphir Rapids, it's called, right?
And also fourth generation AMD EPYC,
C3 and C3D instances on GCP.
One has, the biggest number is 176
VCPUs, and AMD
has 360 VCPUs.
And both have
more than terabyte of RAM.
So it's insane, but
we use spots.
AWS has spots, GCP
has spots. It's cool. You pay a couple
of bucks for such experiment,
honestly.
I realized, even GCP has spots. It's cool. Like, you pay a couple of bucks for such experiment, honestly. Like, it's not...
I realized even if we didn't have credits,
I would probably pay for myself.
It's interesting to explore these things
because we just provision temporary machines
for one or two hours,
and spot means huge discount.
Yeah.
So, this kind of experiment, it's classic experiment for stress testing.
You first start with one connection, then more and more connections,
and you control.
We chose Alexander Korotkov's approach.
In PIRG Bench, both dash C and dash J are the same.
So, we start from one, then we put 50.
With such huge machines, this step is reasonable.
Jumping to 50 right away.
Then 100, then 150, 200, and so on until, I think, 500.
Exceeding the number of vCPUs in both cases.
So since we use PostgreSQL cluster,
which is a project, it's an open-source project,
it's a very good thing,
Ansible kind of set of playbooks
to provision Patroni clusters with a lot of things.
It's maintained by Vitaly, who is working in my
team. So it's a great
tool, and it has a lot of things
including those we
over years discussed
together, and he just added
into PostgreSQL cluster.
So we
use it, and
because of that, we had a lot of
observability tools, including those for query analysis pgstat statements
pgstat kcache much less popular and pglet sampling for
weight event analysis and then we saw very
strange picture until 50 we grow in terms of TPS
kind of approaching million TPS but then go down
very quickly.
And on Intel, we went down even more, but it's a different story.
And thanks to PG-Wet sampling, we saw that we have number one weight event is related to PG-STAT K-cache.
So we immediately realized this is a zero effect from this PG-STAT K-cache.
For those who don't know, PG-STAT K-Cache is an additional extension to PG-STAT Statements
extension which provides physical level metrics, user CPU, sys CPU, real disk I.O. at physical
level, context switches.
So it's very useful to understand real CPU usage, unlike this where to when equals now, which I think is wrongly presented in RDS performance insights.
This thing, trustworthy, we use it many years.
A few big companies use it as well.
It's not super popular.
RDS doesn't have it.
Cloud SQL doesn't have it.
Most managed providers, managed Post rds doesn't have it cloud sql doesn't have it many most managed
providers managed postgres providers don't have it i know yandex cloud has it so they might be
the only ones i've yeah maybe maybe yeah but also one of the biggest e-commerce company we worked
with uh also has it for many years and they have very critical systems running on Postgres.
But in this case,
overhead was so huge.
Like, what's happening?
And I immediately published it on Twitter and Vitaly created an issue
in PGSTAT, KCache, GitHub,
a repository.
And then maintainers,
they created a fix
in four hours, I think.
So I think they expected something.
There was some additional lock, which was removed in this fix,
with the idea this lock is something old and it's not really needed.
But in our case, it was slowing everything down when contention is high,
when many, many sessions fight.
Not fight.
Basically, what's happening?
With PgBench, you have, by default, only four queries, right?
If you say select only dash uppercase S, it's only single query,
only one select to PgBench accounts table.
That's it.
In this case, like hundreds of sessions try to increment
metrics in a single record of the just at kcache right and due to that lock
which was removed in the latest version it was no good I mean the more sessions
we have the biggest contention and the biggest the bigger overhead is and we
see it as as we increase the number of clients,
the number of connections and jobs,
the dash C, dash J in PgBench parameters.
TPS go down.
So we have room.
We have more vCPUs, right?
But we cannot use them properly
because all of them try to execute basically the same normalized query.
Parameters don't matter here.
Yeah, I've pulled the chart up from your tweet.
And just to give people an idea, at 50 clients, it's about 550,000 TPS.
At 100 clients, it does go up to about 700,000 TPS, but to 150, we're down below the 50 client rate,
and it's only 400,000. Then it settles at about 200 clients and above, about 300,000 TPS,
which is less than we had at 50. So it's a super interesting curve and cool to hear that you got
so quickly to the root cause and that the team
were able to fix it so quickly yeah in a few hours well let's split the story here to two
paths first path is purely pgstat kcache and second path is without it because we when we
saw the server head we started we continued our discovery of maximum tps on modern Intel and modern AMD without PitchEstate KCache, of course,
because we didn't expect it would be fixed so quickly.
But when the fix was ready,
we asked the bot to repeat this benchmark.
It was interesting to convince the bot
that it's safe to download fresh code from GitHub and compile it.
It hallucinated, saying it's safe to download fresh code from GitHub and compile it. It hallucinated saying it's against
policy when I asked which policy it said the Postgres AI policy. So it was funny like I
won't check it. Do we have such policy? I checked all our docs because Bob does know our documents.
So probably like, but it was just pure hallucination.
Yeah.
So then we convinced it, it verified.
And indeed,
we saw that the fix
indeed resolves the problem
and no more such overhead.
So it was good.
And next week,
the new release of pgst.k cache was issued.
But interesting question,
like,
as I mentioned,
I trust those people, and also not trust, I worked with some of them, and
I touched
production systems with my hands, right?
So I didn't see such problems
for years on production.
But here we saw
obviously very big overhead.
Why so?
Why we survived with this on production?
This is an interesting question.
Let's return to it after we explore the second path.
Second path, okay, forget about PGSAT-K cache.
We only have PGSAT statements, PGSAT sampling.
We know overhead of both are not so big.
And then we just explore from 1 to 500 connections on both modern Intel platform and modern AMD.
176 vCPUs on one and 360 vCPUs on another.
And huge surprise was Intel behaved, again, the same workload, pgbench-s.
So, select only, single query, very simple, sub-millisecond latency.
Huge surprise was Intel behaves not well at all.
It behaves similar to what we had with pgstack cache enabled.
But higher, yes, higher, it reached 1 million, I think, or so.
Maybe slightly below it. And then went down. cache enabled. But higher, yes, higher, it reached 1 million, I think, or so.
Maybe slightly below it. And then went down.
While AMD,
like this going down
even before you reach
number of vCPUs,
increasing
number of connections and jobs in
PgBench. It's not normal.
Some kind of problem, obviously.
While AMD also was not good, but it didn't go down. It went down slightly, right? It demonstrated almost a plateau. Very different behavior. And we started studying what's happening here. And obviously, the problem was PG-gStart statements in this case. And we saw it, I think, from wait events by PgWait sampling, but we also collected flame graphs.
And we obtained two very different pictures for these platforms.
Everything is the same. Ubuntu 22.04, Postgres 16 latest version, everything is like some kind of tuning applied,
the kind of default tuning we usually apply to new clusters under our control.
Nothing special, nothing fancy, but very different behavior.
And on flame graphs, we could see that in case of Intel, these few rapids, we see that PGC1 statements has a huge S underscore log function call, consumed a lot of time.
And in case of AMD, this call is much, like, small.
So in case of Intel, it was like 75% of whole time spent by CPU in FlameGraph.
Like, why?
What's happening?
Something not normal at all and i i talked to a few guys
talked to andrei borodin and alexander korotkov and interesting they both mentioned the idea that
probably the producer statements needs sampling here right sampling so yeah so it's an interesting
idea we don't currently have a parameter.
Yeah, we don't.
Like for a lot of the other things we have, like for logging, for example,
because there's such overhead to logging, we have parameters for that kind of thing.
We can sample to only measure one in 10 or one in 100.
Yeah, yeah.
Well, I'm thinking, I know we've discussed this before,
but for min duration statement, yeah, I think since more recently, but for auto explain from a long time ago, for example.
But the 75% makes sense as well. Just to give people an idea of the exact numbers, I've pulled up that chart as well. statements it does get to very nearly a million tps without pgset kcash and then drops a little
bit less rapidly but still it drops down to 500 000 at 150 and then down to about 300 000 by the
time you get into the hundreds of clients whereas amd with pgset statements continues to rise at 100 clients to about 1.2 million gets above 1.5 million at about
150 clients then seems saturated and and it's mostly a plateau slight decline as you get more
so it's that's about five times more by the time you get to you know 300 000 versus 1.5 million
starts to make sense that that's like roughly 75% overhead, I guess.
Yeah.
So in all cases, when you reach some usual number of VCPU, you go down.
But normally you go down slowly discerning, right?
Bad picture demonstrated on Intel, like very acute, like acutely going down,
going down very fast. And this is not
normal. And, of course, we,
to confirm that PGS Settlements
involved, and also to get numbers
we wanted, we wanted big numbers,
right?
When I posted it on Twitter,
of course, people started liking the posting.
Like, with AMD,
I think we've got
without PGS at Atmos, just removing it
we've got 2.5 million
TPS, right?
On AMD. Nearly, yeah.
Almost, almost, right.
Above 2 million TPS
my tweet was 2 million TPS on
POSG16, right.
Well, yeah
it was funny to
see the reaction from
CEO of PlanetScale.
Like, it's
not real workload.
Well, I don't know it's not real
workload. It's select only.
It's purely synthetic
workload. We're just exploring some
edge of what system
can provide. Of of course it's not
realistic it's some selector that's it well maybe some there are some systems which you need mostly
this kind of workload and this is interesting question because if they if they have such
workload they suffer from bigger problems from pgc settlements this is i think this is the main
idea of like should be main idea of today's episode.
Look at your workload and understand the nature of it.
But obviously, CEO of PlanetScale
is not interested in single node performance, right?
Because the main thing is sharding, right?
So single node should not provide millions of TPS, right?
It's not normal, right?
It doesn't help the marketing.
Right, right, right.
Everyone realizes it's select only.
Everything is cached in memory.
But still, 2.5 million TPS.
Wow.
On machine, you can just rent easily.
Well, it's expensive machine.
I think if it's not spot, if it's normal without any discounts,
it's above $10,000
per month. So it's
expensive machine. Of course.
But it's possible. And this is just
Postgres with minimal tuning.
It's good. Like 2
million TPS.
Of course, I'm slightly sad
comparing to 2016. How many
years already passed?
Like eight?
One million?
Two million?
Maybe we can do more.
We can squeeze maybe more, but it's a separate question.
We will probably think about it.
But returning to PGSAS Atmos, what's happening?
Again, a lot of sessions, they compete trying to update the same record in Pages.SAS.
It's just single record, select query, that's it.
If it was different queries, right, it would be okay.
So I even started calling this workload pathological.
But then I say, okay, what about all SaaS systems and social media? Do they have
some query which is executed, which should be fast, and it's executed in many cases?
And the answer is yes. Usually, if people work with your system, you need to select,
it can be cached, of course, but I saw it many times, some primary key lookup to tables like users or posts or projects or blogs or something.
And you see most of sessions, I mean, not only database sessions, but for example, like session in terms of HTTP communication, web sessions.
Most of them need this query, right?
And in this case, you probably might have this problem.
You might have this observer effect.
And, of course, solution would be to start caching them, probably, right?
And so on.
But this is still an issue.
So if you have, for example, for example thousand tps or more of some
primary key lookup probably you already might have this problem overhead from pgc statements
which maybe for whole workload which you have is not so big as we mentioned can be one or 0.5
but if you zoom into this part of workload primary key lookup probably there this
overhead is bigger and it maybe guys i talked to maybe they are right and maybe sampling could be
a good solution maybe pgc settlements could guess oh this is high frequency query a lot of qps are
happening here calls number is good is high Maybe I just need to start sampling.
Maybe.
I don't know.
It's an interesting question.
And, of course, another interesting question.
Why Intel, right?
Why Intel?
And I don't have an answer yet.
We see these flame graphs. We realize, okay, this code is running much longer than on Intel's.
Right now, there is an idea to explore older Intel's Cascade Lake
and maybe even older Xeons, right?
Which may be used much more in production systems.
And maybe also Epics, older Epics, third and second generation, maybe.
We also have issue, not related related but we also observe an issue with
we discussed lightweight logs log manager contention their amd behaves worse than intel
so it's kind of interesting but what we need to understand if a lot of sessions run the same query,
it can be bad in terms of log manager.
So the solution would be get rid of planning time using prepared statements.
This is ideal.
Or just reduce frequency and you won't notice this, right?
Or make sure fast path is true always.
And it means that you have only a few indexes and partition pruning works in plans.
And also you have a observable effect
in PGSA statements in this case.
Yeah.
And second, Intel versus AMD.
This is like, I don't have answers here,
but it's interesting to just dig into it and understand it.
The thing I've heard you say multiple times in the past is when you're doing this kind of work or when you when you're looking
at your system it's trying to understand where's the where the bottleneck is something i really
like about this piece of work that you've done is you first identified that pgstat kcache was
a bottleneck and then turn that switch that off and then tried to work out what the next
bottleneck was looks like it actually might be pgstat statements even though even though uh
maybe in your head you were thinking it probably isn't but let's try turning it off and see if that
makes a difference that made a big difference so it's each time like even if you uh when you when
you're thinking about maybe trying to get above that 2.5 million or maybe 3 million, we have to work out what's the current bottleneck.
That's how we – yeah, exactly.
So that's a really nice thing that I don't see enough people thinking about on their own workloads, but also when they're benchmarking what we're currently limited by.
Right, right, right. So when we perform any performance research, like benchmarks, analysis,
root cause analysis after some incidents on production,
or we try to reproduce problems, we perform, like always,
database systems is a complex thing, and workload usually is quite complex.
So to study the whole, we need to apply this, like I mentioned before we had this call.
I mentioned Rene Descartes or how to pronounce it in English.
I think that's why it's French, right?
It's French, yes.
Yeah.
Yes.
I like Rene Descartes.
Right.
So the idea is we need to properly split whole into segments
and then try to study each segment separately.
When you study each segment separately and know how each of it behaves,
for example, okay, there is a high-frequent select.
Let's study how it behaves without anything else. We know how it usually behaves without anything else
in like in an emptiness, right? And by the way, when we study it, we also can divide
into smaller pieces. For example, okay, let's remove this. Let's remove that extension.
So go deeper, deeper.
So basically minimal pieces.
It takes time, of course, right?
But then we know small pieces, how they behave.
We can try to compose it back to complex workload and study already an ensemble of it as a whole.
This is a regular scientific approach, I think, one of the oldest ones, right?
But we must do it here,
and I cannot agree with PlainScale's
CEO. Okay, it's not
normal, it's not, but we study
it because it's
presented in our complex
production workloads, right?
For example, primary key lookups with
1,000 or more QPS.
It's not
uncommon.
Yeah, well, but but to be clear, to be clear, I'm still a big fan
of encouraging every workload I've ever seen, or every setup
I've ever seen, especially we're talking about SaaS companies,
like that kind of, I would still encourage them to have PG stat
statements unless they've done some testing
that it somehow is affecting their workload.
I agree.
Yeah.
Even if it was 30%,
I would say in many cases,
we still need to have it
because without it, we are blind.
Yeah.
But if it was 30%,
if we did find that out,
it would be cool to have sampling.
If we then took one in 10, we could
reduce that maybe to 3%.
Maybe it's not quite linear, but you know.
But sampling here should be smart.
It should be applied only to high-frequency
queries.
So anyway,
I think we're almost out of time.
We are.
The bottom line,
I think we should also check our ARM platform as well and see how it behaves.
So there is difference in PGSR statements behavior on regular queries versus regular slow queries like updates or deletes versus high frequent, very fast selects like primary key lookups.
And there's a difference between AMD and Intel.
So this is the bottom line and I'm excited to see results of our further investigation
of what's happening.
We plan to understand details here.
Nice one.
Thanks so much, Nikolai.
Thank you for listening and catch you soon.
Bye bye.