Postgres FM - Overhead of pg_stat_statements and pg_stat_kcache

Starting point is 00:00:00 Hello and welcome to Postgres FM, a weekly show about all things PostgresQL. I am Michael, founder of PGMustard, and this is my co-host Nikolai, founder of Postgres AI. Welcome back, Nikolai. How are you getting on and what would you like to talk about today? Hi, Michael. First of all, thank you for keeping the, how is it called, the ball rolling, or how to say, long time no see or long time no hear. Yeah. Yeah. Thank you for very interesting interviews.

Starting point is 00:00:29 I liked listening to them while I was slightly off. But maybe it's time to return to our regular format or have a mix. I don't know what we will decide next, but today we have a regular format, right? Yeah. Yeah. decide next but today we have a regular format right yeah yeah and and the topic i brought this topic is overhead of observability tools the most popular ones well one is the most popular one pgc assessments and usually people like i remember exactly when we had we didn't have pgc statements

Starting point is 00:01:03 we had only log based analysis we didn't have weight events didn't have PGSR statements. We had only log-based analysis. We didn't have wait events, didn't have PGSR statements like 15 years ago, right? And we only had log-based analysis. And there was a tool written in PHP called pgfwin. And then another tool replaced it, written in Perl, called Pidgey Badger. And the idea was people always first think to figure out that, oh, usually we only see a tip of the iceberg. We see only some queries which are the slowest ones,

Starting point is 00:01:39 but you cannot properly perform holistic analysis if you don't see the fast queries. Because fast queries might be consuming even more resources than slow queries. So what people did usually, they experienced DBAs, usually said, okay, I'm going to switch off all query logging for a few minutes to collect everything and then analyze. In very many cases, it worked well. Yes, there is observer effect because we probably put our Postgres down completely if we log all queries. But not always. Sometimes it depends.

Starting point is 00:02:24 So, yeah. This is quite understandable. But then PGSR statements were created, and some DBAs, I remember, were saying, oh, you know, like, we still need to study. It's overhead. And then somehow there was, like, there were some benchmarks. I don't remember any benchmarks I would say I trust them. But I remember many experienced folks started saying, oh, you know, yes, there is overhead, but it's below 10%.

Starting point is 00:02:56 And since this thing is measuring everything, it's observing whole workload, it's worth keeping it enabled. Let's have it enabled. I remember also there was some number like people, many people saying, you know, Pages and Assets overhead is 7%. It's kind of strange, but like kind of 7%. Okay, who's 7%? I don't know. Like%. I don't remember any benchmarks that proved it, but I remember it was like consensus was below 10%, so we are all good. If you try to find a good benchmark from these early days of PGS assignments, it would be great. I don't think it exists. But honestly, we as communities should perform good benchmarks there. I don't remember that. It does mean they don't exist, of course. I did a little bit of searching beforehand because I was interested in if there was any up-to-date ones as well. I couldn't find anything great,

Starting point is 00:03:58 but there were a few trustworthy sources. One was PG Analyze, a commercial monitoring tool that uses PGSAT statements. In their frequently asked questions, they gave an estimate of approximately 1% of CPU time, which I thought was interesting that they're telling their customers. Anyway, as we're going to discuss, this is all very dependent on workload, right? But at least that's interesting. And I trust them enough to say that's probably based on some tests on their side, even if I didn't see the actual benchmarking. And then the other things I've seen are a series of, well, there's a question on the DBA Stack Exchange, and somebody's quoting various Postgres authors and a benchmarking

Starting point is 00:04:39 series. One that was like a more realistic workload that was about 0.5% measured overhead, and another that was a more stress test that measured it about 10%. So those are numbers, but that's a wild range of different numbers. And as we're going to discuss later, you've done some work that shows even that it can be very different, which is cool. I hope by the end of this episode, people will start having some ideas why this range can be so wide. But I can imagine like 0.5% or 10%. In both cases, it can be so.

Starting point is 00:05:20 I mean, it depends on workload, actually. Very well depends on the workload. And also not only on workload. We will talk about it later, right? So, yeah, okay. But when I say overhead, of course, I mean CPU overhead, first of all, because we don't think about disk IO or disk usage, disk space usage here, network, nothing like that.

Starting point is 00:05:42 Only pure CPU overhead. By the way, when guys like RDS and others, probably inheriting this from Oracle, when they consider wait event being null in PidgestActivity as CPU marking it as green. CPU green in performance insights. Do they mean other wait events, all of them are not CPU? I don't understand this. I don't know because many of for example lightweight locks

Starting point is 00:06:28 spin locks for example they are purely CPU I know we have several listeners from AWS so maybe they can let us know we have listeners yeah yeah on the other hand if you say CPU instead of null, null means unknown.

Starting point is 00:06:47 And according to Postgres documentation and source code, it means no wait. In reality, it means either really no wait and maybe like some kind of CPU work. But also, in many cases, it means a wait event which is not yet created and code is not covered with this. For example, Postgres 16, we recently had some benchmarks and we saw wait event which you see in Postgres 16, but you don't see it in Postgres 14

Starting point is 00:07:16 because it was not yet created there. So I'm thinking, okay, it means in performance insights on RDS, and I think in many other systems, maybe in Cloud SQL as well, or this PeshViewer at HawkTool written in Java, it also likes to use green color and say this is CPU, but it's actually not CPU, it's null.

Starting point is 00:07:37 Because CPU doesn't exist in the list of wait events in PGSF activity. So if you market CPU CPU for 16, but... Oh, for 14 because it was null, but then you start distinguishing it for 16, something is not right here, right? So I'm asking, is it really CPU? Yeah. So interesting question, but it's slightly off topic. Back to PgStat statements. Let's talk about benchmarks we had. So idea was we got some credits from Google Cloud. And it's great. We like this year we will be before I mean, we I mean, Postgres CI team, we will be performing a lot of benchmarks. I'm going to do a lot of postgres. We already had a lot of benchmarks. I think thousands of them if you count each iteration. And hundreds of them if you count whole benchmark consisting of many iterations.

Starting point is 00:08:39 So we are curious how many TPS we can squeeze from Postgres 16 on various big machines, Intel, AMD. And there was a good article from 2016 from Alexander Korotkov when both Postgres and MySQL, I remember, teamed up. It was interesting. I think, I don't remember, somebody from Percona also was working on MySQL. The goal was to demonstrate that the both systems can show million TPS. And during that work, some contention issues were improved, fixed in Postgres. So Postgres reached million TPS on some strong machines.

Starting point is 00:09:26 And I like that benchmark because it's simple. It's PgBench, select only. So it's only selects, very simple ones. So we just repeated the same benchmark. And as usual for such benchmarks, it's actually stress load testing, stress testing, because you are exploring this edge. It's not what you need to do for your application, for example, unless you're on purpose studying it. I hate this behavior being default in PgBench. It should not be so because it's provoking to perform stress tests instead of regular load tests and exploring normal situation, for example, 25 or 50% of CPU load. But in our case, we just want to squeeze as much as we can. So we did this and I published a blog post about our new AI bot and it has details and links to details how we did it. So we took the newest Intel,

Starting point is 00:10:28 fourth generation Intel scalable Saphir Rapids, it's called, right? And also fourth generation AMD EPYC, C3 and C3D instances on GCP. One has, the biggest number is 176 VCPUs, and AMD has 360 VCPUs. And both have more than terabyte of RAM.

Starting point is 00:10:54 So it's insane, but we use spots. AWS has spots, GCP has spots. It's cool. You pay a couple of bucks for such experiment, honestly. I realized, even GCP has spots. It's cool. Like, you pay a couple of bucks for such experiment, honestly. Like, it's not... I realized even if we didn't have credits,

Starting point is 00:11:10 I would probably pay for myself. It's interesting to explore these things because we just provision temporary machines for one or two hours, and spot means huge discount. Yeah. So, this kind of experiment, it's classic experiment for stress testing. You first start with one connection, then more and more connections,

Starting point is 00:11:33 and you control. We chose Alexander Korotkov's approach. In PIRG Bench, both dash C and dash J are the same. So, we start from one, then we put 50. With such huge machines, this step is reasonable. Jumping to 50 right away. Then 100, then 150, 200, and so on until, I think, 500. Exceeding the number of vCPUs in both cases.

Starting point is 00:12:07 So since we use PostgreSQL cluster, which is a project, it's an open-source project, it's a very good thing, Ansible kind of set of playbooks to provision Patroni clusters with a lot of things. It's maintained by Vitaly, who is working in my team. So it's a great tool, and it has a lot of things

Starting point is 00:12:30 including those we over years discussed together, and he just added into PostgreSQL cluster. So we use it, and because of that, we had a lot of observability tools, including those for query analysis pgstat statements

Starting point is 00:12:48 pgstat kcache much less popular and pglet sampling for weight event analysis and then we saw very strange picture until 50 we grow in terms of TPS kind of approaching million TPS but then go down very quickly. And on Intel, we went down even more, but it's a different story. And thanks to PG-Wet sampling, we saw that we have number one weight event is related to PG-STAT K-cache. So we immediately realized this is a zero effect from this PG-STAT K-cache.

Starting point is 00:13:24 For those who don't know, PG-STAT K-Cache is an additional extension to PG-STAT Statements extension which provides physical level metrics, user CPU, sys CPU, real disk I.O. at physical level, context switches. So it's very useful to understand real CPU usage, unlike this where to when equals now, which I think is wrongly presented in RDS performance insights. This thing, trustworthy, we use it many years. A few big companies use it as well. It's not super popular. RDS doesn't have it.

Starting point is 00:14:02 Cloud SQL doesn't have it. Most managed providers, managed Post rds doesn't have it cloud sql doesn't have it many most managed providers managed postgres providers don't have it i know yandex cloud has it so they might be the only ones i've yeah maybe maybe yeah but also one of the biggest e-commerce company we worked with uh also has it for many years and they have very critical systems running on Postgres. But in this case, overhead was so huge. Like, what's happening?

Starting point is 00:14:33 And I immediately published it on Twitter and Vitaly created an issue in PGSTAT, KCache, GitHub, a repository. And then maintainers, they created a fix in four hours, I think. So I think they expected something. There was some additional lock, which was removed in this fix,

Starting point is 00:14:53 with the idea this lock is something old and it's not really needed. But in our case, it was slowing everything down when contention is high, when many, many sessions fight. Not fight. Basically, what's happening? With PgBench, you have, by default, only four queries, right? If you say select only dash uppercase S, it's only single query, only one select to PgBench accounts table.

Starting point is 00:15:20 That's it. In this case, like hundreds of sessions try to increment metrics in a single record of the just at kcache right and due to that lock which was removed in the latest version it was no good I mean the more sessions we have the biggest contention and the biggest the bigger overhead is and we see it as as we increase the number of clients, the number of connections and jobs, the dash C, dash J in PgBench parameters.

Starting point is 00:15:54 TPS go down. So we have room. We have more vCPUs, right? But we cannot use them properly because all of them try to execute basically the same normalized query. Parameters don't matter here. Yeah, I've pulled the chart up from your tweet. And just to give people an idea, at 50 clients, it's about 550,000 TPS.

Starting point is 00:16:19 At 100 clients, it does go up to about 700,000 TPS, but to 150, we're down below the 50 client rate, and it's only 400,000. Then it settles at about 200 clients and above, about 300,000 TPS, which is less than we had at 50. So it's a super interesting curve and cool to hear that you got so quickly to the root cause and that the team were able to fix it so quickly yeah in a few hours well let's split the story here to two paths first path is purely pgstat kcache and second path is without it because we when we saw the server head we started we continued our discovery of maximum tps on modern Intel and modern AMD without PitchEstate KCache, of course, because we didn't expect it would be fixed so quickly.

Starting point is 00:17:11 But when the fix was ready, we asked the bot to repeat this benchmark. It was interesting to convince the bot that it's safe to download fresh code from GitHub and compile it. It hallucinated, saying it's safe to download fresh code from GitHub and compile it. It hallucinated saying it's against policy when I asked which policy it said the Postgres AI policy. So it was funny like I won't check it. Do we have such policy? I checked all our docs because Bob does know our documents. So probably like, but it was just pure hallucination.

Starting point is 00:17:46 Yeah. So then we convinced it, it verified. And indeed, we saw that the fix indeed resolves the problem and no more such overhead. So it was good. And next week,

Starting point is 00:17:59 the new release of pgst.k cache was issued. But interesting question, like, as I mentioned, I trust those people, and also not trust, I worked with some of them, and I touched production systems with my hands, right? So I didn't see such problems

Starting point is 00:18:15 for years on production. But here we saw obviously very big overhead. Why so? Why we survived with this on production? This is an interesting question. Let's return to it after we explore the second path. Second path, okay, forget about PGSAT-K cache.

Starting point is 00:18:35 We only have PGSAT statements, PGSAT sampling. We know overhead of both are not so big. And then we just explore from 1 to 500 connections on both modern Intel platform and modern AMD. 176 vCPUs on one and 360 vCPUs on another. And huge surprise was Intel behaved, again, the same workload, pgbench-s. So, select only, single query, very simple, sub-millisecond latency. Huge surprise was Intel behaves not well at all. It behaves similar to what we had with pgstack cache enabled.

Starting point is 00:19:21 But higher, yes, higher, it reached 1 million, I think, or so. Maybe slightly below it. And then went down. cache enabled. But higher, yes, higher, it reached 1 million, I think, or so. Maybe slightly below it. And then went down. While AMD, like this going down even before you reach number of vCPUs, increasing

Starting point is 00:19:37 number of connections and jobs in PgBench. It's not normal. Some kind of problem, obviously. While AMD also was not good, but it didn't go down. It went down slightly, right? It demonstrated almost a plateau. Very different behavior. And we started studying what's happening here. And obviously, the problem was PG-gStart statements in this case. And we saw it, I think, from wait events by PgWait sampling, but we also collected flame graphs. And we obtained two very different pictures for these platforms. Everything is the same. Ubuntu 22.04, Postgres 16 latest version, everything is like some kind of tuning applied, the kind of default tuning we usually apply to new clusters under our control. Nothing special, nothing fancy, but very different behavior.

Starting point is 00:20:37 And on flame graphs, we could see that in case of Intel, these few rapids, we see that PGC1 statements has a huge S underscore log function call, consumed a lot of time. And in case of AMD, this call is much, like, small. So in case of Intel, it was like 75% of whole time spent by CPU in FlameGraph. Like, why? What's happening? Something not normal at all and i i talked to a few guys talked to andrei borodin and alexander korotkov and interesting they both mentioned the idea that probably the producer statements needs sampling here right sampling so yeah so it's an interesting

Starting point is 00:21:23 idea we don't currently have a parameter. Yeah, we don't. Like for a lot of the other things we have, like for logging, for example, because there's such overhead to logging, we have parameters for that kind of thing. We can sample to only measure one in 10 or one in 100. Yeah, yeah. Well, I'm thinking, I know we've discussed this before, but for min duration statement, yeah, I think since more recently, but for auto explain from a long time ago, for example.

Starting point is 00:21:52 But the 75% makes sense as well. Just to give people an idea of the exact numbers, I've pulled up that chart as well. statements it does get to very nearly a million tps without pgset kcash and then drops a little bit less rapidly but still it drops down to 500 000 at 150 and then down to about 300 000 by the time you get into the hundreds of clients whereas amd with pgset statements continues to rise at 100 clients to about 1.2 million gets above 1.5 million at about 150 clients then seems saturated and and it's mostly a plateau slight decline as you get more so it's that's about five times more by the time you get to you know 300 000 versus 1.5 million starts to make sense that that's like roughly 75% overhead, I guess. Yeah. So in all cases, when you reach some usual number of VCPU, you go down.

Starting point is 00:22:57 But normally you go down slowly discerning, right? Bad picture demonstrated on Intel, like very acute, like acutely going down, going down very fast. And this is not normal. And, of course, we, to confirm that PGS Settlements involved, and also to get numbers we wanted, we wanted big numbers, right?

Starting point is 00:23:18 When I posted it on Twitter, of course, people started liking the posting. Like, with AMD, I think we've got without PGS at Atmos, just removing it we've got 2.5 million TPS, right? On AMD. Nearly, yeah.

Starting point is 00:23:35 Almost, almost, right. Above 2 million TPS my tweet was 2 million TPS on POSG16, right. Well, yeah it was funny to see the reaction from CEO of PlanetScale.

Starting point is 00:23:50 Like, it's not real workload. Well, I don't know it's not real workload. It's select only. It's purely synthetic workload. We're just exploring some edge of what system can provide. Of of course it's not

Starting point is 00:24:06 realistic it's some selector that's it well maybe some there are some systems which you need mostly this kind of workload and this is interesting question because if they if they have such workload they suffer from bigger problems from pgc settlements this is i think this is the main idea of like should be main idea of today's episode. Look at your workload and understand the nature of it. But obviously, CEO of PlanetScale is not interested in single node performance, right? Because the main thing is sharding, right?

Starting point is 00:24:39 So single node should not provide millions of TPS, right? It's not normal, right? It doesn't help the marketing. Right, right, right. Everyone realizes it's select only. Everything is cached in memory. But still, 2.5 million TPS. Wow.

Starting point is 00:24:58 On machine, you can just rent easily. Well, it's expensive machine. I think if it's not spot, if it's normal without any discounts, it's above $10,000 per month. So it's expensive machine. Of course. But it's possible. And this is just Postgres with minimal tuning.

Starting point is 00:25:16 It's good. Like 2 million TPS. Of course, I'm slightly sad comparing to 2016. How many years already passed? Like eight? One million? Two million?

Starting point is 00:25:28 Maybe we can do more. We can squeeze maybe more, but it's a separate question. We will probably think about it. But returning to PGSAS Atmos, what's happening? Again, a lot of sessions, they compete trying to update the same record in Pages.SAS. It's just single record, select query, that's it. If it was different queries, right, it would be okay. So I even started calling this workload pathological.

Starting point is 00:26:01 But then I say, okay, what about all SaaS systems and social media? Do they have some query which is executed, which should be fast, and it's executed in many cases? And the answer is yes. Usually, if people work with your system, you need to select, it can be cached, of course, but I saw it many times, some primary key lookup to tables like users or posts or projects or blogs or something. And you see most of sessions, I mean, not only database sessions, but for example, like session in terms of HTTP communication, web sessions. Most of them need this query, right? And in this case, you probably might have this problem. You might have this observer effect.

Starting point is 00:26:54 And, of course, solution would be to start caching them, probably, right? And so on. But this is still an issue. So if you have, for example, for example thousand tps or more of some primary key lookup probably you already might have this problem overhead from pgc statements which maybe for whole workload which you have is not so big as we mentioned can be one or 0.5 but if you zoom into this part of workload primary key lookup probably there this overhead is bigger and it maybe guys i talked to maybe they are right and maybe sampling could be

Starting point is 00:27:32 a good solution maybe pgc settlements could guess oh this is high frequency query a lot of qps are happening here calls number is good is high Maybe I just need to start sampling. Maybe. I don't know. It's an interesting question. And, of course, another interesting question. Why Intel, right? Why Intel?

Starting point is 00:27:55 And I don't have an answer yet. We see these flame graphs. We realize, okay, this code is running much longer than on Intel's. Right now, there is an idea to explore older Intel's Cascade Lake and maybe even older Xeons, right? Which may be used much more in production systems. And maybe also Epics, older Epics, third and second generation, maybe. We also have issue, not related related but we also observe an issue with we discussed lightweight logs log manager contention their amd behaves worse than intel

Starting point is 00:28:37 so it's kind of interesting but what we need to understand if a lot of sessions run the same query, it can be bad in terms of log manager. So the solution would be get rid of planning time using prepared statements. This is ideal. Or just reduce frequency and you won't notice this, right? Or make sure fast path is true always. And it means that you have only a few indexes and partition pruning works in plans. And also you have a observable effect

Starting point is 00:29:10 in PGSA statements in this case. Yeah. And second, Intel versus AMD. This is like, I don't have answers here, but it's interesting to just dig into it and understand it. The thing I've heard you say multiple times in the past is when you're doing this kind of work or when you when you're looking at your system it's trying to understand where's the where the bottleneck is something i really like about this piece of work that you've done is you first identified that pgstat kcache was

Starting point is 00:29:41 a bottleneck and then turn that switch that off and then tried to work out what the next bottleneck was looks like it actually might be pgstat statements even though even though uh maybe in your head you were thinking it probably isn't but let's try turning it off and see if that makes a difference that made a big difference so it's each time like even if you uh when you when you're thinking about maybe trying to get above that 2.5 million or maybe 3 million, we have to work out what's the current bottleneck. That's how we – yeah, exactly. So that's a really nice thing that I don't see enough people thinking about on their own workloads, but also when they're benchmarking what we're currently limited by. Right, right, right. So when we perform any performance research, like benchmarks, analysis,

Starting point is 00:30:33 root cause analysis after some incidents on production, or we try to reproduce problems, we perform, like always, database systems is a complex thing, and workload usually is quite complex. So to study the whole, we need to apply this, like I mentioned before we had this call. I mentioned Rene Descartes or how to pronounce it in English. I think that's why it's French, right? It's French, yes. Yeah.

Starting point is 00:31:01 Yes. I like Rene Descartes. Right. So the idea is we need to properly split whole into segments and then try to study each segment separately. When you study each segment separately and know how each of it behaves, for example, okay, there is a high-frequent select. Let's study how it behaves without anything else. We know how it usually behaves without anything else

Starting point is 00:31:31 in like in an emptiness, right? And by the way, when we study it, we also can divide into smaller pieces. For example, okay, let's remove this. Let's remove that extension. So go deeper, deeper. So basically minimal pieces. It takes time, of course, right? But then we know small pieces, how they behave. We can try to compose it back to complex workload and study already an ensemble of it as a whole. This is a regular scientific approach, I think, one of the oldest ones, right?

Starting point is 00:32:05 But we must do it here, and I cannot agree with PlainScale's CEO. Okay, it's not normal, it's not, but we study it because it's presented in our complex production workloads, right? For example, primary key lookups with

Starting point is 00:32:21 1,000 or more QPS. It's not uncommon. Yeah, well, but but to be clear, to be clear, I'm still a big fan of encouraging every workload I've ever seen, or every setup I've ever seen, especially we're talking about SaaS companies, like that kind of, I would still encourage them to have PG stat statements unless they've done some testing

Starting point is 00:32:45 that it somehow is affecting their workload. I agree. Yeah. Even if it was 30%, I would say in many cases, we still need to have it because without it, we are blind. Yeah.

Starting point is 00:32:59 But if it was 30%, if we did find that out, it would be cool to have sampling. If we then took one in 10, we could reduce that maybe to 3%. Maybe it's not quite linear, but you know. But sampling here should be smart. It should be applied only to high-frequency

Starting point is 00:33:14 queries. So anyway, I think we're almost out of time. We are. The bottom line, I think we should also check our ARM platform as well and see how it behaves. So there is difference in PGSR statements behavior on regular queries versus regular slow queries like updates or deletes versus high frequent, very fast selects like primary key lookups. And there's a difference between AMD and Intel.

Starting point is 00:33:48 So this is the bottom line and I'm excited to see results of our further investigation of what's happening. We plan to understand details here. Nice one. Thanks so much, Nikolai. Thank you for listening and catch you soon. Bye bye.

Your Ad Here

Postgres FM - Overhead of pg_stat_statements and pg_stat_kcache

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.