Postgres FM - Four million TPS

Episode Date: July 5, 2024

Nikolay talks Michael through a recent experiment to find the current maximum transactions per second single-node Postgres can achieve — why he was looking into it, what bottlenecks occurre...d along the way, and ideas for follow up experiments. Here are some links to things they mentioned:How many TPS can we get from a single Postgres node? (Article by Nikolay) https://www.linkedin.com/pulse/how-many-tps-can-we-get-from-single-postgres-node-nikolay-samokhvalov-yu0rcChat history with Postgres AI bot https://postgres.ai/chats/01905a83-4573-7dca-b47a-bb60ce30fe6cOur episode on the overhead of pg_stat_statements and pg_stat_kcache https://postgres.fm/episodes/overhead-of-pg_stat_statements-and-pg_stat_kcachePostgreSQL 17 beta 2 is out https://www.postgresql.org/about/news/postgresql-17-beta-2-released-2885/ PostgreSQL and MySQL: Millions of Queries per Second (about the work by Sveta and Alexander) https://www.percona.com/blog/millions-queries-per-second-postgresql-and-mysql-peaceful-battle-at-modern-demanding-workloadspostgresql_cluster https://github.com/vitabaks/postgresql_clusterTrack on CPU events for pg_wait_sampling https://github.com/postgrespro/pg_wait_sampling/pull/74The year of the Lock Manager’s Revenge (post by Jeremy Schneider) https://ardentperf.com/2024/03/03/postgres-indexes-partitioning-and-lwlocklockmanager-scalability Pluggable cumulative statistics (Postgres hackers thread started by Michael Paquier) https://www.postgresql.org/message-id/flat/Zmqm9j5EO0I4W8dx%40paquier.xyz ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael, founder of PGMustard, and this is my co-host Nikolai, founder of PostgresAI. I know, Goliath, what are we talking about today? Hi, Michael. Let's talk about a lot of TPS we can get from a single Postgres node. I recently achieved something. I'm curious, is it some interesting achievement or like, no? I'm sure it's possible to get more on bigger machines because my machines yeah big enough right it's like just 360 vcpus and just two terabytes of ram or something maybe one at 1.5 terabytes of ram yeah fourth generation amd epic it's
Starting point is 00:00:42 there is already five generation AMD EPYC, and there are Intel's also, Xeon scalable, bigger instances and so on. It's possible. But I've almost achieved 4 million transactions per second, full-fledged transactions, not just like what other NoSQL databases have.
Starting point is 00:01:02 It's transactions. Read-only, we should admit, read-only, and meaningful transactions. I mean, it was a table with, I think, 10 million rows. It's a tiny table for this machine. Everything is cached, of course. And it was a primary key lookup and PgBench. Regular PgBench accounts table, you know it, right?
Starting point is 00:01:27 So, yeah. Is it single read per transaction, single query per transaction? Is that right? That's a good question, actually. Let's double check. I will double check it right now. But I think, yes, it's just select from PgBench, random number, random ID. And that's it.
Starting point is 00:01:44 It should be. I was just hoping you'd get to multiply your 4 million by 2 or something. Get queries per second. Let me tell you what you don't know. Yesterday I was driving to LA and I used this. I'm thinking I should check my AI and get even more TPS. And thankfully I had AI with autopilot in the car, right? So I could...
Starting point is 00:02:12 Maybe it's still not a good idea to use phone, but it's just a quick question, right? So I asked my bot to repeat this TPS. Oh, I forgot to mention, all these experiments are done with our Postgres AI bot, which simplifies experimentation a lot. So, if you need to provision a huge machine,
Starting point is 00:02:32 run some PgBench during a couple of hours according to some script, you just ask for it, then you think, oh, what if we change something else? And it's just a few words. You just ask, and then it reports back in a couple of hours. So I asked, can you please check not regular select only for PgBench,
Starting point is 00:02:52 which is very standard Postgres for benchmarks, quite simple but widely used. But rather, let's just use select semicolon. Not meaningful, right? Not dealing with data at all, but still it semicolon. Not meaningful, right? Not dealing with data at all, but still it needs to be parsed, right? And so on. I had it on my list of questions for you to ask what would happen if we got, I was going to suggest select one, which I guess is the beginner's version. I can tell you, we exceeded 6 million.
Starting point is 00:03:21 6 million, wow, nice. But I don't consider it fair because it's just heartbeat. It depends what you're measuring, right? If you're trying to measure what's the most we can achieve, then sure, it's meaningful. But if you're actually trying to do something meaningful, read-only is probably not realistic. There's so many things you could say.
Starting point is 00:03:43 I like the idea of what's the maximum we could get if we don't worry about constraints, about how realistic it is. Right. So, yeah, in this case, we only checked the parser and general transaction, read-only transaction, not full-fledged transaction, all these pieces.
Starting point is 00:04:03 But we didn't check how Postgres works with, like, for example, buffer pool with reading data and so on. Because I know there was some contention in the past when even for read-only queries, which was improved in, I think, in 9.6 on the road to achieve 1 million TPS. We will probably discuss this slightly later. But in general, what's interesting, I needed to explain. I hate select one statements. You know why, right? No. Because one is not needed.
Starting point is 00:04:36 Fair enough. You can say select in Postgres. It will return you results that consisting of zero rows and zero columns. Can you just run semicolon on its own or something? Yes, you can. I think we also discussed it. Must have done. Yeah, it has issues, negative side effects,
Starting point is 00:04:54 because, for example, you won't see it in logs if somehow it's slow. It's like empty query. You won't see it in PGC statements, for example, unlike regular SELLECT semicolon. And I failed to convince bot to run semicolon. Somehow it didn't work. And I also had difficulties convincing bot to select.
Starting point is 00:05:15 It started telling me, I think it was Gemini. It was the specific aggressive manner of Gemini for sure. So it started telling me, you know, like it's incorrect. It's invalid SQL statement. So let's like it offered me let's put one at least they're like, or where's the table name? Like, okay, let's I said no tables, no tables, just like, idle queries, like no real action. And then said, like, let's put at least one because otherwise it's incorrect statement i i of course like are you sure you can check so we had some battle but in general yeah it worked and we achieved six million tps which was good
Starting point is 00:05:58 but semicolon i didn't work didn't work for our. I think it's possible. I don't expect a huge win compared to Select. Maybe it will be Parse. I don't know. You could run the experiment manually, right? Could you just submit the JSON yourself? It's better to convince the board. You prefer to fight the LLM, fair enough. A little bit, yeah, because every time we do it,
Starting point is 00:06:24 I find opportunity to improve, of course. Fair. But back to meaningful results, which is interesting. We almost achieved 4 million, but originally it was just 1 million. And when I saw just 1 million on modern Postgres, we checked 16 and 17 beta 1. By the way, beta 2 is already out. I think we should update and switch to it in our tests.
Starting point is 00:06:48 So I expected much more because I remember the post from Alexander Korotkov from 2016 when I remember he worked with Percona people, Sveta Smirnova particularly. They had had friendly battle let's see what's better, Postgres or MySQL in terms of how many TPS we can get from a single node. And I think
Starting point is 00:07:15 both Postgres and MySQL learned something from it and Alexander created a few patches because some issues were found, some contention was found. And it was improved and in Postgres 9.5 it was not so good and 9.6 it was improved just based on this work. And he achieved 1 million TPS. MySQL also achieved 1 million TPS, good. But I was surprised like so many years like 8 years right
Starting point is 00:07:45 we have so many Postgres versions basically 8 Postgres versions already since then and maybe 7 and much better hardware not much but significantly better hardware so I expected to get more from this
Starting point is 00:08:01 well we looked into the machines didn't we and they even back then they had managed to get more from this. Well, we looked into the machines, didn't we? And even back then, they had managed to get their hands on a 144 vCPU machine. And I think you mentioned yours was a bit more than double that, was it? 360. And I think they had the Intels, all much older Intels, and I had AMD, which is very easily available on GCP, which I use because I have credits there. Right. So yeah, before we started this call, I, I, I cannot miss this.
Starting point is 00:08:35 You showed me what I double yes has super impressive. How many, almost six, six, almost 800 more 896 vCPU and it's Intel scalable fourth generation yeah it'll cost you but also 32 terabytes of RAM maximum 224
Starting point is 00:09:02 x large and they don't have spots, of course. This is something new. Yeah, and it's fourth generation, little scalable. What I found interesting, like during last year and this year, various benchmarks, not this like toy benchmark, it's a side project for me, but working with customers, various cases.
Starting point is 00:09:22 What I found, what my impression right now from Intel and AMD, recent versions of them available in cloud, is that AMD, for example, on GCP, you can get more vCPUs and reach more TPS on very simple workloads. And it will be cheaper in general in terms of how many, like if you divide TPS per dollar, find TPS per dollar, it will be cheaper probably and you can achieve higher throughput for simple workloads. But when it
Starting point is 00:09:55 comes to complex workloads and various problems like lightweight lock manager contention we discussed a few times. And in general complex workloads and behavioral photo vacuum and so on lightweight lock manager contention we discussed a few times. And in general, complex workloads and behavioral photo vacuum and so on, AMD looks less preferable and Intel's are winning.
Starting point is 00:10:14 So that's why I'm additionally impressed that AWS offers Intel scalable Xeons for generation, almost 900 vCPUs. It's insane for me. 32 terabytes. Wow. And you think I can exceed 4 million TPS, right? It's a joke, I guess.
Starting point is 00:10:36 It'd be interesting to hear more about what you think the current bottlenecks are, like where you think but I guess my stupid brain is thinking this is probably CPU constrained at this point throwing more at it makes sense that it could throw double at it, maybe you can
Starting point is 00:10:59 get close to doubling it well, we collect a lot of information, we a weight event analysis yeah we have flame graphs collected automatically for each step and still like still needs to be analyzed deeper but what i see right now the bottlenecks are like mostly communication let's actually maybe let's take a step or a couple of steps back and discuss what issues I had originally. Why only one million? Can we take a step back further than that?
Starting point is 00:11:32 Why were you doing this? Yeah, it's a good question. I was asking myself, why do we do this? Well, for fun, first of all, actually. I was very curious. I remember that work. I think for Alexander
Starting point is 00:11:51 and Sveta, it was also for fun eight years ago. But it led to some interesting findings and eventually to optimizations. So this fun was converted to something useful. And we started doing this for fun because we have like big machines, we have very good automation, we have newer
Starting point is 00:12:12 Postgres. I was just curious how easy it is for us to achieve 1 million and maybe go further. And first thing we saw, since we have a complex system, a lot of things and observability tools. First thing we saw is that one of extensions is a huge bottleneck. It was pgstart kcache. And it even didn't let us achieve 1 million TPS, quickly showing huge observer effect, meaning that just measuring... pgRK cache extends PGSR statements to see metrics related to physical metrics like CPU, real CPU,
Starting point is 00:12:51 user time, system time, then real disk I.O., not just talking to the page cache. And context switches and very good detailed physical query analysis, right? And we like to have it,
Starting point is 00:13:11 but if you have it until the latest versions, it quickly became huge overhead when you have a lot of queries per second for a specific query. I remember we discussed this back in, I just looked up the episode back in February about the overhead. Right, exactly. Yeah, I'll link that episode up as well.
Starting point is 00:13:31 Right, and yeah, just to recap, we found that it shows up in weight event analysis using PGA Web Sampling, great. And we just quickly excluded it and also reported to the maintainers and they fixed it in four hours. It was amazing. And we tested it with Bot, confirmed that now it's improved.
Starting point is 00:13:52 So this is how we achieved 1 million. But why only 1 million? We have newer Postgres and better hardware, right? Did I answer the question why, by the way? You said for fun. For fun, right? right yeah this is the key actually here i i enjoy it you know but let's just run pg bench like that way or this way let's collect more data points let's visualize it let's repeat this but with adjusted version let's compare
Starting point is 00:14:20 various versions for example right so it sounds like for fun but also like you've got credits and also dogfooding right like trying yeah yeah trying it out on here's an idea like yeah yeah we try to collect successful experiments and we do it on smaller machines as well but on bigger machines we collect just to understand like how system is working. Of course, we're moving towards more useful benchmarks, for example, when it will be already some specific database and specific kind of workload, and maybe not only PgBench and so on. And all the tooling will be in place,
Starting point is 00:14:58 and automation is super high, so you can just talk to chatbot and get results visualized and collected with all details and you you don't miss any point so we collect almost 80 artifacts for each data point it's like huge it reflects many years of experience i would say not only mine for example you can say okay let's compare versions 12 to 17 beta 1. This is what I did as well. And I confirmed that for regular PgBench workload, or for select-only PgBench workload, no difference. These versions are stable.
Starting point is 00:15:34 No regression, but also no wins. 12 through 17. Yeah. But for select-only, this is exactly like workload which allowed me to achieve almost 4 million TPS. For this, 12 is losing significantly. Wait, wait, wait. You lost me. So for read only... No, no, for regular. Oh, for read-write. For read-write.
Starting point is 00:16:03 You know, regular transactions, PagerBench has, it's like, I don't know, like a couple of updates, insert, delete, a couple of selects, and they are packaged as a single transaction, and then you just say, okay, let's just check it. And the way we check it, it's called like actually not load testing, but stress testing, specialized version of load testing, and we check in the edge. We say, okay, with one client, how many?
Starting point is 00:16:30 50 clients, how many? 100 clients, how many? And then you have a graph dependency of TPS, how TPS depends on the number of clients. Yeah. And it's interesting that we cannot say zero and due to scale, we need to jump with big steps like 50.
Starting point is 00:16:48 So I say one, 50, 100, then dot, dot, dot, 500. And the bot understands that it needs to go with 50 increments. But first, dot is shifted to one, and it's okay. So it's interesting it's like it adjusts very quickly so yeah for fun and this and uh back to versions comparison so we found that for select only we don't see difference but for regular workload and we don't get million transactions there we have only like 30 000 maybe 40 000 something like quite lower because it's already writing transactions and so on we saw that 12 is
Starting point is 00:17:33 losing apparently and 16 17 are winning slightly compared to like average for these old versions but yeah i think we will start continue exploring and maybe test some specific workloads, not just standard ones. And just to check, this is like completely untuned, like just default config. Okay, it's tuned. Good to check. Yeah, we use PostgreSQL clusters,
Starting point is 00:17:59 Ansible playbooks, maintained by Vitaly, who works with me, and Vitaly Kukharik and it's a great project if you don't like Kubernetes for example and actually containers you just need bare Postgres installed on Debian
Starting point is 00:18:15 Ubuntu or something and you want automation for it this is a good project it goes with everything like everything most popular stuff like Patroni BGP Crest or RollG
Starting point is 00:18:29 what else like a lot of a lot of simple things actually even with timescale DB it's packaged so you can choose
Starting point is 00:18:37 to use it and quickly have it on your like cloud or so so it's for self-managed
Starting point is 00:18:43 Postgres it's a good option yeah it takes time to install of course because it will be running like apt install like cloud or self. So for self-managed Postgres, it's a good option. It takes time to install, of course, because it will be running like apt-install, apt-get-install, right? So it's coming with some tuning.
Starting point is 00:18:56 It's similar to what cloud providers do, like 25% for shared buffers. It also adjusts operational system parameters, like kernel settings. I guess the dataset is so small that 25% is still so easily fits within. Yeah, yeah, of course. And maybe we can tune additionally, but I didn't see the point yet.
Starting point is 00:19:17 This default tuning this project provides was also like... I know it quite well because we adjusted in the past together a few times. So I know how it works. So I relied on it for now, but maybe we will additionally adjust some things, but queries are super simple, right? Data is cached. So I understand. I'm just trying to think like the major things like all in all cached i guess basic like right head log stuff when you've got the read write stuff going on just sim yeah it doesn't
Starting point is 00:19:54 matter because we don't write so in one of them in the 30 000 i reckon the 30 000 is low and you could get much higher there. Well, yeah. If we talk about writing transactions, of course, we should think about checkpoints and so on. And it comes with some adjustments based on the size of machine already. So I just relied on this default behavior. We can look further, and good thing,
Starting point is 00:20:22 we consider settings as one of artifacts we collect. So for history, we record everything, all custom settings and some system info as well. So we can revisit it later and think how to improve and next step. But bottlenecks are not there yet. And I was wondering, last question I had on the top line number, I get keeping PG weight sampling for being able to introspect, for being able to try and work out where is the bottleneck now.
Starting point is 00:20:53 It's great for that. But doesn't it also add a small amount of overhead? I know it's only sampling, so it's probably minimal. But were you tempted to try without it as well and just see what tickets you got? I think we did, and I think it's very minimal. It's try without it as well and just see what tickets you got? Yeah, I think we did. And I think it's very minimal. It's worth revisiting as well. But for now, our impression is PG1 sampling among...
Starting point is 00:21:11 That's why, by the way, we encountered the problems with PG.k cache because we used basically the package we use for various kinds of experiments in the case of self-managed approach, right? Had it more automatically on by default. Yeah, we already had these observability tools there
Starting point is 00:21:32 and this is how we found the PG-SAT-K cache problem. As for PG-Wide sampling among all three extensions, PG-SAT-K cache, PG-SAT statements, and PG-Wide sampling, by the way, both PG-Wide sampling and PG-SAT-K cache depend on PG-SAT statements PG-Word sampling. By the way, both PG-Word sampling and PG-stat K-cache depend on PG-stat statements
Starting point is 00:21:48 because they basically extend it for query analysis. They all provide query analysis. Regular PG-stat statements provides query analysis like regular database metrics like calls, timing, IEO metrics at upper level, buffer pool metrics, right? KCache goes down to physical level, CPU and disk. And PGWet sampling for each query, it also provides like profile for in terms of weight events, which is super useful.
Starting point is 00:22:21 It's like different angles of analysis, right? So if you can have them all three, it's super useful. It's like different angles of analysis, right? So if you can have them all three, it's super good. Also worth mentioning, it's off topic, but interesting. Until very recently, PGW sampling didn't register events where wait event is null, which most wait event analysis we know, such as RDS Performance Insights, they picture it as a green area called CPU. And by the way, Alexander Korotkov, who was originally the author of PGA weight sampling,
Starting point is 00:22:59 confirmed my idea that it's not fair to name it CPU. It should be like CPU or some unknown wait event which is not yet implemented because not everything is covered by a wait event analysis in the code base, right? And until recently, the PGW sampling had a huge problem. Huge problem. It didn't register nulls,
Starting point is 00:23:21 like naming them somehow, at least like CPU or something, at all. But guys from CyberTech implemented it a couple of weeks ago. It's huge news. It's good. Thank you, guys, if you listen to us. I'm happy that it's implemented. And I know GCP
Starting point is 00:23:37 Cloud SQL also has PGWet sampling. And I like the idea that unlike in Performance Insights in RDS, we have interface to this data. I mean, good interface with SQL, right? We can just query this data right from our database and build our own observability extensions or automation.
Starting point is 00:24:01 For experiments, it's super important because I want to take snapshot and it's easier for me just to take snapshot using SQL. Let's collect profile global and for per query. And we dump it to CSV files, basically. That's it for long-term
Starting point is 00:24:18 storage. So we quickly found KCache, solved it, and we stuck at 1 million TPS, which made me sad because it just repeated the experiment Alexander had eight years ago. So what's next? Do you know what was next?
Starting point is 00:24:34 I've cheated. I've read your post already. You know. Yeah. And I think the last episode as well gave it away in February with huge set statements. Is this a set statement? Huge bottleneck for such cases.
Starting point is 00:24:48 I consider this edge case because normally we don't have super high-frequent query, super fast, super high-frequent, and a lot of CPUs. But theoretically, it can happen. For example, you have Postgres and a few replicas, and you have some huge table, maybe not so huge, single index, and you need just to find records. Basically, almost like a key value approach. Maybe partitioned. And then queries are super fast, a lot below one millisecond. So in this case, if you have a query which after you remove parameters is the same,
Starting point is 00:25:34 you have single normalized query, very high frequency. Yeah, there are probably only two or three normalized queries in that PGBench. Well, by the way, I double-checked for select only just a single query. It's one. Okay, wow. It has also set this macro backslash set to get random, but I think it's client-side, so it's not going to Postgres. It's only for PGgbench itself to generate random number and i don't think it's a bottleneck i hope no oh by the way our pgbench clients were sitting
Starting point is 00:26:13 on the same host as posgus which is interesting right because they consume cpu as well and unfortunately right now we don't collect host stats and so on. We don't know how much of CPU was written. But usually, it's noticeable. It's not 50% usually when you do experiments like that, co-hosting clients and server. But I think it's noticeable, I would say maybe 10%, 20%. So if you offload them somehow, but if you offload them, you bring a network. Yeah.
Starting point is 00:26:44 And we will get back to that point. So single machine, clients on the same machine, eating some CPU. And if you have this pattern, just a single query, normalized query, PGC statements becomes huge bottleneck.
Starting point is 00:26:59 Just removing PGC statements, we jumped from 1 million to 2.5 million TPS. Yeah. Yeah. And then, what's next? moving PGCR statements, we jumped from 1 million to 2.5 million TPS. Yeah, wow. Yeah. And then what's next? We reached 3 million. How? Let's recall. Then we reached 3 million just
Starting point is 00:27:16 because I forgot originally I should use it. I remember Alexander used it in 2016. Prepare statements, right? I forgot to use it and how did I understand that I forgot? Looking at flame graphs. Our bot collects flame graphs.
Starting point is 00:27:32 And I just was inspecting flame graphs and I just saw planning time. And I'm thinking, oh, we spent a lot on planning time. Let's just get rid of it. So the right way is just to say dash capital M hyphen, capital M prepared for PidginBench.
Starting point is 00:27:54 And it's a perfect workload for it because it's simple, same query over and over again, no variance. Perfect. Right. So also interesting that when you, interesting, again, step back. When you see that starting from 150, 100, 150, 200, you go to 500 clients, then you have 360 cores. And when you see it goes up to 100 or 150,
Starting point is 00:28:24 and then it's plateau. It also gives you a strong feeling it's not normal, right? Because it means other Postgres doesn't scale in terms of number of parallel clients, right, to number of cores. Are you expected to scale to number of cores? Of course, we have clients running on the same host, so maybe the situation will start earlier, but definitely not at 100 or 150 if you have 360 cores.
Starting point is 00:28:53 So this is how I saw this is not right. And it happened with pgstart kcache and then pgstart statements. Situation started earlier. So yeah, some bottleneck, obviously. But now the curve looks already much better. It's already like we reach maximum point already close to number of cores.
Starting point is 00:29:13 300 or 360. Maybe 400, actually, slightly bigger. So then it doesn't make sense to increase, but we just check it, trying to find plateau. So prepared statements gave us more than additional half a million of TPS. Oh, I expected more, actually. Only because, you know, when you're looking at super fast primary key lookups, quite often planning time is more than execution time
Starting point is 00:29:42 in a simple explain-analyze. Interesting, yeah. Yeah, so it's just interesting to me that it wasn't like double. But yeah, half a million more is nothing to look down our noses at. Yeah, well, interesting. Maybe I should think about it and explore additionally. But this is how it is. Did you do force generic plan as well?
Starting point is 00:30:02 No, not yet. Mind you, that would only that would only like be five yeah no that probably won't help here's the thing is it's my my to-do maybe for this week or next when i have time i don't have time this is again this is side project just for fun and understand general behavior of postgres what i remember about this uh generic plan forced is that Jeremy Schneider posted a very good overview of problems others had
Starting point is 00:30:31 with lock manager recently, right? And I remember some benchmarks, some specific case someone showed on Twitter. If you just tell pgbench, if you adjust user, for example, alter user
Starting point is 00:30:47 and set force plan cache mode, this force generic plan, it leads to huge degradation in terms of TPS. So I need to check it. I need to check it. And lightweight log manager pops up
Starting point is 00:31:03 in top of wait events. So I need to double check it because I don't understand the nature of it. It's super interesting. I realized soon after saying it that the reason I thought it would help is not going to help here because you've got the same query over and over, not lots of little different queries. But that should be cached for sooner, right? Yeah, but after five executions which
Starting point is 00:31:26 is probably yeah it's like less than half a millisecond total there should be no difference in my opinion but no degradation why degradation is also true yeah yeah fair fair something interesting to check or maybe i just misunderstood this is my to do to to clarify so i guess uh it's like it's interesting like exploring Postgres behavior on the edge. So let's just recap. Like, pgSat kcache, we removed it, also fixed, but did we get it back? We should get it back, actually,
Starting point is 00:31:54 because now it's much better. They just removed some lock, and it's good again. Then we found pgSat statements. Oh, so we achieved 1 million. pgSat statements, we removed pgSat statements. This is one million. We removed pg-star statements. This is the key. By the way, when I say
Starting point is 00:32:08 we keep only pg-weight sampling, but it also depends on pg-star statements, I also feel some inconsistency in my... I've looked it up. When you said it, I got confused. It doesn't depend on it. It's just more stuff is possible.
Starting point is 00:32:25 It's optional, right? Yeah. It can run alone, as we do right now in these experiments. But if pgSatK cache is present, we can join data using query ID, right? Exactly. This is a good correction because I said the wrong thing. But pgSatK cache does depend on pgSatK cache. You cannot install it without it.
Starting point is 00:32:45 This is for sure. Okay. And these guys are not available in cloud environments usually. PgWord Sampling is in Cloud SQL. I mean, not in cloud, in managed Postgres. PgWord Sampling is available on Cloud SQL, but that's it. But maybe after this podcast, people will consider adding it. Especially with PgBot sampling
Starting point is 00:33:07 which is really great. Especially now with this fix from CyberTech. So we removed PgStat statements, reached 2.5 million. We added prepared statements, exceeded 3 million.
Starting point is 00:33:24 And then final step how I almost approached 4 million. It's like 3.75 million TPS. How? Like last optimization. It's actually not super fair optimization. My last optimization was let's just switch from TCP connection to Unix domain circuit connection, which is possible in limited
Starting point is 00:33:45 cases because it works only when you're on the same node. You should exclude network completely. And of course, this is obvious. TCP, IP, TCP, TCP, IP connections are heavier, definitely, like more overhead than just when processes talk through Unix domain circuit. It's much more lightweight. And this allowed me to jump from 3-something to 3.75 million CPS, which is good. And also I found that 17 works worse on Unix domain circuit than 16. Interesting.
Starting point is 00:34:23 Yeah, this is also in my to-do. Why? For TCP connections, no difference. On this edge case, like select only. But for Unix domain circuit, I see
Starting point is 00:34:34 degradation and it's worth exploring why. So maybe I found some degradation for 17. I also need to double check. Quite a big
Starting point is 00:34:41 difference. Yeah. Oh, no, no, no. So sorry. 7 to 8% as I, on higher number of clients. Yeah, sure. Sure, sure, sure. I've made the mistake of looking at the 17 beta chart
Starting point is 00:34:55 and looking at TCP versus Unix, not 17 versus 16. By the way, when you have a long chart with both and different series of experiments, then you want to cross-compare something, it's super easy. Yeah, just visualize this and that on the same graph. That's it. Why not? Much easier than getting it to give you the correct JSON I saw from the chat transcript.
Starting point is 00:35:23 Well, yeah, it depends and gemini is better in json than gpt well we have a lot of fun stories so like let's not go there but it happens yeah so so yeah almost four four million tps for quite meaningful workloads i think it's worth checking bigger tables for example to see like how it depends for example we can just take like 300 clients and then check uh draw different picture it's my it's in my mind like for example let's like very small table bigger bigger bigger and huge right and how it will degrade for example right when you say bigger table do you mean like 100 million rows instead of 10 million rows yeah 200 million half a billion rows billion rows partition not partition like it's many it's
Starting point is 00:36:12 it's like a huge maze you can go in many in many directions and turn and then have like that and maybe a return and inspect another like it's it's it's. The cool thing is that I do it, like, anywhere. Like, I don't know, like, I'm eating breakfast and checking what's up with our experiments. Let's do something else. It's just my current hobby, you know, to explore. Yeah, I think we should have more tools like Sysbench, for example, to have different kinds of workloads to be brought up to this.
Starting point is 00:36:44 So, yeah, that's it. I'm very curious for select-only PGA Bench, anyone had more? But now, like I was thinking, oh, maybe it's the biggest number. It's definitely biggest I ever saw, but I didn't see everything. Maybe someone has bigger numbers
Starting point is 00:37:02 or saw bigger numbers for the Pidgey bench. If they have, I couldn't find it. I did do some looking before the episode just to try and see if anyone pushed it further. If it's been published, I've struggled to find it. But it's definitely possible on these huge machines on AWS, right? Because you have more vCPUs and as we see Postgres scales quite well in terms of number of clients
Starting point is 00:37:29 and number of CPUs. That's the direction. If you're talking about that maze, the way I would be tempted to go with this is with the currently available like cloud provided VMs, what's the most TPS I can get? I don't care about cheating.
Starting point is 00:37:44 I don't care if it's only select semicolon. I just would love to know, what is that number? I'm guessing it's above 10 million, but is it 12? Is it 15? How many could we... That would be super interesting. Actually, it's another good
Starting point is 00:38:00 point. By the way, I forgot to mention that the problem with PGSF statements identified. Again, if you have a select-only key-value approach and you need to squeeze more, PGSAT statements can drop performance three times here, I think, by Michael Pacquare. Sorry, I pronounce always wrong. How to optimize, what to do with PGS statements in general. And he reacted on Twitter on my benchmark. So definitely there is opportunity here to improve PGS statements. One of the idea was start sampling at some point.
Starting point is 00:38:44 Sampling is the obvious one. Also, like, deal with high contention case differently, and somehow then, like, update less. There are ideas. I would hate for the
Starting point is 00:39:00 takeaway here to be that there's a real overhead here of PgStat statements mostly because this is such the this is so much the extreme version right this is the same query being what i i put it everywhere disclaimer this is edge case on production you probably don't see yeah yeah with your workload you won't see with your workload probably uh it's like a few percent only overhead but in some cases when your workload is just one or a few normalized queries and frequency is super high you have a lot of vcpus
Starting point is 00:39:31 and they start like competing this is high contention situation in this case just check it and in terms of weights weight event analysis pg start statements didn't show up in weight event analysis until I think Postgres 16 because basically it was not covered by weight events, I think. Interesting. So maybe I remember incorrectly, but definitely it was not covered until some version and then it started to be covered. So you can see it if you have a very fresh version of Postgres, you can see it in, for example, RDS performance insights. Right.
Starting point is 00:40:12 In this case, it's worth just removing it. But I think some guys have such kind of it's not for everyone. Quite rare. Let's say it's quite rare, but it's very It's quite rare. But it's valid key value, right? Just selects. You need to serve a lot of, like, you need to respond very quickly using just B3, index search, just single row, super fast query, and then you have a lot, right? So just removing PGSA statements, you can get more. So this leads to question how to optimize PGSA statements.
Starting point is 00:40:44 There is interesting discussion in hackers. Let's just advertise it a little bit for those who are interested. Yeah, for sure. And another point I wanted to mention, you started this discussion, various clouds. Our bot knows already about price, about cost. So we could potentially check different machines on GCP and bring also AWS and start discovering. I want 1 million TPS, just select where it's cheaper, which machine.
Starting point is 00:41:14 ARM, Intel, AMD, let's compare different clouds, different machines, where. It's interesting, right? I mean, performance cost. I think so. I liked your transactions per second per dollar. TPSPD, I reckon you should go with. Starting from very simple, trivial workload like these selects, and then maybe to extend it closer to what you have,
Starting point is 00:41:35 and then you can decide which machine is better for you. This is something I've actually been struggling with. I think it could be really useful. How do you let someone know that their workload could be done on the next instance size down? Maybe through a bit of optimization, but if people have a rough idea of their workload, maybe they know it's all OLTP and they know roughly the average transactions per second, what size instance should they roughly be on? Just as a rule of thumb. This is called capacity planning. I know, but also I think so many people are so over-provisioned because they're so
Starting point is 00:42:11 under-tuned. Many people are under-provisioned. They don't realize. I have cases when people already hit the sailing for a specific kind of platform, for example. For example, Graviton on AWS or Nardius for example and you have already the biggest size maybe it's time to
Starting point is 00:42:31 I don't know like to scale somehow or to take different kinds of machines because you see how big intels you have but it's not available on Nardius of course yet right? So I see cases both under-provisioned, and maybe you're right, it's more common,
Starting point is 00:42:48 and over-provisioned more common, when you have CPU 5% maximum, right? And you pay for everything else, like, why? But I also see cases when it's already hitting ceilings and it's time to do something about it, right? Of course. Maybe I see them because people go to us for consulting and help, right? So they don't go for, well, sometimes we have this,
Starting point is 00:43:15 like let's optimize our spendings, right? But it's like we're mostly useful when you have problems with performance and how to grow further yeah well i see the opposite i think not because i not because they come to me but or they come to us it's more that i know friends like founder friends and things quite often the cheapest solution in the early days is double the instant size you know if you're you're going from $30 to $60 to $120, there's not many engineering hours that can go into fixing that.
Starting point is 00:43:51 You might as well just upgrade and then worry about it later. For example, you became a big fan of Graviton ARM, right? And you decide, oh, it's great. And it's like checked, performance costs, great. But maybe it's great and it's like checked performance costs great but maybe it's not like it was good year ago but maybe not now and things change very quickly and different nodes uh instances and types of instances are added in cloud very quickly so what in ideal world we should have some rule how to find the best optimal choice for instance, family, and size.
Starting point is 00:44:29 And also, we should have great migration automation to switch zero downtime, right? That's the other thing. I think people are scared. People are scared of going back down. Test and then switch. Everything is possible, but requires some effort, unfortunately. So, good. 4 million TPS, almost.
Starting point is 00:44:51 Almost, yeah. Or 6 million. I'm going with 6 million TPS. And maybe someone already has this machine with almost 900 vCPUs and they can show us 10 million or something. I'd expect more. Maybe. Yeah, 10 million TPS. I'd expect more. Maybe.
Starting point is 00:45:06 10 million TPS next milestone for Postgres on single node. Great. Yeah. By the way, final note, I tried to find information about MongoDB. I asked guys like, I found only some slightly outdated benchmarks, like three years old or so. And
Starting point is 00:45:22 I didn't see anything above like from single node. And I didn't see anything above, like from single node. I even didn't care about if they have transaction mode enabled or not, right? Because they can cheat here, right? But I didn't see for million. I saw only 2 million, 2.5 million or something TPS. That's it.
Starting point is 00:45:41 So I think Postgres is very competitive. Good to know. Yeah. Okay. Good. Nice one. Cheers, Nikolai. Catch you next week.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.