Postgres FM - Getting started with benchmarking

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael, founder of PGMustard, and as usual, I'm joined by Nikolai, founder of PostgresERA. Hello, Nikolai. Hi, Michael. Today is no normal episode, though, because we are delighted to be joined by Melanie Plagerman, who is a database internals engineer working at Microsoft and major contributor and committer to PostgresQL. It's a real honor to have you on the show. Melanie, welcome. Thank you. I'm excited to be here.

Starting point is 00:00:26 We're excited to have you. There are so many things we could have chosen as a topic today, but off the back of a couple of talks you've given at Postgres conferences, we've opted to go for the topic of getting started with benchmarking. Would you mind giving us a little background about why it's a topic you're particularly interested in? Yeah, sure. So since I started working at Microsoft a few years ago, a lot of the work that our team does is performance related. So I would say as a Postgres engineer, even for features that are not performance related, you have to make sure that you understand their performance impact. But we tend to work on things like AIO and other features where you wouldn't do them if they didn't have a positive performance impact.

Starting point is 00:01:08 So assessing that is a huge part of my job. And it's an area where every little detail matters. And the things that don't matter are kind of counterintuitive. And it's just it's kind of an art. And it's so hard to learn. And it's kind of an art and it's so hard to learn. And it's like a dark art, you know, and it's, I think I've been really blessed in that I've gotten to talk to and learn from people that know how to do it, like Andres Freund and other people at Microsoft and in the community, but not everyone gets to learn from people

Starting point is 00:01:43 that have, you know, 15 years or 20 years of experience. So I like talking about it and trying to see if I can help people get started and demystify what I can about it. Yeah, that's a good point. So you think it can be better, right? And people can understand benchmarks better, right? What are the tools, do do you think for that to find solutions better understand bottlenecks faster and so on yes leading results we have all the time

Starting point is 00:02:11 like we think we found something but it's wrong and then you like one day later you realize that it's completely opposite for sometimes right so so you think it can be improved right right so i think one of the things that users have going for them is that they have a baseline understanding of how you should configure your database. And some of the mistakes that you may be that I made as a beginning of performance work, you might not make. So I remember like the first set of benchmarks that I did, I didn't configure share buffers at all. This was a few, this was years ago. So I give myself a little bit of an up, but basically I was looking at it and I was saying this improvement, you know, that I, this is supposed

Starting point is 00:02:56 to be an improvement and it, it looks like it's actually hurting performance. And I was showing Andres and he was like, well, tell me how you configured it. And what, you know, what kind of, I didn't change any configuration. And he was like, okay, we'll just throw all of this away. And so some of the basic stuff around benchmarking, I think that's something that developers who maybe don't have as much user experience have to learn, but users have that. I think one of the things that I see happen more than you would think is that there's some unexplained performance impact or like a lot of times you develop a change, you want to see how does it doesn't improve performance. And then when it doesn't, or it has an effect that you can't explain, you have to start investigating. And one of the things that people almost never do, at least like performance engineers, they don't look at the logs. And it's like, maybe because it's a development branch, right?

Starting point is 00:03:48 Like maybe you actually had a bug and it's crashing or, you know, in the logs, there's something that's actually, and that's what's slowing it down. And it's such a basic thing that you don't think about. You're like looking at IO stat and looking at these more advanced tools when you really just need to like look at the log you know or it was fast because it didn't do any work right yeah and looking at logs you can notice it yeah good point sometimes even some companies we had i won't name it but sometimes like some benchmarks are published and they claim something and if you have experience

Starting point is 00:04:26 10 plus years you quickly have some ideas what's wrong with those benchmarks for example full-tech search in postgres is bad it shows capital o of n you think how come it's not possible like uh like you quickly realize it's not possible and you look at this and realize, oh yeah, they just didn't build Genindex at all, right? But if you don't have experience, it's easy to make errors, right? To have errors, mistakes. And my idea, it would be good if something would help people to brainstorm what's wrong

Starting point is 00:05:00 or what else to look at as early as possible. So we need some brainstormer mechanism right what do you think yeah almost like a checklist or that kind of thing i think that's one of the reasons why when i started out with development i thought you know with postgres there's not enough basic resources like for performance work that give you a checklist. But then I think I realized over time, I sort of moved towards the, you almost need that one-on-one interaction with someone who has more experience because it's just so individual. And so it's just so hard to give general advice to people. So I think that there are basic questions you can ask yourself. There's a big

Starting point is 00:05:45 difference between, I think a lot of users when they do benchmarking, I think a common case is they're going to install, they're thinking about upgrading. And so they're like, let me just see if it's slower or faster, whatever, right? In that case, like using PG bench for that is not necessarily like going to help you, right? Because first of all, benchmarking what you actually want to benchmark with PG bench or any benchmarking tool is hard. And second of all, what if that's not like your workload at all? So let's say that in Postgres, there were changes to the way that vacuum works, and you run a benchmark for less than, you know, a few minutes, well, like, did you even end up having anything be vacuumed, you almost have to think about what do I actually care about. So if you're a user who's

Starting point is 00:06:37 worried about upgrading, like, I guess, you guys would know better than me, but you probably have options like running your maybe non in not production, you can run your real workload and on a newer version and see how it is. And like, that's the ideal scenario. But if that's not an option for you, then I think what you have to do is really think about what are the highest risk areas for me like do i have one critical report i have to run or you know i need to make sure that i'm able to reach this tps at this time of day or whatever it is and then you have to have a way to reduce that like you almost have to have a minimal repro for the performance question that you're asking and be able to test it and i would say even then, I don't know if you're going to get a perfect answer outside of an actual production environment. The use case that users have for benchmarking, I haven't heard of one where I think it's going to work out for them,

Starting point is 00:07:39 but I don't know. Maybe you can tell me what people want to use it for. Let me shamelessly ask you your opinion about the approach i ended up having for upgrades with our customers big and small doesn't matter so i was big fan of replaying workload like five years ago but we don't have tooling for it and mirroring like a pg cat i think has mirroring but it hard. It should be in between clients and server. And it's very critical infrastructure. Usually, PgBouncer is there.

Starting point is 00:08:09 And you cannot just quickly replace it with PgCat and so on, and overhead and so on. And mirroring also is a tricky thing, actually. So what I ended up doing, I split it into two things. First thing is load testing general with PgBench. And usually we have already it's done by like hackers and so on, like performance farms or something. And we can check it additionally with some synthetic workload. PgBench, not PgBench, SystBench, doesn't matter. So this is one thing. And usually we rely on community results,

Starting point is 00:08:40 or we can extend them as well. But as for our workload for our project, instead of replaying the whole, we think, okay, we forget about lock manager, buffer pool, and so on, all those components. And we care only about plan flips. We want just to check the planner behavior. And we just grab query examples, like 100 of them, which are most critical in terms of total time or calls or something from PG-STAT statements.

Starting point is 00:09:07 Sometimes we collect multiple examples for one normalized query in PG-STAT statements. And then this is our testing set. And then we don't care about machine we use, actually. It can be a smaller machine. We just have an exact clone of production database to have the same data and pg statistic and then we also adjust settings all planner settings and workmem which is not planner setting but also affects planner behavior and then we just reproduce plans upgrade reproduce plans again and compare and see if some plans changed and also use buffers of course and

Starting point is 00:09:46 we see like costs and buffers we usually look at we don't care about timing because it can be smaller machine cache state can be different and if some plan flips occurs we quickly find it and it can happen like in small machine like in shared environment basically it's not not pg bench is not needed for plan flip check right and that's it yeah i think if i understand correctly at some point you use pg bench and then you realize that it wasn't actually going to help you with this particular case it's super hard to reproduce actual workload. Yeah, so I support that. I think really, honestly, what you're doing, which is looking for plan flips and starting with PG stat statements,

Starting point is 00:10:32 is probably the best thing that you could do to try to make sure that when you upgrade, that it's going to work out for you. Because I think that's one of the only things that you can realistically reproduce, because like the planner doesn't take into account other things that are happening. Each plan is planned in isolation with the statistics and the data that you're talking about. It's not looking at overall system load. I mean, maybe we should do things like that,

Starting point is 00:10:59 but planner doesn't do that right now. So it's actually possible to do that. So I think that's a really good sort of precautionary thing that you can do to see if the next version of Postgres is going to cause you problems. So yes, I sign off on that. That makes sense as a strategy. Before that, we were using PgBench and we took most frequent and the most time-consuming queries from PG-SAR statements,

Starting point is 00:11:25 and we called it crafted workload. We've tried to find some parameters or examples of queries and put them to a PG bench using hyphen F and then add sign to balance somehow, like to have some balanced workload and then ignore some errors if they happen, like foreign key violation or something. I cannot say it worked well.

Starting point is 00:11:46 It helped, but then I just realized, why do we care about buffer pool and so on? Let's just split this. So you were doing that before upgrading to see if you would get the same? I stopped doing this because it's hard, actually. It's time-consuming. And if we care about plan flips, that's it.

Starting point is 00:12:04 We have a recipe. If we care about improvements in some components of Postgres, we can research them using some simple workloads like PGBench. That's it. Right. Or specific workloads. You really can't use PGBench to replicate real workloads. So, for example, if I want to analyze a patch that I'm doing,

Starting point is 00:12:27 and I want to replicate some scenario, I might run multiple instances of PG Bench at the same time. So I run one that's doing a select query and another one that's doing a transactional workload. And you can combine different PG Benches and you can sequence them and that kind of thing but ultimately like each pgbench is going to do the same thing no matter what kind of custom script you provide so you can there's different variables and you can do things like have variables in the script and then you know interpolate random numbers and things like that but in a real system you'll have some kind of work happening. And then another kind of work happening that's different work and like interspersed or at unpredictable intervals. And with PG bench in the, you know, sort of during a PG bench run, all of the workers are going to

Starting point is 00:13:19 be doing the same thing, you know, whatever's in the PG bench script. So you can't get, it's very, very hard to replicate realistic scenarios with it. And there's lots of creative things you can do. Like you can have some if statement logic where you're like, okay, if this worker is this worker number, then do this other thing. And like, but at that point, I think it's not useful if you're trying to understand what the performance impact is going to be for your real world workload. Like I can see users maybe using it to figure out, to understand a particular Gawk or something like that. Or maybe if you're, you know, you're thinking about like changing your hardware, you know, using a different SKU from the same vendor or something like that you can try to understand how it might affect general Postgres workloads

Starting point is 00:14:11 and then it could be helpful but PGBench is like mainly a developer tool and I mean that's what it's marketed at it's not really marketed as something not that it's marketed but it's not I don't think it's bill built to anyone as a tool for users to understand Postgres performance of their workloads. But I mean, all the other benchmarks that are out there, I think PG Bench kind of gets lumped in with them. And it's not really useful for any of the things other than development, in my opinion.

Starting point is 00:14:40 It's very hard to make it useful for like comparing whatever, Postgres to MySQL or that kind. I mean, because you can't use it. Different hardware options, for example, it can be useful. Yeah, you can use it for that for sure. Yeah, but it definitely serves a different purpose than the TPC benchmarks, for example, which allow you to compare across databases. And then, of course, we also have a community of people that are doing performance testing from version to version of Postgres. That's different than just testing one

Starting point is 00:15:11 patch and saying, what is this patch doing? They're like sort of looking for broad themes and regressions across versions. And you can use PG Bench for that. But ultimately, like, I would say it's easier or sort of more meaningful to do something like run the TPC benchmarks when you're looking to see if Postgres has a regression from one version to another for the purpose of community work. You mentioned we've got a community of people doing that I've seen a couple but I don't know of many do you actually have any names or things i could look into so i think the most um the benchmarker that i follow closely and that many people follow is mark callahan yeah great yeah so he's great and he actually came to pg comp that's up this year and

Starting point is 00:15:57 i got to meet him he's kind of like one of my i was starstruck when i met him people you know and he's like super nice you know in that i like his blog's name small datum yeah yeah and he's great because he has i mean he's an engineer who has a ton of experience in doing performance work for a long time you know in my sequel mainly the my sequel community and then with benchmarking describing your methodology is so important. So he does that in great detail. And because he's active and he's doing these benchmarks like all the time, he's providing a lot of really useful information for the community around, you know, how he does version

Starting point is 00:16:41 to version of Postgres, different types of benchmarks, different on different size machines and that kind of thing. So he's great. And if you want to learn more about benchmarking, I would say his blog posts are fairly advanced. So like, if you're getting started, it might not be the best resource. I don't know. But I still am kind of like, okay, let me read through this. And you know, he's getting down to looking at how does your CPU frequency governor affect your benchmark results, right? So it's definitely at the point of it's definitely past configuring shared buffers for sure. So he's great. And then I would say within the Postgres community, I mean, there's developers like Tomas Andra and, of course, Andras who do benchmarks when they're developing.

Starting point is 00:17:23 And then usually they'll look at the holistic release. But most of those results are just getting posted on hackers and they're not... Because it takes so much time to do what Mark does and sort of describe your methodology, like double check every result, investigate every discrepancy, and then publish it and sort of be accountable for it.

Starting point is 00:17:43 Because people will come after you. They'll be like, no, that is not a regression or whatever. You have to be ready to defend your work. And it's almost like a full-time job in itself. So a lot of the benchmarks that get done within the community are people doing them and then just saying, hey, I found this. Can anyone else look into it? So, you know, there's also Alexander Lakin runs TPCDS and I think TPCH also every release and does a lot of investigation.

Starting point is 00:18:13 And he's great because he does so much investigation of bugs and like reporting of bugs and reproducing a very hard to reproduce bugs. So if he finds some sort of regression, performance regression, he'll also bisect it down to the commit that actually caused it. And as a developer, it's nice to have someone doing that level of analysis. And so again, he's not necessarily publishing a methodology post and all of that. Then there's some people that maintain the kits for, because if you want to implement the TPC benchmarks, it's not like you just run it. It's more involved than that. So there are people that maintain different kits to the bash scripts and stuff. And I think Mark Wong

Starting point is 00:18:59 just did a new one for TPCDS. So he has a lot of database internals experience. And then he was involved with the TPC council for a while. And so he kind of is an expert on just the TPC benchmarks. And so he tends to put together these kits to try to help people because it's not that straightforward. And then I know there's some ongoing work with the Postgres Performance Farm initiative, but it's really hard, right, to agree upon what is a useful benchmark in general to run. So you can find some regression or whatever. And it's like, in order to actually prove that that's correct or valid is a lot of work on the part of everyone else. So I think in some ways, publishing benchmarks, you can't just say I set up this farm and I ran all these benchmarks and now I expect people to

Starting point is 00:19:52 go look at them. As a benchmarker, you're kind of accountable for the analysis and trying to find out if your results are valid. Practically useful. Practically useful. I found one day, I found, we found, my team found bottleneck in WallSender for logical replication. It was so easy to reproduce. You just create some transactions

Starting point is 00:20:13 with delete and roll back them. And like at some point, very quickly, you reach 100% of CPU for WallSender. And it led to bad conclusions that in production, we won't be able to use logical

Starting point is 00:20:26 for many things but it turned out that this kind of workload doesn't happen in in the wild at all because it's like kind of very specific kind of workload so i guess this bottleneck maybe it's not worth fixing right now right i mean it's, it's like, so it's interesting. Yeah, you can definitely report things like that on the hackers mailing list or on performance mailing list and have a discussion about, you know, if it's kind of the right thing to focus on and also, you know, if it's reproducible. I did some discussions with some hackers about this

Starting point is 00:21:04 and didn't met understanding that it's reproducible. I did some discussions with some hackers about this and didn't meet understanding that it's worth fixing. And then I realized in production we don't see such bottleneck at all and I just postponed this research for the future maybe. What do you think about observability tools to be used in PgBench benchmarks? What extensions or additional tools and so on, what to look at? Yeah.

Starting point is 00:21:29 To avoid observer effect as well, right? Yeah, I mean, I tend to not worry too much about the observer effect, depending on what tools I'm using, because I, you have to do, I mean, as long as you don't have, like, log min duration statement set to do some, I mean, as long as you don't have like log min duration statement set to zero or something like that, but if you're comparing to another version of Postgres where,

Starting point is 00:21:51 and they're using the same observability tools, it's sort of the price you have to pay some overhead. But every person that does performance work is different. But when I started, I made the mistake of trying to like generalize and make some of my tools useful to other people and no one wants that so I learned that the hard way but personally I find that it's very hard to look at aggregated numbers at the end of a benchmark run and make sense of it and like that might be what you present to other people because they don't want to look at every detailed graph that you produced or chart that you produced. But while I'm doing the work, I can't do it at all without visual representations and charts and things like that. So I use for that, I mean, for the actual charting, I use like MatPatLib and Pandas and that kind of thing.

Starting point is 00:22:42 But I have all sorts of scripts that do things because I use a lot of data input sources. So I mean, one of the things that depends on the patch, but I so I'll query pgstat.io or, you know, depending on what the patch is, pgstat activity or pgstat all tables and then gather certain information every two seconds or every one second. And then I have scripts that take this output and then I'm able to just basically make it CSVs and then load it into Pandas DataFrames. So for example, I've been working on vacuum recently. And then one of the things that you consider when you're doing vacuum performance work is how often vacuum work is being done. So you might want to look at the weight events and look at, you know, how many,

Starting point is 00:23:31 if auto vacuum workers are waiting on a vacuum delay weight events a lot and like not actually doing the vacuuming work. And if you have the right auto vacuum settings to actually be able to observe the thing you're trying to observe. So, you know, just gathering that information and then plotting it correctly and that kind of thing. Another thing, because Postgres uses buffered IO right now, I tend to use external tools for IO observability, like IOstat, because you can't actually tell with, you know, reads and writes in the Postgres statistics

Starting point is 00:24:05 that could be a reader right to the kernel buffer cache. You could be reading from there, and that's obviously going to be quite different than actually reads from this. So I use iostat input depending on what I'm doing. And then also there's different files. Well, I use Linux, so there's different CPU statistics or memory usage statistics that you can get. And so a lot of what my scripts do is just query whatever file

Starting point is 00:24:32 somewhere in the file system that has that information over time at different intervals so that I can see, okay, what's the memory utilization or what kinds of effects is this patch having on my resource utilization. So that's a source that I use. And then I also use some different Postgres extensions, again, depending on the patch. So pgbuffercache, I look at the, again, I just query it at intervals and then see in terms of the shared buffers that I have, how many are pinned.

Starting point is 00:25:13 And it helps me to see how the patch is affecting the behavior of shared buffers and utilization there. So there's basically all of the statistics views are fair game. All of the output from PGBench itself. So I parse that. So there's progress output, which is like, you know know pgbench does some averaging and that kind of thing so i also parse the execution reports that come out which is like they call it a transaction very confusing it's literally just an execution of the script by a worker so you might get an average tps but depending you know, because of how it's calculated, that might not actually show you the P99 latency of an individual execution of one of the execution scripts.

Starting point is 00:25:52 So I typically parse all the execution reports. There's one line for each execution and then do various analysis of that and plot that. So that's an input. And there's always new sources. And it's funny, I'll take some analysis challenges and say, I haven't figured out why there's this weird spike in IO at this one particular time. And then he's like, well, here's this other input source you could use to investigate this. And I was like, I have 55 input sources. How is it possible there's something else that I could look at? But I guess everything is caused by something.

Starting point is 00:26:30 And so it's possible to find out what it is, I guess, if you look hard enough. Right. Don't you feel like it would be so good to have P99, P95 in PgBench? It would be so good. It would. It would work well, right?

Starting point is 00:26:47 I'm also using progress all the time, like 10 seconds, 30 seconds. It's so great to see how caches are filled and so on, but lack of percentiles. Yeah. So if you do get the execution reports, you can just then read them in with some Python tooling. Execution reports, it's hyphen R or... It's dash dash

Starting point is 00:27:12 L and then you provide a log prefix. Yeah. So that gives you the actual time that each execution of the script took. But this is so slow, no? Well, I mean, it's not free. Yeah, that's true. of the script took. But this is so slow, no? Well, I mean, it's not free. Yeah, that's true. Not free at all. But I think if you wanted the P99 latency,

Starting point is 00:27:33 I'm sure there's statistical methods for getting it without recording every execution, but you would need something like that. Yeah, do you use the actual file there or some RAM disk to memory or something? I use TempFS and then when it's over, I copy it somewhere to a disk, yeah. do you use the actual file there or like some ram disk to memory or something i just i use tempfs and then when it's over i copy it somewhere to a disk yeah good idea i will add it to my tool set yeah yeah i at first i was pretty apprehensive about doing cassandra like a lot of work but but if once you actually have the execution, you literally just like the statistics modules in Python,

Starting point is 00:28:06 you can just give it the quantile you want, and then it just calculates it. It's two lines of code or three or four or whatever. Yeah, and you mentioned pgstart.io. This was great when I first saw it. Well, this should exist many years already, and we discussed it with Michael. And we here in PostgresFM,

Starting point is 00:28:28 we often discuss lack of buffers in plans, for example, right? In individual query plans. Buffers is like, we try to advertise it's so important to have buffers in execution plans. And this is a macro level, like high level. So good to have it, right? But unfortunately, I cannot say I meet it because it's only in Postgres 16.

Starting point is 00:28:51 We don't have it in most production systems yet still. But maybe next year I will start feeling it in production, right? With our customers. Thank you for doing this. My question is how you came up with this idea originally. Well, I can't take credit for it. So Andres has done a lot of performance work for years and years.

Starting point is 00:29:13 So when I joined Microsoft, one of the first projects he suggested was to add some view that would improve observability of IO. And at that time, we didn't have the shared memory stats system. So the patch was much bigger at that point because of the way the stats collector worked. So I did an early version of it.

Starting point is 00:29:36 And then I used it in my own development and I ended up working on some of the AIO work and found it really useful for that. And then I ended up doing some review of the shared memory stats patch. And after that, it made the pgstat.io patch much smaller and simpler. So then I went back to it. But I honestly, like for me, I just looked at it as this helps me as a developer doing benchmarking. And so it's cool that users like it because I don't have that perspective, really.

Starting point is 00:30:12 Obviously, it's useful for benchmarking itself. But I'm wondering about the other way around. Did you have to look at its overhead? And how did you go about that? Yeah, so it was much easier, like I said, once we had shared memory stats, because the way that it didn't need any extra infrastructure is basically just like another data structure, a couple other data structures. And then everything works the way that it does.

Starting point is 00:30:38 Now, we already had some counting for different kinds of reads and writes, because that's how for explain, analyze and that kind of thing, we had those counters. And then there was some reads and writes that in PG stat IO all tables at one of the views that has some IO things. And then there was some for PG stat statements. So there was some places

Starting point is 00:31:02 where we already had some IO counting. And then the other places where we didn't, there wasn't too much risk because if you're adding some timing for whatever background writer or check pointer, there are processes that it's okay to add a tiny bit of overhead. So you didn't have to think too much about it. I think the hard thing was actually finding all of the places that we could possibly do IO for the types of IO that we were representing, and then thinking about how to categorize it, and especially for the different buffer access strategies. So that's one of the things that was completely not surfaced to users was how buffer access strategies work and how they reuse buffers. So for copy and vacuum and things like that, literally you have no information about that.

Starting point is 00:31:56 So I think just figuring out how to represent all of that, that was probably the least performance work that patched in terms of the impact performance impact of it just because there was a lot of like instrumentation already just that was not complete if that makes sense so you're saying like we're already paying a lot of the overhead like we're running postgres 15 for example we're doing a lot of that counting anyway so storing it or like also displaying in an extra view there's very little added in actual instrumentation okay cool yeah i mean like we had four back ends that were doing queries like counting of reads and writes and hits in most places so the places that it was sort of net new were mostly for types of processes that we don't mind a little bit of overhead if

Starting point is 00:32:48 there was and it would be so small i mean it's measuring io right so like the if you're doing io the overhead of that will cover any sort of overhead of instrumentation unless it's hits actually it also counts hits and i found in some cases, for example, monitoring systems try to avoid storing this metric at all, for example, from PGSAT statements. But I had some incidents where we lacked hits numbers. But PGSAT.io has hits, right? I'm looking at the documentation just to confirm. Yeah, it does. This might be kind of overhead of counting. This might be interesting because it doesn't have real IO.

Starting point is 00:33:31 I mean, in terms of disk. Yeah, it actually was already being counted. We had a hit somewhere else, one of the other views, or maybe it was for PG stat statements. I'm not sure. So that was already kind of being counted. And when you go to get something from shared buffers, the there's blocking and all sorts of other things that happen that are expensive enough that it was pretty negligible. Nice. Yeah, okay. But there are some benchmarks proving that Pagesata.io didn't add or like just...

Starting point is 00:34:07 Oh, no. You want to see my receipts? Yeah, yeah. I'm just curious. It's just a general feeling that it doesn't bring overhead or there is solid proof of it in terms of benchmarks. That's a good point. It's mostly based on a feeling. I'm asking just to, like, I'm curious, is it worth checking this, for example? Because maybe we can strip it out somehow or compare 15 versus 16 in various cases and see if... I would look at those benchmarks, but I guess my gut feeling would be that because we already were doing, when track IO timing is on, we already were doing a lot of the measurements, that it wouldn't be much different. Actually, I can answer myself. We have benchmarks with insane numbers.

Starting point is 00:35:00 Oh, you do? Okay. No, no. I mean, 15. We compare all versions and we try to reach as many TPS as possible. I achieved, how many, Michael? 4 million? Nearly, 3.75, I think. Yeah, just select only PgBench, right?

Starting point is 00:35:14 And we didn't see any degradation for version 16 at all. It's already some kind of proof that there is no overhead here. And we have wait events and flame graphs and so on. We would notice it. But Nikolai, you wouldn't have had track io timing on i want to check it or and maybe no way yeah if you did you should repeat again maybe you get four million okay okay right now and good point but yeah this is great like it's super interesting what like how do you choose which things to benchmark and which things not to then yeah so I think developing intuition around that is so I actually became a software engineer like my the first thing that I worked on professionally was green plum which is

Starting point is 00:35:57 a fork of postgres so almost all of my software engineering experience comes from being a postgres engineer I was pretty new to software engineering, right? So I think a lot of software engineers develop intuition around like, you know, if you take a lock every millisecond versus every 30 seconds, like which one of those is okay or whatever, right? Like it obviously depends on the context, but I think there's a lot of sort of performance intuition that people develop just from being software engineers. And that's sort of like the first step of, you know, thinking about, okay, well, is it okay for me to take an exclusive lock here? Do I need to find a performance question and you said do is it okay that i do x then you have to think about okay how would i benchmark this and in a lot of cases maybe not benchmark how would i profile this how would i evaluate or micro benchmark this or something like that so for example in 17 i worked on some refactoring of hot that code, adding any sort of additional

Starting point is 00:37:26 instructions, you know, potentially has overhead. So of course, like if you're doing actual hot pruning, then you have IO from the wall and you have IO from writing out the dirty buffer and that kind of thing. And so you have to think about is the thing that I'm changing actually going to matter in the context that it's in? And then I think evaluate from there. And also how important would it be if there was a difference, right? So like on every select query, if it's a little bit more expensive, that's obviously bad, even if it's only a tiny bit more expensive. So I and Heike did a lot of micro benchmarking around those changes to make sure that they seemed right. And then you have to design the benchmark so that it's actually exercising what you think might be the bottleneck, which is its own challenge. And then like right now,

Starting point is 00:38:21 I'm working on something where it has to do with improving eager freezing so that you freeze more data, vacuum freezes more data sooner. And there's a bunch of different pieces to it, but one of them is a new sort of responsibility that background writer would have for maintaining a new data structure that contains some new statistics. And one of the questions I got in review was like, have you profiled this to make sure that it doesn't have an overhead? And my first reaction was like, but it's background writer. No one cares about background writers performing, you know, like kind of like it's different than on every select query, you're going to have some overhead. But again, it was doing something that was like more, I'd say heavyweight, like it's doing some calculations that could take time and doing them while holding an exclusive lock. And so even though it's background writer, like it's worth proving that, you know, the number of extra instructions, for example, is tolerable.

Starting point is 00:39:27 And that's a judgment call. But what you want is a paper trail to that you like did your due diligence, which is why it was funny when you're asking me about PG stat IOS performance impact, because in that case, I kind of thought, you know, it's negligible compared to the fact that it's already you're doing io first of all and second of all we had a lot of that instrumentation there but for most other things like i haven't had a patch other than that probably in the last a couple years where there wasn't a discussion of what the performance impact could be and then benchmarking or profiling. Nice.

Starting point is 00:40:07 Yeah, Michael, it was on. Huh? Tracking IO timing was on. Well, that's good news for Melanie. There's potential for 4 million. Yeah, and we didn't see difference between 15 and 16. So, I mean... Wow, wow, wow.

Starting point is 00:40:23 Yeah, I already asked our robot to double check with tracking our timing off so we'll see this as well yeah good interesting i hope we'll have more tps i'm curious what you work uh work are working on right now and direction of future work ideas and so on yeah so i mean i started a project last release that had to do with reducing the amount of wall emitted by vacuum and that was like kind of a refactoring but it spun out a lot of other projects and i think i sort of wanted to take a break from vacuum this release but it didn't work out that way so I'm working on a couple of different things around vacuum. And some of them are some leftover pieces around combining wall records that vacuum used to emit like up to six wall records per block. So we've combined a few of them and there's more to do. So that's one of the things. But sort of the more probably exciting work is around the freezing work.

Starting point is 00:41:30 So that basically was born out of Peter Gagan a few years ago. I think it was 16. Did some work to have us do more eager freezing. So one of the things we hear a lot from users is they have, say, insert-only tables and none of the data gets frozen. And then all of a sudden they have anti-wraparound vacuums and then eventually their system becomes read-only. But even before that, anti-wraparound vacuums, if you don't have any actual pruning or vacuuming to do, can be expensive from an IO perspective. And so it can slow your whole system down. And so being more aggressive about freezing data sooner was something that people had been asking

Starting point is 00:42:11 about. And so Peter worked on that. And it's really difficult to find a heuristic that works for determining whether or not to freeze something. Because if you have a table where you're doing a lot of updates, like a standard PG bench built-in workload, then it's a waste to freeze things because freezing emits wall and it dirties the page and that kind of thing. So finding a heuristic that freezes the stuff you want to freeze and doesn't freeze the stuff you don't want to freeze

Starting point is 00:42:42 has been a journey. And one of the parts that's really hard is that you have to try to come up with all of the workloads that are important, the best case and worst case, especially the worst case workloads that users might have and think about how it would perform. And that's like, so hard. So that took a lot of time. And I spent a lot of time improving my different benchmarking setups and coming up with these different workloads and evaluating them, looking at them. So that's one thing I've been working on. And then last release, I also, so there's the new read stream API that does vectored IO does larger larger reads that Thomas Monroe did last release and I worked on the different users so for

Starting point is 00:43:35 sequential scans and this I had some additional users that didn't go into 17 because their performance effects that you have to sort of analyze really closely and they weren't ready and one of them was the word analyzed deliberately there was it yeah no no analyze went in actually um that was the law's work yeah so analyze went in but a bitmap heapscan and vacuum didn't got it so vacuum is an interesting one because streaming vacuum sounds great, but vacuum uses a ring buffer to sort of keep it from having too much of a negative effect on shared buffers. So you don't wash out all of the things that are resident and shared buffers from your workload. It reuses buffers, which means that it has to write wall typically. And there's a lot of calculation around how big the ring is such that wall writer can help you or help vacuum to write out the wall.

Starting point is 00:44:37 So vacuum is not doing all of it. And if you do IO and if you do reads in a different size, that means that you're dirtying and needing to clean buffers at a different rate. And that actually was affecting how many individual wall writes and syncs that we would do. So like in a given F-Sync, you can have different amounts of data, but each F-Sync is a system call and has overhead, right? So you want to amortize that cost over a large enough amount of data and also have WallWriter help to do some of the work. So by doing bigger IOs, we are actually making the performance worse in some cases because of these interactions with additional wall sinks. So Thomas Monroe has been working on a, I think he's calling it streaming right behind.

Starting point is 00:45:33 It's kind of a way of thinking about, for the purposes eventually of AIO, but of how backends and maintenance processes and rights in general and Postgres can sort of make sure that they are cleaning up and doing rights in large enough, if they're doing larger rights, right, that they're actually being conscious of when they should do those rights based on the wall that it's required that you emit. So like, if you're doing reads, you don't have to really think about that. But if you're writing out data, the wall associated with those dirty buffers has to be flushed first. So you have to think about when do I want to do writes, because you need to think about when do you actually want to do the wall writes.

Starting point is 00:46:17 So he's been working on that. And it's a tough problem. A lot of benchmarking, a lot of thinking about it. And buffer access strategy is really complicated. And those are used for large selects. Those are used for, you know, copy, vacuum. So he's doing some work around that. And I'll probably sort of jump in after some of this vacuum stuff is done and try to work on some of the AIO work that's coming up. And I think that ultimately, like just between those things, that'll probably be most of what I end up doing on my own. But I would like to see there's some, for pgStatIO,

Starting point is 00:46:56 there's been a patch to add wall-related IO statistics to it that's been around for about a release. And we haven't, basically what we have, we haven't come to an agreement on what the right thing to do is because you can technically, like our thought was that we would have the block size for like, you can configure, you know, your block size for Postgres, you can configure it for wall and that they would be in units of block size, wall block size for Postgres, you can configure it for wall, and that they would be in units of wall block size. But that actually isn't always the unit that we do reads and writes in exactly like it usually is. So now we're actually talking about rejiggering pgstat.io, especially in light of vectored.io to change it so that it's not looking at, you know,

Starting point is 00:47:45 if you have a thousand writes and then you have the block size and then you multiply them to get the number of bytes, do we change it to represent it differently and put it just in the number of bytes? And like, how do you sort of represent that if you're doing, do you want the right number of writes to be the number of system calls? So we were just thinking about how to actually represent it. But I would love to see wall stats go into PGStatIO in 18 also. So we'll see. And there's also a bunch of other exciting things going on, like Masahiko Sawada's working on parallelizing the first and third phases of vacuum. So he did that really exciting work in 17 on tid store and making it so that i think that's my favorite feature from 17 can you explain it yeah i don't

Starting point is 00:48:37 know that one yeah so you know like one of people's biggest complaints about vacuum is when they have to do multiple rounds of index vacuuming because the maintenance work mem, even if you set it to a high value, you might end up filling it up with dead tids. And then you have to do a round of index vacuuming. And if your indexes are very big, you can imagine that that ends up really affecting your vacuum performance. So what he did was introduce a new data structure that organized the dead tits in a way that was much, much, much more efficient. And that made it so that you're using way less memory for the actual dead tit storage, but then also you can end up having, you can end up

Starting point is 00:49:26 sort of changing it on the fly. You could, I mean, you basically, you're not tied to the same restrictions that we had around the size of maintenance work mem as before. So most people are going to not need ever to do multiple passes of index vacuuming so i think that people with very large indexes and large tables are going to see their vacuum performance be a lot better i have big customers with big partition tables we talk about partitioning for years and it's hard and they will benefit so vacuuming will be faster, basically, right? Yeah. I mean, if you don't have this problem with the multiple index vacuuming passes, then maybe not.

Starting point is 00:50:09 But it's something we hear a lot from sophisticated customers. And also, I remember in Postgres 17, before that, it was only up to one gigabyte could be used. Yeah, that restriction is lifted now. Is it related to this work? Yes, it is related, yeah. I suspected so. Good, yeah. So that's really cool, and I think it's something

Starting point is 00:50:32 that's sort of hard to explain unless you've had the problem. I think if you haven't had this problem, you're not really thinking about it. But for people that do have this problem, I think it's going to make a big difference. Yeah, so he's sort of building on that work and there's a lot of things that we can do with vacuum because of having, so this read stream API will make it somewhat easier to do parallelization of the first and third phases of vacuuming. So he's working on that as well. And then I think, you know, there's

Starting point is 00:51:05 the dreams of having global indexes involved a step where you persisted the dead TIDs to disk because otherwise you wouldn't be able to. Basically, there's discussion of being able to split up the phases of vacuuming and be able to vacuum indexes at some point and come back to it and do it later. Like vacuum, just do the first stage of vacuuming. And if you want to do the phases of vacuum separately, you have to have the dead tits that you need, basically what index entries you need to delete, you need to save that somewhere that's not in memory. And so there's some modifications to the TID store that they can make and probably make it easy to persist. And I don't know that he's planning on working on that now, but Dileep had done some work on

Starting point is 00:51:56 dead TID storage. And so I think there's a lot of exciting things around vacuum that'll be happening. And there's also been discussion, which I know has happened in the past, but new discussion about auto vacuum scheduling and not just auto vacuum scheduling, but cost-based delay and how to change that to be more adaptive and to use statistics basically that are collected while vacuuming to decrease the delay adaptively while vacuuming if you're not able to finish the vacuum appropriately and things like that so we'll see so we we talked to peter gagan in the past as well and i remember the duplication so you both sound not like people dreamed to get rid of vacuum completely,

Starting point is 00:52:45 but you sound like there is a big potential for improvements and it's happening. And so it's going to stay, but with improvements, this is like, I'm trying to simplify the undo people. I mean, they still,

Starting point is 00:52:59 the undo people. Right. I think that there has been a lot more work on vacuum this release than there has been 17 or 1817 and 1817. Yeah, yeah, that's great. That's great. So it's Yeah, many people need it. I mean, both instances. So yeah, I like this kind of work. I think Peter Gagan's done a lot of work in the last few releases and you and others that focus on things that will help people but almost everybody will benefit without having to change anything transparent and that's those are so powerful they're so easily easily forgotten or easily ignored, but everybody benefits.

Starting point is 00:53:45 It gets better for almost everybody without the, A, noticing and B, having to do anything. Just so powerful, those kinds of changes. So thank you for working on those and encouraging others to as well. I appreciate it a lot.

Starting point is 00:54:00 Yeah. The dream is that we get rid of all of those auto vacuum gucks because that's terrible. Like, there's, I don't know, the fact that there are so many blog posts. I mean, yeah, like, they should all be gone. Your auto-vacuum configuration should be basically nothing. I mean, the system should figure out what to do. We often discuss, like, outdated outdated defaults but this is a radical

Starting point is 00:54:27 position right yeah i mean it's not going to happen overnight but i think my vision a lot of people i think want this i don't i think the cost-based delay system is really hard to understand because it's not intuitive the The relationship, like scale factor and all of that, it's just like, what is that? I mean, I understand why it happened from a developer perspective, but I think that getting vacuum to do the right thing is like our job as engineers

Starting point is 00:54:57 and shouldn't be users' jobs, basically. So we'll see. Well, we don't do clips on this show but if we did i think that would be one also number of workers three workers you have like almost 200 cores and you have only three workers by default this is insane right yeah well actually so i will take it back i would say that the one configuration that i think users should be able to provide is the max number of workers because that's your system preference, right? I mean, maybe not the number of workers, the minimum number, but the maximum number. You should be able to keep the system from having 50 processes just doing vacuuming work

Starting point is 00:55:39 if you want. I think that should be... Because this is second phase of of rights we need it right because if you just deleted something or updated something work is not done until it's vacuumed you need more workers anyway so also pagequiz doesn't know how many cores right doesn't know it doesn't take that into account yeah i mean it but like people are allowed to use their servers for other things at the same time. And maybe you don't, but like we sort of, we don't assume that we have access to everything, I guess.

Starting point is 00:56:11 Yeah. Makes sense. Thank you for coming. It was very interesting. Thank you for having me. For this work. Again, thank you for PgSata, your work and so on and benchmarks. I like, we will keep an eye on it and good luck good thank you yeah thanks so much thanks bye

Your Ad Here

Postgres FM - Getting started with benchmarking

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.