Postgres FM - Performance cliffs

Starting point is 00:00:00 Hello and welcome to Postgres FM, a week to share about all things PostgreSQL. I am Michael, founder of PG Mustard, and as usual, I'm joined by my co-host, Nikolai, founder of PostgreSQL CI. Hey, Nikolai. Hi, Michael. Hello, and today we are excited to have a special guest,

Starting point is 00:00:14 Tomas Vondra, who is a PostgreSQL major contributor and committer, now working at Microsoft on their Postgres team. Welcome to the show, Tomas. Hello, everyone. Right, so a few months ago, you gave an excellent talk about where performance glyphs come from, and we have been excited to talk to you about this. So perhaps as a starting point, could you let us know a little bit like how you got interested in this as a topic or why you

Starting point is 00:00:39 chose to talk about it there? So I think I started thinking about performance glyphs when I actually started working on Postgres or started using Postgres. Because performance glyphs are something people often run into and have to investigate changes, sudden changes in behavior after small changes in the query. A single additional row in the query. Like a single additional row in the query could kind of

Starting point is 00:01:07 like flip the behavior in a way that is completely like unexpected and unpredictable. And it's causing problems for like robustness and so on. And I actually started with Postgres as like a user or like a developer using Postgres. And one of the things I was responsible for was investigating performance problems. So that's how I actually came to work on performance glyphs. But after I started working on development of Postgres, you also see the other side of the performance glyphs, which is why it actually happens in practice, because the database needs to do some decisions. So it's also an engineering problem,

Starting point is 00:01:52 which is interesting for me as a developer committer. I've been working on a planner some of the time, and that's where some of the performance glyphs actually happen, because of decisions made either during planning or then during the execution. So I've been reading some papers and that's actually where I learned that there even isn't like a name for this, which is the performance cliff. And there is a whole area of research for likeness, making robust decisions where the performance cliffs do not happen at all or not as often. So that's why I'm interested in this.

Starting point is 00:02:38 Nice. And yeah, I really like the term performance cliff. It's so intuitive in that, from the example you were talking about, kind of adding one additional row. So if we go back a few steps, if we're adding one row incrementally up until the cliff, we might get a certain gradient, might even be relatively horizontal. And then all of a sudden there's a massive spike

Starting point is 00:03:00 and that's like the cliff. I'm probably thinking of it as a spike, but maybe a cliff would be a drop. So yeah, I really like that as a phrase. You mentioned kind of planner versus execution time cliffs. And I guess we're talking about individual execution more than kind of system level cliffs, mostly here. What from your experience as a user, or maybe since you've since you've been looking at like trying to prioritize which things to work on, where do you see the majority of cliffs coming from like plan time, execution time, some some mix? So that's a good question. I don't know. Because I am inherently biased, right? The fact that a lot of performance,

Starting point is 00:03:45 which for example, happen during planning means that you can't actually see a lot of glyphs that happen later, right? So I don't have a very good like data to explain that or to say that most of the glyphs happen at some point. But I would say that most of the cliffs happen at some point. But I would say that the sooner the decision needs to be done in the query planning and query execution, the sooner you need to do that.

Starting point is 00:04:17 The less information you actually have to do a good robust decision. And the more likely it is that it will be wrong, right? Because for example, we do planning based on statistics and those are like inherently inaccurate, right? There'll be a lot of details that we remove from the statistics to keep it small. And therefore, those are the things that can actually cause problems. And then there is, of course, the thing that even the cost model itself is extremely simplified. So I would say that the main problems is that we are making a lot of decisions early in

Starting point is 00:05:01 the query execution. And because of that, we are kind of like fixed in a certain execution path. And that may not be the optimal one, right? So that's why I think the performance cliffs come from. I wanted to ask, like, there should be a challenge, not about like planner versus executor but also they're both of them behavior like in a single session when there is no like a multi-user situation versus some some various contention situations when we have many many sessions working in parallel and all of them have some issue for example an infamous lightweight local manager happens when during planning

Starting point is 00:05:51 planner needs to lock all the indexes right or for the table and we exceed 16 fast path becomes not available for some relations. And we cannot see it when we're just checking the single query, but it becomes a performance cliff immediately when we have thousands of transactions per second doing the same thing, and planning becomes problem, right? I'm curious, how do you think this should be understood, like with what tools and earlier,

Starting point is 00:06:23 not when you already fell from that cliff. But before, this is the trickiest, how to diagnose cliffs which are going to happen soon with our system. I understand the example that you've been describing, but I think that's a slightly different performance cliff, and not what I usually mean by the term. Because you are talking about resource exhaustion. You have a resource which is like a lock queue or

Starting point is 00:07:03 the buffer for the fast path logs, which was increased or like made larger in Postgres 18. And it's configurable now, right? I think it's maybe configurable. Yes. It's tied to expected number of transactions per connection. So it is configurable in this way, yes. So it's not fixed size like before. But I think what I mean by performance cliffs and what I describe in my talk is mostly performance cliffs that are done because of like a binary decision somewhere in the processing, right? And the locking bottleneck is not that, right? I mean, like, yes, you do have the case where you can either fit the fast part lock or not, right? And if not, you have to do something

Starting point is 00:08:00 much more expensive. That's a problem, but it's not the thing that I had in mind thinking about life performance, please. Now I remember when I was checking your talk, this is exactly what I thought, like we have different views on this term. And I grabbed this term from various blog posts where people talked about multi-exact SLRU buffer problem, like it was also a cliff. Sub-transactions, also calling this performance cliff, like saying sub-transactions are cursed, don't use them at all because there is a cliff that is hidden there and so on. So this definition is very different from yours.

Starting point is 00:08:43 I understand. Yeah, I agree with you. Yeah, I don't feel like I'm the authority to define what exactly that means. I'm trying to more explain like what I've been describing, like the cases that I've been discussing in my talk. Because also the reasons why that happens are slightly different, right? For the performance, as I understand them or as I discuss them, it's more like an engineering question because you have multiple algorithms

Starting point is 00:09:12 that you can choose from. Each of those algorithms is kind of optimal, best suited for certain type of conditions or selectivity ranges or something like that. And therefore, we make the decision based on the assumption that you know the correct answer. And I think the bottlenecks that you've been describing are more about, say, changes in the hardware available, right? Because years ago when we initially sized the number

Starting point is 00:09:48 of fast padlocks, I think like 16 was probably fine, right? It's like, it was like reasonable, right? Like you probably didn't have that many tables or like connections and it wasn't such a problem. But nowadays we have many more CPUs, many more cores available usually. And we also have things like partitioning which can completely explode the number of things

Starting point is 00:10:18 that you need to think. So I think it's more like a consequence of changes in Postgres that leads to hitting this bottleneck. I'm not saying it's not a valid problem. It definitely is. We need to improve that. But it's a slightly different thing. Right. Yeah, I agree. Yeah. The topic's already huge if we only limit it to single session, single query, performance cliffs.

Starting point is 00:10:47 So if we think about that, just those, you talked about both reducing kind of the frequency of these, so making them less likely to happen, but also reducing the severity of them. So kind of accepting they will happen, make them less of a cliff or less of a drop when they do. Can you talk us through some things from a user perspective that we can do to reduce the severity and or to reduce the likelihood of running into them? That's a good question. I think it probably depends on each individual case of the performance cliff. And in some cases, you probably can't do anything.

Starting point is 00:11:30 Right? But it's just good to know that this is what's happening in the talk, which you can see on YouTube. And I'm actually invited to do the talk at the San Francisco meetup in like a week. But the first example that I gave in that talk is about the number of elements in the in condition, right? Where column, in, and then array of elements. And there is like a flip because we exactly switch from

Starting point is 00:12:07 one algorithm to the other like from linear search to hash. But that's hard-coded. There's nothing the user can do about that. The one thing you could do is to pick the right size of the list, right? But that's not a very practical thing. But I think in many cases, just knowing what's causing the problem can be a very useful thing because you can say like, okay, this is what's happening, I can't do anything about that.

Starting point is 00:12:45 I have quantified like the impact and we have just to live with that for now. I can imagine like patches doing something like improving that by adjusting the point where exactly that flips by doing some better estimation. But at this point, people can't do anything about that. For some of the cases in planning,

Starting point is 00:13:14 you can do the usual thing, trying to convince the planner to do the right decision. There are different ways to do that. I don't know what's the best solution to that. Things I really liked were, I think proactively people can adjust some of the cost parameters. Like we still have random page cost as four as the default,

Starting point is 00:13:38 even on most cloud providers. So I feel like things like that, when we've mostly got SSDs these days, just feel like they're going to be more likely to push plan flips at the wrong times. You know, some of these, if there are other cost signals that are like massively out as well, I'm not aware of any, but that, that feels like a foot gun or something that if people change proactively or reduce proactively, they'd be less likely to encounter this kind of thing. And also a way you did a lot of the work on extended statistics,

Starting point is 00:14:10 and that feels like another tool that used well could start to teach the planner about certain relationships and make cardinality errors far less likely. So, and again, maybe then less likely to have plan flips at the worst, not the worst times. When we do have the plan flip, it's more likely to be closer to the optimal point. So we're less likely to have a severe cliff at least. What do you think? I think you are absolutely right that like

Starting point is 00:14:38 tuning some of the parameters for like random page cost, for example, and something like that, is exactly one of the things you can do to convince the planner to make the correct decisions like more often, right? Because you are essentially moving the flip point between the costs closer to the actual flip point, right? Between the execution costs,

Starting point is 00:15:06 like measured in time, for example. So that's definitely true. I think most of the time we actually do the correct decision like for like regular queries, because the selectivity where we would have a problem is kind of like in the middle, right? Somewhere. And usually queries do have like very low or very high selectivity, something.

Starting point is 00:15:27 So, so in between it's not very common. It can happen, but I don't really have a perfect solution to that. Extended statistics definitely can help, but it's still the, you know, the very early decision, right? It's going to help with with improving that in some cases. And I know you had a whole podcast about that. The problem with the extended statistics is that it still is like an extremely simplified

Starting point is 00:16:02 representation, approximation of the data. That's the whole point of the data, right? You know, like that's the whole point of having statistics to have like very compact thing, very compact data set, which you can use to do the planning quickly, right? Which means that it definitely is missing some of the details. And that can be fatal, right? So I think what we need to focus on in the future is kind of like making the execution part a bit more robust, like moving some of the decisions to that,

Starting point is 00:16:41 instead of expecting the decisions in planning to be perfectly correct all the time to kind of like allow the execution to actually recover from mistakes. Change the mind on the fly, right? Change the execution path? Kind of, right? Or like, yes, there are different approaches to that. Some are kind of like difficult to do correctly, like replanning, for example, right? Or you could have like a typical problem for a planner is like a sequence of nested loops, right? When you have underestimated the join size or like a condition selectivity, then you pick a nested loop

Starting point is 00:17:32 and then it explodes and takes forever. So there are approaches which kind of blend two different paths through the join and either do like a hash join or like a nested loop, right? But there are like practical issues what you can do limiting the behavior of the optimizer, right? Because if you want to do this, then for example, you cannot emit any tuples from that join

Starting point is 00:18:04 until you know that that's the correct approach, right? Because then you wouldn't be actually able to flip to the other and so on. And I see another challenge, it's like user experience, we all are very scared about the probability of plant flips sometimes, right? And it's, it's, it's dangerous, especially when you do upgrade, major upgrade, you, plant flips are not uncommon. And I know, for example, AWS Aurora, RDS Aurora has extension, which is not open to

Starting point is 00:18:42 freeze plants, so no flips can be possible. What you describe like if I see some plan and explain but then during execution plan can be changed it makes like my head thinking like how can I control it? I'm not controlling, I just understand what's happening, right? Because explain will lie in this case, right? Hmm. How did or maybe you see some something what could help to see like two plans and explain that or three plans and we know that... But we already can can show this kind of stuff in the plan, right?

Starting point is 00:19:26 Where when you have a sub plan, we can either have like a regular plan or hashed sub plan, I think. So we kind of like can show like alternative plans. I don't think that would be a problem. Obviously that wouldn't tell you like what exactly happened during execution. It would need some additional improvements. But yes, it adds ambiguity to the explain plan.

Starting point is 00:19:54 We go to tell you these are the possibilities. You don't know what's going to happen for, and it could actually differ between executions of the same query even, right? But that's the price for higher robustness, right? Right. I'm just thinking observability tools, many people want them to be improved, like to track, PgSense statements is not enough to have plans. One normalized query can have multiple plans.

Starting point is 00:20:29 In the past, there were attempts to have PgStat plans or something. There are also extensions which allow you to see the plan in for ongoing query. I think if this way is like this direction is going to be like norm to alter the plan on the fly, that observability tooling needs improvements for sure because we need to understand what's happening. And auto-explain as well, right? Probably it should report initial plan and then additional plan or something like this alternation. So I

Starting point is 00:21:07 just feel the bigger need to improve observability part of PosgOS as well. Yes, I have nothing against observability And I think I haven't tried like PGSTAT plans for a while, but I think it still exists. It's still available, right? I think it's very useful. I think there's also a patch to actually add ability to watch a plan as it's being executed, right? Which can actually help.

Starting point is 00:21:45 That would be a tremendous benefit when investigating queries that never complete, where you can't actually get like explain analyze and so on, or just watch the execution of a query in another backend. There is a page for that. It's definitely not going to be in Postgres 18, but I think Rafael actually is working on that. I hope I didn't get the name wrong.

Starting point is 00:22:13 Yeah, but again, I don't think that's necessarily about the performance lifts only, right? That's about like monitoring in general. This concept of robustness is really interesting. I don't think I had come across it as defined a term until reading some of the papers that you linked to at the end of your talk. And I found it fascinating.

Starting point is 00:22:39 I think it aligns well with my experience of what people really care about in performance. I think correct me if I'm wrong, but I got the impression it's really about trying to avoid really bad worst case situations at the cost of potentially slightly worse average execution time, for example, if we want to optimize based on time. So this idea that on average, we might perform slightly worse, but our worst case will be nowhere near as bad as the, so the P99 would be way better or way less severe, but the P50 might be a bit higher.

Starting point is 00:23:15 Is that about right? Yeah, I think that's exactly right. You're exchanging the, you know, worse average performance for having to do less investigations of performance issues. And I think that's perfectly reasonable trade off. I mean, I do like working on like micro optimizations and micro benchmarks and everything. That's great. But in the end, someone needs to be using the database, right? I mean, like,

Starting point is 00:23:46 that's the point. And if they do have, I don't know, until people had like one database, and they cared about that only, that was fine. As soon as you have like a larger fleet of database servers that are actually used by people or by applications or something, you definitely will value robustness, like not having issues, even if the peak performance is maybe 25% lower or something like that. Because it's much easier to spend a bit more on the hardware than having to deal, having to run a large team investigating and fixing issues and firefighting all the time.

Starting point is 00:24:38 That just doesn't work at scale. And I think it's also what some of the cloud hosting providers are like, realize, right. So it's a bit against my nature, because I definitely do want to do like the best performance possible, right. And I've been benchmarking a lot and like doing like performance related patches for a long time. But at some point, I think it's kind of pointless, right? Because people will not be actually able to use that in production anyway.

Starting point is 00:25:14 Yeah. I mean, there are definitely, it feels like there are cases and there are people trying to get the best of both worlds, right? Trying to get a faster than we currently have solution that is also more robust. Like we had Peter Gagin on to talk about the work he's been doing on, like in the direction of skip scan. But that work seems to make some decisions post planning execution as to whether it's going to skip or continue trying to skip or just run through the index sequentially,

Starting point is 00:25:47 depending on certain statistics it gathers along the way. That strikes me as work in this direction that in the real world doesn't seem to give up too much on the optimal path either. So I think you are talking about the multi-dimensional access method that Peter was talking about. And I think it's actually a win-win situation in this case, right? Because he actually can do better decisions based on the upper information that he has

Starting point is 00:26:17 at the planning stage, right? So he actually is not sacrificing anything. He actually, he's improving the behavior in many cases. So I think it's amazing what he's doing with indexes. And I've been grateful to actually collaborate with him on the index prefetching page. So, and his comments were very useful. And I think it's actually Peter Gegan who pointed out

Starting point is 00:26:42 like a quote from Gert Grafe, who is one of the authors of the patches, who said that choice is confusion. By adding more different ways to execute a query to the optimizer, then you are just giving it more chance to make a mistake. So I think the robustness, like the main parts of the features that improve robustness, needs to be at the execution path. It's like adjusting based on the actual data, right? Yeah. So we've got probably more than this,

Starting point is 00:27:29 but three main times we're having to make, or the planners having to make decisions that are really impactful for plan flips. We've got scan type. So we've got kind of squinch scan, index scan, bitmap scan, index only scan, perhaps some others if you've got extensions installed.

Starting point is 00:27:47 And then we've got join algorithm. And then we've also got join order. I've read a bit about from, based on your talk, I've read a bit about the first two, but nothing about the last one in terms of execution time optimizations. Is there anything around that that you'd be wanting to be looking into? So the first two parts that you mentioned, that's like, there are actually, you know,

Starting point is 00:28:17 papers proposing like more robust on average slower, but more robust execution paths. on average slower but more robust execution paths. And I think that would be interesting to have some of those as extensions in Postgres. Because we do have the custom scan interface and we can actually inject a different path. So that would be fine. For the joint order search, I'm not sure. I'm aware of like some issues in that but I think for the typical queries the the algorithm actually works quite fine

Starting point is 00:28:56 but there are some examples of queries where that actually you know blows up right and I don't know if that's what you've been pointing at, but like star joint, for example, is like a great example, right? Because we kind of like end up exploring n factorial of different plans. And we can actually improve that. I actually posted like a proof of concept patch for improving this case. But there's a lot of work that needs to be done on that. But that could actually help tremendously. Because I think in another talk, I compared like performance for

Starting point is 00:29:40 different kinds of workloads, since like Postgres 8, I think. And on like a star join with, I mean, OLTP star join, which means like you select one row from or a small number of rows from the fact table, and then have like 20 dimensions joining too. So for that, we didn't actually improve throughput at all since Postgres 8, right? Which is like crazy, like compared to the, you know, order or two orders of magnitude

Starting point is 00:30:13 better throughput for the other like OLTP workloads. I think that's a clear bottleneck there. And it's exactly because, you know, like the joint or the search is not great. We do have something we call genetic optimization, but I think almost no one actually uses that because the default configuration parameters actually force us to do a different thing. So we don't actually use the genetic optimizer at all. Well, when we have 20 tables joined, it's used, right? Because the threshold...

Starting point is 00:30:50 No, no, no, no, no, no. Because first we split that at 8 into smaller joint searches, right? And we do the regular algorithm for that. I see. So you would actually need to change the defaults, and only then would the genetic algorithm be used. I didn't know this. Okay. Yeah. I didn't know that either because that's when I started doing the benchmarks to compare the different versions. I was wondering why is the genetic optimizer not actually visible anywhere? And it's because of this.

Starting point is 00:31:26 So people don't use that. When you enabled it, did it improve on later versions versus version eight? I think it did. But I don't remember the details. I haven't like included that in the charts at all, for example. That's a good question.

Starting point is 00:31:44 I don't know. Thinking about like next steps for post, what would you be most excited to see people working on or what are you looking to work on next that you're excited about? So I definitely would like to see someone working on the robustness patches for either the smooth scan, I do have a, like a proof of concept patch for that, or the G join, I think is, they call it like, that's like

Starting point is 00:32:13 a robust algorithm for joins. So I think that would be like extremely useful. And I'm 99% sure it could be done as an extension. So that makes it like much easier to hack on without breaking everything else. And it also makes it like immediately available for users, which I think is, that's very useful. And then I think there are actually some smaller patches where people could improve individual performance

Starting point is 00:32:48 glyphs. If I go back to the example that I used in the talk, I can imagine people experimenting with choosing the flip, the number at which we flip from a linear search to a hash search, right? So I can imagine people doing some runtime measurements and doing that. I think another problem for robustness is just-in-time compilation for us, right? Which adds a completely different level of robustness issues, like unpredictable behavior. We just switched it off. But that's great.

Starting point is 00:33:31 Yes. I think that's the default recommendation, but it's also quite sad, right? Because the just-entained kind of compilation can be really helpful for some queries. But it's a robustness decision, right? Like Nikolai is often, and I've seen people as well, choosing potentially worse on average performance in certain queries for not getting those outliers. So it's exactly about robustness.

Starting point is 00:34:00 Yes, I think that's a great example, showcasing the decisions people make, right? And they almost always choose better robustness if that's a production workload, right? I think that's fine, right? And I think we could have like a, I don't know if you had a session about just-in-time compilation, but I think there's a whole set of things people could do to improve that. It could be better planning, or it could be using a different library for JIT.

Starting point is 00:34:37 Because a lot of the problems that we have is actually due to LLVM not being a well-suited library for the kind of JIT that we do for the use case. Yeah, interesting. We haven't done an episode on it. I think you're right, it would be a good one to do. Who would you recommend we talk to about that? Is it Andres or who is it? I think Andres is the one person who knows about that most because I think he's

Starting point is 00:35:07 the one implementing that. So you can also like complain to him and he's the right person, right? But I think there are actually other people who submitted some patches for like improvements for like, I think there was like two years ago, there was a blog post about someone actually implementing like custom JIT. And that might be another good guest, I think. I completely forgot about that. That's a good idea. And the other thing is I think it's based solely around like a limit of a cost limit. That feels very crude to me.

Starting point is 00:35:46 Yes, as I say like the planning is very very simplistic. It makes like essentially the decision for the whole query at once, right? I mean like the fact that it's based on like cost limit that's roughly fine but then crossing the whole limit, crossing the cost limit for the whole query can also enable just-in-time compilation in parts of the query where that's not really useful, right? So that's definitely one of the things that we could improve. I mean, like it's not a perfect solution for all the cases, but it would be, I think, better approach.

Starting point is 00:36:29 But I'm sure like there are reasons why it was done like this, right? So I don't want to like make like a definitive statement so we're like solutions. Yeah, would be an interesting discussion. Was there anything you want, anything we should have asked about that we didn't? No, I think we discussed a lot of different parts of the talk that I mentioned in the talk. So I think that's fine. It's a really excellent talk

Starting point is 00:36:57 and we definitely didn't discuss it all and it comes across really well with examples. So I will include the link to the talk and the slides as well for those links. And I'll include a link to the San Fran talk. So if anybody's got questions themselves, I guess you're attending live so they can ask them at the meetup as well. Yeah. Or they can just like send me an email or something. I also do something I call office hours

Starting point is 00:37:25 where people can just let me know what they would like to talk about. And it could be like a page they are considering or whatever. And we can definitely like schedule call or something. That's also what I do for the community, yes. Yeah, that's awesome. I did actually want to say thank you for everything you're doing,

Starting point is 00:37:46 because you're also doing some mentoring and some patch ideas for new potential committers. And I think that kind of work is very, very valued and very, very valuable. So thank you. Yeah. Awesome. Well, thanks so much for joining us Hamish. And yeah, I hope you have a great week. Yep. Bye. Bye bye.

Your Ad Here

Postgres FM - Performance cliffs

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.