Postgres FM - Plan flips

Starting point is 00:00:00 Hello and welcome to Postgres FM. We share about all things PostgresQL. I am Michael, founder of PG Mustard, and I'm joined as always by Nick, founder of Postgres AI. Hey Nick. Hi, Michael. How are you?

Starting point is 00:00:12 I am good. How are you? Very good. Fantastic. Oh, great. What are we talking about this week? Not about AI. Yes, a few.

Starting point is 00:00:26 We have a detox. Everyone is talking about AI, so we are not going. to talk about AI. Great. So what are we talking about? Let's talk about plan flips and it's quite complex topic because I feel PostGus lacks a lot of stuff and I'm not sure what it needs to have in future to be better. And there is a lot of criticism including recent one.

Starting point is 00:00:50 A recent wave of criticism against Posguss in this area, right? After some incidents from one company, the company name is clerk, right? Yeah. Clerk. How do you pronounce it? I say Clark. Okay, that's interesting. So it's E but that might be. I don't know. Yeah. Okay. Yeah. And there was an outage or series of outages. I don't know. Yeah. Quite a long one. Two weeks ago. And I'm as a usual, kudos to all companies who share experience. This is great. They shared experience. They had experience. They had a blog post with timeline and root cause analysis to some extent. And then I saw on Twitter, I saw discussions, which looked like Posgis is not good enough. Let's choose better systems.

Starting point is 00:01:46 Because look what you might have with Posgis. Yeah, look what can happen. Yeah, might happen. Exactly. Something quite innocuous. So it was it, they've said the plan flip was the, maybe not the root cause, but the thing that took them down and due to insufficient statistics. But yeah, they're an infrastructure provider, right? So they do authentication. And if off is down, that means all of their customers are down. Yeah, exactly.

Starting point is 00:02:15 So this was a huge issue for them and for all of their customers. They are very wary of things that can change unexpectedly, understandably. But equally, lots of services that are, that need. extreme high availability, one on Postgres. So how can they do it? I think that would be a good discussion. So in my opinion, their root cause analysis is halfway done. And naming plan flip a root cause is not all right.

Starting point is 00:02:48 Why? Why five whys? You know this approach. Yeah, I think Toyota. Yeah, basic approach for root cause analysis is ask why until you cannot answer at all. So they analysis was saying like the problem happened

Starting point is 00:03:04 because post-gress triggered analyze, obviously automatic analyzed by auto-vacom process, which caused plan flip. So they did only two steps in this 5-Y approach. They said okay, what happened? Why we

Starting point is 00:03:20 are down? Because statistics changed. Why is this? Because plan flip. Why? Because it was triggered by Auto Vacuum, two or three steps. But you can go deeper, right? Yeah. You can go deeper. I think there was one more.

Starting point is 00:03:35 I think they did increase statistics target, for example, to make less likely to happen again, but there still are more steps for sure. Yeah, the increase defaults statistic target? No, maybe not default, but for the table, I think. Ah, or for the column, I forget which. Yeah, it's either column or global. Anyway, that's interesting.

Starting point is 00:03:56 because in my opinion there are different plan flips, different kinds of plan flips and I'm still trying to figure out for myself but if you have plan flip when suddenly

Starting point is 00:04:11 like it flipped after recalculation of statistics that's interesting because that might happen this is actually a difficult type this is a hard type of plan flip but question is why it happened only now, not one hour ago, for example.

Starting point is 00:04:29 And what's to stop it happening again? Yeah, yeah. And this is interesting, right? So I have some thoughts on this specific case, but I think it might actually be useful for us to go back to basics a little bit and define what do we mean by a plan flip and what the normal causes of them. And I think this is a little bit of a special case that should be interesting and might be completely avoidable, I think. But a lot of cases aren't avoidable. And I think that's an interesting discussion. But yeah, in terms of plan flips, I wanted to ask you, do you define them only when they're bad?

Starting point is 00:05:01 Because we have a cost-based optimizer, right? The reason we can flip query plans is because the planner has choices, and it makes those choices based on estimated costs of doing different join algorithms or different scan types. But flipping is part of its design, right? the same query running on different data as the data changes perhaps should use a different execution plan it would be faster to use a different execution plan so flipping is designed it's deliberate but the idea is it flips to a better plan when the data suits that so sudden plan change is normal yes yes but when people say plan flip their own it feels like they're only ever talking about that yes exactly

Starting point is 00:05:50 Exactly. Yeah, because this is when it changes to wrong state. Because the planner sometimes have mistakes. Yes, or misestimations, yeah. For example, if you have random page cost four, which is default, I saw many situations when we have two candidate plans which have very similar cost. under these circumstances. So for this data and what we have in PG statistic,

Starting point is 00:06:26 and for these settings, including random page cost 4, which is default, I saw the cases when there are two plans. One is obviously wrong. For example, sequential scan for a huge table. And another is obviously right, which is index scan. Like I'm remembering some simple case. Like simple in terms of it's obvious to a human or even AI that this plan, for this table shouldn't be chosen because it's sequential scan on the whole table. But since, for example, PostGus thinks that random I.O. is so expensive, suddenly choose a sequential scan plan instead of index. And you don't notice until some point when these costs, they are very close to each other.

Starting point is 00:07:13 And then suddenly when a little bit more data comes to the table and analyze happens, sequential bad one wins and that's it and this we usually solve by adjusting random page cost telling the planner that please don't choose sequential plans i know we have like thomas vondra tomash vondra yeah yeah has great proof that maybe random page cost four is not bad maybe even it should be bigger but i have brain split here because I see what, like, his reasoning makes sense theoretically, but what I saw in production proves the opposite. I saw these plan flips happening when just like suddenly, and it was like some ticking bomb. It was like, yeah, it was sitting there, and then boom, like suddenly the

Starting point is 00:08:08 planner chooses the sequential scan and we are down because it times out, right? Yeah, I have a theory about this. I've been thinking a little bit about it. And I think, It's like Thomas's work is very good in theory because he cares about the optimal time to flip. He cares about how much data do we need to be reading before a sequential scan is faster. But in practice, with large tables, the flipping point is not that important. The main point is we almost never want to request that much data. Often, especially in OLTP, we want very small selects. we're only selecting one record or only a very few number of records.

Starting point is 00:08:53 So the flipping point, we want to put so far out of sight that even if our row estimate is really bad, because often it's a combination of two things, right? It's not just the cost of doing random access, it's also how many rows do we expect to be reading. And if that's out by a lot and the cost is high, those multiply together and suddenly postcards is thinking, you know what, I might as well scan the whole table here because I'm expecting to read so much. But if we can, if we decrease the random page cost, we get away with the bad row estimate. So I think you can fix either and get towards it. But if you do both, then you massively decrease your chance of this kind of issue.

Starting point is 00:09:33 I hear you and this in my mind connects very well to the topic of OTP versus analytical versus hybrid. because there are good advancements in the area of analytical queries processing for Posgars in various ways. And there is also, I'm a strongly belief that there is a very simple case called pure LTP. Pure LTP means that we need to serve users who are humans mostly, maybe. And they have human perception. Our very first episode was about these 200 milliseconds. This is when they recognize something. is already happening.

Starting point is 00:10:14 And if it's significantly more than 200 milliseconds, it feels slow. Which means we need to have our queries below 20 milliseconds because one web page consists of multiple queries usually roughly 10. Say 10, some guess, right? It means we need to be below 20 milliseconds. In this case, by no means I want any sequential scan be applied. Never. Like I don't want it.

Starting point is 00:10:40 Like just forget about sequence. I want to turn it off. completely. I would counter that sequential scans on tiny tables are fine, right? If it's only got 100 rows in it, but also an index scan on that table would be fine as well because it's still only got 100 rows. So the times where sequential scan is best

Starting point is 00:10:59 don't matter anymore because an index scan whilst being less efficient or even a bitmap scan would be absolutely fine even if it's twice as slow or even five times as slow because it's such a tiny table. But we also have different. And other, there is also bitmap scan, index and hip scan. So it also, it's like in the middle. And how the planner thinks about the cost, it's interesting.

Starting point is 00:11:22 So I think how we tune some service for OLTP. And this is obviously OTP. People didn't want this service to. Of course. Yeah. We definitely want very strict timeouts. I think 15 milliseconds, it's old. We should go down to 10 seconds.

Starting point is 00:11:42 10 seconds. So statement time out should be 10 seconds. Even transaction time out should be 10 seconds for radical. Let's introduce concept of radical OTP. Yes, you have some workload that needs to be analytical and some people can wait or some background jobs. Yes, create index taking a couple of hours. It's also a different story. But if it's a pure OTP mobile app, which many people depend on, put transaction time out and statement time out, very low value, be very strict. and just cut those queries which start running longer. In this case, if you, for example, have a bunch of them happening, there's a chance that on powerful machine you will survive.

Starting point is 00:12:29 If you don't have it, it will be a snowball effect of many sessions being stucked and running in sequential scan or something like a very not optimal, and it will be consuming resources. The other thing it might do is it might more quickly point you in the right direction as to what's causing the issues. Yeah, how to diagnose. Let's maybe, I think I'm pushing us to some like problems and resolutions already, but let's go back.

Starting point is 00:13:01 You raised a very good point about what plan flip is. So in my head it's also like this is negative term. Plan flip is a sudden plan change in the wrong direction. direction when like basically black planner chooses a plan we don't want like sequential scan or maybe bitmap scan which is timing out for example yeah I think that it might even be worse than that I think it might even be like it's noticeably worse if it flips and it's one point one time slower I think most people don't really care if it's but if it's double or triple or it's like it takes you from being 90% CPU to 100% CPU if it crosses that threshold maybe again it's the

Starting point is 00:13:38 Thomas episode we did on performance cliffs if it's a a cliff for you then. In my opinion, performance cliff term is different and it's macro term. While plant sleep is micro term, it's like it's just one session and we see suddenly execution of this query became much worse. It was 10 milliseconds, suddenly became like one minute, boom. And obvious example is like we had an index scan, but suddenly the planner, and usually it happens in complex queries, right?

Starting point is 00:14:09 So joins, sub queries, CTE, multi-stages, like it's something complex. And then boom. While performance clips for me, like it's macro level, like some lightweight lock, contention happens and didn't happen and then suddenly it happens. There is no official terminology here, right? So I'm just reading some blockposts and trying to adjust my own mind to them mostly, right? So this is how I feel. So plan flip is negative term, right? So we don't want it to have.

Starting point is 00:14:39 happen. And we started with difficult example. This company obviously experienced this difficult example. And my question to them is like what random page cost you had? Did you have statement timeout, I don't transaction time out, maybe not related in this case, and transaction time out. So some safeguards, safeguards which would protect from a huge wave of sessions which experience this bad plan to be executed and also what were your auto-vacom settings you think that's relevant before we go to default before before we go to an increase number of buckets with statistics target these are macro settings we need to keep in mind and this is what I had another yeah I had this is what thought yeah sure this for me felt like a

Starting point is 00:15:35 data skew issue and so they mentioned the column in question was 99.996% null values and this query was looking for the not null values which of which there were still 17,000 so we're talking about billions of records but only 17,000 were not null in cases like that like where you've got such a data skew you could be looking at like for those columns you could be looking at increased statistics or often you want a partial index on a column like that, right? Because you don't care about indexing the billions of null values. You only care about indexing the not null values. And once you've got the perfect index on that,

Starting point is 00:16:19 I know it could be a more complex query where you want different columns, but if you've got those columns in the index at where the value is not null, then actually the planner's almost certainly going to choose that because it's such the perfect index for that query. And you don't even have to worry about missing. estimations, it's never going to pick a sex scan in that case. Yeah. Yeah.

Starting point is 00:16:40 So we definitely lack information here. Yeah. We didn't see the plans. We don't know of their settings. We even don't know the version. True. Right. So we can only guess.

Starting point is 00:16:51 And again, there are difficult situations with plan flips. I saw quite difficult ones and where obviously like everything was normal. But what I'm trying to say, you can get a plan flip because you didn't do good job. configuring postgres well enough and putting these safeguards for all TEP workloads and this is what we do number one with all the clients during consulting and which also the clients who install our monitoring and checkup service and this is like number one thing we need to be we recognize we are OLTP priority is to have the super fast queries and we need to prepare basics right Unfortunately, they cannot protect us in 100% of cases.

Starting point is 00:17:40 But if you have your auto vacuum not tuned, for example, you have this relatively new setting. I remember Darafi raised this topic. We didn't have auto vacuum insert, auto vacuum, vacuum, insert scale factor. And auto vacuum analysis are there too, right? So which triggers on insert. And by default, it's 20%. And also we have old regular vacuum scale factor, which triggers on updates and deletes, also 20%.

Starting point is 00:18:14 And for analysis, it's 10%, 0.10, right? Which means 10% of your table needs to be changed. If it's a huge table, it's a lot of data. And again, you're like this data skew, it's a good point. If you suddenly, you had one distribution, but suddenly, like, all the new records, they are very different. and you don't have properly tuned auto vacuum, you accumulate that data a lot. And then suddenly after 10 or 20% of data changed or inserted, auto vacuum finally ran analyze.

Starting point is 00:18:48 Suddenly we have a flip. If it happened earlier, maybe it also would be bad. It's actually a good point. Maybe still we need the statistics target. But in my opinion, it's better to have more frequent process. running on large tables. So statistics would represent actual data better, more often, right? To allocate more resources, to have more workers. This is what we do all the time with larger services. And unfortunately, we see, like I would say, two out of three clients are coming.

Starting point is 00:19:22 Most of our clients are startups, which like have, say, database exceeding 100 gigabytes, up to 10 or 20 terabytes. This is like a sweet spot for us where they experienced some problems already and two out of three roughly have poorly tuned auto-wacking. Only two out of three is quite surprising to me. This is my rough.

Starting point is 00:19:48 I can check statistics. You need to update your statistics. Rihanna. My internal, yeah, I need to ask AI to maintain statistics. Not sure about that. Anyway, but I had a question, though. Yeah.

Starting point is 00:20:06 Default statistics target is like 100, and I quite often see people increasing it to a thousand when an issue like this happens, or at least for like specific columns. Other than increased analyze time, which is like important, but not that important for, and it does increase that much. And also it's not that important for LTP systems, right? Like it's a background thing.

Starting point is 00:20:29 It can be, like it's, like, it's, I'm wondering. I'm wondering. During upgrades, though, a little bit of upgrades. Only some types of upgrades, right? Because if you've got such an important infrastructure, OLTP system, you're probably not going to be doing PG upgrade. Like, you're probably not going to be doing those types of upgrades that care of that analyze.

Starting point is 00:20:45 You're probably going to be doing logical failover. It still needs to analyze. You create this logical replica. You need to analyze this replica. Not while you're that. You don't have to be down during that. That could be done while the other system's serving. queries. You are right. You are right.

Starting point is 00:21:03 And the statistics will have been getting updated while it's being populated because vacuum will be running and analysed will be running. So it's not in this. So I was wondering for really important RLTP systems, why not just boost

Starting point is 00:21:17 the default statistics target everywhere and just pay that extra cost? Did we raise this question with Tomash Wander or no? I don't think so. Maybe we raised it when he was on PostGovsky YouTube channel. I remember the opinion that

Starting point is 00:21:34 100 buckets, it's a lot. This topic connects to partitioning for me. Because when I see a table with 10 billion records like table like multi-terabyte in size it's

Starting point is 00:21:50 this is like already quite late. We should have it partitioned in this case we would have many tables and at least we would have localized problem. Only for this partition, we have a problem. And we would have, let's say it had 30 partitions,

Starting point is 00:22:08 we would have 30 times more data in the samples. Different statistics, yeah. Yeah. Yeah. So, I don't know. I see people raise it. I, like, okay, does it affect, analyze? Yeah, it does.

Starting point is 00:22:22 How much we can afford it? Okay, set it 2,000 globally. Okay. How to prove it, it helps or not? We need holistic analysis. plans. Yeah. So we need this is actually we tried to do many years ago before like current state of art of like things with branching and so on. Maybe we should revisit this topic. I've just realized I disagree. I don't think this is about on average improving things. I don't think this is a case of

Starting point is 00:22:48 on average I want a slightly better execution plan. I think this is a case of I never want to completely flip to an awful plan. So it's about risk mitigating the outliers not about improving I would actually accept slightly worse on average latencies. I don't talk about average. I don't talk about average. I'm talking about complete analysis when you have, for example, thousands of plants based on production. You have a clone of production.

Starting point is 00:23:14 And using branching, you can create replicas and compare plans before and after the change. This is what if framework, which we have with Diby. And you can compare plants. There's a question how to properly compare plans at the ground scale. We talked about this. before. Our approach is just compare structure and cost and the buffers and also timing. But timing it's tricky because it's volatile and so on. If cost changes seriously, this is already a signal to investigate what happened. If structure changes significantly also a signal. Structure, by the

Starting point is 00:23:49 way, maybe like you're not in the whole structure, probably just need to access methods. Like you had three index scans, suddenly you have two index scans and one hip scan, sequential scan, right? So it's bad. Right. So what happened? Anyway, this holistic approach is great. So you can have a huge amount of plans. And then you have ability to do the change and compare before, after. Again, I haven't visited the topic of default statistics target, particularly in this context for quite a while. But it would be great to do it. And again, like I have a bunch of unfinished thoughts. This topic plan flips is unfinished in my head. Like I'm still like, thinking what's better. So once again, you don't need average or something. It's a classic optimization problem. You want to optimize something and you don't want everything else to degrade significantly. You say, I'm okay if something degrades like maximum 10% on some metric. There's a question of which metric to choose also. But I want this to be improved. This is one

Starting point is 00:24:55 thing. So we have framework. We don't have all the answers yet. The second thing is how to understand we are on the edge of flip or cliff if you like doesn't Yeah, how to understand. I think I actually don't know if this is important as much as how do we stick with what we've got. I want to predict. Yeah, but I think it's really difficult to predict bad plan flips without also predicting good ones. There will be so many false positives. If you think about how you would do it.

Starting point is 00:25:26 I forgot to mention that in our case, we execute on real database, full size, full clone. Using database branching, provisioned very quickly. And we see not only planner, but executor inaction, like realistic testing, right? Full-flage realistic testing. We execute thousands, hundreds or thousands of queries samples properly collected. So, like, it's not we have 100 samples and 90 of them is the same. query ID. We take query ID, we collect multiple samples for each query ID, and then we execute before and after the change. We developed this framework specifically to predict plan flips for major upgrades. Yeah, which is it's slightly different, right? I think. No, it's not different. I think it's different. Do you know, I think why it's different is often people talk about plan flips in the context of I didn't change anything and this thing

Starting point is 00:26:26 just happened. Whereas with a major version upgrade, you know in advance you're doing it. Major change. And you can look at, yeah, you have this reference point if we change. You have new engine, basically. Yeah. Yeah. So I think psychologically, at least, it's different.

Starting point is 00:26:41 I understand that in practice, there's a lot of the same harness. So that's why I'm saying there are easy plant flips and there are difficult plans. Easy, it's when you did some action. You upgraded postgres. You jumped between 13 to 18, five versions for that. example. There are new nodes in plans, right? Never existed before. So definitely new settings, new configuration parameter defaults. In my opinion, plan flip analysis should be a part of

Starting point is 00:27:07 major post-grews upgrade and we have it for our clients. Yeah, we have it for our clients. It still requires some, like, it's semi-automated right now. I hope at some point we will fully automated and it should be automated, in my opinion. And it shines when you have database branching properly implemented. But, There are difficult plan flips like. So to explain how to properly categorize it, let's a little bit stop and think about what planner, the planner, what does it take into account when choosing plans?

Starting point is 00:27:45 It's only three things. It's a version, which defines the version of the planner, basically, which engine we are using. and the second it's settings not all settings yeah my current vision is like the planar settings plus work planar same plus workmaam why i'm saying plus workman because the planner settings there is a category if you check pg settings one of the columns it will be category right and there is a category called planar settings but work mem is beyond so plus workman and finally the number three is pg statistic content And that's it.

Starting point is 00:28:26 It doesn't care about hardware or actual data. Yeah, true. It's not actually the data volume. It's how much data it thinks it has, which is the statistics. Yeah. And if the statistics is lagging, it might live in different reality. That's why I say I would prefer, it might still be wrong, but I would prefer it be wrong sooner and be closer to reality, right?

Starting point is 00:28:48 Yeah, yeah. So when we change engine, it's obvious. We know when it happens. And before that, we can perform this plan flip analysis holistically according to our methodology. This is great. We can predict this plan flip and prepare a resolution for it in advance. This is what we do with clients. Second, if we change settings, basically it's the same.

Starting point is 00:29:10 We also know when we do it. For example, okay, we are going to raise default statistics target globally. This will help to this plan. Question, how will other plans behave? We don't want them to degrade significantly, at least now. Same approach can happen. It can be applied. The third thing is PG statistic content is changing.

Starting point is 00:29:32 Again, I would prefer to be changing more often, like continuously changing. So more often, auto-analysis should happen more often to reflect the current reality. For this, we need auto-vacuum tuning. And then how to predict. Also, you missed something which is we could gather more statistics. Like it's a sample and we could increase. Yeah, this is mitigation activity already. I agree that moving from 100 to 1,000 sometimes might make sense.

Starting point is 00:29:59 Honestly, like, I don't have strong opinion here. I have clients with 100, with 1,000. In the past, I had clients who thought that going down to 10 is a good idea. I don't have them anymore. Somehow right now, all the thoughts are to the direction of raising this, not to reducing this. Yeah, but I remember we had such discussions. Maybe we should raise it.

Starting point is 00:30:24 Counterintuitive ideas, sometimes we know. We don't have it anymore. Anyway, two first things we usually control. We don't have self-driving post-gues yet, so we control it. And we can perform plan flip analysis if we have such two. Wait, there's one more thing. It's not available everywhere, but another thing that does affect this is, for example, if you're using hints with PG-Hint Plan.

Starting point is 00:30:49 Okay. Okay, hints. I think that's a big topic, though. In production. That's a great topic. Let's consider the exclusion because it's not a major approach right now. It's not, it's like very rare. Is that true?

Starting point is 00:31:04 I hope so. I think in other databases, this is actually often one of the main. You just try to open another gate or something. And I agree, let's open it. But let's just finish with this second plan while I think about second plan. Again, if you change major version, you control it, you can predict plan flip. If you change settings, the same thing. You control it, you can predict.

Starting point is 00:31:27 If changing PG statistics content due to data changes leads to some plan flip, this is hard to predict. How to predict it? When I think about it, I want to be able to check the second plan. or maybe first five plans which were almost winning. Because in this case, I can say, okay, I see the winning plan, current plan has cost, say, 500. But also there is a second plan which has cost 490. And it's quite close. If it's close, say below 20%, if difference is below 20%, I will start thinking, might it flip soon.

Starting point is 00:32:12 because this is why I want second time. I think this is incredibly tricky though. Maybe it's simple, but for example, parameters matter a lot. Like per query, different parameters. If you get, for example, yeah, parameters matter. Yeah, I agree. Yeah, a huge amount. If you look at the solutions at SQL server and Oracle,

Starting point is 00:32:32 bear with me because I think SQL server and Oracle a bit further ahead of postcards on this. Probably other, maybe DB2 sometimes has some features in this kind of area. but the solutions they've ended up implementing are around forcing the existing so plan pinning Orrera has it. POSBORRA has it. Interesting.

Starting point is 00:32:53 So yeah, these are the solutions that other databases have ended up implementing rather than detection of, rather than trying to predict a plan flip, what they try and do is pin it or freeze it and then maybe require approval when it does flip. So instead of trying to predict it, they prevent it and then have it

Starting point is 00:33:10 have it as like an opt-in thing. Maybe this chat changed for the better, but maybe you want someone to approve it first. So there are several things here. You touch several things. First, depending on parameters, I agree with you.

Starting point is 00:33:24 But in a system, which is quite big, we have usually, like, Pareta principle is working, basically. Like, a lot of workload is defined by limited number of specific plans. And this is, this is how we like we have in our plan flip analysis we have methodology to identify not only normalized queries from PG-star statements but also particular examples particular parameters so we would

Starting point is 00:33:53 reproduce exact queries with exact execution so plans which contribute to workload the most for example contribute in terms of calls or in terms of total execution and plan time right so we want to cover like 80% of workload in several directions and several metrics, right? And this is the magic, like how we collect those samples and prepare our testing query set for testing, so it would represent a majority of workload. Once we found a way to do it, there is some tail. Those queries are less important because if you just think about database time in terms of execution and planning, if some query defines only minority of that time, even if plan

Starting point is 00:34:45 flips, it wouldn't affect the whole cluster so much, right? Yeah. Doesn't make sense, right? It does. And it actually reminds me of a different idea. Maybe I'm wrong, actually. Maybe I'm wrong. You know what?

Starting point is 00:34:56 It might be like, okay, now it's only like 1% of a whole time. And London due to plan flip, boom, it's already half of time. It reminds me, though, of something that I think could help here. It's not in Post-Crist. Actually, maybe there's a couple of examples in Post-Crist's core at the moment, but the idea that the plan is chosen and then is stuck to no matter what during execution, it could be, I think Thomas actually brought this up at the end of our call on performance cliffs. It could be adaptive at one time.

Starting point is 00:35:24 If it realizes it's doing, it's chosen a bad plan, instead of sticking to it till the end. We should invite Andrea Lippeuf, who I think, yeah, who, who, who, can talk about this adaptive query optimization a lot. I consider him an expert in this area. Great. But do you see what I mean? Yeah.

Starting point is 00:35:44 It's able to adapt during execution. Maybe we avoid the issue, or at least it's way less bad when it happens. Yeah. And also, like, to finish my thought, maybe I said maybe this approach is wrong, maybe if some plan which contributed only like 1% to whole database time now suddenly half.

Starting point is 00:36:04 Because of this, I think my God told me, we need multiple directions. One of the directions, when we collect this working set, we also order by calls. So it should be quite, so for a query, which had the plan flip to influence the whole cluster so much that it causes downtime, it should be quite frequent enough, I think, right? If it's super infrequent, it should be fine. If it's not super frequent, it means that it won't occupy multiple backends in parallel. Unless it gets really bad, yeah. Unless you forgot to tune the statement timeout or transaction time.

Starting point is 00:36:45 Yeah, we have multiple vectors, and I think this is a good approach, good enough approach. Maybe it can be improved. So I'm very curious about this area, and I'm coming from practice. Yeah, I do think there's still complex cases where you might have clusters of even the same query in PG-Stat statements could have four different plans that are actually all good. But let's say you're a multi-tenant system. Maybe small customers look like, or tiny customers have this plan, huge customers that are typical have this plan,

Starting point is 00:37:13 huge customers that have this weird distribution, or maybe like healthcare customers look different for some other. Yeah. Yeah. And in analysis, again, I said, like, we collect multiple plans per query ID. Why multiple because of this? But so we like, say top five for each query ID,

Starting point is 00:37:30 we collect multiple of them. But all of them are like we take from this top N by H metric. So we try to capture the concept of how influential this query is. And multiple plans might appear. It would be great to have plan ID, right? And shift aggregation logic there. But we have query ideas, the main tool. I think PG-analyzed Lucas, I think, did some work on a plan ID.

Starting point is 00:37:59 Yeah, yeah. Yeah, there's a soil, and I commented on LinkedIn, and it's a great work. Absolutely needed, I think. Yeah, it's great work. So then, like, if we have this, the question, two questions, like, first, like, how to understand that some plan candidate is approaching and suddenly can flip soon? Because data, in large tables, data usually is not changing too much suddenly. I agree with you. It's like gradual.

Starting point is 00:38:30 Yeah, it's quite gradual. So the planar behavior involving large tables, it's dangerous when we have two plans. This is my hypothesis, right? So again, for me, a lot of open questions still. It might be like on the edge. And I want to feel this on the edge for each query idea or like for each plan in advance, slightly in advance to be able to predict. I think someone recently had some work in this area to show

Starting point is 00:38:59 plan candidates. I might be mistaken, but I saw something somewhere. Didn't dive deep. And the second thing is, in my opinion, hints and plant freeze are two great features. I put them to enterprise basket. Like this is, for me, it's enterprise thing. Because it's interesting. Do you need plans? Do we need plant free? For larger systems, probably we should have both. But I also understand the reasoning. that it makes a system complex and it requires a lot of management. There is a price you pay. If you put some, you start to be responsible for it.

Starting point is 00:39:39 You need to maintain it. And if you throw some plan, the same thing. You're responsible now, right? Basically, do we have capacity to manage it? Because maybe later it helps less and maybe then it's not helping but opposite. That's the problem. Absolutely. But I do think it's interesting that these other databases

Starting point is 00:39:59 is given the choice between the two went in that direction. I'm not saying it's a good idea, but they've clearly thought about it, and there's some merit to it, and it's been working for many years, of course with issues, of course with unintended side effects, but it is a solution, I think, worth considering.

Starting point is 00:40:17 So, yeah, this is good. I thought it was actually really interesting. A lot of open questions. Of course, I wish, I saw you send me this Postgrease 19 improvements in terms of plan management, and HIN plans mentioned that. I wish Aurora team open-sourced this.

Starting point is 00:40:36 It's not Aurora specific thing. It's called APG plan MGM extension. APG plan management. So it's a great thing, and I wish it existed in the open-source ecosystem. I think eventually it should exist in open-source ecosystem. Because when I wish, would use it. When suddenly we have plan flip, we know what plan we would prefer. And let's just

Starting point is 00:41:06 pin it right now before we apply permanent solution, just mitigate as fast as possible without any like deployments. But let's say we're, let's say we're Clark and we've got, we know we've got 15 to 20 really important queries. They're on the hot path for every single auth request. They're really frequent. They think the thousands of times a second. We could just We could deliberately pin all of those or put hints into all of those to maintain. So this is different. So what I'm saying is temporarily pin before we apply permanent solution. What you say, permanent pin, which means there might be some cost you will pay to maintain it.

Starting point is 00:41:51 Maybe, yes, but my point is at least we then can guarantee it won't go bad. Like maybe it will gradually degrade and we can look at fixing. If it's going to degrade, I think it's more likely to degrade gradually instead of this all of a sudden, which I think we can then fix Monday to Friday 9 to 5 instead of when there's a huge database outage. And again, just partition your tables. I know it costs some effort, but if I are it easier. And gather statistics on the parent partition. Yeah, which is another topic.

Starting point is 00:42:24 It's fixed. Lawrence Albuhr from Cybertech fixed it I think it will be in 19 Nice So automatic gathering of statistics Yeah so it's headache During upgrades especially

Starting point is 00:42:36 I think it's solved now I mean now In the future Yeah Yeah Great good Nice one Yeah enjoy it

Starting point is 00:42:44 Thank you Likewise Thanks so much And see you See soon

Postgres FM - Plan flips

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.