Postgres FM - Mean vs p99

Starting point is 00:00:00 Hello, hello, this is Posgas FM. I'm Nick, Posgas AI, and as usual with me, Michael, Pidgey Master. Hello, Michael. Hello, Nick. How's it going? Everything is all right, a lot of work. There is progress on many fronts, so yeah, excited to discuss quite important topic. And maybe, I, let's see what we can learn from, from this discussion. I don't know, like mutually learn or something. Yeah. I'm glad you think it's interesting. It's interesting because it's always with us. We, when we deal with performance, we need to work with many numbers and

Starting point is 00:00:37 this topic pops up all the time. Yeah. So well, so specifically the topic is when we're looking at latencies or, you know, monitoring or troubleshooting, maybe prioritization, we often need to look at queries by some measure of their duration. Like often, and I see this in blog posts, in tools, various places, often ordered by the mean or the average of their latencies. Very easy to get from PGSTAT statements, for example. It has a dedicated column. But in a lot of other places, a lot of other, you know, different types of monitoring tools, and I'd say more on the application developer side,

Starting point is 00:01:26 I'm seeing increasingly people moving to more monitoring the P95, P99 type latencies. So percentile based and kind of the tail latencies, not the average. So wondering about that. And yeah, I saw recently another feature request to have the ability to see these in pgset statements. And I've realized it's something we can approximate, but it's not something we can get from that. So I wrote a post about it recently, and I'm glad you find it interesting too. Yeah. So that's important topic, I think, because we all work with averages or mean values all the time. And it's not only about posits, right? Mean value, meaning of average is often misunderstood

Starting point is 00:02:18 and it's often misleading. I remember the idea that in any organization, if you check what is the mean salary or average salary, most people might find out it's like in some organizations, in some organizations, oh wow, it's like this average salary is much above my own salary. What's happening here? Well, what's happening? A couple of dudes have huge salary, that's why. So it means that when we're back to Postgres, back to queries, you see average or mean time

Starting point is 00:02:59 is like one millisecond. Oh, not bad, not bad, right? But how many queries are much above that? It might be a lot. And I must notice that mean time, mean exact time, mean plan time, these columns are redundant, they could be avoided because we always can calculate them dividing total by number of calls, right? Or plans in case of planning time. By the way, I read your article, I noticed you optimistically involve planning metrics. Unfortunately, in 99% of cases, we don't see them because they are off by default.

Starting point is 00:03:40 I remember last year or previous year, it took some effort for us to prove that basically you can turn them on. But unfortunately, as we had an episode about this, unfortunately, performance cleave will be two times closer to you if you turn it on. But again, default accommodation is to turn it on because it's very valuable information. So again, back to averages, we could always calculate them dynamically or something. It's not a big deal. But I guess it was just a matter of convenience to add them and keep them already pre-calculated in PGSAT statements

Starting point is 00:04:18 and other similar extensions and so on. But it would be really great to see extensions and so on. But it would be really great to see the percentiles, right? Or better to see the whole histogram analysis of what's happening. But unfortunately, I guess it's a very complex topic. I mean, I know, PGSTAT monitor implements this, but we need to check overhead. That's the critical part. And also just it's, it's more columns that we've got already a lot of columns in PD stat statements. So as additional overhead both to tracking, but also in terms of user experience. And I mean, I think at this point, we have so many columns that why not have a few more, but it is, it is a problem.

Starting point is 00:05:01 Like there are, I think just in time compilation, got a bunch of additional columns for every single part of that. And there's, there has been discussion about whether that should be collapsed down because do we really need granular reporting there? But yeah, I would be in favor of some histograms. I also haven't yet seen any analysis of, so PGSTAT Monitor is an alternative to PGSTAT Statements, relatively new in the grand scheme of things, probably a few years old at this point, but I don't see it in many installations and I haven't yet seen anybody publish benchmarks

Starting point is 00:05:38 on its performance relative to it. It was created in Percona and I remember I discussed it with Peter Zaitsev who is the founder of Percona when this extension was created and they said instead of contributing to Perjista Statements they wanted a separate independent release cycle and as we know Perjista Statements is a part of the set of so-called contrib, it means that they are like basically recycle, it's Postgres its own release cycle, so once per year. And if you want a few times per year to release something, it's hard. And then I know the original creator left Percona and I don't know what's happening with development of this extension, but when it was created, our first look, of this extension, but when it was created, our first look, we did, our team, we, I remember, we did some analysis and early versions looked promising but challenging in terms of performance

Starting point is 00:06:35 overhead. And of course, if you talk about RDS and others, they don't have it. I saw it only maybe on one or two platforms. So yeah, and performance overhead is a big question. That being said, I don't know the current performance overhead. It would be great to have histograms and so on, right? So, percentiles. So yeah, and another thing is two aspects here additionally in my approach. First is that in many cases I don't care about average at all because this is as we also discussed several times depending on our goals it might be relevant or not like important or not. For example if we target to optimize how much resource utilization happens on this server we We care about total time and total

Starting point is 00:07:27 values because this is what exactly shows a picture because a query might have very like tiny average, tiny standard deviation. So they all like similar, all query executions are similar and planning occurrences are similar, but so many of them, like 10,000 QPS, queries per second. This is going to load our server a lot and we need to optimize this query in spite of good values in terms of average. Total will be high. So averages, as I feel it, you mostly need it when you target to deliver good experience to your users. So, please don't use it.

Starting point is 00:08:14 Yeah, you're completely right, obviously. If we're trying to reduce utilization, there's no point looking at things by how slow they are on average or even at the extremes at the P99 level because if they don't run that often they're not things taken the most time so total ordering by totals whether that's tight I still like to do time or some combination of buffers if for example your resource that you're constrained by is IO which could be the case or like if you're paying by IO like if you're on Aurora or something, ordering by something else could make sense.

Starting point is 00:08:50 But yeah, totals always make sense for those cases. But yeah, this is, this is for those other cases where you care about user experience, perhaps like how long a dashboard is taken to load for a large customer or the boss has complained or you know that that kind of thing or we've talked a few times about the issues with long running queries so you might be investigating what queries do I even have that run long sometimes so I think that's the other use case I thought of as like why might you want to order by some version of either average or percentile? Yeah, let me argue with myself, my favorite thing.

Starting point is 00:09:30 So yeah, when we optimize for resource consumption, lower resource consumption, we look at totals, but at the second step, we should look at average because we need to decide average and frequency, which multiplied brings us to total. So we need to go down and see is it about too frequent query or is it about not very frequent but very slow query and decide which optimization methods to apply, right? And in this case, I'm wondering, does it really matter to understand that, like to look at standard deviation, basically, right?

Starting point is 00:10:13 To think about percent else. I think you're conflating two things here, looking at something versus ordering by it. And yes, when you're looking at resource consumption, you want to look at average time sometimes, but you don't order by it. Crucially, you need to order by total time. And the same is true for standard deviation, I think.

Starting point is 00:10:35 I'm talking about adding maybe one, maybe two, depending on exactly what percentile you want, adding a couple of standard deviations to the mean to order by it. Not necessarily to look at it, but just to get the initial prioritization order, which thing should I start looking at? What are my top 10? Because the order will be different. Assuming you have some variants and you know, distribution of your queries and some of them have a longer tail than others, the order will be different if you order by mean plus a couple of standard deviations than it would be if you just order by

Starting point is 00:11:08 mean. So it's about the prioritization order and then you care about looking at a lot of the other columns but you don't order by them. So I think that's a subtle difference. Yeah, yeah, yeah, I agree. Order by standard deviation, it's interesting. I'm joking, I'm joking. Yeah. I understand you are thinking about ordering for example, average plus some factor times standard deviation. This is a smart idea I think. And before we move to that idea, which is explained in your blog post. So again, we order by total time, and then we look to decide which approach to use for optimization. And still, maybe that also makes some sense to look... To have percentiles, for example, because we would understand, oh, this query has this

Starting point is 00:12:05 average which is not that bad, but for example, we don't have percentiles in PgSense 7, no histogram. So, but we see, oh, standard deviation is so huge, it means that some occurrences are okay, some are not at all. And this brings us to the idea, oh, it means that we have instability here, depending on maybe data volumes or maybe we are blocked by someone. By the way, I would rather prefer to have this information for different metrics than time.

Starting point is 00:12:37 But for hits and reads, this is where I would really love to see that piece, standard deviation or even better person tells. Oh my goodness. I'm just imagining the number of columns at that point. Yeah, because it's always in the business approach, it always concerns me how much of those slowness was associated with query being blocked by another query, just wait time. We don't know and it's hidden here.

Starting point is 00:13:06 Yeah, I mean it's a good point that these aggregates are always going to hide information, you know, that is the nature of aggregations is we don't get all the information. It's the trade-off, right? We get some useful information on aggregate, but we don't get all of the information. I quite often see people include min, mean, and max in their queries, their querying page stat statements from. So you get kind of the full distribution. I could imagine adding a couple, you know, like either side of the mean to say what's the mean minus the standard deviation, what's the mean plus the standard deviation. Without standard deviation, mean and max, they can be some, like you have million calls, but just one of them was so far and like, okay, it broke your whole picture, affected your whole picture, while standard deviation brings some good sense, right?

Starting point is 00:14:00 Yeah. I mean, I guess mean is possibly more interesting than max, but yeah, of course, sometimes there'll be one parameter that causes the critical one very differently. Zero roles returned, we have min close to zero, it doesn't make any sense, right? Good point, yeah. Yeah, but by including more, you get a better picture of the distribution, you get more points on the graph to get an idea of quite how is this distributed. Is that max and outlier or are there quite a few queries running nearly that long? You know, so without some measure close to

Starting point is 00:14:36 P95, P99, that kind of thing, I think it's really hard to know that and I think possibly what's happening at the moment is quite a few of these dashboards, monitoring tools, diagnosis tools, I'll give you an order when you look at it know that. And I think possibly what's happening at the moment is quite a few of these dashboards, monitoring tools, diagnosis tools. I'll give you an order when you look at it by mean, that's fine. It's useful, but could be more useful. And you could maybe start at the more problematic queries if you ordered them by a higher percentile and then started at the top from there. Yeah, because it's not, it's like we say, oh, on average we deliver everything like 10 milliseconds, everything is good.

Starting point is 00:15:12 Or one millisecond again, like for queries we want to be below 10 milliseconds ideally, definitely not 100 milliseconds, it's too too much. So we say it's good, but what if almost 50% of that was one millisecond and almost 50% of that was how many? Like, no, one millisecond. What did you say the average was? My method is going to be really bad. I was trying to say you might have two segments or clusters of cases. One is very close to zero, another is very bad. And they call it like a bi-modal distribution. Yeah.

Starting point is 00:15:55 Perhaps because of a different plan for different amounts of data. Maybe like... Different parameters, maybe, or maybe these locking issues. And then you observe something in the middle. Well, maybe 50-50 is a bad example because we are going to be in the middle. What if like 70%, very close to zero, 30% is very far from zero and they really struggle with performance, those users, right? And we see average is not bad, it's going to be shifted towards

Starting point is 00:16:27 to zero because with average not... Oh no, it's not going to... Okay, depends, right? So, but average will tell us it's like, it's not that bad. We can leave it. It's very easy to imagine distributions that on average are less than some threshold that we might not care about, like 50 milliseconds or something. Maybe we set our threshold there. On average, they're 40, 45. But for a non-zip, you know, like a decent number, maybe if it's P95, P99, and we're talking about either 5% or 1% of times, it's running for hundreds of milliseconds.

Starting point is 00:17:03 And we've talked in the past about how... Or seconds, go on. Or even seconds, yeah, exactly. So those are noticeable to human durations, even though on average it's running in 45 milliseconds. So it's very easy, I think, to hide things that on average they maybe aren't perceptible to humans, but if the tail's long enough,

Starting point is 00:17:24 then for quite a lot of people, especially, let's say it's 5% of people, that's one in 20 times somebody's loading that dashboard or using that feature. One in 20 sounds quite often. Yeah. Yeah. Yeah. If it's one in a thousand, well, things happen. We we can say okay we have a very large user base some of them like we need to take care of everyone but we cannot like it's it happens right but but it this approach works only if your P99 is good yes okay one percent is not that good, but if it happens with 20% of your users, they suffer and you missed this fact just looking at the average, it's not good. Yeah, exactly.

Starting point is 00:18:12 Well, imagine, go back to our example, if you've got two queries, mostly they look the same in terms of average. Let's say one's 45 milliseconds on average and the other is 50 milliseconds on average. But the one that's 45 has a P 99 that's 10 times that 450 milliseconds. And the one that has the one that's 50 milliseconds has a P 99 of only 60 milliseconds, only a little bit higher. I would much, much, much rather start. Uh, I'll have it ranked higher. The one that had the higher P 99 in that case, even though it has a

Starting point is 00:18:44 lower average Duration so that's that's kind of the whole point. Yeah, let's let's explain the idea Yeah, the idea is simple. We we don't have buckets. We don't have percentiles We don't have histogram we cannot draw it, but we have stellar deviation and the idea is how I learned how I Hear it read it in that all analysis we have over the last years with PgSense statements and other things PgSense.kcache with PgWatt sampling maybe and so on we always look at throughput and average throughput and average latency that's it yeah we know that from you mentioned the article mentioned that from application side

Starting point is 00:19:25 some monitoring systems that bring you P99, P95. It's great or two values even better than just one here. Yeah. And sometimes colorful, good looking graphs. But in Postgres we like don't do it somehow, but we have standard deviation, and in which POSGIS monitoring standard deviation is really appreciated? I don't know. I don't know any, yeah. We don't have P99 or P95, P90, nothing like that, but what if we assume we like distribution, well your assumption it's normal, it's not normal, like half normal. But the idea is let's take average and standard deviation and have some multipliers.

Starting point is 00:20:08 So we have formulas, average plus some multiplier times standard deviation. And we have a kind of P something, P90 something, kind of, right? Yeah, I called it approximate, just only because it really does assume a normal distribution like standard deviation only makes sense in it like not not that only makes sense you can only approximate if you assume a normal distribution and i i made the case that i think it will be quite rare for you to have queries that or at least problematic queries that are normally distributed i suspect a lot of the problematic ones are going to have long tails, but it definitely isn't a safe assumption that they're all going to be normally distributed. But

Starting point is 00:20:50 once you factor that in, I think this is still useful. I think queries that have a low average and a higher standard deviation where this ordering difference, where this change makes a difference, are probably gonna have well are definitely gonna have high P99s as well like it is it's still directionally correct I think I used that phrase so it still points you probably in the right direction this order is probably not going to be that different to the order by P9 by true P99 so it's it's a step in the right direction it's not necessarily accurate, but it's direction correct.

Starting point is 00:21:27 Yeah, I agree. And since I suck at statistics as a listener already should understand by this point, I talked to my daughter who is learning data science and at UCSD right now. Nice. And, and, uh, yeah. And she told me, Oh, this is quite known, like P95 or something for normal distribution. We just standard deviation times two to both sides.

Starting point is 00:21:52 This is how you can get P90, she said, or P95. I forgot. Yeah, so this is reasonable, she said. But then, of course, it's not normal distribution, and it's a big assumption. And what I thought, like, idea number one I had, and maybe I will encourage some, I don't know, some students or some people who are willing to do something here. What if we had some raw data from somewhere? I don't know who can afford logging all queries with duration. Since we talked last week, we talked how dangerous it can be to log all queries.

Starting point is 00:22:30 But imagine we collected some raw data and we see what's present in persistent statements. And what I'm trying to say, we could find, we could understand distribution nature, maybe observing many, many cases from all to be like web apps, mobile apps. And then we could probably find some heuristic based based multiplier, right, which would serve like closer to reality. I don't know, like to avoid this assumption that it's normal, but it's a big work. I know like it's you need to collect a lot of data, process it and so on. What do you think? Well and it would be different for different query like unless we find out that it's just I think

Starting point is 00:23:13 it's a tricky problem unless you actually approximate it in while you're monitoring. Look I understand what you said. Yeah. No but so if we take take assumption, it's kind of normal. It sounds for me like fragile assumption. If we learn how some OTP system, social media or something, with all these queries, like select two users, select two something, join some tables, and so on, how these queries behave, at least, I don't know, a few dozens join some tables and so on. How these queries behave, at least, I don't know, a few dozens of them, how they behave. This practice learned multiplier, I'm pretty sure it will be in another system, it will be very close.

Starting point is 00:23:57 Yeah, actually, it's a good point. You could get more accurate with the number like the exact multiple to get a more likely to be P99 measure it might turn out that the number I've suggested based on normal distribution is actually more likely to be a P96 than a P99 or P97 instead of P99 Does it make that no difference? nine. Does it make that much difference? If they just order by no, I agree with you. And what I'm proposing breaks Pareto principle, right? It's Yeah, it's a lot of work for arguably not that much additional like I think the if someone's going to put a lot of effort into this, it'd be much

Starting point is 00:24:37 better to actually track this and put it into Peterstatt statements. Well, this will be this like, again, overhead question. It would be great to have, but again, okay, I understand. What I'm proposing is going to be outside of P80 of the effort we're going to have. I like it. P20, P20, I don't know. P80 of the value, P20 of... Okay.

Starting point is 00:25:03 So, another thing, another thought I have. I noticed over time that when I run some benchmarks, very often with pgbench, I'm absolutely unsatisfied with... So, I use... We have it in automation as well, like all my team members use it. So I have lowercase r to report results in the end for each statement used in transactions. Sometimes we use multiple files with different transactions and with some weights assigned to them.

Starting point is 00:25:38 It can be achieved with option f and add sign. It's a great feature, by the way, underappreciated. But anyway, we see these reports and we discuss them and like, oh, it looks great. Oh, it looks not that great. But I realized it's not enough. I want histograms. Right? Yeah. And to mitigate it, at least somehow, we started to use capital P option. For example, P30, it means like every 30 seconds, it will be reporting. Oh, interesting. Different P notation.

Starting point is 00:26:12 Yeah. Yeah, yeah. So it will be reported as progress. So it reports latencies every 30 seconds, for example. Or every 10 seconds if benchmark is very fast or every minute if it's longer, it depends. And from there we already understand, oh, distribution is like this. So we can feel some distribution. It reports, by the way, standard deviation as well, but it's still not enough.

Starting point is 00:26:39 And there I also like, but there is another aspect here, not only about distribution of NP-90, NP-95, it's also about the distribution and time, over time, because there might be a spike of latency, and then it's better. But still, I think what I was trying to say, actually, there is an idea to implement histograms or percentiles. Maybe it's better to start in PGA Bench. Not in PGA Start Statements first. Yeah, interesting idea.

Starting point is 00:27:13 I hadn't thought, I get where you're coming from. I think it'd be easier, right? You don't worry so much about. Well, you worry. You worry about overhead, but this is like, this is tool which used to make decisions and so on. Well, it's also used. But anyway, maybe it's easier because it's less risky.

Starting point is 00:27:35 The idea is it's not production. So are you thinking educationally you could see the idea that this is a useful thing? And once people start seeing it, like, oh, this is really helpful, why can't I have it elsewhere? So it kind of builds. Yeah, I think most of people understand that this is helpful, but it's just risky because PgSETments have overhead, we know it, and it's probably increasing because more and more things are being added. And just again, PgSense Setsments,

Starting point is 00:28:05 it's not about production usually. Depends, of course. And I think- You mean PgBench is not about- It's not used in production. It's a learning tool for experiments. For- Yep. Yeah, PgSense Setsments is for production.

Starting point is 00:28:19 So it's harder to bring something heavy there, right? Yeah, I agree. PgBench, again, like optionally, and see how it works. It's easier, in my opinion. And it's also, first of all, it's needed there. It's needed. Yeah. It's very useful to have it there.

Starting point is 00:28:34 It would be very useful. Do you know if Sysbench is bringing histograms or not? Oh, no, I don't know. I don't remember. Yeah, it brings percentiles. So in history. Nice. So yeah. How many buckets or which? I don't know. Just report a couple. It brings P95. Yeah. And I don't know. I don't see any more. I don't know. Anyway, it's already better than just standard deviation, I guess.

Starting point is 00:29:05 But I agree with your point if we order by just like maybe just two times standard deviation. Okay. I did. Yeah. In the blog post, I did consider rounding just to two instead of using the one that's actually an approximation of P9. Yeah, exactly. Because at that point I'm saying, look, this is approximate already, why are you multiplying by such a specific number? But then what are we approximating? It's like a different p-value. And I just didn't really, I thought it would just be confusing as a kind of, to people, like people reading the results rather than people looking at the query. I figured more people would probably end up, if you've got, if you put these into dashboards and things,

Starting point is 00:29:49 you're going to have more people reading the dashboards than looking at the queries behind the dashboards. So I figured I'd rather, I'd rather have something confusing in the query than something confusing in the like column header. Makes sense. Makes sense. Yeah. Yeah. So yeah, quite strange to use such a precise figure to calculate an approximation. You know what? I will think about next time we will maybe do some reports or dashboards. Maybe we will implement this approach. We'll call it P question mark, question mark. I suggested like calling it a Prox P99, just so that it's really clearly like you're leading with the word approximate.

Starting point is 00:30:32 So it's mysterious percentile. Yes. And even if you don't include like, it's probably confusing to look at dashboard that doesn't include it in the results, but still include all the other columns you're going to include. Just ordering by that is then, I think, a useful... Ordering means in monitoring, it means another chart. You like... Exactly. Top end.

Starting point is 00:30:57 Top end by what? And this is ordering. And instead of top end by average, or in addition to top end by average. This would be helpful to bring sense of how users feel about your service. I really respect monitoring tools and reports that try and limit the number of pages or dashboards they show, but most com come across don't seem to try and limit it too much so I think another what's another chart but if you are worried about quite how many charts you have I might argue that I'd rather have why my argument is I'd rather have this than the one ordered by me which all pretty much all of them include so if you include that one I would rather have this one. Yeah. And do you think it's worth keeping a chart for average, pure average or no?

Starting point is 00:31:51 Again depends on the constraints, right? If you've already got a hundred charts and you're showing and it's 99 versus a hundred, why not include the average still? But if you're talking about only having two, then two versus three is quite a big difference. So I think it depends on the content. And this chart is going to be regular, like non-stacked because you cannot summarize this. Like it's not a cumulative thing, averages, right? Yeah.

Starting point is 00:32:18 And probably it should include only like 10 or something. It's not really a chart. It's just like a top, top end. Table. Yeah, like a table. Yeah, I think so. Chart also matters. Like again, table is good,

Starting point is 00:32:32 but it loses historical information. Yeah, okay. So maybe your chart is, yeah, your most important queries looking at their P99 over time, but then you can include the mean on the same chart, can't you? I've seen those in monitoring tools where you're looking at the same.

Starting point is 00:32:50 For the chart, we can get average for specific period, just dealing with total and calls. We have delta for total, delta for calls, we can have pure average for a specific period of time. But what about standard deviation? We don't have it. You can calculate it using the Deltas, but well, the same way PGSTAT statements calculate, we calculate it every time it has a new query come in. Oh yeah.

Starting point is 00:33:21 So, yeah, I was looking at the source code. I was like, oh, that's clever. Yeah, yeah, yeah, yeah, exactly. Otherwise, yeah, exactly. So, this... But you also have the option of like snapshotting PGSTAT statements. Some people do that, right? Reset snapshot, like some monitoring tools or some providers reset it regularly.

Starting point is 00:33:41 It's not a good idea because to do it often, you will have trouble under load because it doesn't come for free to add new entries. Well then we have deltas. Delta is a good idea. Yeah, yeah, yeah. Well, standardization can be also understood for a specific period of time. And in this case, you can have this metric with like magic, mysterious percentile. And you can have a chart of, for example,

Starting point is 00:34:12 10, the top 10 queries. Yeah, and maybe also average of everything. Plus standard deviation of everything. Why? Standard deviation of everything? Why not? So, I don't know. Well, actually, there's a good point of like, there are a couple of... I don't even know if I mentioned this in the blog post, but there's at least one major caveat here.

Starting point is 00:34:36 A lot of people have a timeout, often like 30 second query, like, statement timeout or... Not just a statement. It won't be be registered so it does quite seriously affect those metrics like the max the standard deviation and affects in which way it's not registered so you miss it there you miss it you miss an outlier so it's it's bringing the average down it's bringing the P19. It looks more positive than it is. Yes.

Starting point is 00:35:06 But you know, like if query was cancelled, job is not done, we don't count it. As I remember, this is another point PgStat monitor aims to solve. Yes. Like it's a common problem with PgStat admins and all its friends like Pgistad Kcache that cancelled queries are not counted. And I think also a big missing piece here, honestly, is that in Pgistad database we have TPS, two metrics committed rolled back, right? But we don't have QPS, we don't have latencies, standard deviation. It should be there, actually. This is another idea to register these metrics, calls, min, max, and min, average, and standard

Starting point is 00:35:56 deviation at the database level for everything, regardless of Pidgestad statements. And maybe involving error doubt queries, maybe this is I'm not sure because this is this is like this is if if we involve them The values observed by purchase our statements and at this high level they will not don't they want to match at all Yeah, I wasn't I was unsure what like ideal looks like here. I did wonder about should you know in an ideal world would you add the time out like would you add a query took 30 seconds. Let's say one got cancelled at 30 seconds would you add that an instance of a query that ran 30 seconds and then add that to the average, add that to the same duration.

Starting point is 00:36:45 Would you want that or would you not want that? It's easy. Let's just have another PCSAT statement. We'll call it PCSAT error statements and have a whole set of metrics. I'm joking, of course, but... Well, it's not a bad idea though, right? How many of each error type do we have? And query can be cancelled before statement timeout.

Starting point is 00:37:06 It can be cancelled due to deadlock timeout, for example, or just something else happens, I don't know. Kill, PG cancel backend, for example, came, cancelled us. So yeah, and it would be great to count all those things like buffer hits reads. Yeah. So yeah, that's the long and short of it. I did. Was there anything else you wanted to talk about?

Starting point is 00:37:31 I feel we went to some very dark corner of post-gaze observability, you know? Yeah. It definitely seems less explored than I was expecting. As your daughter said, this is not a novel, like this is not an out there revelation. This I'm just, I was just surprised when I saw the, like feature request for it. And I've heard it a few times. I was thinking, why don't we add, like, why don't more people have that in their queries?

Starting point is 00:37:57 Why don't I see this in blog posts or in monitoring tools? And also for buffer operations and wall operations. Well, I know why we don't have that is because it's not, the standard deviation is not even available there. Well, it should be available. Let's edit. So yeah, my summary is I'd love to see these in pitch stat statements, but in the meantime, like you mentioned, if you're doing a consulting report, or if you've got a dashboard internally consider adding this like or replacing the ordering by me just don't call it P fake P95 call it mysterious P PN fake news P99 yeah I liked your mysterious uh what was it MMP magic P yeah mysterious magic P. Okay?

Starting point is 00:38:46 Yep, absolutely. Nice one Nikolai. See you next week. Bye. Bye.

Your Ad Here

Postgres FM - Mean vs p99

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.