Postgres FM - pg_wait_tracer

Episode Date: April 24, 2026

Nik and Michael are joined by Dmitry Fomin to discuss his new tool pg_wait_tracer, as well as changes that could be made to core to allow wait event tracing with lower overhead, and on manage...d services. Here are some links to things they mentioned: Dmitry Fomin https://postgres.fm/people/dmitry-fominpg_wait_tracer https://github.com/DmitryNFomin/pg_wait_tracerpg_wait_sampling https://github.com/postgrespro/pg_wait_samplingpg_10046 https://github.com/DmitryNFomin/pg_10046Jeremy Schneider reply on LinkedIn https://www.linkedin.com/feed/update/urn:li:activity:7414966981847748608RDS for PostgreSQL wait event docs https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/PostgreSQL.Tuning.concepts.summary.htmlCustom wait events for extensions (added in PostgreSQL 17) https://www.postgresql.org/docs/current/xfunc-c.html#XFUNC-ADDIN-WAIT-EVENTSHacking Postgres with Dmitry, Kirk, and Nik (Part 1) https://www.youtube.com/watch?v=3Gtuc2lnnsEHacking Postgres with Dmitry, Kirk, and Nik (Part 2) https://www.youtube.com/watch?v=6kqpjnpl5GcTanel Poder https://tanelpoder.com/about/USDT static tracepoints for wait event tracing (proof of concept by Nik) https://github.com/NikolayS/postgres/pull/18Add wait_event_timing: Oracle-style wait event instrumentation (patch by Dmitry) https://github.com/DmitryNFomin/postgres/pull/1PgQue benchmarks https://github.com/NikolayS/pgque/blob/main/docs/benchmarks.mdThe Art of Computer Systems Performance Analysis Techniques for Experimental Design, Measurement, Simulation and Modeling (by Raj K. Jain) https://www.researchgate.net/publication/259310412_The_Art_of_Computer_Systems_Performance_Analysis_Techniques_For_Experimental_Design_Measurement_Simulation_and_Modeling_NY_WileyPerformance modeling and design of computer systems queueing theory in action (by Prof. Mor Harchol-Balter) https://www.cs.cmu.edu/~harchol/PerformanceModeling/book.htmlOracle Performance Firefighting (by Craig_Shallahamer) https://www.abebooks.co.uk/9780984102303/Oracle-Performance-Firefighting-Craig-Shallahamer-0984102302/plpProcess Mining: Data Science in Action (by Wil van der Aalst) https://link.springer.com/book/10.1007/978-3-662-49851-4~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, hello. This is PostGos FM. My name is Nick, PostGSIIIII. And as usual, my co-host is Michael Pigeamastert. Hi, Michael. Hi, Nick. And we have a great guest today, Dmitri, who is a very experienced DBA, originally Oracle, now PostGus, already many years. Hi, Dimitri. Hi, hi, hi, thanks for inviting. Thank you for coming. So we have, I think, very, for me, it was surprised when you did it, when you built this tool. And I didn't expect that. that it's coming to post-Gres ecosystem. But obviously it's something new and in the area which is in like my personal key interests lately. So active session history analysis, weight event analysis, but you did it very differently compared to what we usually did with PG weight sampling or other tools, performance insights in RDS.
Starting point is 00:00:51 So it's a great tool. Where to start? Yeah. So I think we can start with the time when I first start working with Postgres. I started switching from Oracle. When Oracle, one of my favorite topics always and the tasks was performance-related investigations. And when I switched to progress, of course, I start looking for similar tasks
Starting point is 00:01:12 and I understood that the way of post-gustrable shooting is very different from how it's done in Oracle. And definitely the felt lack of tooling, lack of standard approach and so on. Okay, so after some time, I realized that there are some tools that can bring similar feeling, let's say, what I had in Oracle. And I'm talking about the weight event analysis because I know, in Oracle world, it's a given thing. Yeah.
Starting point is 00:01:43 I'm going to interrupt. So when did this switch happen from Oracle to Postgres for you? Recently I checked this. It happened 10 years ago. 2016, roughly. I remember very well a big stream of like big. wave of people coming from Oracle in 2013, 14, and all of them were complaining about lack of weight of event analysis. This is how two columns were added to PGSID activity. It was like
Starting point is 00:02:09 maybe a couple of years before basically your switch. And you also had this same feeling like these tools are needed. Yeah, because when you work with Oracle since I don't know, 2002 or something like that, it's a given way of troubleshooting. Any Oracle guideline, any book, any training, any training, official, non-official says just open the active session history and so it's a it's a it's a gate that you go through and you start with and so I try to use the same approach with Postgres because for me it was so obvious why it's why it's a way to analyze the performance because basically what's the weight event says you say you so why it was slow where the session spent the time session or database in general so you do not need to
Starting point is 00:02:58 to guess was it's IO or CPU, basically where it spent time, so it just says very clearly. So then why it happened that way, so you need to investigate. But at least you know where. Because the regular approach in progress just can say that this query was slow. In average, or with some hacking, you can get some precise numbers of the execution, but it does not give the idea why it was slow. Okay, yes, you can use the explain analysis. to see where in the plan you spend the time, but again, it does not explain why. Sometimes this joint takes, I don't know, one second, sometimes 10 seconds. From the plan perspective, it looks the same.
Starting point is 00:03:40 And the tooling for weight event analysis evolved year by year. And so we have the PG rate sampling. And since they introduced the new event tracing as well, so it became really tool that give really good data, a good source for weight sampling analysis. And I use it everywhere where I manage Postgres. And if it's self-hosted progress, for sure, I will convince the owner of this Postgres to install Pigeavit sampling. It's a default way of analyzing.
Starting point is 00:04:17 On all my doorboards, it's the first panel you see when you open any dashboard that I created. So it's a weight event analysis. I convinced all my colleagues to use this approach. You convinced me to put this ash graph on the top of the first dashboard, which I resisted initially because the first dashboard for me was seen as very shallow and very wide troubleshooting where you identify the areas. But you convinced me to put ash graph on the very top. And now I'm convinced. Yeah, it's a great idea.
Starting point is 00:04:50 So it very quickly helps you understand some pain points for performance. Yeah, I have a, I will try to make it brief two examples when it's the weight event analysis shines. So one was there was a migration, hardware migration from, let's call it this old hardware to new hardware. The data files stored in the storage appliance, so we just remounted. So the data layout was the same. Everything was the same in terms of the data layer, storage layer. but the hardware, the compute part, changed, and changed to the highest spec. So the number of virtual cores increased, the clock, increased everything increased.
Starting point is 00:05:32 All tests were good. When we start loading production, so we see the really huge degradation performance. And so what you would think, so the plump sleep or something like that. But you have the same, everything is the same. Just hardware is different, so it basically cannot be planned, Because the parameters are the same. Your slide deck has a very interesting example, very simple, so anyone can understand about contractors doing some work.
Starting point is 00:06:00 So in this very example you describe right now, it's like basically you hired a different team and do the same work, but it does it very differently, right? Yeah. So that example, I just to explain what like those who haven't seen. So it's a simple, like there are contractors. They say, we are busy. We spent so much time.
Starting point is 00:06:21 And without weight event analysis, you just know they were busy. But what they exactly were busy with, you don't know. And in this case, okay, they were busy with IO or something else, right? Yeah, yeah, exactly. Switching hardware, you see different picture and so what was in the end of that story? So it was a problem was LV lock manager. So, you know, it's one of your favorite LV LLLOTF. A lot of dots are connected here because we spoke about so many problems which are explored with weight event analysis on this very podcast.
Starting point is 00:06:59 So right, Michael? You remember lock manager? Yeah, of course. It feels like whenever there's like a system-wide issue, it's such a great starting point, right? That's, sorry. And that's the reason next case when I would say that's, uh, low level. I disagree with Michael because I agree on one side, yeah, system-wide. But on the other side, more and more, I think, we just don't have this tool yet,
Starting point is 00:07:27 but it should exist. On PG-Master, we should have it. I mean, we have plan analysis. And for one call, it would be great to see weight event profile as well. Because imagine we were holding, like we were blocked by lock acquisition, which were pending. You will see a lot of time spent, buffer numbers are low. That's it. You need to guess. With weight event analysis for a single query execution, you will see, okay, this was log weight event type. But wait, but I think this is where we get into the core of what Dimitri's done, which is cool, which is moving from sampling to tracing.
Starting point is 00:08:05 I think when it historically has been reliant on. I do think it's really important. Maybe I'm wrong, but my understanding is. I also thought this, like this is what expectation I had as well. It's about a single backhand tracing. But let's, let's. Let's move on and hear what Dimitri will say. Exactly. Yeah, just the next case when the weight event really saved a lot of time. And when this single backend analysis, I mean, a single backend analysis case, it does not work only on the instance level. It's on the single back end it also works. I would have a case.
Starting point is 00:08:40 I got the complaint that sometimes query takes from few milliseconds and it's killed with one second time out. We have, I don't know, 100 calls. 10 of them failed with one second timeout. And the query is primary key lookup. So you cannot, they cannot mess up with a plan for this case. So it's just B3, primary key look up, very simple. It's not join nothing, just very simple look up. And it's not lock manager contention because exactly lock manager contention,
Starting point is 00:09:12 in many cases, it's primary key lookup. It was not that. Yeah. Okay. So you open the dashboard, so you don't see any. spikes of any specific weight events. So it's definitely very low level with some specific backend problem. So I also captured just in case the explain analysis, the plan was the same, so everything was the same. But sometimes it goes from few milliseconds to one second,
Starting point is 00:09:35 and I start sampling. So I make some custom scripts that will just catch the needed back end and start sampling. And I saw that when it's slow on AIO. I mean, I saw the, I saw the I mean, look, how it can be, just a very simple read. So it just reads very few blocks from disk. So next step, so I understood with the weight event analysis, it's nothing else but just slow I. for some cases. And with the EBPF and tracing these backends, I understood that's the problem with the long tail latency on the storage appliance that we use. Sometimes it goes from one millisecond to tens milliseconds. And indeed, if you're not lucky enough and you need to read these blocks from disk, not from memory, not from OS cache, these even few blocks become seconds. And yeah,
Starting point is 00:10:29 and I think without weight event analysis, it will be just guess game. But this just in a couple of days, I found the problem. A couple of quick questions. One is around, like you can get, timings in explain and analyze output. Obviously, it's a bit tricky if they're always getting cancelled. I don't know how you got the explain and analyzed for the queries that were timing out, actually. How did you do that? I just increased the timeout for some proportion of them.
Starting point is 00:10:59 To make this finish in a few seconds, yes. DBA should have IO timings. Amount zero, right? Yeah. Did you have IO timings on the server? Yes, I do. but when you have 1K calls a minute of this time and just very few fails, you cannot track it. It's one of my favorite topics that averages.
Starting point is 00:11:25 It's every time. So you have to have to have P99 and better more nines to see the actual picture of the senior system. Just to push back, I think this is a really interesting topic. I think actually in this one specific case, or to explain might have been okay. Like catching the, that's where it can be really good in terms of catching the outlier queries that go beyond a certain thing.
Starting point is 00:11:48 As long as you've got the IO timings on, in this one case, we might have seen, oh, IO timings are the issue for these slow queries. But I'm sure there are other examples where the weight events would have shown something more interesting. But weight event will analysis, or not, auto-explain, we'll show you the plan, will show you timing if you enable it,
Starting point is 00:12:07 we'll show you buffer numbers if you enable it. Good. But it won't show you weight event profile. Although now I think, why not? It's a good idea probably. This is maybe, yeah, even maybe better than for manual explain to have it in auto-explain. I think you're right, but I think I was talking about track I-o timing. You know the parameter track-I-O timing?
Starting point is 00:12:31 Yeah, I-O timing, great. But I-O-timing is working, first of all, it's working at high level. But what if it would be not I-O-O? Yeah. IPC or a light with local or lock or something like full picture would be great to have, right? Also, weight event, we talk like this is a weight event type level. We have how many like seven, nine weight event types, but how many weight events? Hundreds.
Starting point is 00:12:57 So it's very precise understanding what the system was doing. So I think, just imagine for auto-explained to have a weight event profile logged optionally. I think it's a good idea for those who want to hack Postgres, right? Yeah, me too. And I think it probably transitions onto like, why don't we have this? What are the downsides? What are the overhead? What are the tricky parts of this? And we will go there. Yeah, because it's definitely a thing that I cannot take out of my mind for weeks already. And so this last case with this single backend problem, I reflected this experience. I spent a couple of days on it. And I thought so what tooling can help me with it to, time to finish this investigation in hours and better probably even enable the people to do it instead of DBAs. And the Oracle has very simple answer for this type of problems. It's the 146 signal. It's a tracing when you enable it on a certain level. So it writes everything in the file,
Starting point is 00:14:01 every single CIS call, every single topple, I mean the raw joint, everything, I mean very detailed, like real tracing. And I decided to create the PG-146 extension that will do it in Postgres. And the prototype was really good, so I really proud of it. But for weight events, Oracle
Starting point is 00:14:23 also traced the weight events. And I did the same, but I had to sample it because I didn't find by that time any tracing approach for weight events in Postgres. I know when I start posting on LinkedIn this, Jeremy Schneider came and commented. So yeah, while it's really nice, you cannot call it 146 because weight events are not tracing. Very true, but there is no way in Postgres to do it.
Starting point is 00:14:48 Yeah, yeah, one second. I just wanted to send big hello if Jeremy Schneider listens to this. And I hope one day we will have him on this podcast. And Jeremy Schneider is important because he was like making it popular, blog posting about Ash a lot. And he was in the Ard team when they also started to do this like performance insights and so on. So it's great that you connected over LinkedIn with him. I think was also involved in the very good weight event documentation that RDS. Oh yes. We just discussed with me off line a lot.
Starting point is 00:15:25 Yeah. Yeah. Yes. I mean, it's really a lot of kudos to them for this work. Okay. And again, I was brainstorming with all LLMs that I had by that moment. and I start complaining. Sometimes I talk to them, like, to real people. And I was complaining.
Starting point is 00:15:44 So, yeah, I had a really nice extension, but lack of real tracing, it's a big caveat. And the plot said me, yeah, but you can trace. And I said, no way, how. And he explained me details, how I can really trace the weight events. And for two days, I challenged this idea a lot for reason. it could not be me the first who found this. Not me, okay, the clot found, but I'm not the first guy in the progress world who used this. Either this just does not work or it was basically the only idea, so it just does not work
Starting point is 00:16:21 because otherwise someone should find it before. Anyway, two days of testing this idea showed me that it basically works. And I started prototyping the tool. It's still in a very immature. state, but it's already showing nice pictures and already can help to find some
Starting point is 00:16:44 interesting, I'd say, correlations. For example, I found that most of extensions now still use extension weight event. And when extension does something, you really cannot distinguish it from doing a real job
Starting point is 00:17:01 or it's just a weight loop. When I use PJ weight sampling, for example, to cross-check the numbers that I get from my tool and from PG weight sampling. I understood that the background of a PG weight sampling, just even it does not, does nothing because it works in the background and there is no job for me. It's still in a waiting state in that.
Starting point is 00:17:23 And so I think since PG-17, you can have custom weight events for extension. And I really hope if someone listened, just please use the custom. weight events in your extensions that it will really help to understand is your extension busy with doing some valuable work or it's just spinning a loop right this is why we in my pgash also listening to dmitri i also spend some time analyzing colors so green is what's usually called CPU and this is why in pgash and in our monitoring tool CPU has asterisk so CPU star star means that it's CPU or maybe not. And we also discussed it a few times on our podcast. And yeah, so this is exactly why. So some extensions. Yes. Something. Yeah. I was just thinking whether PG-Ash would show up in PG-Ash with its own custom weight events. But then I remembered it's not really, it's not an extension. So it can't. Exactly. Fully. Sorry. Yeah. So that's basically the story how this came to life. So want to me to show it?
Starting point is 00:18:36 We cannot show it on podcast, unfortunately. Oh, okay. But just to recap, like you complained to LLM that something is impossible, and LLM said, let's do it. Yeah. And this is how it was. Basically, it is. This is crazy.
Starting point is 00:18:50 It means that it's good to sometimes complain. It would be good to have, unfortunately, we cannot. And then maybe you just don't know how it's possible. Yeah, brainstorm with the LLM's, I think. It's one of the best features, yes. Similarly, our monitoring, I remember I was talking to you and you said, oh, it's really hard or maybe impossible to show the beginning of query in query analysis in Grafana. And that's it. Oh, you mean query text.
Starting point is 00:19:19 Yeah, we see the chart, we see query ideas, but it would be good to see query itself. Is it select or something? So I just went and did it with AI. And it's there right now. because we had limiting belief like it's not possible. It's interesting. So, yeah, the AI helps here. Sometimes it doesn't help, but here definitely helps to explore new options.
Starting point is 00:19:43 And yeah, create something. Yeah, the new idea of validation that it's also one of fantastic features, you can really fast check the idea if it basically working or not. Of course, bring the mature software, will take, time. So I mean, an LLM cannot do it in one night. But prototyping the idea cross-check. So this new tool is tracing. So it's true, like it won't miss any weight event or it still has some sampling. Because when you showed me the pictures, which look amazing, I still think you need to post those pictures to add those pictures to read me because this is like super
Starting point is 00:20:27 impressive. And it was a micro-second precision. I was like, my mind was blown. So it's really crazy. But then we started talking and Dmitri actually came to our hacking sessions, which happened on Wednesdays, online, on YouTube. So a couple of times we sat and thought about how to do something. And that's something was about changing Posgous, right? So why is it needed to change Posgues for this tool? And here we, no, for this tool, we do not need to change progress. But the thing is, this tool has overhead. Abduver effect, right?
Starting point is 00:21:08 Yes. In more or less, I'll say if you can call it regular load, not pathological, let's call it an acceptable level, like 10% something plus minus. So you pay overhead for every weight event transition. So when you get from one to a one, And if you get more transitions per second, so you pay more overhead. And the maximum that I got was 220K transitions per second. And in this case, yeah, 220,000 transitions.
Starting point is 00:21:41 Yeah. And so in this case, my overhead was basically 30% of the query duration. So was it with very tiny buffer pool, shared buffers, tiny share buffers? Yes. When it read, that's why. it need to go to the OS to get the block, but block is in OS cache. So that's why it's really fast.
Starting point is 00:22:01 But you need to, almost for every block, you need to go. Yes. And even for evicting, so you need to pinning, unpinning, so all this kind of stuff. So it goes, I mean, in this case, I was able to generate this amount of transitions. And while it's kind of a, how to say, not edge case, but Kinda, very synthetic.
Starting point is 00:22:21 But anyway, so it says that the overhead, you cannot avoid it and when the Jeremy Schneider raised this concern about it's not tracing he also said that so the patch for progress can be very simple you just add two probes in
Starting point is 00:22:39 these functions that used for set and unset weight events and basically that's it and then you created this EBPF yes yeah yeah and you created this patch for I wanted to say that even
Starting point is 00:22:53 if like you said for regular workload it's not 30 like 10% overhead observer effect it still sounds super useful to me to use it in lab environment so if we reproduce some problem and study it not in production but on some clone or something right it's like i'm ready to pay 10% 20% maybe even 30% because it gives me an exact understanding of everything would happen right but of course it would be good to have a very low observer effect and have it in production right way. To be honest, I personally would pay even 20% in production just will instrument me to find the problem really fast. Because once I spent, I don't know, two or three days finding where the problem is, because there was some changing in the security configuration of the server,
Starting point is 00:23:44 and basically everything became very slow. And again, so I spent a lot writing custom tooling, custom scripts to find out and go to the problem. but if I had the really good instrumentation like we now had discussing, so it took a couple of hours to find a problem. And as the panel podders said in some of the comments, also, I mean, was triggered by these pictures, that the observability is investment. It's not for free.
Starting point is 00:24:13 And for example, in Oracle World, you just paid, you just don't know it. Right. Because, you don't, you can, yeah, you don't, you cannot have the Oracle that. without the instrumenting. So I think they instrumented kernel in version 7. And you cannot strip it out and test it without it. Tannel polar is just worth mentioning is a very like noticeable expert in the Oracle world.
Starting point is 00:24:36 Yeah. It's great that also paths connected here. Yeah. And that's why I think the weight event analysis should be in the capabilities for that, should be in the Postgreas score. And two ways of doing it. A small patch that you did with this problem. patch I already forgot.
Starting point is 00:24:54 Yes. Yeah, yeah. But I wanted to go further and build the patch that will bring Oracle level of weight event analysis, bring this system time model, the tracing on the backend level. So without this, and at the beginning I was skeptical, and I thought that overhead would be kind of similar to this, because why it should be really less. But in reality, way less. I have this patch in my GitHub.
Starting point is 00:25:25 And so last couple of weeks I was iterating it, making it better and better. And I hope it will be ready to propose to a community just, I mean, today. But today I was analyzing there was a commit in the master that breaks a lot. Yes, they changed some memory management and basically breaks some path in my code. And I need to rework some. Yeah, and it's like now it's a heavy work of preparing PostGuard 19 and maybe like slightly later it will be less like super active in terms of cold changes I guess. I wanted to mention not only one limiting belief was like, okay, it's not possible. Now you think, okay, observer effect is too high.
Starting point is 00:26:10 And obviously it's not so high. And I remember the upgrade and post resume in PGR Bouncer, I had so big belief it's it won't work. but we see now how it works beautifully under huge loads and huge clusters. So we just go and check, right? Now experimentation is cheap with AI. It's great to check instead of thinking it's not possible or hard. It's great. It's super inspiring to hear what you say.
Starting point is 00:26:36 Yeah, and back to the Tully, I still think it's very applicable in case you can, you, I don't know if the patch, my patch or similar patch with a single. idea would be committed any day upstream or you cannot afford new version so you will stick with some vanilla Postgres without these new capabilities. Nice thing that it's fully independent from Postgres. It runs like a sidecar so you can switch on and off. You do not need to recompile Postgres. You cannot you don't need to change anything postgres. It just runs. You can start to next to it on the yes, you need to self-host Postgres. You need to access to have access to the Not route, but with elevated access rights for running EBPF, basically.
Starting point is 00:27:23 Okay. And so that's a nice thing about the tracing. As soon as it's pure tracing, it's not sampling at any moment. It opened doors for very interesting things that are not even implemented in Oracle world. And here we can beat Oracle in Postgres because it's tracing. It means that we have a full history of transitions from from one state to another state to the next state and so on. And you can reconstruct all performance problems or clashes or bursts in your system,
Starting point is 00:27:58 not only in time, but also find the dependencies who started the problem. Because even with a weight event analysis, with a sampling approach, for example, you cannot distinguish between you had 10 short are your weights or one long. With the tracing, you can do it, and you can find the session who started the problem, because in a regular way, it's really hard to distinct the session suffered from the problem, or it caused the problem. So with this, I believe with the tracing, you can find the root of the problem and all the consequence of it.
Starting point is 00:28:37 And with all these transitions, you can see it as the servers and can analyze this, from the Qin theory perspective. You can get the already really mature mathematician approach and everything is already there. You can analyze it how much you can serve in a given period of the time. And you can find the real bottleneck, for example, with a regular weight event analysis,
Starting point is 00:29:07 you can see that a lot of IO, for example. That's IO is a bottleneck. But in reality, if you see from the tracing that by IOR always gets the LV lock, for example, for any reason. Even if you give, you would be able to solve the problem with IOR, giving, for example, find the faster disks. The I.O. would be shorter, but anyway, you would be locked by LV locks. So the bottleneck would be just moved to the next stage.
Starting point is 00:29:36 This is what makes benchmarking so hard. Yes. And what you're saying, as I hear, I don't know all the details, but what I hear is there is, mathematics like apparatus basically which could help here and analyze bottlenecks much better and understand real bottleneck reliably that's all amazing I want this yeah so it's not at all so yet not yet I haven more ideas that's semi-implemented so I'm just playing with it because collect data it's not the even analyze the data it's not the last step the last step is present data to the human to make it
Starting point is 00:30:14 consumable from human to understand this data, to make the action. So this is the process analysis. There is a, again, already people spent a lot of time, really smart people thinking, and they already have the really good approach, how to analyze the graphs. And when you move data from one state, for example, tickets, when you move from one state, ticket to the open, in progress, I don't know, test completed. And same with the logistic, when you need to find the bottlenecks on your graph. And you can, with this math, you can find the patterns,
Starting point is 00:30:56 that the most frequent path is that, and the bottleneck is this specific state. It means that the, I don't know, most of your backends go through the buffer pin, for example. and buffer pin basically the limiting factor. You do not need to even look at the dashboard with your eyes. System, not LLM, simple code can do it for you. We analyze the backends, traces.
Starting point is 00:31:23 Yes, everything goes through the buffer pin. And so we analyze that the buffer pin and a system can make 100K buffer pins a second. But your system demands 200. So basically you need to solve this problem here. So the problem is here. It's a very deterministic approach. It can give the report really easy. You do not need hallucinations.
Starting point is 00:31:44 You don't need to look at your dashboards for that. And so I think it opened doors for really interesting stuff. And one last idea that I have and tried to implement already. System can find the deviations. For example, for query ID, the common weight event profile is, I don't know, spent one millisecond in CPU, one millisecond. second in I.O. And after that, sometime in the locks. But instead of 40 milliseconds, it took 50 milliseconds. And it's because it's introduced new weight event in between. So again, it's very
Starting point is 00:32:21 easy to, I think, to write the code to make this analysis in a semi-automatic way. So you just can, like in Oracle, you open the one hour AWR report, but in the bottom you will have, so I made analysis for you. So the problem is here. The query with a query ID, change the weight event profile from that to that, and you have a problem with this, I don't know, disk or file or something like that. So it's a kind of a project, but I think it's doable nowadays. Going back quickly, you mentioned let's say, if we're self-managing using Postgres today, presumably with this tool already would work for us. So I'm wondering what the changes in core would be that you want to make? What added benefit would that have? Or is it mostly about getting it to work on managed service providers or
Starting point is 00:33:13 something else? From one hand, these two, I think, enable, I hope would enable a lot of debates to make the perfect weight event analysis. But moving these capabilities to the Postgres core will reduce the overhead from one hand. Second hand, that you will have it in vanilla everywhere. So you would not need to bring any tool. You would not need to have access to the box. Because not all Postgres debas have it, right? You have interface, maybe SQL interface or something. Yeah, yeah. So it's definitely going to be available through the SQL views. Yes. But one question, we haven't started any discussions in hackers yet, right? Yes. So this is the interesting point
Starting point is 00:34:02 what will they say because Yeah, and why I still want to make this patch at least from the first look really solid because I expect a lot of pushback because it's a kind of a consensus
Starting point is 00:34:18 for some reason when I talk to on any conf on any place to the production DBAs everybody says yes we need it no dups but when you talk to the people who has
Starting point is 00:34:29 some weight in the community who can really promote something, they, how do I say it? I would say they spent less with a real production issue, less time with the real production issues. And that's why probably they pushing back for any changes in core that can introduce the performance degradation, even if it brings a lot of new, I don't know, tooling for DBS. Because it really depends on your perspective. I look at the Postgres as the production DBA. That's why I think I would pay 20% overhead, so I don't care.
Starting point is 00:35:04 Just give me a tool. It's already so you don't need to go to them. As you said, you're already working with this like 10, 20% or something overhead. Yeah. But a quick question on the tool as it is, it's on the GitHub repo. It says version 0.8 is the last one I saw. Are you working towards what does 1.0?
Starting point is 00:35:25 look like what is it reliability thing is it like a more tests thing what's missing that you don't want to call it 1.0 yet yes i still in some tests it gets out of memory so seems that there is a i mean like so yes i need the the more tests increase the quality of the code and i want to get to run it on real box not virtual a machine from some cloud provider on real box to make you know to say to people so it's if not production ready but at least it's safe to run somewhere on the test environment because nowadays I still cannot say it for sure but again I thought I was really close to the patch to make this ready for community to start a conversation I know it's a really long journey and I wanted to start this and as soon as I had a feeling that I'm really close to it I invested
Starting point is 00:36:17 all my time last couple of weeks in the patch and I didn't commit anything to this tool that's kind of opposed but I will get back to it for sure but yes they call it a couple of questions for those who are interested in trying out earlier first of all what are best use cases for it in its current state not in future state and second question is how to start quickly just clone repo and that's it yeah I hope at least I try to make the read me up to date and it should be sufficient just to go through the read me to compile it and just run it. I tailored it for my use case. I mean, for my it today life. And I hope that it's a good example and people should be able to use it just right of the just after the cloning. But I obviously see, for example,
Starting point is 00:37:11 next big benchmark actually probably should work, worth using it to see how it helps to understand what's happening with that benchmark. But maybe. what are the other cases, like some troubleshooting of complex situation and production and try to reproduce it on a clone or something else? I think if you, for example, you have a problem and fortunately enough, you can reproduce it, but you just don't know in some isolated environment. You just don't know what happens during the problem. You can run the tool and it would be just look into the problem with microscope.
Starting point is 00:37:46 It will track everything, it will show everything to you. So I think at this moment it's good enough for that type of usage, investigating some problem in isolated environment. But if you're brave enough, you can try it on some close to product, to some load in production because you just can kill it. Brave enough or desperate enough. Sometimes you have a problem. You cannot reproduce and let's go to production if it's self-managed.
Starting point is 00:38:13 In this case, someone goes and needs it in production because otherwise it's not possible to understand what's happening. Do you recommend to attach to only one backend or to analyze all of them? How is it better to use this tool? I built it with the trace everything, including the auxiliary processes, these Osgress processes, because sometimes that's the problem. And I build this tool with this in mind. So just let's trade everything and have everything.
Starting point is 00:38:43 But it's possible. When you don't know, yes, yes, yes. It's a default mode. It's possible to narrow to only one session or only one backend. Is it possible? Possible as well. Yes. Good.
Starting point is 00:38:54 So if like it's just less risk is lower if you decide on the only one backend, right? Yeah. Theoretically, yes. Surface should be smaller is. Yeah, that's great. Yeah. I think I'm definitely, we'll plan to use it in some next next benchmarks I will be having. I had some benchmarks last Friday.
Starting point is 00:39:17 I just realized I should involve this tool. But I'm going to revisit those benchmarks. They are fully scripted. So I guess I will just try and see what it will say. It's about this Q in Portuguese tool. I will definitely connect to you about that. So anyway, like I'm excited to see this work. It's very unexpected for me angle of active session history analysis
Starting point is 00:39:41 which brings, which actually dissolves one of the biggest concerns about Ash methodology, which is sampling, right? Pugister statements are precise, unless query ID is evicted, or unless we talk about the part which is like canceled statements, which statements don't see, but they are precise. It's just cumulative metrics, counters, incrementing, like tracking all. I really liked producer statements when they were created, because before that we had only pre-PG Badger, we had PG-Fween.
Starting point is 00:40:16 It's a French name. And it was P.HP script. And it was based on sampling. Everything related to logging is painful, and sampling is painful. So, okay, we now have exact precise instrument tool to work with. But then ash is impressive, great approach, but it's sampling-based. Now you say ash can be precise. And this is like New Horizont opening for me.
Starting point is 00:40:42 I'm super excited. Thank you for making this work. It's great. And when I, working on this tool and working on the Pigeful Progress, I understood that there are some, how to say, gaps in the attributing time to the queries that's not really, I would say, straightforward to understand. For example, the client reads the weight event that we see really often, but is it really database time or it's still client time? It's not that simple and easy, but even if you go deeper, so there are, for example, you, you know, You ended the query, I mean, you open transaction, begin, you made some changes. The last update finished.
Starting point is 00:41:25 And you issue commit or end. And really, you need to attribute the time and ways to end to commit. I think it's default behavior. But also even after commit, there are some work that progress does, some cleaning up. And from EGSTAT statements perspective, it's already not there. and some part of the work, it's kind of unadributed. And it goes to nowhere.
Starting point is 00:41:52 It's not really big amount of work, but still, playing with this precise tool, if you're curious enough, so we'll get you to the next level of understanding how Postgres actually works to the internals. Really interesting. I really like for that, because I never thought about it
Starting point is 00:42:08 until I start working with it. And it's true that not only you can understand like exact sequence of weight events for this particular load but also you can map it to source code right of course nowadays lLM can explain it to you so it's there is not it's not a foreign language anymore so you can speak to it you can really easy to understand which opens doors for more optimization of post goes itself i hope so yes yeah like usually with gdb or with perf now with this tool also it's possible. If you have some pathological workload,
Starting point is 00:42:46 you reproduce it in synthetic environment, let's go. That's great. Michael, you wanted to add something, right? Oh, only when you were talking about unexplained overhead, it often shows up in explain, analyze, when you look at the difference between the execution time at the end of the query plan and the actual total time of the, like, the top-level operation.
Starting point is 00:43:10 You almost always see a small discreet. there as well and not accounted for anywhere, which is, yeah, the same thing you're talking about, I think. Yeah, a few percent there and there and you lost 10 bucks. Yeah. Well, yeah, I'd like to echo what Nick's saying. It's really interesting work that you're doing. And thanks for sharing it. And also, I think you published the tool under the Postgres license, which is really cool.
Starting point is 00:43:36 Nice one. Yeah, I personally deal with it. So if you develop for Postgres, just and share it. And what book on Q theory you are reading for quite long time, as I know? Yeah, so it's a quite old one. Because it connects me to my current work, like this tool I just released. And I guess I need to read it as well. I will send it.
Starting point is 00:44:02 You can put it to the description. Great. Okay, good luck with the tool. Definitely. I hope it will be popular. And I hope your patches will be welcomed in hackers. Yeah, we will see how it's going, how it goes. Great.
Starting point is 00:44:21 Thank you so much for coming again. Thank you. Thank you. Thanks a lot.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.