Postgres FM - pg_ash

Starting point is 00:00:00 Hello, hello. This is PostGosafm. My name is Nick, PostGosier. And as usual with me is Michael Pigeumastert. Hello, Michael. Hello, Nick. How are you? What's new? I'm good. Not as much new as you, I think. Wrote a blog post recently. You've been working on... Oh, I saw. Yeah, little improvements here and there. Doing some research in the background as to what people want and need. But yeah, you've got some much more exciting news, I think. It's small, but maybe good. I don't know. I'm experimenting a lot lately with AI as well, constantly learning. It's crazy times.

Starting point is 00:00:36 People reconsider workflows, obviously, right? How we work. So it's easier right now to ship some stuff. And all the ideas you had during a long time, it's quite easy right now to implement. If you just shift from coding to engineering and organize design, and verification of all the details, like benchmarks.

Starting point is 00:01:05 And you focus on like architecture more and so on. It's great time to have like helpers. And honestly, Opus 4.6 is quite good. We even better than 4.5. So yeah. And others also good. I use them for reviews, other models, including Gemini and GPD, of course. Yeah.

Starting point is 00:01:27 Anyway, I wanted to discuss this small toy. tool I just created over weekend. And today I'm releasing version 1.2 and publishing it finally. I have good feedback from a couple of folks, not only from my team, but also external ones. And it's working on our production already. And yeah, and I think it gives me the same feeling as I had when I proposed a transaction timeout to Andre to implement during our post-guess.

Starting point is 00:02:00 life hacking sessions on YouTube. Because it's, it's, it's, it's, it's, it's feeling is, why doesn't this thing exist yet? Yeah, do you want to say what it is? Yeah, PG-PG-EH, active session history. I believe with all the new setups and growing existing setups of post-Gus, millions of database clusters, I believe weight event analysis has huge potential and heavily underappreciated. And performance as sites and RDS is great. Other tools like Pidgeevali, all tools are great.

Starting point is 00:02:37 But there is huge potential for various tools and people to understand what it is. And I think a lot of non-experts don't understand what it is. And they are going to explore it soon, I hope, because it's a great tool to troubleshoot database issues. You know like Brandon Greg books and his presentations, he says if if somebody asks him which Linux performance observability, like Linux performance analysis tool in console, he says it would be a Iostat. Because Iostat shows database metrics, or that was not a disk metrics, right? Latency throughput, Q size. And also it shows a little bit about CPU as well. If you have one tool, quickly understand as much as possible. This is it. Of course, he also mentioned Sar, this is great and so on.

Starting point is 00:03:35 But the same thing here. If you choose just one approach or tool, which will give you as much understanding of problems during troubleshooting, like why database is slow, this is going to be a weight event analysis or active session history analysis. We had a separate episode about it. maybe a couple of them. And I think inside me, there's a huge feeling. It's hugely underappreciated.

Starting point is 00:04:05 When I talk to people who are not Postgres experts, if they are on RDS, this is default tool for there. But outside of RDS, they don't know what it is. And then even with RDS, performance insights or CloudSQL database insights or how it's called Quddy insights, it's still like a huge lack of use of it all the time. think it should be central in all troubleshooting. It should be starting point.

Starting point is 00:04:30 Okay, so why? Okay, go on. Yes. So I've been a big fan of PG-weight sampling for many years. PG-Wate sampling is an extension which provides you very precise sampling, 10 milliseconds. Internally, but you need additional tool like monitoring to export that data for analysis. It's hard to analyze it inside because you still need some persistence storage

Starting point is 00:04:55 because history it has, it's not like covering a lot. It's just some tail and you need to export it. But it gives you a 10 millisecond sampling. It's very precise tool. The problem with PG with sampling, and we had it for long in our monitoring stack, but slowly we move the attention from it away, and we implemented, we call it lazy sampling,

Starting point is 00:05:21 where every 15 seconds our monitoring tool collects it. Because availability, a lot of customers, they don't have perjewat sampling. It's present only in CloudSQL. I wish it was installed everywhere. But no. Unfortunately, it's reality, you know, like. So we shifted to lazy sampling. And then something like, it's working, it's enough if database is huge.

Starting point is 00:05:45 15 seconds sampling already good enough. And we cannot do one second sampling because it's too much, to big frequency, observe effect, like it's too much. Because we are external, we're pulling this data to some external, connecting to instance on super base or RDS anywhere. There is a feeling that lazy this approach is not lazy, infrequent. It's not okay. We need something internally to sample and then we need to offer.

Starting point is 00:06:11 This was number one reason why I thought we need something else. Number two reason, there are more and more cases when databases are small and it's hard to justify having full-fledged monitoring for them. because you need to pay like basically at least 100 bucks for this monitoring setup. But if your database is tiny and you pay 50 bucks, it's like some pet project. There's no way you will decide to spend a few hundred bucks per month for this pet project. But you still need this because if something happened a couple of hours ago, you need to troubleshoot and post-gastroids internally has everything, right?

Starting point is 00:06:50 But it's just you need to, it doesn't have memory. So weight events, there are two columns, weight event type, weight event event, impregent activity. But nobody is sampling them without, like by default. And it was also the third reason, understanding that current monitoring tools are going to be changed, to be better, like to provide information in better form for LLMs during troubleshoot. This was reason number three.

Starting point is 00:07:17 That's how I came to idea, okay, let's implement something. And this is something like must be. be low footprint in terms of storage and observer effect. It must work everywhere. And it must be LLM friendly in terms of how information is provided. Not a lot of jasons or you need like a lot of like to write prongue all the time. But some compact form you can feed it to your AI agent or something and ask to explain what's happening here and where to dig further. And this is how PGS was created. Because of the first requirement, I couldn't create it as extension. Couldn't create. So I called it anti-extension. And remember, we discussed PG index pilot,

Starting point is 00:08:02 which is automation for recreation of indexes. It was also created in the same fashion. And also, like some sneak peek, we talk with some super base folks. And they seem to understand that as well and there is also a new project called PG flight recorder. I hope we will discuss it's maybe separately, which actually is very similar to my PGA-She, but it covers more. So we are thinking how to maybe unite our efforts. It's ongoing discussion. But yeah, it samples like PG-Stat statements, PG-Stat-I-O, and so on, like all the stuff,

Starting point is 00:08:41 SLRU. Oh, interesting. So there was an old project called PG-Profile, Andréz-Zou. Zupkov created it and for quite some time it exists. It's great. It's AWR from Oracle, but for PostGus, right? It records all statistics also inside, but it's extension. So since it's extension, it's not available on most managed postgius platforms.

Starting point is 00:09:05 That's why I like the idea of anti-extension because it's just pure SQL and PLPDGQL. So installation is just the case of running an SQL script that you've got, you know, creates a schema functions. Exactly. Yeah. So I think this. Which you could bundle as an extension, right? But the problem then becomes, let me to be transparent.

Starting point is 00:09:28 Why not? Yesterday we discussed it. And there is an idea that yes, it can be like extension, not like non extension. It's just a bunch of. There is an idea. Okay, this could be. I honestly have some thought that maybe PostGus could have some concept of packages or something.

Starting point is 00:09:47 There was trusted language. Yes, exactly. Yes. Rust is all. Exactly. So David Ventimilia, I hope I pronounced last name right. I apologize, David,

Starting point is 00:09:58 if I did wrong. From Superbase said, and who is developing PG flight recorder? Yeah. Yes. So he said, this is the direction he's thinking to packages,

Starting point is 00:10:10 T-L-E extension, right? Because extensions, if it's available, I think it's available on Superbase on RDS, right? It's like straightforward concept, right? But in some cases, we need to think about all platforms. And in this case, we want to be maximum flexibility. So I'm thinking now to package it in future versions, to package it in both ways, like pure SQL installation or TLE installation.

Starting point is 00:10:34 Why not? Yeah, absolutely. It's just a bit sad that extensions were, the idea of having an extension framework was to make distribution easier. And now we've gone full surface. and it makes it harder because everyone's on managed service providers that pick which extensions are available. So yeah, it's just a bit of shame. This sentiment you just shared, like it's so deep in me.

Starting point is 00:10:55 I remember 2005, six, seven, where like Posgis is extensible. Then extensions concept was created by Dimitri Fountain, right? Oh, really? Yeah. He created extension is his thing. Maybe I'm not mistaken, I hope. I can hallucinate often. This happens because I'm overloaded with information.

Starting point is 00:11:18 And then everyone was excited. Postgres is so extensible. But we should blame managed Postgres platforms, starting with Hiroku and RDS and all others. Why extensions became an extensible concept? Because it's in their hands to decide which extension to add. And they have their own reasons, obviously. When every time they add some extension,

Starting point is 00:11:40 it means they need to support it and to have some responsibility. if it's secure and so on. I understand that. But extensions became unexstendable. And every time I thought, oh, you want to implement this in Postgres, every time I decided, no way I will decide to make this extension. Of course, the concept of trusted language extensions is great, but still it's not adopted everywhere. It's only, it's limited, right? Yeah. So this is not an extension, but could be, but it's not to start with. and it installs a couple of tables, some functions.

Starting point is 00:12:14 What do they, what does it do exactly? Yeah, let's talk about this, what exactly it does. The idea is let's just sample, collect weight event samples from activity with query ideas, and have some history. It's quite simple. I didn't want, so the problem is we need some help to invoke function, which will sample.

Starting point is 00:12:38 And I don't want any lambda functions or anything. I want everything in the PostGGIS, but I cannot build a background worker or anything because I don't want to create extension. PG cron was chosen as a dependency, and PG cron is present everywhere. And I remember I was driving at that time. I was brainstorming with Charger PT.

Starting point is 00:13:01 Unfortunately, Claude, voice abilities are not as good as ChargapT. So I was brainstorming, and we agreed both. that PG crone is going to be a problem because it has, as any crone, it has only one minute precision. I was driving. I reached my destination. I was starting to doubt.

Starting point is 00:13:18 I always doubt when working with AI and humans, actually. It's not only about AI. Humans also can make mistakes. So I was doubt in doubt and I said, let's verify. I check the documentation. Check with me. And thankfully, starting PG cron 1.5 per second precision is available. Okay.

Starting point is 00:13:39 And this was a moment, okay, I'm going to build this because I need per second precision. Is one second the minimum though? It's minimum, yeah. But it's going to ask why. I was going to ask why one second as the default and now it makes sense. Yes. Then I was thinking, okay, we are going to write a lot. We need to write query ID and the weight event.

Starting point is 00:14:02 So how to optimize storage? And there are two big problems. First, how much we are going to spend in terms of bytes if we're just writing. At second, we know postgis and VCC and bloat issues, right? So if you implement some ring buffer for your storage, anyway, let's jump to solution. I knew the proper solution from Skype, PGQ, Sky Tools, PGQ, three partitions and rotation, right? So I just created partition yesterday, today, today, and tomorrow, logically. So yesterday is read only partition fully filled.

Starting point is 00:14:40 24 times 60 times 60, right? Records. Actually, no. We need to think how, do we store one row pair weight event pair or we somehow store it differently? Let's return to this point. Now we have current partition today and tomorrow is truncated because truncated is super efficient as we all know, right?

Starting point is 00:14:59 No deletes, no updates, inserts on truncate, that's it. So this is how we implement. And we have visibility right now in current version. We have visibility only for yesterday and today. And you can export it to your monitoring tool this as we are going to implement compatibility with this extension, not anti-extension in our monitoring stack. But also there is a plan. Maybe when you listen to this podcast, I already implemented this,

Starting point is 00:15:28 there is a plan to have a roll-up storage for longer term. I plan to have one year with precision, maybe one hour or one minute. It depends. It's not a lot. So this is how we sold MVCC. Just borrowed an idea from old good Skype from 20 years ago. I still think it's a great. It's a great idea.

Starting point is 00:15:49 So no block. So that solves the bloat issue. Yeah, exactly. And second, how to find optimized storage. We had several ideas. So we decided, like, I actually decided, but yeah, I agreed with a lot of benchmarks, many iterations. We chose the idea to store one row per second.

Starting point is 00:16:07 So it's timestamp. It's database ID because, oh, not one row. If you have multiple logical databases, then it will be as many databases as are currently active in terms of workload at this very second. So at this very long... One row per database per second. But only those which receive,

Starting point is 00:16:28 which have active sessions right now. Okay. Yeah. I thought about it. If we have a thousand databases, it's insane. But it's unlikely all of them have right now active sessions because it might be if you have huge machine and a lot of things happening.

Starting point is 00:16:43 But still, if you have huge machines, you can afford maybe like 1,000 rows per second. Maybe no, it depends. There is room for improvement here. But anyway, right now it's one row per second if it's only one database active right now. And then we encode everything. But how?

Starting point is 00:17:01 The thought about JSONB, first thing, JSON or JSONB, I proposed correlated arrays and proved me that it's better in this case than JSONB. Two arrays, so two rows. And then I proposed this encoded one array, and benchmarks proved it's better. So we encode weight event, number of sessions, like weight event, like, for example, or for example, lightweight lock manager, our favorite, right? And then there is number how many active sessions we are present, like for example, five. And then five query IDs.

Starting point is 00:17:37 So seven numbers and then next weight event and so on. So we have a bunch of numbers. And I didn't want to use eight byte integers. I wanted to use two byte integers because it should be enough. So we had also, we created two dictionaries, one to encode all weight events. Actually, if you write every time if you write LW log manager, it's a lot of bytes. Why? We can encode it and to support, as in recent couple of major versions, there is a PG weight event, weight events dictionary, right, in PostGGGGS.

Starting point is 00:18:16 Yeah. Which is just a static list of them. But I wanted to support older versions with this thing supports PostGus 14+, so that's why we created some our dictionary like to propagate. that to back back yeah one thing I didn't understand is you mentioned not using eight by integers is that does that include because query IDs by default eight by right yeah so query ideas eight and that's why we have us one we have two dictionaries one to encode weight events and second is to encode query IDs to map interesting that's it and yeah and we also we have I don't remember but we somehow solved the problem that it can

Starting point is 00:18:58 grow without limits. We solve this. So it cannot grow without limits. We keep it short. Ah, I know how we solved it. Can I remember because I proposed this solution? By the way, I forgot to say, I created it using three AI plus me. So a team of four worked. One AI was engineer and one was focusing on benchmarks. Another was focusing only just on quality and documentation and reviews. and benchmarks benchmark engineer also focused on reviews but with specific goal like storage efficiency and performance and so low observer effect

Starting point is 00:19:36 and so on and we worked many iterations like crazy and I did it everything the whole thing is coded in Telegram so I didn't touch anything but you reviewed it right I reviewed what

Starting point is 00:19:50 I didn't review code really I reviewed only results of benchmarks I cross-check from two engineers, AI engineers. I cross-check results. Like, they work. I trust this code, actually. It works in my production already.

Starting point is 00:20:07 Yeah. Why should I look at this code if I know that it was very thoroughly tested and benchmarked and reviewed many times by AI? Well, yeah, that's the why for me. But, yeah, anyway, it's interesting. I understand your question. But in this case, I think it's quite solid. I actually checked the code and it follows my style.

Starting point is 00:20:27 Wait, you just said you didn't check the code. I checked pieces of code just in sure. Before I decided not to look into code, I made sure it's written according to our style guide. And I like what I see. So style corrected, somehow how we write PLPGSQL. I have my already very well established. Because PLPGQL is the most popular code language for me last 10 years.

Starting point is 00:20:50 So I wrote a lot in it myself. So that's why like when I made sure it's producing good code, I didn't see the final version of everything. But I know it was thoroughly tested in many aspects. Yeah. So back to storage. This is how we encode and how we solve this unbounded growth of Curia ID dictionary. My idea, actually. I'm excited because you can engineer solutions thinking about like algorithms and data structures

Starting point is 00:21:19 and you don't don't need to code everything. AI is coding. So I decided to have three tables, one per every day, that's it. And the same like rotation. We have dictionary for today, we have dictionary for tomorrow. AI said there was a doubt because query ID might not match, but we don't care because if the same query ID receives different encoded ID tomorrow, it doesn't matter because we don't work with this data directly. We work with this with some interface functions, which exposed to you. user and the user doesn't see the encoded numbers at all. Yeah, and once you look them up, then it's the same query idea anyway.

Starting point is 00:21:59 Okay, yeah, that makes sense. Great. And it cannot grow unbounded anymore, right? Yeah. Yeah. And I guess for a lab approach, we also will need a separate dictionary. That's it. Which might be bigger, but we, why?

Starting point is 00:22:13 Because we, I want to have history for a one year at least or maybe half a year. Good. Okay, fine. But it will be already like compact, compact, is compacted and query ideas, not all query ideas will go there, only most important, which are most popular or participating in spikes. Yeah. So make sure I've understood. We've got three main tables, only two of which will contain data each time, a day's worth of data, like all of yesterday's data. Per second. And then, yeah, and then today's data until now, like it's like, one and a bit days data at any point in time.

Starting point is 00:22:55 And that's fine because the main point of this is something was slow recently. Like we had an incident 10 minutes ago or an hour ago or over the weekend. In fact, over the weekend is interesting. But there was something fairly recently we want to look into. And normally that's within a couple of days. Yeah, I think maybe we should allow to configure how many days raw data is stored. Anyway, I know of this approach with three, it's from PGQ, as I said. This is three partitions in rotation.

Starting point is 00:23:25 You cannot see a day before yesterday with it until we implement a rel-up approach for longer term storage. And I forgot to mention that how much of data it is. So per one row, if it's five back-ons active, it's roughly 100 bytes only. I forgot to mention also that I decided knowing alignment padding, I decided to also encode time stamps. So it's 4 bytes instead of 8. 4 bytes, Unix timestamp, but it's shifted. It starts with January 1st this year.

Starting point is 00:24:01 So it will be enough until the end of century, almost. And the database ID, it's 4 byte. It was already. It's all ID, so 4 bytes. 8 bytes plus this encoded data, roughly 100 bytes. if you have five active sessions. And for example, if you have a more loaded machine, it will be, for example, 50 average sessions on average.

Starting point is 00:24:26 It will be producing 30 to 50 megabytes per day only, which is acceptable for us. Absolutely. I also forgot to mention that you cannot use it on replicas. This is downside. We forgot to mention that RITME, but I thought about it before we implemented it. So it's obvious because we need to write. But it's fine because.

Starting point is 00:24:46 we observe more and more even bigger clusters coming to our consulting which don't have replicas. Or only H.J replicas. Yeah, this is a good point. Actually, yes, they recognize H.A. is needed, but they are okay to leave with one. Somehow quite long. They leave with it. It's so unusual. I still think classic three nodes set up is much better.

Starting point is 00:25:10 But reality is just showing different thing because cloud resources, as we discussed multiple times. they are so reliable already that people are okay. And also Postgres is performance. You don't need the replicas for quite some time because they had the complexity. So that's why I thought, okay, this is only for primary. It's enough. And I'm satisfied with results like 50 megabytes per day for a loaded cluster. It's great.

Starting point is 00:25:37 And I'm thinking maybe we'll be like let's keep seven days of road data or 14 days of row data maybe. Yeah, I could easily see. At least the weekend thing makes sense to me. You come in on a Monday and it turns out there was some blip on Saturday. It would be a shame if that data's gone. I agree. Yeah.

Starting point is 00:25:57 Yeah. And we have a bunch of functions which allow you to see what's happening. Give me like overview of last few hours, for example, or previous day. Which top weight events happened and which query ideas are participating there? Yeah. And it joins with PGISO statements if it's available to present some high level settings, macro level settings for each query ID, which is participating in some weight events. We also implemented visualization with bars.

Starting point is 00:26:31 It can be monochrome visualization or even colorful for PSQL if you do some trick. And I enjoy it. You see basically performance insights right inside PSQL. It's quite fun. And you can, for example, overview last 24 hours, then you think you can zoom to specific area. You can understand which queries. And of course it requires some typing work. But since how it's organized, quite straightforward, you can let your LLM to do it.

Starting point is 00:27:03 If you connect using some redonly user role. Yeah. Let it troubleshoot it can quickly find which bottlenecks. Is it like I.O bound workload? You need to increase disks or give more memory. Or it's like there is a heavyweight, low contention. Which queries participate? It's quite good in terms of this kind of troubleshooting.

Starting point is 00:27:27 Yeah, definitely macro level stuff. But I saw there was even some micro level abilities as well, like looking at specific queries weights by the query ID. Yeah, you can ask, for example, for specific query ID, you can say which weight events. It mimics a little bit what we have in full-fledged monitoring because we have all this.

Starting point is 00:27:48 And here is just like in textual form, you can do this kind of thing. For this query idea, what are weight events during last hour or previous day? And for this weight event type or specific weight event, what are like top five query IDs and their macro level characteristics?

Starting point is 00:28:07 Yeah, it's not, it's still macro level actually. because you still like a lot of aggregation is happening here. But yeah, you can jump to micro level, specific query ID and see plans and start working with this. So there is a bridge here. Right. Yeah, there is a bridge. So anyway, it's early days. I'm releasing this week.

Starting point is 00:28:28 Let's see how it works. Is it useful or no? I wanted to just to add that this thing feels to me as, oh, yeah, one more thing. So this is poor man's monitoring, self-monitoring, right? I think it's not, I don't think so, but yeah, sure. It's quite straightforward. This trick with PG-Crone, I forgot to mention, by the way, that if we sample every second, PG-Crone by default writes logs to a table.

Starting point is 00:28:58 Oh, yeah, yeah. Yeah, so I think PG-Crone can be improved here. For example, I would turn it off for this specific job. I don't need logging here. I just need to see errors and I don't need errors to be stored in with fully CID. PostGGIS lock itself would be enough for me to trouble should what's happening. Anyway, like there's a potential for PG cron here to be slightly improved for these high frequency jobs.

Starting point is 00:29:23 But what I wanted to say, this is like self-contained self-monitored system, right? So you can even without LLM, you can write some analysis inside Postgres so it could produce some reports and analyze what's happening. Maybe like it's time to add more. or increase shared buffers, for example, because we see a lot of IOT data file read or something, right? Or maybe we see a lot of time spent on heavyweight locks, so we need to signal that workload must be redesigned

Starting point is 00:29:48 to avoid contention on heavy weight locks, right? But there is a trade-off. Should we monitor inside or outside? And I think there are pros and calls. We discussed it, right? Already. We did. Yeah, that's...

Starting point is 00:30:01 I had one last question, though. I think, by the way, the reason I disagree is, I think it's actually just another tool, right? It's another tool in the tool belt. If you self-manage or use a provider that doesn't have weight sampling, this is a valuable tool in its own right. Like regardless of monitoring, especially when you mentioned the overhead from monitoring externally,

Starting point is 00:30:20 you'd only get, let's say, 15 second frequency or something of that order of magnitude. To get one second frequencies is an upgrade, not like a poor man's solution. Yeah. It's not 10 milliseconds. It's no, true. Good point. point. So the last thing I wanted to ask, because you always ask our guest this, on the license

Starting point is 00:30:41 front, you've chosen a really permissive license, Apache 2.0. Why? Why not? It can be baddable to some stuff. And I even think, honestly, let me, to be transparent here, I think it will contribute to my business, because if we have this everywhere, it's easier for us to explain which problems and what should be done about it and improve health and support people. Yeah, yeah. So we get right now in not quite good position because a CloudSQL, this is one of not many but quite great things they have, is that they have PJ white sampling, super great. But RDS, they have their own proprietary tool performance insights, which is great if you

Starting point is 00:31:24 only on AWS. But if you think about interoperability or how to say, like you want to be able to work with any Postgres, performance insights, database insights, I call it right now. It's some proprietary API and it's not transparent. And also external dependency, I always think, okay, how good is it? Is it trustworthy? Maybe there are also bugs. I don't see code, right? I cannot know, I don't know maybe there is there. RDS quality is great. Like, let's agree. Just quality is great. But still, like sometimes you doubt, like how exactly it works. This is fully transparent. works everywhere right and that's it

Starting point is 00:32:03 yeah nice all right thanks so much Nikolai yeah thank you so yeah everyone is welcome to try and contribute fork anything

Starting point is 00:32:14 yeah I will I'll stick it in the show notes obviously make it nice and easy thank you yeah good all right take care

Starting point is 00:32:23 catch in a bit bye

Postgres FM - pg_ash

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.