Postgres FM - pg_flight_recorder

Starting point is 00:00:00 Hello and welcome to Postgres FM, a weekly show about all things PostgreSQL. I'm Michael, founder of PGMust and I'm joined us always by Nick, founder of Postgresair. Hey, Nick. Hi, Michael. And we have a special guest, David Ventimilia, Solution Architect at Superbase, and the creator of PG Flight Recorder, which we're talking about today. Welcome, David. Thank you so much. Good morning and I guess good afternoon and good evening.

Starting point is 00:00:23 It's nice to meet you. Yeah, I think we've got all three bases covered. Where would you like to start? perhaps with why another tool in this area? What's the origin story? What's the motivation? The motivation, as a solutions architect at Superbase, my job evidently is to help our customers. And often they come to us and they say, I've got this problem. My database is slow or this query is behaving weirdly. Please help us. And then we try to bring whatever tools we have to bear on subject. And to be honest, often we're starting out with no idea what the problem is. And our beautiful

Starting point is 00:01:05 customers cannot be relied upon to, you know, relay all the information perfectly. And I just needed more. I was telling Nick this at one point, you know, some people say less is more. I say more is more. I needed more data. And I just wasn't getting it. And I did the usual thing that people would do these days, version zero of this, maybe four or five months back. in it was a rush job with our good friend Claude Code. And I just cobbled something together to get the data that I needed for a particular customer. And we got the data that we needed. And we were able to get through that particular instance rather swimmingly.

Starting point is 00:01:44 And I was pleased by the outcome. And I thought, you know, let's try to turn into something a little better and a little something more real. As a sidebar, I think we're all, most of us are familiar with the excellent. p.T. Weight sampling, which is an excellent extension, but sadly is not available on all managed Postgres services, even Superbase. Cloud SQL is one little one. But even Subabase, on whose staff is Alexander Karatkov, who wrote it, but we don't have that extension on the platform.

Starting point is 00:02:23 and there's, you know, there's some resistance within managed platforms to get new extensions added. I mean, there's a rich and vibrant ecosystem for extensions, and that's one of the strengths of PostgreSQL, but a weakness is getting those into all of the places where you needed it. So I needed to, I needed something that was a poor person's substitute for PG weight sampling. That's really how it started out was just a worse version of PG weight sampling written in SQL and PLPGS, UL. And then you know how these things go. It just took off from there.

Starting point is 00:02:59 Yeah, there's a lot of unwrap here, actually. Yeah. Yeah, and I agree with you, starting from the end of your intro. I keep saying, like, extensions are not extending us anymore, because in reality of managed postgres, we are limited by only the set of extensions which are present. There is PG-TLE, and we should discuss that separately, which is, I think, a great idea.

Starting point is 00:03:23 And actually you told me. It's a lot of happening also. So let me tell my story. I think every experienced post-guvies DBA at least once wrote this snapshot tool for PG-Stat statements and other PG-Stat views. Because it's only cumulative statistics. Some numbers are growing. That's it. And we need persistent storage. Usually it's in bigger, for bigger clusters we have monitoring.

Starting point is 00:03:51 But I remember I was working with a really big company called Chuvi. And I remember they had a great monitoring already. Some clusters were on RDS already. So they also have some like performance insights and so on. Still I wrote my snapshot tooling because I didn't fully trust monitoring things. And I also wanted to verify and some details were missing because they didn't capture all metrics I needed specific metrics and so on. So meanwhile like there are some projects which implement this idea but they are extensions so they are not available. PG profile I think there is

Starting point is 00:04:32 such a great PG profile. So that's why many many DBAs, maybe not all, but many DBAs at least once wrote some snapshoting tool. And I remember our checkup tool was also snapshoting. First versions of checkup tool it was a shell script. So with snapshot it, we had to be a snapshot that you have two snapshots, now you need DIF. And since we were in BASH, we didn't want to do DIF. So we sent these snapshots back to do DIV on the observed Posg. So I saw a lot of solutions which try to make this data persistent and have snapshots and then show everything without the need to set up full-fledged monitoring.

Starting point is 00:05:15 And even if it exists still, sometimes we need additional lightweight solution. This is one thing to warm up us why it's needed. Another thing is that I guess SuperBase has a lot of clusters. Some of them are... We have a few, yeah. Yeah, like millions, right? It's quite unique, interesting story, and many of them are small, and you cannot justify full-fledged monitoring, right?

Starting point is 00:05:40 And getting smaller, yeah. Getting smaller. Yeah, because people just experiment so much, right? And they need so many databases. Yeah, we are now, I mean, we have a skewed distribution, of course. We have a few giant customers, but we have lots of medium-sized customers and many more small ones and millions of tiny ones. And now with the AI builders, we are shoveling millions of nano instances into the AI builder

Starting point is 00:06:08 furnaces. Like, where are these databases going? What's the long-term prognosis for these tiny databases? Who knows? Who cares? That's a completely different animal. But yes. It's a very, it's a vibrant ecological niche that we've developed here.

Starting point is 00:06:25 Yeah. On that note, who's this tool for? Like, of that distribution, are there some where it's not appropriate for them and some way it's ideal? Or like, where does it fit? Yeah, that's a great question, Michael. Again, and as Nick indicated, this is, you know, all these things, what time is a flat circle, all these things will happen before and will happen again. You know, versions of this have been written before and versions of this will be written in the future.

Starting point is 00:06:52 There's nothing really that profound about this. But this is the tool that I needed right now at this time for the reasons that Nick just described. Among which is, you know, we have, who this tool would be for, I think would be startups, SMBs, builders, sort of the canonical super base customer, those who are starting out, you know, building a business, building a backend, building a project, they need PostgresQL, and off they go. You know, we at Superbase, we don't, I don't think it's any big surprise. We don't really have that many migrations onto Superbase. I mean, we would love to have more than anybody who's willing to bring giant workloads over to Superbase, come on over.

Starting point is 00:07:37 But we know that databases are infamously sticky tools anyway. People don't really migrate that often. And they're probably less likely to migrate giant workloads over from Oracle or Microsoft SQL server to Superbase. Although we are entertaining that option. But if we did do those things, those folks would probably come over with DBAs, database experience, database expertise. So our sort of customer portfolio doesn't really reflect that. What we have are, even our largest customers, I would say tend to be, I mean, they may have three or four or five years of experience with Postgrescue L now by dint of hard effort,

Starting point is 00:08:24 but they all started out small. Every one of our large customers was a little acorn that grew into a giant oak. And we try to make SuperBase easy, and we do. It's certainly easy to get into. I've used this analogy too many times, but it's like the car dealership. You can drive it off the lot in five minutes, but actually operating it, especially at scale, is something different. And we, Nick knows this, Michael, you know this.

Starting point is 00:08:51 We would all benefit for more and better automation. And it's coming. It's coming from within the community, and it's coming from Superbase. We will be able to help these customers more seamlessly and operate their databases in the future. But right now what we need is tooling to help customers as they grow and as they scale. So in a nutshell, like who's this for? People who are not DBAs, not database experts, they just want to run a business. They want to grow that business.

Starting point is 00:09:21 And they want some tools to help them. That's it. Some of them start with some small, very small database instance paying 25 or some very low number of bucks, right? and it's hard to justify paying right away $150 or $400 or $500 or $500 for monitoring a full-fledged solution and then you need to spend time there and so on. It cannot be justified easily. And also, I wanted to mention it's quite elastic. So if you just inject this tool inside your post-guess database, it starts collecting inside,

Starting point is 00:09:56 like self-observed. Yeah. And you pay a little bit for those megabytes per day. I don't know. I think since I helped with storage to rewrite it, it was like it's quite efficient and I again used this approach for PGQ rotation of partitions and truncate, so it's very efficient and so on. And I'm just saying it's like a little bit, you pay a little bit and it's self-observed, right? And when I was thinking what a person comes self-observed versus externally observed? Ideally we need to have both actually because you cannot understand

Starting point is 00:10:32 all agent right to RDS or super base machine. So if you observe it outside with external monitoring tool, if something bad happens, maybe you don't have connectivity, right? While this thing sitting inside it still keeps observing, right? That's right. At the same time, if everything is down, you don't see, you cannot reach the data, right? So like external tools also have benefits. They have both pros and cons if you think about it.

Starting point is 00:10:59 It's interesting. So in my realization, even bigger clusters should have maybe a small like this black box or flight recorder, right? While we have full-fledged solution outside, they're both like remote telemetry and something internal, right? Yeah, that's right. And I think I landed on the name PG flight recorder. And then at some point I think it had some reservations because I thought, sure, but

Starting point is 00:11:22 you know, in the event of a crash, then maybe the data aren't available and it's not really that useful. But then, I mean, it's not. If an actual airplane crashes, then that airplane also is not really useful either. To find it. That airplane is dead. No one will be using that. Just a side note, I just learned David has PhD in astrophysics.

Starting point is 00:11:43 So this name is not a random thing, I guess, right? And a master's in aerospace engineering. But at every turn, I was trying to do something else. And I was trying to get away from computers. And I just kept getting sucked back in. But I grew up in the 70s when it seemed like airplanes were crashing all the time when they weren't being hijacked. Mercifully, that doesn't really seem to happen all that often.

Starting point is 00:12:05 But I think, I'm not a pilot, but it's my understanding that actual flight recorders are useful for far beyond crash investigation. They're useful for optimization, for troubleshooting, like in-flight incidences. And so I think the nature of this is hopefully a little bit more like that. Nick, you would know better than I would, but I have the feeling that in reality, databases don't really actually crash all that often. what they do is they exhibit behavior and we want to be able to investigate that behavior. And that's what this, this helps us do briefly about the tool itself.

Starting point is 00:12:41 Again, all it really does is it takes snapshots of weight events. That's how it started. It's like PG-Ash. We developed it in parallel, actually. Yeah. So when I told David that there should be something small which self-observes, they just sent me a link. It's already done.

Starting point is 00:13:00 It was interesting that we had parallel courses of development of PGA and PG flight recorder. They're very similar in this case. Yeah, very similar. And like that, it captures active session history, weight events, nature of pours a vacuum, and idle hands of the devil's workshop. You know, with the tools available, I couldn't resist the urge to just keep pouring more into it. So it records lock activity and check pointer activity and background activity and IOS stats and statement stats. and config changes, right?

Starting point is 00:13:32 Config changes as well. And hopefully the conjecture is that there's some value in, if not capturing everything, having an opinionated and curated set of many things that are captured simultaneously in a correlated fashion so that maybe you experience a checkpoint storm. And then you notice that there has been a config change recently and you're able to bring these things together.

Starting point is 00:13:59 That's the idea. As Nick indicated, it was, when we talked about this, version zero was done. Again, it's a pretty simple tool. I had a few guiding principles, one of which was, sort of the Hippocratic Oath, try to do no harm. I put a lot of effort into making this safe to run. Statement timeouts and so on, right?

Starting point is 00:14:21 Yeah, statement timeouts, circuit breakers, graceful degradation of some of the components, dozens, too many configuration settings, but then configuration profiles that capture those to make it easy to use. I think I got three quarters of the way there or maybe 50% of the way there, but there were still some improvements to be made, among which the storage engine, which we can thank Nick for rewriting using PGQ, or essentially the engine that is part of PGQ. Correct, Nick? Yeah, it's like partitions, rotation, I think daily partitions. there is also a roll up for all data to have it less precise, not raw, but aggregated.

Starting point is 00:15:02 And everything is already implemented. And I remember I was brainstorming with cloud code, like what kind of storage we should choose. Because I think originally you used a lot of JSON, right? It's quite bloated, in my opinion, sometimes. Well, it wasn't JSON, but I had originally, I was, you know, I was using skip locked and unlogged tables in a vein attempt. attempt to mitigate dead tubals and bloat, but if you're not diligent, then they're still there. And so that's why you can rewrite the engine.

Starting point is 00:15:36 But also like data format is interesting. And I had multiple ideas and I have some like brainstorm document where like thinking what to choose and some ideas were compressing data quite a lot. But it was hard to deal with because it was basically encoded so much that it's inconvenient. So I did like trade of choice. It should be human readable even in raw form. Although I did apply it some tricks from PGS as well. Like timestamps are relative to, I think, I don't know, 2020 or something.

Starting point is 00:16:08 Like Unix timestamp bites shifted. So we have capacity until the end of century. And you have as few bytes wasted as possible, like very compact way. Plus this PGQ style rotation. And also like worth mentioning, it's working. like there is like soft requirement pgcron it's not requirement but it's very recommended because this is how it's ticking as well right that's correct so again it is it's a simple tool it's two sort of packages two simple install scripts two schemas one of which is required one of which is

Starting point is 00:16:39 optional the part that's required it's the data model the tables and the views and the functions to record those data and then the other optional piece is a set of functions for analyzing those data, but again, they could be analyzed in raw form in whatever way you like. But as Nick indicated, somebody has got to generate the ticks. Somebody has got to force the samples. And that could be PG cron. It could be something like an outside scheduler. Somebody's got to do it.

Starting point is 00:17:10 The sort of default ways with PG cron. And PG cron is available everywhere. Yeah, PG cron sort of snuck in before the sort of iron curtain started the drop on extensions. Maybe. So it's in a lot of places at least, you know. I, I, yeah, maybe that's true. I got the impression it solved such a useful problem and was from such a reputable author that I think people trusted it and also thought it's simply enough that we can maintain. I understand why managed service providers don't just offer any extension. But if you think about how much work it would be to maintain PG-Gron if the author ditched it or, you know,

Starting point is 00:17:48 actually it's not huge and it's so useful. And it should be in core. That's all of it. That's true. I wish PG-Cron was in core. And we would have, for example, automated new partition creation out of the box without any extensions.

Starting point is 00:18:05 It's magic. So simple thing actually, right? But the one thing about PG-Cron from a database perspective is it is, I think, once per second is the lowest you can schedule. So what do you use, David, when you're using this with people? Do you use PG-Crom with like a one-second tick, or do you suggest like something else?

Starting point is 00:18:30 So far I have used it just with PG-Crone, the resolution that has been sufficient so far, you know, because with the customers that I've worked with, the resolution before has been, I guess, infinite, as in they didn't have this at all. So it's just worth having the data. And a finer resolution, I haven't encountered a demand for that or a need for that yet, although certainly plausible. But again, yeah, that has worked well so far. And again, it's just a simple tool.

Starting point is 00:19:01 The idea, the objective anyway, is sort of a set it and forget it. Install the tool, then forget that you've installed the tool. But because it's safe to run, by virtue of Nix, hard effort. It is safe to run. So just forget that it's there. And then you have an incident. And then you think, oh, wait, I have PG flight recorder. Let's find out what happened. And just point your eye to the data, maybe a dump of that data or something. And that's it. And this second package you mentioned, it has interesting functions, like what happened at or something, right? It's based on function names. I see already your thought, oh, AI should guess,

Starting point is 00:19:43 right? You designed it for, so it's self-explanatory, right? So it's great. But I also wanted to, about PG-Cron a little bit, version 1.5, as I remember, the lowest resolution once per second, and I use it for PGA. But I guess you use it by default at much less frequency, at much less frequency, especially for Azure data, right? Maybe once per 30 seconds or 60, but it's tunable, right? It's everything is tunable. You know, default sample collection is, I think I haven't said to once per second, but there are, there are, that are taken at a course of resolution. There are roll-ups that are taken in a course of resolution.

Starting point is 00:20:22 Data are archived at a course of resolution. Then there's retention for the core tables, is I think that my default is seven days, for the aggregates of the seven days, and then for snapshots, I think it's by default 30 days. But all of these are configurable. So there are a few different cadences that are happening. And there is, again, as you indicated,

Starting point is 00:20:45 On the analyzed side, there is a wall of functions appropriately named, meant to be understandable by humans and by AI alike, so that they can use these functions to analyze the data. But then, of course, it's always available to be analyzed in raw form as well. Yeah. I can share some interesting story from PGQ about function names. So when I was dealing recently in library, client libraries for PGQ, multiple times cloud code. made mistake because there is a function forced tick, but it's not ticking, it's just shifting this pointer. And then you need to run ticker in a separate transaction. And Cloud couldn't get it because it's confusing name, actually. And it made mistake multiple times developing this.

Starting point is 00:21:33 And this, I had huge flashback to 15 plus years ago when I made the same mistake manually without AI because it was also confusing to me 15 years ago. So I just renamed that function to force next tick. So you need to understand this is about next tick. You're not doing, you're just preparing this job. And looking at your functions, what happened at incident timeline. I'm just thinking this is life explanatory, like maybe long, but everyone will understand what it is for. So it's worth making long. Yeah. That's right. And it's and if there are too many functions, then again, we can use AI to paw our way through and figure out which ones they use. Just the final thing. I mean, it's meant for a few things, not just incident response, but also capacity planning, blast radius evaluation.

Starting point is 00:22:23 You know, again, where I intend to go with this is just getting back to Superbase briefly. I have lots of customers that I have to, I should say, I'm blessed with helping, but so many of them I want to get too early. It's, you know, again, we talked about this, but all of these databases are small. They start out small anyway. I would say from a certain point of view, from the point of view of scale, many of them are sort of doing things wrong, but that's okay because they're small.

Starting point is 00:22:57 You can do, with a small database, you can do everything wrong and it's fine, no problems. But it's when you start to scale that you need to think about this. So it's like exercise. It's something you have to get into the habit of doing it early and often, even though you don't want to. And maybe you would benefit from, a personal trainer and some encouragement to get you on the path early so that it pays dividends

Starting point is 00:23:19 when you're old like i am and when you're a big database like these some of these eventually will become i wanted also to mention like by default it's consuming up to a couple of gigabytes for those seven days right but again it's tunable if you or less yeah it's tunable it's a few gigabytes nick i think you and i we benchmark this i think we don't remember with a new storage engine i think we estimate maybe for like under on the happy path maybe I think around like 20 gigabytes for the month it depends there's some there's some yeah one of my goals is to sort of draw more attention to this so that I could get feedback and improve it and there is some low-hanging fruit to be plucked in so far as data retention for yeah I think it depends on the like how many queries you have in

Starting point is 00:24:11 and PG's statements by default up to 5,000. And also, like, you collect data about indexes and tables, so how many tables and illnesses you have. This is, and I think you have limits there, but still, like, it depends a lot of cardinality of these things. And about use cases. I used it recently for benchmarking PGQ. So I just, and for me it's so natural.

Starting point is 00:24:35 I had multiple already projects like this, and I just, okay, I injected both PG-H and PG flight recorder, because PGH has more frequency and more details about SASH data, PG flight recorder brings a lot of stuff, right? So I just injected it into some synthetic database as provisioned with PG-Crone configured. And I just asked AI, of course, to do it, right? So just inject it. And then don't forget to dump after each run.

Starting point is 00:25:00 And then visualize it. That's it. Only three sentences. Yeah, exactly. And this is how I created a beautiful looking. I actually asked to animate benchmarks because it's great to look how lines go. And this is what brought PGQ good attention because this data is easy to understand. So for example, how much wall was generated, right?

Starting point is 00:25:24 A lot of stuff. What was the behavior of checkpointer or auto vacuum and so on? That was going to be my next question actually on the wall front. You mentioned a few minutes ago about how you originally went with unloged table. does that mean these are now logged and there is wall generated? It was my decision. It was my decision to say that, first of all, important limitation of all those tools which are ticking on PG-Cron and write something and it's only PLPG-SQL.

Starting point is 00:25:53 It's primary only, right? But we live in this strange situation for me, old DBA, when a lot of clusters are single-node. I have even cases clients are coming like 10 plus, like 15 terabytes on single node and they are fine. cloud like resources became quite relevant. For some, actually, I think it's okay because backups matter more than H.A. Because they are fine to be done, but not to pay for additional couple of notes and so on. So anyway, this is primary only because we cannot write. Yeah, can I add something to that?

Starting point is 00:26:26 Because you say single node, but I think that's slightly simplistic because a lot of ones I see, they're H.A., but the replicas are not real. They're like the failure of a replica. shadow standby node. I wouldn't call that a one-node cluster, but it's still you only need to monitor the primary. You're right, actually. You're right.

Starting point is 00:26:44 But in this case, you are not interested because you don't have any workload on that hidden standby, right? Exactly, exactly. So you are interested on the primary, and we see so many projects reaching dozens of terabytes already, which single node.

Starting point is 00:26:59 You inject it, we need to understand, okay, it's self-recording, so it's going to produce some rights. If it's on log table, if it crashed, it's gone. That's the key idea. We cannot... At least the...

Starting point is 00:27:11 Where the data lands initially, those data would be gone, right? When they were on log tables. You need to snapshot, you need external means. So to understand the incident after crash, we should use regular tables. And when we've redesigned storage, it's not so super expensive. Of course, there is some wall to be written and some data storage to be paid. And of course, a little bit of shared buffer. are occupied by our data.

Starting point is 00:27:38 If you have replicas, it goes to replicas. Maybe it's not a bad thing because if it's half of primary, we can pay this price. What are we roughly talking? You mentioned a few gigabytes up to maybe 20 gigabyte, like that kind of amount for storage. What are we talking about in terms of wall generation by default, just to give people a rough idea?

Starting point is 00:27:58 I don't remember. I thought about baby clusters like one gigabyte once, like three tiers up to one gigabyte, right? So I thought they should afford this maybe with a little bit tuned to less frequency or something retention wise. Are you talking about storage now or wall three? Both, both. Okay. They are connected, actually.

Starting point is 00:28:19 If you need to write. You mean on super base? Anywhere, any postgres. If you need to write 100 megabytes to storage, you will produce like very roughly. You will produce kind of close to 100 megabytes to wall because this is the same data. Yes, in different form. but it's the same data, right? If you need to write 100 times more,

Starting point is 00:28:40 expect 100 times more of wall. Yeah, order of magnitude, it would be about the same. Yeah, very roughly. Of course, like full-page rights, all the compression, but it's very different. I also think of them very differently because with war, I think of it as like megabytes per second always. It's always like a time component, if that makes sense.

Starting point is 00:29:00 So it's like a constant amount that we're generating, of course over a month or whatever it is, that's a few gigabytes. But I guess that doesn't actually add up to very much per second in terms of, yeah. I think we should expect something like 100 to a few hundreds megabytes per day with all a lot of queries and indexes and so on. Yeah, if you like ballpark math, if it was, if it were 30 gigabytes of data per month, that would be roughly, I guess, by the power of arithmetic, maybe a gigabyte per day. It's very stable.

Starting point is 00:29:35 And by 3,600, you could figure out megabytes per second of wall generation. Kilabytes maybe already, right? And it's very stable because it depends only on this cardinality. And if you have some spikes of foreclode, it doesn't affect the amount of data, these snapshots right, to wall and data directory, right? Yeah, it's a baseline. Yeah. So just, you know, you're paying maybe $25 to Superbase, maybe pay $26,000.

Starting point is 00:30:03 and just pay for a little bit of more storage. Not hundreds more as you would pay if you install full-fledged monitoring. And if it's only, if it's to first order for the primary node, again, all of these small projects are starting out with only one note anyway. I mean, life is complicated and these people are just trying to get like a business off the ground and a job done. they're not thinking about multiple nodes, especially early on. But they, you, I mean, we three know that they will need data to guide them on their journey.

Starting point is 00:30:44 So this is just part of that. I mean, Michael, you work in the observability space. We sort of dilate this out to a wider view. This is just another entry in the observability space, like maybe a new generation. 0.5 of observability tools for PostgreSQL, but, I mean, and there's PG-Ash, there's so it's not only about observability I see actually the word new kind of breeded you used I think in our discussions I so I have PGh you have PG flight recorder I also trying to revive PGQ in this very format PG crone and POPGQ only that's it so it can be

Starting point is 00:31:20 installed anywhere and just tick I also have index which is not yet released which is rebuilding indexes on PGCron that's it you can inject PPLGQL and tick on PGCron That's it. It's super simple. I already think about tool for automated partition creation without heavy tools. I don't know. It should be easy to use. But then you mentioned PGTLE. Yes. So can you like maybe elaborate a little bit of why PGTLE? Why not just single SQL file or PLPG SQL fire? It's both. So for those who don't know, TLE is trusted language. extensions, which I have my own view on that. I regard it as just a little bit of extra

Starting point is 00:32:08 housekeeping that's associated with just a simple SQL install file. But they are, you know, TLE sort of dress up SQL and PL, PGSQL code as if they were a sort of kind of managed extension, but they can be installed without super user privileges. PG Flight Recorder comprises both just simple install scripts, you can use PSQL to install it. But it also is available as a trusted language extension that can be installed through, I think, DB dev, because that's how some people want to be able to install. Just to track like metadata. Yeah, just so you can do, it makes housekeeping a little bit easier. You can slide PG flight recorder in with an install, and if you don't like it, you can uninstall it in a very managed fashion. So that's all that's meant there. But yeah,

Starting point is 00:33:00 It's a very, TLE, it's a very lightweight way to have managed extensions, and flight recorder offers that as well. Yeah, I actually wish PG-Cron and T-L-E both inside Postgres itself. Yeah. And we would say something like create package or something, I don't know, and it's just a bunch of SQL and build PCQL code, or maybe build Python if you want, like, anything. And it just can be installed anywhere with versioning and so on, with like tracking of CVEs, I don't know, like, if there are, and so on. Who knows? Yeah, but just like extensions don't feel like a part of extensibility of postgres to me anymore. This is my honest like feeling lately. Well, it also what worries me is

Starting point is 00:33:44 who is testing, I mean with with major version upgrades, let alone minor versions, who's testing all of these extensions? But this question is also applicable to any regular backend code. You use some libraries. You just, you just, just import them to your code somehow include, right? And that's it. And versions also matter there. And it's on your shoulders, right? It should be on your shoulders.

Starting point is 00:34:09 This idea that we are not providing some extensions because we will need to maintain, give it to shoulders of people, right? This is different part of thing. This is intention with products which telegraph or advertise that were easy to use and you don't need to worry, we will handle it for you. Yeah.

Starting point is 00:34:29 But I think it's great to have. flexibility and if people can use various like choose and and but they need to be responsible for upgrades and part of maintaining as they are already for libraries and go language any language right type script and so on so there is something here and i think it's edible yes guys who created pg t lear right so definitely this project was created with realization that something is limiting people here and let's bring something here that's a great idea And I wish they learned. Michael, do you experience that all with your customers?

Starting point is 00:35:04 Their challenges. I mean, I know you work in a slightly different space, but you certainly must encounter this as well, like tensions with extensions, with managed database providers. Yeah, so I get the impression, so I don't speak to people all the time about this kind of thing, but I get the impression that people are looking for a little bit of advice, almost, from their managed service provider, on which. extensions they should trust, which are like the best at what they do. You know, often there's a choice of two or three and they kind of want their managed service providers to pick one and say,

Starting point is 00:35:40 you know, this is the one we suggest or this is the one we support. And I feel like there's a little bit of that going on as well. So it isn't just, I think it is a little bit of, I don't know if it's like king making or something, but like people saying this is the one of one using. When people come to Postgres, for example, for the first time, they're like, which backup should I use? Which monitoring? What's everyone else using? And there's no kind of like official, there's no, there's barely any kind of extension management systems. There's been about three or four kind of created. And I think there is still, is it PGXN that's probably the most used? But it doesn't have like reviews or it doesn't have, like, it doesn't have a lot of things people are looking for in terms of which ones actually used, which ones people do that people actually like, which ones have got a good track record when it comes to major versions or low, maybe no CV. or very few reviews, you know, that kind of thing. So I think trust is a big part of it, and also people want a shortcut as to which ones of these should I be using.

Starting point is 00:36:37 I should agree. Soapabases, this database.org, it's another attempt to have this register of extensions, right? It is yet another attempt, which seems to be somewhat honored in the breach. But it's yet another attempt. But Michael, I take your point definitely that there is a need and I would say a growing need for, if not kingmaking, at least someone to offer guidance and to bless these. You would know as well as I would like the sort of persona for database operators

Starting point is 00:37:08 that definitely does seem to be changing. I mean, there once was a time when databases were an arena for people to sort of develop and then project expertise, which is certainly true. But more and more I encounter people, customers, superbased users, whoever I can be. They will say to me or to us, I don't know what I'm doing. I'm not a DBA. Some of them will say, I'm not even a tech, I'm not even technical. I'm a founder.

Starting point is 00:37:37 And I just vibe-coded my way into this. And, you know, there's less of an urge now than there was in the past to sort of burnish your credentials as a database expert. People are very happily, they're very candid. And they will say, I am not a database expert at all. So please, can you help us? Can you offer guidance? If you tell me what extension to install, I'll install it. So there's a growing need for those kinds of tools

Starting point is 00:38:04 and for greater and better automation. Oh, yes. That can be a topic for another session. That's where my mind is good. And I agree, actually, with this authority, what's good, what's reliable. Sometimes I have cases where we have huge postgres, self-managed postgres clusters, And when I'm saying we should add some extension, I'm saying it's available on this and this managed platform. So it's like it's reliable, you know.

Starting point is 00:38:35 Let's add it. It helps me to speed things up. So yeah. I had a couple of last things. I wanted to make sure, like, or it would be great to get your thoughts on. One is, is there any, like, is there anything that we haven't talked about that you have one of your favorite features of the tour? I've seen there quite a few in there. And then also, is there anything missing that you really want to add?

Starting point is 00:38:57 There are, in reverse order, things that I want to add, again, I really want to, I would like to fortify this against observer effect, against deleterious consequences. I already know that there are some important fixes to be made, and I'm committed to doing that. I'm hoping, my tender hope is that people will use this to some degree again so that I can get feedback if there are. if it needs to be strengthened, I will strengthen it. I will pour effort into it to make sure that happens. So I wouldn't say there are features that I want to add. It's probably already bloated in terms of features anyway. So maybe I'll just stop in terms of adding features. In terms of favorite features, I mean capacity planning, Nick, you know, it is another worthwhile endeavor besides incident management. And it's something that is sorely needed for

Starting point is 00:39:54 super-based customers. There are functions within flight recorder to help you project your capacity needs. Like everything else, those functions will be strengthened and improved, but I really would like to exercise those. I'd rather this tool is used to forest all problems rather than to investigate ones. Let's just not have problems at all. You touch several things.

Starting point is 00:40:19 I wish we had a separate episode on which, like, and actually I think like how we would approach incident response, RCA with this tool, particularly. step by step it's possible. And also capacity playing, I agree. And a thing related to observer effect, one thing everyone who is using this new breed of tools ticking on PG cron should remember that PG cron records logs, right?

Starting point is 00:40:45 And you need to clean them up. And I think we need to team up and bring some pull requests to make this configurable and make it Unixway, Linux way, when everything is cool, don't say anything and don't log anything. because it works, right? Like some levels, right? Like warning error level for each job in PG cron. And another thing is this verbosity,

Starting point is 00:41:09 PG cron also depends on this frequency and it can produce a lot of bloated logs right in Postgres as well. Yeah, especially if we go to sub-second frequency, which is just implemented for PGQ, just running a single stored procedure, ticking 10 times within one second. Yeah, I don't know if it's needed for PG flight recorder. Maybe not at this point.

Starting point is 00:41:34 It's too much precision, right? Yeah, anyway. Anyway, yes. So there is a soft dependency on PG cron. PG cron is also maybe a little overly chatty. My kitchen wall clock just silently ticks away. It doesn't generate a daily journal of the fact that it ticked. Yes.

Starting point is 00:41:54 I would like. This is what we have. live with. So yeah. Let's create pool request. I can create or you can create and just support each other. Maybe PG cron maintainers will agree that it's actually like I opened the issue. I didn't see feedback from them, but it's definitely an issue. It was issue before we started creating these tools. I have another places where PG cron was chatty in logs. Yeah, anyway. And final thing for me, good place to start is benchmarking. If you just do benchmarks, just inject these tools, PG flight recorder, inject PG flight recorder,

Starting point is 00:42:28 and ask your AI to visualize the result of that data, like before destroying instance or something, destroying database, just dump that data and visualize it. It's so easy these days. It is, yeah, the nature of these tools is changing, and this makes it super easy. That's what I learned as well. But yeah, that's it.

Starting point is 00:42:45 It's not very profound. It's not very complicated. But we need more and better automation in the community. This is just one small contribution. Many more to come, I think, from you to and hopefully me and others in the community. So looking forward to that. Nice one, David. And just to check, what's the license here?

Starting point is 00:43:04 It's as generous as I could make it. Nice. So open and permissive. Yeah, exactly. Super base style. Super base style. I work at Superbase, but this belongs to, if anybody, to the community. Wonderful.

Starting point is 00:43:20 Love it to meet you. You as well. Very kind. I appreciate it. Thank you. Thank you.

Postgres FM - pg_flight_recorder

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.