Postgres FM - PgQue

Starting point is 00:00:00 Hello and welcome to Postgres FM. We share about all things Postgres Quill. I am Michael, founder of Peach Mustard, and I'm joined as usual by Nick, founder of Postgres AI. Hey Nick, how's it going? Hi, Michael. Everything is all right. How is your business and life?

Starting point is 00:00:15 Yeah, good. We're in spring in the UK and it's just getting a bit warmer now. And yeah, business is good, ticking along. I've got some upcoming news soon, actually, that I'll publish in the newsletter. That's great, yeah, looking forward. How about you? Yeah, obviously I'm a guest today, right? Yeah, what are we talking about?

Starting point is 00:00:36 We talk about cues again, cues and postgres. My favorite, remember, I told you so many times, I like to observe how many of them are created and how many of them have issues. Actually, all of them have issues. Now I digged into the topic deeper, and actually I had surprises my understanding in the past was not fully correct.

Starting point is 00:01:02 And I'm going to confess today where I was not, like, where there were gaps. Yeah. And just to clarify, when you say that they all have issues, do you mean Q implementations inside Postgres or inside relational database or OTP database? Yeah. So when we have a new client, for example,

Starting point is 00:01:24 at Postgres AI, our most popular type of client is a startup without DBA who are on RDS or CloudSQL or Superbase whatever. And they have issues because they have growth. This is my favorite type of client. And they bump into some problems. We check, we have various tooling for health checking and almost always we recognize one of a few patterns

Starting point is 00:01:50 like log-like, append only, unpartitioned, huge table, or a table usually, usually also on partitioned, which receives some events to process. And they got updates, deletes. And if this startup is more mature, we see some, like, PGMQ, for example. Or if it's smaller, usually nothing. And everything is wrong there, starting from heavyweight lock contention, right?

Starting point is 00:02:22 But also bloat and a lot of complaints. Are we talking about like self-rolled Q? inside the data. Yeah, so like naive implementation of Q in progress. You can see it naturally just looking at Pagist at all tables, noticing patterns of foreclode, a lot of updates and deletes, but also a lot of de-taples and auto-vacuum. If it's untuned, especially, it cannot keep up, but also if they have long-running transactions or other reasons to block Xmin Horizon.

Starting point is 00:02:50 We need to discuss it slightly deeper today. They will have a lot of blood accumulated and all latencies suffer and it all. this piece of workload and this table, usually just one table, it becomes a hot spot. This is a huge reason for them to complain about how post-gust is bad. And I don't fully disagree.

Starting point is 00:03:11 Only who didn't complain about post-gast-gast-Gus and VCCC and vacuum, all the, like being headache all the time, everyone did, right? So, yeah. And I usually said, all you need is two things.

Starting point is 00:03:30 And actually not only said, I came to some implementations of cues less naive and even helped them. For example, long ago, there was a project called Delayed Jobs in Ruby. And I added a couple of things like index,

Starting point is 00:03:48 which was easy, like just missing index. But also I said, let's use skip locked for update skip locked. So you just don't need this heavy weight lock contention when multiple sessions compete to update or delete the same row. It's quite straightforward.

Starting point is 00:04:07 And to some others, I had said always like, you just need two things, skip locked and partitioning. And as I understand, this is where everyone went. So skip locked created roughly 10, 11 years ago, actually 11, 2015. I think it was, Postg is 9.5, right, because it's 2015. Yeah.

Starting point is 00:04:31 And it's a great feature to get rid of heavy weight lock contention. But it's not enough. It doesn't solve the blood problem. Actually, somehow I noticed in my recent work over the last few weeks on hacker news discussions and some other places, I noticed people think that skip locked will solve their blood issues, vacuum issues. It's not so. So partitioning and skip locked is quite good enough, and this is where everyone went. I think in PGMQ, actually all modern Q systems in PostGus, they love skip locked.

Starting point is 00:05:10 They are like all about skip locked. At some extent, my AI boat started, we did a lot of research, did a lot of experimenting, benchmarks. So at some point I noticed they started to name all these guys, skip-locked architecture somehow. Sometimes I don't like it. I like more update-delit architecture. So update-delit-cute-cue systems, not skip-locked systems. Because skip-locked, it's shifting too much attention to itself. But skip-lock is a simple thing.

Starting point is 00:05:38 Just let's get rid of heavy-weight lock and tension. That's it. Other problems which are bigger actually, and harder to solve are not eliminated. They can be only mitigated with partitioning and rotation. and I saw, for example, PGMQ, it's quite popular. I think this is a good legacy from Tembo. Yeah. It's supported, I think it's supported on SuperBase, quite popular there.

Starting point is 00:06:06 And they actually also went to get rid of the need of create extension. So they re-implemented it fully in PLPGSQL. Oh, wow. And then in form like a trusted language extension. You know, PGTLE, trusted language. Trust Language Extension, since it's purely PG PLSQL, you don't need to ask provider to support it. If provider supports PG-T-L-E, you can have it.

Starting point is 00:06:31 And not only load it as only SQL file, but you can have it as tracked extension without provider support because it's just SPILPG-SQ. So they also focus on skip-locked, and they have support of partitioning, but it requires PG-P-P-G part-man and additional effort. Right? And just a question.

Starting point is 00:06:50 their partitioning. Is it like time-based partitions and you detach and drop them over time? Or is it a rotational thing like I don't remember, but it doesn't matter actually. So like what matters here, you cannot rotate partitions every minute. It's not practical. Sure, sure, sure. And it will lead actually to some other issues. So partition rotation should happen less often. And by the way, who those who use partitioning to mitigate blood is a great idea but instead of detaching attaching partitions which i see in some q systems better to have several static partitions i mean stop creating them it will lead to catalog bloat eventually right and also detaching attaching has its own issues under heavy weight heavy loads it's better just to truncate

Starting point is 00:07:43 and have rotation like round robber of partitions and you just truncate them and that's it. So it's much better in many senses. And this is what PGQ does. So I always said like these two things are enough, but I realize that they are not. That's the like, yeah. Can I just, when you say PGQ,

Starting point is 00:08:04 are we talking about your new tool, PGQUE or PGQ the Skype-based origin? Because just the letter's PGQ. Let me, yeah. let me explain how it all started. So on our podcast, I was saying, that's it. Just skip locked and partitions, rotation or something. Skip locked felt natural because this is how we solve heavyweight lock and tension when you have multiple beacons competing to update or delete the same row.

Starting point is 00:08:35 Yes, yes. Inserts cannot. Inserts don't need it. They are independent, right? But updates and deletes, they compete. And when I said partitioning, I always, said, look at PGQ, Skype created 20 years ago. Yes.

Starting point is 00:08:50 And that's it. And I thought it's enough. I thought Skype didn't have it. I mean, Skype didn't have skip locked that time, right? So I was thinking, they did it differently because skip locked didn't exist in 2006 and seven. PGQ was created exactly 20 years ago, 2006, and it was open sourced in 2007. And now I'm talking about Sky Tools, PGQ.

Starting point is 00:09:15 Right. Yes. But okay, but it didn't do skip block because it, but it was also not doing updates or deletes, right? Like it was... Exactly. We will come to that. Okay.

Starting point is 00:09:28 But I was thinking, I had a false impression that we should use skip lock because it's modern path, but we also should mitigate blood issues, vacuum problems and so on, just using partitioning and rotation. And coincidentally, this is how all... all current guys are doing it. River, P-G-M-Q, Q, Q, Q, U-E, others. There are many now. And also, some of them are quite agnostic to languages.

Starting point is 00:09:59 Some, like river is go-oriented, so they're focused on goal frameworks. And that's it. And then what happened, actually. So this is my understanding three weeks ago. Then what happened? I was in Zion Canyon on campground and they have good connection actually I was in my tent

Starting point is 00:10:19 and I saw that it was Friday evening I think and I saw Planscale blog post yeah right and I started to read not blog post itself because I quickly realized what it's about I started reading discussions of it on Twitter

Starting point is 00:10:37 on X right so that post was dancing around Brando's Brondor Litch, right? Who is actually one of creators of River Q? So Brander had a post in 2015 about how

Starting point is 00:10:53 like basically how challenging it is to have Q in Posgas because on MVCC and if you have long running transaction with XID assigned or repeatable doesn't. I think Brander used repeatable read transaction

Starting point is 00:11:10 lasting one hour or half an hour. I don't remember. And I think in his post it was like below 1,000 events per second inserted, like maybe 800. And quickly like something like 60,000 events were accumulated unprocessed by consumers because everything started to lag and so on. It was 2015. I think this is actually a year when skip locked was added to Posgues. Interesting. right? Yeah, good timing.

Starting point is 00:11:45 Yeah. So PlanScale discussed how bad it is. Not like Q and Postgres are bad, but it's bad to have long running transactions or something which is blocking Xmin horizon, right? And they promoted their new feature, how to mitigate it, but mitigate how? Just cancel that. So they have some smart approach like which traffic is more important, which is less. important. My opinion, what we created with Andre,

Starting point is 00:12:15 transaction timeout is good enough for everyone as default solution against long-running transactions, although there might be other problems like unused or lagging logical replication slot, right? Yeah. Or maybe someone is using 2PC and prepared transactions also can be a problem. So anything that holds X-Men Horizon and prevents the cleanup of dead, top or like old re-versions.

Starting point is 00:12:41 By the way, we just released our monitoring, which is like front. When we say front, we mean monitoring with Grafana and Victoria Metrics and PostGus inside everything. We just released with our new dashboard for Xmin Horizon analysis. And there are five possible reasons. And also Lawrence Albee, very timely posted blog post about how he monitors AutoVacom. I stole a couple of thoughts there. it just implemented in dashboard and it's already released and it's free for use Apache license, but it works much better if you go and become our customer because we have great new health

Starting point is 00:13:22 matrix. I will blog post about it separately. Anyway, this is connected because Xmin Horizon, like we also talked about it. Everyone monitors long run transactions, but it's off. It's a wrong thing to monitor in this context. You need to understand Xmin Horizon being blocked and by whom to unblock it promptly because this is how you can put your river or regime you or something down actually not down but basically lagging and having very poor performance accumulating a lot of bloat that is not ever recovered if it's not using like a i think that's the other thing that people don't realize is there's no recovering from that because once that's bloated unless you're using like the partition rotation there are several things here several

Starting point is 00:14:08 First, a lot of dead tuples are accumulated. Yeah. Because Xmin Horizon is blocked, so dead tuples are created every time you produce delete, successful, delete, successful, update, or unsuccessful insert. So it means, by the way, that we also can produce dead tuples if some inserts are failing. But this is very subtle. It's nuance, like we can omit it, right? So regular approach, we always produce that tuples.

Starting point is 00:14:38 is a raw version. We agreed on our first episode that I say tuple, you say tuple or vice versa. I don't remember. Yeah. Anyway, tuple is a raw version. So old version becomes dead, but it's still hanging out in your, everywhere, actually, including shared buffers everywhere. It's polluting everything. So garbage collection called vacuum is needed to delete it. So it's two-phase process. First, it's only marked that and then it's deleted. And the first problem, a lot of debt apples are accumulated. They cannot be deleted by garbage collection called auto vacuum. Right.

Starting point is 00:15:15 And this first bad effect is latencies of consumption degrade. Very fast, actually. So because to find the next thing, you need to skim through all debt apples with your index scan. And it becomes less and less performance. Yeah, next bad thing is that. we we accumulate big set of unprocessed events sometimes not always sometimes like we okay we have degradation and when I said degradation it means like degradation was it was like one millisecond to fetch next event for example or a bunch of events and then it

Starting point is 00:15:55 degrades to second or a few seconds during one hour I saw I think five seconds for some cues just to fetch one event five seconds can imagine it already at some point in some if you keep inserting a lot of events and consuming them, at some point, it might start timing out on statement timeout. If you have strict statement timeout, as you should for all LTP systems, we always recommend to have strict timeouts. Right. This is so a lot of data app was degradation of consumer performance. Consumer query. Second effect is a lot of them accumulated just because consumer capacity, throughput is not enough.

Starting point is 00:16:34 You have, for example, 10 consumers work. in parallel, keep locked, help them not to fight with each other. And you have capacity, for example, to consume 2,000 events per second. But now suddenly latency became from one millisecond to one second, thousand times worse. It can happen during 10 minutes or so. And 10 minutes just like for example, you have multi-turbide database. If you decided to dump it, right, or create logical replica in a traditional way, this is it. This is how you, can reach second level of consumer query performance.

Starting point is 00:17:11 And this leads to accumulation of unprocessed events. And for some cues I noticed even when we stop, long-running transaction, we unblock Smith Horizon. First of all, auto vacuum comes, cleans up that doubles, immediately latency for consumer query drops. But for some, it didn't recover to previous level. It didn't recover to one millisecond level. It stayed like 50 milliseconds or so.

Starting point is 00:17:38 I think it was Q, which is like Q, U, E, very nameing as mine. I call it K. I call it K, like Spanish. If you check, I'm not speaking Spanish, but who is speaking Spanish? How my tool is pronounced? It's crazy, right? Something like Peket or something. I don't know.

Starting point is 00:17:57 I have an issue, and actually I have already pull requests to rename, but I didn't like all brainstormed names so far. PG-Belt, this was the, the best I had, I didn't like it. And we will come to that YPG belt, right? And YQ actually is misleading, as I learned. I learned two big things during this journey last three weeks or two weeks.

Starting point is 00:18:18 So... Just quickly, before we move on from side effects, I want... So eventually, even in a non-partition system, the heap bloat would get reed. Once the tumbles are marked dead and vacuum comes along, mucks is like reusable. New jobs or new events

Starting point is 00:18:35 can go into that space in the table. But people don't, and I know we've talked about this, like thousands of times, but it's the indexes that can end up bloated in a way that isn't recoverable without a re-index. So I wonder if that could be the source of the not ever recovering. Maybe, yes. Yes, you're right. And bloat might still be there. And this is what can be sold post-factum, actually.

Starting point is 00:19:01 It can be solved with partitions and truncate and in rotation, because truncated means like you will have fresh empty index and it will start growing from scratch. It's great. But again, I was thinking first, actually, it's an interesting nuance as well. I was thinking, oh, partitioning won't help in the middle of long grinding transaction. In the middle of like when X-Men Horizon. Actually, it will. It will help.

Starting point is 00:19:24 It will help. You will switch to new partition. All partition will keep old stuff with that tuples, degraded indexes, and that's it. And this is actually, it will lead us to interesting optimization. in my tool but what it won't be possible to do to have I think it's possible but I wouldn't recommend it to have aggressive like every minute partition rotation and truncation first of all you need a lot of a lot of partitions maybe maybe no actually maybe you can find a way to to just jump between them but it's not practical again

Starting point is 00:19:59 because of catalog bloat and the stress when you switch to new partition and some stress and you need to clean up and all, everything should be processed. And if you dropped, if you're dropped latency, like degraded latency leads to event, unprocessed event accumulation, you won't be able to recycle, right? Old partition, because it still has useful data.

Starting point is 00:20:23 You can move it to new, like, it's becoming nightmare, actually. So anyway, it feels architectural wrong for me to have very often partitioning switch. right so this is and back to your question time-based or size-based well time-based is fine i actually don't remember what i have in settings i need to check but here your other question are we talking about pjq new or pjq all sky sky tools so what i did i took it as is right first thing i did i took it as this and i just started to build around it so i'm not touching core engine that's the So the original Sky Tools PGQ core engine remains.

Starting point is 00:21:09 I mean, I knew that, but not actually by reading your... I saw, did you see a really... I thought it was a good blog post by Christoph Petters covering the 0.1 announcement. Yeah, he's done it. He's been on a bit of a blogging spree lately, but he's done a blog post about PGK or PGQ your tool already. I missed it. It's cool. Nice.

Starting point is 00:21:31 Well, yeah. What I wanted, I wanted to say, guys, there is alternative, forgotten Kung Fu, I say, because I know how trustworthy it is, 20 years. I used it in two of my three social network startups myself before we, I think it's a mistake, switch to Rabbit in Q. I regret it now. We use it heavily. We use it even originally as Skype. Skype built it also not only like for event processing, but also to have logical replication instead of, the Sloane they built

Starting point is 00:22:03 Londeste and it was working on top of PGQ. Native logical replication didn't exist at time. So it was serving many purposes. And when they built it, Kafka didn't exist. Kafka created in 2011, I looked

Starting point is 00:22:19 it. So this is called Q but by nature this is a second thing, big thing I learned it's not a Q system. It's a immutable log similar to Kafka, not distributed like Kafka but or Red Panda also modern thing right as I learned and it's just it guarantees that something is inserted this is order and consumer is just a pointer shifting right

Starting point is 00:22:44 but it can be used for Q like workloads maybe not all of them and we can discuss it in a bit but what I wanted I saw this blog post from playing scale and again these discussions like just you skip locked and first thing I posted and I found big feedback like people like yeah that's it i said first of all skip lock doesn't solve vacuum problems it's just very wrong skip lock solves heavy weight lock contention right and also skip locked there is a another post from law lawrence alb select for update considered harmful right so there are issues with skip lock it's a part of select update you cannot use it without select update and there are issues with that approach as well additional danger is there but i wanted to show like there is

Starting point is 00:23:31 PGQ from SkyTools, and we all know another tool called PGBouncer, and I'm wondering, and I know PGQ is still used in very large companies as an important building block, very reliable, very performant. Skype originally built architecture for, as I learned from their talks, for one billion users. They achieved hundreds of millions. I don't know after acquisition by Microsoft. By the way, Skype was closed last year. also case of this. So that's an interesting legacy here and I'm thinking why pjibouser became quite popular,

Starting point is 00:24:08 but pjQ didn't became quite popular. And my hypothesis is it's because of its extension which requires additional demon and provider maybe I'm wrong but maybe providers just didn't want to bring any additional demon which is not a regular background worker but something that needs to be managed separately. Whereas PG bounce is completely independent and... Well, yeah, yeah, yeah, yeah. But yeah, that's a good point. It needs to be managed.

Starting point is 00:24:41 There are some providers who provided as managed service. But PG Bouncer, it's like old tools solving that problem. Here, it's inside Postgres and we need additional demon. I don't know. I didn't participate in any of those decisions, but I just don't see any of providers supporting PJQ. meanwhile maybe on maybe on cloud SQL actually is it

Starting point is 00:25:03 they support PL this is maybe I told you this and I was wrong I mixed it with another tool from Sky Tools called PL proxy Cloud SQL SQL supports PL proxy yeah and yeah hello Hanu Crossing

Starting point is 00:25:16 who actually liked every single my post I think on LinkedIn about PGQ because this is he was from Skype as well so it's great it's great and yeah so then

Starting point is 00:25:29 interesting thing like i'm thinking okay you know you know my work pj a h anti-extension concept right yes yeah we talked about it yeah then there are others like i have a couple of more coming but then i think okay i just need to repackage it right and being in my tent i have new tool to create thorough specification so spec creation tool which i use for many things now it's CLI, like just. And like it's, you start from idea and then explore questions, research, and then build a comprehensive tool and then iterate with multiple LLMs, most powerful you can reach. And then you have already like version seven, for example, which is ready for implementation. So I wrote spec, how to repackage PGQ from Skype in this anti-extension manner. So no create

Starting point is 00:26:24 extension is needed. And since like we discussed this, since we have PG-Kron, like with PGHH. Same thing. Almost everywhere. Yeah. You need a ticker. So that's why that demon was needed. You need ticker.

Starting point is 00:26:37 So PGQ, it's a log. There is insert, can be single insert or batch. Never updates, never deletes. And there is basically their own horizon. It's based on snapshots of data, right? So every consumer knows position. And to shift position, you need tick. You need to announce, okay, we shifted.

Starting point is 00:26:58 Because something new, new. arrived. And by default, the PGQ from SkyTools, it ticks every second. So it shifts and consumers sees new data and fetches the whole batch of data, events. So batch is by default like there. Batch processing is by default. The thing I didn't understand is that you have a second table for keeping track of those. There are meta tables. Yeah, for ticking and for subscriptions but Q itself it's three partitions old school inheritance partition because it was created before native partitioning was created so three partitions one is in work one is like in the past one is in future and there is a rotation rotation using truncate there is also delayed table

Starting point is 00:27:47 separately for those events which cannot be a process now can they need retry we put it there Yeah. So that you don't end up truncating jobs that haven't been done. Yeah. And it was a single table. I think I also implemented the same pattern there because sometimes we might have a lot of jobs to be retried, events to be retried for processing. So I created also three partitions.

Starting point is 00:28:13 You also created dead letter Q for the concept of maximum retries and then we put it to that letter not to retire forever. So I just, I just, I suggested this is like how many. modern systems work. I said, oh, good idea. Let's adopt it here on top of what already exists. So I created spec. So I was like thinking, guys, I understand playing scale position. Let's just have shotgun and fire all those long running transactions. Great. Transaction timeout also that's shotgun. Maybe less smart, but still working and reliable, right? But let's just compare performance. I said, okay, let's compare performance.

Starting point is 00:28:54 And I said it in the same session of Cloud code, where we just created a spec. And then we started benchmarking, provisioned like, I think, seven VMs in cloud for alternatives and PGQ. And I noticed it provisioned two machines for PGQ. One was original PGQ and one it called like PL mode. PL mode. I was thinking, okay, what is PL mode? We're supposed to create it according to spec, but we haven't implemented it yet. it said

Starting point is 00:29:24 you created it in 2019 I said okay it was first of it was not me it was Mark Kren

Starting point is 00:29:31 or author of PGQ but second how come it's created apparently also interesting part of the story

Starting point is 00:29:39 Alexander Kukushkin main trainer of Patroni in January I think in PG Day Prague I might be mistaken presented a talk

Starting point is 00:29:50 about PGQ because PGQ is excellent thing, let's revive it, right? Great, goals align here. I also want to revive it. And I was like, you know, I told him, but unfortunately it's, you cannot install it on Ardios, right? I was like slightly provocative. He said, no, my slides have a recipe. I go, how come you learned it? He said, everything is written in commit messages. The problem with PGQ always have been a lack of

Starting point is 00:30:21 documentation. Everyone can tell it. I hear it 15 years and can say myself. So yes, indeed, there is a commit in 2019 or so to make it work on RDS and others. Let's have this support without create extension, just a single file. Or maybe multiple files. Doesn't matter. P. PLG scale only mode. It's called PL mode internally in PGQ. So apparently it was already ready, but it was not, I said, okay, but I actually, meanwhile, I talked to my cloud code, I say, okay, but I cannot find that file. Like how, like, where did you get it? I don't see it in repository.

Starting point is 00:31:03 It says, okay, you just need to run make. What? You need to run make to get the SQL file you can load to your RDS. We live in different quotes here a little bit. Yeah. I told Kukushkin, I think, Gen Z won't understand us with this make. I don't know.

Starting point is 00:31:27 It's interesting. For me, it's like mind-blowing. Everything exists, but it's buried under these walls. For some people, it's not a wall. You get clone, CD, make. PSQL, import file, everything works. But imagine for some people it's a big barrier, right? it'd be weird not seeing it in the repo yeah well you need to read commit messages and that's great

Starting point is 00:31:57 that Alexander has a talk promoting that this is possible but it inspired me even more let's do it and add more and more in the documentation so my tool is just basically it has it even a module and then it compiles and presents this is our sequel but of course I started adding more things around So this is it. This is the idea of PGQ with UE, which is universal edition, so it can be used anywhere. And I'm releasing this week second version with a lot of stuff, actually, a lot of stuff. Somehow, some requests and so on. First of all, I realized I need libraries.

Starting point is 00:32:37 So we now have TypeScript, Go, and Python libraries. And one person promised to bring Ruby library as well. Oh, nice. Yeah, and I also already have two external contributors. So like I have some life, I achieved thousand stars in four days or so. It was good. I mean, it felt great. But I also learned it's not a queue.

Starting point is 00:33:00 It's like log because it's more like Kafka than Rabbit InQ or ActiveMQ. And I agree after for understanding. That's why in version two, I'm bringing another concept from Skytools originally. It's called cooperative consumers or something. sub-consumers. So logically it's a single consumer but there is a group of consumers which distribute load between them. This is needed. For example, when you have, imagine you have a queue of jobs like process on video. Some videos are super small, some videos are super much. In case of PGQ, if you just, your position and read it by one, if you have some people by the way looking at PGQ

Starting point is 00:33:43 think, okay, if I'm adding more consumers, I'm increasing capacity. No, throughput won't increase because every consumer in PGQ reads everything. Everyone, everything. You need different cues to distribute load. It's like topics in Kafka because it's actually not a queue. It's log. Yeah, but do you support multiple cues? I guess it just involves... Yeah, yeah, you can create as many cues as you want every time. they will be partitioned and all the mechanics will work. Yeah. But with concept of sub-consumers, which is not my idea, it's a regional idea,

Starting point is 00:34:26 but I couldn't import it because it's a separate repo, PGQ Coop, and it doesn't have license. There are only two issues asking what license is it because we want to package it as Debian package or something. And I couldn't take it, but I stole idea, of course, right, and re-implemented it with my own code. but the idea is the same. The feature is right now experimental. I need to play with it. We started already benchmarking it and so on.

Starting point is 00:34:51 It looks good. So this I think should be natively supported. So there is a lot of stuff, but the key idea is that now it's like a single file. You can load it. I also made it a PGTL extension for those who want to properly attract as extension. Sure.

Starting point is 00:35:09 Right. And you just injected. You configure PG-Crone. I actually thought about, and Hanuk Crossing, who is now at GCP and Skype, we discussed it on LinkedIn and I implemented it. So PG-Cron cannot tick more often than one second. I was going to ask you about this. Yeah.

Starting point is 00:35:31 But what the reason is the default is 100 milliseconds. This is new for version 2, yes. I just made it yesterday. Okay. Yeah. So I was thinking, first of all, people think about latencies. I recognize three latencies. First is producer query latency.

Starting point is 00:35:48 How fast it takes to insert one events or a batch of events, like 100 or 1,000. Second is a consumer query latency. How fast it is to take next batch fetch it, right? And we measure them and we see how badly the second one degrades for all alternative modern. I cannot name myself modern tool. I'm very old. Engine is 20 years old. So they all degrade.

Starting point is 00:36:13 Here we don't degrade almost. We slightly degrade from 100 microseconds. We go slightly above one millisecond. While they from one millisecond go to one second, sometimes five I saw. And degradation, we can discuss separately. Our degradation also solvable, but not solved into version 2 yet. When I say version 2, it's 0.2 because it's early,

Starting point is 00:36:37 but it's super solid engine, we know. And there is third latency, end-to-end event delivery latency. If you tick, if you shift your vision horizon only every second, it can be up to second, at least. Also, a consumer itself might wake up, not immediately like you need to listen to notify or something. It's partially supported right now. But you need polling or something. You might lose some milliseconds there as well.

Starting point is 00:37:05 And I was thinking this decision to have once per second was made 10 to 20. years ago. We have better hardware, so let's have 10 per second by default. And how? Okay, PG-Cron, which we rely on. By the way, PG-Cron is optional. You can put it to Cron or something. You just need select Ticker, PGQ Ticker. Okay. Tick-tick. Yeah, every second by default was, it was originally from Sky Tools. It was in the first version. In second version, I made it 10 times per second, And it's simple. In PG-Crone, there is a stored procedure who has a loop with commit because we need the separate transactions, actually, to shift the snapshot.

Starting point is 00:37:50 So it's not every, this is the same misconception as for backslash watch in PCCO. It's not every 100 milliseconds. It's 100 milliseconds wait time. The operation itself has non-zero duration, right? So roughly it should be fine. and like it takes 10 times per second, but not exactly like it's slightly shifting. But updating one row, it's super fast. Yeah.

Starting point is 00:38:18 And it will be even better when I implement bloat mitigation for system tables because this is why we go from 100 microseconds to 1 milliseconds or so under blocked Xmin horizon because we accumulate the tuples in this meta tables, ticker and subscribe subscription. Are there any other downsides to increasing the ticker speed or decreasing the ticker frequency? So I did preliminary benchmarks yesterday and important thing to understand. When it's ticking, if nothing to do, nothing to read it. So it doesn't show. And it means it's great if load is low, it won't produce new rights.

Starting point is 00:39:02 But imagine every 100 milliseconds you have new events. In this case, every time ticking, it's updating this row, which has metadata. And we estimated it for ticking every second, it's 24 megabytes per second of wall. It's very rough because it doesn't take into account full page rights. So it's very rough, just from ticking, overhead from ticking. 24 megabytes per second from a single time per day. Oh, per day, that makes more sense. And 240 megabytes per day, sorry, per day.

Starting point is 00:39:39 If you have right now default in version 2 to 10 times per second, which is acceptable. I mean, this means you have load already, right? So doing this database is loaded. If every 100 milliseconds, there is ticking happens. If it doesn't happen again, no rights. So it sounds like it scales fairly linearly then, like 10 times more. This is overhead from updating this meta table with row where we are.

Starting point is 00:40:10 That's it. Yeah. Right. So, of course, if you inject a lot of data, there's mechanics. Interesting. Might happen. And again, if you have long running transactions, X-Mech horizon blocked, in this case, that apples will accumulate, unfortunately, in the system,

Starting point is 00:40:26 in metadata tables, which I'm going to solve also with partitioning and truncate. So something I don't quite understand is what, when doesn't this make sense? You mentioned it's not really, it's a queue. It can be used for Q like workloads, but maybe sometimes it doesn't make sense. Should we go into that a little bit? This is a great question. I'm deep in database. So I would like to hear from backend engineers and people who build systems.

Starting point is 00:40:55 What's lacking here? One thing I can understand is lack of, for example, priorities for events. Yeah. Because this is very linear with cooperative consumers, I think the problem of big task blocks, small tasks will be basically resolved. But priority, I don't know. This is definitely not a pattern here. Also, if you need almost immediate delivery, PGMQ or river might be better because they deliver faster, right? Because they take that.

Starting point is 00:41:28 End to end, I mean, end to end latency for job processing. what we have here is worse and to end latency controlled by this ticking frequency of this. I actually wrote a document about frequency tuning with some considerations. There is in docs folder directory

Starting point is 00:41:48 there is a special document right now. So the problem will be so if you want almost immediate delivery like for example I don't know like chat or something maybe you should choose Pidjim Koe River but you need to fight X-Men Horizon blockers very actively, right, and install our monitoring and connect it to our

Starting point is 00:42:09 platform and check the health and so on and fight those blockers actively. What I can say is that they delete skip-locked systems. They have better end-to-end delivery, but they degrade badly under X-Men Horizon blocked. We have worse initially, but it's predictable, reliable, right? And in the case, if you have like background, for example, we discussed how to convert integer 4 primary key to integer 8 primary key. You have, for example, a billion rows. You need to change them. And you chose, for example, the approach I call new column approach.

Starting point is 00:42:49 You create a new column with integer 8. And then you need to install trigger for future rows already. And then you need to process your big backlog, billion tables. You do it in batches. how to schedule this processing. This is exactly any background processing where like 50 milliseconds and the tetanency is fine, this is it.

Starting point is 00:43:10 It's good. In working batches. If you need one millisecond, okay, choose new tools but fight X-Men Horizon Blockers. Yeah. I mean, my understanding of when you use cues is for asynchronous stuff.

Starting point is 00:43:27 So I'm struggling to imagine something that can't cope with 50 milliseconds of overhead on something asynchronous. Even like a password reset, if it comes through a second later, it's fine. I agree. And in this case, maybe you should consider doing it outside of PostGus with different systems like Red Panda or something. I can imagine some systems where you need very responsive behavior, but you need to learn how MVCC works and what dangerous weight you. if you build like that.

Starting point is 00:44:02 You will have good latency in the beginning, but suddenly then something blocks you and then it degrades quickly. I wanted to mention that a queue in database is great because it's ACID. Nothing will be lost, right? It's like it's replicated.

Starting point is 00:44:19 It goes to backups. Nothing is lost. And it's isolation. All four properties are very follow followed. right if you go and use different system and you need to think about consistency right so you need to think about if you have something in database you already wrote here but didn't delete that or it can be inconsistent these days github works very poorly and since i posted this project on github i worked i'm usually

Starting point is 00:44:53 usually on git lab and i know gitlap issues as well because of there are our clients many years But they're great. On GitHub lately, I like, wow, it's interesting. You already merged pool request, but it takes some seconds for counter to propagate. Also, for this pool request to disappear. So they have big legs, synchronous processing, right? But legs are fine, eventual consistency, right? But data loss is not fine.

Starting point is 00:45:22 So if you have data, I would say, if you need predictable performance, reliable approach, good throughput, not suffering from degradation when X-Men Horizon is blocked, PGQ is great. When you need much faster delivery and you want to stay inside database, ACID and so on, choose different systems for PostGarves, but fight XMIN horizon blockers. And if you want better throughput, go with Red Panda, Kafka or anything, if you can afford supporting or paying for managed version. But in this case, do look at transactional outbox pattern. Yeah.

Starting point is 00:46:03 Because this is from microservices theory, so to speak. There is a pattern to organize data delivery from database to queue properly and all statuses. Like this is how you should do it because otherwise data loss is eventually inevitable. Yeah, this is how to navigate. solutions advice from me. Yeah. Anything else you wanted to make sure we cover before we wrap up? Well, I'm just excited that among others, as you said, Christoph Petosso and also, as I said, Kukushkin,

Starting point is 00:46:37 I guess we teamed up a little bit, not like somehow in distributed fashion to shed a new light at PGQ because it's a great piece of software. It solved problems before people encountered them. But somehow, got lost with knowledge and I hope more people at least keep in mind what's possible and consider it building systems and telling their AI to consider because maybe they're just yeah I will look at it some do some benchmarks research and make decision right that's it maybe yeah all right nice one well thanks so much Nicola and catch soon thank you for listening

Starting point is 00:47:17 see you soon bye

Postgres FM - PgQue

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.