Postgres FM - PgDog update

Starting point is 00:00:00 Hello, hello. This is Posgess FM, your favorite podcast about Postgres. And as usual, my name is Nick, Postgresia. And Michael is here as usual as well from PG Master. Hi, Michael. Hello, Nick. And we have a great guest for the second time, Leif, PGA dog. Hello, Leff. Thank you for coming. Hello. So thank you for having me for a second time. Yeah, I'm sure there is a lot new to discuss.

Starting point is 00:00:28 so yeah i think it's a great idea to see what have changed since last time we talked because it was around your first release i guess right and yeah so where to start where to start yeah we're different year 2026 so it's been uh almost what eight months since we last chatted something like that seven months it was a lot of yeah now we're in january yeah i think 10 10 months. Oh boy. I mean, I've been working full time on this stuff, so I have a whole list of things that have been developed ever since. I think the biggest thing is BGDog is very much real and in production with real companies and real customers. You know, last time we chatted it was like a, you know, a thing, but now it's a real thing. So that's the big update. It's in prod. Most of the code, I think, I would comfortably say is stable and performant, which. Obviously I would say that, but there's real, real usage behind this, which is really cool. Let's see, there's a long as the futures that's been developed since then.

Starting point is 00:01:38 Before we go, before we go on to those, I wondered of the people using it in production, what are the primary drivers? Is it sharding or is it like an upgraded pooler? You know, I feel like there are a few selling points to PG-Dog that we talked about last time, and I wondered which ones were driving, like the primary adoption? That's a great question. One of our early adopters, they went straight for sharding, which to me was like, wow, you're brave, sir. And they went straight for it and they replaced CITUS with it. So that was really cool. Can you name some names or no?

Starting point is 00:02:15 They're on my website. One of their logos, I'll let you guess which one. I'm still trying to close them as an official customer. So soon I'll be able to say their name, but their logos there. And yeah, they were, they were, they They're pushing, I think, over 100K QPS per shard. And they have 12 shards, I think. Yeah, yeah, it's kind of crazy. They're using direct to shard query, so that's the real, you know, why it works so early is because they didn't use any of the fancy like aggregates and sorting across multiple shards. That's the real tricky part, although we've made some progress there, but, you know, it's still not 100% compatible to Postgres does it.

Starting point is 00:02:56 So on its way, though. But other use cases, you know, load balancers a huge, huge thing. Like people love having the ability to talk to multiple replicas without actually manually creating a connection pool for each. You know, there's health checks and failover. One of the big uses is we deployed alongside, you know, Aurora or RDS, where Petroni is basically not used. And it can detect primary replica failover and it'll move the traffic to the new primary, like, within like a second or so, which is really great. And then of course, when your replicas,

Starting point is 00:03:30 but one of the replicas starts failing, it'll ban it and then use the remaining replicas for re-quiry. So you get that, you know, that amazing like four nines of uptime for progress, which is- I have questions about all these features. All right. Before that, I have underlying question. 10 months, is it really 10 months or 10 years? Because you use AI.

Starting point is 00:03:53 Do you use AI to develop? I do. Yes, Claude, Claude, right? Yeah, I use Claude Opus specifically. Cloud code, Opus 4.5 right now, right? So this is what you use. That's right, yeah. For only certain areas of the code, not all of it.

Starting point is 00:04:09 There are some areas where I don't trust it at all, but some areas, like, like we're building like a dashboard like with all the metrics and stuff. Not critical paths, right? Yeah, critical paths are still organic code, if you will. Organic code. H.I. Exactly. This is the first time I hear this.

Starting point is 00:04:30 It's hilarious. Yeah. Homegrown organic code, but I don't write JavaScript anymore, for example. I don't write CSS anymore. Like that kind of stuff. So you use code code for this additional pieces of software, not super critical pass. And I'm sure you ask it to deduplicate and reuse.

Starting point is 00:04:54 the code as much as possible recycle existing pieces of glass off okay it's it's it's it's crazy how much that machine loves to generate like copy based code around it's it's allergic to function we're used you really have to tell like hey yeah you know yeah i have a special additional code-based review process which like basically targeting this like the duplication sector. Yeah. But it's possible to achieve. Yeah.

Starting point is 00:05:28 If you don't do that, by default, you will get a lot of duplicated code. Yeah, that's great to hear. Okay. I see by the way, great logos on the website. I missed that. I haven't visited front page for a while. I see Coinbase Circleback. We use Circleback and work at work.

Starting point is 00:05:45 Some other names. That's great. Congrats. Thank you. Hard at work. Yeah. So what is your current pace? pace, how often releases happen? How do you ship? And if you still write a code a lot,

Starting point is 00:06:01 like I guess it's not like a everyday new version. It's less, right? Yep. Not every day. Our release schedule is every Thursday. I'm trying to be very diligent about it. Like whatever code that's in Maine is usually stable, but I give it a week before I tag a release. And then every Thursday, I compile release notes and ship it out. And then you get it in GitHub and there's a Docker image that gets built for both, you know, X86 and arms. So you can pretty much run it anywhere, including on your Mac if you wanted to. And we have a helm chart for if you're using Kubernetes, our helm chart is getting a lot of good usage and contributions from a bunch of people. So, you know, I'm not a Kubernetes expert, but they are.

Starting point is 00:06:45 So they're helping me along. So if you're running Cube, you should be able to deploy PG-Dog in like 30 seconds, honestly. but otherwise, you know, it's just a container, run it anywhere. Right. And you use LibPG query for query parsing, so to build IST, trees, and yeah, and you recently optimized it, right? So it's not your optimization not merged to upstream yet, but can you tell us a little bit more about it? Also, specifically, I think my question is also about, is it worth optimizing, like, how much of microseconds we save on your? query, you know, should we worry about those microseconds at all?

Starting point is 00:07:27 Yeah, you should. It was, oh, so first of all, full disclosure, I told Claude to do it. Yeah. And then it did it. I checked it. I was very open about it. I put in the PR. I know how open source projects hate this kind of stuff. Like, you know, 3,000 lines of boilerplate. Like, I was up front, look, guys, this is cloud generated. I reviewed it. It works okay. But, like, give it a few months before. seriously looking at it. But the idea is in LibPG Query, when there's a translation layer between the C library and a bunch of other bindings. So notably, like, we use the rust binding, and the translation layer is in Protobuff. So every single time you call parse or fingerprint,

Starting point is 00:08:11 every time you get the AST back, basically, it goes through a serialization and deserialization process with Protobuf inside the same process. And that's really slow, actually. So I just replaced that were like only for the rust bindings with just like direct cedar rust and rust-to-see mappings of the Postgres A-ST and that improved the translation part of it by honestly I'm not even exaggerating

Starting point is 00:08:33 like 5X faster the throughput went from like you know a thousand per I think per millisecond or per second to like five or 10K or something like that the benchmarks are in the PR but it's huge and I think another important factor is like CPU utilization

Starting point is 00:08:49 so like for us we're network proxy, so our CPU budget has to be pretty reasonable. Like, if you can decrease it by 5x, and that's most of our CPU comes from query parsing. Everything else is pretty straightforward stuff, just I-O back and forth in Atomics. So for us, it's huge. So this optimization is just like, you reduced a number of times how much query is copied between various buffers. Exactly. And the proto-buff serialization is probably one of the fastest in the industry, but it's still very, very slow compared to just like copying a C-struct in memory like M-M-CPI. It sounds to me like shorter queries should suffer less from this like huge CTEs,

Starting point is 00:09:29 you know, like multi-glybites in size. They should suffer even more, right? Yeah, yeah, definitely. I haven't measured how much faster are slow queries or long queries. I just took up like a pretty sizable query like six hundred lines and used that as my benchmark because a lot of people who use PGGG did you like to write long and complex SQL statements for the for the database. So that's what I went with and that's where latency will really matter.

Starting point is 00:09:57 So yeah, for small ones, maybe it's smaller, but doing Protobov DCR in memory, it's a big waste of time when you can just map a like a C-Struck directly with a few allocations. Right, right. I also saw there was optimization of like SISI, S-S-RQ directly. I and the vector calculations. So this is like cool stuff or what? Like is it or is it or you saw some real benefit? Well, the vector one is kind of funny. There's a block post about it,

Starting point is 00:10:31 but basically I try to shard a PG vector index among multiple Postgres instances. And to do that, I implemented, implemented one of their algorithms, IVF flat inside PG dog. And then it's able to basically like separate the, uh, the vector between multiple databases and then search for them using IVF flat.

Starting point is 00:10:51 And for that, I needed SIMD to calculate, I think I'm supporting, we're supporting cosine distance only, just as a POC just to see if anyone even cares about this stuff. And that's where the SIMD optimization comes from, just to calculate the cosine distance between the input and one of the centroids, because we store the centroids basically in a config file. Everything is in PG-Dog and I know is config-driven because I trust the disk not to do anything. Eventually, we'll move away into something more dynamic right now. It's like you pre-calate your centroid in advance, and then you store them,

Starting point is 00:11:23 and then you can basically query one chart and multiple shards to get your probes, the number of probes, if you will. And then underneath each Postgres database has their own IVF flat or HNSW index. So it's an approximation, top of an approximation, which is fun, because you can scale PGVactor into like presumably billions of billions of. of vectors like others announced, right? And we said it's not true because it's not a single instance. So yeah.

Starting point is 00:11:54 Yeah. One billion vectors on single instance, I haven't seen this. No, no, it's not possible with PGVector, but sharding it is. Right. So if you have sharding schema, define, like you can store a lot. Right. So it's great, great feature. Anyone use it, using it or just, it was just, it will just showcase and this.

Starting point is 00:12:15 I don't think anyone. I think I got an email from somebody who's like, this is really cool. And I'm like, well, this is how you need to use it. And they're like, cool. And I haven't heard back from them. But yeah, it's more like a, yeah. Speaking of performance, let's talk about two to face commit.

Starting point is 00:12:31 There is support for it, right? A native support. Can you describe like what's your experience and view on using to face commit in larger systems? Yeah. Oh, we have a dog. Um, We have a dog.

Starting point is 00:12:45 And we have a doc on Two-Face Commit as well. Okay. Just for those who are listening on the audio version, we just saw a dog actually walking in background. That's great. Big dog. She is the reason why PG-D dog is named PG-Dog. Right.

Starting point is 00:13:03 You explained us last time. Right. So what about Two-Face Committee and performance overhead it has and so on? I'll be completely honest with you. I haven't benchmarked. it very much. Nobody really talked to me about using it yet in production, so I don't know how much the Postgres internales themselves create an overhead. I just know from the proxy side, it's two extra statements to execute, and also the additional complexity of managing

Starting point is 00:13:31 phase one, phase two of the commit. So if phase one fails, you need to roll back, and if phase two fails, you need to roll forward or commit. And you also need to make it PG-doc crash-resistant, which we haven't done yet. So there's still work to do there to make it, I think, production grade. But from our point of view, it was just an extra couple statements to execute. Yeah, okay, I understood. So this is just like also like some foundation for future work maybe, right? Definitely, yeah.

Starting point is 00:13:58 Some feedback from production systems. Okay, I see. Another thing I noticed also like it's fun, fun feature. You created ID generator, right? Which is like we like PostG18 brings UUID version seven. but it's 16 bytes. It's quite like a lot of space spent for a single ID value. Yours is 8 bytes, right?

Starting point is 00:14:22 Yeah, big end, yeah. It's 8 byte integer, but it behaves similarly, so there is timestamp component in it. This is cool feature and it's going to serve us until when? 70 years, right? 70 years left. It's not a lot, but for many systems should be enough. What inspired you to do? do you to build this?

Starting point is 00:14:43 I have a potential production deployment and there were big usage of omnischarted tables or replicated tables or mirror tables or whatever you want to call it. Basically the data in the table is the same on all shards and they needed to generate the primary key for that row. And you couldn't rely on the sequence because even if you used to face commit, which they originally thought they would do, there's still a chance that the sequence numbers on each shard will be different, especially because sequences are consumed, whether the transaction succeeds or fails. So the ID could be different on each shard. So you needed an external ID generation service in the first thought about using a third database or another database to use its sequence to pull it in and then insert into the query, which, you know, creates a single point of failure.

Starting point is 00:15:32 But it's much simpler to actually be able to generate a unique big end inside your proxy, inside Pichadog, which is not that hard to do. Could you sequence we just could have some extra added like for example if we say we don't need more than 100 billion per sharp we could say on shard number one we always add 100 billion on the second shard we add 200 billion why not they need to be the same on every shard this is the table that's replicated okay i like it i think it's interesting so i guess it's because we don't have unique constraints across shards right? Yeah, that's right. Well, we don't need them, basically,

Starting point is 00:16:15 because the ID generation is guaranteed to be unique. Before. Yeah, before. I mean, all good database engineers will tell you should still have a unique index, even for you IDs, which are theoretically guaranteed to be unique, but even in this case, you don't necessarily need one. So the idea is, again, so you insert the same ID into all the shards and you need to be able to generate it quickly and efficiently.

Starting point is 00:16:35 And that's why this, like, it's the first of a, That's kind. PG-Dog executed Postgres SQL function. Okay. Because we're using the Postgres parser, right. So why not, you know, just hook into the query execution. One step towards Citus. And having whole Postgres in your middle there, no? I mean, you heard it here first in Postgres.

Starting point is 00:16:59 No, I don't know. I think it's a good start, but it was so simple to implement. You extract, like, you find the function execution, and then you pull, you generate. you generate a unique ID inside the pooler. And the way you do that is, well, you get the time stamp. You kind of assume your clock is reasonably accurate. Then you have an internal sequence number, so you can generate multiple IDs per millisecond.

Starting point is 00:17:20 So the clock tick is millisecond. And then there's another unique number that call it node ID or PG-dog node ID. So you can have multiple PG-Doc containers in front of the same deployment, and it will guarantee that none of them will hit a duplicate. from the signals yeah no collisions exactly so you know the 70 years is a little bit arbitrary we can reduce any one of these components and increase the timestamp components so we can to exit by just reducing the number of pg dog nodes that you can deploy right now it's a thousand twenty four which if you're running that much you're your your facebook probably but you can reduce that by half and

Starting point is 00:18:02 give that extra bit to the timestamp and then you can also reduce the number of per millicester second generation. Right now it's 4 million. You can generate 4 million per second per millisecond. It's a lot. Um, yeah, probably reduce that. Uh, actually no, let me double check. I think it's probably 4,000 per millisecond. That's, that's, that's, that sounds more reason. Four million per second would make sense in that, yeah. That's right. Yeah. This is where documentation or a good documentation comes in handy. That's a big change from last time as well. Yep. The biggest thing I'm proud of is adding an icon next to every single link. just, you know, being a very, like, attention to detail, I suppose.

Starting point is 00:18:42 But, yeah, like, every single comments I get from people is like, well, is this documented, is this documented, anytime I talk about a feature, and I'm like, good point. Now it is. Thank you. So if it's not in the docs, I want to say it doesn't exist, but that's not true. There's a backlog of documentation to be done. Yeah. And I should help with this.

Starting point is 00:19:05 I would love for you guys to help out with anything. Not I, AI. Oh, yeah. Maybe I as well. What will help with AI. Yeah, I'm looking at PIGD dog from time to time and definitely looking for some use case. And as we all know, maybe not all that we are my team benchmarked. And so I'm looking forward to revisit that benchmark to see how that optimization related to the protobuff helped.

Starting point is 00:19:34 I'm very curious. This time we are going to visit not with trivial queries like select. That's it, but with some heavy that's why I mentioned like I expect some benefits from for four heavier queries, bigger in text. Yeah. So yeah, yeah. Documentation I think AI should help with it. Like this is my point. No, no way man. Like because they all sound the same. They all look the same. Like you could give it a couple of tips like your sound original. But like if you're write text using AI, it just all reads like it was written by the same.

Starting point is 00:20:07 But at least the reference could be like the usual problem with reference like documentation which lists various like functions config parameters or everything it usually lags. And to maintain it, like I think AI is great to maintain it. Yeah, I mean I don't disagree. For me writing documentation is kind of like a it's like a ritual. You know, I go through the things that I that we built that I'm like, oh, document and then I look at what I wrote. I'm like, wait a minute. This is wrong. I'm putting myself into the user shoes. Like, I'm breeding the dogs. I'm like, okay, this is what it looks like. It's an additional point where you can verify the whole picture and see if all that makes sense

Starting point is 00:20:49 all needs to be adjusted additionally. Yeah, I agree. Also demos, right? Like when you think how to show what was built, sometimes you end up changing everything. Yep. That has happened before. Because you simply cannot explain it means that probably should be done differently, right? Exactly. Exactly. So let's talk maybe about hard stuff, like prepared statements, what happened to them. Like how digital supports prepared statements and transaction mode and so on. Can you can do?

Starting point is 00:21:24 Funny. You know, we have to self sharding, but prepare statements is where we really have issues. Yeah, let's talk about it. sure they work they're fine everything's great i it's it works uh i think you saved them right yeah they're they're cached uh it's a global cache uh so if clients send multiple multiple clients send the same statement the same statement on the server side is executed so pretty high cash hits implicit or explicit like protocol level or prepare keyword protocol level right now and partial support for explicit

Starting point is 00:22:01 you know, since we have the parser. For the explicit ones, I need to differentiate on the number. Sometimes you can pass the data types in the prepared statements. Sometimes you don't. I just need to add a little bit more support to the parser there to extract the the differentiation between like different unique, identical prepared statements, but they expect different data types for some reason. So protocol ones work.

Starting point is 00:22:26 I guarantee it. If they don't, I swear they do. But the other kind is still a work in progress. Yeah. And have you seen any issues with preparersetments together, used together with partitioning or your clients don't use partitioning at all because they have sharding? Why use partitioning? I think we use them with partitioning.

Starting point is 00:22:50 I haven't seen any issues. Like if the query, if the preparatory step and plan is wrong, postgres will return an error and we evicted from the cache and then we prepare it. that simple right because first few calls are not cached yet in prepare statements because first it uses custom plan for the first five times this is this there is hard-coded constant in post-gross source code and with partitioning it's much more complicated i have a serious of blockposts about this it's like super complicated yeah i'm curious about this maybe just nobody encountered in issues yet but yeah i'm curious

Starting point is 00:23:30 about this. No clue. Yeah, let me know what I need to fix. Yeah, I understand. Very practical. I understand that. Yeah, but it's also, also, Nick, I think you're, like,

Starting point is 00:23:42 this feels like the kind of thing that actually gets a little bit better with sharding. Like, you just reduce the number, you reduce the QPS, you reduce the load on each. Like, it, everything that could cause issues around that is, however many shards you have times less bad. or less likely to hit the cliff? Absolutely. Sharding reduces the need in partitioning, hands and so. But maybe not fully eliminates it because partitioning,

Starting point is 00:24:09 if it's time-based, it's still makes sense within one shard. We know cases of really large databases, which combine both approaches. Yeah, for maintenance reasons, it still makes complete sense, right? Yeah, so yeah, it's tricky. And if you want to prepare settlements, oh, it's like, it's good. Oh, it's like it's super complex. So what's what's what's what doesn't work?

Starting point is 00:24:35 In in, in, uh, without part, without partition. So partitioning brings cash not at planning time, but at execution time. And this increases complexity because, uh, if you just deal with planning time cash, it's quite simple. So if, if it's not partitioned, only planner can decide the cash or not, right? That's it. If it's partitioned, there is a decision point shifted to execution time. And there is a point when basically you are without cash and you cannot avoid it. Because when cash is being built, at that execution, it's number sixth execution or fifth execution.

Starting point is 00:25:17 So at that point, executor doesn't have this cache because it's being built. So we are without cash. And in high QPS cases, when frequency of course, is super high. We basically there is a moment when we must live without cash and it might be a super non-performance. So performance can be terrible in that point. Oh I see. Which doesn't happen without with unpartition case at all. This is like I already keep I really start forgetting details because so many nuances if you don't live there if you don't only visit it like I did for a couple of weeks and then but I have a

Starting point is 00:25:59 several blog posts about this problem with demonstrations. See, this is the prime example why being a proxy for post-grat sharding is the best place to be. Because to me, that problem is completely irrelevant, right? Because I'm just passing statements through, right? I think, unless I'm, because like they will work or, or they won't. But for a performance perspective, like, if you have trouble with partition tables and perversum, you just don't use prepare statements, right? Yeah, yeah.

Starting point is 00:26:27 Yeah, with partitioning, we like, we must, so partitioning also brings the problem of explosion of excess share locks because. Okay. Yeah, I'm talking about this particular, but yeah, but. So if you have a lot of partitions and we have plan cache, one idea, but also we have a routing to specific partition, which is called constraint exclusion, right? We need to route specific partition. this is another idea and at planning time if plan and if planner knows where to route this is normal case when you don't have prepared statements it just routes to specific partition all good no problem we lock only one partition with all its indexes but if uh we postpone it to execution

Starting point is 00:27:19 it needs to lock all tables with all its indexes and it's terrible With sharding, I don't know. You just, I think. Well, you reduce the KPS. Yeah, this is already. And the log manager is not a problem anymore. Or you just need to use PostGbis 18 and raise max locks per transaction. In this case, the local manager also is same.

Starting point is 00:27:46 So yeah, anyway, this is an interesting topic. And I'm curious if some people combine sharding and partitioning. I think they shoot sometimes. They shoot sometimes and it's just it's in really big setups. Yeah. I mean, as the guy is trying to build charting, you should always do sharding unless you can't, but like that's, you know, you know. I think there's a maintenance argument. Imagine if you've got like a logs table per shard and you want to delete old logs after a set, you've got a retention period. You probably still want partitioning to make those deletes really efficient. Like I think there's still a use case for it. but it's not so much for performance reasons, but it's for maintenance reasons. That'd be my guess. Yeah, to build indexes, to have vacuum, working more efficiently. Yeah, but I think those are just, I still categorize those slight, I know they are maintenance,

Starting point is 00:28:40 but they're kind of still performance reasons, right? You want them, because you don't want the table to be too big. I'm talking about doing like deletes efficiently, being able to, instead of having to like do batch deletes, you can, exactly. It's, exactly. called not data retention but there is some additional term yeah yeah yeah the opposite of yeah yeah there could be like a legal requirement for data retention i think that's the only reason why people would ever delete data right uh well the other reason is performance so if you eliminate the performance argument uh like if you make performance just free because of you know sharding makes that i mean obviously that's not the case but the yeah and if you compare sharding and

Starting point is 00:29:19 partitioning uh with partitioning we often are willing to to have one active partition where we write and all others are archived. In case of sharding, it's strange, right? Because a lot of hardware is not used. Like if you have only one shard read, write, and others like archive, it's strange, simply strange, right? So you want to distribute rights, usually. You want it to be equivalent.

Starting point is 00:29:43 You want like a perfect, like, all shards have the same load. And that's why you'd usually use like hash-based partitioning, if you can't again. Or the way you design your schema for sharding is you make sure that each shards, roughly has the same load if you have hot chart I mean you're gonna have a hot chart problem one or rather than later yeah exactly especially for like multi-tenant like what I call multi-tenance B2B basically where your sharding key is like a customer ID if you will and then one of your customers grows big and you need a

Starting point is 00:30:09 second sharding tier or if you will make sure second partitioning does your dog support resharding without downtime oh yeah yeah of course how technically working I mean I mean, it's good. It's been tested, so that's kind of cool. The idea is, I mean, I don't want to say simple, but I think practical. For resharding, like, this is where the contrast between how this works and, like, NoSQL databases work is a little bit more apparent. With NoSQL, you kind of add another machine, and then it kind of rebalances all the data using existing machines. The PG-Dog approach is you create a brand new empty cluster, and then you copy all the data out into the new cluster. And there's the same. decision is actually not because of a particular limitation, but it's, it's intentional. Because you want to reduce the load on the existing cluster during resharting, because usually when you start resharting, it's already underload, it's already underperforming, it's already having problems. So you don't want to add additional load with resharting to it. So you create a brand new cluster, you move the data over and we are able to use replicas since Postgres 16

Starting point is 00:31:15 to copy data out with logical replication. That was, that was a huge thing. That was, I think, one of the game changers that made this possible is logical replication slots on replicas. Why? Why? Why does it matter? Because you can create 16 replicas and paralyze your copying. Why doesn't matter? Because Discaio or why does it matter? Like you don't want your own primary or what? Well, you definitely don't. Your primary is already shot. Like if when you're sharding, your primary is already having a really, really bad time. That's the main reason behind charting you want to scale your scale your right so adding additional replication load to an already like underwater database is is bad and when you want especially when you want to create let's say

Starting point is 00:31:59 you go from one chart to 16 charts extreme example but realistic you're going to add 16 replication streams to your primary that's like 16 times like streaming the same data the thing is going to fall over i guarantee it but overhead is amplified here so the same work is done on different on CPU, V-CPUs, so I guess it. Yeah, yeah, decoded and recoded. Yeah, yeah, yeah. So you want to shard this load, right? Exactly.

Starting point is 00:32:28 Exactly. Like we're able to use one replica per table, basically, for resharding. So if you have like, say, a thousand tables, you're able to shard your sharding stream and you both split it into 16 different machines or 32 different machines and sync it in parallel, which I think is pretty huge. It's all the same LSN, right? Because they're replicas, right? you can synchronize at the end and make sure that you got you're at the right point and like

Starting point is 00:32:51 pause the live traffic and then wait for the logical stream to catch up and cut over to the new cluster like all of that is the pieces are there is you just need a button to make it work we haven't built the button yet but we built all the all the levers you need to push you know like as so you say you you create cluster you mean you mean create postgast cluster is just basically one new primary that's right and when you say you copy it means that there is already something in pjadog that copies you just need to use it you exactly it's not on the shoulders of the user right no no we we hook into the replication protocol to stream the rows and we use copy binary copy underneath and we shard the copy stream all right this is the option

Starting point is 00:33:35 since post was 16 or 14 i think binary oh it's been a minute yeah yeah you initiate logical replica I guess by this option binary equals true for initialization part interesting yeah there is also trick to convert physical to logical which is official since postg 17 I think it's it's called PG create logical or PG create subscriber CLI oh interesting I'm using like PG output which streams everything in text format yeah yeah yeah but but yes but right but still there are additional things here you can convert physical replica to logical replica and and we could do it before but now it's official oh

Starting point is 00:34:22 pg creates subscriber unfortunately we won't be able to use it on most managed services because they don't expose this they should actually yeah they should expose this in api i hope they will but in case of self-managed it's good right yeah with self-manage you have a lot more options for sure. I think the original goal was to make this work and manage environments because that's where them... Right. Yeah, you mentioned Aurora already support of Aurora. That's great. Yeah. And these replicas, so you create logical replica out of copying data and performing CDC, so that's right from a slot on a replica, which is supported since recently, right? I'm curious this when you initialize logical replica from this is this physical standby doesn't have

Starting point is 00:35:17 hosstandby feedback on or off does it report to the primary that it still needs some data because if it does the primary health can be affected if you eventually switch over that blow it doesn't matter but if if you roll back so i think If this physical replica is not used by anyone else, I would keep Hotsonby feedback off. To let it lag. I would raise the max. There are a couple of settings which allow a replica to lag. If Honsel feedback is off, it's allowed to lag.

Starting point is 00:35:57 There are many, many nuances here in this process. I'm curious how you do it. But, okay, you're using a logical replica here. Yeah, that's a good question. call that actually like my recommendation is to create a dedicated replica for this or dedicated set of replicas but without host by faddeck i would do it if it's temporary replica why not like and you know and also how this replica is created right like it's it's interesting topic do you integrate with some backup tools or or you just it's out of scope right for pjd doc i feel like at the moment

Starting point is 00:36:34 it is but maybe in six months it won't be Like right now I'm relying on people using like some kind of reliable way to create replicas. If you use RDS, you just click a button. Again, like if you're using like any of the managed services, you just click a button, you get a replica. It's using disk snapshots. If you self-hosting, like if you have ZFS, presumably you'll have your snapshots somewhere. You can restore from a snapshot and resynchronized pretty quickly. But even then, it takes a while to sync the ZFS snapshot on disk.

Starting point is 00:37:03 Tested that one. It takes minutes. sometimes hours. Oh, well, if you have like a 2 terabyte file system, you mean that the first send receive? So to, yeah, between the hosts? Yeah, I understand. Yeah.

Starting point is 00:37:18 It's surprisingly limited by, I think, actually, network. I needed to like compress it and then do something else to it and re. I wrote a block post to it on our old blog at PostGus. Oh my God. I know. Thank you. but the website went down so my shame went down with it as well

Starting point is 00:37:42 the blog post is gone I think it was a good blockpost but I learned how to use compression tools and ZFS and you know big deal just out of interest are people generally using this with the managed service or is it quite split between managed service and self-managed

Starting point is 00:38:00 I right now it's split a lot of management service use especially like failover for Aurora is a big deal. Like failover detection, because you don't run Petroni typically. So it's able to like query the PG is standby. Sorry, what was the name of the function? PG is in recovery.

Starting point is 00:38:19 PG is in recovery, thank you. In my opinion. Stop confusing people. The people who know, know, know, the people who don't know don't need to know. Yeah, and I'm sure you know, Aurora loves to fail over on a perfectly good day. So it helps a lot with uptime.

Starting point is 00:38:40 But one of our biggest sharded deployments is literally on-prem, on-prem. Like they rent their own data center and everything. So it varies. Like that's the beauty of it, like operating on the application network layer. You really don't care that much where you're deployed. So based on this, almost a year since you started this,

Starting point is 00:39:01 what's your current vision for your like customer who needs pjadog and what's your vision for next future steps like where do you move yeah this is a really interesting question because like sharding is always going to be like the north star because it it explains to people like this is a scaling product ultimately you can start anywhere i already feel like it's a better connection to a pg bouncer i already feel like the low balancing aspect of it is a game changer and sharding as much as it is a work in progress is is the the North Star for us. That's how we want this project to evolve. We want to make Postgres as scalable as Cassandra or Scylla or Elasticsearch. I don't know what people think about those particular databases,

Starting point is 00:39:48 but I think they're like horizontally scale, though that we can all agree. So we want Postgres to be like that as well. We're building like a it hasn't been released yet. We're building like an enterprise edition of PG-Dog, which adds like, you know, the classic things like monitoring, like understanding how everything works, like a control. whole plane, like something that we can sell as a product to people. You know, like you don't have to like go through like many, many steps requires to set up like a sharding layer network proxy. So that should come out within the next few months, hopefully.

Starting point is 00:40:18 Will it include automated sharding with AI? Like, like pass a button and your database is sharded already. You want automatic sharding with AI as part of our future set? I think we just want more consulting clients. I don't think we'll ever get automatic sharding with AI, but I don't know. Right now my idea for that is like, dump the, use PG-Dump, get the schema out, pipe into chat GPT and ask it what to do.

Starting point is 00:40:50 With the prompt. And test. Yeah, exactly. And test, yeah. I don't know. A human being is able to come up with a sharding key really quickly. Like I talk to like a, like a senior database person at a company or an app

Starting point is 00:41:04 engineer and they're like, oh, my sharding key is, like, they already know it's org ID or customer ID or like they instinctively know how the data is organized and they even know what the edge cases are. You know, like, I don't think you need AI for that one unless you literally have, you've never seen this database before. But yeah, that's, that's the way I think about it. Interesting. On the vision stuff, how, how closely are you paying attention to the other like sharding, uh, in the space and how do you how do you differ in terms of what you like your North Star versus what you think theirs is based on what they're saying I'll be completely honest with

Starting point is 00:41:44 you I literally never look at them and it's not because like I think I'm cool or something it's literally I just don't have the time like I have I have something in my mind I'm building it what the other guys are doing is usually more of a distraction than the learning point to me personally so I have my vision I have my North Star and I'm going to follow it until I hit a wall or something and so far I haven't so that's how I think about it. Great. It was to finish this podcast, I guess.

Starting point is 00:42:11 Is there anything left that you wanted to mention that we haven't asked about? Well, let's see. I'm going to do a post on HECON News soon with all the stuff that we've done. But I'm just going to blab with you. It's kind of boring. We added a bunch of support for aggregate functions. We do query rewriting now, which is really cool. We take the ASD and we add whatever needs,

Starting point is 00:42:33 whatever we need to that query to execute it. One thing that I think is really, really phenomenal is we support sharding key updates. That's something I think CIT has never got around to do is like changing the actual value of the sharding key and moving the rows between the shards. Wow. That one's really fun. I don't know how often it's going to be used, but anytime somebody asks about sharding, they ask about that future, so it must be, must be useful.

Starting point is 00:42:56 You know, we support multi-touple inserts, we rewrite insert statements, we separate the tuples and we send them individually to each shard. you know, all of that happens, you know, transparently, which is really cool. We're just, we're focusing on usability. Like the idea is you just deploy this and you don't change your app. Like don't rewrite your queries, don't rewrite your code and just make it work. But, you know, obviously, we're not there yet, but we're getting closer. So every week there's something new that comes out.

Starting point is 00:43:20 And it comes from the fields, right? Yep. It comes from people asking me, hey, is this supported? I'm like, yes. Almost. Next week. Thank you for the suggestion. Cool.

Starting point is 00:43:37 Yeah. Production driven development. It is. I really enjoy it because every single future is immediately used. It's great. Cool. Great. Thank you.

Starting point is 00:43:47 Thank you for update. It's really good to see the project is growing. Excited to test it once again soon. That'll be great. And I still have to review and merge your PR. It's okay. but for i believe 10 months since the last time we talked something like this good well yeah really good to see you again love take care thank you so much yeah thanks so much for having me guys

Starting point is 00:44:16 bye bye

Postgres FM - PgDog update

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.