Postgres FM - PgDog

Starting point is 00:00:00 Hello, hello, this is Postgresfm. My name is Nikolai, Postgres AI, and as usual, co-host is Michael, PgMaster. Hi, Michael. Hello, Nikolai. And we have super interesting guest today, Lev Kokodov. Hi, thank you for coming. Thanks. Hi, Nikolai. Glad to be here. Yeah, and the reason we invited you, Yeah, and the reason we invited you is when I saw PGCAD, it was interesting, like at that time I had a spike of interest to Sharding Solutions. And I saw PGCAD was started as connection puller alternative to PG bouncer. Actually at the same time, a few other teams started

Starting point is 00:00:43 some new projects, but then Sharding was, okay, Sharding was quite straightforward way, just comments. It was great, interesting. But I'm not a cat guy. I'm a dog guy. And once I saw Pidgey Doggy started, we focused on Sharding. It obviously attracted much more attention from me. So that's why we invited you actually. Awesome. Yeah. Let's talk about, I don't know, like maybe charting itself and probably let's

Starting point is 00:01:12 start with the idea that not everyone needs it because on a single Postgres cluster you can grow and reach multi-billion evaluation, go to IPO, I have examples. But obviously sometimes it's really tough, right? Tough job to maintain huge monolith. So what's your take in general about the idea that you don't need Shardy? Oh yeah, well, that's a good one. I agree in principle.

Starting point is 00:01:43 So, you know, Postgres can be pushed quite far. And we pushed it very, very far at Instacart, pre IPO, I should add. But we IPO it as a massively sharded database. And we absolutely had to because a lot of things started to run quite slow. We had a lot of writes. Instacart was,acard, you think most apps are 90% read, 10% write. I don't know if that was the case for us. I think we're a little bit more like maybe like 80-20, maybe 70-30, but we were doing a lot of writes. Hold on. 80-20 means 80 writes or? No, no, no.

Starting point is 00:02:20 Reads, as usual reads, right? Yeah, yeah, reads. Yeah, yeah. I'm thinking like 90% typical workloads, 90% reads, 10% writes. Or even more sometimes. Social media is definitely more reads, right? But here's different. Yeah, it was a little bit different, not by much, but even that small percentage, like 10%, 15% writes was enough to push R5 to 24x large, I don't know, like 192 cores, almost a terabyte of RAM over the edge.

Starting point is 00:02:48 Sounds like RDS. Actually, I'm super curious, it's off topic, but I'm super curious about so-called Instacart zero downtime upgrade approach because we believe at PostGCI, we believe it leads to corruption but I hope we will discuss it at different time. Let's hold on to this. Yeah, happy to talk about another podcast. That was one of the guys who worked on it. Consider you invited. Zero downtime upgrades made on managed Postgres, which doesn't allow to change recovery target of ASCENT.

Starting point is 00:03:26 Yep, I did it myself. We had zero downtime. Yeah, it's a super interesting topic. Maybe in a few weeks, I really looking forward. Sounds good. So yeah, I understand that writes, we can scale reads easily because it's just a replica until some point.

Starting point is 00:03:43 But writes, we have only the primary. That's, and you need to go either to services, microservices, or sharding, or combination of them, right? This is several routes, actually two major routes here. Yeah, Instacart actually sharded what we call functionally sharded. I don't know if that's a real term in the industry, but we just took tables out of our main database and put it into a different database, you know, functional sharding, air quotes, vertical split, vertical position. Exactly. And that happened even before I joined the company.

Starting point is 00:04:18 So that was way before, you know, IPO talks and all that stuff. So you can run out of capacity on a single machine quite quickly if you use it a lot. Sorry, that's a little bit of a tautology, but that's just the case. It depends on your workload. We did a lot of machine learning. So we wrote a lot of bulk data a lot, like daily.

Starting point is 00:04:37 We would have hundreds of gigabytes of data that were completely new data every day, basically. So a lot of writes were happening, and sharding was the only way basically. We were running out of... Our vacuum was always behind and when vacuum is behind everyone gets a little bit scared. You know, performance is worse as well. You know, we were getting like a lot of lock contention on the wall. That's, you know, that happens a lot. You know, the wall is single threaded to this day and that's totally fine, but you end up, you know, when you write like hundreds of thousands transactions per second. Well a single threaded is if you

Starting point is 00:05:08 use wall writer but backends are can write to wall themselves if you turn it turn it off like this setting. Yeah, you're right you're absolutely right but there's still a lock there's still a lock yeah they have to grab a log. We see in performance insights if it's on RDS, like queries are spending time on LW log. That's the guy. And that's the guy who takes your website offline on Sunday every morning, like a clockwork. On write attention for close.

Starting point is 00:05:40 And I would say traditional optimization would be, let's write less, but I assume only in Postgres 13 wall-related metrics were added to PgSAR statements and explain. So it was very difficult to understand which queries, well, with performance, as you can based on the weight event analysis, to identify queries which are wall- well, write intensive, right? And maybe optimize them, but it's still like limited approach, right? Well, it's limited in its success in a company because when you tell somebody like, hey, can you write less to your database, people are like, I'm sorry, but is that your job? Like, you can write less, but we have...

Starting point is 00:06:25 Wall writes our problem for a query. Consider a query which writes to a table and two situations. This table has a couple of indexes versus 20 indexes. And when we insert something, every index gets some insert. And it amplifies the amount of wall written for the same and serve. We fished index, right? It's really important to know that every single index has an important business use case. And that's why I was put there in the first place.

Starting point is 00:06:52 So like all of these are invariants. Like when you say like, I have to win indexes on a table and it's right up, like that's, it's by design. And when you want to write more data to that table, because your business is growing, like us as a database engineering team, we're like, all right, we're going to make it happen because we can't go to like 25 individual teams

Starting point is 00:07:10 and tell them like, hey, can you fix your query? They're going to be like, I don't have the time. I'm trying to sell groceries here. You know, like I have other concerns. But anyway, I'm trying to say, I agree with you that without sharding, we are limited in like, we need to squeeze, squeeze, squeeze, but it's very limited. And at some point, you cannot say we can grow to X anymore, right? That's right. And that 2X is actually on the low bound. What most engineering

Starting point is 00:07:40 leaders expect is like 3, 5, 6 X. Yeah, 10 to have that runway. If you don't have that runway in your system, it's a red flag for most. And it goes to CTO level basically. The decision is like, should we migrate out of Postgres to I don't know where. Right. So exactly. I know this so many times, including sitting on CTO position myself, like 15 years ago, like I know this pain, sharding. It's really funny. You're just kind of sitting there and be like, hey guys, I need 10x capacity. And you're a database engineer is like, well, it's the wall.

Starting point is 00:08:13 There's nothing I can do. It's the wall. It's like, there's just like a disconnect between these two people. You need to like, you just need to make this thing scale. Like that's the end. And a lot of these are out of control. But let's like, we can also enable compression for full page

Starting point is 00:08:28 rights. We can also partition so rights become more local. I don't know, you're skeptical. Because all of that, partitioning is great. I would never speak bad against partitioning. Not every partitioning. Sorry for interrupting. Not every partitioning.

Starting point is 00:08:44 For example, people say, it's- Not every partitioning, sorry for interrupting. Not every partitioning. If, for example, people say, okay, we are going to spread all customers evenly distributed among partitions, this is not going to help us in terms of wall rights. But if you say we will have hot partition, like receiving most rights, and kind of archive partitions where auto vacuum and everything was going like stabilized. In this case is super beneficial for wall writes and write

Starting point is 00:09:09 intensive workload optimization, right? Absolutely, but then you're gonna end up having to read those code partitions anyway. It's a temporary band-aid which we did. Again, absolutely we partitioned a lot of tables and that helped us with a lot of things, but at the end of the day, you just need more compute. It's the bitter lesson of database. I don't know if you're familiar with the bitter lesson in AI where you just need more machines to solve the problem. It's the same thing in databases.

Starting point is 00:09:36 You just need more compute to just be able to read more stuff and write more stuff. Yeah. I'm still pulling us back intentionally because we are having conversations about like scaling problems and interesting insight I've got yesterday talking to a very experienced Postgres expert that partitioning not only helps with like data locality, write locality and full page writes and wall amount as well. But also with backups, surprisingly, if we do incremental backups with all G, BG, backrest, with new incremental API, I still

Starting point is 00:10:15 need to learn about this more. Or snapshots, cloud snapshots of VBS volumes and RDS also relies on it, as I understand. We imagine writes are spread out everywhere, and you constantly change a lot of blocks versus you change only specific blocks, and some blocks are not changed. And it's less pressure on backups as well. It's interesting, and it affects DR and RPO, RTO.

Starting point is 00:10:41 But I just want to test my understanding. If we go to Sharding blindly and miss some optimizations, I remember I was dealing with Mongo and each node of Mongo, it was maybe 10 plus years ago, I saw each node of Mongo can handle much less workload than Postgres, one node of Postgres. Maybe jumping to Sharding too early, it's premature optimization and you will be not in good shape in terms of how much money you spend on compute nodes, because you missed a lot of optimization steps.

Starting point is 00:11:16 What do you think? Although, absolutely. If your first step is like, my database is broken, I need to shard, you missed a lot. Of course, you should look at all the possible optimizations. And that's really important to keep you like on a single machine. I think the benchmark for me is when you need to start thinking about charting is when you tried like

Starting point is 00:11:34 three or four or five different things. And then you're like a person or two working on it full time. And at the end of the day, they're like, I ran out of ideas. Do you have some numbers, like for modern hardware, like how much wall to be written per day, for example, like terabyte, five terabytes, where it's already an edge? Right? Yeah, for sure.

Starting point is 00:11:57 I can give you the exact number. We benchmarked Postgres in the best possible scenario, actually. It was on EC2, it had NVMe drives, it had an XT4, a RAID zero. So we did not want durability, we wanted performance. That thing could write four or five gigabytes per second using Bonnie++. Now Postgres was able to write about 300 megabytes per second of just like, whatever. I think we're using copy, like the fastest possible way to dump data into PG. And it's just, it's the nature of the beast, right?

Starting point is 00:12:26 You have to like process it, you have to put it into a table. I'm sure there's a lot of like checks and balances going between and again, lock contention and all of that stuff. So you can't necessarily squeeze out everything out of your machine if you have the capacity. So ultimately you need to split the processes themselves between multiple machines once you reach that number. And then, you know,

Starting point is 00:12:50 for everything else, if you have again, if you have write amplification, things are getting a little bit more tricky. And then so I think that's that's usually the number once you're writing like a, you know, 200 megabytes per second of just wall and you can see that in RDS, there's a graph for that just just wall writes. 200 megabytes of wall per second, each wall is 16 megabytes but on RDS it's 64, they changed it. So how many wall files like, it's a lot right? It's a lot and that's the scale you're looking at. Ten plus is already too much. Yeah, I agree.

Starting point is 00:13:26 Yeah. And so that's the red line. Right? If you're at the red line and you're thinking about sharding, we're going to have a bad time. And that's okay. It's okay to have a bad time. These lessons are learned when you're having a bad time.

Starting point is 00:13:39 I had a really bad time. Yeah. And I pulled the calculator just to check 200 megs per second, it gives you 17 terabytes per day. That's mild, honestly, for what we did at Instacart. We had like petabytes of wall-in-s3 after like a week. And that's just, again, it's the nature of the beast. Single cluster.

Starting point is 00:14:04 It's just, it is it's the nature of the beast. From single cluster or multiple cluster? Single cluster. That's insane. It is. It's called selling groceries. What can I tell you? We had like, you know, hundreds of gigabytes of machine learning data that would come in every day because the model was rechained on a nightly basis and all the embeddings have changed, right?

Starting point is 00:14:20 Again, the use cases are almost like, I mean, they're interesting, but they're kind of off topic, I guess, because they're just like, you just write a lot of data. That's what you do for a living, right? That's what Postgres is for. That's what databases are for. So the numbers are like, well, this is the number and that's when you should split up. Yeah.

Starting point is 00:14:38 So let's probably, like, just for more sane situations, let's say 10 terabytes per day or something of wall data, it's already too much. If you approach that soon, it's already a wall you will be hitting. Several walls inside Moskvist you will be hitting. Including light. Yeah. Yeah.

Starting point is 00:15:01 Several walls. Yeah. I also had to say, light web lock, wall-related light web locks, wall right, right? I don't know. There are several of them. Yeah, but it's a... Yeah, this is the scale to be scared about.

Starting point is 00:15:20 It's hard to put a number on it because you can push that further and who knows what's going to happen to your app. Maybe that's as high as you'll go. And not everyone has local NVMe's because in cloud environments, they are ephemeral. It's kind of exotic and risky. It is. It's definitely risky. I love those because when you click Reboot on the machine, it wipes the encryption key and all your data is gone. So you better be careful which button you press.

Starting point is 00:15:44 There's like a hot restart, which is safe, and there's the actual restart, which will delete all your data is gone. So you better be careful which button you press. There's like a hot restart, which is safe. And there's the actual restart, which will delete all your data. So, you know, stay on RDS if you can. But no, I don't agree with here because you just, you should just use Patroni and multiple replicas, at least three nodes and you'll be fine to use one node. You lose one, something random, bad, anyway.

Starting point is 00:16:08 That's off topic, yeah. Okay, thank you. For me, it's enough to understand that there are cases definitely when you do need to split vertically and splitting vertically usually is limited as well. It's hard sometimes and at some point you need, we need sharding. Yeah, so it's enough to explain that definitely

Starting point is 00:16:29 there are cases where you need to scale post-gass beyond one primary. It can be vertical or horizontal and the result is sharding, right? That's right. Good. Yeah, this is great explanation. Well, sharding is definitely needed, but not for

Starting point is 00:16:46 everyone maybe, but for many. I mean, yeah, I hope not for everyone. That would be, I mean, you know, we solved the problem once and for all. Actually, last week, Michael and I had discussion about snapshots and I said that Cloud SQL has hard limit 64abytes, because this is a limit for GCP, persistent disks, including PD, SSD. And I said RDS obviously can allow you to grow further, but beyond 64, but it's not. After we recorded, I checked and same limitation 64 terabytes. It's another reason to have multiple primaries and scale beyond one cluster. Well you'll hit that sooner on Postgres because of the 32 terabyte table

Starting point is 00:17:36 limit and actually I did hit that limit once and that was a scary error. The only thing that it says is like there's no more bytes to be allocated or something like that in some kind of database file. I'm like, what? And then you're done. There's nothing you can do. You have to move data out into that table. You have to take the database offline, obviously.

Starting point is 00:17:57 Even if it's a single table. Okay. Okay. Yeah. Yeah. A single table. Yeah. Which is actually not uncommon. like 32 terabytes these days.

Starting point is 00:18:06 Well, it's still not super common as well. I see only cases like kind of 10 terabytes per table and it's already when any DBA should scream, where is partitioning? DBAs like to scream, but application engineers be like, I can't, aren't you the DBA? Partitioning unfortunately requires application code changes. Yeah, so sharding is needed for big projects, obviously.

Starting point is 00:18:40 Agreed. What's next? Can we go, I reckon it's time to go back to the origin story of PGDog and obviously then PGCat. It would be great to hear like a little bit of that story, Lev. Yeah, absolutely. I'm happy to. Do you guys care about the name of the project? I don't know.

Starting point is 00:18:59 Or do you want to know what the inspiration for that came from? Sure, it's fun. Yeah. Okay. Well, when I started PGCat, we just got a new cat and I really loved that cat. And I was working on Postgres, so the two things came together naturally. You could probably guess the origin story of PGDog now.

Starting point is 00:19:15 I got a dog and then I'm like, look, I love this dog. What could I call my next project? Obviously PGDog. Yeah, so the naming issue solved. PGCat came from the idea that it was really simple. It was charting was not even in scope back then. It was just like, we ran PGBalancer. PGBalancer could only talk to one database at a time.

Starting point is 00:19:36 It didn't make sense. You were pooling like just the Postgres instance. We had a bunch of replicas. We needed to load balance. And we needed a load balancing algorithm that was smart. When a replica went offline because of, again, hardware issues, scaling issues, whatever, we needed to remove it from the rotation

Starting point is 00:19:51 without affecting the app. So we would regularly lose a replica and then most of the site would go offline because we had a Ruby gem that would randomize access to those replicas and when one of them broke, it worked okay, but doing this in the application code is really hard. Especially in Ruby, there's a way to inject exceptions into Ruby, like sideline, and basically that breaks your state. So we had multiple gems working against each other. And we just needed to do that in a place

Starting point is 00:20:21 where it made more sense. A load balancer is typically outside the application because you have multiple applications talking to it anyway. So I just built a load balancer basically. It was after actually I left Instacart. I was just doing it as a site project just to keep my mind going. So I built it.

Starting point is 00:20:36 It was really simple, used, you know, banding logic, which was kind of novel at the time for Postgres, you know. If you receive one single error from the database, it's removed from the rotation. It's very aggressive, but Postgres never, never throws errors like network related errors on this as a serious problem. So that actually worked pretty well. I talked to a friend of mine on Instagram, I was like, hey, look, I built this on the side. That looks fun, right? Like we thought about this. And he's like, yeah, all right.

Starting point is 00:21:01 Didn't you quit? Yeah, but now I have some free time to work on this. Right. And he's like, okay. And then he took the code, added a bunch of features that I didn't add. Cause obviously I didn't have a use case anymore for it. And he's like, all right, great. We're going to deploy it. Right. And we tried it and they use it and they're, I mean, they put so much work into it. They wrote a blog post about it. You probably know about this.

Starting point is 00:21:18 So that went pretty well, you know, so it's working in production. It's great. And then I'm like, all right, well, you know, so it's working in production. It's great. And then I'm like, all right, well, you know, sharding is the next one. They have a bunch of shard databases that we sharded and adding sharding routing to that would be great because again, it was done in the application layer and application layer routing, you know, I think we all agree is a little bit iffy,

Starting point is 00:21:40 especially if you have more than one app written in more than one language, like you have to repeat the same logic across all apps. So I had to just build a common system. I knew like there's two sharding schemes at Instacart. One uses the actually the hashing function from partitions in Postgres. I love that because you can actually split data both at the client and in the server.

Starting point is 00:22:00 So you have multiple ways to move your data around. And the other one is just a custom one. Like we use like SHA1, take the last few bytes and then mod that. That's just, you know, it's random. But it was available in multiple systems as well. The data for that came from Snowflake. So we could actually shard the data in Snowflake

Starting point is 00:22:16 and then ingest it into the instances directly. And then on the routing layer in Ruby, same hashing function, you know, the sharding key is always available. So that was good. So I added both of them and they're like, great. That's great. on the routing layer in Ruby, same hashing function. You know, the sharding key is always available. So that was good. So I added both of them and they're like, great, that's great. And they tried to think the SHA-1 function

Starting point is 00:22:32 and I think it's working pretty well for them. So that was fun. Then I started another company that had none to do with any of anything. Postgres.mall, you might've heard about it. That came from, you know, the idea that we shouldn't ingest, you know, hundreds of gigabytes of machine learning data into Postgres, that we shouldn't ingest hundreds of gigabytes of machine learning data into Postgres,

Starting point is 00:22:45 why we should just ingest the three megabyte model and run inference online. It was okay. Stayed a couple of two and a half years there. Didn't work out. There's a lot of startups too. Left. And then I had some free time and I'm like,

Starting point is 00:22:58 well, what do I like to do in my free time, guys? Writing Postgres poolers. This is what I do. This is what I do to rest on vacation. post-guest poolers. This is what I do. This is what I do to, you know, rest on vacation. I write post-guest poolers. So I'm like, well, let's do sharding for real this time. Let's actually take what we built at Instacart, make it into software because that's what we do, right?

Starting point is 00:23:16 We come up with an idea with a problem. We find a problem, we find a solution. We write it in code and we don't have to solve it again every single time manually for like, you know, hopefully hundreds, hopefully thousands of use cases for this, you know, we'll see, I'm still doing my research, but yeah. So that's PGDog.

Starting point is 00:23:31 Sharding is, you know, sales and number one, everything else is, you know, there's obviously, it's obviously a pooler, it's obviously a load balancer, it has all the features that PGCAD has, almost all of them, I'm adding them as I go. It's a rewrite, it's brand new. I like new code. That's what everyone loves to hear.

Starting point is 00:23:48 Hey, you've rewrote it from scratch, great. You know, that code that we battle tested in production and serving like half a million transactions per second, well, that's obsolete now, I guess. You're just gonna take this brand new code base and check it out. And it's written in Rust. Yeah, absolutely.

Starting point is 00:24:06 It's in my kind of dreams to find some spare time to learn and try it because it looks like many folks move to Rust. So what do you think? Well, it took me about a decade to get good at it. So the sooner you start, the sooner you will get it. You know, it takes about 10 years to be good at it. So at least that's what it took me. Honestly, I was like, that's okay. Again, if you start today,

Starting point is 00:24:37 eventually you'll get good at it and that's okay. It's a journey. You don't have to learn it like immediately. It's just, it's such a paradigm shift in how things work. The compiler is very aggressive about checking for things, especially concurrency errors, which is for multi-threaded asynchronous poolers, very important. I don't have concurrency bugs in PGDog or PGCAD because I'm using Rust. I don't have data races. And that's really important. And the number of bugs that I ship is considerably lower

Starting point is 00:25:06 because the compiler is like, hey, listen, this variable, you're not using it. And I'm like, oh, yeah, crap, I'm actually importing the wrong variable here. I'm using the wrong variable. Good catch, because that was going to be a bug. A lot of things that just in other languages are not available, Rust makes really nice. So it's really worth, again, you ask anyone who writes Rust to be like, that's the best thing that ever happened to me since sliced bread. I'm like, that's true. I haven't thought about sliced bread in a while,

Starting point is 00:25:32 but Rust is great. Cool. Another like side question is license. I saw feedback, I joined that feedback. So Pidgeycat was, I think on Apache or MIT, I don't remember exactly, but kind of permissive. And for PGDoc, you chose AGPL, right? Can you elaborate a little bit why? Yeah. Yeah, of course. Yeah, happy to. The sentence is really simple. AGPL is actually

Starting point is 00:26:03 pretty misunderstood. You can use that code and the application anywhere you want and never have to share anything back. As long as you use it internally and you don't provide, like you don't use PGDog like publicly as like a service for running PGDog. And even if you do that in that case, all you have to do is just tell us like,

Starting point is 00:26:20 what did you change and send us patches, right? Or what do you change? So like, it's pretty minimal, but it's a red flag for everyone. And that's okay. Yeah, I'm building a company around it. Building a company around MIT codebase is, I think, probably possible. I've never done it successfully. Building a company around the EGPL, I think, has been done before. And I think it's probably fine. if it becomes a hurdle I'm not like I'm not like married to it I just thought a GPL looks cool I like the ideas behind it I like free and open source code I don't think MIT is

Starting point is 00:26:54 necessarily the the original idea behind you know free and open source code I like MIT because I don't think about it I'm'm checking PostgresML. PostgresML is MIT, right? Yeah. Codebites. Yeah, it is. Yeah. Rasta and MIT. So it's interesting how you decided to change it.

Starting point is 00:27:14 I mean, I agree if it's misunderstood. Yeah, but it's already so. Majority of people misunderstood it, and we cannot change it with a single project. So it's my majority of people misunderstood it and we cannot change it with single project. So it's a reality. Yeah. But you don't care if it's okay. Like you do it.

Starting point is 00:27:32 Yeah. Yeah. Because like if somebody tells me like, look, I would love to use your code, but a GPO is a deal breaker. I'll be like, well, we work something out. You know, that's, that's not a big deal, you know. But I think that's a good, it's a good thing to have a good license. It's important.

Starting point is 00:27:48 You mentioned starting a company around it. It strikes me that it's going to be tough. Like, obviously, new code base, the main use cases are at scale. But normally startups, like the easiest way of getting started is serving smaller companies, right? Like it's harder, like going straight to the enterprise with something that's not yet battle-tested. Like what's the, you've got a plan though.

Starting point is 00:28:10 It'd be great to hear like, what's the plan? What's the plan? It's okay, don't freak out, it's gonna be okay. Yes, it's not actually that uncommon to have enterprise startup products. If the problem is interesting enough, there's always gonna be somebody's gonna be like, oh great, somebody's working on it full time.

Starting point is 00:28:27 That'll be amazing. At this early stage, how this works usually is I'm looking for design partners. So it's companies like Instacart who are like, hey, this is a great idea. We're gonna try it out. We're gonna develop it together. And at the end of the day, it's gonna be in production

Starting point is 00:28:41 because we built it together. And that's actually good because you wanna build it with the users. Like you don't wanna build it by yourself like for several years and then she'll be like, hey, does anyone need Postgres sharding? And Nikolai is like, well, I don't know, maybe. Maybe not, depends.

Starting point is 00:28:55 So what I'm actively looking for right now, like code base is okay. There's, I'm sure there's bugs in it, performance issues and that's totally fine. I'm just looking for people who'd be like, this is an interesting idea. Like I like the idea of post-gust sharding. I like the way it's done at the pooler.

Starting point is 00:29:09 It's not done as an extension. It's not done as like some kind of like other thing that I can't even think of. Like I like the way it's done. So I'd like to try it out and help you finish the job. You know, by deploying it in production, by benchmarking it, by finding bugs, by reporting bugs, by even fixing bugs would be great, not required. My job. I can confirm you were very quick reacting to requests. I remember

Starting point is 00:29:33 looking at PGCAD, I had the idea that mirroring, right, let's have mirroring to have like a kind of A-B performance testing right in production, and you implemented it, it was great. I think you were like, you were- Actually to be perfectly correct, it was actually Mustafa at Instacart who implemented it because he had a use case for it. He was my design partner. It's very common request and the only problem like,

Starting point is 00:30:00 you need to have this pooler already in production to use it, this is the trickiest part because it's zero to one it's always tricky new stuff especially in the hot path always gonna be hard but if the problem is there and if the problem is big enough I'll find my champion yeah exactly so so yeah and you mentioned PGA Dock. Maybe let's move back to technical discussions. A little bit out of business and license. So first of all, it's not like PGA Cat.

Starting point is 00:30:35 It's not explicit sharding when you command with SQL comments how to route. This PGA Dooc has automated routing, right? And second thing, there's no Postgres as like middleware for this, right? So it's just a puller with routing, right? Can you explain like architectural decisions here and what do you use and like what kind of components like, like you got a parser from Postgres to understand queries. I'm very curious how you're going to automatically route

Starting point is 00:31:14 selects of functions which are writing, for example. Or select for update, which you cannot route to a physical standby replica, right? Or I don't know, something else, like how you are going to, or you obviously will have some limitations, already have some limitations, right? So can you talk about this a little bit? Yeah, of course.

Starting point is 00:31:41 Well, select for update is actually really simple. That's a clear intent to write something. So that's an easy one straight to the primary. No problem. That's an easy one, which I actually should implement now that I'm thinking about it. I'm routing it to the replica right now. Bug issue incoming. Thank you very much. The other one is the functions. Obviously, it's impossible to know if a function is writing or not by just looking at it. Even if you look at the code, you're not going to know. So like static analysis, I don't think is necessarily possible.

Starting point is 00:32:13 So for that one is, I think it should be pretty easy. You put it in the config, you have a list of functions that actually write. This is what Pidgeypool does, right? I think so. Yeah, I'm not sure. Yeah, but you know, Pidgeypool does everything, so I'm sure they do it as well. Exactly. You have to be careful. You can't do everything. People don't believe. Right.

Starting point is 00:32:34 I know that approach is not perfect because if you add a new function, you have to update the config. You're always going to forget and you're always going to have issues. So for that one, I don't have a solution. My theory so far is that that is not as common as I'd like, but I will probably prove it wrong. And then we'll figure something out probably some kind of migration process that says if you want to add a function that writes send it through the like you should be writing migrations and sending them through the through the pooler not not not some kind of

Starting point is 00:33:01 side channel and you can probably mark that function is like, hey, this function writes like to put in a comment or something and then PGDog is gonna be like, great, good to know. You got any persistent storage for that kind of side channel. And you can probably mark that function as like, hey, this function writes, like to put in a comment or something. And then PGDog is gonna be like, great, good to know. You're gonna need persistent storage for that kind of stuff, which you could probably implement. Yeah, I agree. Some docs behave like cats sometimes. Exactly. Yeah, exactly.

Starting point is 00:33:17 People will forget to put that comment in and then there are gonna be issues, but that's just software. When there's people involved, there's always gonna be... The more manual stuff you have to do, the more problems there are gonna be. but that's just software. You know, when there's people involved, there's always going to be... The more manual stuff you have to do, the more problems there are going to be. But if you're writing a function by hand in the first place, you know, that's just going

Starting point is 00:33:32 to be part of your re-process. I was reading Hacker News comments and somebody asked about aggregates as well. I think there's a limitation around those at the moment. What's the story there? Well, when I posted the thing, I didn't have support for aggregates at all. But then I'm like, hey, you know what, let's add some I've been thinking about it for months. I might as well just do a couple simple ones. So I added like count simple one, you just sum the counts across all shares, max man again, super simple added some as well. Some is really just some everything.

Starting point is 00:34:01 Super simple, I added some as well. Some is really just some everything. Small comment here, it's MapReduce basically, like analogy for it, it's cool, right? So it is cool, right? This is cool, like, you know, it's- You can scale to billions, trillions and so on, right? Precisely, yeah, this is like a MapReduce for those guys is phenomenal, you know?

Starting point is 00:34:23 I'm actually, okay, so I'm gonna release a blog post in a couple of days. I'm actually sharding and doing MapReduce for PG Vector. I don't know if you've heard about this one. Yeah, that one's really fun. PG Vector is like a... No, no, PG Vector we know. What do you do with it? What do you do with it? Well, you're going to have to wait till the block post comes out, but it's really fun. I'm doing both MapReduce and a machine learning algorithm to route queries in the cluster because like, scale, like searching vectors is a completely different problem than searching B-trees. So I don't know how many people will need that solution. Well, as soon as W is terrible, if you go beyond one million records, it's a big problem still.

Starting point is 00:35:03 So that's the number I had in my blog post as well. I don't know why 1 million just feels like the right number, but that's exactly what I said. I'll answer you how over a million, you probably need to shard your PC Vector index. Or use different approach. Yeah. Exactly.

Starting point is 00:35:17 That's great. And MapReduce, someone told like in the past that pale proxy by Skype, very old tech, also looked like MapReduce, but it required you to use only functions, which is a huge limitation, especially if you have ORM or GraphQL. It's a big showstopper. And also it was Postgres in the middle for routing and for this map reduce. In your case, this is more lightweight software in the middle, PGDoc, right? And it does some simple, like, arithmetic

Starting point is 00:35:53 operations. Do you plan to define some interface for more advanced operations that the user could define, like, beyond simple sum or count or other aggregates? I haven't worked much with custom data types, EDFs and all that stuff, so that's going to be a learning curve for me, I'm sure. I'm sure it's not that hard, but once you add custom functions, you need to add custom logic. I think that should be pretty straightforward to implement if there's a synchronization between the core is working and BGAs. This would give full-fledged map reduce capabilities to this, right? Yeah, absolutely.

Starting point is 00:36:34 We're opening in the same perspectives, I suppose. Absolutely, yeah. If I find someone who thinks this is cool as well, we could definitely build it together. Not only thinks it's cool, but has some production to try, right? Because it's just... Absolutely. Exactly.

Starting point is 00:36:51 Yes, absolutely. I think that would be pretty terrific. It will work pretty well. But yeah, again, there's a lot of interesting things about Agigrids that, for example, like percentiles, notoriously difficult, basically impossible to solve, I think, at the sharding level, because you need to look at the whole data set

Starting point is 00:37:12 to compute it, you can approximate it. Approximation functions should be like a feature that we add to like PGDog that says like, you know what, I don't care about the exact number, like average, the simplest one, you could estimate it. If you have- For pencil tiles, we could define some custom data type

Starting point is 00:37:30 to remember how many members were analyzed and so on. I don't know. To bring not just one number from each chart, but a couple of numbers, and then it will be possible to understand percentiles. Maybe, should be. Maybe, yeah. Not super difficult.

Starting point is 00:37:46 Feels like a similar, like Hyperlog Log had some similar, like, I don't know how you do that cross-shard, but it feels like there might be some secret source in what they've done already that could be applied more like cross-shard. Yeah, Hyperlog Log is like a, it's like a counter, basically approximates how many members are in a set.

Starting point is 00:38:08 Actually, there's an extension for it in Progres. It's pretty fun. But yeah, I need a statistician, like a data scientist, to come and be like, all right, this is how you approximate percentiles. And they're like, great. Do you know Rust? Because, yeah.

Starting point is 00:38:24 What about pure Rust? Cut stuff. Yeah. What do you think Rust is? Exactly, yeah. Well, so what is, like what's next? What are you looking, I guess it depends on what people want, but what does tomorrow look like over the next week? Well, you're gonna be shocked to hear this, but I'm an engineer who does sales now.

Starting point is 00:38:40 I literally just like, yeah, as I said, I'm building a company, so I'm literally just like, yeah, as I said, I'm building a company. So I'm literally just sending as many like LinkedIn and emails to whoever I can think of to find design partners to for people to be like, Hey, I want to help. I want to, I, this problem exists. First of all, that's the first feedback I need to get. Be like, I would like sharding. I would like it solved. And I would like to solve it your way. That would be great. Or, you know, just tell me how you like it to be solved. And if like there's an interlap and I think there to be solved that your way. That would be great. Or just tell me how you like it to be solved.

Starting point is 00:39:05 And if there's an interlap, and I think there should be overlap, solve it together. So that's what I'm doing mostly. But I'm an engineer, so I need a safe space from all the social activities. So I still code. So that's why the PG Vector sharding is coming out because I needed to code a little bit.

Starting point is 00:39:21 And I thought the idea would be cool. We're gonna be keep doing that. I'm gonna keep adding these kind of features, keep adding tests, benchmarks, fixing bugs, finding more use cases for update that I forgot. But yeah, that's the plan. I have specific question about rebalancing without downtime. Right? If one shard is huge, others are smaller, we need to rebalance.

Starting point is 00:39:47 What do you think? Will this feature first of all be inside open source core offering? Because we remember in Citus it was not until Microsoft decision, as I understand, Microsoft decision to, or like aytos team decision to make everything open source. And second question actually triggering by my this question like how do you compare PidgeyDog to Cytos? Yeah the open source I would like to stay I don't foresee myself writing closed source code maybe things around like deployments and orchestrating the stuff in companies. My idea, we'll see if it's realistic, is to sell managed deployments of

Starting point is 00:40:31 Pichidog to on-prem deployments to companies. So probably the actual code that orchestrates that stuff will probably be proprietary, mostly because I'm embarrassed how much bash I use. Nobody really wants to know how that sausage is made. Do you follow Google bash code style? Yeah, well, somebody taught me to use curly braces for my batch variables and ever since then I've been doing that religiously. So I learn new things every day. But I think the core will stay open source forever. I don't see, I don't even know, even that data migration part, like that's a that's a known, there's a known solution for it. There's no point of building a closed source solution that does the same thing like that's already been

Starting point is 00:41:11 solved. So might as well just it doesn't exist yet. Right? So this is a rebalancing feature. Well, the rebalancing feature is basically, I get it depends on your sharding key, it depends how you store data on the shard like Instagram wrote that blog post a long time ago where they pre-partition everything into smaller shards and that's how Citus does it underneath. Instead of like, if you say you want three shards, it's not going to build you three tables, it's going to build 128 tables and move them between the shards.

Starting point is 00:41:36 How? Well, logical replication became a thing in 10, so Cit's a side that uses that to move things around. I think logical application makes sense up to a point. You really have to catch the tables at the right time. Once they get a little bit too big, logical application can't catch up anymore. So that's gonna be an orchestration problem. I've seen logical kind of-

Starting point is 00:41:59 You can partition it virtually, like what PRDB did, like they implemented virtual partitioning splitting. If your primary key, or partition key, partition key in this case, allows to define some ranges, you can use multiple streams to copy initially and then even to have CDC, so it's kind of interesting. That's right. So yeah.

Starting point is 00:42:26 Well, it's funny that you mentioned that, because what is it, Postgres 16 allows us to now create logical application for the laptop because- Binary equals true. Postgres 17. 17? Or 16. Maybe 16.

Starting point is 00:42:40 Maybe you are right. I just looked at the computation a few days ago, and I already forget. Yeah, but it's still binary. It's good, but it's a whole table. Maybe we don't need it. If we rebalance, we don't. Well, in case if you split to already partitions,

Starting point is 00:42:56 it's fine, but if not, yeah, I'm very interested to understand design decisions here. It's going to be interesting. Because I think for sharding at large scale, this is one of the key features to understand this. Yeah, well, so when you started from like a one big database to shard it into like 12, what we did at Instacart was we created a replication slot, we snapshot at 12 databases, we restored them 12 different times, deleted the data that's not part of the shard, synchronized the logical replication and launched.

Starting point is 00:43:28 So deleting data is faster than writing it from- Copy it as physical replica first, right? Or copy it logically, like provisioning logical replica, basically dump restore before binary? No dump restore. So RDS, we were on RDS. So EBS. This leads us to this discussion about upgrades because recovery targetless and doesn't exist. How like, like, let's follow up on this. Like this is good bridge to zero downtime upgrades on RDS.

Starting point is 00:43:57 Yeah, those are fun. Oh yeah. So the second question, yeah, Citus. So the difference is mostly philosophical in the architecture. Citus runs inside the database, which limits it to two things. First of all, the database host has to allow you to run Citus, which in a lot of cases is not the case. Maybe because of a GPL, maybe because they just don't want you competing

Starting point is 00:44:19 with their internal products. Again, it's business, it's all fair game. And then the second one is performance. When you run something inside Postgres that needs to be massively parallel, you know, you're limited by the number of processes you can spawn and by number of connections you can serve. So PGDog is asynchronous, Tokyo, Rust, lightweight, it's not even threaded. I mean, it's multi-threaded, but it's mostly like asynchronous, like

Starting point is 00:44:39 task-based runtime. So you can connect, I mean, I'm going pull up a big number here just for, you know, for effect, but you could have like a million connections going to PGDog from a single machine and that technically should work because it's using ePoll underneath. But for Postgres, you could probably do like, you know. You need PgBouncer or something in front of it.

Starting point is 00:45:00 You need PgBouncer and you need most of those connections to be idle because concurrency-wise, Postgres can only do maybe like, you know, two per core. That's the myth, at least. So again, like Citus, single machine, they have some kind of support for multiple coordinators, but I think the readme just says, like, please contact us. In any case, you're in the hands of Microsoft, you need to go to Azure, or you need to do self-host everything.

Starting point is 00:45:27 You cannot use it on RDS because it's extensions required. In your case, no extensions required. This is the key. So you can run it on RDS because the extensions are not needed. Right? Exactly. So you can run it on any Postgres. I see a lot of guys are trying to develop an extension ecosystem.

Starting point is 00:45:48 I over time became an opponent of extensions' idea because we have a lot of managed services and if you develop an extension, it takes a lot of time to bring that extension to extension it takes a lot of time to bring that extension to managed provider and they like it takes years so if you can do something without extensions it might be better in some cases and sharding maybe is this such a case right mm-hmm no absolutely yeah because if you develop an extension then it gets then installed by RDS like I don't know if they're gonna pay you support or support it or all that stuff. RDS notoriously upgrades like once a year.

Starting point is 00:46:29 We need to wrap up soon, but like a few words about encryption. Does PidgeyDog support encryption now already because it's super important for, and it can be a bottleneck, I know. Odyssey connection pooler was created because PidgeyBouncer needed two layers of Pidgey Bouncer to handle a lot of, yeah, this is what several companies

Starting point is 00:46:51 did in the past, two layers of Pidgey Bouncer because of handshake. So, TLS handshake. What's in this area, encryption? Sure, yeah, it supports TLS. It's using a library, Tokyo. It's like one of the Rust libraries that implements TLS. It's completely fine.

Starting point is 00:47:12 You could use TLS. And my personal favorite, I finally find a library that implements the Scram SHA-256 authentication. So now that's finally supported. Yeah, I saw MD5 is going to be deprecated in the next few years in Postgres. It still comes up.

Starting point is 00:47:27 I mean, it's been deprecated for 10 years and people still use it just because it's really simple to implement. Scram is really hard. And one more small question about prepared statements. I know PGCAD supported them In transaction pool mode, right? Yeah, the implementation wasn't great, but it does support it, yeah. PGDoc supports them too.

Starting point is 00:47:51 Much better implementation this time. Okay. Cool, well, I'm not out of questions. I'm out of time. Yes! But it was absolutely interesting. Thank you so much. I'm definitely rooting and going to follow the project.

Starting point is 00:48:07 Best luck to you and your new company. And maybe Michael has some questions additionally. I took a microphone for too long this time. Apologies. Well, I'll let Lev, is there anything we should have asked that we didn't? Oh, no, you guys are, I think you covered it. Well, really nice to meet you.

Starting point is 00:48:27 Thanks so much for joining us. And yeah, catch you next week, Nicolai. Thank you so much. Bye bye. Thank you. Bye bye.

Your Ad Here

Postgres FM - PgDog

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.