Postgres FM - Multigres

Episode Date: July 11, 2025

Nikolay and Michael are joined by Sugu Sougoumarane to discuss Multigres — a project he's joined Supabase to lead, building an adaptation of Vitess for Postgres! Here are some links to thi...ngs they mentioned:Sugu Sougoumarane https://postgres.fm/people/sugu-sougoumaraneSupabase https://supabase.comAnnouncing Multigres https://supabase.com/blog/multigres-vitess-for-postgresVitess https://github.com/vitessio/vitessSPQR https://github.com/pg-sharding/spqrCitus https://github.com/citusdata/citusPgDog https://github.com/pgdogdev/pgdogMyths and Truths about Synchronous Replication in PostgreSQL (talk by Alexander Kukushkin) https://www.youtube.com/watch?v=PFn9qRGzTMcConsensus algorithms at scale (8 part series by Sugu) https://planetscale.com/blog/consensus-algorithms-at-scale-part-1A More Flexible Paxos (blog post by Sugu) https://www.sougou.io/a-more-flexible-paxoslibpg_query https://github.com/pganalyze/libpg_queryPL/Proxy https://github.com/plproxy/plproxyPlanetScale Postgres Benchmarking https://planetscale.com/blog/benchmarking-postgresMultiXact member exhaustion incidents (blog post by Cosmo Wolfe / Metronome) https://metronome.com/blog/root-cause-analysis-postgresql-multixact-member-exhaustion-incidents-may-2025~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Postgres FM, a weekly show about all things Postgres QL. I am Michael, founder of PG Mustard and I'm joined as usual by Nick, founder of Postgres AI. Hey Nick. Hi Michael and let's welcome our guest. Yeah, we are joined by a very special guest, Sugu, who is a co-creator of Vitesse, co-founded Planetscale and is now at SuperBase working on an exciting project called Multigress. So welcome Sugu. Thank you. Glad to be here.
Starting point is 00:00:30 All right, it's our pleasure. So it's my job to ask you a couple of the easy questions to start off. So what is Multigress and why are you working on it. Multigress is a VITAS adaptation for Postgres. It's been on my mind for a long time, many years. And we've even had a few false starts with this project. And I guess there is a timing for everything. And finally, the timing has come. So I'm very excited to get started on this finally. Yeah timing is an interesting one it feels like for many
Starting point is 00:01:12 years I was looking at planet scale and Vitesse specifically very jealously thinking you can promise the world you can promise this you know horizontal scaling with a relational database for OTP and it's you know all of the things that people want and we didn't really have a good answer for it in Postgres but all of a sudden in the last few months it seems almost there are now three four competing projects all doing it so why now why is it all happening now? Why is it all happening now? I would say I think there is a timing for every new idea to gather momentum. Like, the idea may be good, but the timing may
Starting point is 00:01:55 be bad for one reason or another. But once the timing is good, then it is obvious that it is the right time. And it feels like now is that time. Like I would say, for example, PayPal, there have been previous attempts at moving money online, but it never worked. But PayPal just came at the right time and therefore took off at during that time.
Starting point is 00:02:18 There are many such examples where, you know, some ideas came too early and didn't take off. But for Postgres, this feels like it's the right time. That's interesting. Let me reveal something. We met in 2018, I think, at Percona Conference and in 2019, we met again in Nutanix at Meetup. Nutanix, yes.
Starting point is 00:02:38 Go Meetup, right. Discussions about the test for Postgres started. Couple of times we had a call I tried to involve a couple of guys to and from my understanding it never worked because you could not do it yourself being busy with other thing I guess my sequel related and the guys are looking at the complexity other guys looking at the complexity of related and and didn't proceed and actually in one case it was they decided to build from scratch it was it was a speak your project it's life and there is a it's just starting for post yeah
Starting point is 00:03:14 yeah and the report borrowed in borrowed in borrowed in yeah yeah so another folks also involved and and so for me it was Disappointing that it does work and at some point I saw the message in Vitesse I think that we are not going to do it. So like don't expect I Feel so bad because I was so excited about doing it and then I realized my god, you know, yeah But but now you started in last week plan scale, they started to support Postgres. So what's happening? Like, I don't understand.
Starting point is 00:03:51 Just, just right time, right? Yeah. Enough, enough number of companies using Postgres, which really needed at least one horse will win. So, yeah, it's great. But, but yeah, long long long story to this point. Yeah sometimes when there are multiple projects there's kind of slight differences in philosophy or approach or trade-offs willing to trade one thing
Starting point is 00:04:19 off in relation to another and I saw your plan I really liked that you use you mentioned building incrementally. So Vitesse is a huge project, lots and lots of features, but I've heard you talk in the past about building it quite incrementally while at YouTube. It didn't start off as complex as it is now, obviously, and you did it kind of one feature at a time. And it sounds like that's the plan again with Multigrass.
Starting point is 00:04:46 Is that different to some of the other projects or what do you see as your philosophy and how it might differ slightly to some of the others? I think my philosophy is that I would say I don't want to compromise on what the final project is going to look like. So any path that deviates me from hitting the target on the final project, I do not want to take. But any shortcuts that can get us early results that we can do, but only those that ensure that it doesn't risk what we become eventually, which is a project that should feel native
Starting point is 00:05:29 as if it was for Postgres, by Postgres kind of thing. I want it to be a pure Postgres project. In this case, I must ask, in this case, are you sure you're still good with choice of goal language. Because sometimes we see projects which have really like sub-millisecond latency on average, like 100 microseconds for example on average, and Go definitely will bring a few hundred microseconds of latency overhead. So usually it's not a big deal, but maybe in some cases it's some deal, right?
Starting point is 00:06:08 Are you happy with Go? Yeah, yeah. I mean, because you were one of the first big Go language users building with this, as we know from various interviews and so on. Yeah, so it's still a good choice, because now there is a rust, right? Yes. I would say, by the way, when we started Go compared to where Go is today, it was a
Starting point is 00:06:34 nightmare. Like 10 milliseconds or something round trip is what we were paying for. Those days we had hard disks by the way. So that's another three to five milliseconds just within the database. But things are a lot better now. And at this point, the way I would put it is like the trade-offs are in favor of go. Let's put it that way, mainly because there is a huge amount of existing code that you can just lift and port. And rewriting all of that in Rust is going to just delay us.
Starting point is 00:07:14 And at least in Vitesse, it has proven itself to scale for hundreds of terabytes. And the latencies that people see are, they are not affected by a couple of hundred microseconds. So I think the, and plus there's this inherent acceptance of this network latency for storage and stuff. And if you bring the storage local, then this actually wins out over anything that's there.
Starting point is 00:07:46 This is exactly what I wanted to mention. Yeah, I see PlanSkill right now they came out with Postgres support but no Vitesse and I'm very curious how much it will take for them to bring it like and compete with you. It's not an interesting question but from past week, I see my main impression is like, main stake is on local storage. And this is great because local storage for Postgres, we use it in some places where we struggle with EBS volumes and so on. But it's considered like not standard, not safe, blah, blah, blah. There are companies who use it. I know.
Starting point is 00:08:22 Like I know myself, right? And it's great. Today, for example, Patroni and since Postgres 12, we don't need to restart nodes when we have failover. So if we lose node, forget about node, we just failover, and so on. And with local storage, not a big deal. But now I expect with your plans to bring local storage, it will
Starting point is 00:08:46 become more, like I expect it will be more and more popular and that's great. So you shave off latency there and keep go, which brings 200... Good compromise. Effectively, it's a win because one network hop completely eliminates language level overheads. Sounds good. Yeah. Maybe a goal will improve additionally, but yeah. Good. I wanted to go back, Sugi, you mentioned not wanting to compromise on feeling Postgres native. And that feels to me like a really big statement coming from Vitesse being very MySQL specific saying you want to be Postgres native feels like it adds a lot of work to the project or you know it feels like a lot to me. What
Starting point is 00:09:41 does it mean? Is that about compatibility with the protocol? What does it mean to be Postgres Native? There's two answers. One is why do we still think we can bring Vitesse if it wants if it was built for MySQL and how do you make it Postgres Native? That's because of Vitesse's history. For the longest time, Vitesse was built not to be tied to MySQL. It was built to be a generic SQL 92 compliant database. That was actually our restriction for a very long time until the MySQL community said, you need to support all these MySQL features said, you need us to, you know, you need to support all these MySQL features.
Starting point is 00:10:27 Otherwise we won't move. CTEs, right? Common Table Expressions with, right? It's I guess SQL 99 feature, not 99. Yeah, I think the first parser I built was SQL 92, which is the most popular one that I know of. So that's answer one. Answer two is more with the behavior of Postgres.
Starting point is 00:10:50 What we want is completely mimic the Postgres behavior right from the beginning. Basically, in other words, we plan to actually copy or translate what we can from the Postgres engine itself, where that behavior is very specific to Postgres. And the goal is not compatibility just at the communication layer, but even internally,
Starting point is 00:11:18 possibly even recreating bugs at the risk of recreating bugs. In this case, it's very like... So there is Citus. In the hands of Microsoft, it got everything open sourced. So before, resharding was only in paid version. Now it's in free version open source. So it's fully open source.
Starting point is 00:11:41 And they put Postgres in between. So they don't need to mimic it. They can use it, right? And latency overhead is surprisingly low. We checked it. Well, it's C, but it's whole database in between. But it's sub-millisecond, so it's acceptable as well. I think it's half a millisecond or so in our experiments with simple select one or
Starting point is 00:12:03 something. Select, just like that. experiments with simple select one or something. So don't you think it's like in comparison, it's quite a challenging point when you say I'm going to mimic a lot of stuff but they just use Postgres in between? Yeah, I think there's a difference in architecture between our approach between multigrass versus Citus. I think the main difference is it's a single coordinator for Citus and there is some bottleneck issue with that if you scale to extremely large workloads like that goes into millions of QPS, hundreds of terabytes. So having that single bottleneck, I think,
Starting point is 00:12:46 would be a problem in the future. Whereas... Yeah. As I understand, you can put multiple, like you can have multiple of nodes there, and also you can put a Pidgey bouncer to mitigate the connection issues. So it can scale as well.
Starting point is 00:13:03 Oh, that's good. That's not, that's something I haven't known before. So it's possible then that they may also have something that can viably scales for OLTP. Yeah. So, we are still exploring this and more benchmarks are needed. Actually I'm surprised how few like not comprehensive benchmarks there are there are published for this. Yeah what I know of Citus is probably what you told me when we met.
Starting point is 00:13:33 So it's about five years old. Yeah yeah. Another big difference and this is typically Nikolai's question is on the license front I think you've picked about as open a license as you could possibly pick, which is not the case, I think, for many of the other projects. So that feels to me like a very super based thing to do and also in line with what the test did. And that seems to me like a major advantage in terms of collaborating with others, other providers also adopting this or working with you to make it better.
Starting point is 00:14:10 What's your philosophy on that side of things? My philosophy is my metric for success is adoption. Yeah. And the only way to have a project be adopted is to have a good license, a license that people are confident to use. That has been the case from day one of with us. We actually first launched with BSD license,
Starting point is 00:14:34 which is even more permissive than Apache. And then when we moved the project to CNCF, they said, oh, no, no, we need Apache license, which is why we converted it to that one. So adoption has always been, yeah. Why do they say it? Do you know? Why CNCF wants Apache? I think Apache is a pretty good license. They just made it a policy. I mean, had we asked to keep the BSD license, they would have allowed us, but we didn't feel like it was a problem to move to Apache. Yeah, and I remember you described, when you did it in YouTube, you thought about external users. You need external users for this project to grow.
Starting point is 00:15:21 And I guess at Google, HPPL is not popular at all. We know. Oh, banned. Yeah, banned. And Citus is AGPL. Oh, okay. Yeah. Also, compared to Citus, I think you have chances to be compatible with RDS and other
Starting point is 00:15:41 managed Postgres, to work on top of them, unlike Cytos which requires extension and so on. Correct. This was actually something that we learned very early on. We made a five-line change on MySQL just to make the test work initially, and it was such a nightmare to keep that binary up, to keep that build running. Fork. Yeah, to keep that fork alive. So we decided, no, it's like,
Starting point is 00:16:14 we are going to make this work without a single line of code of change in VITA in MySQL. And that actually is what helped VITAS move forward. Because people would come in with all kinds of configurations and say, you know, make it work for this. So in this case, actually, we'll probably talk about the consensus part. That is one part that we think it is worth making a patch
Starting point is 00:16:41 for Postgres, and we're going to work hard at getting that patch accepted. But I think what we will do is, we will also make multigrass work for unpatched Postgres for those who want it that way, except they will lose all the cool things about what consensus can give you. I'm smiling because we have so many variations
Starting point is 00:17:03 of Postgres these days. I would expect people coming not only so many variations of Postgres these days. I would expect people coming not only with different configurations of Postgres, but also with various flavors like Avar. We have a client who just migrated from regular Postgres to AloyDB. Hello from Google Cloud. And they already shared that on the application side, But imagine they could come to you and say, let's support AloyDB now. It's almost postgres, right?
Starting point is 00:17:33 So these things might happen as well. Don't they claim full compatibility with Postgres? I thought you are over at that. Not full, but most of it. They did interesting stuff in memory like Column storage in memory for tables. It's raw storage on disk, but column store in memory. But it looks like kind of Postgres and we actually even had some questions answered from my team unexpectedly because we don't normally work with AloyDB, but it looks likeers so I could imagine the request let's support a Lloyd DB as well anyway
Starting point is 00:18:10 but my question like I remember featuring the test that we work with RDS and managed my sequel did this features like has this feature survived no actually later we decided that at least we call it actually managed versus unmanaged. Managed meaning that means that Vitesse manages its databases and unmanaged means that the database is managed by somebody else with us just access a proxy to serve queries. At some point of time, we realized that supporting both is diluting our efforts. And that's when we decided, okay, you know,
Starting point is 00:18:53 it's not worth it to try and make this work with every available version that exists out there in the world. And we said, okay, we will do only managed, which means that we will manage it ourselves. And if you want, we'll build the tools to migrate out of wherever you are and they will, we'll make it safe. We'll make it completely transparent.
Starting point is 00:19:15 In other words, you deploy with this on both and then we'll migrate your data out without you having to change your application. But then with this can be more intentional about its features, about more opinionated about how clusters are to be managed. And we were able to commit to that because at that point, Vitesse had become mature enough, people were completely trusting it.
Starting point is 00:19:38 They actually preferred it over previous other managed solutions. So it wasn't a problem at that time. Yeah, it's not a surprise, that's why I asked. Because you talk about local disks, backups, HA, a lot of stuff, right? And basically... Yeah, 5.9 is like what VITAS shoots for. And like most big companies that run VITAS do operate at that level of availability. So what's the plan for multigrass? Going to support, so not only managed version, right? Only, so the, it would be, yes, it would be only managed versions because I believe
Starting point is 00:20:21 that the cluster management section of VITES will port directly over to Postgres, which means that you will once it goes live, it will be coming with batteries included on cluster management, which should hopefully be equal to or better than what is already out there. So I don't see a reason why we should try to make it work with everything that exists today. So it means there is no, like, this is the same like with Cytos. It doesn't work with RDS on one hand,
Starting point is 00:20:56 but on another hand, we, like, I don't see it's only a sharding solution. It's everything, which is great. I mean, it's interesting, super interesting. A lot of problems will be solved and I expect even more managed services will be created. I don't know how it will continue in terms of super base because of the very open license and so on, but also I expect that many people will reconsider their opinion about managed. many people will think, reconsider their opinion about managed. We had an episode about this.
Starting point is 00:21:27 This is my usual opinion about managed services because they hide super user from you. They don't have provided your access. It's hard to troubleshoot problems. In this case, if problems are solved in this and this gives you a new way to run Postgres, so if many problems solved, it's great. If you want to do self to solve problems, for example. Yeah, if you may not know, the initial focus of with test is actually solving these problems first.
Starting point is 00:21:54 Sharding was actually came much later, protecting the database, making sure that they survive abusive queries. Basically, that's what we built with us for initially. And the counterpart of taking away power from the user, like you said is one is, well, we now know exactly how to make sure that the cluster doesn't go down. And two, we countered that by building
Starting point is 00:22:21 really, really good metrics. So when there is an outage, you can very quickly zero in on a query. If a query was responsible, we test will have it on top of, like on the top of the line saying that this is a query that's killing your database. So we build some really, really good metrics
Starting point is 00:22:39 and which should become available in multigrass, probably from day one. and which should become available in multigrass, probably from day one. That's interesting. I didn't see, maybe I missed, I didn't see in the readme you were writing right now in the project. There's a last section called observability. I missed it, I need to revisit.
Starting point is 00:22:58 We're actually building something there as well. I for regular postgres, I'm very curious. I will definitely revisit this. Interesting. Okay. So yeah, great. And yeah. Also quite a big, I feel like this is quite a big difference on the, at least with Citus in terms of the philosophy or at least the origin story. I feel like that started much more with OLAP focused features in terms of distributed queries and parallelized across multiple shards and aggregations and column and loads of things that really benefit OLAP workloads.
Starting point is 00:23:36 Whereas this has come from a philosophy of let's not worry about optimizing for those cross shard queries, this is much more, let's optimize for the cross shard queries this is much more let's optimize for the single shard very very short quick LTP queries and let's make sure we protect it against abusive query you know so it feels like it's coming the architecture it's coming from a very different place of what to optimize for first and historically that was YouTube's problem. Surviving the onslaught of a huge number of QPS and making sure that one single QPS doesn't take the rest of the site down. Yeah, perfect. Makes loads of sense.
Starting point is 00:24:20 So actually before we move on too much from that, where do you see sharding as becoming necessary? Like is it just a case of a total number of QPS or like rights per second type? We've talked about sharding in the past and talked about kind of a max that you can scale up to perhaps in terms of rights, in terms of wall per second I think was the metric we ended up discussing. Are there other reasons or kind of bottlenecks that you see people getting to that sharding then kind of makes sense as it's now time or you should be
Starting point is 00:24:54 considering at this point? Well, there is a physical limiting factor, which is the single, if you max out your single machine, that is your Postgres server, then that's the end of your scale. There is nothing more to do beyond that. And there are a lot of people already hitting those limits from what I hear. And the sad part of it is they probably don't realize it.
Starting point is 00:25:19 As soon as that limit is hit, in order to protect the database, they actually push back on engineering features, indirectly, saying that limit is hit, in order to protect the database, they actually push back on engineering features indirectly saying that, you know, this data, can you make it smaller? Can you somehow lower the QPS? Or could you put it elsewhere? Let's stop showing this number on front page.
Starting point is 00:25:39 And so, yeah. Yeah, yeah. And it affects the entire organization. It's a very small, it's a very subtle change, but the entire organization slows down. Like we experienced that at YouTube when we were at our limits, the default answer from a DBA was always no.
Starting point is 00:25:58 We used to even kid, no? The answer is no, what's your question? And when we started sharding, it took us a while to change our answer, to say that, you know, bring your data, we can scale as much as you want. Believe it or not, we went from 16 shards to 256 in no time.
Starting point is 00:26:22 And the number of features in YouTube exploded during that time because there was just no restriction on how much data you wanted to put. And coming back here, the upper, like reaching the limit of a machine is actually something you should never do. It's very unhealthy for a large number of reasons. Like even if there is a crash, like how long is it going to take to recover? Like the thing that we found out is once you can shard, it actually makes sense to keep your instances way, way small.
Starting point is 00:26:54 So we used to run like 20 to 50 instances of MySQLs per machine. And that was a lot healthier than running big ones for a couple of reasons. One is if you try to run so many threads within a process, that itself is a huge overhead for the machine. And it doesn't do that very efficiently, whereas it does it better if you run it as smaller instances.
Starting point is 00:27:20 I think it's more of a feeling, but I don't know if there is proof or whatever. But in like Go, for example, wouldn't do well. Go, I think beyond a certain memory size or beyond a certain number of Go routines would start to slow down, would not be as efficient as it was before, mainly because the data structures to keep track of those threads and stuff, they are getting, they are growing bigger, right? But more importantly, on an outage, a smaller number of users are affected.
Starting point is 00:27:54 If you have 256 shards and one shard goes down, it is 1-256th of the outage, right? And so the site looks a lot healthier, behaves a lot healthier. There's less panic if a shard goes down. So people are, you know, a lot less stressed managing such instances. Right, I wanted to mention that this discussion was with Lev Kokotov, Pidgey Dog, which is competitor as well, new sharding tool written in Rust. And we discussed that there is a big limitation when Postgres.
Starting point is 00:28:34 So replication, physical replication has limitation because it's single threaded process on standby if we reach like somewhat like 150, 200, 250 megabytes per second depending on core and also number of, not number, structure of tuples and so on. We hit one single CPU 100% one process and it becomes bottleneck and replica standbys they start lagging. It's a big nightmare because you usually buy that, but that's like, at high scale you have multiple replicas and you offload a lot of read-only queries there and then you
Starting point is 00:29:14 don't want, don't know what to do except as you described let's remove this feature and slow down development and this is not not fun at all. So what I'm trying to do here is trying to move us to discussion of replication, not physical but logical. I noticed your plans involve heavily logical replication in Postgres, but we know it's improving every year. So when we started the discussion five, six years ago, it was much worse. Right now it's much better. Many things are solved, improved, but many things still are not solved. For example, schema changes are not replicated. And sequences. There is work in progress, but if it's committed it will be only in Postgres
Starting point is 00:29:58 19, not in 18. So it means like long wait for many people. So what are your plans here? Are you ready to deal with problems like this? Pure Postgres problems, you know? Yeah, yeah. If you ask me, I think the Postgres problems are less than what we faced with MySQL. I wanted to involve physical as well because this great talk by Kukushkin which describes very bad anomalies when data loss happens and so on. Let's talk about this. Yeah, we should talk about both. I think overall the Postgres design is cleaner is what I would say. You can feel that from things.
Starting point is 00:30:45 Like the design somewhat supersedes performance, which I think in my case is a good trade off, especially for sharded solutions, because some of these design decisions affect you only if you are running at, you know, if you are pushing it really, really hard, then these design decisions affect you. But if your instances are small to medium size,
Starting point is 00:31:06 you won't even know and then you benefit from the fact that these designs are good. So I actually like the approaches that Postgres has taken with respect to the wall as well as logical replication. And by the way, I think logical replication theoretically can do better things than what it does now and we should push those limits but yes I think the the issue about schema not being as part of logical replication it feels like that is also a theoretically solvable problem except that people haven't gotten to it I think there are issues about the transactionality of DDLs, which doesn't even exist in MySQL. So at least in Postgres, it's like it exists in most cases. There are only a few cases where it is not.
Starting point is 00:31:53 And for such things like a proxy layer, like multigrass or withers, it's a no problem for them because you should say, Oh, okay. This particular construct is not transactionally safe, well, then we'll even prevent you from doing it transactionally because we don't want you to get the wrong impression. We'll let you do it non-transactionally and we know that it's non-transactional and therefore we can do something about it, right? Those abilities don't exist previously, but eventually if it becomes transactional, then we can actually include it in a transaction. Yeah, just for those who are curious because there is like concept all DDL in Postgres is transactional, here we talk
Starting point is 00:32:34 about things like creating this concurrently because we had discussion offline about this before recording. So yeah, creating this concurrently can be an issue but you obviously have a solution for it. That's great. The way I would say it is we have dealt with much worse at with MySQL. So this is much better than what was there then. Sounds good. Okay, good. And let's talk about physical replication because I saw you are going to use it. So each shard is going to have a standbys. And we've, yeah, so we've, we've Quorum commit, right? So like we don't lose data
Starting point is 00:33:15 because data is on local disks by default, as I understand. Like if a metal in cloud, so we want to be sure data is written to at least two places, for example, or three, right? Configurable, of course. Here this interesting talk by Kokushkin, he presented it recently online conference by Microsoft describing that synchronous replication in Postgres is not what you think. Correct.
Starting point is 00:33:43 What are you going to do about this? Well, I was just chatting with someone and essentially synchronous replication is theoretically impure when it comes to consensus. I think it's provable that if you use synchronous replication then you will hit corner cases that you can't handle. And the most egregious situation is that it can lead to some level of definitely split brain, but in some cases it can even lead to downstream issues. Because it's a leaky abstraction, it's a leaky implementation,
Starting point is 00:34:21 there are situations where you can see a transaction and think that it is committed, and later the system may fail and in the recovery may choose not to propagate that transaction, or may not be able to, and it's going to discard that transaction and move forward. But this is the same as with asynchronous replication, it's the same, we're just losing some data, right? Exactly. It is the same as asynchronous replication. Yes. It's not split-brain.
Starting point is 00:34:51 It's just data loss. It's data loss, correct. It's data loss. But for example, if you are running a logical replication of one of those, then that logical replication may actually propagate it into an external system and now you have corrupted downstream systems that don't match the source.
Starting point is 00:35:14 So those risks exist and at VITAS scale people see this all the time, for example, and they have to build defenses against this and it's very, very painful. It's not impossible, but it's very hard to reason about failures when a system is behaving like this. So that is the problem with synchronous replication. And this is the reason why I feel like it may be worth patching Postgres. Because there is no existing primitive in Postgres on which you can build a clean consensus system. I feel like that primitive should be in Postgres.
Starting point is 00:35:52 I now remember from Kukushkin's talk, there is another case when primary transaction looks like not committed, because we wait a replica, but the replica somehow was like lost connection or something. And then we suddenly, and client thinks it's not committed, We wait a replica, but the replica somehow is like lost connection or something. And client thinks it's not committed because commit was not returned.
Starting point is 00:36:10 But then it suddenly looks committed. It's like not data loss, it's data un-loss somehow. Correct. Like suddenly. And this is not all right as well. And when you think about consensus, I think you are very good describing these things, like concept and distributed systems. It feels like if you have two places to write, definitely there will be corner cases where
Starting point is 00:36:36 something will go off if you don't use two-phase commit, right? Correct. And here we have this. But when you say you're going to bring something with consensus, it immediately triggers my memory how difficult it is and how many attempts it was made to bring pure HA to POSGAS just to have auto failover. All of them failed. All of them.
Starting point is 00:37:01 And they say, let's be outside of Postgres. So here maybe it will be similar complexity to bring these two inside Postgres. Is it possible to build it outside this thing maybe? It is not possible to build it outside. Because if it was, that is what I would have proposed. The reason is because building it outside is like putting Band-Aid over the problem. It will not solve the core problem. The core problem is you've committed data in one place and if that data can be lost and there is a gap
Starting point is 00:37:38 when the data can be read by someone, causes is the root cause of the problem. That is unsolvable. Even if you later Raft may choose to honor that transaction or not, and that becomes ambiguous, but we don't want ambiguity. What if we created something extension to commit, like make it extendable to talk to some external stuff to understand that commit can be finalized or something. I don't know, consensus.
Starting point is 00:38:06 Correct, correct. So essentially if you reason through about this, your answer will become a two-phase system. Yeah. Without a two-phase system. Which spurs me. But as I told you, a two-phase commit in OTP world, in Postgres OTP world considered really, really slow and the rule is let's just avoid it. I see your enthusiasm and I think I couldn't find good
Starting point is 00:38:31 benchmarks. Zero. Published. This is not two-phase commit by the way. This is two-phase synchronization. I understand. It's not in two-phase commit. It's like more communication happens. I understand this. two-phase commit, it's like more communication happens. I understand this. So two-phase synchronization, the network overhead is exactly the same as full sync, because the transaction completes on the first sync. Later it sends an acknowledgement saying that yes, I'm happy you can commit it, but the transaction completes on the first sync, so it will be no worse than full sync. Yeah, compared to the current situation, when primary commit happens, but there is a lock which is being held until... Correct, it is the same cost.
Starting point is 00:39:15 Yeah, we wait until standby. And for users, it looks like when lock is released, it thinks, okay, commit happens. But the problem with this design, if, for example, standby restarts, lock is released, it thinks, okay, commit happens. But the problem with this design, if for example, standby restarts, lock is automatically released and commit is here and it's unexpected. This is data on loss, right? So you are saying we can redesign this, network cost will be the same, but it will be pure.
Starting point is 00:39:40 Yeah, that's great. I like this. I'm just thinking, will it be acceptable? Because bringing Autofill over is not acceptable. There was another attempt last year from someone with great enthusiasm. Let's bring Autofill over inside Postgres. Actually, maybe you know this guy, it was Konstantin Osipov, who built Taranto database system. It's like memory. He was ex MySQL in performance after Zaitsev. Zaitsev was ex MySQL when Osipov was MySQL. So Konstantin came to Postgres saying, let's build this. Great enthusiasm, but it's extremely hard to convince such big thing to be in core.
Starting point is 00:40:23 So if you say it's not big thing, this already. So I can, it's a, I'll probably have to explain it in a bigger blog. But essentially, now that I've studied the problem well enough, the reason why it's hard to implement consensus in Postgres with the wall, is because they are trying to make Raft work with wall and there are limitations about how the Raft, how commits work within Postgres that mismatch with how Raft wants commits to be processed and that mismatch, so far, I have not found a way to work around that. But a
Starting point is 00:41:09 variation of Raft can be made to work. Interesting. The way the, I don't know if you know about my blog series that I wrote when I was at PlanetScale. It's an eight-part blog series about generalized consensus. People think that Raft is the only way to do consensus, but it is one of a thousand ways to do consensus. So that blog series explains the rules you must follow if you have to build a consensus system,
Starting point is 00:41:37 and if you follow those rules, you will get all the properties that are required by a consensus system. So this one that I have, the design that I have in mind, follows those rules. And I am able to prove to myself that it will work, but it's not Raft. It's going to be similar to Raft. I think we can make Raft also work, but that may require changes to the wall, which I don't want to do.
Starting point is 00:42:04 So this system I want to implement without changes to the wall, which I don't want to do. So this system I want to implement without changes to the wall as possibly a plugin. Well, now I understand why you could like another reason you cannot take Patroni not only because it's Python versus Postgres, but also because you need another version of consensus algorithm. Correct. And among those hundreds of thousands, millions of ways to... By the way, Patroni can take this and use it, because he's very close to how FullSync works. Good. Okay. I was just thinking, watching Alexander Kokushkin's talk, he said a couple of things that were interesting.
Starting point is 00:42:44 One is that he was surprised that this hasn't happened upstream. So you definitely have an ally in Kokushkin in terms of trying to get this upstreamed. But also that he thinks every cloud provider has had to patch Postgres to, in order to offer their own high availability products with Postgres, each one has had to patch it and they are having to, or you, you mentioned earlier today, how painful it is to maintain even a small patch on something. I don't think it's every, I think it's Microsoft for sure, knowing where Kukushkin works at, but maybe more, not every.
Starting point is 00:43:24 Yeah. sure, knowing where Kukushkin works at. But maybe more, not everything. Yeah, all I mean is that there are a growing number of committers working for hyperscale and hosting providers, so I suspect you might have more optimism for consensus or at least a few allies in terms of getting something committed upstream. So I personally think there might be growing chance of this happening, even if it hasn't in the past for some reason. Yeah, I feel like also being new to the PostFest community, I am feeling a little shy about proposing this upfront. So what I'm thinking of doing is at least show it working,
Starting point is 00:44:03 show it working at scale, have people gain confidence that this is actually efficient and performant and safe. So I also plan to, I don't know if you've heard of Flexback source, which is actually, in my opinion, a better way to handle durability, because today's cloud environments are a lot more complex and a simple majority based quorum is actually very hard to configure if your needs are different, which actually FlexPaxos does handle. It's actually something I'm a co-inventor of some sort.
Starting point is 00:44:40 And this blog post. I only heard the name that's it. Can you explain a little bit? Oh sure, yeah. So actually let me explain what is the reason why. So FlexPackSource was published a few years ago, about seven years ago or so. And you'll see my name mentioned there, you know, which I feel very proud of. And the block series that I wrote is actually a refinement of flexboxes. And that actually explains better why these things are important. The reason why it's important is because people think of consensus as either a bunch of nodes agreeing on a value, right?
Starting point is 00:45:23 That's what you commonly hear, or you think of like reaching majority, reaching quorum is important. But the true reason for consensus is just durability. Is when you say, when you ask for a commit and the system says, yes, I have it, you don't want the system to lose it. is yes, I have it. You don't want the system to lose it.
Starting point is 00:45:45 So instead of defining quorum and all those things, define the problem as it is and solve it the way it was asked for is, how do you solve the problem of durability in a transactional system? And the simple answer to that is make sure your data is elsewhere. And the simple answer to that is, make sure your data is elsewhere.
Starting point is 00:46:07 Yeah. I love how simple you make it. Yeah, if you make sure your data is elsewhere, if there is a failure, your challenge is to find out where the data is and continue from where it went. And that is all that consensus is about. And then all you have to do is have rules
Starting point is 00:46:24 to make sure that these properties are preserved. And Raft is only just one way to do this. So if you look at this problem, if you approach this problem this way, you could ask for something like, I just want my data to go across availability zones. As long as it's in a different availability zone, I'm happy, right?
Starting point is 00:46:46 Or you can say, I want the data to be across regions, or I want at least two other nodes to have it, right? So that's your durability requirement. But you could say, I want two other nodes to have it, but I want to run seven nodes in the system or 20 nodes. It sounds outrageous but it is actually very practical. In YouTube we had 70 replicas but only one node the data have to be in one other node for it to be durable and we were able to run this at scale. The trade-off is that when you do a failover,
Starting point is 00:47:27 you have a wild goose chase looking for the transaction that went elsewhere, but you find it and then you continue. So that is basically the principle of this consensus system, and that's what I want to bring in multigrass, while making sure that the people that want simpler majority-based core items to also work using those same primitives.
Starting point is 00:47:52 Just quickly, to clarify, when you say the wild goose chase, is that because it was one of 70, but different transactions could have gone to different of the 70? Or it's always the same one, but you have to know which one that is. No, it could be anyone. The way we ran it, the way we ran it, it is one. It could not be at any given point of time. There's only one primary,
Starting point is 00:48:16 which means that there is only one transaction that you have to chase down. Ah, the latest one. The latest one, yes. Yeah, makes sense. Yeah. There was a time when we found that transaction in a different country. So we had to bring it back home and then continue. It was once it happened in whatever the 10 years that we ran. It's interesting that talking about sharding, we need to discuss these things, which are not sharding per se.
Starting point is 00:48:48 So it's about HA inside each shard. It's actually like what I would call healthy database principles, which is, I think, somewhat more important than sharding. Yeah. Yeah. Yeah. It is, it is true that it is to do with it being a distributed system, right? Like, and that is because it's sharded, no?
Starting point is 00:49:19 Uh, I think they are orthogonal. Okay. Yeah. I think sharding, like you can do sharding on anything. You can do sharding on RDS. Somebody asked me, what about neon? I said, you can do sharding on neon too.
Starting point is 00:49:39 It's like you put a proxy in front and then it does a sharding. But the problem with sharding is it is not just a proxy. That's what people think of it when they first think of the problem because they haven't looked ahead. Once you have sharded, you have to evolve. You start with four shards, then you have to go to eight shards. And the evolution is not linear this way. Actually, it's an exponential growth because 4, 8, 16.
Starting point is 00:50:10 But at some point of time, it changes because your sharding scheme itself will not scale. Like if you, for example, are in a multi-tenant workload and you say shard by tenant, at some point of time, a single tenant is going to be so big that they won't fit in an instance. And that we have seen. And at that time, we have to change the sharding scheme. So how do you change the sharding scheme?
Starting point is 00:50:37 Slack had to go through this, where they were a tenant-based sharding scheme and a single tenant just became too big. They couldn't even fit one tenant-based sharding scheme and a single tenant just became too big. They couldn't even fit one tenant in one shard. So they had to change their sharding scheme to be user-based. They actually talk about it in one of their presentations. And Vitesse has the tools to do these changes
Starting point is 00:50:59 without actually incurring any kind of downtime, which again, multigrace will have. I keep talking about Vitesse, but these are all again, multigrace will have. I keep talking about vitess, but these are all things that multigrace will have, which means that you are future proofed when it comes to. And these are extremely difficult problems to solve. Because when you're talking about changing the sharding scheme, you are basically looking
Starting point is 00:51:21 at a full crisscross replication of data. looking at a full crisscross replication of data. And across data centers. Yeah, and also, like I know with this version three, right? It was when you changed, basically created a new planner, right, to deal with arbitrary query and understand how to route it properly and where to execute it. Is it a single shard or it's global or it's different shards and so on?
Starting point is 00:51:55 Are you going to do the same with Postgres? I think yes, right? So that's the part that I'm still on the fence. By the way, the V3 now has become Gen 4. It's actually much better than what it was when I built it. The problem with V3 is that it is still not a full query. It doesn't support the full query set yet. It controls supports like 90% of it, I would say,
Starting point is 00:52:27 but not everything. On the temptation side, there's the Postgres engine that supports everything. So I'm still debating how do we bring the two together. If it was possible to do a simple Git merge, I would do it. But obviously, this one is in C, this was in goal. And the part that I'm trying to figure out
Starting point is 00:52:50 is how much of the sharding bias exists in the current engine in VITES. If we brought the postgres engine as is, without that sharding bias, would this engine work well for a sharded system? So this looks like Citus already, right? If you bring a whole Postgres. There's a library libpg query by Lucas Fittal,
Starting point is 00:53:16 which basically takes the parsers part of Postgres and brings it to, and there is a Go version of it as well. So we don't talk on top of it. Oh, libpgquery, you said? Yeah, yeah, yeah, I will send it to you. So and many, many, many, many systems use it when we need to parse. Yeah, one day I told it to the SPQR guys about this,
Starting point is 00:53:43 and eventually I think they ended up using it. I think Lef uses it, PGDock also uses it, the Rust version. Is it like 100% Postgres compatible? Well it's Postgres source, it's based on Postgres source code. So parsers is fully broad, but it's not whole Postgres. So maybe you should consider this. If thinking about parsing, I mean queries and so on, but I'm very curious, I also noticed you mentioned routing, read-only queries routed to replicas automatically, and this concerns me a lot because many POSGAS developers, I mean, who use it, users, they use PLPG scale functions, all PLPython functions, anything, which are
Starting point is 00:54:34 writing data. And the standard way to call function is SELECT, SELECT function name. So understanding that this function is actually writing data is not trivial. And we know in PGPool, which I hold my life, I just avoid. I touched it a few times, decided not to use it at all, because it tries to do a lot of stuff at once. And always considered like, no, I'm not going to use this tool so Pidgeypool solves it like saying okay like let's build a list of functions which are actually writing or something like this so it's like patch approach you know workaround approach so this is going to be a huge challenge I think if you yeah yeah yeah for
Starting point is 00:55:19 automatic routing it's it's a huge Yeah, I think this is the reason why I think it is important to have the full Postgres functional engine in Multigress, because then these things will work as intended, is my hope. What we will have to do is add our own sharded understanding to these functions and figure, oh, what does it mean to call this function, right? If this function is going to call out to a different shard, then that interpretation has to happen at the higher level. But if that function is going to be accessing something within shards, then push the whole thing down and just let the push the whole select along with the
Starting point is 00:56:03 function down and let the individual Postgres instance do it. Yeah, but how to understand? Function can contain another function and so on. It can be so complex in some cases. It's also funny that there is still, there is actually a Google Cloud, Cloud SQL supports it, like kind of language, it's not language called PL proxy, which is sharding for those who use, who have workload only in functions. This can route to proper shard.
Starting point is 00:56:36 It was created at Skype. It still exists, but not super popular these days. But there is a big requirement to write everything in functions. In your case, if you continue like, I would expect in some cases you would say, okay, don't use functions. But I'm afraid it's not possible. Like I love functions. Actually, SuperBase loves functions because they use Postgres. Postgres like it provokes you to use functions. Oh really? Oh yeah, yeah, yeah. Actually I saw that, yeah. So the, in VITES, I feel like this was a mistake that we made, which is if we felt that anything,
Starting point is 00:57:15 any functionality that you used didn't make sense. Like if I were you, I wouldn't do this, right? Because it's not, it won't scale, it's a bad idea, right? Because it's not, it won't scale. It's a bad idea, you know, it's like, those we didn't support. We didn't want to support. We said, no, we will never do this for you because, you know, we'll not give you a rope
Starting point is 00:57:36 long enough to hang yourself. This is, basically, that was our philosophy. But in the new, in multigrass, we want to move away from that. Which means that if you want to call a function that writes, have at it. Just put a comment that's going to write something. Yeah, if you want a function that calls a function that writes, have at it. If we cannot, like the worst case scenario for us is we don't know how to optimize this. And what we will do is we'll execute the whole thing
Starting point is 00:58:13 on the VT gate side. There's another, I remember there is interesting solution in I think in AWS RDS proxy, proxy, which, as I know it, maybe I'm wrong, when they needed to create a global, it's called, I think, Aurora global database, maybe, or something like this. So there is a secondary cluster living in a different region, and it's purely read-only, but it accepts writes. And when write comes, this proxy routes it to original primary waits until this write is propagated back to replica and responds.
Starting point is 00:58:54 Oh wow! I don't think that feature can be supported. No, it's just some exotic interesting solution I just wanted to share. Maybe, you know, if we, for example, if you originally wrote a write to a replica, then somehow in Postgres you understand, oh, it's actually a write. So maybe 100% is theoretically impossible to support. Okay, it's super exotic. Okay. But I think if people are doing like that, doing things like that, it means that they are trying to solve
Starting point is 00:59:30 a problem that doesn't have a good existing solution. So if we can find a good existing solution, I think they'll be very happy to adopt that instead of whatever they were trying to do. Well, this is just multi-region setup. When I saw not one CTO which wanted it for dealing with Postgres, like I say, we are still single region. We need to be present in multiple regions in case if one AWS region is down. It's also about the chain. So availability and the business characteristics.
Starting point is 01:00:06 So yeah, anyway. Okay. Yeah, it's exotic. But interesting still. Yeah. So you've got a lot of work ahead of you, Sergey. I feel like we barely covered like one of so many topics. Let's touch something else. Maybe it's a very long episode, but it's worth it, I think. It's super interesting. What else? I think the other interesting one would be 2PC and isolation.
Starting point is 01:00:47 Isolation from what? The one issue with the sharded solution is that, again, this is a philosophy for the longest time in VITESS we didn't allow 2PC. You said you sharded in such a way that you do not have distributed transactions. And many people lived with that. And some people actually did not. Let me interrupt you here because this is the best feature I liked about Vitesse. It's this materialized feature when data is brought. Oh yeah, materialized is another topic.
Starting point is 01:01:24 That's actually a better topic than 2PC. Well, yeah, because this is your strength, right? So this is like, I love this idea. Basically, distributed materialized view, which is incrementally updated. Yes, yes, yes. That's great. We need it in Postgres ecosystem.
Starting point is 01:01:43 Just maybe as a separate project, even. We like it in Postgres ecosystem. Correct. Just maybe as a separate project even. We like it everywhere. So yeah, this is how you avoid distributed transactions basically, right? No, this is one way to avoid it. One way. Yeah. There are two use cases where materialized views are super awesome. You know, a table that has multiple foreign keys,
Starting point is 01:02:06 that has foreign keys to two different tables is the classic use case, where the example I gave was a user that's producing music and listeners that are listening to it, which means that the row where I listen to this music has two foreign keys, one to the creator and one to the listener. And where should this role live?
Starting point is 01:02:27 Should this role live with the creator or should this role live with the listener is a classic problem. And there is no perfect solution for it. It depends on your traffic pattern. But what if the traffic pattern is one way in one case and another way in another case? There is no perfect solution.
Starting point is 01:02:47 So this is where in multigrace, what you could do is you say, okay, in most cases, this role should live with the creator. Let's assume that, right? So then you say this role lives with the creator and we shard it this way, which means that if you join the creator table with the, with this event row, it'll be all local joins. But if you join the listener's table with this event row, it's a huge cross shard while while goes chase. So in this case, you can say materialize this table using a different foreign key, which is the listener's foreign key, into
Starting point is 01:03:26 the same sharded database as a different table name. And now you can do a local join with the listener and this event table. And this materialized view is near real time, basically the time it takes to read the wall and apply it. And this can go on forever. And this is actually also the secret behind resharding, changing the sharding key. This is essentially a table that has real time present with two sharding keys.
Starting point is 01:03:57 If you say, oh, at some point of time, this is more authoritative, all you have to do is swap this out, make one the source, the other is a target. You've changed your sharding key. Actually, the change sharding key works exactly like this for a table. Distributed denormalization technique. This is what it is.
Starting point is 01:04:16 Yeah, exactly. The other use case is when you reshard, you leave behind smaller tables, reference tables, we call them. And they have to live in a different database because they are too small. And even if you shard them, they won't shard well. Like if you have, you know, a billion rows in one table and, you know, a thousand rows in a smaller table, you don't want to shard your thousand row table.
Starting point is 01:04:45 And there's no benefit to sharding that either. So it's better that that row lives, that table lives in a separate database. But if you want to join between these two, how do you do it, right? The only way you join is join at the application level, read one and then read the other. And so at high QPS, it's not efficient. So what we can do
Starting point is 01:05:06 is actually materialize this table on all the shards as reference, and then all joints become local. Yeah, and you definitely need logical replication for all this. So this is where we started, like challenges with logical replication. Yeah. Yeah. Great. You do have the reason why 2PC is still important. Because there are trade-offs to this solution, which is there's a lag. So it takes time for the things to go through the well and come to the other side. Whereas 2PC is essentially basically the transaction system itself trying to complete a transaction.
Starting point is 01:05:49 Which means that it will handle cases where there are race conditions, right? If somebody else tries to change that row elsewhere while this row is being changed, 2PC will block that from happening. Whereas in the other case case you cannot do that. Yeah, if it's just user views of some video like on YouTube, we can say, okay, there will be some lag, probably some small mistake, it's fine, but if it's financial data, it should be 2 PC, but latency of write will be high, throughput will be low, right? This is... I actually want to, like I read the design of, which is again, by the way, very elegant API.
Starting point is 01:06:30 And I assume I can see the implementation on the API. And I don't think we will see performance problems with 2PC. We need to benchmark it. We will benchmark it, but I will be very surprised. I think there are some isolation issues that we may not have time to go through today because it's a long topic. Like the way 2PC is currently supported in Postgres, I think it will perform really well.
Starting point is 01:07:00 Isolation issues when we sit in read committed and used to PC. You mean this, right? Not unrepeatable read in default. Yeah, read committed I think will be, there will be some trade-offs on read committed, but not the kind that will affect most applications. MVCC will be the bigger challenge. But from what I hear is most people don't use, like the most common use case is read committed. Of course, it's default. Yeah, it's faster. It's the default. Yeah.
Starting point is 01:07:31 So people won't even, yeah, I don't, I think this is... They already on some, they're already in bad state. It won't be worse. It won't be worse. Yes. Yes. Yeah. As for 2PC, it of course depends on the distance
Starting point is 01:07:45 between nodes, right, a lot. Like if they are far, they need to, we need to talk like client is somewhere, two nodes are somewhere, and if it's different availability zones, it depends, right? So this distance is a big contributor to latency, right, network. Because there are four communication
Starting point is 01:08:06 messages that are needed. So correct. Correct. Actually, you can, I have actually the mathematics for it, but you're probably right. It's about double the number of round trips. Yeah, if we put everything in one AZ, client and both primaries, we are fine. But in reality, they will be in different places and if it's different regions, it's nightmare, of course.
Starting point is 01:08:33 But at least it is. Yeah, the 2PC is not done by the client, by the way. The 2PC would be done by VT Gate, which would be, it should have the nodes nearby. Ah, should have. In one availability zone? Unless you did some crazy configuration, they should be pretty close to each other. Pretty close means still different availability zones, in general case, right?
Starting point is 01:08:56 No. The availability zone is only for durability, for replica level. But two PC are coordinating between two primaries, which may actually be on the same machine for all you care. It's not... Well, imagine a real practical case. We have shard, every shard has primary and a couple of standbys, right? Correct.
Starting point is 01:09:21 So, are you saying that we need to keep primaries all in the same availability zone? That's usually how things are. Ah, interesting. I didn't know about this. By the way, I wanted to rattle a little bit about plain scale benchmarks they published last week. They compared to everyone. They compared to everyone.
Starting point is 01:09:40 I wanted just, it's not like, I'm sorry, I will take a little bit of time. They compare to everyone and they just publish like planet scale versus something. And this very topic, they, on charts, we have planet scale in single AZ, everything, client and server in the same AZ. And line, which is like normal case, client is different as that. And line with same as that is active, line is normal, not active. And others like neon, super base, everyone, it's different. And of course, plain scale looks really well, because by default, they presented numbers for the same availability zone.
Starting point is 01:10:27 And below the chart everything is explained, but who reads it, right? People just see the graphs. And you can select proper planet scale numbers and see that they are similar. But by default the same as that number is chosen. This is like benchmarking. You know, like... Well, I think if you look at the architecture, like even fair comparison,
Starting point is 01:10:53 planet scale should come out ahead, like the performance of a local disk. Of course. Should. But this was select one. Disks don't matter. Disks don't matter. But select one is not a benchmark.
Starting point is 01:11:08 Well, it was part of benchmark. It's just checking query path, but it fully depends on where client and server are located. So what's the point showing better numbers just putting client closer, right? I don't like that part of that benchmark. Oh, okay. Yeah. I saw the publications, but I didn't go into the details because I thought, well, it has to be faster because it's on local disks. Well, for data which is not fully in cache, of course. Yeah. Local disks are amazing versus ABS volumes.
Starting point is 01:11:42 You're right. Yeah. If the data is in cache, then there is, yeah, then all performance, the performance of everything would be the same. Yeah, well, I wanted to share this. I was annoyed about this, but I fully support the idea of local disks. It's great. I think we need to use them more and more systems. I think, I wouldn't be surprised if you reached out to PlanetScale. They may be willing to like if you want to run your benchmarks They may be willing to give you the yeah, this was got published and in general benchmarks look great The idea is great. Okay, and actually with local disks. The only concern is usually the Limit hard limit. We cannot have more space, but if you have a sharded solution, there is no such thing.
Starting point is 01:12:30 Correct. But speaking about the hard limit, today's SSDs, you can buy 100 plus terabytes size SSD, single SSD, and you can probably stack them up on next to the other. But in cloud there are limitations. In cloud there are limitations, for instance. Okay, yeah. I saw AWS SSD over 100 terabytes. In Google Cloud 72 terabytes is hard limit for Z3 metal and I didn't see more. So 72 terabytes is a lot but sometimes it's already not a lot.
Starting point is 01:13:06 At that limit, your storage is not the limit. You will not be able to run a database of that size on a single machine. Why not? You'll max out your CPU. We have cases, CPU. Well, again, the problem will be replication. If we talk about single node, we can have 360 cores, or in the edible, yes, almost 1,000 cores already for Xeon Scalable Generation 5 or something. So hundreds of cores. The problem is Postgres design. If physical application was multi-threaded,
Starting point is 01:13:41 we could scale more. By the way, replication is not the only problem. Backup recovery. If your machine goes down, you're down for hours. Recovering something of that size. Yeah, yeah. Not many hours. Someone in my team recently saw 17 or 19 terabytes per hour for recovery with PgBest. Wow!
Starting point is 01:14:08 Yes, in eight double years. I was like, my jaw dropped. I was, on our podcast, I usually say one terabyte is- Can you repeat that? 17 or 19 terabytes per hour? 17 terabytes per hour. 17 terabytes per hour with local disks. So this is important.
Starting point is 01:14:26 With ABS, it's not possible. It's good to know. Yeah, well, with Michael, I was always saying, one terabyte is what you should achieve. If it's below, it's bad. Now I'm thinking, one terabyte is already. Yeah, yeah. So with ABS volumes, we managed to achieve, I think,
Starting point is 01:14:44 seven terabytes per hour to restore with all G. But there is a danger there, you could become a noisy neighbor. So we actually built throttling in our restore just to prevent noisy neighbors. With local disks you lose ability to use B best snapshots, cloud disk snapshots. Correct, correct, yeah. That's what you lose, unfortunately. And they're great, and people enjoy them more and more.
Starting point is 01:15:16 So I agree. And as I remember, for 17 terabytes, it was 128 threads of OLG or PgBecraster, I don't remember. Wow. But with local disks, S3. I need to update my knowledge, this technology is changing too fast.
Starting point is 01:15:34 Exactly, yeah. Hundreds of course, terabytes of RAM already, right? Like, yeah. Yeah, yeah. But it does go straight to your point of the smaller they are, the faster you can recover still. And you don't hit some of these limits. Like these systems were not designed with these types of limits in mind.
Starting point is 01:15:56 Some weird data structure, you know, suddenly the limit of this is only, you know, a hundred items or, you know, and you hit those limit and then you're stuck. Like recently Metronome had an issue. Yeah. They had that outage. Multi-exact. The multi-exact thing which nobody has ever run before, but they hit that problem.
Starting point is 01:16:22 Yeah. We saw many problems also when you're on the edge. It pushes forward Postgres actually sometimes. But if you want to be on the safe side, I really like it. It's kind of resilience characteristics. Even if it's down, it's only a small part of your system is down. That's great. That's mature architecture already. That actually makes it easier to achieve five nines uptime because that's the way you calculate. Like if only one node is down, you divide it by the number of users being affected.
Starting point is 01:16:59 Down time budget. Yeah. That's good. Cool. I think it's maybe one of the longest episodes we had. Oh my God! I enjoyed it. I hope we will continue the discussion of issues with logical, for example, and so on, and maybe if things will be improved and so on. Looking forward to test POC once you have it. Thank you so much for coming. I am so excited. Yes. Thank you. Yeah. Is there any last things you wanted to add or
Starting point is 01:17:30 anything you wanted help from people on? I would say it feels like nothing is happening on the repository except me pushing, you know, a few pushes, a few things, changes, but a huge amount of work is happening in the background like some of these design work about consensus are all like almost ready to go and there's also hiring going on there are people coming on board very soon so you will see this snowball it's a very tiny snowball right now but it's going to get very big as momentum builds up. So pretty excited about that. We still, we may still have one or two spots open to add to the team, but it's filling
Starting point is 01:18:13 up fast. So if any of you are very familiar, this is a very high bar to contribute to a multi-grace. You have to understand consensus, you have to understand query processing. But if there are people who want to contribute, we are still looking for maybe one or two people and also on the orchestration side and the kubernetes side of things. Do you mind a small joke in the end, just not to finish on serious? I do not mind at all, let's hear it. Yeah, so I know what you're doing, You're writing a lot of markdown right now And then you will feed it to AI
Starting point is 01:18:51 I wish My god, I I almost hope that day never comes but It is so fun working on this project creating it why why do I want to give it to an AI to do it? You know? Okay. Good. Thank you. Enjoy it a lot.
Starting point is 01:19:12 Yeah. Yeah. Thank you so much for joining us. It's great to have you as part of the Postgres community now. And I'm excited to see what you get up to. And we too. Wonderful. Thanks so much.
Starting point is 01:19:24 Thank you. Bye-bye. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.