Postgres FM - To 100TB, and beyond!

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael, founder of PgMustard, and I'm joined as usual by Nikolai, founder of PostgresAI. Hey Nikolai. Hi Michael. And this week we have not one, not two, but three special guests with us to celebrate episode number 100. We asked you, our listeners, for ideas for this episode and we got some great suggestions including the topic of scaling to 100 terabytes for episode 100 so we thought that was and beyond oh and beyond yes

Starting point is 00:00:31 always beyond yeah no downtime allowed we're delighted to welcome three people who have successfully scaled postgres exceptionally well and those are arca ganguly from notion hey arca hey good to be here. Really excited to talk about Postgres. And hey, Sammy. Sammy Steele from Figma. Thanks so much for having me on. And finally, Dirk van Veen from Adyen. Yeah, thank you very much for the invitation. I'm honoured. We are honoured to have all three of you. So yeah, thank you all for joining. So we're going to try a kind of panel roundtable style discussion. I'll play good cop trying to keep us on track, but you are all welcome to chime in with questions for each other.

Starting point is 00:01:12 And I know Nikolai will as well. So let's start on the simpler side of things. And starting with you, Dirk, would you mind letting us know what kind of things you're using Postgres for at Adyen there? Well, that's an easy question. We keep our data within Postgres database. The thing is we are processing financial transactions. And the thing about financial transactions is you'd better not lose them. Because if you have a social media kind of platform, right, and you refresh your page, and you're missing a picture and refresh again, And the picture's there. Nobody really cares.

Starting point is 00:01:46 But if you refresh your bank app and your money is gone, then people, for some reason, they kind of freak out. So, yeah, we try to not lose a single bit of data within our Postgres databases. So you don't use data type money as well, right? No, I couldn't. Okay, nice. Thanks so much. Sammy, how about at Fma hi everyone i'm sammy and at

Starting point is 00:02:08 figma our platform is an online web-based sass software that allows designers to collaborate together and you know kind of the magic secret sauce is this real-time platform where you can comment on a design and it immediately shows up for other people. And so all of that core data is powered by Postgres. And it has to be really reliable, really fast and low latency so it can feel real time. And we've grown 100x in the last four years. So that's where we're getting our 100 for this episode from. Nice. Even more hundreds.

Starting point is 00:02:40 We like it. And Arca, how about Nation? Yeah. So unsurprisingly, we also use Postgres to store our data and make sure that it continues being there when our customers expect it to be. For those of you that don't know, Notion is kind of a collaborative workspace. You can use it for documents, wikis, product management, and it has a lot of AI built in recently. So really, it's like the place where a lot of customers store their entire knowledge base and run their company or their personal use cases as well. I just wanted to mention that I'm a user of all three companies' products, because for example, Notion, I'm a big fan of lists and so on. So it's a great thing for that, like for such kind of people I am. And Figma, it's like, I actually remember I created

Starting point is 00:03:27 a lot of interfaces in Figma myself. It's super cool too. And especially this real time thing as well. And somehow I noticed Figma replaced all things I used in the past. And I just told Dirk before we started recording that every time I go with my daughter to Legoland, I see a DN on payment machine when I pay for something. And I think, oh, Postgres processing is there. So it's cool. Thank you for coming once again. I love how humble you all are, assuming people might not know who your companies are as well. It's nice. Let's flip the order and let's look at how you set things up Postgres-wise. So how things were in the past, how things are set up now. I realize this might be quite an in-depth answer and any future plans perhaps as well. So

Starting point is 00:04:10 maybe at the highest level you can describe that would be awesome. Arca, we should restart on your side. Sure. So I joined Notion about a little bit over three years ago. And when I got there, we had actually just finished like our first sharding effort. But essentially at Notion, we run a pretty simple deployment. We use RDS, which is Amazon's managed database offering. We run Postgres on that. And we run pgBouncer in front of our RDS cluster for connection pooling. And that's pretty much it. We do have sharding, like I mentioned, which basically is just like splitting up the data across more databases. And the logic for how to talk to the database is actually controlled by the application. So depending on some ID, we know what database

Starting point is 00:04:57 to send your query to. And yeah, that's kind of how it is even today. We have a few more databases than when I started, but the overall high level architecture is pretty much the same. Nice. I've read both of your sharding and resharding blog posts. I'll link both up in the show notes, but I'm guessing that will continue to work. Any reason it won't continue to work for the foreseeable? Yeah, I think our plans are not really to move away from Postgres. It's worked really well for us so far. And we've got a lot of people at the company now that have built up a lot of knowledge around how to operate this thing at scale. So definitely something we want to stick with moving forward. Probably we do have to reshard in the future at some point. and moving to more of like a cellular architecture where we can kind of deploy maybe like different

Starting point is 00:05:45 versions of Notion to scale our capacity with like their own databases and caches and things like that. So we don't have to continuously spend our time resharding over and over again, because that's a very time intensive and operationally intensive kind of thing for engineers to take on. So we probably do have to resard in the near future, but in the longer term, we hope our more like cellular architecture will be able to scale much further. Super interesting. Sammy, let's go to you. I know you're sharding as well, but slightly differently, I believe. Yeah, well, we definitely have a lot in common with what Arka just described. And so your blog posts were a great inspiration for us as we

Starting point is 00:06:25 were in the early days of horizontal sharding. So similar high-level stacks, and that we're also on RDS Postgres, Amazon's managed system, and we also use PGBouncer for connection pooling, at least for now. And where we start diverging from Notion is that I think our data model is somewhat more complex and less naturally a good fit for sharding. My understanding is Notion is that I think our data model is somewhat more complex and less naturally a good fit for sharding. My understanding is Notion had shard IDs based on the orgs or the workspaces, and that worked pretty well. But at Figma, we have a lot of data not associated with an org, and data moves quite frequently between orgs, which makes that kind of sharding model quite hard. And so we actually ended up going

Starting point is 00:07:05 with a more tactical approach for each table, picking a set of a few shard keys that would be more optimal. And we ended up building a proxy layer so that instead of the application having to be aware of shards, we actually have this DB proxy service we built, which is a Golang service. And it's able to do all the routing and handle all the horizontal sharding, scatter gathers where a query hits many shards, and then you aggregate results back. And so that's kind of been a superpower for us to have a lot more control in that layer. And long-term, we have sharded our simplest, highest write rate tables, but we still have a ways to go until everything is sharded. And we have to build out a lot more infrastructure to support that. So unlike Notion, we actually expect reshard

Starting point is 00:07:49 operations to be quite common. And so our goal is like one click failover, where you can just hit a button and then have a reshard operation happen and transparently in the background. Sounds like logical replication involved, but maybe we'll talk about later. I'm so interested in this proxy layer. Did you build it yourself in-house? We did. Yeah. So it's a Golang service. So it has a couple of components. It has a query engine, which has a parsing layer that takes in raw SQL and transforms it to an AST. And that part, we did borrow Cockroach's Postgres parser, which has mostly worked well for us with a few modifications but then all of the logic for we have a topology layer that we've built that knows where all the databases

Starting point is 00:08:30 are and what shard keys mapped what tables and so the query engine really does all that evaluation and query rewriting yeah that sounds pretty mind-blowing to me it's really cool yeah there are a couple of projects like pgcat and spqr which do that but maybe when you started they are very early stage yes actually to clarify so we wish we had named this something besides db proxy because pgcat would not be a replacement for what we have because we actually might use pgcat and replace a pg balancer because we are hitting some pg balancer scaling problems i'd be curious if you guys are also running into challenges but pd proxy is really it's a stateless service that talks to every single database and so a request gets routed to it and it has a connection pooling

Starting point is 00:09:14 layer which talks to a pg bouncer or pg cat like layer that is specific to a single host so it's really playing more of a high level orchestrator role so cool and i love that we've already got two different approaches to sharding, kind of one that's transparent to the application and the developers and one that's not. And I think we're about to get a third answer from Dirk's side. How are things set up at Adyen and past, present, future? Let's start with the past. I think we had a pretty, it was by far our most well-read blog article, and that's how to upgrade a 50-terabyte database. And when

Starting point is 00:09:52 looking back on it, it's quite a while ago. So I think two years ago, we had an internal event, and it was like, no, it's not how you upgrade a 50-terabyte database. It's how you upgrade a terabyte database with multiple zeros in the end. At least two, I guess, right? Yeah, we have two zeros in the end. And not just upgrade, but zero downtime upgrade. Or like, I just want to clarify,

Starting point is 00:10:14 because regular upgrade is not a super big problem if you can afford a couple of hours downtime. Yeah, we tried to think about the ordinary upgrade, but the rough calculation, it will take us like four or five weeks downtime, which didn't cut the bill. We've got links. No, that one is much faster, fortunately.

Starting point is 00:10:33 I think we limited to 15 minutes of downtime and we already had a big fight for this 15 minutes, but it has to happen. And it's not even a database downtime, but it's more like all the infrastructure around it to get it fixed, which requires downtime. If you look purely at Postgres, I think it was minutes, but it's not just a database you run.

Starting point is 00:10:53 I know this problem very well. So, yeah. Everything around adds more and more minutes and then sometimes hours, right? Yeah. I'm curious, what's the biggest individual table or do you have partitioning under the hood there to have so many terabytes on one box?

Starting point is 00:11:07 What is the biggest table we have? I don't know, but the biggest partition is over 30 terabytes, which is a single partition. Wow. And you don't run into vacuuming issues with that? I think we are the most skilled company when it comes to vacuum and how to tweak

Starting point is 00:11:23 vacuum. You build your indexes, which blocks xmin horizon right yeah we have seen many vacuum issues when i started to work with postgres and i was installing this database and transaction wraparound was some kind of weird theoretical thing and then i joined dot yen and then something you do every week at least so i yesterday asked my Twitter audience, what would you ask to people who manage many, many, many terabytes? And one of the questions was, how are you? I guess this is the case when you have problems like 30 terabytes one table. How long is the index creation time or rebuild time it's like days or

Starting point is 00:12:07 yeah days days during which xmin horizon is blocked right so we accumulate bloat in whole database for all tables right it's a super annoying problem but most of these tables are partitioned so first we create an index per partition. And then finally, in the end, we create the index on the parent. Otherwise, we don't make it before we hit wraparound. Oh, wraparound. It's also a problem. But I'm talking about also vacuuming and bloat problems. So multiple problems here.

Starting point is 00:12:35 Well, that's a super interesting experience. It keeps us busy. But it's also an honor to have these problems. But I think all three of us got these big databases because the company is very successful. Right. And I think at the end of the day, having a big database is actually a design problem. You forgot to design in time for the size you need to accommodate. See, I actually disagree.

Starting point is 00:12:59 I think it's an expected problem. If you design a company too early to scale, then you're not going to have a company probably if you're trying to solve those problems when you're five or 10 people. I think one of the cool things about Postgres is all of our companies did get very far on one host. Figma existed for seven or eight years and was pretty successful

Starting point is 00:13:18 before we had to actually start scaling out. That's definitely true. I agree with that. And my team helped a couple of companies who went to IPO being like on a single Postgres cluster, and they got evaluation, thousands of billions of dollars having one cluster. And this was impressive. Of course, later, it should be changed. But yeah, so I agree with one cluster you can scale a lot. But I would not like to be in your shoes dealing with like multi-days index rebuild and also transaction ID wrap around problems.

Starting point is 00:13:54 So how do you escape from these problems today? Like 30 terabytes, one partition. It's quite difficult, right? So is it to be partitioned into smaller partitions? Yeah, I'm thinking about how to answer this question. I think before I joined Atyen, I tried to manage my database in a proper way. You do it by the book and by the rules and everything is fine. And Atyen is pushing me to find which boundaries can be bended or violated without too much risk. I think that's the way to put it.

Starting point is 00:14:31 We still are on the safe side, but with more knowledge, you're better able to decide when do I cross a boundary and when is it really dangerous and when is it safe to do, but not strictly advised. Yeah, I think that's definitely one of the benefits of staying on a system like Postgres that you end up understanding really well is all these limits are extremely specific to the workloads and the tables and even what works for one table at Figma doesn't work for another. And so there is a lot of just accumulated organizational knowledge that makes it easier to run these systems at scale and understand actually at what point do things get scary. Yeah, I agree. If, for example, index creation takes one day, but our transaction, real transaction ID doesn't grow a lot, it's not a big problem. But if it grows very fast, then we have two kinds

Starting point is 00:15:17 of problems. So transaction ID are up around and below it accumulated. I agree. So it's very specific to particular workloads. want to go go back to dirt quickly because you've done such a great series of blog posts on partitioning i'll share those i reread them today and part three promises a part four so i'm i'm hopeful of getting another installment at some point but yeah is it would you say partitioning is the big like the way you've largely scaled there or are there other strategies like are you moving data out like how are you dealing with it and how are you planning to in the future yeah i think we started with horizontal sharding and then within every shard tables are partitioned but financial

Starting point is 00:16:03 transactions you need to be able to refund them for a very, very long time. So it takes quite a long time before you can actually archive data. So we have a lot of old data, which is still around, but yeah, these partitions kind of shift more back in the row and vacuum is much easier if you don't change a lot on a partition. But at the moment, we are actually reaching the limits of our sharding solution again. So we basically start redesigning the entire system again.

Starting point is 00:16:32 You know, that's really interesting because that notion, like it's interesting that you chose partitioning Postgres as like your thing that like got you really far because that notion, we actually ran into, I don't know, like really weird issues

Starting point is 00:16:44 with dropping indexes on partition tables. really far because at Notion, we actually ran into, I don't know, like really weird issues with dropping indexes on partition tables. But that's because I think we were running on like, like this database was running on like Postgres 12. And I think a lot of these things may have been fixed in the in future releases. But we actually went the opposite way where we're like, we don't want Postgres partitions, let us manage our own. Because I don't know, there's like some weird behavior we ran into when creating and dropping indexes especially because some things you can't do concurrently on partition tables yeah dropping indexes is is a nightmare we i always tell my developers like you can create

Starting point is 00:17:16 indexes on on a partition table but if you want to drop them i come after you yeah fun problem actually a duck you said something earlier about knowing which rules you can kind of bend or break one of the ones you mentioned in those posts was around adding check constraints and so adding them in a not valid state but they're not running validate afterwards i think you mentioned just updating the system catalog and that's such a cool trick but like how yeah i guess you have to become comfortable we used this trick like 15 years ago when validate was not an option so like something like that we we used it but it wasn't like undocumented thing don't do it like if you're if you're not owning a database. For example, if you're a consultant,

Starting point is 00:18:07 this is not something you should recommend because who knows what will happen next. If you own a database, if you work inside a company, it's probably okay. It's not official. By the way, I wanted to emphasize problems we just started touching most people who listen to us don't have, right? Because it's like extreme problems.

Starting point is 00:18:26 I'm not completely sure. I think partitioning becomes useful for a lot of people way before you hit the limit where you have to. So if you're listening to this podcast and you think like, yeah, partitioning is too far away, just think again. Because if you do it now, when you have all the time in the world to figure it out, find a good strategy, then you don't end up with a huge partition,

Starting point is 00:18:47 a small partition, weirdly partitioning things. So I would say just starting time. I'm on the same page with you here because you mentioned this rule, like 100 gigabytes as threshold when you need to start partitioning. Same rule we also apply everywhere. If you exceed 100 gigabytes, it's time to partition, I agree.

Starting point is 00:19:07 Interesting. We have a lot of tables larger than that that aren't partitioned, but probably about a terabyte is what we aim for max for our largest shards, although we have one 16-terabyte table today. But another interesting topic that I think is more relevant to people are upgrades. We have built this no downtime tooling that we've mostly used for reshard and horizontal sharding operations. But more recently, we've been using it for no downtime major version upgrades with the ability to roll

Starting point is 00:19:35 back if you run into problems. And so I think that's something that could be interesting for anyone who's running Postgres. Upgrades are always a hard thing to do. Yeah. By the way, we forgot to mention that ADN is on self-managed situation. It's not on RDS, right? Because it's very different. Yeah. I'm very curious how you did this for RDS case, not using Bluegreen deployments, right? Just fully...

Starting point is 00:20:02 Yeah, at least today. Well, Bluegreen only very recently became available for postgres last year it was only really my sequel and so we'll probably explore it going forward one blocker that actually we're talking to the rds team today is our rollback or our there is no way to roll back and that's the wrong blue green deployment's idea it should be symmetric and when you switch over, reverse replication should be installed immediately. And without this, you don't have rollback plan, right?

Starting point is 00:20:32 Exactly. It means data loss. Yeah, so that's what our failover operation does. I think it's also nice to have more control when you do it locally because we're able to, for instance, do a no-downtime replicas failover, and because most of our P0 do a no downtime replicas failover. And because most of our P0 critical workflows read from replicas, that means that most failovers don't actually incur any downtime. Whereas if you're using a managed thing like Bluegreen, you lose that kind

Starting point is 00:20:54 of control. Yeah. Well, I'm super curious about details here, but maybe it should be a separate discussion. And first of all, I wanted to thank you, all of you. We invited you because you shared very great posts. So this knowledge sharing is super important. So I would be happy to see more and more coming. And maybe if we talk more and exchange ideas, we could collect better materials. For example, how to do zero downtime upgrades, because I know a very good recipe for self-managed Postgres,

Starting point is 00:21:24 but I don't know for rds because they don't allow you to control recovery target lsn so i'm very curious to talk about details but maybe the separate discussion because i know michael has different plan not to talk half an hour about upgrades well maybe we could actually do a quick like lightning round of which major version each of you are currently running if you know and you're happy to share is it a single version only or also a good question sammy you happy to share that one sure i can definitely talk about that well we are newly as of two months ago fully on pg-13. So we had a couple of PG-11 very legacy boxes. There were some of our earliest, largest databases.

Starting point is 00:22:11 And that actually was the main motivation for building out this upgrade workflow was to get those into PG-13. And we would love to use some of the features that are available in like PG-15, PG-16. And so we're probably going to look into fleet wide upgrades sometime next year to get to a newer version. But this is where this one click upgrades or failovers really matters because today it's quite toilsome and it would be pretty painful to upgrade the whole fleet. Yeah, absolutely. How about you, Arka? Yeah, we're actually, we're like 90% on Postgres 15 And it is my life's mission to get us that final 10% over. Yeah, like most of our databases are on actually Postgres 15. We finished that upgrade last year.

Starting point is 00:22:52 And yeah, we really wanted some of the features that I think got introduced maybe after Postgres 14, especially around being able to filter the replication stream, because that's something we want to use in the future. Probably if we want to reshard, that'll be a very useful thing for us. And just wanted to touch on Sammy's point a little bit. We actually use the exact same kind of, I think, failover that you guys use as well at Figma based on PGBouncer and with the reversibility as well. And that's something that has been really, really surprisingly amazing. Yeah. You mean PgBouncer post-resume. I guess this is why you don't use RDS proxy being on RDS, right? Because it doesn't have post-resume.

Starting point is 00:23:31 Yeah, that's great. I also was surprised last year that it can handle really heavy workloads. Yeah. Just a few seconds spike of latency, you can upgrade and switch over to different primary. Yeah, it was one of those things where when we were writing out the tech spec for it, I was like, is this thing going to actually work? The problem with this, by the way, there are no good materials about this.

Starting point is 00:23:53 Proving, like, nobody say, like, it's a public secret. Nobody talks about it somehow, but it works really well. So we need more materials proving that POSresume works well in pgBouncer. Yeah, it's been great for us. We have some pgBouncer scaling problems, but everything on the failovers, pgPause, pgResume side has been flawless. That's so good to hear. And finally, Dec, on the major version side of things. We are fully on 13, and we're having discussions about the next version we want to upgrade to because every version has nice things i want certain partitioning functionality

Starting point is 00:24:31 a colleague wants more logical replication functionality so of course 17 would be the ideal but then do we really want to run 17 already with the biggest clusters we have so yeah to be continued and i think also the fact that i mean notion being on 15 was somewhat of a surprise to me i think because of these major version upgrade challenges i see so many companies even at the cutting edge of what postgres can do lagging several years behind because upgrades are so difficult so i know i know we won't get a chance to discuss all the reasons why, but I think it's proof because we're all lagging. Yeah. Well, I actually think it's not for us so much fear of newer versions. I actually worked in MySQL before Postgres and MySQL major version upgrades were truly terrifying. These were like

Starting point is 00:25:19 two or three year long efforts where you found a lot of major incompatibilities. But one thing I love about Postgres is it is mostly backwards compatible across major versions. And so it's not so much that we're scared of the newer ones as versus we can't take down time to do these upgrades. And so there it's much more than manual toil. Yeah, exactly. I think that is I think that's the case for a lot of people. So last one I was going to ask each of you is if you had it funny or scary and i don't know which one i want most story that you can share with us related to your scaling any of you got one prepared already um feel free to jump in or nicolai white wet logs log manager multi-exact transactions multi-exact ids and so on like something like that maybe yeah i think we

Starting point is 00:26:04 definitely got bitten by the log manager, like most people who start doing partitioning. It's like we started partitioning and we did a good job and I built this framework around partitioning. By the way, Michael, the fourth blog post is,

Starting point is 00:26:19 we are writing it at the moment. Yes. So it will be there. It's a nice ID. Now I lost the rest of my story. Oh, yeah, partitioning. So it sounds like we have prepared statements. No, we just have the prepared statements.

Starting point is 00:26:33 But you know everything about prepared statements, right? Where you try it five times, and then the optimizer decides, do we go with the generic plan or the custom plan? But we also created a lot and a lot of partitions and a lot of partition tables. So we definitely ran out of logs and our CPU, basically every time we created a new partition for a table, the CPU usually jumped 10%. But jumping 10% was fine because we had a lot of

Starting point is 00:26:57 CPU to spare. So by the time it alerted, it was pretty high. And then we were looking back and we see these jumps of 10% and 10% and 10%. And then we were looking like the next jump of 10%, that's above 100. And then we are not in a good position. So then we had this small window with all kinds of things prepared, like don't create any partitions anymore on this cluster. But can we break open partition boundaries so we can basically extend the final partition to hold more data? And then we were researching this problem. Like, why is the CPU usually jumping like this? And in the end, we started to force the optimizer to pick generic plans because custom plans were faster in execution, but definitely not

Starting point is 00:27:41 a prepared time. So we won like a few milliseconds in execution time, but we were paying like 100 milliseconds in prepared time. And that was really heating up the CPUs. Because during planning, Postgres logs all indexes and tables, all partitions, all its indexes, if partition brewing is not working. And so it's like, it's terrible. Yeah.

Starting point is 00:28:02 So you're running out of your fast path logs because you need to log like 20 partitions and 20 index on these partitions for a single table. And then you join with a partition table and with partition table, all these logs go to the log manager, which is burning CPU cycles. You're in a good club of a bunch of companies

Starting point is 00:28:18 who experienced this over a couple of last years. Yeah, I was going to say, we hit this for vacuuming on our bigger tables. When you get above like four terabytes you start seeing that at the end of the cycle when you run analyze and the cache gets cleared out and then you suddenly have this custom plans that are 5x more expensive and that's when cpu spikes to 100 for a few seconds i have a question to all of you when you have these problems do you like discuss this with people who can influence this,

Starting point is 00:28:46 like in hackers somehow? Do you raise these problems in mailing lists or somewhere else where hackers live? We work with AWS support quite a bit. Well, some hackers there are for sure, but not all of them, right? Yeah. I mean, we're good friends with the

Starting point is 00:29:02 AWS RDS team, and they have actually a team that contributes to Postgres, so we try to leverage them to get some of these patches in. But it's honestly a lot of us reading the source code ourselves as well and trying to figure out workarounds. created originally by Andrei Borodin, which will be in Postgres 17. They are already adapted by AWS RDS Aurora for a couple of years, almost. I was super surprised. Actually, Andrei was also surprised. But you're not going to use Aurora, right? Instead of regular RDS. Maybe.

Starting point is 00:29:38 I think, I mean, for Notion, I think we are actually starting to look into if Aurora makes sense, mostly because the AWS team promises a lot better replication for global replicas compared to the RDS built-in replicas. So it's something we're considering, but we haven't actually used before. So less curious if any of you have any experience with that or like scaling read replicas in

Starting point is 00:30:04 general, because that's also something we haven't leveraged too much at Notion yet. Well, less scaling read replicas, but actually my scary story was going to be around moving things more to replicas because at the same time that we were growing 100x, we were dealing with a lot more pressure on the reliability of our system and trying to move towards four nines globally as a company. And so our application, because it's a web browser and we have a kind of single threaded Ruby monolith that serves a lot of traffic, what is very latency sensitive that if your database slows down for a minute, then the whole application can back up and your whole website can be down. And so that was one of our biggest reliability risks and most

Starting point is 00:30:45 persistent pain points is our RDS does have some big latency spikes on EBS that most customers don't notice, but we were really hitting. And so our two big solutions there were to move, basically remove writes from our most business-critical routes wherever possible, and put all reads or 99% of reads onto replicas for these P0 user workflows. And then we build out replica hedging. So every request hits two replicas and returns the first results that are successful there, which allows us to tolerate any one replica being down. And that's been hugely useful. We also got onto IO2 for EBS,

Starting point is 00:31:29 which has been a game changer for EBS reliability. Oh, nice. How many replicas do you guys normally have off one database? Today, it's quite, there are only two large replicas. In the future, we want to move towards more smaller replicas. We have multiple, let's put it that way. Reading from a replica is much harder when it is about financial data, because you want to have the correct data, and it might just been updated. So we try to move a lot of stuff to the replicas, but it's kind of hard. Yeah, that would definitely be a problem for financial data.

Starting point is 00:32:02 Are we talking about asynchronous replicas only or maybe some semi-synchronous quorum commit approach oh we have both i mean also cache invalidation i think that's one of the bigger concerns for us at notion because we're super super heavy users of memcache and like reasoning about how that works for replicas is another thing that we're we don't really want to do but we have to do i don't really have like a super scary story but actually it's something i've been working on recently which we ran into some surprising behavior so maybe it'd be just fun to talk about so recently at notion we've been trying to upgrade that last 10 and of, those databases are

Starting point is 00:32:45 just the hardest because they're just so big. So we're trying to upgrade this partition table that is around 20 terabytes with like 16 partitions on each database. So, you know, each partition is about like a terabyte-ish. And we were having trouble getting this to work with just regular logical replication. So we actually started exploring DMS, which is like an AWS service that lets you migrate data between two separate databases. Side note, it's actually super cool. Like you can go from like a completely random system to like another new system. It doesn't have to be like Postgres to Postgres only. I don't know how good it is at huge scale.

Starting point is 00:33:25 It just requires a department in your company to manage. Yeah, so originally I was like, oh yeah, I can do this. And then it's been like four months now and I'm still like, oh my God, this is a lot. But anyway, one of the problems we ran into was actually related to

Starting point is 00:33:41 Postgres bloat. Because it turns out with DMS, you get a lot of kind of knobs that you can tune for the concurrency of like the copy. And that goes pretty fast, like we're able to copy, you know, a terabyte of data in maybe 1214 hours with the way we have our DMS set up. But the thing we noticed after we set up the target databases is somehow the data size, like, so we use PG analyze as like our Postgres monitoring tool. And like, we noticed that the data size literally on disk was like three times larger on the new tables for some reason than the old one. And, you know, we were going from like Postgres 12 to Postgres 15. So at first we thought maybe it was like a toast compression thing,

Starting point is 00:34:26 because we know that the toast compression algorithm changed between, I think, 12 to 14 or something. It went from like PGLZ to LZ4 or something. I don't remember exactly, but I know that that changed. So I was like, oh, like, is it the new post compression that's causing like the data size to be literally much larger on disk, which obviously leads to much worse latency because then every one of your queries are doing much more IO,

Starting point is 00:34:50 which is... Buffers, buffers. Yeah. So, you know, we tried a few tests with like changing the compression back to the old one. We weren't fully convinced still, but it turns out actually we were just hitting like bloat. And this clicked when I was actually listening to one of your earlier podcasts, I think, on Postgres bloat. And I realized that our vacuums weren't actually fully completing on the new table because there was just so many new transactions being written, the vacuum could not keep up, which meant that the bloat just kept on accumulating and accumulating. And on top of that, a lot of the data in this table was being toasted

Starting point is 00:35:25 because they're just JSON V columns. So anyway, it was just like two weeks of investigation to be like, oh, we should just run a full vacuum. And it turns out that brought the data size right back to what it should be. So it was a fun little kind of investigation. It sounds like Pidginalyze doesn't tell you about Xmin Horizon and proper things like why vacuum can be lagging and skipping some

Starting point is 00:35:50 data tuples that cannot be deleted. PGA Analyze has a very good vacuum dashboard. So I'm very curious. Does it speak about Xmin Horizon? It actually does point those things out. We didn't have the new databases in PGA Analyze yet

Starting point is 00:36:07 because we're like, oh, we're not using production traffic on these, so we don't need it there. When we put it there, we realized that, oh, yeah, like the vacuums are falling behind. I'm just personally myself very annoyed that Postgres monitoring systems lack better Xmin Horizon and long transactions, which is two separate things, basically. Alerts. Long transactions can

Starting point is 00:36:28 happen and nobody tells you. It's not the right thing if you have sometimes two Postgres monitorings and nobody tells you about this. So I'm curious if Pidginalyze has some alerts. It does. It sends you an email if, for example, you've had any transaction running for over

Starting point is 00:36:43 an hour. Right, but if you use Logical, it's not you've had any transaction running for over an hour. Right. But if you use Logical, it's not the only reason of Xmin Horizon being installed. Right. Okay. For us, TGN Live has been really great, actually. I love working with that team. Shout out to Lucas. He's been so amazing.

Starting point is 00:37:01 And also content. Again, back to content sharing. Lucas does a great job and his team does a great job. Yeah, Lucas is awesome. I talked to him as well a few weeks ago and everything at MPG Analyze, their blogs are really great. Okay, I think it's interesting you guys had the data copy be such a big bottleneck. We also had a big pain point there when we were first doing vertical sharding. And what we ended up finding out is we kind of went into the source code for Postgres and logical replication and realized that it's really, really expensive to have indexes on your database table while you're

Starting point is 00:37:35 doing logical replication and much faster to drop the indexes and then recreate them afterwards. So it went from like weeks to a day for a multi-terabyte data copy when you dropped all of your indexes and foreign keys. Yeah, that's a binary copy option. And also there is, if you need it for upgrades, for example, or you need majority of data, sometimes it's better to convert physical replica to logical, which I hope soon will become standard recipe inside Postgres itself. But now it's only possible if you orchestrate it properly. I'm very curious if you do this trick on RDS, both of you or no. We rely a lot on logical replication for all of our core failover tooling so we haven't really explored physical replication i'm not sure if we really can because

Starting point is 00:38:30 of eds on rds it's possible if you apply so-called instacart recipe but but it's there are debates about is it safe to do it actually lucas blog posted about it as well recently you're talking about the thing where you like create a replication slot take a backup and then like advance the replication slot yeah either this or yeah on on rds only this yes yeah on self-managed you can play with recovery target lsn right yeah so i'm curious is it only in stack art or what we've actually used that before too so actually before we knew the trick that sammy just told us about with the indexes, we could never get logical replication to catch up ever to our tables. So we actually ended up using the Instacart trick to do our first round of upgrades that I ever did at Notion.

Starting point is 00:39:19 It worked well. I think one of the problems you run into with that, though, is similarly, when you provision the new database off the backup, you have to do another extra step of actually dropping the indexes again. Because even just catching up with logical replication from the point that you took the backup for a sufficiently large database still took us like a day or so just to catch up the changes between the replication time and that thing. So I think, yeah, that trick could work really well. Probably combined with dropping the indexes would speed it up significantly. But actually for us, after that, we've always just used regular logical replication and it's been fast enough if we can provision the schemas without any index. Actually, dropping indexes lead us to one topic I wanted not to miss. Huge wall volumes generated

Starting point is 00:40:05 and index write amplification. Anyone can tell anything about these problems, like experience something about it. Because this is one of the key criticism points from Uber when they posted this infamous article, right? Postgres has index write

Starting point is 00:40:22 amplification. We have hot updates, but still, it's a problem right actually Dirk I remember article about maybe a couple of years ago about fill factor right is it still the tool you use actively well fill factor is not a tool right

Starting point is 00:40:38 it's just a parameter you can set for your tables and your indexes but the tool is to use it to provoke more hot updates. Yeah. If we lose hot updates like we did last week, it's like we doubled our wall usage. So that's already, and that was only for single table.

Starting point is 00:40:58 For some reason, with the next release or the new release, we had a longer running query, which means pretty long because it was holding back hot updates. And that means like you just double the wall volume on your system. But we have many tables, right? And then only this single table is responsible or becomes responsible for half the wall

Starting point is 00:41:17 of the entire cluster. For me, that's an amazing number that you can generate or skip creating that amount of wall when using hot updates. So yeah, if we are having heavily updated tables, we always try to find the right fill factor. I've been trying to get this formula, like what tables do you need to fill factor and how high do you set it? And it's still a mystery to me because we have some tables which do fine at 95% fail factor and some do really great at 70%. And also you need to protect this table from new index being created and losing all the

Starting point is 00:41:54 hot updates somehow, right? If developers decide to add one more index. Yeah, I can't. Right. I have no way to force my developers. I'm actually curious. A big pain point we have right now is for really expensive index creations, there's no good way of throttling it.

Starting point is 00:42:11 So if it eats a lot of your IOPS, there is not really a Postgres way we've found to tune this so that maybe you want the index creation to happen much more slowly in the background over a few days for some really large tables. But it will hold Xmin Horizon again, this infamous problem. Sure. But I mean, maybe three days of holding that is better than an hour of the website being down. This might be a stupid question, but would reducing maintenance work mem in a, can you

Starting point is 00:42:38 do that in a session? Would that help? Or just call your mind, maybe. Yeah, I think we have the minimum. Yeah, we're only using one thread for the maintenance settings there. But it has been a couple of times enough to spike our IOPS to basically the RDS limits.

Starting point is 00:42:53 I always want indexes to be created faster. Yeah. But it requires faster Discaio capabilities, of course. This is what I love about TrueScale is you just hit trade-offs you hit times where one thing's best and you hit other times where the the complete opposite is best so yeah um i'm really conscious of time you've all been amazing thanks so much for joining i wanted to give you each an opportunity to say anything else you wanted to add or forgot to say

Starting point is 00:43:24 or feel free to shout out if you're hiring, that kind of thing. Let me ask one more technical question. Everyone uses foreign keys everywhere or decided to drop some of them everywhere, right? No, I'll actually say something about this. I think honestly, we use probably Postgres in the least interesting way possible. For our sharded databases, we don't have foreign keys. We don't do joins. We don't have any triggers. We don't have really anything that can make, we don't have any generated columns.

Starting point is 00:43:55 Like we limit so many things that you can do that we're basically using Postgres as like a blob store at this point, which is why we end up with really giant databases with a lot of indexes. But the queries are quite simple, actually, to retrieve the data. And I think that's been like probably one of the primary reasons that I've actually, and like my team has actually been able to kind of keep up with the growth of our Postgres cluster, because we're able to do things like resharding without having to, you know, talk to every team at the company and be like, can you please change this query? Can you please not do this thing? You know, I think a lot of people get really excited about using all the fun features of relational databases, but actually you

Starting point is 00:44:32 can use it in a very boring way, but still get some of the benefits of relational databases, such as consistency and things like that. So I think that's like my one trick to high scale with any database, not just Postgres, MySQL, and really any other thing that you can think of. It's good to be aware of some of the pitfalls of the more interesting features. So no foreign keys? Absolutely not. Any more input? Yeah, we do allow foreign keys on the sharding key. And similarly, we allow joins on the sharding key. So for us, it's a matter of picking the sharding key. And similarly, we allow joins on the sharding key. So for us, it's a matter of picking a sharding key

Starting point is 00:45:08 so that most operations are happening within that. And then that allows us to avoid rewriting a lot of application logic. Smart. We also have foreign keys at POSET, which is in two weeks time, a little bit less. I tell you how to find a proper partitioning strategy, because up to now, I've been focusing on partitioning a single table in the best way possible. But I kind of forgot to look at the bigger picture. So all my tables are partitioned slightly different with different partition boundaries. And then the entire system becomes

Starting point is 00:45:39 a bit messy. So now we are working on basically straighten out all the partition boundaries, because then the joining and the foreign keys work much more efficient. And so is partition pruning. Clever. And I kid you not, some queries became 20 times faster by straightening out partition boundaries. Wow. Wow. I'll make my last plug on the hiring side.

Starting point is 00:46:02 My team is hiring really aggressively right now. We're basically trying to double this year. And in particular, we just had an awesome senior staff engineer become a manager. And so we have a big gap in senior staff principal type engineers, folks who have worked at scale before. So if you're interested in those problems, please reach out to me or come apply. Notion is also hiring. But my pitch is definitely not as refined as Sammy's. But if you want to come work on fun problems,

Starting point is 00:46:33 check out our careers page. I actually think this might be the first time in 20 years we are not expanding my team this year. Oh, wow. But we should plug Dirk's talk at Pazet, which you can all watch. So I'll include links to all of these things final one final thank you from me this has been amazing thank you so much for joining us for episode 100 yeah thank you so much i wanted just to once again like say

Starting point is 00:46:56 thank you for knowledge sharing super important looking forward to more definitely to more blog posts and maybe some run books published i don't know like things like you use in your daily work maintaining these large databases large postcard databases so super curious to see more from you and maybe to collaborate on some things like some knowledge and exchanging and so on thank you thank you Thank you very much, both of you. It's really cool to be here, especially in such good company. Yeah, this is a really fun conversation.

Starting point is 00:47:31 Thanks for getting us all together. Yeah, thanks again. I really enjoyed this. I hope to be back on the future as well. This was the best episode number 100 we could imagine. Agreed.

Your Ad Here

Postgres FM - To 100TB, and beyond!

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.