Postgres FM - To 100TB, and beyond!
Episode Date: June 7, 2024Michael and Nikolay are joined by three special guests for episode 100 who have all scaled Postgres to significant scale — Arka Ganguli from Notion, Sammy Steele from Figma, and Derk van Ve...en from Adyen. They cover how their setup has evolved, what their plans are for the future, and get into the weeds of some fun and interesting challenges along the way!Links to some of the things discussed: Arka Ganguli from Notion https://postgres.fm/people/arka-ganguliSammy Steele from Figma https://postgres.fm/people/sammy-steeleDerk van Veen from Adyen https://postgres.fm/people/derk-van-veenThank you to yerrysherry on Reddit for the idea! https://www.reddit.com/r/PostgreSQL/comments/1cn8ajh/what_should_we_do_for_episode_100_of_postgres_fmLessons learned from sharding Postgres at Notion (October 2021) https://www.notion.so/blog/sharding-postgres-at-notionAdding Postgres capacity (again) with zero downtime (July 2023) https://www.notion.so/blog/the-great-re-shardHow Figma’s databases team lived to tell the scale (March 2024) https://www.figma.com/blog/how-figmas-databases-team-lived-to-tell-the-scaleUpdating a 50 terabyte PostgreSQL database (March 2018) https://www.adyen.com/knowledge-hub/updating-a-50-terabyte-postgresql-databasePart 1: Introduction to Table Partitioning (July 2023) https://www.adyen.com/knowledge-hub/introduction-to-table-partioningPart 2: Partitioning at Adyen (October 2023) https://www.adyen.com/knowledge-hub/partitioning-at-adyenPart 3: Maintenance Under Pressure (January 2024) https://www.adyen.com/knowledge-hub/maintenance-under-pressureFighting PostgreSQL write amplification with HOT updates (May 2022) https://www.adyen.com/knowledge-hub/postgresql-hot-updatesTracking HOT updates and tuning FillFactor (October 2022) https://www.adyen.com/knowledge-hub/postgresql-hot-updates-part2Partitioning your Postgres tables for 20X better performance (upcoming talk by Derk at Posette) https://www.citusdata.com/posette/speakers/derk-van-veenFigma, including Sammy’s team, is hiring https://www.figma.com/careers Notion’s engineering team is hiring https://www.notion.so/careersAdyen’s engineering team is hiring https://careers.adyen.com~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
I am Michael, founder of PgMustard, and I'm joined as usual by Nikolai, founder of PostgresAI.
Hey Nikolai.
Hi Michael.
And this week we have not one, not two, but three special guests with us to celebrate
episode number 100.
We asked you, our listeners, for ideas for this episode and we got some great suggestions
including the topic of scaling to 100 terabytes for episode 100 so we thought that was and beyond oh and beyond yes
always beyond yeah no downtime allowed we're delighted to welcome three people who have
successfully scaled postgres exceptionally well and those are arca ganguly from notion hey arca
hey good to be here. Really excited to talk about
Postgres. And hey, Sammy. Sammy Steele from Figma. Thanks so much for having me on.
And finally, Dirk van Veen from Adyen. Yeah, thank you very much for the invitation. I'm honoured.
We are honoured to have all three of you. So yeah, thank you all for joining.
So we're going to try a kind of panel roundtable style discussion.
I'll play good cop trying to keep us on track, but you are all welcome to chime in with questions for each other.
And I know Nikolai will as well. So let's start on the simpler side of things.
And starting with you, Dirk, would you mind letting us know what kind of things you're using Postgres for at Adyen there?
Well, that's an easy question.
We keep our data within Postgres database.
The thing is we are processing financial transactions.
And the thing about financial transactions is you'd better not lose them.
Because if you have a social media kind of platform, right, and you refresh your page,
and you're missing a picture and refresh again, And the picture's there. Nobody really cares.
But if you refresh your bank app and your money is gone,
then people, for some reason, they kind of freak out.
So, yeah, we try to not lose a single bit of data within our Postgres databases.
So you don't use data type money as well, right?
No, I couldn't.
Okay, nice.
Thanks so much.
Sammy, how about at Fma hi everyone i'm sammy and at
figma our platform is an online web-based sass software that allows designers to collaborate
together and you know kind of the magic secret sauce is this real-time platform where you can
comment on a design and it immediately shows up for other people. And so all of that core data is powered by Postgres.
And it has to be really reliable, really fast and low latency so it can feel real time.
And we've grown 100x in the last four years.
So that's where we're getting our 100 for this episode from.
Nice.
Even more hundreds.
We like it.
And Arca, how about Nation?
Yeah. So unsurprisingly, we also use Postgres to store our data and make sure that it continues being there when our customers expect it to be.
For those of you that don't know, Notion is kind of a collaborative workspace.
You can use it for documents, wikis, product management, and it has a lot of AI built in recently. So really, it's like the place where a lot of customers store their entire knowledge base and run their company or their personal
use cases as well. I just wanted to mention that I'm a user of all three companies' products,
because for example, Notion, I'm a big fan of lists and so on. So it's a great thing for that,
like for such kind of people I am. And Figma, it's like, I actually remember I created
a lot of interfaces in Figma myself. It's super cool too. And especially this real time thing as
well. And somehow I noticed Figma replaced all things I used in the past. And I just told Dirk
before we started recording that every time I go with my daughter to Legoland, I see a DN on
payment machine when I pay for something. And I think, oh, Postgres processing is there. So it's
cool. Thank you for coming once again. I love how humble you all are, assuming people might
not know who your companies are as well. It's nice. Let's flip the order and let's look at
how you set things up Postgres-wise. So how things were in the past, how things are set
up now. I realize this might be quite an in-depth answer and any future plans perhaps as well. So
maybe at the highest level you can describe that would be awesome. Arca, we should restart on your
side. Sure. So I joined Notion about a little bit over three years ago. And when I got there,
we had actually just finished like our first sharding
effort. But essentially at Notion, we run a pretty simple deployment. We use RDS, which is Amazon's
managed database offering. We run Postgres on that. And we run pgBouncer in front of our RDS
cluster for connection pooling. And that's pretty much it. We do have sharding, like I mentioned, which basically is just like
splitting up the data across more databases. And the logic for how to talk to the database is
actually controlled by the application. So depending on some ID, we know what database
to send your query to. And yeah, that's kind of how it is even today. We have a few more databases
than when I started, but the overall high level
architecture is pretty much the same. Nice. I've read both of your sharding and resharding
blog posts. I'll link both up in the show notes, but I'm guessing that will continue to work. Any
reason it won't continue to work for the foreseeable? Yeah, I think our plans are not
really to move away from Postgres. It's worked really well for us so far. And we've got a lot of people at the company now that have built up a lot of knowledge around how to operate this thing at scale.
So definitely something we want to stick with moving forward.
Probably we do have to reshard in the future at some point. and moving to more of like a cellular architecture where we can kind of deploy maybe like different
versions of Notion to scale our capacity with like their own databases and caches and things like
that. So we don't have to continuously spend our time resharding over and over again, because that's
a very time intensive and operationally intensive kind of thing for engineers to take on. So we
probably do have to resard in the near future,
but in the longer term, we hope our more like cellular architecture will be able to scale much
further. Super interesting. Sammy, let's go to you. I know you're sharding as well, but slightly
differently, I believe. Yeah, well, we definitely have a lot in common with what Arka just described.
And so your blog posts were a great inspiration for us as we
were in the early days of horizontal sharding. So similar high-level stacks, and that we're also on
RDS Postgres, Amazon's managed system, and we also use PGBouncer for connection pooling,
at least for now. And where we start diverging from Notion is that I think our data model is
somewhat more complex and less naturally a good fit for sharding. My understanding is Notion is that I think our data model is somewhat more complex and less naturally a good
fit for sharding. My understanding is Notion had shard IDs based on the orgs or the workspaces,
and that worked pretty well. But at Figma, we have a lot of data not associated with an org,
and data moves quite frequently between orgs, which makes that kind of sharding model quite hard.
And so we actually ended up going
with a more tactical approach for each table, picking a set of a few shard keys that would be
more optimal. And we ended up building a proxy layer so that instead of the application having
to be aware of shards, we actually have this DB proxy service we built, which is a Golang service.
And it's able to do all the routing and handle all the
horizontal sharding, scatter gathers where a query hits many shards, and then you aggregate results
back. And so that's kind of been a superpower for us to have a lot more control in that layer.
And long-term, we have sharded our simplest, highest write rate tables, but we still have
a ways to go until everything is sharded. And we have to build out a lot more infrastructure to support that. So unlike Notion, we actually expect reshard
operations to be quite common. And so our goal is like one click failover, where you can just
hit a button and then have a reshard operation happen and transparently in the background.
Sounds like logical replication involved, but maybe we'll talk about later.
I'm so interested in this proxy layer. Did you build it yourself in-house? We did. Yeah. So it's a Golang service.
So it has a couple of components. It has a query engine, which has a parsing layer that takes in
raw SQL and transforms it to an AST. And that part, we did borrow Cockroach's Postgres parser,
which has mostly worked well for us with a few modifications but
then all of the logic for we have a topology layer that we've built that knows where all the databases
are and what shard keys mapped what tables and so the query engine really does all that evaluation
and query rewriting yeah that sounds pretty mind-blowing to me it's really cool yeah there
are a couple of projects like pgcat and spqr which do that but maybe when you started
they are very early stage yes actually to clarify so we wish we had named this something besides
db proxy because pgcat would not be a replacement for what we have because we actually might use
pgcat and replace a pg balancer because we are hitting some pg balancer scaling problems i'd
be curious if you guys are also running into challenges but pd proxy is really it's a stateless service that
talks to every single database and so a request gets routed to it and it has a connection pooling
layer which talks to a pg bouncer or pg cat like layer that is specific to a single host
so it's really playing more of a high level orchestrator role so cool and i love that we've
already got two
different approaches to sharding, kind of one that's transparent to the application and the
developers and one that's not. And I think we're about to get a third answer from Dirk's side.
How are things set up at Adyen and past, present, future?
Let's start with the past. I think we had a pretty, it was by far our most well-read blog article, and that's how to
upgrade a 50-terabyte database. And when
looking back on it, it's quite a while ago. So I think two years
ago, we had an internal event, and it was like, no, it's not how you upgrade
a 50-terabyte database. It's how you upgrade a terabyte database with multiple
zeros in the end.
At least two, I guess, right?
Yeah, we have two zeros in the end.
And not just upgrade, but zero downtime upgrade.
Or like, I just want to clarify,
because regular upgrade is not a super big problem
if you can afford a couple of hours downtime.
Yeah, we tried to think about the ordinary upgrade,
but the rough calculation,
it will take us like four or five weeks downtime,
which didn't cut the bill.
We've got links.
No, that one is much faster, fortunately.
I think we limited to 15 minutes of downtime
and we already had a big fight for this 15 minutes,
but it has to happen.
And it's not even a database downtime,
but it's more like all the infrastructure around it
to get it fixed, which requires downtime.
If you look purely at Postgres, I think it was minutes,
but it's not just a database you run.
I know this problem very well.
So, yeah.
Everything around adds more and more minutes
and then sometimes hours, right?
Yeah.
I'm curious, what's the biggest individual table
or do you have partitioning under the hood there to have
so many terabytes on one box?
What is the biggest table we have?
I don't know, but the biggest partition is
over 30 terabytes,
which is a single partition. Wow. And you don't
run into vacuuming issues with that?
I think we are the
most skilled company
when it comes to vacuum and how to tweak
vacuum.
You build your indexes, which blocks xmin horizon right yeah we have seen many vacuum issues when i started to work with postgres and i was installing this database and transaction wraparound was some kind of weird
theoretical thing and then i joined dot yen and then something you do every week at least so i
yesterday asked my Twitter audience,
what would you ask to people who manage many, many, many terabytes?
And one of the questions was, how are you?
I guess this is the case when you have problems like 30 terabytes one table.
How long is the index creation time or rebuild time it's like days or
yeah days days during which xmin horizon is blocked right so we accumulate bloat in whole
database for all tables right it's a super annoying problem but most of these tables are
partitioned so first we create an index per partition. And then finally, in the end, we create the index on the parent.
Otherwise, we don't make it before we hit wraparound.
Oh, wraparound.
It's also a problem.
But I'm talking about also vacuuming and bloat problems.
So multiple problems here.
Well, that's a super interesting experience.
It keeps us busy.
But it's also an honor to have these problems.
But I think all three of us got these big databases because the company is very successful.
Right.
And I think at the end of the day, having a big database is actually a design problem.
You forgot to design in time for the size you need to accommodate.
See, I actually disagree.
I think it's an expected problem.
If you design a company too early to scale, then you're not going to have a company probably
if you're trying to solve those problems
when you're five or 10 people.
I think one of the cool things about Postgres
is all of our companies did get very far on one host.
Figma existed for seven or eight years
and was pretty successful
before we had to actually start scaling out.
That's definitely true.
I agree with that.
And my team helped a couple of companies
who went to IPO being like on a single Postgres cluster, and they got evaluation,
thousands of billions of dollars having one cluster. And this was impressive. Of course,
later, it should be changed. But yeah, so I agree with one cluster you can scale a lot.
But I would not like to be in your shoes dealing with like multi-days index rebuild and also transaction ID wrap around problems.
So how do you escape from these problems today?
Like 30 terabytes, one partition.
It's quite difficult, right?
So is it to be partitioned into smaller partitions?
Yeah, I'm thinking about how to answer this question. I think before I joined
Atyen, I tried to manage my database in a proper way. You do it by the book and by the
rules and everything is fine. And Atyen is pushing me to find which boundaries can be bended or violated without too much risk.
I think that's the way to put it.
We still are on the safe side, but with more knowledge, you're better able to decide when do I cross a boundary and when is it really dangerous and when is it safe to do, but not strictly advised. Yeah, I think that's definitely one of the benefits of staying on a system like Postgres
that you end up understanding really well is all these limits are extremely specific
to the workloads and the tables and even what works for one table at Figma doesn't work
for another.
And so there is a lot of just accumulated organizational knowledge that makes it easier
to run these systems at scale and understand actually at what point do things get scary.
Yeah, I agree. If, for example, index creation takes one day, but our transaction, real transaction
ID doesn't grow a lot, it's not a big problem. But if it grows very fast, then we have two kinds
of problems. So transaction ID are up around and below it accumulated. I agree. So it's very
specific to particular workloads. want to go go back to
dirt quickly because you've done such a great series of blog posts on partitioning i'll share
those i reread them today and part three promises a part four so i'm i'm hopeful of getting another
installment at some point but yeah is it would you say partitioning is the big like the way you've
largely scaled there or are there other strategies like are you moving data out
like how are you dealing with it and how are you planning to in the future yeah i think we started
with horizontal sharding and then within every shard tables are partitioned but financial
transactions you need to be able to refund them for a very,
very long time. So it takes quite a long time before you can actually archive data. So we have
a lot of old data, which is still around, but yeah, these partitions kind of shift more back
in the row and vacuum is much easier if you don't change a lot on a partition.
But at the moment, we are actually reaching the limits
of our sharding solution again.
So we basically start redesigning
the entire system again.
You know, that's really interesting
because that notion,
like it's interesting that you chose partitioning Postgres
as like your thing that like got you really far
because that notion,
we actually ran into,
I don't know,
like really weird issues
with dropping indexes on partition tables. really far because at Notion, we actually ran into, I don't know, like really weird issues with
dropping indexes on partition tables. But that's because I think we were running on like, like this
database was running on like Postgres 12. And I think a lot of these things may have been fixed
in the in future releases. But we actually went the opposite way where we're like, we don't want
Postgres partitions, let us manage our own. Because I don't know, there's like some weird
behavior we ran into when creating
and dropping indexes especially because some things you can't do concurrently on partition tables
yeah dropping indexes is is a nightmare we i always tell my developers like you can create
indexes on on a partition table but if you want to drop them i come after you yeah fun problem actually a duck you said something earlier about knowing
which rules you can kind of bend or break one of the ones you mentioned in those posts was around
adding check constraints and so adding them in a not valid state but they're not running validate
afterwards i think you mentioned just updating the system catalog and that's such a cool trick but like how yeah i guess you have to become
comfortable we used this trick like 15 years ago when validate was not an option so like something
like that we we used it but it wasn't like undocumented thing don't do it like if you're
if you're not owning a database.
For example, if you're a consultant,
this is not something you should recommend
because who knows what will happen next.
If you own a database, if you work inside a company,
it's probably okay.
It's not official.
By the way, I wanted to emphasize problems
we just started touching most people who listen to us don't have, right?
Because it's like extreme problems.
I'm not completely sure.
I think partitioning becomes useful for a lot of people
way before you hit the limit where you have to.
So if you're listening to this podcast and you think like,
yeah, partitioning is too far away, just think again.
Because if you do it now, when you have all the time in the world
to figure it out, find a good strategy,
then you don't end up with a huge partition,
a small partition, weirdly partitioning things.
So I would say just starting time.
I'm on the same page with you here
because you mentioned this rule,
like 100 gigabytes as threshold
when you need to start partitioning.
Same rule we also apply everywhere.
If you exceed 100 gigabytes, it's time to partition, I agree.
Interesting.
We have a lot of tables larger than that that aren't partitioned,
but probably about a terabyte is what we aim for max for our largest shards,
although we have one 16-terabyte table today.
But another interesting topic that I think is more relevant to people are upgrades.
We have built this no
downtime tooling that we've mostly used for reshard and horizontal sharding operations. But
more recently, we've been using it for no downtime major version upgrades with the ability to roll
back if you run into problems. And so I think that's something that could be interesting for
anyone who's running Postgres. Upgrades are always a hard thing to do.
Yeah. By the way, we forgot to mention that ADN is on self-managed situation.
It's not on RDS, right?
Because it's very different.
Yeah.
I'm very curious how you did this for RDS case, not using Bluegreen deployments, right?
Just fully...
Yeah, at least today.
Well, Bluegreen only very recently became available
for postgres last year it was only really my sequel and so we'll probably explore it going
forward one blocker that actually we're talking to the rds team today is our rollback or our
there is no way to roll back and that's the wrong blue green deployment's idea it should be
symmetric and when you switch over,
reverse replication should be installed immediately.
And without this, you don't have rollback plan, right?
Exactly.
It means data loss.
Yeah, so that's what our failover operation does.
I think it's also nice to have more control when you do it locally
because we're able to, for instance, do a no-downtime replicas failover,
and because most of our P0 do a no downtime replicas failover. And because most
of our P0 critical workflows read from replicas, that means that most failovers don't actually
incur any downtime. Whereas if you're using a managed thing like Bluegreen, you lose that kind
of control. Yeah. Well, I'm super curious about details here, but maybe it should be a separate
discussion. And first of all, I wanted to thank you, all of you. We invited you because you shared very great posts.
So this knowledge sharing is super important.
So I would be happy to see more and more coming.
And maybe if we talk more and exchange ideas,
we could collect better materials.
For example, how to do zero downtime upgrades,
because I know a very good recipe for self-managed Postgres,
but I don't know for rds because they don't allow you to control recovery target lsn so i'm very
curious to talk about details but maybe the separate discussion because i know michael has
different plan not to talk half an hour about upgrades well maybe we could actually do a quick
like lightning round of which major version
each of you are currently running if you know and you're happy to share is it a single version only
or also a good question sammy you happy to share that one sure i can definitely talk about that
well we are newly as of two months ago fully on pg-13. So we had a couple of PG-11 very legacy boxes.
There were some of our earliest, largest databases.
And that actually was the main motivation for building out this upgrade workflow was to get those into PG-13.
And we would love to use some of the features that are available in like PG-15, PG-16.
And so we're probably going to look into fleet wide upgrades
sometime next year to get to a newer version. But this is where this one click upgrades or
failovers really matters because today it's quite toilsome and it would be pretty painful
to upgrade the whole fleet. Yeah, absolutely. How about you, Arka?
Yeah, we're actually, we're like 90% on Postgres 15 And it is my life's mission to get us that final 10% over.
Yeah, like most of our databases are on actually Postgres 15. We finished that upgrade last year.
And yeah, we really wanted some of the features that I think got introduced maybe after Postgres
14, especially around being able to filter the replication stream, because that's something we
want to use in the future. Probably if we want to reshard, that'll be a very useful thing for us. And just wanted to touch on Sammy's point a little
bit. We actually use the exact same kind of, I think, failover that you guys use as well at Figma
based on PGBouncer and with the reversibility as well. And that's something that has been really,
really surprisingly amazing. Yeah. You mean PgBouncer post-resume.
I guess this is why you don't use RDS proxy being on RDS, right?
Because it doesn't have post-resume.
Yeah, that's great.
I also was surprised last year that it can handle really heavy workloads.
Yeah.
Just a few seconds spike of latency,
you can upgrade and switch over to different primary.
Yeah, it was one of those things where when we were writing out the tech spec for it,
I was like, is this thing going to actually work?
The problem with this, by the way, there are no good materials about this.
Proving, like, nobody say, like, it's a public secret.
Nobody talks about it somehow, but it works really well.
So we need more materials proving that POSresume works well in pgBouncer.
Yeah, it's been great for us. We have some pgBouncer scaling problems, but
everything on the failovers, pgPause, pgResume side has been flawless.
That's so good to hear. And finally, Dec, on the major version side of things.
We are fully on 13, and we're having discussions about the next version we want to upgrade to
because every version has nice things i want certain partitioning functionality
a colleague wants more logical replication functionality so of course 17 would be the
ideal but then do we really want to run 17 already with the biggest clusters we have so yeah to be continued and i think also
the fact that i mean notion being on 15 was somewhat of a surprise to me i think because
of these major version upgrade challenges i see so many companies even at the cutting edge of what
postgres can do lagging several years behind because upgrades are so difficult so i know i
know we won't get a chance to discuss all the reasons why, but I think it's proof because we're all lagging.
Yeah. Well, I actually think it's not for us so much fear of newer versions. I actually worked
in MySQL before Postgres and MySQL major version upgrades were truly terrifying. These were like
two or three year long efforts where you found a lot of major incompatibilities. But one thing I
love about Postgres is it is mostly backwards compatible across major versions. And so it's not so much
that we're scared of the newer ones as versus we can't take down time to do these upgrades. And so
there it's much more than manual toil. Yeah, exactly. I think that is I think that's the
case for a lot of people. So last one I was going to ask each of you is if you had it funny or scary and i don't
know which one i want most story that you can share with us related to your scaling any of you
got one prepared already um feel free to jump in or nicolai white wet logs log manager multi-exact
transactions multi-exact ids and so on like something like that maybe yeah i think we
definitely got bitten
by the log manager,
like most people who start doing partitioning.
It's like we started partitioning
and we did a good job
and I built this framework around partitioning.
By the way, Michael,
the fourth blog post is,
we are writing it at the moment.
Yes.
So it will be there.
It's a nice ID.
Now I lost the rest of my story.
Oh, yeah, partitioning.
So it sounds like we have prepared statements.
No, we just have the prepared statements.
But you know everything about prepared statements, right?
Where you try it five times,
and then the optimizer decides,
do we go with the generic plan or the custom plan?
But we also created a lot and a lot of partitions
and a lot of partition
tables. So we definitely ran out of logs and our CPU, basically every time we created a new
partition for a table, the CPU usually jumped 10%. But jumping 10% was fine because we had a lot of
CPU to spare. So by the time it alerted, it was pretty high. And then we were looking back and
we see these jumps of 10% and 10% and 10%.
And then we were looking like the next jump of 10%, that's above 100. And then we are not in a
good position. So then we had this small window with all kinds of things prepared, like don't
create any partitions anymore on this cluster. But can we break open partition boundaries so we can basically
extend the final partition to hold more data? And then we were researching this problem. Like,
why is the CPU usually jumping like this? And in the end, we started to force the optimizer
to pick generic plans because custom plans were faster in execution, but definitely not
a prepared time. So we won like a few milliseconds in execution time,
but we were paying like 100 milliseconds in prepared time.
And that was really heating up the CPUs.
Because during planning, Postgres logs all indexes and tables,
all partitions, all its indexes,
if partition brewing is not working.
And so it's like, it's terrible.
Yeah.
So you're running out of your fast path logs
because you need to log like 20 partitions
and 20 index on these partitions for a single table.
And then you join with a partition table
and with partition table,
all these logs go to the log manager,
which is burning CPU cycles.
You're in a good club of a bunch of companies
who experienced this over a couple of last years.
Yeah, I was going to say,
we hit this for vacuuming on our bigger tables.
When you get
above like four terabytes you start seeing that at the end of the cycle when you run analyze and
the cache gets cleared out and then you suddenly have this custom plans that are 5x more expensive
and that's when cpu spikes to 100 for a few seconds i have a question to all of you when
you have these problems do you like discuss this with people who can influence this,
like in hackers somehow? Do you
raise these problems in mailing lists
or somewhere else where
hackers live?
We work with AWS support quite
a bit. Well, some hackers
there are for sure, but not all of them, right?
Yeah. I mean, we're good friends with the
AWS RDS team, and they have actually
a team that contributes to Postgres, so we try to leverage them to get some of these patches in. But it's honestly a lot of us reading the source code ourselves as well and trying to figure out workarounds. created originally by Andrei Borodin, which will be in Postgres 17. They are already adapted by AWS RDS Aurora
for a couple of years, almost.
I was super surprised.
Actually, Andrei was also surprised.
But you're not going to use Aurora, right?
Instead of regular RDS.
Maybe.
I think, I mean, for Notion,
I think we are actually starting to look into
if Aurora makes sense,
mostly because the
AWS team promises a lot better replication for global replicas compared to the RDS built-in
replicas.
So it's something we're considering, but we haven't actually used before.
So less curious if any of you have any experience with that or like scaling read replicas in
general, because that's also something
we haven't leveraged too much at Notion yet. Well, less scaling read replicas, but actually
my scary story was going to be around moving things more to replicas because at the same time
that we were growing 100x, we were dealing with a lot more pressure on the reliability of our
system and trying to move towards four nines globally as a company. And so our application,
because it's a web browser and we have a kind of single threaded Ruby monolith that serves a lot
of traffic, what is very latency sensitive that if your database slows down for a minute, then
the whole application can back up and your whole website can be down. And so that was one of our biggest reliability risks and most
persistent pain points is our RDS does have some big latency spikes on EBS that most customers
don't notice, but we were really hitting. And so our two big solutions there were to move,
basically remove writes from our most business-critical routes wherever possible, and put all reads or 99% of reads onto replicas for these P0 user workflows.
And then we build out replica hedging.
So every request hits two replicas and returns the first results that are successful there,
which allows us to tolerate any one replica being down.
And that's been hugely useful.
We also got onto IO2 for EBS,
which has been a game changer for EBS reliability.
Oh, nice.
How many replicas do you guys normally have off one database?
Today, it's quite, there are only two large replicas. In the future, we want to move towards more smaller replicas.
We have multiple, let's put
it that way. Reading from a replica is much harder when it is about financial data, because you want
to have the correct data, and it might just been updated. So we try to move a lot of stuff to the
replicas, but it's kind of hard. Yeah, that would definitely be a problem for financial data.
Are we talking about asynchronous replicas only or maybe
some semi-synchronous quorum commit approach oh we have both i mean also cache invalidation i think
that's one of the bigger concerns for us at notion because we're super super heavy users of memcache
and like reasoning about how that works for replicas is another thing that we're we don't
really want to do but we have to do i don't really have like a super scary story but actually it's
something i've been working on recently which we ran into some surprising behavior so maybe
it'd be just fun to talk about so recently at notion we've been trying to upgrade that last
10 and of, those databases are
just the hardest because they're just so big. So we're trying to upgrade this partition table
that is around 20 terabytes with like 16 partitions on each database. So, you know,
each partition is about like a terabyte-ish. And we were having trouble getting this to work with
just regular logical replication.
So we actually started exploring DMS, which is like an AWS service that lets you migrate data between two separate databases.
Side note, it's actually super cool.
Like you can go from like a completely random system to like another new system.
It doesn't have to be like Postgres to Postgres only. I don't know how good it is at huge scale.
It just requires a department
in your company to manage.
Yeah, so
originally I was like, oh yeah, I can do this.
And then it's been like four months now and I'm still
like, oh my God, this is
a lot. But anyway, one of the
problems we ran into was actually related to
Postgres bloat.
Because it turns out with DMS, you get a
lot of kind of knobs that you can tune for the concurrency of like the copy. And that goes pretty
fast, like we're able to copy, you know, a terabyte of data in maybe 1214 hours with the way we have
our DMS set up. But the thing we noticed after we set up the target databases is somehow the data size, like, so we use PG analyze as like
our Postgres monitoring tool. And like, we noticed that the data size literally on disk was like
three times larger on the new tables for some reason than the old one. And, you know, we were
going from like Postgres 12 to Postgres 15. So at first we thought maybe it was like a toast compression thing,
because we know that the toast compression algorithm
changed between, I think, 12 to 14 or something.
It went from like PGLZ to LZ4 or something.
I don't remember exactly, but I know that that changed.
So I was like, oh, like, is it the new post compression
that's causing like the data size
to be literally much larger on disk,
which obviously leads to much worse latency because then every one of your queries are doing much more IO,
which is... Buffers, buffers.
Yeah. So, you know, we tried a few tests with like changing the compression back to the old one.
We weren't fully convinced still, but it turns out actually we were just hitting like bloat.
And this clicked when I
was actually listening to one of your earlier podcasts, I think, on Postgres bloat. And I
realized that our vacuums weren't actually fully completing on the new table because there was just
so many new transactions being written, the vacuum could not keep up, which meant that the bloat just
kept on accumulating and accumulating. And on top of that, a lot of the data in this table was being toasted
because they're just JSON V columns.
So anyway, it was just like two weeks of investigation to be like,
oh, we should just run a full vacuum.
And it turns out that brought the data size right back to what it should be.
So it was a fun little kind of investigation.
It sounds like Pidginalyze doesn't tell you about Xmin Horizon and proper
things like why
vacuum can be lagging and skipping some
data tuples that cannot be deleted.
PGA Analyze has a very good
vacuum dashboard.
So I'm very curious.
Does it speak about
Xmin Horizon? It actually does
point those things out.
We didn't have the new databases in PGA Analyze yet
because we're like, oh, we're not using production traffic on these,
so we don't need it there.
When we put it there, we realized that, oh, yeah,
like the vacuums are falling behind.
I'm just personally myself very annoyed that Postgres monitoring systems
lack better Xmin Horizon and long transactions,
which is two separate things, basically.
Alerts. Long transactions can
happen and nobody tells you. It's not
the right thing if you have sometimes
two Postgres monitorings and nobody tells you about
this. So I'm curious
if Pidginalyze has some alerts.
It does.
It sends you an email if, for example,
you've had any transaction running for over
an hour. Right, but if you use Logical, it's not you've had any transaction running for over an hour. Right.
But if you use Logical, it's not the only reason of Xmin Horizon being installed.
Right.
Okay.
For us, TGN Live has been really great, actually.
I love working with that team.
Shout out to Lucas.
He's been so amazing.
And also content.
Again, back to content sharing.
Lucas does a great job and his team does a great job. Yeah, Lucas is awesome. I talked to him as well a few weeks ago and everything at MPG
Analyze, their blogs are really great. Okay, I think it's interesting you guys had the data copy
be such a big bottleneck. We also had a big pain point there when we were first doing vertical
sharding. And what we ended up
finding out is we kind of went into the source code for Postgres and logical replication and
realized that it's really, really expensive to have indexes on your database table while you're
doing logical replication and much faster to drop the indexes and then recreate them afterwards.
So it went from like weeks to a day for a multi-terabyte data copy when you dropped all of your indexes and foreign keys.
Yeah, that's a binary copy option. And also there is, if you need it for upgrades, for example,
or you need majority of data, sometimes it's better to convert physical replica to logical,
which I hope soon will become standard recipe inside Postgres itself. But now it's only possible
if you orchestrate it properly. I'm very curious if you do this trick on RDS, both of you or no.
We rely a lot on logical replication for all of our core failover
tooling so we haven't really explored physical replication i'm not sure if we really can because
of eds on rds it's possible if you apply so-called instacart recipe but but it's there are debates
about is it safe to do it actually lucas blog posted about it as well recently you're talking
about the thing where you like create a replication slot take a backup and then like advance the replication slot yeah either this or
yeah on on rds only this yes yeah on self-managed you can play with recovery target lsn right
yeah so i'm curious is it only in stack art or what we've actually used that before too so
actually before we knew the trick that sammy just told us about with the indexes,
we could never get logical replication to catch up ever to our tables.
So we actually ended up using the Instacart trick to do our first round of upgrades that I ever did at Notion.
It worked well.
I think one of the problems you run into with that, though, is similarly, when you provision the new database off the backup, you have to do another extra step of actually dropping the indexes again.
Because even just catching up with logical replication from the point that you took the backup for a sufficiently large database still took us like a day or so just to catch up the changes between the replication time and that thing.
So I think, yeah, that trick could work really well.
Probably combined with dropping the indexes would speed it up significantly.
But actually for us, after that, we've always just used regular logical replication and it's been fast enough if we can provision the schemas without any index.
Actually, dropping indexes lead us to one topic I wanted not to miss.
Huge wall volumes generated
and index write amplification. Anyone
can tell anything about these problems,
like experience something about it.
Because this is one of the
key criticism points
from Uber when they posted
this infamous article, right?
Postgres has index write
amplification. We have hot updates, but
still, it's a problem right
actually Dirk I remember
article about maybe a couple of years ago
about fill factor right
is it still the tool you use actively
well
fill factor is not a tool right
it's just a parameter
you can set for your tables and your
indexes but the tool is
to use it to provoke more hot updates.
Yeah.
If we lose hot updates like we did last week,
it's like we doubled our wall usage.
So that's already, and that was only for single table.
For some reason, with the next release or the new release,
we had a longer running query, which means pretty long
because it was holding back hot updates.
And that means like you just double the wall volume
on your system.
But we have many tables, right?
And then only this single table is responsible
or becomes responsible for half the wall
of the entire cluster.
For me, that's an amazing number
that you can generate or skip creating that amount of wall when using hot updates.
So yeah, if we are having heavily updated tables, we always try to find the right fill factor.
I've been trying to get this formula, like what tables do you need to fill factor and how high do you set it?
And it's still a mystery to me because we have some tables which do fine at 95% fail factor
and some do really great at 70%.
And also you need to protect this table from new index being created and losing all the
hot updates somehow, right?
If developers decide to add one more index.
Yeah, I can't.
Right.
I have no way to force my developers.
I'm actually curious.
A big pain point we have right now is for really expensive index creations, there's
no good way of throttling it.
So if it eats a lot of your IOPS, there is not really a Postgres way we've found to tune
this so that maybe you want the index creation to happen much more slowly in the background
over a few days for some really large tables.
But it will hold Xmin Horizon again, this infamous problem.
Sure.
But I mean, maybe three days of holding that is better than an hour of the website being
down.
This might be a stupid question, but would reducing maintenance work mem in a, can you
do that in a session?
Would that help?
Or just call your mind, maybe.
Yeah, I think we have the minimum.
Yeah, we're only using one thread
for the maintenance settings there.
But it has been a couple of times enough
to spike our IOPS to basically the RDS limits.
I always want indexes to be created faster.
Yeah.
But it requires faster Discaio capabilities, of course.
This is what I love about TrueScale
is you just hit trade-offs you hit
times where one thing's best and you hit other times where the the complete opposite is best
so yeah um i'm really conscious of time you've all been amazing thanks so much for joining i
wanted to give you each an opportunity to say anything else you wanted to add or forgot to say
or feel free to shout out if you're hiring, that kind of thing.
Let me ask one more technical question.
Everyone uses foreign keys everywhere or decided to drop some of them everywhere, right?
No, I'll actually say something about this.
I think honestly, we use probably Postgres in the least interesting way possible.
For our sharded databases, we don't have foreign keys.
We don't do joins.
We don't have any triggers. We don't have really anything that can make, we don't have any generated columns.
Like we limit so many things that you can do that we're basically using Postgres as
like a blob store at this point, which is why we end up with really giant databases
with a lot of indexes. But the queries are quite simple, actually, to retrieve the data. And I think
that's been like probably one of the primary reasons that I've actually, and like my team
has actually been able to kind of keep up with the growth of our Postgres cluster, because we're able
to do things like resharding without having to, you know, talk to every team at the company and
be like, can you please change this query? Can you please not do this thing? You know, I think a lot of
people get really excited about using all the fun features of relational databases, but actually you
can use it in a very boring way, but still get some of the benefits of relational databases,
such as consistency and things like that. So I think that's like my one trick to high scale with
any database, not just Postgres, MySQL, and really any other
thing that you can think of. It's good to be aware of some of the pitfalls of the more
interesting features. So no foreign keys? Absolutely not.
Any more input? Yeah, we do allow foreign keys on the sharding key. And similarly,
we allow joins on the sharding key. So for us, it's a matter of picking the sharding key. And similarly, we allow joins on the sharding key.
So for us, it's a matter of picking a sharding key
so that most operations are happening within that.
And then that allows us to avoid rewriting a lot of application logic.
Smart.
We also have foreign keys at POSET, which is in two weeks time, a little bit less.
I tell you how to find a proper partitioning strategy,
because up to now, I've been focusing on partitioning a single table in the best way
possible. But I kind of forgot to look at the bigger picture. So all my tables are partitioned
slightly different with different partition boundaries. And then the entire system becomes
a bit messy. So now we are working on basically straighten out all the partition boundaries,
because then the joining and the foreign keys work much more efficient.
And so is partition pruning.
Clever.
And I kid you not, some queries became 20 times faster by straightening out partition boundaries.
Wow.
Wow.
I'll make my last plug on the hiring side.
My team is hiring really aggressively right now.
We're basically trying to double this year.
And in particular, we just had an awesome senior staff engineer become a manager.
And so we have a big gap in senior staff principal type engineers, folks who have worked at scale before.
So if you're interested in those problems, please reach out to me or come apply.
Notion is also hiring.
But my pitch is definitely not as refined as Sammy's.
But if you want to come work on fun problems,
check out our careers page.
I actually think this might be the first time in 20 years we are not expanding my team this year.
Oh, wow.
But we should plug Dirk's talk at Pazet,
which you can all watch.
So I'll include links
to all of these things final one final thank you from me this has been amazing thank you so much
for joining us for episode 100 yeah thank you so much i wanted just to once again like say
thank you for knowledge sharing super important looking forward to more definitely to more blog
posts and maybe some run books published i don't
know like things like you use in your daily work maintaining these large databases large postcard
databases so super curious to see more from you and maybe to collaborate on some things like some
knowledge and exchanging and so on thank you thank you Thank you very much, both of you.
It's really cool to be here,
especially in such good company.
Yeah, this is a really fun conversation.
Thanks for getting us all together.
Yeah, thanks again.
I really enjoyed this.
I hope to be back on the future as well.
This was the best episode number 100
we could imagine.
Agreed.