Postgres FM - Zero-downtime migrations

Episode Date: June 2, 2023

Nikolay and Michael discuss zero-downtime schema migrations — why they're a challenge, a variety of different cases, and some things you can do to achieve them. Here are links to a few thi...ngs we mentioned: Comparison of JOINS: MongoDB vs. PostgreSQL (blog post by Michael Stonebraker and Álvaro Hernández)Common DB schema change mistakes (blog post by Nikolay)lock_timeout and retries (blog post by Nikolay)lock_timeoutFast Column Creation with Defaults (blog post by Brandur)Database Schema Changes Without Downtime (new version of blog post by Braintree)Zero-downtime Postgres migrations - the hard parts (blog post from GoCardless)GitLab migration_helpers.rb GitLab migration style guidedblab ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!If you would like to share this episode, here's a good link (and thank you!)~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, hello, this is PostgresFM. I'm Nikolai and my co-host is Michael. Hi, Michael. Hello, Nikolai. So we will be talking today about one of my favorite topics. I think it's quite often underestimated when our database and workloads grow. It will be about migrations, right? But what kind of migrations? Migrations is not a good term. I don't like it, but we all use it because of some, I guess some Ruby guys, right? Or where did it start from? Database migrations. It sounds like we're migrating from one place to another, but actually
Starting point is 00:00:34 it's about schema changes. Yeah. I've heard the phrase schema migrations as well. So we're talking specifically about, and I know this is implicit in what you're saying, but I don't think everybody has this implicit, but zero downtime migration. So how do we do a schema migration in a way that doesn't block writes or reads? Schema evolution, is the better term, I think. Yeah, absolutely.
Starting point is 00:00:58 Well, anybody that's building products, building applications, managing a database, if it has any success at all, it's going to change over time, right? You're going to want to add things, you're going to want to change things, you might even want to rename things. Three deployments per day, as usual. Yeah, and a lot of those changes, by default, the way that they happen take locks, right? So those
Starting point is 00:01:22 locks on small tables are nice and fast, but once they start getting bigger, those locks can be really problematic. So we're talking about changes that can cause these heavy locks for long periods of time and alternatives to those. Or cannot acquire a lock. Yeah, yeah, absolutely. Also problematic. This is even more dangerous and it usually remains unnoticed until some level of workload. But I have a good recipe and probably this recipe will work for everyone
Starting point is 00:01:49 and we will just end this episode much earlier than usual. So we promised to not exceed 30 minutes to ourselves and to some of our listeners and let's try to do it. I see timer here. So recipe is simple. Let's have ID column with perfect UL ID or UUID
Starting point is 00:02:11 version 7, 8, I don't remember. Some ID column, that can be integer 8. Just don't use integer 4, right? And JSON, B or without B, it depends, but JSON column. And that's it. We don't need migrations. Well, actually the integer for integer A question is a really good point. And the reason that's problematic is because changing the type of a column is one of these issues. It was one of these operations that takes... I know it very well.
Starting point is 00:02:41 I've earned more than 100k on it. Just on this sole task for multiple companies, at least four companies. Yeah. But yeah, so I think we could probably do quite a long time on that issue alone. But my main point was that it's because that requires a migration. So that is one of the category of issues here. I guess it would take a long time to discuss the details of every single one. So I was hoping we'd give some, maybe newbies, a little bit of an insight into what kinds of things can cause problems. You don't understand. Let's put everything to JSON and we don't need migrations. That's it.
Starting point is 00:03:22 I understood, but I thought you were being facetious. That's the whole NoSQL argument. They had the marketing claim of schema-less, don't worry about migrations ever. And there's some compelling stuff in there. But then the benefit of relational database, the reason a lot of us love them, is we get validation. Let's just mention that multiple banks and financial organizations and also e-commerce organizations used this approach just for everything, schema lesson or how it's called.
Starting point is 00:03:51 And it doesn't work well for data where you need good quality of your data, right? For example, billing system. And this is on one hand. And on another hand, also, I can encourage listeners to read the article from Michael Stonebreaker. It was an EDB blog, co-authored with Alvaro Hernandez, about how bad it is to avoid good normalization. So normalization, denormalization should be in our toolbox, both, by the way. And if we talk about it,
Starting point is 00:04:27 schema changes are inevitable, right? So yeah, I want to just exclude this because someone can ask, why do we need it at all, the schema changes? But right, let's start from something simple. You try to add a column. Yeah, perfect example. It's a very simple column without defaults, without anything. Like I just want some small column on the right side of my table. And suddenly you have
Starting point is 00:04:56 traffic went down for many minutes. What happened? I mean, traffic to this table. It's like partial down times. Other tables are accessible, but this table even selects are blocked. What happened? What do you think? Something waiting on a lock, right?
Starting point is 00:05:10 Something gets stuck behind. Yes. I think this is something that eventually should be improved in Postgres. Like retries? How would you... Right. So we have created this concurrently. We have two-stage procedure to create various constraints like foreign keys or check constraints or unique
Starting point is 00:05:35 constraints and unique index of course concurrently. So we have tools to avoid the long exclusive logs. Exclusive logs are inevitable. To adjust schema in system catalogs, brief, at least brief, exclusive log is needed. But there are two big problems, at least two, very big problems.
Starting point is 00:05:57 First, okay, three. First is very obvious. We cost full table rewrite. For example, if you change integer 4 to integer 8 with single line, single comment, alter table, alter column, you will cause full table rewrite. And if it has 1 billion rows, probably this is when you think, oh, it's already 50% of capacity, which is 2.1 billions roughly. This is when you already start being nervous and try to do something. In this case, full table rewrite will cause very long lock and nobody will
Starting point is 00:06:31 be able to talk to this table. But this is obvious, right? We do some work while still keeping holding this exclusive lock. Of course, even sell-s are not possible. There is another problem, slightly lighter problem, when we don't need to rewrite the whole table, but we need to at least scan it at least once to check that, for example, constraints are not violated by existing rows. So it's just read, but still long because billion rows, right? And during this we also hold this exclusive lock. But these two problems are quite obvious. There is one particular problem which hits everyone, and we don't have anything inside out of a box or inside Postgres itself. And people just keep not noticing it until they reach some thousands of TPS. And this problem is nasty. It's like we try to obtain our log
Starting point is 00:07:26 for one millisecond or less just to adjust our system catalogs, but we cannot because, for example, somebody read something from this table and didn't close transaction or still reading from it. Or AutoVacuum is running in non-regular mode
Starting point is 00:07:43 in transaction ID wraparound prevention mode. And in this case, auto-vacuum doesn't yield, so it won't auto-kill itself. In this case, we need just wait for this log to be acquired. And this is easy to see in PGC activity wait event for our session. But the problem is, why is it nasty? It's okay to be blocked if it's only about us, right? But the problem is, we have a long line of all other transactions who try to even read from this table,
Starting point is 00:08:18 and they form a long line behind us, waiting if we will succeed or fail. This is it. This is our landscape of problems with making schema changes. Yeah, I feel like you dove into kind of like, I know it's one of the simpler examples, but it's also one of the more complex ones, right? Like it's one of those things that looks fine in a lot of them. Yeah, it's simple, but... A lot of the time it's fine, and then sometimes it's a problem, right? Before we move on from that one, the solution, I think you've written a really good blog post that includes this as one of the examples. So I'll make sure I link that one up.
Starting point is 00:08:57 But there's a setting where we can cancel things, where we can set a timeout. Walk timeout or statement timeout. Yeah. For example, PGRPack does something like that and it uses statement timeout. Okay, cool. But it's old school, old fashioned way, because it will affect also DML and so on. There is lock timeout and well statement timeout, if you do it in your session, if you have quite complex transaction, very properly organized, so all logs are closer to the end
Starting point is 00:09:30 because all acquired logs will be held until the very end of transaction, rollback or commit. So in this case, if you use statement timeout, it will be applied both to DDL and DML, including selects. So there is a finer grain, how to say? Fine-grained? A special timeout, fine-grained setting, which will be applied only to log acquisition,
Starting point is 00:09:56 not to whole statement duration. And it's called log timeout. And of course, I prefer using it. I don't remember what PGRPAC decided. They originally said, oh, we need to think. When I proposed to use lock timeout because statement timeout, you know, it was not working well in some cases with PGRPAC.
Starting point is 00:10:15 I don't remember details. It was long ago. But lock timeout is what you can use at least not to block others for long. Because if you are being blocked, not being able to acquire a lock, you block others, at least it can be like, okay, half a second. Why this value? We had our first episode about query duration, latency, what is long, what is short for human perception in OLTP case, web and mobile apps. So second is already very noticeable, but half a second or maybe 100 milliseconds is quite good,
Starting point is 00:10:56 but depends on the cases. Maybe you try, but you fail and you need three need to retry and in my article i have another article about it right specifically for this problem proposing the solution uh retries and low log time out maybe increasing with some back back off jitter and so on like you know yeah sometimes you could like exponential decay like kind of right but but not until hours, right? Not until minutes even. Still quite limited, and maybe try a bit later and so on. And be prepared for your deployments to be failed, it's fine. If you have auto-vacuum processing... Well, it also depends. I see some people prefer observing auto-vacuum running in this mode, understanding that it kicks off early like at 10%, 20% usually
Starting point is 00:11:48 it depends, of all capacity it's very safe to just terminate it and let it restart later. But if you terminate it, your session should be already in process of attempt of acquisition of lock. Because if you just terminate it
Starting point is 00:12:04 and go try, I bet Autowacuum will be sooner than you and you will have the problem again. So you need to start attempt and kill Autowacuum, pgterminate backend for Autowacuum and then it opens the road in front of you. Right, but I wanted to mention, you're right, this is one kind of a problem when you think it's so simple, everyone hits it, at least silently, not noticing until some level of workload. It's so simple, but it's also complex under the hood. If you dive inside, it's like many things, but it's so simple. How can people live without solution to it? And it still amuses me actually. So I don't understand. We definitely need something,
Starting point is 00:12:54 but I have a solution at the upper level outside of Postgres, but I think at some point inside Postgres should be done something. We see some improvements in the area of partitioning like detached partition is already not blocking and so on and many things like that but this is just let me add a column. I would like to specify what amount the number of retries some logic right inside this alter table statement. Yeah it's good that's an interesting idea. I always think how can you make these things the default? Because until you know that those settings exist, you're still going to hit that the first time.
Starting point is 00:13:31 But yeah, it would be nice to empower people that do know it's an issue as well. I, as a comic book, really experience engineers. Some people I consider some of the best engineers I've ever worked with. And for example, this was, I think, must have been pre-Postgres 11, but somebody adding
Starting point is 00:13:45 a column with a default. So that was always going to cause an issue and it did. So that was a real problem, but it caught them out. This falls into, sorry, I'm interrupting again. It falls into previous, like you have table rewrite, table full scan, and this unsuccessful lock acquisition problem. What you mentioned falls to full table rewrite, because pre-POSGIS 11, when we create a new table with default, it calls full table rewrite.
Starting point is 00:14:16 It's quite... This I would expect. POSGIS is not perfect here, I understand this. Okay. But that, I'm like, what? How come? this is not perfect here. I understand this. Okay. But that I like, what? Okay. Like how come I, even simple example, I want to drop a call.
Starting point is 00:14:31 Say. There's some, I'm conscious we're not going to have time to discuss every single case of this. So there is a couple of good articles. One on the one I just mentioned on, it was by Brando, wasn't it? On the change in version 11. And I think the nice thing about that article is it goes into some of, not all of the complexities he does link to the patch, for example, but how complicated some of the remediations needed to be
Starting point is 00:14:56 to make adding a column with a default not a rewrit case. So that changed in version 11. And it's not trivial, that change. There's a lot of places that that touches. So I do have sympathy with that these things are difficult to do and difficult to do in a way that still lets people on simple databases just get things done. So yeah, I understand the complexity, but equally, it would be nice to not have people shoot themselves in the foot quite so often as happens at the moment.
Starting point is 00:15:29 Right. And actually, this change in Postgres 11 is very connected to Integer 4 primary key problem because if you think about how you can convert to Integer 8, there are two big approaches. I call it new column and new table. New table is more like PgRepack style when you need both should be zero downtime. The second one, new table, it's more like pgRepack style. You need to create another table, but first implement triggers that will record all changes in some delta table. And then you do select, create table as select
Starting point is 00:16:06 or just copy, like some snapshot of existing table. And then you, actually, there are interesting tricks there as well. And then you need to take care about indexes, constraints, foreign keys especially. It's interesting, very interesting.
Starting point is 00:16:20 And then you apply delta, maybe in several steps, actually. It's possible to split it into multiple transactions, which I didn't know initially. I thought, oh, whole delta needs to be applied at once. No, PGPAC is much smarter. We can do it in steps.
Starting point is 00:16:35 Just last step should be inside our renaming transaction. But this is a more complex approach than just new column. But if you do new column, eventually you need to redefine primary key. But primary key, not only in Postgres and Fear even, it consists of two things, uniqueness and not null. Two constraints. Uniqueness we can create index concurrently with unique, create unique index concurrently, it might fail if that duplicates.
Starting point is 00:17:06 We fix it. Eventually, we can create it without long locks. So it's like so-called online operation. So create index concurrently is online operation, and create unique index concurrently. Create index is not online operation. Not, how to say, online safe. There should be some term for it. Zero downtime is not a operation. Not, how to say, online safe or there should be some term for it. Zero downtime is not
Starting point is 00:17:27 proper term because if we block one table, it's not downtime. I call it partial downtime. Only some part of workload which deals with its table is experiencing problems, but everything else is fine, right? I like online operation as the Yeah. A good friend of mine
Starting point is 00:17:43 who is from SQL world, a very big professional, gave me this term. I like it as well, online operation. But what about not now? Before Postgres 11, it's terrible. Absolutely terrible. You cannot create it.
Starting point is 00:18:01 Period. Wow. So, yeah, if you create it, you have second type of our problem. You need to full scan. Postgres needs to fully scan table at creation time, and during this time, it holds this very expensive exclusive log.
Starting point is 00:18:17 And check constraint can be done in two phases. So first you say not valid. You create check constraint very fast, not valid. Again, you need to do retries. If you do it very properly, you need retries because it's still changing schema, right? But not valid means it will not be applied
Starting point is 00:18:36 for... Some people think not valid, it's not valid. Like, no. I mean, it's not valid for all data, but it's already being checked for all new rows, updated or inserted. Of course, deleted don't count. It's like valid for all data, but it's already being checked for all new rows, updated or inserted. Of course, deleted don't count. It's like no validation rather than not valid. Is that fair?
Starting point is 00:18:51 Right, right. Yeah, it's already working, but we skipped validation for existing rows. And then we in our transaction alter table validate, and this doesn't need the long log. Constraint is already there, it's just a flag. And this is like two phases, two transactions, separate transactions, and this is how we can market, we can create a check constraint. Our new column is not null. But before Postgres 11, if you create a check constraint, it cannot be done in two phases.
Starting point is 00:19:25 I think since Postgres 12, if there is check constraint, and then you create not null, it's smart enough. I think Sergey Kornilov implemented this. It sees, oh, there is check constraint, it's not null, so I can rely on it, it already validated all rows. And it skips it in Postgres 12. But using some brainpower, we solved the problem for Postgres 11 as well.
Starting point is 00:19:52 What we can do? We say, because just rely on this new feature, non-blocking default. When we create our integer of 8 new ID, we say default minus 1. Not null. Well, yeah. And it's magic. Postgres thinks, oh, this minus one, this default is virtual.
Starting point is 00:20:16 Postgres doesn't rewrite the whole table. It just says, okay, for all existing tables, we think it's minus one, virtually. Right? But if it's already filled with some, we can say not null, right? Then we backfill in batches and drop default. Minus one goes away. We have real values
Starting point is 00:20:38 that not null is there. So this trick we used for Postgresor level. It's funny. It's like very interesting. This is like. It's very interesting. It feels like art. You've used what you have to achieve something. Like multiple things work together and you achieve something very useful. But then I even was slightly disappointed in Postgres 12 seeing this like, okay, now we can build just check and then not not a tooth in two steps. Then we create not now regular not now.
Starting point is 00:21:11 And this interesting trick is not needed anymore. Yeah. I know it's not a good analogy, but I quite like thinking about it as the, you know, if you ever seen those magicians putting lots on the table, like a dining table and then they get the tablecloth and they whiff it out just and with everything stays as it is a good migration is a bit like that there's always one step at the end that's kind of like the little finesse the last little thing to make everything work but it needs to be really quick and it needs to be like not change anything
Starting point is 00:21:39 but it can all go wrong as well so if they pull the tablecloth and it, and you haven't quite sorted everything out, everything falls on it, everything smashes basically. So yeah, that's, there's one other blog post that discusses some of that in more detail, which is the Braintree one. I think they've updated on the PayPal blog. So I'll link that up that has some of these kind of worked examples in it. Another one I did want to make sure we covered, or at least I think that's on the slightly more complex side is one that one of my colleagues covered from
Starting point is 00:22:07 GoCardless that caught them and I think you cover really well in your post. But it's not obvious, which is the foreign key update. So like the, the fact that it takes locks on both tables, I think it's really that catches people out. And different blocks, different blocks. And they've also partitioning involved. both tables, I think it's really that catches people out. Different logs. And also if partitioning involved, it's very interesting. Let me also advertise a couple of things. First,
Starting point is 00:22:33 GitLab has very good migration helpers. If you use Ruby, and if you don't use Ruby, it's still very insightful. It's their source code. And they have very, very good documentation about all these things. So it's very well tested, both on their production.com and all GitLab instances.
Starting point is 00:23:00 There are so many of them, right? And you need to deploy without seeing what is happening. So fully automated. And sometimes tables are huge, even if it's not there.com. So it's like, it's very challenging task and they solved it. So their experience, I think, I find their experiences very like one of the best in this area for Postgres. And also, of course, our tool dbLab. So if you deal with foreign keys on both sides, sometimes we need
Starting point is 00:23:30 different types of logs. Unfortunately, we can create foreign keys in two steps, but if partitioning is involved, it's tricky. How foreign keys are propagated from parent to partitions, from main table to partitions. I never keep it in my mind.
Starting point is 00:23:47 It's too much. I trust two things. First, documentation source code, and second, our tool. Our tool detects long-lasting exclusive logs. You just need to apply it either for existing tables taken from production without PIR or something, or just raw production.
Starting point is 00:24:03 It's even better if you can afford it in terms of how data, how infrastructure is organized. Or you need to generate some rows, right, for testing. And by the way, today I just used ChargerPT to fill tables. I had some rows on my table
Starting point is 00:24:21 and I needed to have more. Okay, I have one megabyte. I want one gigabyte or ten, hundred gigabytes. And help me, it's good in this. If you show examples and it helps you increase the size. According to some patterns we already have. So if you just test with big data, our tool can catch database lab engine and dbLab can catch longer long running exclusive logs. And in this case, you understand your solution is not good.
Starting point is 00:24:55 And you go fix it. Yeah, I think this this thing side rust, right? documentation. And yeah, this is great for catching two of those classes, right classes right like including the let's say it's a large table and oh it the third case is still difficult right because like as in the go cardless blog you need to have some long-running transactional or vacuum as you said some something hold like you are still tricky it's still catches so i'm not sure if the maybe the gitlab one does one for that as well but i think that's actually super tricky uh i guess you'll have migration help of gitlab has implemented this approach a low log timeout and retries they have it they have helpers for this this thing but you're right, and our tool right now doesn't do it like we don't have background transactions
Starting point is 00:25:48 which do this. It's technically possible. Yes, like we can think about it, but so far this is something you need to do. And so, yeah, we cover only two first classes. We are right. But, yeah,
Starting point is 00:26:04 for this, I think everyone just needs to have it. Always, you know, be prepared for long running like after vacuum or some transactions holding some even share locks blocking our DDL. Yeah.
Starting point is 00:26:20 We didn't cover actually interesting topic how DDL is related to DML and like updates or deletes in batches or like massive insert as well, also interesting. There are many things, right? And we already reached our limit in time. So let's wrap up. I think we scratched surface, but we did well because we brought many good links on the table, right?
Starting point is 00:26:46 Articles and so on. Materials. Definitely. And I think we've given some good warning signs, you know, if you're thinking about adding a column, if you're thinking about adding an index, anything to do with partitioning, create, detach, we've given some good warning, foreign keys especially, anything around renaming. These things are scary and you should be careful and try and remember this and read up about it.
Starting point is 00:27:07 Oh, renaming has issues. Renaming has issues also related to application because you have many application nodes, it's also interesting sometimes you cannot simply rename. Yeah, well, it's interesting. And the topic is very broad and yeah, warnings are right. And I bet most people have issues in their database migrations. They have like, they just don't, if you think everything is all right, you, you just don't have proper cases.
Starting point is 00:27:35 Don't have enough data. You haven't met problems yet because there might be problems lasting a second and nobody noticed, right? But later it will be 10 seconds, then one minute and it's already bad. So I thought that was helpful. Please read articles. There is a lot of wisdom there. Yeah.
Starting point is 00:27:56 Brilliant. Nelson, Nikolay, thanks so much and catch you next week. Good. Yeah. Thank you all listeners. Let us know if you like shorter episodes. Nelson, take care. Bye.
Starting point is 00:28:06 Okay. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.