Postgres FM - Zero-downtime migrations
Episode Date: June 2, 2023Nikolay and Michael discuss zero-downtime schema migrations — why they're a challenge, a variety of different cases, and some things you can do to achieve them. Here are links to a few thi...ngs we mentioned: Comparison of JOINS: MongoDB vs. PostgreSQL (blog post by Michael Stonebraker and Álvaro Hernández)Common DB schema change mistakes (blog post by Nikolay)lock_timeout and retries (blog post by Nikolay)lock_timeoutFast Column Creation with Defaults (blog post by Brandur)Database Schema Changes Without Downtime (new version of blog post by Braintree)Zero-downtime Postgres migrations - the hard parts (blog post from GoCardless)GitLab migration_helpers.rb GitLab migration style guidedblab ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!If you would like to share this episode, here's a good link (and thank you!)~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork
Transcript
Discussion (0)
Hello, hello, this is PostgresFM. I'm Nikolai and my co-host is Michael. Hi, Michael.
Hello, Nikolai.
So we will be talking today about one of my favorite topics. I think it's quite often
underestimated when our database and workloads grow. It will be about migrations, right? But
what kind of migrations? Migrations is not a good term. I don't like it, but we all use it because of some, I guess some Ruby guys, right?
Or where did it start from?
Database migrations.
It sounds like we're migrating from one place to another, but actually
it's about schema changes.
Yeah.
I've heard the phrase schema migrations as well.
So we're talking specifically about, and I know this is implicit in what you're
saying, but I don't think everybody has this implicit, but zero downtime migration.
So how do we do a schema migration in a way that doesn't block writes or reads?
Schema evolution, is the better term, I think.
Yeah, absolutely.
Well, anybody that's building products, building applications, managing a database, if it has any success at all, it's going to change over time, right?
You're going to want to add things, you're going to want to
change things, you might even want to rename things.
Three deployments per day, as usual.
Yeah, and a lot of those
changes, by default,
the way that they happen take
locks, right? So those
locks on small tables are nice and fast,
but once they start getting bigger,
those locks can be really problematic. So we're talking about changes that can cause these heavy
locks for long periods of time and alternatives to those. Or cannot acquire a lock.
Yeah, yeah, absolutely. Also problematic. This is even more dangerous and it usually remains
unnoticed until some level of workload.
But I have a good recipe and probably
this recipe will work for everyone
and we will just end this episode
much earlier than usual.
So we promised to not
exceed 30 minutes to ourselves and
to some of our listeners and let's
try to do it. I see
timer here. So
recipe is simple. Let's have ID column with perfect UL ID or UUID
version 7, 8, I don't remember. Some ID column, that can be integer 8. Just don't use integer 4,
right? And JSON, B or without B, it depends, but JSON column. And that's it.
We don't need migrations.
Well, actually the integer for integer A question is a really good point.
And the reason that's problematic is because changing the type of a column is one of these
issues.
It was one of these operations that takes...
I know it very well.
I've earned more than 100k on it. Just on this sole task for multiple
companies, at least four companies.
Yeah. But yeah, so I think we could probably do quite a long time on that issue alone.
But my main point was that it's because that requires a migration. So that is one of the
category of issues here. I guess it would take a long time
to discuss the details of every single one. So I was hoping we'd give some, maybe newbies,
a little bit of an insight into what kinds of things can cause problems.
You don't understand. Let's put everything to JSON and we don't need migrations. That's it.
I understood, but I thought you were being facetious.
That's the whole NoSQL argument.
They had the marketing claim of schema-less,
don't worry about migrations ever.
And there's some compelling stuff in there.
But then the benefit of relational database,
the reason a lot of us love them, is we get validation.
Let's just mention that multiple banks and financial organizations and also e-commerce organizations used this approach just for everything, schema lesson or how it's called.
And it doesn't work well for data where you need good quality of your data, right?
For example, billing system.
And this is on one hand.
And on another hand, also, I can encourage listeners to read the article from Michael
Stonebreaker.
It was an EDB blog, co-authored with Alvaro Hernandez, about how bad it is to avoid good
normalization.
So normalization, denormalization should be in our toolbox, both, by the way. And if we talk about it,
schema changes are inevitable, right? So yeah, I want to just exclude this because someone can ask,
why do we need it at all, the schema changes? But right, let's start from something simple.
You try to add a column. Yeah, perfect example. It's a very simple column
without defaults,
without anything. Like I just want
some small column on the right
side of my table.
And suddenly you have
traffic went down for many
minutes. What happened?
I mean, traffic to this table. It's like partial
down times. Other tables are accessible,
but this table even selects are blocked.
What happened?
What do you think?
Something waiting on a lock, right?
Something gets stuck behind.
Yes.
I think this is something that eventually should be improved in Postgres.
Like retries?
How would you...
Right.
So we have created this concurrently. We have two-stage
procedure to create various constraints like foreign keys or check constraints or unique
constraints and unique index of course concurrently. So we have tools to avoid the long
exclusive logs. Exclusive logs are inevitable. To adjust
schema in system
catalogs, brief,
at least brief, exclusive
log is needed. But
there are two big problems,
at least two, very big problems.
First, okay, three.
First is very obvious. We cost
full table rewrite. For example, if you change
integer 4 to integer 8
with single line, single comment, alter table, alter column, you will cause full table rewrite.
And if it has 1 billion rows, probably this is when you think, oh, it's already 50% of capacity,
which is 2.1 billions roughly. This is when you already start being nervous and
try to do something. In this case, full table rewrite will cause very long lock and nobody will
be able to talk to this table. But this is obvious, right? We do some work while still
keeping holding this exclusive lock. Of course, even sell-s are not possible. There is another problem, slightly lighter problem,
when we don't need to rewrite the whole table, but we need to at least scan it at least once to
check that, for example, constraints are not violated by existing rows. So it's just read,
but still long because billion rows, right? And during this we also hold this exclusive lock.
But these two problems are quite obvious. There is one particular problem which hits everyone,
and we don't have anything inside out of a box or inside Postgres itself.
And people just keep not noticing it until they reach some thousands of TPS. And this problem is nasty. It's like we try to obtain our log
for one millisecond or less
just to adjust our system catalogs,
but we cannot because, for example,
somebody read something from this table
and didn't close transaction
or still reading from it.
Or AutoVacuum is running
in non-regular mode
in transaction ID wraparound prevention mode.
And in this case, auto-vacuum doesn't yield, so it won't auto-kill itself.
In this case, we need just wait for this log to be acquired.
And this is easy to see in PGC activity wait event for our session.
But the problem is, why is it nasty?
It's okay to be blocked if it's only about us, right?
But the problem is, we have a long line of all other transactions
who try to even read from this table,
and they form a long line behind us,
waiting if we will succeed or fail. This is it. This is our landscape of problems with
making schema changes. Yeah, I feel like you dove into kind of like, I know it's one of the simpler
examples, but it's also one of the more complex ones, right? Like it's one of those things that
looks fine in a lot of them. Yeah, it's simple, but...
A lot of the time it's fine, and then sometimes it's a problem, right?
Before we move on from that one, the solution, I think you've written a really good blog post that includes this as one of the examples.
So I'll make sure I link that one up.
But there's a setting where we can cancel things, where we can set a timeout.
Walk timeout or statement timeout.
Yeah. For example, PGRPack does something like that and it uses statement timeout.
Okay, cool.
But it's old school, old fashioned way, because it will affect also DML and so on.
There is lock timeout and well statement timeout, if you do it in your session, if you have quite complex transaction,
very properly organized,
so all logs are closer to the end
because all acquired logs will be held
until the very end of transaction, rollback or commit.
So in this case, if you use statement timeout,
it will be applied both to DDL and DML, including selects.
So there is a finer grain, how to say?
Fine-grained?
A special timeout, fine-grained setting,
which will be applied only to log acquisition,
not to whole statement duration.
And it's called log timeout.
And of course, I prefer using it.
I don't remember what PGRPAC decided.
They originally said, oh, we need to think.
When I proposed to use lock timeout
because statement timeout,
you know, it was not working well in some cases with PGRPAC.
I don't remember details.
It was long ago.
But lock timeout is what you can use
at least not to block others for long.
Because if you are being blocked, not being able to acquire
a lock, you block others, at least it can be like, okay, half a second. Why this value?
We had our first episode about query duration, latency, what is long, what is short for human perception in OLTP case, web and mobile apps.
So second is already very noticeable, but half a second or maybe 100 milliseconds is quite good,
but depends on the cases. Maybe you try, but you fail and you need three need to retry and in my article i have another article about it right
specifically for this problem proposing the solution uh retries and low log time out maybe
increasing with some back back off jitter and so on like you know yeah sometimes you could like
exponential decay like kind of right but but not until hours, right? Not until minutes even. Still quite limited,
and maybe try a bit later and so on. And be prepared for your deployments to be failed,
it's fine. If you have auto-vacuum processing... Well, it also depends. I see some people prefer
observing auto-vacuum running in this mode, understanding that it kicks off early
like at 10%, 20% usually
it depends, of all capacity
it's very safe to just
terminate it and let it restart later.
But if you terminate it,
your session should
be already in process of
attempt of acquisition of lock.
Because if you just terminate it
and go try, I bet Autowacuum will be
sooner than you and you will have the problem again. So you need to start attempt and kill
Autowacuum, pgterminate backend for Autowacuum and then it opens the road in front of you.
Right, but I wanted to mention, you're right, this is one kind of a problem when
you think it's so simple, everyone hits it, at least silently, not noticing until some
level of workload. It's so simple, but it's also complex under the hood. If you dive inside,
it's like many things, but it's so simple. How can people live without
solution to it? And it still amuses me actually. So I don't understand. We definitely need something,
but I have a solution at the upper level outside of Postgres, but I think at some point inside
Postgres should be done something. We see some improvements in the area of partitioning like detached partition is
already not blocking and so on and many things like that but this is just let me add a column.
I would like to specify what amount the number of retries some logic right inside this alter
table statement. Yeah it's good that's an interesting idea. I always think how can you
make these things the default?
Because until you know that those settings exist,
you're still going to hit that the first time.
But yeah, it would be nice to empower people
that do know it's an issue as well.
I, as a comic book, really experience engineers.
Some people I consider some of the best engineers
I've ever worked with.
And for example, this was, I think,
must have been pre-Postgres 11,
but somebody adding
a column with a default. So that was always going to cause an issue and it did. So that was a real
problem, but it caught them out. This falls into, sorry, I'm interrupting again. It falls into
previous, like you have table rewrite, table full scan, and this unsuccessful lock acquisition problem.
What you mentioned
falls to full table rewrite, because
pre-POSGIS 11,
when we create a new table with default,
it calls full table rewrite.
It's quite...
This I would expect.
POSGIS is not perfect here, I understand this.
Okay. But that,
I'm like, what?
How come? this is not perfect here. I understand this. Okay. But that I like, what?
Okay.
Like how come I, even simple example, I want to drop a call.
Say.
There's some, I'm conscious we're not going to have time to discuss every single case of this.
So there is a couple of good articles.
One on the one I just mentioned on, it was by Brando, wasn't it?
On the change in version 11.
And I think the nice thing about that article is it goes into some of,
not all of the complexities he does link to the patch, for example,
but how complicated some of the remediations needed to be
to make adding a column with a default not a rewrit case.
So that changed in version 11.
And it's not trivial, that change. There's
a lot of places that that touches. So I do have sympathy with that these things are difficult
to do and difficult to do in a way that still lets people on simple databases just get things
done. So yeah, I understand the complexity, but equally, it would be nice to not have
people shoot
themselves in the foot quite so often as happens at the moment.
Right.
And actually, this change in Postgres 11 is very connected to Integer 4 primary key problem
because if you think about how you can convert to Integer 8, there are two big approaches.
I call it new column and new table.
New table is more like PgRepack style when you need
both should be zero downtime. The second one, new table, it's more like pgRepack style. You need to
create another table, but first implement triggers that will record all changes in some delta table.
And then you do select, create table as select
or just copy,
like some snapshot of existing table.
And then you, actually,
there are interesting tricks there as well.
And then you need to take care
about indexes, constraints,
foreign keys especially.
It's interesting, very interesting.
And then you apply delta,
maybe in several steps, actually.
It's possible to split it
into multiple transactions,
which I didn't know initially.
I thought, oh, whole delta needs to be applied at once.
No, PGPAC is much smarter.
We can do it in steps.
Just last step should be inside our renaming transaction.
But this is a more complex approach than just new column.
But if you do new column, eventually you need to redefine primary key.
But primary key, not only in Postgres and Fear even, it consists of two things, uniqueness
and not null.
Two constraints.
Uniqueness we can create index concurrently with unique, create unique index concurrently,
it might fail if that duplicates.
We fix it.
Eventually, we can create it without long locks. So it's like so-called online operation.
So create index concurrently is online operation,
and create unique index concurrently.
Create index is not online operation.
Not, how to say, online safe.
There should be some term for it. Zero downtime is not a operation. Not, how to say, online safe or there should be some term
for it. Zero downtime is not
proper term because if we block one
table, it's not downtime. I call it partial
downtime. Only some part of
workload which deals with its table
is experiencing problems, but
everything else is fine, right?
I like online operation as the
Yeah. A good friend of mine
who is from SQL world,
a very big professional,
gave me this term.
I like it as well, online operation.
But what about not now?
Before Postgres 11, it's terrible.
Absolutely terrible.
You cannot create it.
Period.
Wow.
So, yeah, if you create it,
you have second type of our problem.
You need to full scan.
Postgres needs to fully scan table at creation time,
and during this time,
it holds this very expensive exclusive log.
And check constraint can be done in two phases.
So first you say not valid.
You create check constraint very fast, not valid.
Again, you need to do retries.
If you do it very properly,
you need retries because
it's still changing schema, right?
But not valid means it will not be applied
for... Some people think not valid, it's not
valid. Like, no. I mean,
it's not valid for all data, but it's
already being checked for all new
rows, updated or inserted. Of course, deleted don't count. It's like valid for all data, but it's already being checked for all new rows, updated or inserted.
Of course, deleted don't count.
It's like no validation rather than not valid.
Is that fair?
Right, right.
Yeah, it's already working, but we skipped validation for existing rows.
And then we in our transaction alter table validate, and this doesn't need the long log.
Constraint is already there, it's just a flag.
And this is like two phases, two transactions, separate transactions, and this is how we
can market, we can create a check constraint.
Our new column is not null.
But before Postgres 11, if you create a check constraint, it cannot be done in two phases.
I think since Postgres 12, if there is check constraint,
and then you create not null, it's smart enough.
I think Sergey Kornilov implemented this.
It sees, oh, there is check constraint, it's not null,
so I can rely on it, it already validated all rows.
And it skips it in Postgres 12.
But using some brainpower,
we solved the problem for Postgres 11 as well.
What we can do?
We say, because just rely on this new feature,
non-blocking default.
When we create our integer of 8 new ID,
we say default minus 1.
Not null.
Well, yeah.
And it's magic. Postgres thinks, oh, this minus one, this default is virtual.
Postgres doesn't rewrite the whole table. It just says, okay, for all existing tables,
we think it's minus one, virtually. Right? But
if it's already filled with some, we can
say not null, right? Then we
backfill in batches
and drop default.
Minus one
goes away. We have real values
that not null is there.
So this trick we used for Postgresor level.
It's funny. It's like very
interesting. This is like. It's very interesting.
It feels like art. You've used what you have to achieve something.
Like multiple things work together and you achieve something very useful.
But then I even was slightly disappointed in Postgres 12 seeing this like, okay, now we can build just check and then not not a tooth in two steps.
Then we create not now regular not now.
And this interesting trick is not needed anymore.
Yeah.
I know it's not a good analogy, but I quite like thinking about it as the, you know, if
you ever seen those magicians putting lots on the table, like a dining table and then
they get the tablecloth and
they whiff it out just and with everything stays as it is a good migration is a bit like that
there's always one step at the end that's kind of like the little finesse the last little thing to
make everything work but it needs to be really quick and it needs to be like not change anything
but it can all go wrong as well so if they pull the tablecloth and it, and you haven't quite sorted everything
out, everything falls on it, everything smashes basically.
So yeah, that's, there's one other blog post that discusses some of that in more
detail, which is the Braintree one.
I think they've updated on the PayPal blog.
So I'll link that up that has some of these kind of worked examples in it.
Another one I did want to make sure we covered, or at least I think that's
on the slightly more complex side is one that one of my colleagues covered from
GoCardless that caught them and I think you cover really well in your post.
But it's not obvious, which is the foreign key update.
So like the, the fact that it takes locks on both tables, I think it's
really that catches people out.
And different blocks, different blocks. And they've also partitioning involved. both tables, I think it's really that catches people out. Different logs.
And also if partitioning involved, it's
very interesting. Let me
also advertise a couple of things. First,
GitLab has very good
migration helpers.
If you use
Ruby, and if you don't use Ruby,
it's still very insightful.
It's their source code.
And they have very, very good documentation about all these things.
So it's very well tested, both on their production.com and all GitLab instances.
There are so many of them, right?
And you need to deploy without seeing what is happening. So
fully automated. And sometimes tables are huge, even if it's not there.com. So it's like, it's
very challenging task and they solved it. So their experience, I think, I find their experiences
very like one of the best in this area for Postgres. And also, of course, our tool dbLab.
So if you
deal with foreign keys on both sides,
sometimes we need
different types of logs.
Unfortunately, we can create foreign
keys in two steps, but if partitioning
is involved, it's tricky.
How foreign keys are propagated from parent
to partitions,
from main table to partitions.
I never keep it in my mind.
It's too much.
I trust two things.
First, documentation source code,
and second, our tool.
Our tool detects long-lasting exclusive logs.
You just need to apply it either for existing tables
taken from production without PIR or something,
or just raw production.
It's even better if you can afford it
in terms of how data,
how infrastructure is organized.
Or you need to generate some rows, right,
for testing.
And by the way,
today I just used ChargerPT to fill tables.
I had some rows on my table
and I needed to have more.
Okay, I have one megabyte. I want one gigabyte or ten, hundred gigabytes.
And help me, it's good in this. If you show examples
and it helps you increase the size.
According to some patterns we already have.
So if you just test with big data,
our tool can catch database lab engine and dbLab can catch
longer long running exclusive logs. And in this case, you understand your solution is not good.
And you go fix it. Yeah, I think this this thing side rust, right? documentation. And yeah,
this is great for catching two of those classes, right classes right like including the let's say it's a large table and oh it the third case is still difficult right because
like as in the go cardless blog you need to have some long-running transactional or vacuum as you
said some something hold like you are still tricky it's still catches so i'm not sure if the maybe the gitlab one does
one for that as well but i think that's actually super tricky uh i guess you'll have migration
help of gitlab has implemented this approach a low log timeout and retries they have it
they have helpers for this this thing but you're right, and our tool right now doesn't do it like we don't
have background transactions
which do this.
It's technically possible.
Yes, like we can
think about it, but so far
this is something you need to
do. And so, yeah,
we cover only two first classes.
We are right. But, yeah,
for this, I think everyone just needs to have it.
Always, you know,
be prepared for long running
like after vacuum
or some transactions
holding some even share locks
blocking our DDL.
Yeah.
We didn't cover actually interesting topic
how DDL is related to DML and
like updates or deletes in batches or like massive insert as well, also interesting.
There are many things, right?
And we already reached our limit in time.
So let's wrap up.
I think we scratched surface, but we did well because we brought many good links
on the table, right?
Articles and so on.
Materials.
Definitely.
And I think we've given some good warning signs, you know, if you're thinking about adding a column, if you're thinking about adding an index,
anything to do with partitioning, create, detach, we've given some good
warning, foreign keys especially, anything around renaming.
These things are scary and you should be careful and try and remember
this and read up about it.
Oh, renaming has issues.
Renaming has issues also related to application because you have many
application nodes, it's also interesting sometimes you cannot simply rename.
Yeah, well, it's interesting.
And the topic is very broad and yeah, warnings are right.
And I bet most people have issues in their database migrations.
They have like, they just don't, if you think everything is all right, you,
you just don't have proper cases.
Don't have enough data.
You haven't met problems yet because there might be problems lasting a
second and nobody noticed, right?
But later it will be 10 seconds, then one minute and it's already bad.
So I thought that was helpful.
Please read articles.
There is a lot of wisdom there.
Yeah.
Brilliant.
Nelson, Nikolay, thanks so much and catch you next week.
Good.
Yeah.
Thank you all listeners.
Let us know if you like shorter episodes.
Nelson, take care.
Bye.
Okay.
Bye.