Postgres FM - NOT VALID constraints

Episode Date: January 17, 2025

Nikolay and Michael discuss adding constraints in an online fashion, using NOT VALID and then VALIDATE CONSTRAINT. Here are some links to things they mentioned:ADD table_constraint NOT VALID...  https://www.postgresql.org/docs/current/sql-altertable.html#SQL-ALTERTABLE-DESC-ADD-TABLE-CONSTRAINTOur episode on zero-downtime migrations https://postgres.fm/episodes/zero-downtime-migrationsVALIDATE CONSTRAINT https://www.postgresql.org/docs/current/sql-altertable.html#SQL-ALTERTABLE-DESC-VALIDATE-CONSTRAINTALTER TABLE notes https://www.postgresql.org/docs/current/sql-altertable.html#SQL-ALTERTABLE-NOTESpgroll https://github.com/xataio/pgroll~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, hello, this is PostgresFM. My name is Nikolai, Postgres AI, and as usual, my co-host is Michael, Pidgey Mustard. Hi, Michael. Hello, Nikolai. So, there is a topic I wanted to discuss, and it's widely used by those projects which are grown-up projects. Usually in tiny projects and new projects, this topic is skipped. And this topic is related to my favorite area in Postgres administration. It's zero downtime changes. First of all, schema changes, right? And we all know that it's a mistake to create index without the word concurrently live. Because it's not a live operation, not online operation.
Starting point is 00:00:44 There's such not online operation. This is such term online operations. It means it's blocking other sessions, which deal with the same table. And it's bad. While create index concurrently is good, if it succeeds, it takes roughly two times longer, because it scans table twice. But if it succeeds, we have an index and nobody is blocked, which is great. And we discussed many times that there are many changes which are tricky. For example, if you just add a column, you can block others if you are blocked by, for example, auto-vacuum running in transaction idea wrap around mode. And in many cases, there are tricks how to, instead of single step, do multiple steps, but achieve the same result in truly online fashion. And let's talk about some details here. First of all, how we introduce new constraints on existing columns.
Starting point is 00:01:40 First of all, check constraints and foreign keys. Yeah. Should we talk about if we didn't have this feature so if we didn't have the ability to do it in multiple steps if we couldn't set it not valid first what would we do or like would we live without the constraint you know that i think it's worth discussing what kind of lock is taken and yeah why we need this feature or why this feature was implemented in the first place yeah Yeah, let's talk about this. If we imagine that we have already a colon
Starting point is 00:02:09 and we want to change something about it. For example, we want to change default. It's easy because default is only about the future. Postgres doesn't need to check the existing rows, doesn't need to change anything in existing rows. It's only some metadata change. So alter table, alter columns are default super fast, unless we acquired the lock. Right.
Starting point is 00:02:36 And this topic is important almost to like to any alter. It's important to lock acquisition. In some cases you need need exclusive log acquisition. In some cases, also exclusive. But I remember differences. For example, if you alter table, adding a table to publication, to enable publication with logical application, there it's not that bad.
Starting point is 00:03:00 But if we alter table, alter column, set default, it's quite bad. It's fast, but you need a lock. And if you cannot acquire a lock, you are blocking others. But this is a simple example. It's super fast, nothing to worry about in terms of this operation takes long. Lock acquisition might take long, right? And we need to take care of it. We need to have a low lock timeout and retry logic, be ready to do it.
Starting point is 00:03:28 Again, I'm discussing here, my comment in the beginning was specifically, we talk about projects which are grown already and they have significant loads, say more than 1,000 TPS. Other projects also experience it, but they just don't notice because traffic is too low. Right. Yeah, and we did a whole episode on the general case here. I think we did one on zero downtime migrations, which I think we went into detail on the general case, didn't we? Yeah, I agree. If you are interested in this topic, you listen to this episode.
Starting point is 00:04:01 Please listen to that episode as well, because today we are going to go deeper in a specific narrow area, which listen to that episode as well. Because we today we are going to go deeper in specific narrow area, which is not that narrow, we in any like, if you take any project, take three years of its lifetime, definitely these things happen. So adjustment of default is relatively easy, we need just to remember about this log acquisition issue. But if we go further, for example, we created a table. It has, oh, by the way, another comment. Folks who just prefer JSONB or very flexible schema would say, it doesn't matter we keep everything in text no additional constraints let's application take care of everything and i i understand this position but i fully disagree with it because i think database is the only like database constraints is very powerful mechanism to
Starting point is 00:05:01 ensure that data that data is clean and high quality. And I think there are even extensions that help you validate schemas even within JSON B columns. And I think they're using constraint. But anyway, constraint capabilities in database system, it's very powerful. And when you shift this to application logic, you should be 100% sure that in future
Starting point is 00:05:30 you don't write second application, which deals with the same database or different parts of it in different code. Or don't use some UIs, which can allow people to change data and bypass those checks, which are left in your application. So, constraint validation should be done closer to data.
Starting point is 00:05:52 And this is database system. Okay, default, easy. Next, for example, let's jump to the main case. We have not null. Not null is a tricky tricky constraint because it in my opinion it should be it should not exist because we have check constraint which is much more powerful it it's a super set of not now right it includes it because we can say check this column is not now that's it easy but is not now also it exists because it's Easy. But is not null also, it exists
Starting point is 00:06:25 because it's SQL standard. I think it was created before check constraint was created, added to standard. I guess. It's my guess. I might be wrong. I'm not even sure. Check constraint is definitely in standard. Check constraint should be in standard.
Starting point is 00:06:42 Not 100% sure, but it would be very strange if not but it's obviously like why do we have not null constraints at all we have check constraint right and interesting that if you define a primary key it requires not null constraint not like it needs it it requires two things in heaven yeah unique constraint and not null constraint and fortunately since postgres 12 optimizations we can forget about this distinction and we can build the check constraint with not null and implicitly not now actual constraint will be built efficiently on top of our check constraint and primary key will use it but some implementation detail right before 12 it was nightmare and i i remember
Starting point is 00:07:27 discussing with very experienced dba is discussing how we can introduce not now constraints like because not now doesn't support this two-phase approach we discussed today right you you cannot define not now constrained first not valid and then validate. It's not possible. And before 12, it always led to full table scan. And I remember Experience DBA said, you know what, between us, we just adjust system catalogs because we know there are no records violating these constraints, so we accept this, we just do it. Since like Postgres 8.0
Starting point is 00:08:07 or something. That's it. But yeah, this is an ugly hack, actually. I mean... Risky if you don't know exactly what you're doing. Well, it seems like not super ugly, not super risky, but fortunately, since Postgres 12, we don't need it
Starting point is 00:08:22 and we can define check constraint in this two-phased manner. And then not null constraint will be implicitly defined very fast without full table scan. So, for example, okay, we create check constraint. We define some logic in this. For example, not null or something else. For example, we can say all integer values in this column should be positive. Yeah, well, it's super flexible, isn't it?
Starting point is 00:08:51 You can pretty much define whatever you want via any query via a check constraint. It's really cool. Not any query, unfortunately. There is a huge limitation. For multiple tables, even, to write columns? Exactly. What is the limitation? You need to stay inside the same table context. You cannot refer other tables,
Starting point is 00:09:08 unfortunately. But it's pretty powerful, right? Of course. You can reference multiple columns within the same table. But yeah, the normal use cases are like, this number cannot be negative. This number should be I've not seen that in the wild or certain maybe like a maximum value or something yeah ranges
Starting point is 00:09:33 many things can be it's good it's good practice because this is how you ensure data has high quality but imagine we have a billion rows table which is data has high quality. Yes. But imagine we have a billion rows table,
Starting point is 00:09:51 which should not happen. You should have partitioning in place and don't allow it to grow beyond a few dozens of millions of rows, maybe. But it happens all the time in grown-up projects. I see it billion-plus rows all the time and we we define this constraint and how postgres like if we just say alter blah blah blah once lock is acquired to adjust metadata system catalogs postgres acquires log and then what it does
Starting point is 00:10:21 of course it goes and checks existing rows to see if there is any violation. If there is, it will roll back. And this is super bad because for one billion rows, it takes a lot of time. Full table. And while it's doing that, it's blocking any new writes. Exactly. Not only writes, selects as well because it's an algorithm. Yeah, true.
Starting point is 00:10:44 Yeah. So it's super bad this is where some engineers have feeling like they promised us that selects are not blocked what's happening here well they are not blocked by rights regular rights but you are changing the rules right now. Rules I mean you're changing the schema. So this is not simple. So you acquire the log, you start reading this. It can take a lot like minutes, maybe more, depends on these capabilities and actual table size. And even selects are blocked. This is super bad. That's why we have for check constraints, not for not null.
Starting point is 00:11:29 For not null, we don't have it. For check constraints, we have ability to say not valid. We say alter table, blah, blah, blah. We say add constraint, and then we specify not valid. Flag. It's a flag.
Starting point is 00:11:44 By the way, terrible naming. Yeah. Or confusing. If you had to guess what it was without knowing, I think you wouldn't get the details right. Right. Why is it terrible? Because once this transaction is fast because it doesn't check the existing volume of data at all.
Starting point is 00:12:02 It just defines this constraint as the table was very small, like empty, super fast, just local position was needed. Again, we need low log timeout and retries. This is a rule. But I think that's not the bit that would confuse me. The thing that would confuse me is that it will be checking new rows and updates. Exactly, yes. That's what's confusing, right? You're okay. You brought it before. It's okay.
Starting point is 00:12:31 Yeah. But, I mean, mechanics is like we create constraint, like table's empty, and that's it. And it's marked not valid in table definition if you use backslash D in PSQL, and you think, oh oh this is not very I don't care and I remember I was preparing complex first complex transformation of four byte primary key to eight byte primary and it was very complex operation, like seven steps.
Starting point is 00:13:06 Each step was reversible. And I remember for the sake of reversibility, it was not about check constraints, it was about foreign keys. It's quite similar there. You can also say not valid for foreign key. And I left not valid foreign keys for some like shadow table or old table or something and I thought they won't be used. It was a huge mistake led to some bug and fortunately I quickly realized okay not
Starting point is 00:13:37 valid doesn't mean it's not working. It's still working it means that it's applied to all new writes. Inserts and updates, not deletes. In the case of check and straight deletes don't matter, right? Because we don't need to check. But in the case of foreign keys deletes also matter, right? Yeah, true. Yeah, because if you delete the referenced row... Referencing row. Yeah, referencing row becomes an orphan if...
Starting point is 00:14:18 Yeah, so for all new writes, let's say writes, like aggregate all changes of data here. And if you define check constraint, you try to insert, if you have, for example, let's say we have not null, you try to insert null, it says, okay, violation of not valid constraint. How does it sound? Sounds not good. It's confusing. So I think a better name could be... I had an idea yesterday. We were discussing with one client.
Starting point is 00:14:57 I forgot. What would be better? Not verified maybe, right? Not verified. Yeah, it's better. Naming things is notoriously difficult right yeah but yeah we've got this now and we're stuck with it i think that distinction between the types of rights is quite important though like i think i would then have gone away and thought oh insert like if i change that column, it will validate it. But I wouldn't have realized that if I changed a different column.
Starting point is 00:15:31 Or no columns at all. As you remember, my favorite example, update set ID equals ID doing nothing logically, but physically it produces a new tuple. A new tuple is also check against with not valid constraint and it complains violation yeah so in case that wasn't
Starting point is 00:15:52 clear if that tuple has a violation in it and because we haven't validated yet it didn't fail then it tries to write that tuple again with the invalid column so yeah not verified feels good, actually. Yeah, that makes sense.
Starting point is 00:16:08 But I don't think the side effects of this, leaving a constraint in a not valid state, are obvious. Or at least not to me when I first came across it. But despite naming, I think the idea that it's already working for new rights, it's great because it gives you opportunity to say, okay, we now are covered by all future rights, all future data. Now let's take care of existing data.
Starting point is 00:16:42 And how we do this, how do we do this? How do we do this? First of all, we need to... I would not go straight to alter table, validate constraint, which scans the table in non-blocking way. This is the key. It doesn't block anyone. It just scans it.
Starting point is 00:17:02 And if everything is clear, it updates system catalogs. Again, I think lock will be needed, maybe right, but it only in the end, I'm not sure if it logic is inside this. It's interesting. I looked it up. There's a route. There's a really good note in the altar table docs. And says validation acquires only a share update exclusive lock table being altered for like a check constraint and there is an additional lock when it's a foreign key on the if concerns
Starting point is 00:17:33 funky than a row Sherlock is also required on the table reference by the constraint. Yeah, yeah, it might it might fail, for example, again, if AutoVacuum is running transaction ID wrap around prevention mode. I think it can be blocked by it. But it's not like... Fortunately, we can retry it multiple times. Yes, it will read the whole table,
Starting point is 00:17:58 quiet data, disk I.O. intensive operation, but it's already much better. The key is we don't block anyone. And if we succeed, it's great. But we might not succeed if there are nulls, or if it's different check constraint, if there are rows which violated. In this case, before doing this second step, there should be additional step in between. We should take care of violations ourselves. Either we need to delete those rows or update them, setting value, right? there should be additional step in between we should take care of violations ourselves either we need to delete those rows or update them setting value right look question why not do
Starting point is 00:18:32 that before the exactly good question setting that okay yeah good question and i think if we think like since not valid constraint already covers all future rights, I would prefer first introduce constraint, then clean up, taking care of violations in one way or another. It's very domain specific. And then validate constraint. Why? validate constraint why because in this way we don't introduce any gaps where new violations can be added right if we move the step of cleanup before like if you have it as the very first step in this case what is like again depending on the logic of application but in general case after we're done and before we start creating constraint some new rights might happen and violate it once again
Starting point is 00:19:36 right and this is so yeah i i guess it depends how many we've got but i i would still like if we've got loads then we're gonna we're gonna end up with a not valid constraint for quite a while while we clean up quite a lot of data my feeling would be what i guess both might be the solution check like get an idea of how many you've got in your existing data work out a plan for fixing those deal with the ones that you knew about at that time so there's some snapshot from like yesterday we checked which rows were in violation we've got this script to fix them all up we've run that now we check maybe we check again see if there's any new ones or maybe then you flip it to not value because at least you're only going to have a day's
Starting point is 00:20:22 worth of uh or you know however long it takes your company to or your organization to deal with those so that so it could be that you do it before and then in between yes unless we don't want to scan table too many times because we already need when we do it to clean up and also this validate constraint step the final step also needs to do it but maybe we should do it before like this very first step then introduce constraint do it again and then validate maybe this is the ideal situation but regardless of that if we just blindly say, okay, we introduce constraint, then clean up, then validate, this may lead to not understanding that not valid constraint is already applied to new rights, this may lead to late code fixes. And if code, application code or whatever, continues to perform new rights which violate constraint this is a problem because
Starting point is 00:21:26 it will bring more more and more right but there is another problem also exactly what you brought like imagine we perform an update changing the other column for example we i don't know like if it's some post or something we just edit. And then there is some additional column which we want to be not null. We had some nulls, we want to convert them to, I don't know, false or zero or something because we finally realized we don't consider nulls as zeros or empty strings. And then we have a bunch of rows which still have nulls and we already introduced constraint then user goes to and tries to for example edit the title and cannot because because constraint is already working and during editing a title postgres as we know mvcc postgres copies the whole
Starting point is 00:22:20 row it's physical row it's called tuple or tuple, right? And new tuple is violating not valid constraint. So editing of title is broken. It means we must clean up as the first step. Accepting gaps. So this is why we always need to validate the constraint eventually. When I say eventually, I mean at the end of this process and ideally as quickly as possible to avoid issues do you see how deep the topic is like yeah it's it's interesting for sure unexpectedly right what do you see in the wild yeah do you see people trying to do this like almost instantaneously
Starting point is 00:23:04 because running still if we run them back to back it's still much much better than introducing or it's much better than the alternative introducing the constraint without you know without the not valid honestly i think i saw systems maybe it was smaller projects and maybe it was because of me where we made mistake and did clean up as the second step, not having it before the first alter table. It should be before the first alter table, definitely. And we just didn't notice this bug. I mean, it's a bug of application. It's not a bug of Postgres. When user tries to edit a title, but because of some additional column violation, what's happening here?
Starting point is 00:23:45 And now I'm more cautious about it. So I would put it on the first place. And yeah. So just to clarify, first step we manually run, we manually check for violations. Yeah, exactly. So you know what I mean?
Starting point is 00:24:03 We don't use the validation step to validate whether it's actually... First step is get rid of violations even before you introduce constraint. Then you introduce not valid constraint then ideally once again get rid of violations because maybe we had a gap, maybe new came
Starting point is 00:24:19 and then validate constraint. I've got... I actually think I'm wrong there. I think we'd be better off checking for violations, fixing them, check the violations again, and if you find any, don't go ahead with your plan, because you've still got a problem at the application level of bad data being inserted. Probably then don't get started,
Starting point is 00:24:41 because you need to go back and stop. This algorithm needs to be visualized already yeah yeah but do you see what i mean because otherwise we're going to have errors yes exactly but what how does it change it okay clean up create constraint with not valid why you are shaking your head i'm talking about leaving a gap at the beginning. So, clean up the existing data. Wait a while. Wait is a good idea, yeah. To see if...
Starting point is 00:25:10 See if you need to clean up again. First, fix your application. Fix your application. Not to produce wrong data anymore. Second, clean up. One more step. Wait. Yeah.
Starting point is 00:25:25 Like days, weeks, I don't know. Then introduce. No need to second clean up. Introduce constraint. Okay, second clean up. Why not check real quick? Check without clean up. Okay, I mean, just, yeah.
Starting point is 00:25:39 It's just a minor step. It doesn't take long. Okay. One more scan of the table, though. Okay, this is, yeah, this is full-fledged. Okay, one more time. Fix application code. Clean up.
Starting point is 00:25:53 Check and clean up if needed again. But if clean up needed again the second time, it means code is not fixed yet. This is great. Then alter table constraint not valid. And then what do you think? Do a maybe no already right shouldn't do right yeah yeah well because at this point if you just validate it if you needed to clean up it will fail yeah it's similar to just checking yeah yeah yeah so yeah i agree i let's remove cleanup between two alter table steps
Starting point is 00:26:26 but have two cleanup plus additional check and some time between them. Yeah, that's great. Final question I have here. Could we have just alter table concurrently and have less headache when doing these things? What do you think? For example example if you create like similar situation create index concurrently create unique index concurrently it also it's it's not just index it's constrained basically and and if we have duplicates it will
Starting point is 00:26:58 fail right what we need to do in this case clean clean up manually invalid index. It will be showed as invalid. And then make a new attempt from scratch. You said index there, but did you mean constraint? Well, constraint will be created implicitly. There's a subtle difference between unique indexes and constraints. Let's not go there. I know places in Postgres where confusion is like violating logic.
Starting point is 00:27:30 I have a patch proposed five years ago or so. Nobody took care of it. It's okay. Anyway, when we create a unique index, logically, we are building constraints. So if we have duplicates, this operation will fail, leaving invalid index behind because this attempt also to face
Starting point is 00:27:48 just single line of code, but it's to face, right? We need to drop that index again concurrently and then start from scratch. Oh, again, we need to clean up duplicates and start from scratch. How about here? We could say alter table concurrently, alter table blah, blah, blah, like add, check, constrain concurrently. If there are problems, it would fail.
Starting point is 00:28:14 And we need to clean up and try again. I don't know. I don't understand the reason. I never really thought about this before, but why does create index concurrently, if it fails, leave behind... Why can't it fully tidy up after itself? Because it's not transactional.
Starting point is 00:28:33 But if it's not transactional already, why can't it do the drop index? Why can't it know that it's left? Yeah, it's a good question, actually. And it could drop index concurrently. Yeah. It's a good question. I. And it could drop index concurrently. Yeah. Yeah. It's a good question. I will think about
Starting point is 00:28:47 this. There's probably a reason, right? Like, people would have thought of that, so I imagine there is a good reason, but I don't know it. Maybe it's just easier to implement. That's it. I mean... Maybe, but it would help. Like, that would be a good feature if it's not really difficult. Yeah. User should be okay
Starting point is 00:29:04 just with error message. Yeah, user should be okay just with error message. Yeah, exactly. It didn't work. You're back where you started, even if I haven't seen several transactions. Fix your duplicates and try again. Yeah, but if you don't use explicit naming for your indexes, there is a risk that you had 10 attempts and didn't notice that you have
Starting point is 00:29:27 9 invalid indexes left behind because you didn't check the definition this is the risk here if you use explicit naming of course it will say the index with such name already exists I actually don't know for invalid I assume so i think i think yeah it will be yeah yeah so i always prefer to to use explicit naming for indexes just to control the naming
Starting point is 00:29:56 schema like understanding which columns are indexed and so on but also because of this for to better control of my attempts if it's concurrent concurrently, it's always concurrently, almost always. Yeah, so anyway, back to this alter table, add constraint, check constraint. Concurrently, why not? It could just do two phases. First phase not valid. Second phase, validation. Validation fails.
Starting point is 00:30:23 Clean up or no is a separate topic now. You detached this topic. I prefer clean up, but yeah. But wouldn't it be so great? Yeah. I feel like concurrently is coming up in every episode these days. It's good. Great, great. Yeah. So alter table concurrently. I think it'd be a great feature. It's not an easy to implement, I'm sure. And again, non-transactional.
Starting point is 00:30:48 It's another one of those things that would have to be. It means that there is a promise of Postgres has transactional SQL, but when you start working, it's already so. In real life, we don't have transactional DDL because creating this concurrently is not transactional and these steps when you like split it to
Starting point is 00:31:09 several steps each one of them is transactional which is great but sometimes you cannot reverse this and overall you lose some data or something yeah I think it's fair to say we do have transactional DDO and Postgres,
Starting point is 00:31:28 but it's not practical in larger projects with high load because you just can't afford the downtime that comes as a result of that transactional DDO. So in medium to heavy load, large enough setups, we don't have transactional DDO or we can't practically use it so these practical solutions are really valuable i think the other approach seems to be yeah like the um completely rewriting the table type approaches like we talked about last week or there's another project and i'd be maybe it's a whole it's definitely a whole nother topic but
Starting point is 00:32:04 have you come across pg roll i think it's been brought up other topic, but have you come across PG Roll? I think it's been brought up a couple of times on the podcast. Yeah, in the past. Fully right off the table. Yeah, exactly. I think, well, that's another solution to this, right? Just to introduce constraint? Thank you, no. I don't know if that's what they do for this, but I think that's how they handle most
Starting point is 00:32:21 schema changes. I understand that. It's heavy-handed, but it does also solve this problem. Yeah, yeah, yeah. Now, remember, when we discussed PgSqueeze, we discussed changing order of columns. This makes sense. And, of course, concurrently, not concurrently.
Starting point is 00:32:38 Yeah. And the other, I guess, one final solution is, again, super heavy-handed, but you could logically replicate to an instance that has the constraints that you want in place already. Mm-hmm. Lots of huge amount of work. When we want to add just constraint,
Starting point is 00:32:59 we need just to read everything to ensure there are no violations. It's not a big change. it's just adding a restriction so we don't need to change data it shouldn't be anyway that's i think that's it what i wanted to discuss yes anything else yes super useful super useful feature i think feature. I'm hoping more people become aware of it now. You mean alter table concurrently doesn't exist? No, I mean alter table add constraint not valid.
Starting point is 00:33:33 Not valid. And a little bit confusing naming. So watch out. New rights are already being checked. Yeah, I'll include some links to the documentation because that explains it pretty well as well. Great.
Starting point is 00:33:49 Thank you. Nice one, Nikolai. Thanks so much. See you soon. Bye-bye. See you soon. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.