Postgres FM - Version control for databases

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael, founder of PGMustard, and this is my co-host Nikolai, founder of PostgresAI. Hey Nikolai, what are we talking about today? Hi Michael, I have no idea how to call it. Let's you do it. Well, let's start with calling it database version control. So source control, schema... Schema change management. Yeah.

Starting point is 00:00:23 Database schema version control. I don't know. Yeah. If you say just database version control, someone can think that it's also about data changes and time travel for data. It's very different. We mostly talk about schema, but talking about schema, we should not forget about big data volumes when you need to adjust your data if you need to make some schema change which requires

Starting point is 00:00:49 data changes, right? For example, data type change of a column. Absolutely. Well, I think probably in the application world, this is a well-covered topic. There's been various generations of tools that have improved.

Starting point is 00:01:06 Yeah. I've recently read articles about it. There are three generations of code version control or code change management or source code management. First was very naive and log-based. Second was CVS subversion and others.force it was better but it was very centralized subversion I remember subversion, team foundation server yeah some old ones and the third generation is

Starting point is 00:01:36 git and mercurial which are also decentralized yeah and I think lock free, decentralized I don't know anybody using mercurial anymore so it seems like git one several years ago i know for sure facebook used it but i'm not sure about now yes so yeah so on the application side it feels like this is pretty standard it's very rare that i meet anybody not source controlling their application code but I do still meet people

Starting point is 00:02:06 who don't have the same systems for their database and it's super interesting because this is a problem I was working on back in 20 even as early as 2010 I was working at a company called Redgate that were bringing out a source control product for a database for SQL server so it's super interesting that still what 12 years later it's still pretty common for me to come across teams that don't have it and even the ones that do for database right for exactly for that yes and even the ones that do don't normally have it hooked up to they're using it for kind of communicating changes between developers but not always for deployments or continuous integration.

Starting point is 00:02:47 Yeah, let's discuss the goals. Let's start with regular code. Which goals such tool like Git is aiming to solve? For example, if you're just a solo developer, it still makes sense for you to track changes and to have history, right? Like, so you can time travel, basically, in terms of source code. But do you need branches? And do you need multiple versions of your software? Well, this is super...

Starting point is 00:03:18 Do you mean for databases or do you mean for applications? For regular applications. Simple, like, like single page application. I don't know, something simple. If you're on your own, I think branches have minimal additional benefit. I know some people love them for kind of exploring different paths. If they want to try something experimental and then quickly come back and fix a bug for a customer, it's like nice to not have been developing on the main branch.

Starting point is 00:03:43 So I can see benefits to that. And also I think it teaches good discipline when you do have collaborators, but I don't see massive benefits. How about you? Well, yeah, this is interesting. So branches can be used in various ways. And if I'm alone and I'm a solo developer,

Starting point is 00:04:00 I still have some benefit of them because I can, exactly like you said, I can experiment have some benefit of them because i can exactly like you said i can experiment with some path of development and then abandon it if i consider it wrong and with branches it's good that they are very quickly to create them it costs nothing very quickly boom and i have a branch i can experiment and i can have multiple paths of development and then I choose which one to take. Maybe I will combine multiple ones. This gives flexibility and freedom, right? But also branches sometimes are used not as for free, it's like for feature development or bug fixes.

Starting point is 00:04:39 They are used to track different versions of software. And Postgres itself, Postgres project, doesn't use feature branches, right? It uses major version branches, like stable release number 15. We have a branch and all bug fixes are happening inside this branch. So you have multiple lines of your tree of development. And I can easily imagine if I'm a solo developer, I develop some popular software, I still can benefit from branches. Like in this particular case, I maintain some old version. I already refactored it a lot.

Starting point is 00:05:17 I already have a new version, but I still maintain the old one. I need to bug fix it sometimes. So branches are needed for this as well. But maybe we went too far from databases already, right? Maybe not, maybe yes. I think this is all useful, but I would say that this is where the next major benefit that I was going to mention is obviously in teams being able to work in isolation and not have to worry about stepping on each other's toes. And that's where it gets super relevant for databases. So when I was at Redgate, we actually supported two different

Starting point is 00:05:51 models of source controlling a database. One we called the shared model because most teams at the time, and probably still now, tended to have, even if they had databases locally for developers, they would tend to have one big database so i know this is where your product comes in key but they would have one large database where you could do proper testing proper checking on decent data volumes and they had to be really careful of stepping on each other's toes for example if you you and somebody else happen to be working on the same stored procedure or function at the same time you get the case where the last last one to change it wins and that's a that's a

Starting point is 00:06:30 real problem in terms of testing if you don't understand why why your feature's not working and you then you realize somebody else changed it so we did support that model and we would we actually ended up implementing it via locks so triggers actually we call it locks but you could lock an object so other people couldn't change it for example so that was there was like some super interesting features around that but that was a stop gap i saw it as a stop gap because the better version was we called it the dedicated model so where everybody had their own dedicated version of the database and they could make changes in isolation and then commit those back to the repository for the source code and maybe push those to a central system. I'd deploy those maybe to an environment.

Starting point is 00:07:16 But OK, but back to Git as being third generation of source code management systems, There's also some small things which deliver some value. For example, you can easily check diff before you push. So you're already committed. You can review your own changes. And with additional additions in form of GitHub or GitLab, you can also have very, very powerful process of, like, review process, pull requests, merge requests. So there are multiple levels of quality control,

Starting point is 00:07:53 I would say, right? And this is great. And also forking. So if you have open source project, it's easy to just fully clone it, fork it, and develop. And if you see some value pushback changes, propose, suggest in form of pull request or merge request, suggest these changes to the upstream project. But benefits are not in the same area. They are in the different areas.

Starting point is 00:08:23 But all of them allow to build very powerful and very good workflow. And there are many workflows actually, not one, which are proven to be successful in various teams. But back to databases, we know that currently, I would say we have like also some generation of schema change management tools. Some people name them. I believe it came from Ruby. I'm not sure. Database migration tools. Yeah. Right. Because Ruby on Rails Active Record has part of it called database migrations. And it's not migrating from Oracle or to Postgres. It's about schema changes, basically. Yeah. Migrating from one version of your schema to another version

Starting point is 00:09:04 of your schema. And version of your schema. And this is really important. This comes back to your diff point from earlier. In code, in applications, we can diff simple text files, and we can deploy it by overwriting one with the other. And that's fine, because there's no data. Whereas in databases, we need, like, if we're going to, we can't just replace, if we, if we rename a column, for example, we can't just delete that column and add a new one with the new name. We need to be aware of those data. Yeah, of course. So that's, that's a very simple example, but there are loads of these examples. And I think this is where at Redgate, we had a huge advantage because we had a schema comparison tool that did these diffs.

Starting point is 00:09:46 It was, that was the starting point. Is this schema comparison, does it happen without context? Like what happens with this example? This is where it gets really complicated. And I think it's mostly unviable in Postgres at the moment. I don't think anybody's doing this in Postgres at the moment. You need to have a live database, source code files. So these files are create statements, a create statement per object, and you need to be able to do comparisons. But of course,

Starting point is 00:10:16 if you're comparing two static files, how do you know if it's a rename or dropping one column and creating another one? Through static files, you can't't tell so you need to add migrations or what rails calls migrations at some level so they ended up kind of backing into that and ended up also acquiring a tool called flyaway db because that was very migration based so i think just just quickly i wanted to say i think there are two two completely different models here one of which is kind of nice in principle, but I don't think works in practice, at least not yet, which is the static create statements. And the other one, which is a single dump of your schema from a point in time, and then migrations between versions thereafter,

Starting point is 00:10:59 which is what all of the frameworks like Rails and Django are doing, and the other tools do as well. Right. And if you have some live database, even if it's not production, you can distinguish drop create versus alter column if you check system catalogs. Object IDs. Yeah, exactly.

Starting point is 00:11:17 Yeah. Or in Postgres, it's a pgAttribute has atNum, so like numerical or number where column is and you can say oh this is the same column like just renamed so you can distinguish this case or versus it's a completely new column interesting so you mentioned flyway i know i know redgate a few years ago purchased this project there are other projects like Skitch. I like Skitch because it's quite powerful.

Starting point is 00:11:50 Although it's written in Perl, so it's hard to contribute for me. It's very powerful and works with many databases as well. Also there is a very popular Liquibase. Liquibase is very... Django also, being a framework, Django has database migration capabilities similar to Rails.

Starting point is 00:12:08 But all of them, correct me if I'm wrong, all of them live on top of Git. They rely on Git for version control, right? Okay. So what's their value? Why do we need them at all? Is it to do with deployments? Is it to do with actually applying the changes, picking them up, and tying them to code changes as well? For example, if I have database change that needs to go alongside an application change, that's my understanding. Yeah. For example, if we talk about...

Starting point is 00:12:40 Well, obviously, framework in Rails and django they are also they support the both frameworks are also orms so migration parts of those rms also support like upper like higher language not sql right so because it's all right so we can change schema also using our native language avoiding trying to avoid SQL. But for quite advanced changes, you still need SQL. For example, if you want trigger, probably you need to write some PLPG scale

Starting point is 00:13:13 code and some SQL code and so on. If you need some advanced CTE, you also need, or define some Vue with advanced CTE, you need some SQL. But generally, they try to also support native language like Ruby or Python. And also, for example, in Rails,

Starting point is 00:13:31 I see that people start, I observed it in multiple projects and GitLab as well. They start with not dumping schema. So they support steps and also dumping schemas for additional control. This is our current baseline and we shifted

Starting point is 00:13:47 but usually projects start with schema.rb I think in Rails and then they switch to structure.sql which is SQL regular dump and it's easier for Postgres guys to compare it and deal with it I remember GitLab did it when

Starting point is 00:14:04 they decided to stop supporting my sql before before they supported both sql and my postgres and it was not possible to have structured sql which is which should be like single database only but my question is to this like systems okay you provide some like usually like forward and backward. You can define forward and backward steps. Sometimes you can define also testing step. They allow you to deploy. They allow you sometimes like to have some dry deploy,

Starting point is 00:14:39 which not doing anything but checking correctness of your changes. But why don't you care about database data? Database is data. It's not tiny, unlike source code. Sometimes we have millions of code, of course, but usually it's relatively small compared to database. But database is usually big. And if you need to change something without blocking others, you need to have some

Starting point is 00:15:07 DML as well, not only DDL. And this DML should be split to batches. Why don't these old tools support it at all? I think it's difficult. And I think if we talked about those three stages of application progress, or application source control progress, I think If we talked about those three stages of application progress or application source control progress, I think in database land, we're kind of in step one still. I think the tools are still quite immature for supporting us.

Starting point is 00:15:35 Generation one. Maybe, yeah. Maybe that's a bit harsh. Maybe we're just sneaking into number two with some of these tools becoming better. But's if you're not on rails or django i think it's actually quite difficult to get this stuff up and running so two well one one issue for example even before you worry about online changes with big data but you know actually worrying about zero downtime deployments even before then what about merging?

Starting point is 00:16:07 What about if I make some changes on a branch and add a migration and you make some changes on a branch and add a migration, but we want to deploy those together? Conflict resolution is called. Yeah, exactly. Conflict resolution is difficult. Yeah, so two questions we raise. Zero downtime migrations, for example, actually basic diff capability.

Starting point is 00:16:27 If we rely only on Git, it will be very weak. For schema, we need diff and not as PGAdmin and others have. Very naive or Migra or others. So they have very naive diff. For example, they say create index, not mentioning concurrently. So if we have substantially big and substantially loaded database, we will block them. And only tiny projects can afford using such approach. So create index should be concurrently.

Starting point is 00:16:58 Alter table, changing column data type also. This is harder. This should be in batches and so diff you you say conflict resolution i also say deployment according in accordance with this zero downtime approach so we need batch management background like workers sometimes lasting days if you need for example to change data type in a column or in the table, which has a billion of rows, you need to be very patient and wait one or two days and split it to batches. Batch size should be properly adjusted. So they don't care about it. Why? Hard problems? Or I'm

Starting point is 00:17:40 mistaken and it's not needed at all? I think it's a hard problem personally i think also three problems already already three problems yeah right and i've actually been so i've got a friend working in this area kind of somebody i used to work with called neil anderson working on a tool called postgres compare and i've been trying to encourage him because because as you say the diff tools in most the diff tools that have been added to most ides and the open source ones aren't super intelligent they're good i'm sure a lot of people have put time into them but i think he's put a lot more time in the charges it's a commercial tool and he's been able to put a lot more time into it so it's it's called positive zero downtime migrations i'm not sure it's mostly for live schemes yeah it's mostly for like live

Starting point is 00:18:27 schema to schema comparisons and I don't know if they're like without data right no he does do data as well actually it's in in the same product but for okay yeah but it's kind of different features right so yeah I think data like I was going to mention data here actually because I think data can sometimes be an important part of your schema, which again, some of these tools forget and neglect. If you've got a lookup table, for example, of what the famous one is, country codes or yeah, exactly. Anything where you need that data.

Starting point is 00:19:00 People put it to migrations. I see it quite often in Flyway, for example, just insert. Yeah. And sometimes some teams consider this as a bad practice and try to avoid, but some teams do it. I don't know what the alternative is, though, if that's the migration. Good question. I don't know either.

Starting point is 00:19:21 So we had to worry about that because we were having create statements in version control. So if you're trying to do the alternative, which is have a create statement per object and manage diffs, you have to choose which tables need to also go in data-wise. So I think it's a hard problem and complicated. So I do have empathy with the people making these products and making these services i do think databases are fundamentally harder to do this for than applications but i do also think there's a lot we can do to make them better the create index concurrently is a perfect example of quite a simple change that probably should be in most of them

Starting point is 00:20:00 yeah also you know you know? I think if somebody developed open source tool which supports zero downtime diff, I think it would solve the problem of merging and conflict resolution because we would see this diff and this is our merging process. We just, maybe not.

Starting point is 00:20:20 Actually, if we merge to, for example, main branch, it's our deployment to production. What do you think? Yeah, I think merging is trickier, but we might be able to lean on Git for some of that. So, for example, if we're changing the same function, the function is going to look very similar between our two.

Starting point is 00:20:37 But if we're altering a table, maybe we can apply one after the other, but I'm not sure when order matters. It's a tricky one. I haven't thought about this for a few years, to be honest. If we consider functions triggers as a part of schema, and I think we should, this also, we probably already in previous episodes, we already mentioned the problem of including all these tools,

Starting point is 00:21:01 current generation 1.5, right, tools. If we include a different file and keep function body there for convenience, like, for example, each function, it's a known file, easier to find it, navigate, and so on. Do we still want to support long sequences of deployments? For example, we have several installations of our product, and one is already very up-to-date, and another one is lagging a few months of changes.

Starting point is 00:21:30 And the problem is when you include, you need the current version of file. But when you already function changed many times, and you try to replay old migration file, it tries to include the very fresh very new function body and it's it's not what it is expected like couple of months ago right and we have a problem so you need to somehow fetch you can use git history and fetch older version which corresponds to this migration file but why nobody cares about this i needed to implement it myself and like i thought about how to contribute it to sketch and it was like quite hard actually i raised the topic about it

Starting point is 00:22:11 and mailing list and i didn't see support and things like it looked to me like i'm alone with this problem like nobody cares right maybe i'm really alone how people do this from your practice is it the problem or people just don't need to replay long sequences of changes? I've seen more people kind of like try migrating forwards. I actually see quite a few people not even worrying about backwards, like moving backwards. And it's an interesting potential. It's not only backwards. It's about also if you have multiple setups and one of them

Starting point is 00:22:46 is lagging quite much. They need to replay, say, 10 database migrations. And they fetch the very fresh version of our code, but this code already has a much newer function file. And when they try

Starting point is 00:23:02 to replay some old migration which tries to include this file, it expects a very different function or view, for example. So the problem with including different files is it's a problem. You always need to snapshot the whole function body to migration, which bloats your code, right? You cannot keep a function in a separate file. This is the problem with all these tools. If we only have migrations from

Starting point is 00:23:30 a baseline, I don't think this is a problem. The problem starts when you try to use including, include some file. Like with Skitch, it's very Skitch feeds files to psql, so you can use backslash I.

Starting point is 00:23:46 Okay. Yeah, okay. This is a problem. And if you use backslash I and this is like two-month-old migration, this file already changed and you need to rewind it. And this is a problem. Otherwise, you need every time to put a whole function body to migration and it bloats everything.

Starting point is 00:24:07 It's very huge. If you put every time you say create a replace function or drop function and create it again and put everything and revert as well, like whole function again, it bloats it so much that I'm starting to ask,

Starting point is 00:24:23 do we have Git still? Nobody cares about we have Git here. This is why it was created for, right? Why we copy and paste functions, function versions? Where is version control at all? This is what I liked about the single create files per object. But it has such big limitations right now that I don't think it's viable. So I think we have to accept that putting the full object back in.

Starting point is 00:24:50 Maybe you could argue it's a sequel. I don't understand. If you keep object in separate file, how you replace sequences of changes? So like you still have versions, right? Like our first version, then like maybe timestamp that you take day or something timestamp. This is new version, new version, new version. If you keep object in separate file, you need to include it.

Starting point is 00:25:10 And if you try to replay very old migration with very new state of code, you have problem. I think you need to have a database aware diff functionality and it also probably needs another folder of overrides so if you have

Starting point is 00:25:27 for example um easy migrations like adding if we're always adding adding columns adding indexes adding tables easy stuff we don't need to have any special cases we can just the diffs are easy if we do something more tricky let's say from i know git doesn't work like tricky, let's say from, I know Git doesn't work like this, but let's say from version 15 to version 16, we need a more complex migration. We can say, we can put in a folder somewhere for going between 15 and 16, which includes going from 14 to 18 or any other way past that boundary. We can say, don't use what you plan to use, use this override instead. But it gets complicated. That's kind of like a hybrid between the two models we've been discussing. Well, there's a big difference between objects with data

Starting point is 00:26:14 and very lightweight objects. Objects with data is tables. It's called tables. That's it. Only tables have data. If we forget about materialized use, which is derivative from tables, right? Yeah. Views, triggers, functions, they they are light they don't have data so we always can rewrite them we can throw

Starting point is 00:26:31 them away indexes though like well yeah yeah but indexes it's like part of it's like part of tables probably okay cool fair enough they cannot live without tables it's like okay it's addition to tables add on to tables. But functions, triggers, and views, regular views. Maybe we trust views as well because they have only derivative data as well. You always can redefine them, and you can keep them in files. That's why. Well, that's a really interesting point.

Starting point is 00:27:02 I haven't seen a system that separates those two. I built it a couple of times. Okay, cool. So you had one system for tables. Remember we discussed the approach when functions are kept in separate schemas and you can keep the old version of... This is smart and good. And I think it's possible to

Starting point is 00:27:18 build without copy-paste still, like to rely on Git for version control and have like version 1, schema functions version 1, functions version two, and being able to even keep them several versions at the same time. And different application nodes can address different versions of functions. This is very smart. And this is exactly when we probably need to keep functions in separate files as well. Right. And I should mention that this idea was like came to mind of Valentin Gagichevsky long, long ago from Zalando and then propagated to Russia, to Avito and others and so on.

Starting point is 00:27:54 So it's a great idea, still relevant I think. I'm not sure about triggers because triggers consist of trigger function plus trigger itself. So trigger itself is a bit easier it also requires things oh by the way big question number four or five probably why do all these guys don't care about it like about simple problem of locking attempts if you try to alter table let go you always must have small lock time out and not be ready to retry. Otherwise, well, you don't notice the problem if you have not big load, not high load, right?

Starting point is 00:28:32 Not heavy load. But if you have noticeable load, like 1,000 TPS, sooner or later you will bump into this problem very badly, as many did already. So you need to be graceful otherwise you try to alter somebody locks and you cannot acquire a lock and you have huge line including selects after you waiting so you block everyone they don't care about it why well I Well, I guess... Because it's generation 1.5. Generation 1.5.

Starting point is 00:29:06 It's a really good point. But that also lives in a slightly different place, right? That lives in the config file. And that's not necessarily... Is that in version control? Hopefully, but it's not in the same version control for most people. So it's deployed in a different way, updated in a different way. So it's very, very interesting.

Starting point is 00:29:27 Yeah, really good point. Maybe that should be in there as well. I hope they will develop in the right directions. People need these directions. People need zero downtime migrations, conflict resolution, graceful deployments, right? And by the way, I want to advertise, GitLab has open source and background migrations.

Starting point is 00:29:45 It's called migrationhelpers.rb file. So it's open source. Anyone can check. And if you are Ruby user developer, you can benefit from learning their experiences. Enormous volume of experience is already coded there to support zero downtime migrations. Also, actually, GitLab has many installations and some of them may lag and need to... This is exactly the problem I described. I'm not sure if they keep some objects in separate files, probably not, but they have

Starting point is 00:30:16 this problem. Deployments should be without issues on many, many setups. So it's interesting to check. I hope that one day they will open source it separately as a separate gem or like library to be great for Ruby developers. That'd be really cool. There's a really good blog post

Starting point is 00:30:33 about zero downtime migrations from the team at Braintree. That's excellent. That I think I'll include as well. Don't forget to include my log timeout post about this read-rise problem. It's a big one.

Starting point is 00:30:44 And yeah, I would love to hear from anybody listening to this that's screaming at us and saying that you've got this solved or you've got a really good process on your side. It would be great to hear how people are doing. Yeah, maybe we are wrong. This episode is full of criticism, but it's an opportunity to improve our tooling. So maybe someone can say there is some very magic tool which solves all problems right yeah let us know wonderful okay well thanks nikolai thanks everybody for listening appreciate everybody that's got in touch recently with feedback and yeah looking forward to speaking to you next week great thank you bye

Postgres FM - Version control for databases

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.