Postgres FM - Version control for databases
Episode Date: November 4, 2022Here are links to a few things we mentioned: A brief history of Version Control Systems (blog post by Mehran Hajirajabi)Redgate Source Control for Oracle (and SQL Server)Flyway Sqitch Liqu...ibase pgAdmin Schema Diff Migra PostgresCompare Schema based versioning and deployment (blog post by Valentine Gogichashvili)Change management tools and techniques (PostgreSQL Wiki)GitLab migration_helpers.rb Database schema changes without downtime (blog post by James Coleman from Braintree/PayPal) Zero-downtime Postgres schema migrations need lock_timeout and retries (blog post by Nikolay) ------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov / @michristofides / @PostgresFM, or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
I am Michael, founder of PGMustard, and this is my co-host Nikolai, founder of PostgresAI.
Hey Nikolai, what are we talking about today?
Hi Michael, I have no idea how to call it. Let's you do it.
Well, let's start with calling it database version control.
So source control, schema...
Schema change management.
Yeah.
Database schema version control.
I don't know.
Yeah.
If you say just database version control, someone can think that it's also about data changes and time travel for data.
It's very different.
We mostly talk about schema, but talking about schema, we should not forget about big data volumes when you need to adjust your data
if you need to make
some schema change which requires
data changes, right? For example,
data type change of a
column. Absolutely.
Well, I think probably
in the application world, this is
a well-covered topic. There's been
various generations
of tools that have improved.
Yeah. I've recently read articles about it. There are three generations of code version control or
code change management or source code management. First was very naive and log-based. Second was
CVS subversion and others.force it was better but it was very
centralized
subversion I remember
subversion, team foundation server
yeah some old ones
and the third generation is
git and mercurial
which are also decentralized
yeah and I think
lock free, decentralized
I don't know anybody using
mercurial anymore so it seems like git one several years ago i know for sure facebook
used it but i'm not sure about now yes so yeah so on the application side it feels like this
is pretty standard it's very rare that i meet anybody not source controlling their application code but I do still meet people
who don't have the same systems for their database and it's super interesting because this is a
problem I was working on back in 20 even as early as 2010 I was working at a company called Redgate
that were bringing out a source control product for a database for SQL server so it's super
interesting that still what 12 years later
it's still pretty common for me to come across teams that don't have it and even the ones that
do for database right for exactly for that yes and even the ones that do don't normally have it
hooked up to they're using it for kind of communicating changes between developers but
not always for deployments or continuous integration.
Yeah, let's discuss the goals.
Let's start with regular code.
Which goals such tool like Git is aiming to solve?
For example, if you're just a solo developer, it still makes sense for you to track changes and to have history, right?
Like, so you can time travel, basically, in terms of source code.
But do you need branches?
And do you need multiple versions of your software?
Well, this is super...
Do you mean for databases or do you mean for applications?
For regular applications.
Simple, like, like single page application.
I don't know, something simple.
If you're on your own, I think branches have minimal additional benefit.
I know some people love them for kind of exploring different paths.
If they want to try something experimental and then quickly come back and fix a bug for a customer,
it's like nice to not have been developing on the main branch.
So I can see benefits to that.
And also I think it teaches good discipline
when you do have collaborators,
but I don't see massive benefits.
How about you?
Well, yeah, this is interesting.
So branches can be used in various ways.
And if I'm alone and I'm a solo developer,
I still have some benefit of them
because I can, exactly like you said, I can experiment have some benefit of them because i can exactly like you said i can
experiment with some path of development and then abandon it if i consider it wrong and with branches
it's good that they are very quickly to create them it costs nothing very quickly boom and i have
a branch i can experiment and i can have multiple paths of development and then I choose which one to take.
Maybe I will combine multiple ones.
This gives flexibility and freedom, right?
But also branches sometimes are used not as for free, it's like for feature development or bug fixes.
They are used to track different versions of software.
And Postgres itself, Postgres project, doesn't use feature branches, right?
It uses major version branches, like stable release number 15.
We have a branch and all bug fixes are happening inside this branch.
So you have multiple lines of your tree of development.
And I can easily imagine if I'm a solo developer, I develop some popular software, I still can benefit from branches.
Like in this particular case, I maintain some old version.
I already refactored it a lot.
I already have a new version, but I still maintain the old one.
I need to bug fix it sometimes.
So branches are needed for this as well.
But maybe we went too far from databases already, right? Maybe not, maybe yes.
I think this is all useful, but I would say that this is where the next
major benefit that I was going to mention is obviously in teams being able to work in isolation
and not have to worry about stepping on each other's toes. And that's where it gets
super relevant for databases. So when I was at Redgate, we actually supported two different
models of source controlling a database. One we called the shared model because most teams at the
time, and probably still now, tended to have, even if they had databases locally for developers,
they would tend to have one big
database so i know this is where your product comes in key but they would have one large database
where you could do proper testing proper checking on decent data volumes and they had to be really
careful of stepping on each other's toes for example if you you and somebody else happen to
be working on the same stored procedure or function
at the same time you get the case where the last last one to change it wins and that's a that's a
real problem in terms of testing if you don't understand why why your feature's not working
and you then you realize somebody else changed it so we did support that model and we would we
actually ended up implementing it via locks so triggers actually we call it locks but you could
lock an object so other people couldn't change it for example so that was there was like some
super interesting features around that but that was a stop gap i saw it as a stop gap because
the better version was we called it the dedicated model so where everybody had their own dedicated
version of the database and they could make changes in isolation and then commit those back to the repository for the source code and maybe push those to a central system.
I'd deploy those maybe to an environment.
But OK, but back to Git as being third generation of source code management systems, There's also some small things which deliver some value.
For example, you can easily check diff before you push.
So you're already committed.
You can review your own changes.
And with additional additions in form of GitHub or GitLab,
you can also have very, very powerful process of,
like, review process, pull requests, merge requests.
So there are multiple levels of quality control,
I would say, right?
And this is great.
And also forking.
So if you have open source project,
it's easy to just fully clone it, fork it,
and develop.
And if you see some value pushback changes, propose, suggest in form of pull request or merge request, suggest these changes to the upstream project.
But benefits are not in the same area. They are in the different areas.
But all of them allow to build very
powerful and very good workflow. And there are many workflows actually, not one, which are
proven to be successful in various teams. But back to databases, we know that currently,
I would say we have like also some generation of schema change management tools. Some people
name them. I believe it came from Ruby.
I'm not sure. Database migration tools. Yeah. Right. Because Ruby on Rails Active Record has
part of it called database migrations. And it's not migrating from Oracle or to Postgres. It's
about schema changes, basically. Yeah. Migrating from one version of your schema to another version
of your schema. And version of your schema.
And this is really important. This comes back to your diff point from earlier. In code, in
applications, we can diff simple text files, and we can deploy it by overwriting one with the other.
And that's fine, because there's no data. Whereas in databases, we need, like, if we're going to, we can't just replace, if we, if we rename a column, for example, we can't just delete that column and add a new one with the new name.
We need to be aware of those data.
Yeah, of course.
So that's, that's a very simple example, but there are loads of these examples.
And I think this is where at Redgate, we had a huge advantage because we had a schema comparison tool that did these diffs.
It was, that was the starting point.
Is this schema comparison, does it happen without context?
Like what happens with this example?
This is where it gets really complicated.
And I think it's mostly unviable in Postgres at the moment.
I don't think anybody's doing this in Postgres at the moment.
You need to have a live database, source code files. So these files are create statements,
a create statement per object, and you need to be able to do comparisons. But of course,
if you're comparing two static files, how do you know if it's a rename or dropping one column and
creating another one? Through static files, you can't't tell so you need to add migrations or what rails calls migrations at some level so they ended up
kind of backing into that and ended up also acquiring a tool called flyaway db because
that was very migration based so i think just just quickly i wanted to say i think there are two
two completely different models here one of which is kind of nice in principle, but I don't think works in practice, at least not yet,
which is the static create statements.
And the other one, which is a single dump of your schema from a point in time,
and then migrations between versions thereafter,
which is what all of the frameworks like Rails and Django are doing, and the other tools do as well.
Right.
And if you have some live database,
even if it's not production,
you can distinguish drop create versus alter column
if you check system catalogs.
Object IDs.
Yeah, exactly.
Yeah.
Or in Postgres, it's a pgAttribute has atNum,
so like numerical or number where column is and you can say oh this
is the same column like just renamed so you can distinguish this case or versus it's a completely
new column interesting so you mentioned flyway i know i know redgate a few years ago purchased
this project there are other projects like
Skitch. I like Skitch because it's
quite powerful.
Although it's written in Perl, so
it's hard to contribute for me.
It's very powerful and works with many
databases as well. Also
there is a very popular Liquibase.
Liquibase is very...
Django also, being a framework,
Django has database migration capabilities similar to Rails.
But all of them, correct me if I'm wrong,
all of them live on top of Git. They rely on Git for version
control, right? Okay. So what's their value?
Why do we need them at all? Is it to do with
deployments? Is it to do with actually applying the changes, picking them up, and tying them to code changes as well?
For example, if I have database change that needs to go alongside an application change, that's my understanding.
Yeah.
For example, if we talk about...
Well, obviously, framework in Rails and django they are also they support the
both frameworks are also orms so migration parts of those rms also support like upper
like higher language not sql right so because it's all right so we can change schema also using our
native language avoiding trying to avoid SQL.
But for quite advanced changes, you still
need SQL. For example,
if you want trigger, probably
you need to write some PLPG scale
code and some SQL code and so on.
If you need some advanced CTE,
you also need, or define some
Vue with advanced CTE, you need
some SQL. But generally,
they try to also support
native language like Ruby or Python.
And also, for example, in Rails,
I see that people start,
I observed it in multiple projects
and GitLab as well.
They start with not dumping schema.
So they support steps
and also dumping schemas for additional control.
This is our current baseline
and we shifted
but usually projects start with schema.rb
I think in Rails
and then they switch to structure.sql
which is SQL regular dump
and it's easier for
Postgres guys to compare it
and deal with it
I remember GitLab did it when
they decided to stop supporting my
sql before before they supported both sql and my postgres and it was not possible to have
structured sql which is which should be like single database only but my question is to this
like systems okay you provide some like usually like forward and backward.
You can define forward and backward steps.
Sometimes you can define also testing step.
They allow you to deploy.
They allow you sometimes like to have some dry deploy,
which not doing anything but checking correctness of your changes.
But why don't you care about database data?
Database is data.
It's not tiny, unlike source code.
Sometimes we have millions of code, of course,
but usually it's relatively small compared to database.
But database is usually big.
And if you need to change something without blocking others, you need to have some
DML as well, not only DDL. And this DML should be split to batches.
Why don't these old tools
support it at all? I think it's difficult. And I think if we
talked about those three stages of application
progress, or application source control progress, I think If we talked about those three stages of application progress
or application source control progress,
I think in database land, we're kind of in step one still.
I think the tools are still quite immature for supporting us.
Generation one.
Maybe, yeah.
Maybe that's a bit harsh.
Maybe we're just sneaking into number two
with some of these tools becoming better. But's if you're not on rails or django i think it's actually quite difficult to get this
stuff up and running so two well one one issue for example even before you worry about online
changes with big data but you know actually worrying about zero downtime deployments even
before then what about merging?
What about if I make some changes on a branch and add a migration and you make some changes on a branch and add a migration,
but we want to deploy those together?
Conflict resolution is called.
Yeah, exactly.
Conflict resolution is difficult.
Yeah, so two questions we raise.
Zero downtime migrations, for example,
actually basic diff capability.
If we rely only on Git, it will be very weak.
For schema, we need diff and not as PGAdmin and others have.
Very naive or Migra or others.
So they have very naive diff.
For example, they say create index, not mentioning concurrently.
So if we have substantially big and substantially loaded database, we will block them.
And only tiny projects can afford using such approach.
So create index should be concurrently.
Alter table, changing column data type also.
This is harder.
This should be in batches and so diff you you say conflict
resolution i also say deployment according in accordance with this zero downtime approach
so we need batch management background like workers sometimes lasting days if you need for
example to change data type in a column or in the table, which has
a billion of rows, you need to be very patient and wait one or two days and split it to batches.
Batch size should be properly adjusted. So they don't care about it. Why? Hard problems? Or I'm
mistaken and it's not needed at all? I think it's a hard problem personally i think also three problems
already already three problems yeah right and i've actually been so i've got a friend working
in this area kind of somebody i used to work with called neil anderson working on a tool called
postgres compare and i've been trying to encourage him because because as you say the diff tools in
most the diff tools that have been added to most ides and the open source ones aren't super intelligent they're good i'm
sure a lot of people have put time into them but i think he's put a lot more time in the charges
it's a commercial tool and he's been able to put a lot more time into it so it's it's called
positive zero downtime migrations i'm not sure it's mostly for live schemes yeah it's mostly for like live
schema to schema comparisons and I don't know if they're like without data right no he does
do data as well actually it's in in the same product but for okay yeah but it's kind of
different features right so yeah I think data like I was going to mention data here actually
because I think data can sometimes be an important part of your schema, which again, some of these tools
forget and neglect.
If you've got a lookup table, for example, of what the famous one is, country codes or
yeah, exactly.
Anything where you need that data.
People put it to migrations.
I see it quite often in Flyway, for example, just insert.
Yeah.
And sometimes some teams consider this as a bad practice
and try to avoid, but some teams do it.
I don't know what the alternative is, though,
if that's the migration.
Good question. I don't know either.
So we had to worry about that
because we were having create statements in version control.
So if you're trying to do the alternative, which is have a create statement per object and manage diffs, you have to choose which tables need to also go in data-wise.
So I think it's a hard problem and complicated.
So I do have empathy with the people making these products
and making these services i do think databases are fundamentally harder to do this for than
applications but i do also think there's a lot we can do to make them better the create index
concurrently is a perfect example of quite a simple change that probably should be in most of them
yeah also you know you know? I think if somebody developed
open source tool
which supports zero downtime diff,
I think it would solve the problem
of merging and conflict resolution
because we would see this diff
and this is our merging process.
We just, maybe not.
Actually, if we merge to,
for example, main branch,
it's our deployment to production.
What do you think?
Yeah, I think merging is trickier,
but we might be able to lean on Git for some of that.
So, for example, if we're changing the same function,
the function is going to look very similar between our two.
But if we're altering a table,
maybe we can apply one after the other,
but I'm not sure when order matters.
It's a tricky one.
I haven't thought about this for a few years, to be honest.
If we consider functions triggers as a part of schema, and I think we should,
this also, we probably already in previous episodes,
we already mentioned the problem of including all these tools,
current generation 1.5, right, tools.
If we include a different file and keep function body there for convenience,
like, for example, each function, it's a known file, easier to find it,
navigate, and so on.
Do we still want to support long sequences of deployments?
For example, we have several installations of our product,
and one is already very up-to-date,
and another one is lagging a few months of changes.
And the problem is when you include, you need the current version of file.
But when you already function changed many times,
and you try to replay old migration file,
it tries to include the very fresh very new function body and it's it's not
what it is expected like couple of months ago right and we have a problem so you need to somehow
fetch you can use git history and fetch older version which corresponds to this migration file
but why nobody cares about this i needed to implement it myself and like i thought about
how to contribute it to sketch and it was like quite hard actually i raised the topic about it
and mailing list and i didn't see support and things like it looked to me like i'm alone with
this problem like nobody cares right maybe i'm really alone how people do this from your practice
is it the problem or people just don't need to replay long sequences of changes?
I've seen more people kind of like try migrating forwards.
I actually see quite a few people not even worrying about backwards, like moving backwards.
And it's an interesting potential.
It's not only backwards.
It's about also if you have multiple setups and one of them
is lagging quite much.
They need to replay, say,
10 database migrations.
And they fetch the very fresh version
of our code, but this code
already has a much
newer function file.
And when they try
to replay some old
migration which tries to include this file, it expects a very different function or view, for example.
So the problem with including different files is it's a problem.
You always need to snapshot the whole function body to migration, which bloats your code, right?
You cannot keep a function in a separate file.
This is the problem with all these
tools.
If we only have migrations from
a baseline, I don't
think this is a problem.
The problem starts when you try to
use including, include
some file.
Like with Skitch, it's very
Skitch feeds files to
psql, so you can use backslash I.
Okay.
Yeah, okay.
This is a problem.
And if you use backslash I and this is like two-month-old migration,
this file already changed and you need to rewind it.
And this is a problem.
Otherwise, you need every time to put a whole function body to
migration and it bloats everything.
It's very huge.
If you put every
time you say create a replace function
or drop function and create it again and
put everything and
revert as well, like whole function again,
it bloats it so much that
I'm starting to ask,
do we have Git still?
Nobody cares about we have Git here.
This is why it was created for, right?
Why we copy and paste functions, function versions?
Where is version control at all?
This is what I liked about the single create files per object.
But it has such big limitations right now that I don't think it's viable.
So I think we have to accept that putting the full object back in.
Maybe you could argue it's a sequel.
I don't understand.
If you keep object in separate file, how you replace sequences of changes?
So like you still have versions, right?
Like our first version, then like maybe timestamp that you take day or something timestamp.
This is new version, new version, new version.
If you keep object
in separate file, you need to include it.
And if you try to
replay very old migration with
very new state of code, you
have problem.
I think you need to have a database
aware diff functionality
and it also probably needs
another folder of overrides so if you have
for example um easy migrations like adding if we're always adding adding columns adding indexes
adding tables easy stuff we don't need to have any special cases we can just the diffs are easy
if we do something more tricky let's say from i know git doesn't work like tricky, let's say from, I know Git doesn't work like this, but let's say from version 15 to version 16, we need a more complex migration.
We can say, we can put in a folder somewhere for going between 15 and 16, which includes going from 14 to 18 or any other way past that boundary.
We can say, don't use what you plan to use, use this override instead.
But it gets complicated.
That's kind of like a hybrid between the two models we've been discussing.
Well, there's a big difference between objects with data
and very lightweight objects.
Objects with data is tables.
It's called tables.
That's it.
Only tables have data.
If we forget about materialized use, which is derivative from tables, right?
Yeah.
Views, triggers, functions, they they are light they don't have data so we always can rewrite them we can throw
them away indexes though like well yeah yeah but indexes it's like part of it's like part of tables
probably okay cool fair enough they cannot live without tables it's like okay it's addition to
tables add on to tables.
But functions, triggers, and views, regular views.
Maybe we trust views as well because they have only derivative data as well.
You always can redefine them, and you can keep them in files.
That's why.
Well, that's a really interesting point.
I haven't seen a system that separates those two. I built it a couple of times.
Okay, cool.
So you had one system for tables. Remember we
discussed the approach
when functions are kept in separate
schemas and you can keep the old
version of... This is smart and good.
And I think it's possible to
build without copy-paste still,
like to rely on Git for version control
and have like version 1, schema
functions version 1, functions version two, and being able to even keep them several versions at the
same time. And different application nodes can address different versions of functions. This is
very smart. And this is exactly when we probably need to keep functions in separate files as well.
Right. And I should mention that this idea was like came to mind of Valentin Gagichevsky long, long ago
from Zalando and then propagated to Russia, to Avito and others and so on.
So it's a great idea, still relevant I think.
I'm not sure about triggers because triggers consist of trigger function plus trigger itself.
So trigger itself is a bit easier it also requires things oh by
the way big question number four or five probably why do all these guys don't care about it like
about simple problem of locking attempts if you try to alter table let go you always must have
small lock time out and not be ready to retry.
Otherwise, well, you don't notice the problem if you have not big load,
not high load, right?
Not heavy load.
But if you have noticeable load, like 1,000 TPS,
sooner or later you will bump into this problem very badly,
as many did already.
So you need to be graceful otherwise you try to alter somebody
locks and you cannot acquire a lock and you have huge line including selects after you waiting so
you block everyone they don't care about it why well I Well, I guess... Because it's generation 1.5.
Generation 1.5.
It's a really good point.
But that also lives in a slightly different place, right?
That lives in the config file.
And that's not necessarily...
Is that in version control?
Hopefully, but it's not in the same version control for most people.
So it's deployed in a different way, updated in a different way.
So it's very, very interesting.
Yeah, really good point.
Maybe that should be in there as well.
I hope they will develop in the right directions.
People need these directions.
People need zero downtime migrations,
conflict resolution, graceful deployments, right?
And by the way, I want to advertise,
GitLab has open source and background migrations.
It's called migrationhelpers.rb file.
So it's open source.
Anyone can check.
And if you are Ruby user developer, you can benefit from learning their experiences.
Enormous volume of experience is already coded there to support zero downtime migrations.
Also, actually, GitLab has many installations and some of them may lag and need to...
This is exactly the problem I described.
I'm not sure if they keep some objects in separate files, probably not, but they have
this problem.
Deployments should be without issues on many, many setups.
So it's interesting to check.
I hope that one day they will open source it separately
as a separate gem or like library
to be great for Ruby developers.
That'd be really cool.
There's a really good blog post
about zero downtime migrations
from the team at Braintree.
That's excellent.
That I think I'll include as well.
Don't forget to include
my log timeout post
about this read-rise problem.
It's a big one.
And yeah, I would love to hear from anybody listening to this that's screaming at us and saying that you've got this solved or you've got a really good process on your side.
It would be great to hear how people are doing.
Yeah, maybe we are wrong.
This episode is full of criticism, but it's an opportunity to improve our tooling.
So maybe someone can say there is some very magic tool which solves all
problems right yeah let us know wonderful okay well thanks nikolai thanks everybody for listening
appreciate everybody that's got in touch recently with feedback and yeah looking forward to speaking
to you next week great thank you bye