Postgres FM - Database branching
Episode Date: November 11, 2022Here are links to a few things we mentioned: Database Lab EngineOverview of Aurora cloning PlanetScale branchingCommon DB schema change mistakes (blog post by Nikolay)Supabase visionNeon br...anchingShift-left testing------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov / @michristofides / @PostgresFM, or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
I'm Michael, founder of PgMustard, and this is my co-host Nikolai, founder of PostgresAI.
Hey Nikolai, what are we talking about today?
Hi Michael, last time we discussed versioning of database schema or database migration management.
I already forgot the proper name of it.
We can continue this field and discuss database branching today
and how it is different from
like snapshotting, for example, or from schema version control, like, because it's like adjacent
area, which is not yet developed. We don't have good tools yet, by the way. So we probably will
discuss some ideas and concepts and what to expect from the future in terms of various tooling and what other companies do.
And also what my company does, because it looks like we go in this direction.
We develop database branching right now.
Yeah, exactly.
Depending on exactly how you define it, it seems like there's quite a few database companies at the moment talking about branching, but each one means something slightly different. Or as you dig into it, I've seen your conversation on Twitter with a few people trying to understand what they mean by it and trying to get some definitions down. So I'm looking forward to hearing your thoughts around that. Is it worth us? Oh, yeah, so Database Lab Engine is worth discussing maybe
in terms of what it does,
what you're calling branching
and what some others are.
Is that worth doing first?
Yeah, well, I think discussion
of Database Lab Engine as a whole,
it's maybe a separate discussion
because there are many things
that it can do
and many different use cases
where it is useful. But briefly, I think
it's a good idea. And yeah, let me do some overview of database branching topic. So database lab
engine, which we post-GCI develop, it was born when we needed to experiment very quickly to check,
first of all, actually SQL optimization ideas,
not on production, but in some non-production environment, which behaves where Postgres
behaves identically to production.
And we needed to isolate experiments of different people.
And we also needed to iterate, so to reset quickly and to throw out bad ideas switch to new ideas quickly
but when you build indexes during hours or you'd already changed your schema heavily changing data
sometimes you need like to spend many hours converting some column and then you realize
it's a dead end you need to start from scratch it's quite difficult usually to have another
environment provisioned quickly so we solve this originally for optimization using think cloning
think provisioning based on zfs either zfs or lvm although other options also possible to implement
and without any big details like you can run a single server with dozens of logically independent Postgres instances
where the database is the same everywhere, but it's writable, so you can deviate, you can create your own index.
And the planner behaves exactly as on production.
This is the trick.
And the creation of a new clone takes only a few seconds regardless of database size.
Sounds like magic, but this magic is going to kill another magic.
I call it, like many people call it, black magic.
Post-GIS DBA knowledge and skills is like area of black magic.
You need to spend 15 years and then you quickly say, this will work, this won't work.
And people say, oh, you're like black magic guy.
Yes, black magic means like something is hidden.
Our magic is white magic.
Nothing is hidden, and any developer, any engineer
can see behavior, not disturbing others,
and experiment and fail and so on.
So this is what we did.
Many clones are running on single machines, so you pay for one machine and have and so on. So this is what we did. Many clones are running on single machines.
So you pay for one machine and have dozens of clones.
And we switched then to area of testing in CICD pipelines.
So like it's a whole new world as well.
Again, a whole big topic.
What can be tested in CICD pipelines in terms of database, in terms of Postgres?
Our idea is obviously we have database
finally, not tiny, small, like one gigabyte of something just generated or, I don't know,
brought to you by Docker pool. But we have whole database, like we can be 100 gigabytes or
terabytes. It doesn't matter for us. We can just set it up and make pipelines working,
provisioned very quickly in a few seconds.
So testing is another area.
Some people do various things.
For example, some people just test PGA upgrade inside our container.
It's also possible and so on.
The key here is that we already do it for a couple of years, maybe three years.
We always said this is like thin cloning.
We use the term cloning.
Clone, clone, clone. It's very natural for cloud engineers, for DBS, for like SREs,
because this is cloning term used in clouds, right? You can clone your EBS volume also from snapshots.
So copy and write there as well.
It's also like thin provisioning, but it's kind of different
because you pay for each volume separately. still cloning is used there are rds clones aurora has
also thin clones and so you can single storage but multiple instances running using all of them use
this same storage and you can have multiple write instances so you can do writes independently
similar but again you need to pay for each compute node separately that's why no no neither rds clones
nor aurora clones are good for testing in cicd pipelines because you you want constant price
constant cost when you need to pay also by the way you need to pay, also, by the way, you need to wait many
minutes to provision them both. I guess thin cloning is specifically named as opposed to
thick cloning, as you could take a full copy of it. And that's what a lot of systems have offered
for a long time. And this is obviously a step above that in terms of speed, but also in terms of not having to have that extra...
I think cloning, actually, we can devote a whole episode to it.
And maybe we should, because it's also an interesting question.
How to clone a large database using regular tools?
For example, do you just clone PG data directory
or use PG-based backup at physical level, right?
How to do it live without interruptions?
Of course, any experienced Postgres DBA knows the answer, pgStartBackup, stopBackup, or just use pgBaseBackup.
By default, it will be okay and you can do it live.
Or you can use it at logical level using pgDumpRestore.
It has questions how to speed it up and so on.
But roughly we can assume that regular,
both of them, by the way, we consider thick cloning,
but we distinguish physical and logical levels.
So DumpRestore is also cloning, but at logical level.
But you can choose which objects to clone there, right?
And you can speed it up using dash j but in this case you need
additional space because you cannot use that j and do it on the fly the problem which is solved
by another tool dmitry fontaine is developing pg copy db i maybe i'm wrong with name of the tool
but it's quite new and it exactly provides your ability at logical level to use multiple threads and avoid intermediate backup files so you can do it on the fly.
That's interesting.
But it also raises the question how long our transactions are on the source.
Many, many, many things in the area of thick cloning.
I know it very well because to provision database lab engine, we need first to get data in a regular way,
either logical or physical.
Right.
Like one thick clone that you can base the thin clones off.
But that's where I think this comes.
And we need also to maintain it either continuously
or to do a full refresh on schedule.
Everything is possible.
Like nightly or something.
Yeah.
But that's where this becomes really useful i think
for the branching discussion because suddenly if we can do thing clones or something like them
we get the concept of maybe you can have branches that aren't just empty that aren't just just the
schema they can have real data behind them as well so what what happened with branching, the branching term? First of all, I didn't realize it in the past, but now I see it very well.
Cloning is a very infrastructural language.
It's not friendly to developers because in Git, there is Git clone, but it's kind of
different.
You clone whole repository.
There you have revisions, commit numbers, and branches.
And clones, it's a language of SRE people or DBAs, DBREs, infrastructure people.
Of course, any engineer knows clone in various aspects, but still they prefer branching. And suddenly, some time ago, PlanetScale, which originally provides sharding for MySQL, Vitesse, they developed Vitesse.
And founders are the same who created Vitesse.
It looked like, from my perspective, suddenly to sharding problem, they added schema management capabilities. And they called it, okay, we have now database branching.
And we have zero downtime deployments for your schema changes.
Hustle-free.
So, like, great.
And on the front page, it was like maybe last year, in 2021.
And on the front page, I saw database branching.
But when I clicked inside documentation, I was curious because it's our area.
I felt like
okay okay do they play with thin clones or what thin provisioning but inside you can see
database schema branching already so slightly different term right and you realize that they
clone only schema then you can change it then they produce diff uh we some of topics of our previous episode
and this diff you can see it you can approve it other people can approve it so there is some flow
and then it's deployed in zero downtime fashion yeah nothing about data so that that raises the
question about test like how do you test on the performance
side of things? Is this all, and how do multiple people work together, that kind of thing?
Right. Well, first of all, this is also a viable approach, I should admit, because this is a hard
problem, zero downtime migrations. By the way, last time we didn't mention my article, I don't
know, like 18 mistakes of making schema changes in
postgres worth mentioning this article i selected there are many types of mistakes you can do but i
selected some and discussed in detail it's on on our website postgres.ai so it's a good problem
to solve very actually hard because most of diff tools we see, and maybe I'm wrong, but Liquibase has div tool we mentioned.
And somebody in comments, thank you so much, on YouTube raised this.
Liquibase also has div and PGAdmin has div.
There are separate projects like Migra, but all of them show simple div, like create index without word concurrently, not discussing the problem how to change data type in one billion row table.
So they solve hard problem, but there is hardest problem,
how to generate diff in zero downtime fashion.
As far as I understand, a planet base, they show diff in regular form,
but when they apply changes, they perform something similar to PGA repack approach,
when you create a full copy of the table, recording all changes in some delta table,
like change log, right? And then in single transaction or in multiple transactions,
it's interesting also topic, but in steps, you apply all changes and then you already switch to new table.
Of course, this approach requires some disk space.
And it's kind of too heavy for small changes sometimes.
It depends.
But it's interesting that they have full automation of it.
But again, they don't care about data in this case.
But their CEO in Twitter discussion said they are working on data branching.
Nice.
I'm very curious how they will solve the terminology problem. You know,
two biggest problems in computer science, right? Naming and cache invalidation, right?
And off by one iOS.
Yeah, yeah. So obviously they have naming issue because they already use database branching
for schema-only branching.
Then this year, earlier,
I saw that Supabase have branching in their roadmap
and also Neon appeared.
And Neon said, we are open source Aurora, right?
Yeah.
Aurora has thin cloning,
which is, in my opinion, not good for testing.
It's too expensive and it's too slow. It's like thin cloning, which is, in my opinion, not good for testing. It's too expensive and it's too slow.
It's like thin
cloning, yes, but you need to wait minutes
and you need to pay for compute
for each clone separately.
If you use big O
notation in terms of money, it's a big
O from number of clones
for compute power. Fortunately,
not for storage, but also, of course,
Aurora also charges you for i o and for testing also not very pleasant but i guess big enterprises are okay with
this and it's better than nothing as well right it's better than not having it of course you can
yeah you can test some heavy changes in this way, but this is not something you will use for each pull request or merge request.
Good point.
We can't call changes.
It's too much, right?
But I also, by the way, I found that, observed this area,
heavy clones where we can use all CPUs and so on,
they are needed only infrequently for infrastructure teams for
example upgrades migrate big migrations to some new hardware or operational system but developers
these days they do changes many times per day sometimes right so like it's very often and there
we need the very very cheap and fast cloning okay back to database branching. So Neon in the very beginning said we are going also to be a very good database for CI-CD pipelines,
and we have database branching.
Not discussing what it means, actually, like in detail.
How is it different from cloning, for example, or snapshotting, or these infrastructure languages, terms.
And someone else also said, some other projects also said,
we have Git-like approach for databases, for Postgres.
And then I spent some time trying to realize how branching could behave for database and for Postgres to solve problems of development
and testing.
And finally, I realized that branches are very different from our clones because clones,
they take some memory.
They consume memory.
For example, in Git, I can have thousands of branches and I don't pay't pay for it extra okay some some small storage
penalty but that's it right but if if all branches are identical i don't pay at all no nothing
nothing right but when you run a thin clone in database lap engine it consumes some memory
because it has shared buffers allocated so So it's like it has something,
like Postgres running.
So it has some cost.
We have some limit, of course,
defined by size of your memory on the server
and shared buffers, for example.
All right, so we can adjust and run more clones,
but still we have some limit.
For branches, we don't want to have limit, right?
This is one thing.
Of course, name is also a thing.
But also a thing, like in Git, it's very good.
We discussed it.
It's decentralized and one of properties of Git.
It allows you several stages of review.
You can review yourself before you push your commits.
You can ask your colleagues to review if you have a pull request,
merge request in GitHub, GitLab,
to see difference between branches before you merge your development branch to your main branch.
But if we say our clones are branches, we cannot do it because we want to do this.
We want to say this is our state and then say multiple colleagues or multiple CI pipelines
test it, check it
and continue working with it for example
so obviously we realized
we need, it was in our
roadmap for a while but we realized
branching looks like snapshotting
on demand for your clones
so you have the clone, you change something
you put snapshot
via API, CLI or UI UI, we have all of them.
And then you say, okay, this is snapshot, this is commit, or this is it, continue working with it.
So snapshot started to look like branching, right?
And kind of a name snapshot.
And then, you know what I did?
I just opened documentation and started to read it from scratch.
And they say, they have, by the way, slightly conflicting definitions of branching.
There is no good definition like in the beginning.
You need to go deeper.
But I found a good definition.
A branch is a pointer to commit, a named pointer to commit.
Well, there are issues with this term, but it kind of also works for us.
We say, okay, we have a named pointer to commit.
It can shift.
If new commit is created, it shifts automatically in this kind of branch.
And that's it.
And we already developed prototype.
I think it will be Database Lab Engine 4.0 when it will be released.
We are not in hurry.
So we want everything to work very smoothly and tested by many teams properly.
But it's already, we have prototype.
It's working at CLI level so far, not UI.
So you can, maybe if you listen like one month earlier, later, we already have, we should have UI as well and so on.
But you say like, I want branch.
So you started to deviateiate you run cloning for
this branch others can run their clones too clone is like your working directory you just grabbed
the content of your code base and open some id or editors and started to change it. So clones is like it's a mean means to change the state.
Well, it's like a running application, isn't it? Like the source code isn't like a running
application. Right. But usually running application doesn't mean like a change of schema,
but I usually avoid it. Changing of schema should be during deployment, not during normal. Some people do it, actually.
Some evolution of schema initiated by your users.
Also, temporary tables is a part of it.
But I consider it as very questionable practice leading to issues with management from DBA point of view.
So I would say normal running application should not change your schema.
You should try to avoid it but developer open side editor changes it and then commits git commit git push right what i meant
more is i really like the snapshot analogy and i think the code at a specific commit is kind of
like a snapshot of the application but it's not the application running. And in the
same way, a clone is a running database, right? That you can create from a snapshot maybe, or
from, I don't know quite how you're defining these things, but yeah, we don't need them running all
the time. Just like a developer doesn't need an application, like they're a version of the
application running all the time. Just while we're debugging something, just while we're actually trying to test it.
So yeah, it makes a load of sense in theory.
Yeah, I wanted to emphasize also
that we consider these snapshots as a whole with data.
It can be production data if you can afford it,
if there are no issues with PII, GDPR and others.
But we focus on schema changes.
So data, we also snapshot it
and we provide it to clones or branches,
snapshots and so on.
But what is most meaningful is schema changes, right?
Because these should be deployed.
Sometimes, of course, data also should be deployed,
but we want like Git-like
approach with data, but applied to schema only. We don't want to have a data comparison and then
deploy this. Maybe we will want it as well, because we have it here, right? We can do something here
as well. But so far, like looking at the problems we have have we just want to mirror the capabilities of git and bring
branches database branches to build very like effortless effortlessly build non-production
environments matching your code so you have development branch code and you have development
branching database so you can quickly take this code somewhere on your laptop or on your some like i don't know non-production
machine in cloud and then you can request a clone for this branch latest snapshot in this branch
will be used and we have postgres running and you can they go together and you can start testing
developing your application and see how it works with a lot of data similar to production this is what we do but if you do some changes data changes we think they are not such so relevant because some tester
can do many weird things with data there and then we just need to throw it away so when we commit
we commit fully but we look mostly on schema and so, we relied that we use one of these tools
we criticized last week.
So Skitch, Liquibase, Flyway, Rails migrations.
We see that people already use them,
so we are not going to solve.
There is something to help there.
We discussed problems they have.
New generation, I'm sure, will be born
in the nearest future, I think.
But what's really not solved is how to test it properly with a lot of data.
Here we have this branching.
So I just described some, maybe not very well I described it,
because still it's kind of not very precise, this concept.
It's already clear, but not super clear.
But what I'm trying to do here is try to define what database branching is.
And this is how we see it.
We're already developing in this direction.
I'm curious what other companies think, actually.
But I think it would be good to synchronize thoughts
and to move in similar directions
because in this case, everyone wins
and developers have similar concepts
in various products, right?
Actually, in source control management systems, snapshots, clones, and so on, like branches,
they also have different meanings. If you compare detail, they have differences. So probably here,
it will happen as well. Database branching can have different meanings in different tools,
obviously, right? Sorry, I'm talking too much.
No, this is great.
And I think you're working on a document, right?
Are you going to share that on Twitter when you're ready?
Yeah, I have some draft for RFC in this area, discussing goals and anti-goals.
Because, for example, we want to deal with data, div data, and so on.
We focus on schema changes mostly.
Because data production has probably different data, for example.
We don't want to release data patches.
Also, there is a problem, interesting problem.
You created a column.
I'm very sorry.
I feel very sorry all the time because we have delay probably.
And you're trying to interrupt me, but I'm already switched.
You wanted to ask something.
Sorry. I was just going to add on the but I'm already switched. You wanted to ask something. Sorry.
I was just going to add on the data front. I think I added it last week. But we found when I was doing this before, we found sometimes data is schema, like the lookup tables.
So you might need to worry about that.
Yeah, exactly.
So you might have to worry about that at some point, but it wasn't.
I think we added it in version two or something.
So it definitely can get away with not having it in there or yeah at first but
it does feel like it's it is there it is important right i spent some time trying to also to think
about how to merge merges it's basically it's deploy if you consider main is what should be
done and should be present on our production environments so merge is some
you need to have diff and then you need to go with this diff to production roll it out but but
diff is slightly different isn't like diff is like compare these two things but then i need a script
like a in order to actually make one of them the same as the other and that's not the same as a
diff but yeah i agree so first step is diff and then second step is a script. Well, diff can be seen as a series of
DL comments. In this case, it's the same. But the problem is that you see it with, well, diff tools
for Postgres schema, they show alter comments, create index comments. Yeah, most of them. But I
consider that a second feature, right?
The first feature is compare these two schemas,
and they highlight differences.
And then the second is...
Yeah, yeah.
So we have two approaches for diff.
I agree.
But the problem with deployment will be
if you create index comment to production,
you will block people.
And so you need to have advanced diff.
And we spent some time prototyping this as well
and then i realized it's already kind of solved like it's not solved well but we have a zoo of
various tools for deployment management like sketch ruby on rails database migration sensor
and they all ignore the fact that data should be changed in batches. Again, I'm advertising GitLab migration helpers, which solves this very well for Ruby.
And it's hard for us to, either we need to choose something or we need to somehow abstract.
And then I realized, okay, people resist all this somehow.
Let's just avoid this problem.
And currently I consider this as anti-go problem and we i consider this currently i consider this
as anti-goal and we just take care of conflicts so if someone already changed schema in this branch
you're trying to change so in other words we have something like cvs or subversion very centralized
before you put your changes you need to update and then you can already push your changes, already resolving conflicts and maybe replaying your changes on top of other person's changes.
So we just take care of conflicts in quite a simple way and we don't solve the problem of merge fully, maybe postponing it.
But what we have among goals for example imagine you created
a column which is empty everything is filled you have full database but one column it's new and
it's empty and how to test it you need something there to test it so it looks like we need to think
about ability to provide some i don't know like I don't like, like fixtures and like seed databases where we have some fake data.
But here we need it.
We need to feel like we need to feel new columns, new tables, and developers should decide how to do it.
Should provide some mean for testing.
So we have everything, but somehow we need to feel new column.
Yeah, like data generation.
Right.
Maybe snapshot it and consider this.
Like, this is our test data.
We have deviation from our production from main branch.
But we already have test data.
It's good.
Any other engineers can work with it and test it and play and explore how this feature behaves with many rows, right?
Yeah.
Feels like a whole other topic. Yeah. Yeah, feels like a whole other topic.
Yeah, well, testing is a whole other topic, definitely.
There are major areas we can discuss there as well.
So what excites me here is any direction I go,
I feel how come this is still not developed?
How can we live without it?
Like, I see how we can live.
We test on production.
Every time I have some question like this,
it's about this not very beautiful term,
but it's called shift left testing.
When we want developers test first,
then some...
Testing should be done in the very beginning.
It should be shifted to the very left in this infinite DevOps sign,
you know, like this infinity sign. Yeah, I know what you mean.
But even if it's not done in production,
it's often done in like a thick cone of production.
Staging often is very different from production,
so we end up testing it on production.
I can see what you mean.
Really testing, really testing.
We pretend sometimes with mark checkboxes
it was tested in lower environments,
in staging, everywhere.
But if you think about was this testing real,
it was fake testing.
And this is what we want to fix
in our development processes.
Yeah, that'd be great.
Any last thoughts or things you wanted to share with people?
Well, just keep an eye on what we are doing.
Stay tuned.
And any feedback, any ideas are welcome.
I'm always ready to discuss this topic with everyone.
I think it's one of the hottest topics in the area of databases right
now. I mean, many problems are being solved, Kubernetes and so on, but this problem needs to
be solved as well. And the majority of development teams will benefit immediately from better new
generation tooling to build new non-production environments. So I think we spend too much time thinking on production,
but to solve problems on production,
we need to start from non-production.
And this is interesting.
So ready to talk with everyone?
Just reach me out on Twitter, email, anywhere.
And our regular mantra,
thank you for feedback, everyone,
for subscriptions, likes, topic ideas.
I think we will, again, next time we should choose one of the topics proposed by our audience.
Very appreciated ideas.
And that's it.
Please share in your social networks, working slack discord mastodon right yeah yeah
of course absolutely well thank you nicolai thanks everybody take care thank you michael bye