Postgres FM - Database branching

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I'm Michael, founder of PgMustard, and this is my co-host Nikolai, founder of PostgresAI. Hey Nikolai, what are we talking about today? Hi Michael, last time we discussed versioning of database schema or database migration management. I already forgot the proper name of it. We can continue this field and discuss database branching today and how it is different from like snapshotting, for example, or from schema version control, like, because it's like adjacent

Starting point is 00:00:34 area, which is not yet developed. We don't have good tools yet, by the way. So we probably will discuss some ideas and concepts and what to expect from the future in terms of various tooling and what other companies do. And also what my company does, because it looks like we go in this direction. We develop database branching right now. Yeah, exactly. Depending on exactly how you define it, it seems like there's quite a few database companies at the moment talking about branching, but each one means something slightly different. Or as you dig into it, I've seen your conversation on Twitter with a few people trying to understand what they mean by it and trying to get some definitions down. So I'm looking forward to hearing your thoughts around that. Is it worth us? Oh, yeah, so Database Lab Engine is worth discussing maybe in terms of what it does, what you're calling branching

Starting point is 00:01:28 and what some others are. Is that worth doing first? Yeah, well, I think discussion of Database Lab Engine as a whole, it's maybe a separate discussion because there are many things that it can do and many different use cases

Starting point is 00:01:44 where it is useful. But briefly, I think it's a good idea. And yeah, let me do some overview of database branching topic. So database lab engine, which we post-GCI develop, it was born when we needed to experiment very quickly to check, first of all, actually SQL optimization ideas, not on production, but in some non-production environment, which behaves where Postgres behaves identically to production. And we needed to isolate experiments of different people. And we also needed to iterate, so to reset quickly and to throw out bad ideas switch to new ideas quickly

Starting point is 00:02:27 but when you build indexes during hours or you'd already changed your schema heavily changing data sometimes you need like to spend many hours converting some column and then you realize it's a dead end you need to start from scratch it's quite difficult usually to have another environment provisioned quickly so we solve this originally for optimization using think cloning think provisioning based on zfs either zfs or lvm although other options also possible to implement and without any big details like you can run a single server with dozens of logically independent Postgres instances where the database is the same everywhere, but it's writable, so you can deviate, you can create your own index. And the planner behaves exactly as on production.

Starting point is 00:03:19 This is the trick. And the creation of a new clone takes only a few seconds regardless of database size. Sounds like magic, but this magic is going to kill another magic. I call it, like many people call it, black magic. Post-GIS DBA knowledge and skills is like area of black magic. You need to spend 15 years and then you quickly say, this will work, this won't work. And people say, oh, you're like black magic guy. Yes, black magic means like something is hidden.

Starting point is 00:03:51 Our magic is white magic. Nothing is hidden, and any developer, any engineer can see behavior, not disturbing others, and experiment and fail and so on. So this is what we did. Many clones are running on single machines, so you pay for one machine and have and so on. So this is what we did. Many clones are running on single machines. So you pay for one machine and have dozens of clones. And we switched then to area of testing in CICD pipelines.

Starting point is 00:04:13 So like it's a whole new world as well. Again, a whole big topic. What can be tested in CICD pipelines in terms of database, in terms of Postgres? Our idea is obviously we have database finally, not tiny, small, like one gigabyte of something just generated or, I don't know, brought to you by Docker pool. But we have whole database, like we can be 100 gigabytes or terabytes. It doesn't matter for us. We can just set it up and make pipelines working, provisioned very quickly in a few seconds.

Starting point is 00:04:46 So testing is another area. Some people do various things. For example, some people just test PGA upgrade inside our container. It's also possible and so on. The key here is that we already do it for a couple of years, maybe three years. We always said this is like thin cloning. We use the term cloning. Clone, clone, clone. It's very natural for cloud engineers, for DBS, for like SREs,

Starting point is 00:05:13 because this is cloning term used in clouds, right? You can clone your EBS volume also from snapshots. So copy and write there as well. It's also like thin provisioning, but it's kind of different because you pay for each volume separately. still cloning is used there are rds clones aurora has also thin clones and so you can single storage but multiple instances running using all of them use this same storage and you can have multiple write instances so you can do writes independently similar but again you need to pay for each compute node separately that's why no no neither rds clones nor aurora clones are good for testing in cicd pipelines because you you want constant price

Starting point is 00:06:01 constant cost when you need to pay also by the way you need to pay, also, by the way, you need to wait many minutes to provision them both. I guess thin cloning is specifically named as opposed to thick cloning, as you could take a full copy of it. And that's what a lot of systems have offered for a long time. And this is obviously a step above that in terms of speed, but also in terms of not having to have that extra... I think cloning, actually, we can devote a whole episode to it. And maybe we should, because it's also an interesting question. How to clone a large database using regular tools? For example, do you just clone PG data directory

Starting point is 00:06:42 or use PG-based backup at physical level, right? How to do it live without interruptions? Of course, any experienced Postgres DBA knows the answer, pgStartBackup, stopBackup, or just use pgBaseBackup. By default, it will be okay and you can do it live. Or you can use it at logical level using pgDumpRestore. It has questions how to speed it up and so on. But roughly we can assume that regular, both of them, by the way, we consider thick cloning,

Starting point is 00:07:12 but we distinguish physical and logical levels. So DumpRestore is also cloning, but at logical level. But you can choose which objects to clone there, right? And you can speed it up using dash j but in this case you need additional space because you cannot use that j and do it on the fly the problem which is solved by another tool dmitry fontaine is developing pg copy db i maybe i'm wrong with name of the tool but it's quite new and it exactly provides your ability at logical level to use multiple threads and avoid intermediate backup files so you can do it on the fly. That's interesting.

Starting point is 00:07:50 But it also raises the question how long our transactions are on the source. Many, many, many things in the area of thick cloning. I know it very well because to provision database lab engine, we need first to get data in a regular way, either logical or physical. Right. Like one thick clone that you can base the thin clones off. But that's where I think this comes. And we need also to maintain it either continuously

Starting point is 00:08:17 or to do a full refresh on schedule. Everything is possible. Like nightly or something. Yeah. But that's where this becomes really useful i think for the branching discussion because suddenly if we can do thing clones or something like them we get the concept of maybe you can have branches that aren't just empty that aren't just just the schema they can have real data behind them as well so what what happened with branching, the branching term? First of all, I didn't realize it in the past, but now I see it very well.

Starting point is 00:08:49 Cloning is a very infrastructural language. It's not friendly to developers because in Git, there is Git clone, but it's kind of different. You clone whole repository. There you have revisions, commit numbers, and branches. And clones, it's a language of SRE people or DBAs, DBREs, infrastructure people. Of course, any engineer knows clone in various aspects, but still they prefer branching. And suddenly, some time ago, PlanetScale, which originally provides sharding for MySQL, Vitesse, they developed Vitesse. And founders are the same who created Vitesse.

Starting point is 00:09:34 It looked like, from my perspective, suddenly to sharding problem, they added schema management capabilities. And they called it, okay, we have now database branching. And we have zero downtime deployments for your schema changes. Hustle-free. So, like, great. And on the front page, it was like maybe last year, in 2021. And on the front page, I saw database branching. But when I clicked inside documentation, I was curious because it's our area. I felt like

Starting point is 00:10:05 okay okay do they play with thin clones or what thin provisioning but inside you can see database schema branching already so slightly different term right and you realize that they clone only schema then you can change it then they produce diff uh we some of topics of our previous episode and this diff you can see it you can approve it other people can approve it so there is some flow and then it's deployed in zero downtime fashion yeah nothing about data so that that raises the question about test like how do you test on the performance side of things? Is this all, and how do multiple people work together, that kind of thing? Right. Well, first of all, this is also a viable approach, I should admit, because this is a hard

Starting point is 00:10:57 problem, zero downtime migrations. By the way, last time we didn't mention my article, I don't know, like 18 mistakes of making schema changes in postgres worth mentioning this article i selected there are many types of mistakes you can do but i selected some and discussed in detail it's on on our website postgres.ai so it's a good problem to solve very actually hard because most of diff tools we see, and maybe I'm wrong, but Liquibase has div tool we mentioned. And somebody in comments, thank you so much, on YouTube raised this. Liquibase also has div and PGAdmin has div. There are separate projects like Migra, but all of them show simple div, like create index without word concurrently, not discussing the problem how to change data type in one billion row table.

Starting point is 00:11:48 So they solve hard problem, but there is hardest problem, how to generate diff in zero downtime fashion. As far as I understand, a planet base, they show diff in regular form, but when they apply changes, they perform something similar to PGA repack approach, when you create a full copy of the table, recording all changes in some delta table, like change log, right? And then in single transaction or in multiple transactions, it's interesting also topic, but in steps, you apply all changes and then you already switch to new table. Of course, this approach requires some disk space.

Starting point is 00:12:28 And it's kind of too heavy for small changes sometimes. It depends. But it's interesting that they have full automation of it. But again, they don't care about data in this case. But their CEO in Twitter discussion said they are working on data branching. Nice. I'm very curious how they will solve the terminology problem. You know, two biggest problems in computer science, right? Naming and cache invalidation, right?

Starting point is 00:12:57 And off by one iOS. Yeah, yeah. So obviously they have naming issue because they already use database branching for schema-only branching. Then this year, earlier, I saw that Supabase have branching in their roadmap and also Neon appeared. And Neon said, we are open source Aurora, right? Yeah.

Starting point is 00:13:18 Aurora has thin cloning, which is, in my opinion, not good for testing. It's too expensive and it's too slow. It's like thin cloning, which is, in my opinion, not good for testing. It's too expensive and it's too slow. It's like thin cloning, yes, but you need to wait minutes and you need to pay for compute for each clone separately. If you use big O

Starting point is 00:13:36 notation in terms of money, it's a big O from number of clones for compute power. Fortunately, not for storage, but also, of course, Aurora also charges you for i o and for testing also not very pleasant but i guess big enterprises are okay with this and it's better than nothing as well right it's better than not having it of course you can yeah you can test some heavy changes in this way, but this is not something you will use for each pull request or merge request. Good point.

Starting point is 00:14:09 We can't call changes. It's too much, right? But I also, by the way, I found that, observed this area, heavy clones where we can use all CPUs and so on, they are needed only infrequently for infrastructure teams for example upgrades migrate big migrations to some new hardware or operational system but developers these days they do changes many times per day sometimes right so like it's very often and there we need the very very cheap and fast cloning okay back to database branching. So Neon in the very beginning said we are going also to be a very good database for CI-CD pipelines,

Starting point is 00:14:51 and we have database branching. Not discussing what it means, actually, like in detail. How is it different from cloning, for example, or snapshotting, or these infrastructure languages, terms. And someone else also said, some other projects also said, we have Git-like approach for databases, for Postgres. And then I spent some time trying to realize how branching could behave for database and for Postgres to solve problems of development and testing. And finally, I realized that branches are very different from our clones because clones,

Starting point is 00:15:33 they take some memory. They consume memory. For example, in Git, I can have thousands of branches and I don't pay't pay for it extra okay some some small storage penalty but that's it right but if if all branches are identical i don't pay at all no nothing nothing right but when you run a thin clone in database lap engine it consumes some memory because it has shared buffers allocated so So it's like it has something, like Postgres running. So it has some cost.

Starting point is 00:16:08 We have some limit, of course, defined by size of your memory on the server and shared buffers, for example. All right, so we can adjust and run more clones, but still we have some limit. For branches, we don't want to have limit, right? This is one thing. Of course, name is also a thing.

Starting point is 00:16:26 But also a thing, like in Git, it's very good. We discussed it. It's decentralized and one of properties of Git. It allows you several stages of review. You can review yourself before you push your commits. You can ask your colleagues to review if you have a pull request, merge request in GitHub, GitLab, to see difference between branches before you merge your development branch to your main branch.

Starting point is 00:16:51 But if we say our clones are branches, we cannot do it because we want to do this. We want to say this is our state and then say multiple colleagues or multiple CI pipelines test it, check it and continue working with it for example so obviously we realized we need, it was in our roadmap for a while but we realized branching looks like snapshotting

Starting point is 00:17:18 on demand for your clones so you have the clone, you change something you put snapshot via API, CLI or UI UI, we have all of them. And then you say, okay, this is snapshot, this is commit, or this is it, continue working with it. So snapshot started to look like branching, right? And kind of a name snapshot. And then, you know what I did?

Starting point is 00:17:41 I just opened documentation and started to read it from scratch. And they say, they have, by the way, slightly conflicting definitions of branching. There is no good definition like in the beginning. You need to go deeper. But I found a good definition. A branch is a pointer to commit, a named pointer to commit. Well, there are issues with this term, but it kind of also works for us. We say, okay, we have a named pointer to commit.

Starting point is 00:18:09 It can shift. If new commit is created, it shifts automatically in this kind of branch. And that's it. And we already developed prototype. I think it will be Database Lab Engine 4.0 when it will be released. We are not in hurry. So we want everything to work very smoothly and tested by many teams properly. But it's already, we have prototype.

Starting point is 00:18:30 It's working at CLI level so far, not UI. So you can, maybe if you listen like one month earlier, later, we already have, we should have UI as well and so on. But you say like, I want branch. So you started to deviateiate you run cloning for this branch others can run their clones too clone is like your working directory you just grabbed the content of your code base and open some id or editors and started to change it. So clones is like it's a mean means to change the state. Well, it's like a running application, isn't it? Like the source code isn't like a running application. Right. But usually running application doesn't mean like a change of schema,

Starting point is 00:19:17 but I usually avoid it. Changing of schema should be during deployment, not during normal. Some people do it, actually. Some evolution of schema initiated by your users. Also, temporary tables is a part of it. But I consider it as very questionable practice leading to issues with management from DBA point of view. So I would say normal running application should not change your schema. You should try to avoid it but developer open side editor changes it and then commits git commit git push right what i meant more is i really like the snapshot analogy and i think the code at a specific commit is kind of like a snapshot of the application but it's not the application running. And in the

Starting point is 00:20:05 same way, a clone is a running database, right? That you can create from a snapshot maybe, or from, I don't know quite how you're defining these things, but yeah, we don't need them running all the time. Just like a developer doesn't need an application, like they're a version of the application running all the time. Just while we're debugging something, just while we're actually trying to test it. So yeah, it makes a load of sense in theory. Yeah, I wanted to emphasize also that we consider these snapshots as a whole with data. It can be production data if you can afford it,

Starting point is 00:20:40 if there are no issues with PII, GDPR and others. But we focus on schema changes. So data, we also snapshot it and we provide it to clones or branches, snapshots and so on. But what is most meaningful is schema changes, right? Because these should be deployed. Sometimes, of course, data also should be deployed,

Starting point is 00:21:03 but we want like Git-like approach with data, but applied to schema only. We don't want to have a data comparison and then deploy this. Maybe we will want it as well, because we have it here, right? We can do something here as well. But so far, like looking at the problems we have have we just want to mirror the capabilities of git and bring branches database branches to build very like effortless effortlessly build non-production environments matching your code so you have development branch code and you have development branching database so you can quickly take this code somewhere on your laptop or on your some like i don't know non-production machine in cloud and then you can request a clone for this branch latest snapshot in this branch

Starting point is 00:21:53 will be used and we have postgres running and you can they go together and you can start testing developing your application and see how it works with a lot of data similar to production this is what we do but if you do some changes data changes we think they are not such so relevant because some tester can do many weird things with data there and then we just need to throw it away so when we commit we commit fully but we look mostly on schema and so, we relied that we use one of these tools we criticized last week. So Skitch, Liquibase, Flyway, Rails migrations. We see that people already use them, so we are not going to solve.

Starting point is 00:22:35 There is something to help there. We discussed problems they have. New generation, I'm sure, will be born in the nearest future, I think. But what's really not solved is how to test it properly with a lot of data. Here we have this branching. So I just described some, maybe not very well I described it, because still it's kind of not very precise, this concept.

Starting point is 00:22:58 It's already clear, but not super clear. But what I'm trying to do here is try to define what database branching is. And this is how we see it. We're already developing in this direction. I'm curious what other companies think, actually. But I think it would be good to synchronize thoughts and to move in similar directions because in this case, everyone wins

Starting point is 00:23:21 and developers have similar concepts in various products, right? Actually, in source control management systems, snapshots, clones, and so on, like branches, they also have different meanings. If you compare detail, they have differences. So probably here, it will happen as well. Database branching can have different meanings in different tools, obviously, right? Sorry, I'm talking too much. No, this is great. And I think you're working on a document, right?

Starting point is 00:23:49 Are you going to share that on Twitter when you're ready? Yeah, I have some draft for RFC in this area, discussing goals and anti-goals. Because, for example, we want to deal with data, div data, and so on. We focus on schema changes mostly. Because data production has probably different data, for example. We don't want to release data patches. Also, there is a problem, interesting problem. You created a column.

Starting point is 00:24:16 I'm very sorry. I feel very sorry all the time because we have delay probably. And you're trying to interrupt me, but I'm already switched. You wanted to ask something. Sorry. I was just going to add on the but I'm already switched. You wanted to ask something. Sorry. I was just going to add on the data front. I think I added it last week. But we found when I was doing this before, we found sometimes data is schema, like the lookup tables. So you might need to worry about that. Yeah, exactly.

Starting point is 00:24:38 So you might have to worry about that at some point, but it wasn't. I think we added it in version two or something. So it definitely can get away with not having it in there or yeah at first but it does feel like it's it is there it is important right i spent some time trying to also to think about how to merge merges it's basically it's deploy if you consider main is what should be done and should be present on our production environments so merge is some you need to have diff and then you need to go with this diff to production roll it out but but diff is slightly different isn't like diff is like compare these two things but then i need a script

Starting point is 00:25:18 like a in order to actually make one of them the same as the other and that's not the same as a diff but yeah i agree so first step is diff and then second step is a script. Well, diff can be seen as a series of DL comments. In this case, it's the same. But the problem is that you see it with, well, diff tools for Postgres schema, they show alter comments, create index comments. Yeah, most of them. But I consider that a second feature, right? The first feature is compare these two schemas, and they highlight differences. And then the second is...

Starting point is 00:25:50 Yeah, yeah. So we have two approaches for diff. I agree. But the problem with deployment will be if you create index comment to production, you will block people. And so you need to have advanced diff. And we spent some time prototyping this as well

Starting point is 00:26:05 and then i realized it's already kind of solved like it's not solved well but we have a zoo of various tools for deployment management like sketch ruby on rails database migration sensor and they all ignore the fact that data should be changed in batches. Again, I'm advertising GitLab migration helpers, which solves this very well for Ruby. And it's hard for us to, either we need to choose something or we need to somehow abstract. And then I realized, okay, people resist all this somehow. Let's just avoid this problem. And currently I consider this as anti-go problem and we i consider this currently i consider this as anti-goal and we just take care of conflicts so if someone already changed schema in this branch

Starting point is 00:26:52 you're trying to change so in other words we have something like cvs or subversion very centralized before you put your changes you need to update and then you can already push your changes, already resolving conflicts and maybe replaying your changes on top of other person's changes. So we just take care of conflicts in quite a simple way and we don't solve the problem of merge fully, maybe postponing it. But what we have among goals for example imagine you created a column which is empty everything is filled you have full database but one column it's new and it's empty and how to test it you need something there to test it so it looks like we need to think about ability to provide some i don't know like I don't like, like fixtures and like seed databases where we have some fake data. But here we need it.

Starting point is 00:27:51 We need to feel like we need to feel new columns, new tables, and developers should decide how to do it. Should provide some mean for testing. So we have everything, but somehow we need to feel new column. Yeah, like data generation. Right. Maybe snapshot it and consider this. Like, this is our test data. We have deviation from our production from main branch.

Starting point is 00:28:13 But we already have test data. It's good. Any other engineers can work with it and test it and play and explore how this feature behaves with many rows, right? Yeah. Feels like a whole other topic. Yeah. Yeah, feels like a whole other topic. Yeah, well, testing is a whole other topic, definitely. There are major areas we can discuss there as well. So what excites me here is any direction I go,

Starting point is 00:28:41 I feel how come this is still not developed? How can we live without it? Like, I see how we can live. We test on production. Every time I have some question like this, it's about this not very beautiful term, but it's called shift left testing. When we want developers test first,

Starting point is 00:29:03 then some... Testing should be done in the very beginning. It should be shifted to the very left in this infinite DevOps sign, you know, like this infinity sign. Yeah, I know what you mean. But even if it's not done in production, it's often done in like a thick cone of production. Staging often is very different from production, so we end up testing it on production.

Starting point is 00:29:24 I can see what you mean. Really testing, really testing. We pretend sometimes with mark checkboxes it was tested in lower environments, in staging, everywhere. But if you think about was this testing real, it was fake testing. And this is what we want to fix

Starting point is 00:29:42 in our development processes. Yeah, that'd be great. Any last thoughts or things you wanted to share with people? Well, just keep an eye on what we are doing. Stay tuned. And any feedback, any ideas are welcome. I'm always ready to discuss this topic with everyone. I think it's one of the hottest topics in the area of databases right

Starting point is 00:30:05 now. I mean, many problems are being solved, Kubernetes and so on, but this problem needs to be solved as well. And the majority of development teams will benefit immediately from better new generation tooling to build new non-production environments. So I think we spend too much time thinking on production, but to solve problems on production, we need to start from non-production. And this is interesting. So ready to talk with everyone? Just reach me out on Twitter, email, anywhere.

Starting point is 00:30:38 And our regular mantra, thank you for feedback, everyone, for subscriptions, likes, topic ideas. I think we will, again, next time we should choose one of the topics proposed by our audience. Very appreciated ideas. And that's it. Please share in your social networks, working slack discord mastodon right yeah yeah of course absolutely well thank you nicolai thanks everybody take care thank you michael bye

Postgres FM - Database branching

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.