Postgres FM - Blue-green deployments

Episode Date: November 10, 2023

Nikolay and Michael discuss blue-green deployments — specifically an RDS blog post, how similar this is (or not) to what they understand to be blue-green deployments, and how applicable the... methodology might be in the database world more generally.  Here are some links to things they mentioned:Fully managed Blue/Green Deployment in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL https://aws.amazon.com/blogs/database/new-fully-managed-blue-green-deployment-in-amazon-aurora-postgresql-and-amazon-rds-for-postgresql/  Blue-green deployment (blog post by Martin Fowler) https://martinfowler.com/bliki/BlueGreenDeployment.html  Our episode on logical replication https://postgres.fm/episodes/logical-replication pgroll https://github.com/xataio/pgroll ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael, founder of PGMustard. This is my co-host Nikolai, founder of PostgresAI. Hello Nikolai, what are we talking about today? Hello Michael, the RDS team just released a blog post about blue-green deployments and I thought it's a good opportunity to discuss this topic in general and maybe RDS implementation in particular, although I haven't used it myself. I've just read the blog post, but I know some issues and problems this topic has. So I thought it's a good moment to discuss those problems.
Starting point is 00:00:37 Yeah, awesome. Even if we look at the basics, I think it's interesting to most people. Everyone has to make changes to their database. Everyone needs to deploy those changes. Most people want to do that in as safe a way as possible with as little downtime as possible. So I think it's a good topic in general to revisit, and it looks interesting.
Starting point is 00:00:57 Right, right. So I think in general, it's a great direction of development of methodologies, technologies, and ecosystem, like various tools and so on. Because bigger projects, not only the biggest projects, some smaller projects also need it, especially those who change things very often. But before we continue, I would like to split this topic to two subtopics. First is
Starting point is 00:01:27 not frequent changes we do when we, for example, perform major upgrade of Postgres or we switch to new operational system if we have self-managed Postgres with JLIPC version switch, right?
Starting point is 00:01:44 Or, for example, we switch hardware, I don't know, like something like big, big changes. Major version upgrade. Right, or maybe we try to enable data checksums. Maybe also this is one of the... Interesting, yeah. It's generally possible with rolling upgrade approach when you just change it on one replica and then another,
Starting point is 00:02:03 and then like, you then like rolling upgrade. But maybe this idea of blue-green, which came from stateless part of systems. Originally, this idea was avoiding the topic of databases, but we will discuss it. So this is a big class of changes, which usually is performed by infrastructure teams. And it's not very often few times per year usually, right? Versus a very different category of problem
Starting point is 00:02:32 which is changing our application code maybe several times per day, trying to react to the market needs and our competitors changes, trying to move forward, like go to market strategy and so on so continuous deployment schema changes various stuff so obviously interesting that original idea
Starting point is 00:02:59 described by martin fowler is about the second thing, schema changes and so on, like application changes, which is done probably not infrastructure team, but engineering team or development team, which is usually bigger in size and they need more often changes, but each one of those changes is lighter, much more like it's not as heavy as major Postgres upgrade, right?
Starting point is 00:03:24 But it's done, it needs to be done very often and probably in fully automated fashion, like through CICD pipelines, continuous integration approach, right? So we just change it a lot of automated testing and we just approve, merge, and it's already in production, right? So original idea by Martin Fowler, and I think we need to start discussing it already, right? So original idea by Martin Fowler,
Starting point is 00:03:45 and I think we need to start discussing it already, right? It's about the second problem for developers. While what the RDS team developed is for infrastructure team and major upgrades, it's a very different class of tasks to solve, right? Do you agree? Yeah, I do. And I'll probably jump in the gun a little bit here, but I feel like they might be slightly misusing the phrase blue-green deployments for the
Starting point is 00:04:15 description of this feature. And I really like this feature. If I was on RDS, I think I would use it, especially for major version upgrades. I think it makes that process really simple and lower downtime than most other options smaller database users have. But yeah, I completely agree that this is not at all appropriate for application teams wanting to roll out new features, add a column to a table, add an index. It just doesn't make sense. Because logical doesn't support DDL replication yet, right? That's why this is like stop, full stop. And even if
Starting point is 00:04:52 it did, I think the way that this is done wouldn't be appropriate. Here I would argue with you, but let's do it later. Just let make a mark that I have multiple opinions here, no final opinion. So I have different thoughts.
Starting point is 00:05:09 Let's discuss it slightly later. So, okay, let's talk about the original idea, blue-green. First of all, why such weird naming, which reminds me red-black trees from the algorithm and structure from computer science, basically. Binary trees, next ideas, red-black trees, and so on. So why this name?
Starting point is 00:05:35 You've read about it, right? Yeah, I saw in an old Martin Fowler blog post that I'll link up that they had, I suspect, I didn't actually look at the timelines but i suspect it was back from when they were consulting i think probably at thought works that seems to be where a lot of these things have come from and they had some difficult to convince clients that they wanted to increase the deployment frequency but people were scared of risk as always and they had this idea that
Starting point is 00:06:09 well i mean it's it's kind of standard now but i guess back in the day it wasn't as standard that staging needed to be as close to production as possible so that you could do some testing in it and deploy the changes to production in as as riskfree a manner as possible. And then they took that a bit further and said, well, what if staging was production, but with only the change we wanted to make different? And instead of making that change on production, we instead switched traffic to what we would previously have called the staging environment. And they talked about naming for this. I don't even know what you'd call it, but methodology, I guess.
Starting point is 00:06:48 And they thought about calling it AB deployments, which makes a lot of sense, but they didn't want to do it. AB means we split our traffic, maybe only read-only traffic in the case of databases. And we compare two paths for this traffic. Well, the main objection that Martin had with that naming is that they were scared the client would feel that there's a hierarchy there.
Starting point is 00:07:16 And if we talked about there being a problem and we were on the B instance instead of the A instance, the question is, why were you on the B one when the A was available? And I think that's, I'm not sure. I think you're quite right that A-B testing might have already been a loaded term at the time, but it also is a good counter example where most people understand that in an A-B test, we're not assuming a hierarchy between A and B.
Starting point is 00:07:42 Right. But also the approach, this approach says the second cluster, like secondary cluster, which follows... Okay, I'm thinking about databases only, right? Let's switch. Since we discussed Martin Fowler's ideas, we should talk only about
Starting point is 00:08:02 stateless parts of our system and database we should touch on a little, right? So, only about stateless parts of our system and database we should touch on a little, right? So, okay, stateless. For example, we have a lot of application nodes and some of them are our production. Some other are not production. And what I'm trying to say, it's not only about hierarchy and which is like higher, of course. So, yeah, by the way, I remember a similar name. Okay, I'm a database guy. I remember if you give hostname primary to your primary
Starting point is 00:08:32 but after failover you don't switch names, it's a stupid idea because this replica now has hostname primary. It's similar here, right? So we need some distinguish but not to permanently say this is
Starting point is 00:08:47 main one because we want them interchangeable symmetric, right? So we switch there, when we switch back, back and forth, and always one of parts of like one cluster is or one set of nodes
Starting point is 00:09:03 is our real production and another is considered as like a kind of powerful staging right but key questions not only about only about hierarchy but how exactly is testing done in one case we can consider, this is our staging, and we send only test workload there, which is done, for example, from our QA test sets, from pipelines, or we consider this secondary cluster, secondary node set as part of production and put, for example, 1% of whole traffic there
Starting point is 00:09:45 this is very different testing strategies right so two different strategies I think in original ideas it was like it's staging all production traffic goes to main node set blue or green depending on current state
Starting point is 00:10:02 and that's it, right? So we cannot say it's A-B because in A-B, we need to split 50-50 or 20-80 and then compare. Yeah, sometimes in marketing, I've heard people talk about A-B testing,
Starting point is 00:10:20 which is concurrently testing two things at the same time. And then sometimes they they call what this might be is cohort testing they say we're going to test this month uh the timelines will be different but if you wanted to test if you wanted to to switch from blue to green in one go and send all traffic to the new one that would be considered considered, it's not A, B, because it's not concurrent, but you might say this cohort is going to this new one. I would say that they're both A, B, in my opinion, because they both use production traffic to test.
Starting point is 00:10:54 So this is exactly, by the way, the idea, we can switch there for one hour, then switch back, and then during next week, study the results, for example, right? It makes sense to me, or next hour, I don't know. I don't really care if it's like concurrent or sequential, but the idea is we use production real traffic. It's a very powerful idea. Not only data or not only applications, application nodes are configured exactly like on production
Starting point is 00:11:20 because they are production sometimes, right? We switch them. But also we use real traffic to test i think original idea was we don't do it this secondary node set is used for lower environment it's still production data right or production it talks to production database but we generate traffic ourselves, like special traffic, special workloads under control. This is the idea, the original idea, like we do with staging. But we know this is our final testing. It's very powerful.
Starting point is 00:11:58 It uses the same database, first of all. So we should be careful not to send emails, not to call external APIs and also to convince various auditors that it's fine because they always say if you do production testing, maybe it's not a good idea. Who knows? But it's very powerful testing, right? But it's
Starting point is 00:12:17 not done with production workloads. Yeah, interesting. And have you heard the phrase testing in production that this feels like a uh it's like uh i do it all the time yeah but i like it yeah yeah well it kind of feels like that partly when we're switching over as well because as much as testing as we've possibly done most of us with a bit of experience know that as you can do all the testing in the world and productions just different like we all use is just different they will use it or break it in ways you didn't imagine or have
Starting point is 00:12:53 access patterns you just didn't imagine so we kind of are testing and I think that's one of the big promises of blue-green deployments in the theoretical or at least in the stateless world. Let's introduce Vestor, stateless. Is that you can switch back if there's a problem. That feels to me like a real core premise and why it's so valuable is if something goes wrong, if you notice an issue really quickly, you can go back to the previous one.
Starting point is 00:13:22 It's still alive and there are no ill effects of moving backwards. And I think that's a tricky concept in the database world, but we can get to that later. Yes, and this is exactly. Let's continue with this. I think we already covered the major parts of the original idea, of stateless idea. We can switch to stateful ideas.
Starting point is 00:13:42 And this is the first part where the RDS blue-green deployment implementation radically differs from the original stateless ideas. I noticed that from the very beginning of reading the article, they say, this is our blue, this is our green. And they distinguish that. Yeah. It's a different approach. It's not what Martin Fowler talked to.
Starting point is 00:14:08 Very different. So, and obviously, like, reading from this article, obviously the reverse replication is not supported, but it could be supported. It's possible. And actually we already implemented it a couple of times. And I hope soon we will have good materials to be shared but
Starting point is 00:14:25 in general you just, why not create, when you perform switchover, why not create a reverse replication and consider old cluster as like kind of staging now. Not losing anything. And without this
Starting point is 00:14:41 idea, it's one way ticket and this is not an enterprise solution, sorry. It's plain and simple. It's not an enterprise solution. It's definitely not. Well, it's not blue-green either, I don't think. Right. But it's an interesting point about scale.
Starting point is 00:14:58 So if I'm just a small business with a small database, and I'm doing a major major version upgrade and I want to be able to go backwards, it would be tricky to do with this, I think. Yeah. So it's not tricky. It's tricky. If you want, if you don't want data loss, you can go back, right? But you lose a lot of rights. So, but I can't do, let's say if it's not a major version upgrade, if it's something like maybe changing a configuration parameter, I could do what Amazon are calling blue to green,
Starting point is 00:15:31 change the parameter in the green one, switch to it, and then I can do it the same process again, switching it back. But you lose data. New writes will be not replicated backwards. Not back. So let's say I go blue to green, change the parameter on green, switch over.
Starting point is 00:15:49 Now I've realized there's a problem and I set up a new one, a new green, as they call it, and switch again. Well, okay, okay. In this case, we deal with very basic example,
Starting point is 00:16:04 which probably doesn't require so heavy solution to it, because, like, depending on which parameter you want to test would be easier to just do that, especially because the second consideration, reading this article, is that downtime is not zero. So restart is not zero, and here is also not zero. I don't remember if they mentioned
Starting point is 00:16:39 the issue checkpoint to minimize downtime. I think no, right? In fact, that alone is probably enough in your books to say it's not enterprise-ready. And to their credit, they do say low downtime switchover.
Starting point is 00:16:55 They're not trying to claim that it is. Right. If we say that, if this is our characteristic, it's not zero downtime, it means that this solution competes with regular restarts. That's it. So why should I need this to try to switch to different parameters? I can do it just with restarts and not losing data,
Starting point is 00:17:19 not paying for extra machines and so on. But for major upgrades, it's a different story. You cannot downgrade, unfortunately. There is no PG downgrade tool. Exactly. So you just need to use reverse logical replication and orchestrate it properly, and it's possible 100%. And this would mean... But not through Amazon right now.
Starting point is 00:17:44 Yeah, it's not implemented, but it's solvable. And I think everyone can implement it. It's not easy. I know a lot of details. It's not easy, but it's definitely doable. What were the tricky parts? Tricky parts are like if you need to deal with... We had a whole episode about logical, right?
Starting point is 00:18:08 So the main tricky part is always not only like sequences or DDL replication. It's very well-known limitations of current logical replication. Hopefully, they will be solved. There is good work in progress for both problems. There are a few additional problems which are observed not in every cluster, but these two usually are observed in any cluster because everyone
Starting point is 00:18:32 uses sequences, even if they use this new syntax-generated identity. I don't remember. I still use Bix. It always has identity, I think. Yes, but behind the scenes,
Starting point is 00:18:48 it's also sequences, actually. And everyone is usually doing schema changes. So these two problems are big limitations of current logical replication. But the trickiest parts are performance and lags. So two capacity limitations on both sides. On publisher, it's a wall sender. Like we discussed it, right?
Starting point is 00:19:13 Yeah, we can link up that episode. Yes, yes. So wall sender limitation and logical replication worker limitation. And you can have multiple logical application workers and interesting that this is like actually says that article needs some polishing because they say max logical application worker
Starting point is 00:19:33 like I'm reading max logical application worker and I don't see S because it's plural the setting and I'm saying inaccuracy here and then the whole sentence is saying when you have a lot of tables in database, this needs to be higher.
Starting point is 00:19:50 And I'm thinking, oh, do you actually use multiple publications and multiple slots if I have a lot of tables automatically? This is super interesting because if you do, as we discussed in our logical replication episode, you have big issues with foreign key
Starting point is 00:20:06 violation on the logical replica side, on subscriber side, because by default, foreign key is not followed when replicating tables using multiple poopsoup streams, right?
Starting point is 00:20:21 And this is a huge problem if you want to use such replica for some testing, even if it's not production traffic. You will see, like, okay, this row exists, but the pending row is not created yet. Foreign key violated. And it's normal for logical replica
Starting point is 00:20:38 which is replicated by multiple slots and publication subscriptions. Not discussing this problem means that probably there are limitations also at large scale. If you have a lot of tables, it's not a problem, actually. The biggest problem is how many tuple writes you have per second. This is the biggest problem.
Starting point is 00:21:01 Roughly like thousands or a couple of thousands of tuple writes on modern hardware with a lot of vCPUs like 64, 128 or 96. I'm talking Intel numbers, usual Intel numbers. You will see single logical replication worker will hit 100% CPU. And that's a nasty problem. That's a huge problem. Because you switch to multiple workers, but now your foreign keys are broken. It's hard to solve problems for testing. So, I mean, if you use multiple, you need to pause sometimes
Starting point is 00:21:41 to wait until consistency is reached and then test in frozen mode. This is okay. But it adds complexity. But if your traffic below like 1000 tuple per second roughly, depending also, is it Intel? But by the way, it doesn't matter how many cores
Starting point is 00:21:58 because I talk here about limitation of single core. It matters only like, is it modern core or quite outdated? On the family, it depends. If you talk about AWS on the family
Starting point is 00:22:09 of EC2 instances you try to use or RDS instances you try to use. So, this single core limitation on the subscriber side
Starting point is 00:22:17 is quite interesting. But if you have below 1,000 tuple per second writes in source update the deletes probably you're fine yeah so this is interesting to check and this lagging is I think the biggest problem because when you switch over you need to catch up when you
Starting point is 00:22:38 install the reverse logical replication you also need to make sure you catch up. This defines the downtime, actually. Yeah, because we can't switch back until... We can't switch or switch back until the other one is caught up. Right, because we prioritize avoidance of data loss
Starting point is 00:22:59 over HA here, over high availability here. I'm sure since the RDS blog post talks about it's not zero HA here, over availability here. And I'm sure since the RDS blog post talks about like it's not zero downtime, they have additional overhead. But if you have, for example, Pidger Bowser
Starting point is 00:23:13 and you're going to use post-resume to achieve real zero downtime, then you need the lag to be close to zero. And the limitations of logical application worker will be number one problem. Another problem is long-running transaction switch.
Starting point is 00:23:29 Until post-16, I think, right, cannot be parallelized or 15. So if you have long transaction, you have big logical replication lag. So you need to wait until you have good opportunity to switch over
Starting point is 00:23:49 with lower downtime. That's one thing I think I do want to give them some credit for. This does catch some of those. So for example, if you do have long running transactions, they'll prevent you from switching over. Equally, there's a few other cases where they'll stop you from causing yourself issues which is is quite nice and i wanted to give a shout out to the like postgres core team and everybody working on features to improve logical replication has enabled cloud providers to start to provide features like this and that's really cool it's this feels like the the good features going into Postgres Core are enabling cloud providers to work at the level they should be working at to add additional functionality. So it's quite a cool, like, not necessarily that we're there yet,
Starting point is 00:24:35 and as logical replication improves, so can this improve, but they are checking for things like long-running transactions, which is cool. Yeah, and definitely Amit Kapil and others who work on logical replication, kudos to them 100%. And also RDSTM, I'm criticizing a lot, but it's hard to criticize someone who is doing nothing. You cannot criticize such guys who don't do anything.
Starting point is 00:25:03 So the fact that they move in this direction is super cool. A lot of problems, right? But these problems are solvable, right? And eventually we might have a real blue, like the question, is the blue-green deployment terminology going to stay in the area of databases and Postgres ecosystem in particular? What do you think? Because this is a sign that probably yes.
Starting point is 00:25:26 It should be reworked a lot, I think. But in general, maybe yes. What do you think? Yeah, I don't know. Obviously, predicting the future is difficult, but I do think that badly naming things in the early days makes it less likely.
Starting point is 00:25:44 Calling this blue green when it's not actually i think reduces people's trust in you using blue green later in the future when when it is more like that but you you've got more experience with this than me in the for example in the category of database branching, like taking these developer terms that people have a lot of prior assumptions about and then using them in a database context that they don't 100% apply to or they're much more difficult in,
Starting point is 00:26:15 I think is dangerous. But equally, what choice do we have? How else would you describe this kind of thing? Maybe it's a marketing thing. I'm not sure. That's cool direction of thinking. So let me show you some analogy. Until some time, not many years ago,
Starting point is 00:26:36 I thought, as many others, that changing something, we need to perform full-fledged benchmarking. Like, for example, if we drop some index, we need to check that all our queries are okay. In this case, okay, we can do it with pgbench, sysbench, jmeter, or anything like we... Or simulate workload with our own application using multiple application nodes.
Starting point is 00:26:59 A lot of sessions, like, running, like, 50% CPU, and this is just to test, an attempt to drop indexes. It sounds overkill. I mean nobody is doing it actually because it's too expensive. But people think in this direction. It would be good to test holistically. But actually there is another approach, lean benchmarking, single session, explain and analyze buffers, focus on buffers, I.O., and so on.
Starting point is 00:27:30 Similar here. And first class of testing is needed for infratasks, mostly upgrades and so on, to compare the whole system, lock manager, buffer, pull behavior, everything. File system, disk, everything. But it's needed only, as I said, like once per quarter, for example. Of course, for every cluster, if you have thousands of clusters,
Starting point is 00:27:51 you need almost daily, I think, right? To run such benchmarks. And similar, these upgrades, major upgrades and so on, these tasks go together usually. You need upgrades, so you need to do benchmark. But for small schema changes,
Starting point is 00:28:04 you do it every day, multiple times maybe, okay, once per week maybe, depending on the project. You release often. You develop your application quickly. You don't need full-fledged benchmarks, and you also probably don't need full-fledged blue-green deployments, right? But maybe you need, I don't know, maybe you need I don't know maybe you need
Starting point is 00:28:25 still need it this is where I said I have open-ended questions like what should we use for better testing because I could like if we are okay to pay two times more
Starting point is 00:28:39 we could have two clusters with one-way replication but when we switch perform switchover zero downtime switchover, immediately we set up reverse replication. So, real blue-green approach. In this case, probably we could use them for DDL as well. Of course, DDL should be solved. But we can solve it applying DDL manually on both sides, actually. This unblocks logical replication.
Starting point is 00:29:07 So we just need to control DDL additionally, not just alter. We need to alter there and alter here. In this case, probably it would be a great tool to test everything. And then if we slightly diverge from blue-green deployments idea, but use A-B testing idea, so we point like 1% of traffic to this cluster, read-only traffic only. I'm not going to work with like active-active schema,
Starting point is 00:29:38 like multi-master, no, no, no. So then we can test at least read-only traffic for change schema. But again, there will be a problem with schema replication because logical replication is going to be blocked. We need to deploy the schema change on both. It's not only about the lack of logical replication of DDL. It's also about even if DDL would be also replicated, if you deploy it only on one side, it don't
Starting point is 00:30:08 deploy on another side. Logical replication is not working. Or it replicates it, right? So I'm not quite sure. Actually, we can drop index on the subscriber. Or we can add a column on subscriber. And the logical replication
Starting point is 00:30:24 will be still working. But some certain cases of DDL will be hard to test in this approach. But still, imagine such approach. It will be full-fledged blue-green deployment with simple, like, symmetric schema, simple switch back and forth, reliable.
Starting point is 00:30:44 I don't know maybe it's a good way to handle all changes in general we just paid two times more but for some people it's fine if the costs of error and risks are
Starting point is 00:30:58 costs of problems are higher than this what do you think? yeah this is a tricky one. The first database-related company I worked for did a lot of work in the schema change management tooling area, not for Postgres, but for other databases. And it gets really complicated fast just trying to manage versions between, like, just trying to manage deployments between versions of maintaining data. And the concept of rolling back is a really strange one. Like, going backwards, let's say you've deployed a simple change.
Starting point is 00:31:40 You've added a column for a new feature. You've gathered some data, does rolling back, like maybe temporarily involve dropping that column? I don't think so, because then you destroy that data. But then it's now in the old version as well. And there's this weird third version that I often talked about in the past, rolling forwards rather than rolling back. And I think that's gained quite a lot of steam in the past rolling forwards rather than rolling back. And I think that's gained quite a lot of esteem in the past few years.
Starting point is 00:32:07 The idea that you can't, like with data, can you actually roll back because do you really want to drop that data? Yeah, you know, dropping column doesn't remove data, you know, right? Because that's why it's fast. But it's not a story. Well, this approach with reshape and now how this new tool is called to handle DDL in the reshape model. When it's similar to what PlanScale does with MySQL, the whole table is recreated additionally. So you need two times more storage.
Starting point is 00:32:44 And we have a view which masks this machinery, right? And then we, like in chunks, we just update something. There you can have ability to roll back, right? Because it's maintaining it in both places? Because for some period of time, you have both versions working and synchronized inside one cluster. But the price is quite high and views have limitations, right?
Starting point is 00:33:14 But here, if we talk about like we're replicating whole cluster, the price is even higher. Yeah, and the complexity is even higher, I think. Of course, yes. Managing it within one database feels complex. It's called PgRoll. Oh, nice.
Starting point is 00:33:29 A new tool which is a further development of that idea of the Reshape tool, which is not developed, as I know, because the creator of that tool, Reshape, went to work into some bigger company, not Postgres user, so unfortunately. So I don't know. The problem exists. People want to simplify schema changes
Starting point is 00:33:57 and be able to revert them easily. And right now we do it hard, I mean, hard in terms of physical implementation. I mean, if we say revert, we definitely revert. But dropping column is usually considered as non-revertable step. And usually it's quite known. like people usually design in larger projects they usually design so like first application stops using the column and a month later you drop it and then already you know so i'm i'm actually talking about adding a column which is way more common i'm talking about adding a column because if you need to support rolling back, that becomes dropping a column. Okay, so what's the problem?
Starting point is 00:34:51 Data loss if you do roll back or what? Yeah. Oh, you want to move forth and back then forth again without data loss? Possibly. You want too much. Yeah. but so i think that's like we talked about blue green deployments right let's say part of what you're doing is rolling out a new feature and so you roll it out for a few hours and some of your customers start using that feature but then there's a major it's causing a major issue in the rest of your system, so you want to roll back, does the use of that feature, are we willing to scrap those users' work in order to fix the rest of the system? I think people would want to retain that data.
Starting point is 00:35:36 Yeah, well, let's discuss it in detail. First of all, on subscriber, we can add a column. If it has default, the logical replication won't be broken because it will be just inserting, updating. Okay, we have extra columns, so what? Not a problem, right? But when we switch forth, setup of reverse replication will be hard
Starting point is 00:36:02 because we have now extra columns and our old cluster doesn't have it. So we cannot replicate this table. Unless we replicate DDL, which if we start replicating DDL backwards, then we're kind of reverting to our existing state, which is strange. This is one option, yes. And another option is to, I know there is an ability to fill the rows and columns, I guess, right? So you can replicate only specific columns, right? I never did it myself, but I think this is… Yeah, yeah, yeah. So if you replicate only a limited set of columns, you're fine. But in this case, moving forth will like you you lose this data and it's similar to what
Starting point is 00:36:46 you do with fly with sketch liquid base or redis migrations usually you define up and down or like house upgrade downgrade steps in this case you create column alter table add column then other table drop column and if you go if went back, of course you lost data which was inserted already, and it's considered normal actually, usually. Well, yeah, but that's my background as well, is that people often wouldn't
Starting point is 00:37:16 end up actually using the rollback scripts. What they would do is roll forwards. They would end up with an old version of the application, but the column and the data are still there with the database. You talk about people who closely like you
Starting point is 00:37:31 talk about companies who are both developers and users of this system. But if you imagine some system which is developed and like, for example, installed in many places, some software, they definitely need a downgrade procedure to be automated,
Starting point is 00:37:52 even with data loss, because it's more important usually to fix the system and make it alive again. And users in this case don't necessarily understand details because they are not developers. And it's okay to lose this data and downgrade, but make system healthy again. In this case, we're okay with this data loss. Well, yeah, but I guess going back to the original topic, you asked do I think blue-green deployments will take off in database world? And I think it's the switching back that's tricky.
Starting point is 00:38:26 But I don't want to diminish this work that's been done here, regardless of what we call it, because I think it will make more people able to do major version upgrades specifically with less downtime than previously they would have been able to, even though it will still be a little bit. Yeah, I don't know. Maybe we need to develop the idea further
Starting point is 00:38:49 and consider this blue-green concept as some intermediate. It reminds me of red-black trees, right? Like binary tree, red-black tree, then AVL tree and so on, and then finally binary tree, black tree, then AVL tree and so on. And then finally B3. And this is like development of the like indexing also approaches, algorithms and data structures. So maybe like closer to self-balancing than a lot of children for each node. Maybe here also like it's very distant analogy, of course,
Starting point is 00:39:26 because we talk about architectures here, but maybe these blue-green deployments or green-blue deployments, I think we should start mixing this to emphasize that they are balanced, right, and symmetric.
Starting point is 00:39:42 And also, like, say, tell RDS guys that it's not fair to consider one of them as always source and another as always target we need to balance them so I think there should be some new concepts also
Starting point is 00:39:57 developed so it's interesting to me I don't know how the future will look like also let me tell you a story about naming. In our systems we developed, we chose like we know like master, slave in the past and primary, secondary or primary standby official Postgres documentation follows this terminology right now. Writer, reader in Aurora terminology.
Starting point is 00:40:26 Also, leader, follower, Patroni terminology. Then logical application terminology, publisher, subscriber. Here we have blue-green, right? In our development, we chose source-target clusters. And it was definitely fine in every way, monitoring and all testing, like everyone understands, this is our source cluster, this is our target cluster. But then we implemented reverse logical replication
Starting point is 00:40:51 to support moving back. And it was like source target clusters naming showed it's the wrong idea immediately, right? So I started to think in our particular case, we do it like we set up these clusters temporarily. Temporarily might mean multiple days, but not persistent,
Starting point is 00:41:12 not forever. In original Blue-Green deployment, as I understand Fowler, if I understand correctly, it's forever, right? We just, this is production, this is staging, then we switch. So I chose the new naming is old cluster, new cluster, right? But if it's persistent, that's also bad naming. Maybe blue-green is okay, green-blue,
Starting point is 00:41:33 blue-green, but definitely, yeah. Why don't you use the Excel naming convention with the final, final V2 at the end? This is the final server. This is the final, v2 at the end this is the final server this is the final final server so naming is hardly known right wonderful I enjoyed this thank you so much thank you
Starting point is 00:41:57 thank you everybody and catch you next week yeah don't forget to like share subscribe share is the most important I think or like is the most important what is it like i think comments comments are the most important to me anyway like let us know what you think in youtube comments maybe or on twitter or on mastodon you know i wanted to take a moment and emphasize that our we continue working on subtitles subtitles they are great.
Starting point is 00:42:25 They are high quality. Yesterday, I asked in a Russian-speaking Telegram channel where 11,000 people talk about Postgres, I asked them to check YouTube because we have good quality English subtitles. They understand terms. We have 240 terms in our glossary. We feed our AI
Starting point is 00:42:45 based pipeline to generate subtitles. And I wanted to say thank you to my son who is helping with this actually. Who is still like teenager school but also learning Python and so on. So YouTube provides automated generation to
Starting point is 00:43:01 any language. So to me the most important is sharing because this delivers our content to any language. So to me, the most important is sharing because this delivers our content to more people. And if those people cannot understand English very well, especially with two very weird accents, British and Russian. Yeah, sorry about that. Yes.
Starting point is 00:43:19 So it's good to just, on YouTube, to switch to automated generated subtitles in any language. And the people say it's quite good and understandable. So share and tell them that even if they don't understand English, they can consume this content. And then maybe if they have ideas, they can write us.
Starting point is 00:43:39 Perfect. This is, this is the way, right? Thank you so much. Bye bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.