Postgres FM - How to move off RDS

Episode Date: May 30, 2025

Nikolay and Michael discuss moving off managed services — when and why you might want to, and some tips on how for very large databases. Here are some links to things they mentioned:Patron...i https://github.com/patroni/patronipgBackRest https://github.com/pgbackrest/pgbackrestWAL-G https://github.com/wal-g/wal-gHetzner Cloud https://www.hetzner.com/cloudPostgres Extensions Day https://pgext.daypg_wait_sampling https://github.com/postgrespro/pg_wait_samplingpg_stat_kcache https://github.com/powa-team/pg_stat_kcacheauto_explain https://www.postgresql.org/docs/current/auto-explain.htmlFivetran https://www.fivetran.compgcopydb https://github.com/dimitri/pgcopydbKafka https://kafka.apache.orgDebezium https://debezium.iomax_slot_wal_keep_size https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-MAX-SLOT-WAL-KEEP-SIZElog_statement DDL https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-LOG-STATEMENTPgBouncer pause/resume https://www.pgbouncer.org/usage.html#pause-db~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things Postgres girl. I am Michael, founder of PG Mustard and I'm joined by Nick as usual from Postgres AI. Hey Nick, how's it going? Everything is fantastic. How are you, Michael? Yeah, good. Good also. Thank you.
Starting point is 00:00:16 What are we talking about this week? So today, this week we are talking about, let's move out of RDS or put your own Managed Postgres name there. And how? Yeah. So you mentioned how to move off RDS, but I was going to ask how specific you wanted to be to, well, because there's even multiple flavors of RDS in terms, even Postgres compatible. Yeah, it's great. And we have a lot of managed service providers and that's super good.
Starting point is 00:00:49 They have all of them have their pros and cons. Some of them are already dead. We must say like, and you know, there's a saying like, a human can be, not only can be dead, he can be suddenly dead. And this is true about about this is from Russian Literature, okay, okay, oh gac of yeah Mastering Garita, so so Managed Postgres service can be suddenly dead also and we know several examples and
Starting point is 00:01:19 What annoys me is how they provide very short notice sometimes. I don't expect this from RDS, but who knows. No, definitely not. RDS is huge, but we know Google can kill services. Not with 30 days short notice or two weeks. But among recent examples, I remember Bit.io, which was an interesting idea, drop CSV file and it gives you Postgres and you can start querying. I remember, I think, Tambor is a recent example, right? Yeah, but how much notice are they giving?
Starting point is 00:02:03 Bit.io gave, I think, 30 day notice only. Oh, wow. It's not all right for databases. We know how difficult it is to do any action. If you exceed, for example, one terabyte, and it's already a significant size, any move, you need to plan carefully and practice and test and then perform.
Starting point is 00:02:24 Yeah, well, Tembo, right? Tembo, I think they completely pivoted to different area. And also PostgresML is a new example. But PostgresML, I think one of the founders is PGCAD. He was here long ago. Right. But PostgresML is closed, but others as well. Right. So I think it's, it happens. So one of the reasons to migrate out is lack of control. I think if we already
Starting point is 00:02:54 discussed reasons, do we? Yeah, let's, let's do it a little bit. I hear people talking a little bit about costs, but I think the ones I see more often seem to be low-level access extensions. Like if you've got a specific extension you want supported, you might migrate, I guess, sometimes between providers rather than off. But yeah, migrating off it also could include migrating onto a different one. Yeah, yeah, definitely, because some of them are much more open, some of them not many, but some crunchy, right? Crunchy Bridge. Yeah, or some some of them are much more open, some of them, not many, but some, crunchy, right?
Starting point is 00:03:27 Crunchy bridge. Yeah, or some support different extent, like maybe the extension you want is on one managed service, but not another. You want time scale DB or you need to migrate to time scale cloud? Or even I think I saw a conversation you were having that is it PL proxy? Oh, Cloud SQL. Yeah. Yeah.
Starting point is 00:03:47 Because how the cloud SQL is working there. This is my theory. That's why PGQ and Sky Tools are there basically, right? So that works as well. So yeah, well, and all others only have, not actually all others. RDS doesn't have even PG Bouncer, which also originated from Skype 20, almost 20 years ago. I didn't know that. Yeah, PG Bouncer is the most popular product
Starting point is 00:04:14 created inside Skype 20 years ago. Others are PGQ, Launders and Pell Proxy. I think there are more, but these I used and remember very well. If we talk about reasons, there are many reasons. For me, inside me, and services are different. Some of them are open, some of them are fully based on open source, which is great, like SuperBase.
Starting point is 00:04:43 And strictly based on open source, and this is great, like SuperBase. And we are not strictly based on open source and this is great. Some of them provide super user like Crunchy Bridge or access to physical backups. It's miracle among managed Postgres providers. But most of them are telling you, we are protecting you, give us almost twice more money and we will limit your access to your own things and we will take care of backups and failover, HA and DR and some other features. And all of these features are very good established already in open source ecosystem. So I think, I think, my anger lives inside mangroves towards managed service providers because I think, first of all, they make money, much more money than they should on open source, not contributing back enough.
Starting point is 00:05:40 Not everyone, but many. But most problem is, I think, I truly believe they stole the openness from us. The open source which we loved, always comparing to SQL Server, Oracle, or Linux versus Microsoft Windows. Those guys who create many source providers, among them are great engineers and I admire and have huge respect to many of them. But at the grand schema, this is new like Oracle and vendor lock-in and closed everything and bad processes, inability to fix problems, understand what's happening and so on. All things which open source intended to solve.
Starting point is 00:06:23 For example, why I stopped using Oracle 8i in 2000? Because I spent three days trying to fix something. It worked on one machine, it didn't work in our clients, at our client machine. And then I read somewhere, okay, Internet already was a thing. I read somewhere that if hostname has closing parentheses, then Oracle cannot connect. You cannot connect Oracle, and no errors, nothing. And I eventually switched to open source where such kind of problems you can troubleshoot easily.
Starting point is 00:06:55 Like if you have advanced skills. And many people, they are not dumb. Many people are quite smart and can learn how to use pure for Soul and Treble Shoot. But many service providers stole this openness and ability to see what's happening. And they just packaged open source in the form of service, charged a lot of money, and also closed important features like access to physical backups or ability to physically physical replication connection basically implementing vendor login. So well but they didn't steal it right it's it's um
Starting point is 00:07:34 licensed that way deliberately to be very permissive at Postgres so there's a it's Postgres Postgres right it's open source it's the same as Postgres. We almost didn't change it. We use it as is. It's Postgres. And it's like enterprise level. To call it, you mean it's the calling it Postgres is what you're... Yeah. They call it Postgres and Postgres popularity. Actually, it's two side question because I also think RDS contributed to the rise of Postgres popularity around 2015. I think Jason was a big contributor, but if you check carefully, RDS probably was bigger contributor to the growth of Postgres popularity because everyone hated installing Postgres.
Starting point is 00:08:14 This is true, but things are changing. It's much, much easier these days to install Postgres. When I, and when I started, when I got more and more involved in the Postgres community, maybe six years ago, most Postgres committers were, well, I didn't look at the numbers, but I think from memory, most committers and contributors to Postgres were working at consulting firms who did a lot of community work, but also, I guess, implementing features for their clients. at work, but also I guess implementing features for their clients. These days I wouldn't be surprised if, and I know the lines are blurred between consulting firms and managed service providers these days, but I wouldn't be surprised if a group, I definitely know that lots and lots of committers are moving to the hyperscalers and other managed service providers.
Starting point is 00:09:02 Of course, of course. But what I mean is there is some contributing back from them, but I do take your point that there's an interesting dynamic here. And I would be interested in your take on what proportion of people are moving for those kind of ideological reasons versus cost reasons. Because you mentioned it's like double the price for less access.
Starting point is 00:09:23 Which of those is more important to people, the double the cost or the less access? That's a good question. First of all, again, question is not like one-sided. I admit I have all things RDS brought and others brought and being bought. It's for sure. But also, I just feel it like how working with Oracle and SQL Server, not understanding what's happening, always like a need to reach out support and waiting months to get any small response.
Starting point is 00:09:56 I was like, it's hard. Then working with open source, you just see how it works and you can even fix it if needed or you can troubleshoot. It's like documentation always lacks details. So source code is best documentation and ability to debug. This is a huge part of open source constantly for me. And this lacks there. So for me, it's a big problem.
Starting point is 00:10:23 But I understand I'm in minority here. But I'm not afraid to be in minority. 20 years ago, Postgres itself was deeply in minority, deeply. It was like MySQL everywhere, Oracle is in enterprise. What is Postgres? So I truly believe that RDS Rise started roughly 10 years ago, maybe 12, around that time. It was like a storm. Heroku started first, and I think Heroku was historically first, well, like good managed
Starting point is 00:10:55 post-gres, right? Then RDS, others, everyone has it. Even IBM or Hub has it, or Oracle has it. It's insane. Absolutely insane. So I truly believe this peak will already like should go down or soon will start go down and something new has to be created. This is what I feel. Maybe it will take 10 years or so, another 10 years. But it should not be so because the whole philosophical point of open source is deeply
Starting point is 00:11:32 disturbed. It sucks to troubleshoot problem on RDS trying to reproduce it on regular Postgres. You try to guess things all the time. You don't see you are blind, right? Some database is suffering, their support is slow or unexperienced. We're repeating a conversation we've had several times. Yeah, yeah, well, the problem is like, for me, this isn't as deep, it's strong, but I also understand I'm among very few people who feel it, I think think So why bring this one up about how to move off? Is it because you want people to move or is it because you're seeing that they're moving and they need advice?
Starting point is 00:12:11 I want to start this discussion. I also know that probably more people will move because of budgets especially Like economically reasons they can hear you at some point you realize you spend a lot on RDS and other services probably but RDS sometimes is huge like it can be sometimes it's like between 20 or 50 percent of whole budget for cloud RDS and I think there are interesting pieces of how you can get support as I understand to support, the RDS whole account should be upgraded, so you cannot upgrade only RDS part. But this I'm not sure.
Starting point is 00:12:50 What I'm sure, it costs a lot, and it costs for something that can be achieved with open source and a little bit of support from some guys. You can hire those guys or you can use services, but it will cost a fraction of what you can have on AWS. And so, sometimes some companies are in trouble. They check spendings. They see big check, like say 10,000 per month for some small company. It's significant money, right?
Starting point is 00:13:24 So you can hire a guy for this. And you check what you can achieve on low cost like Hetzner or OVH or something. And you see you can drop it five times. Why not? Right? Or maybe it can be 100 and 20,000 after migration. And you see that, as I said, between 20 and 50% of costs are databases. We're talking about guys who are not growing fast and not raising a lot, but they need to survive as well. There are such companies and sometimes we need help as well. Yeah, so the cynic in me is thinking, oh, so Nikolai wants more customers. No, unfortunately, in this case, I can help one time like me, but we don't have solution
Starting point is 00:14:15 yet. So others might have solutions to this. And sometimes it's Kubernetes, sometimes it's not Kubernetes. I would prefer not Kubernetes in this case. If you're very familiar with Kubernetes, okay, good. There are options. But I would just prefer maybe like old school solution without additional layers. If it's a small project, it's maybe easier.
Starting point is 00:14:40 And yeah, we can help, but I'm not looking for customers because this is actually not super huge customer usually. We had such cases, but it's like small project for us to migrate. And I just wanted to share some pieces of advice and start like spark this negotiation about why like we use RDS so much while backups, HA,ANDR already solved in open source, fully solved. Take Patroni and Pgbcrest or Wooljie, that's it. And just find some packaging or build your own and that's it, yeah.
Starting point is 00:15:16 So not convincing or what? Yeah, well, I mean, I think you're right at a certain scale, but I think there's a lot of smaller scales where it just makes sense like it doesn't make as much sense to to try and save that amount of money for the How much like let's take some company how much is total cost and like in percentage how much databases? Yeah, good good question, but I think like even up to spending Yeah, good question. But I think even up to spending, well, definitely less, if you're spending less than a thousand dollars a month on the database, why bother? That is not noticeable, right? But if it's several...
Starting point is 00:15:53 But do you see what I mean? That's quite a big... For a lot of very small startups, they won't be spending that. So I think there are a lot... Well, depending on the term small, because if it's still small team, but they accumulated a lot of data, the budget for the to keep this data in RDS Postgres will be high. And I think we touched on an interesting question. I think there is some threshold where you can estimate how much effort in terms of engineering resources it will take.
Starting point is 00:16:27 And then how much like there should be some threshold below that it's not like reasonable to move out of RDS. But about that it might be reasonable, especially considering that the quality of Patroni, WorldG, and Bejibikrest, it's very good enough. They are battle proven many years already, right? and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC
Starting point is 00:17:02 and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC and BGC to multi-as-a-instance three-node cluster, it already looks like classic Patroni setup. And if you compare costs, it will be interesting. But also if you move out of AWS to out of cloud or to cheaper cloud like Hedstner cloud, which has regions in the US and a lot of regions, not a lot, some regions. As of recently, yeah. And they also, since I think last year, have S3 compatible object storage for backups, but only in Europe. That's a problem.
Starting point is 00:17:36 Oh, really? I hope they will add it to US regions as well soon. And I actually really like their dedicated offering. I used it many years ago when I was bootstrapping it. And I know many people use it. And sometimes customers come to us and they use HeadStamp for bootstrapping and it makes sense. And it's even more cost saving, but it's also not available in the US, unfortunately. So anyway, if you take that price, for instances,
Starting point is 00:18:07 EC2 versus virtual machine heads in the cloud, it's already 5X. But on top of that, premium AWS adds to run backups and failover and other services they provide. It's additionally like 60, 70, right? Yeah, well, but okay, so I think we've covered enough on like why. I think people... Good reasons. Yeah, control and money. Money and control. So then I think the more, the trickier question is how. And especially how to do it with minimal or no downtime. Yeah, well, it's again, this is not only about RDS.
Starting point is 00:18:50 Maybe you want to migrate out of other like Cloud SQL or Crunchy Bridge or something despite their achievements. First thing, like let's talk about versions, compatibility, plugins. If you move out to different cloud, it's of course important to compare all the extensions and capabilities, version supported, how fast the service delivers minor upgrades. This is a big deal. If they lag, it's like some flag for me. We need to make sure if bugs happen in Postgres, it will be delivered to my setup quickly. Also how my major upgrades are done, right? What kind of control I have there. But if it's to your hands, well, it's easier because you can take... It's open source.
Starting point is 00:19:39 You can take any extension. Besides extensions, AWS, for example, created for RDS and didn't publish to open source. They have it like AWS Lambda, extensions for plant control in Aurora to freeze plants to mitigate plant flips, to avoid plant flips. These extensions, unfortunately, are not available in open source, so you need to find a replacement. But this is rare, actually. So usually people... It's either observability-related capabilities, you can find a replacement for it, or it's something people usually rarely use.
Starting point is 00:20:17 So most of the offering in RTS, it's based on open source pieces. And you can bring even more extensions if you have everything in your hands, right? You can have... Make your own? Yeah, make your own extension. Easy, compile everything. So... BYOE, yeah.
Starting point is 00:20:33 Yeah, yeah. So yeah, there was a mini conference inside, a recent conference in Montreal, right? So, organized by URI, right? It was PG extension day or something. I know several people who present the talks there. It was interesting. So actually, part of my lack of love to extensions over the last five to 10 years is because of RDS and these guys. Because I understand if I create extension, it will take ages for them to take it. So if you move out, you have control and can
Starting point is 00:21:08 use more extensions. What? Yeah. So, but now we're talking about some, yeah, of course you need to do a bit of research, bit of prep as to where you're going, but I'm assuming most people that are thinking, how do I move off have done that step. I think the tricky part is often you don't, they're thinking how do I move off, have done that step. I think the tricky part is often you're not able to, for example, set up logical replication to, you know, an easy-ish way to do this would be set up a logical replica somewhere else, set up replication, decide a cutover point, pause writes, cut over, send rights to the new
Starting point is 00:21:46 primary. Great, easy, done, episode done. But I believe that's not possible in most cases. Well, it depends on the scale. Before we go there, one more thing related to extensions and compatibility. So if you want to think about how not to lose anything, it's important to think about observability bits. And in this case, PgWeightSampling and PgStatKCache and PgStatStatMents and AutoExplain, they bring a lot of observability bits. And there are several monitoring solutions available that can be used on top of them. So the idea is everyone who works with RDS, literally everyone, they use performance insights. And it's not a good idea to migrate off and lose this bit.
Starting point is 00:22:38 Fortunately, PGWet sampling, especially with recent fixes now, it's possible to use it and have similar charts in your monitoring thanks to that. So it's not only about extensions functionality, but also additional observability, so management extensions. As for cutover and so on, logical replication usually available everywhere, on RDS including. So you can provision logical replicas. The tricky part is how to do it if your database exceeds, say, a few terabytes or 10 terabytes.
Starting point is 00:23:12 It's really not easy. As in that first initialization, you just never catch up. Yeah. Yeah. Well, yeah, never catch up. You can use multiple slots to catch up, right? Yeah, but so, okay, but you said it's not easy. Why is it because of the time between setting it up
Starting point is 00:23:33 and complete code? To create logical replica, we need to do two things. First is initialization, and next is switching to CDC and catching up inside the CDC. I don't think it's a huge deal in terms of CDC to catch up if you use multiple slots. Usually with one slot you can handle workloads like thousand double writes per second. If you check PgSysActivity, PgNTapDel, NtapUpt, UPD, and NtapIns, these three numbers, you check how many of them you can have per second. On modern hardware like Intel, AMD as well, ARM maybe, but maybe less, you need to go
Starting point is 00:24:10 down with this threshold. Thousand tuple writes per second, of course, it's a relatively big number because it depends on the content of those tuples as well. You can, roughly thousand tuples per second, you should be second, you should catch up well just with single slot. If you have more, just use multiple slots. The only problem is that when you use a logical replication using multiple slots, foreign key violation happens temporarily. It's eventually consistent on the subscriber.
Starting point is 00:24:43 You cannot use it while it's replica. But so this is not the problem. Problem is this initialization because standard approach. So there are several tricks here. First there is binary mode. There is traditional way to create a replica with copy data set to true, but also you can set binary to true. I think it's supported since Postgres 15 or 16. In this case, it gets data in binary form and it's kind of faster. It's not the
Starting point is 00:25:13 same as to create physical replica and convert it to logical replica, which became, by the way, official in Postgres 17. I missed this. There's a new CLI tool, but it will be only available post-17, so in future. Now usually we have older version on production. Although if you can upgrade to 17 first, then this tool is available. But these guys don't provide the physical backups besides Crunchy Bridge. I was going to say, so you wouldn't be able to use this to migrate off RDS. But this flag, what I'm talking about, flag binary set to true, this is for regular logical replica provisioning.
Starting point is 00:25:53 It's not physical to logical conversion. It's something else. It should speed things up. But additionally, you can speed up things to implement it manually. So if you just create a slot, open transaction, export snapshot, and then dump everything in many workers, actually PgDump supports it. In PgDump, you can support a hyphen j, or number of jobs, and also you can specify exact snapshot to work with. If that transaction is still open and this snapshot is being held in repeatable read transaction isolation mode,
Starting point is 00:26:35 snapshot isolation, so it holds snapshot. Then you can use multiple push-dumps or multiple push-dump workers to export data faster. There is even more. Even more can be done here. If you have huge tables, a single worker will be a work alone. It will be a single-threaded single worker processing a huge table, like a billion rows, for example. And in this case, it's good to think about custom dumping and restoring using multiple workers and using logical partitioning. you have ID ranges for example if it's integer big int or
Starting point is 00:27:10 UUID version 7 or actually if it's even if it's UUID version 4 you can have ranges because it's like randomly distributed like so no longer using PG dump but like copying to CSV or something? Yeah, yeah. Copy to, yeah, yeah. So something like this and you can specify ranges and export data in huge table using multiple workers. This is to speed up the process.
Starting point is 00:27:41 The more we speed up, the less we need to catch up. Obviously. the process and took the more we speed up the less we need to catch up. Right. Okay. Obviously. But the problem with speeding up too much is if you do it from the primary and until only recently 16 or 17, it was when it became possible to create a logic replication from physical standbys. Before that, you need to deal with primary and this is, this is not fun
Starting point is 00:28:07 because also I don't understand how you Can you on RDS set up logical replicas from it from a standby never tried but it should work in 17, right? Oh, I mean the features there in postgres. I just didn't know it was available I also don't know I also but actually you But actually you can think about having… So it's interesting, right? If we create a slot on the primary and we have a physical replica, theoretically we could have a replica lagging a little bit. We create a slot and then we use recovery target LSN to catch up and synchronize with slot position using physical replication. And then we pause again and we can grab data from physical replica. And we know it will correspond exactly to the slot position which is waiting for us. But the problem is
Starting point is 00:29:02 RDS doesn't support recovery target LSN. You cannot change it. So my question is, can we pause replica and instead of recovery target LSN, advance slot? There is a function, PG replication slot advance or something like this. You can, knowing position of your physical standby, you could advance the slot so they are again synchronized. And then you can grab your data from physical standby and you know this data will correspond to the slot position. So, but in reality what how are you seeing people, how you when you do a project to move people off what are you actually ending up doing normally? Well recently we decided to do it with downtime honestly because this
Starting point is 00:29:44 project could allocate maintenance wind of three hours, no problem. But the scale was, I think, a couple of terabytes. It was not big. But at this scale actually, I would probably use traditional logical replication provisioning. But what I'm describing right now here, it's interesting because again, we want to But what I'm describing right now here, it's interesting because again, we want to move as fast as possible with initialization to catch up less later. But we also not want to overload the primary if it's still being used. So this trade-off is like competing reasons. That's why it's a good idea to move this reading, your reading disk, your off the primary to a standby.
Starting point is 00:30:26 This is what I just tried to elaborate. Honestly, I never tried on RDSVs. I never tried. I only tried from self-managed. That's why I wanted to bring us back a little bit to kind of how to move, because it feels like, even though these things might be possible, they start to get quite complicated.
Starting point is 00:30:44 And sometimes complexity in these situations is, well, it's scary, it's hard to test, it's hard to be sure it's gonna work well and often because it's a project that won't happen, hopefully you won't be migrating every year or two, so hopefully it's a fairly infrequent task, people can offer, I think sometimes people can arrange for a couple of hours of downtime. I really, you know, over a weekend, announce it really far in advance, let people like, obviously some, some can't, but I think plenty can.
Starting point is 00:31:18 And then the complexity just drops. So well, there are tools, there are open source tools and there are proprietary tools additionally, which help with logical replication. For example, like CDC type. Yeah. So DMS from AWS and Google also supports their own tooling and there are third party tools like click. I wouldn't recommend click. They suck at postgres for sure. But there is 5Trend. 5Trend is a good proprietary tool. They promise to work very reliably in very big databases. So you can use them to migrate out and then that's it. Or there are open source tools like this PgCopyDB, which I think works fully at logical level, right? So it should be compatible with RDS by the mid-Ethanthane.
Starting point is 00:32:06 Again, I only know it, I never use it myself. There's something from Xata, I think. No? Should be. I don't know. They promised support of DDL because the problem with logical is DDL always. If you need to create DDL often, sometimes it's a part of user activity, DDL creation. And this is not a good position to be in because DDL is not propagated with logical replication. And if you can pause it, it's great, but if you cannot... Another reason to keep the window. Yeah, I agree with you. Yeah, so. Another reason to keep the window.
Starting point is 00:32:44 Yeah, I agree with you. Complexity can grow, but it's already not rocket science. Yeah, that's fair. And you can evaluate this complexity and how much you need to engineer, compare with your savings. Yeah. Yeah, it's a good point though, because if we're talking like projects that are already spending more like a few thousand a month or more, they're going to
Starting point is 00:33:12 be of a certain scale. So there is going to be that. I think you mentioned a couple of terabytes and it was say about a couple, two hours of downtime or something. Sometimes people spend millions. And in other that there are cases of quite good, like in terms of scale, bigger companies who migrated out of cloud even, you know, like posts by DHH and 37 signals. So it was, I think, a couple of years ago already, right? Oh yeah, all I mean is the bigger you are, the more that complexity makes sense because you well
Starting point is 00:33:45 unless I'm mistaken I was always thinking that the downtime you'd need to take would be somewhat proportional to the amount of data because of the dump restore time so it's like yeah if you've got a small database the amount of downtime you would need to cut over would be much lower. Yeah, yeah, that's fair. But you're not going to have a small database and be moving for cost reasons. That doesn't make sense to me. Yeah, by the way, there is a good approach. So we don't want to have a long lasting initialization just because we lag a lot.
Starting point is 00:34:21 We just don't want to risk the health of the primary in case if we cancel this migration, because if we cancel the health of the... It will be disturbed because of accumulation of dead tuples and eventually bloat. And there is an approach, I think some tools, I just mentioned, I implemented, if you start consuming from CDC immediately when slot is created and put it into some intermediate place like Kafka or something, or object storage. In this case, Xmin Horizon is already propagating. So you're already using this slot.
Starting point is 00:34:58 You're just using it not by the final user, Postgres, but some intermediate user who will promise to deliver all the changes later when final destination will be ready. So this is also like a reasonable approach but it adds a little even more complexity because now you need to manage Kafka and this is a whole another story or something else files on object storage. Is that what Debezium is used for? files on object storage. Or is that what Debezium is used for? Yeah, well, but I stopped hearing from Debezium for quite long. I don't know what's happening.
Starting point is 00:35:30 If you know who's listening, if you know, can you leave some comments somewhere? I'm curious what's happening with this project. But you raised a good point about testing. What if you need to go back? I think there's a valid question around how do you even test this kind of thing? How do you need to go back? Or I mean, I think there's a valid question around how do you even test this kind of thing? How do you do a dry run? So the thickest part is switchover. Because it's hard to undo this, because it's already
Starting point is 00:35:55 like jump. But provisioning of logical replica, it can be tested in production. You start from non-production, then you go to production, and you are very careful. Two big risks with Logical is to be out of disk space, and you have a new setting. I keep forgetting the name of new setting that can mitigate. Maximum size of the leg, you can control it and say, better kill my slot if you achieve this threshold. So, you can set, I don't know, like 10 or like 100 gigabytes there to avoid risks. And second danger is to affect health because Xmin Horizon is not advancing. So this you can just monitor and define some threshold when
Starting point is 00:36:37 you say stop, we are killing the slot and and let it go. But besides these two risks users won't notice anything. Well, DiskIO as well. If you provision right from the primary, DiskIO can be significant if you use multiple workers. So it's better to control this. Yeah. You also can't be doing migrations during that time. Like you can't be doing DDL? You can, you can. If it's... Oh, good point. Yeah, they will just put your replication on pause and if it's not a lot, you can just... If it's a test, what I did, I just checked in the logs. I see DDL. It's good to have a log statement DDL. It's a good advice anyway for any setup just to control all changes. And you see this is when DDL happened and now our replication gets stuck because of
Starting point is 00:37:28 this DDL. Okay, we have lag up to one minute when I manually propagate. It's a test, right? But the good test is like it's production test because it's really hard. Well, I recommend starting from non-production tests, but eventually due to complexity, production tests here, I would definitely recommend having as well. Under control, risk is not high. If you see the lag, okay, propagated, if the deal happens, new partition is created every hour or what. So you can just propagate it manually and understand this spike was
Starting point is 00:38:06 that. That's it. So it doesn't affect. But good thing is that users don't notice. So it gives you benefit of performing tests in the field where the real battle happens. Unlike switchover, switchover is different. Unless you have switchover backed by Pidgey bouncer post-use view, it won't be fully zero downtime. You will need to accept like up to one minute loss, for example. Loss of availability, not of data. Data should not be lost in any of data. It's to switch, right? And PgBouncer, if you want POS vZoom, I guess it should be installed on separate nodes in Kubernetes or somehow and under your control. Because if it's PgBouncer provided by that service, you need to switch out of that service, right?
Starting point is 00:38:59 So it's quite rare when people control their own PgBouncers. But it happens. I mean, in the case of users of managed Postgres, but it happens. Makes sense. It happened recently with us. Like people. So might that, might that be a sensible initial thing to consider migrating? Um, like run your own proxy. Oh yeah. To run your own.
Starting point is 00:39:22 Oh yeah. Well, and you can start from it actually. This is what happened recently. It was managed service and Pidgey bouncer. That managed service provided Pidgey bouncer, but they didn't provide important bits of it, some control. I think Pulse Museum was either not support, I don't remember details, but definitely was not supported all the details about monitoring.
Starting point is 00:39:43 So you could not export this bits to some custom monitoring solution. So first thing was migrated to a speech-to-bounce, get control of it. Then you can have pure post-resume. And that's good. Oh, important thing I forgot. I also wanted to mention when you do all this migrate out, you need to first out, you need to plan. And inside planning, it's super important to understand the topology and route trip times between the nodes.
Starting point is 00:40:14 The distance, is it like the same region? If it's, for example, you're migrating out of RDS, but you stay inside AWS, you can have the same region. It's good Right because if where will be your users and I mean application nodes If they are far, it's bad. Sometimes people migrate to like okay heads or something and in this case it's better to be closer and Some AWS regions have like couple of milliseconds latency, route trip time to a few
Starting point is 00:40:50 Hetson regions. I think they have only two or three, I don't remember. So maybe you first need to migrate to different region in AWS, your application also, I mean, to be closer to that Hetson region. Geography matters here. How many miles or kilometers between them? And you need to test latency. How to test latency? Absolutely simple test is you have Postgres, you connect to it, you write backslash timing in P SQL and just select semicolon multiple times. You already see it's not scientific, right? You need to do something better, but it's super easy because ping usually doesn't work and you need some ways to test the TCPIP level, some routertime.
Starting point is 00:41:35 I think there are tools. What we used in the past, I don't remember. But this is the easiest way just to check rout trip time to Postgres and choose better because actually one millisecond is already noticeable right like two milliseconds if you have many queries then the seconds I would already start hesitating to to have it right it's you remember our first episode right yeah I was gonna say that's what you're aiming for to have most queries be Less than 10 minutes. Yeah, yeah because HTTP can have like many
Starting point is 00:42:12 Example 10 and HTTP 100 seconds is already noticeable not noticeable until 200. Okay Noticeable should be noticeable for engineers not to users yet Okay, that's it. Like I think we covered many It should be noticeable for engineers, not to users yet. That's it. I think we covered many specific areas. I just wanted to say it feels maybe scary, but it should not be scary. I think in the future we will have more mature products, purely open source, delivering H&R in packaged form, not only in Kubernetes.
Starting point is 00:42:43 But Kubernetes is already so. You can choose among multiple Kubernetes operators, fully open source, and they have, or at least they promise to have everything. But if you don't like Kubernetes, like actually I do more and more, I don't like for databases Kubernetes more and more. I don't like for databases, because it's more and more. In this case, I think more products will arise and help you to stop worrying about backups and availability. Yeah, I think it's good to have options as well. It's good to have a second best option when you're negotiating as well. Because if, cause if costs are the main concern, step one might be try to negotiate a lower cost.
Starting point is 00:43:29 And at that point it helps to have a, oh, we could actually migrate. We've looked into it and here's our plan. We can move on to something that will cost us this much less. You might be able to get a slightly best of both worlds and just get the cost reduced without having to migrate. But yeah, great. So we covered a bit of planning, we covered how to in terms of technically, covered things you have to make sure you don't forget about, a few bits there. Anything else
Starting point is 00:43:56 before we wrap it up? Well, I just wanted to comment that I understand that this discussion maybe is a little bit early. So let's like see what happens in next like say five years. And am I right to say like to predict that the rise of new products around the Postgres should happen? Right. We'll see. Right. Yeah.
Starting point is 00:44:22 Well, I'm going to predict that actually, I think we might even go slightly in the other direction. I think we're going to see more and more managed services and I think they're going to be. Maybe because of AI, but also like some of the companies that that are spinning up thousands of these days, like Superbase seem to be going from strength to strength. Good for them. I know Neon just got acquired, but the numbers of new databases on their platform is significant. They both, because of AI builders, so to speak, right? Yeah, but not only. I think there was... This is what I read in social media.
Starting point is 00:44:56 I think the recent continued trend is definitely that. But SuperBase, I know a lot of people building small businesses, and a lot of them are doing it on SuperBase instead of Firebase in the old days. So I think there are a lot of non-AI projects on using that kind of service as well. So I'm seeing a lot of those, maybe not the enterprise on the larger scale projects,
Starting point is 00:45:25 but it'll be interesting to see in a few years time. I hopefully we'll have both. the enterprise on the larger scale projects, but it'll be interesting to see in a few years time. I hopefully we'll have both. Hopefully we'll have both ends of the spectrum. Yeah, what I remember is the saying that, not the saying like phrase that more and more, maybe more than half of new databases created. It's from integration automation from like say some AI web coding or something and they need database
Starting point is 00:45:46 and it's just created to basically fully automatically. Yeah. Yeah, interesting. Yeah, what the other half then? I don't know. I don't know exact numbers. It's just what like, yeah, yeah. I just put some random number 50.
Starting point is 00:46:01 Yeah, yeah. So and scale matters too, right? Like I think whilst that's a lot of databases, I don't imagine that's a lot of huge databases. Like we're probably not talking about, that's not the ones that are going to be migrating to RDS anytime soon. Most of them. Yeah. Well, yeah. It's interesting. Like I would, I would love to see some report from good trustworthy sources, landscape of databases in terms of size, budgets and so on. About open source, or maybe only Postgres, I remember some Gartner report from 2018, yeah, that was already many years ago.
Starting point is 00:46:43 It was saying that open source database market exceeded 1 billion already. So it was great. But what's the distribution? Those who are created automatically, they are small. And they are definitely below the threshold we discussed, right? And also, what counts?
Starting point is 00:47:02 What counts as open source? Is Gartner including Aurora Postgres as open source? Well the basis is open, not bad. I understand that, I think it's a blurred line at this point. Yeah it's hard to say, there are so many flavors. I agree. Yeah, alright, anyway, really a pleasure talking flavors. I agree. Yeah all right anyway really a pleasure talking speaking with you as always and catch you next week. Good bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.