Postgres FM - Snapshots

Episode Date: March 21, 2025

Nikolay talks Michael through using cloud snapshots — how they can be used to reduce RTO for huge Postgres setups, also to improve provisioning time, and some major catches to be aware of.�...�Here are some links to things they mentioned:Snapshots on RDS https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_CreateSnapshot.htmlpgBackRest https://pgbackrest.orgWAL-G https://github.com/wal-g/wal-gpg_backup_start and pg_backup_stop (docs) https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-ADMIN-BACKUP How to troubleshoot long Postgres startup (by Nikolay) https://gitlab.com/postgres-ai/postgresql-consulting/postgres-howtos/-/blob/main/0003_how_to_troubleshoot_long_startup.mdRestoring to a DB instance (RDS docs mentioning lazy loading) https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_RestoreFromSnapshot.html Amazon EBS fast snapshot restore https://docs.aws.amazon.com/ebs/latest/userguide/ebs-fast-snapshot-restore.htmlOur 100th episode “To 100TB, and beyond!” https://postgres.fm/episodes/to-100tb-and-beyond ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things Postgres girl. I am Michael, founder of PG Mustard. And as usual, this is Michael, this is Nikolai, founder of Postgres AI. Hey Nikolai, what are we talking about today? Hi Michael. We are talking about snapshots and not only snapshots, but backups of large databases and RPO, RTO. More actually RTO. RTO means how much time it will take for us to recover from some disaster, from backups.
Starting point is 00:00:33 So is it minutes, hours, days? For example, let's take some hypothetical database of 10 terabytes and consider what we would expect from good backup system in various managed post-go situations and also self-managed in cloud maybe bare metal as well okay pardon bird sounds it's a nice it's a nice sound it's spring I I've got to close the door and they decided to have a nest again here. This is what they try to do every time. Yeah. So yeah, good topic.
Starting point is 00:01:12 Why is it on your mind at the moment? Well, yeah, it's a good question. I just observe larger and larger databases and I also observed different situations of managed services recently and I see that it's really a big pain when you achieve some level of scale, maybe not 10 terabytes, maybe 20 or 50 terabytes. At this level, it becomes a big pain to do various operations, including backup, I mean recovery first of all, backups as well, we will talk about them as well, but also provisioning of new nodes, provisioning of replicas or clones, it becomes more and more difficult. And the only solution I see is snapshots.
Starting point is 00:02:09 I see is snapshots. And we have two big, I mean, two most popular backup systems right now in Postgres ecosystem. They are PG Backrest and Volji. In my opinion, all others became less popular. This is my perception. I might be wrong. I have only anecdotal data to prove it. But it's time for both of them to seriously think about cloud snapshots in cloud environments. When I say cloud snapshots, let's clarify. I mean snapshots of EBS volumes in AWS, of persistent disks, PD SSD or others, HyperDisk in Google Cloud. And alternatives in other clouds, I have less experience with Azure. So let's cover only all three. But again, I have much less experience with Azure. My main experience is with GCP and AWS. So these snapshots, I think, must be used by those who create large database systems,
Starting point is 00:03:10 maintain them. And especially if you are a managed Postgres provider, it's inevitable that you should rely on these snapshots of disk and restoration. And of course, I think, these snapshots of disk and restoration. And of course, I think AWS RDS, Postgres and Google Cloud, Cloud SQL, they already do it. I suspect Azure also should do it, but we have many more managed Postgres providers, right? So many now, yeah. Right. And also we have some people who self-manage their Postgres clusters. And also we have, of course, many users who use RDS or Cloud SQL or Azure. And I think it's good to understand some details about how backups of large databases are done and the role of snapshots in those backups. So yeah, so I understand the attraction for huge, huge databases in terms of recovery
Starting point is 00:04:10 time, but are there also benefits for smaller databases as well? Or is it, are we just fine recovering from a normal backup, like a PG Backrest style or wall G as you say. I know there are speed advantages or smaller sizes as well. That's a good question. Well, yeah, let's just, let me walk us through the steps of backup and recovery. Good idea. With wall GBECREST, let's forget about Delta backups. Let's consider the basic, the most fundamental process. First of all, we need to take a copy of data directory, bring it to backup archive. It's usually in object store like AWS S3 or Google Cloud GCS, right? And in Azure it's blob storage or how they call it.
Starting point is 00:05:13 And this copy of data directory is not consistent by default. And it's okay. We can have inconsistent copy. We can copy data directory manually with R-Sync or with CP. It doesn't matter. It will take time to copy. And usually roughly like four terabytes, it's roughly like one hour or half an hour. If it's faster or you're lucky, you have good infrastructure. If it's slower than a one hour, something should be tuned. But if you have 10 terabytes, we're ready to talk about five to 10 hours. If you have 50 terabytes, it's like more than one day. And it's inconsistent, but it's fine because we do it between
Starting point is 00:06:02 wrapping this process up with two calls, PGSTARTBACKUP and PGSTOPBACKUP. These two important calls tell Postgres that it should keep walls that will be used in recovery. It's understood that data directory copy is not consistent, but we have enough walls to reach consistency point. So when we will recover this backup, full backup, it will be copy back to some normal disk to another machine, right? And enough walls and replace some walls to reach consistency point. and here is some problem when you recover sometimes you think oh how long it will take to reach consistency point and I
Starting point is 00:06:53 have a recipe in my how to set of how to articles how to understand current position in terms of OSN and how long it is left to reach consistency point because while those walls are applied any connection attempt will be rejected. And we need Postgres needs to reach consistency point to open the gates and accept connections. And I think there was some work to improve logging of this process. And maybe it will happen in post-Gosling 18. I saw some discussions and some patches, I think. But before that, while we still don't have transparent logging of that process,
Starting point is 00:07:40 I have some recipes so you can use them to be less anxious because sometimes many many minutes you wait and don't understand and it's not fun. And that's it. This is full backup but it's only part one because it will reach consistency point and it will represent backup for some time in the past. But we also need to have to reach the latest state of database. And for that we need a second part of backups of archives is the whole archive.
Starting point is 00:08:21 Right-ahead log files by default they are 16 MBb so on RDS they are tuned to 64 Mb. Most clusters are default 16 Mb, so each this small file is also compressed in one way or another so it takes not 16 but say 7 seven or ten or five maybe, but depending on nature of data inside it. And then when we recover we reached consistency point and we start replaying walls, Postgres starts replaying walls according to restore command configuration, which just tells fetch next wall and we will apply it. Postgres will apply it. And there's prefetching, because, of course, object storage is good with parallelization.
Starting point is 00:09:15 It's not a good idea to use just a single process to fetch file one sequentially. We can ask to fetch 10 files in parallel to move faster because in heavily loaded system sometimes we see the source, the production can generate many many walls per second and recovery can be like dozens of walls per second depending on hardware and so on. So it's worth fetching very fast. And this process can take a long time. Of course, after you did full backup, many walls can be archived until the latest point. You can restore to the latest point or to some point in time.
Starting point is 00:09:56 It's called point in time recovery. You can configure, I need this timestamp or this LSN and you need to wait. So, of course, it's good if you have quite frequent full backups. It means that wall replay, this additional wall replay, because there is wall replay to reach consistency point but there is additional wall replay to reach the latest or desired point in time. It will be much less. But usually like to replay a whole day of wall, it takes not a whole day, definitely. It takes maybe an hour or two hours. It depends a lot on a particular case. But usually it's much faster than to generate those walls.
Starting point is 00:10:41 And to replay seven days of walls, well, it might take up to one day, maybe, maybe less, it depends again. So, people need to configure the frequency of full backups and understand that up to the whole period between two backups we might need to replay this amount of walls. This affects recovery time as well, RTO, recovery time objective. To improve that situation, both PjibuKrest and Bolj support Delta backups. So, instead of doing like copying whole 10 terabytes or 50 terabytes to Google, to GCS or S3, they copy only Delta compared to previous backup. And then recovery looks like restore full backup and apply Delta backups. It's just some diffs of file basically, which is supposed to be faster than wall replay and also much cheaper
Starting point is 00:11:47 because you don't need to keep, for example, for seven days you don't need to keep seven days of full backups but maybe you keep only one full backup plus deltas only. It's already optimization of budget and also optimization of our RTO because replaying wall is slower than applying those deltas. However, we still need to start and have full first full backup fetched and applied. And it's definitely slower than if we had this full backup made recently. So we just apply this. Yeah.
Starting point is 00:12:26 So anyway, if you use PgBackrest or WorldJ, basically, they ignore existence of another alternative. Instead of copying data directory to object storage, you can use cloud native capabilities and snapshot disk. And snapshotting disk is also bringing data to object storage in the background, but it's on the shoulders of cloud provider AWS or GCP or Azure. And they have this automation, they use it in many processes, and these days it's already quite reliable in most cases.
Starting point is 00:13:08 This is, by the way, an objection I hear. Copy the whole directory to object storage, it's reliable. What will happen with those snapshots? Who knows? Well, yes, okay. But a lot of mission critical systems already using for many years right and it's also copy to object storage but done in background and it's done it's like it's basically you create snapshot it takes only minutes for large volume like 10 terabytes, 50 terabytes. And it also is like there is also concept of Delta snapshots, like it's incremental. So there's one snapshot and then additional snapshots are done
Starting point is 00:13:57 faster and take less space in object storage so you pay less. And again this is a possibility of cloud provider, like AWS or GCP, it works. Then you can recover, restore, snapshot, restore takes only a few minutes. This is the magic. And restore takes only a few minutes, even for like 50 terabytes. It can take like 10-15 minutes. Because I'm speaking about GCP and AWS here, there is a concept of lazy load. So it looks like disk already exists and data is already there, but in the background it's still pulling data from object storage. And that's why it's good. Recovery is fast, but it's slow. Like I mean, you try to select is fast, but it's slow. I mean, you try to select from table, but it's slow.
Starting point is 00:14:48 This concept is very well known for people who use RDS, for example, or work with EBS volume snapshots. It might be bad in some cases, and there is trade-off. You may say, okay, I have a recovery time very late, a few minutes only, RTO is perfect in our case, but we will be slow originally, right? We will need to warm up the disks. Or you can say, okay, a recovery time RTO is bad, but we are fast from the very beginning. Well, we need to warm up buffer pool and page cache anyway, but it's a different story.
Starting point is 00:15:28 It usually takes only minutes, not hours definitely. So that's the story about snapshots. And I think that I just observe people have a tendency to like snapshots and prefer them both for DR and just copy cloning of nodes for purposes of creating replicas, standbys or forking, anything. Once they reach level like 10-20 terabytes. Before that it's not that obvious that snapshots are beneficial. Because okay, we can wait a few hours. If you have 5 terabytes, okay, we can wait 5 hours or 2 and a half hours for backup. We can wait to recover.
Starting point is 00:16:19 Two couple of hours, not big deal but once we have 10 20 30 50 terabytes it already exceeds one day it exceeds eight hours of work work working day right so it's already becomes two day job for someone yeah this process right this is I think psychological threshold here but there are I'm just thinking there are a couple of other use cases where that time does matter. Like if you decide that you don't want to go down the logical replication route for major version upgrades, for example, and you're willing to take a certain amount of downtime, one of the steps is to clone your database, right?
Starting point is 00:17:03 Do the upgrade on the clone and then switch over. As on the cloud, like a lot of cloud providers, that would be quicker or less downtime. Why do you need the clone? I think if you do it in place, then you're down the whole time. Whereas if you do it on a clone and then transfer any changes across
Starting point is 00:17:22 that customers have made in the meantime, you might be able to minimize. How do you transfer changes? Script. Let's say it's not super busy. It's interesting approach. Yeah. Sounds like reinventing the wheel of logical replication. So you need to, if you have complex system, how will you understand the diff, the changes? I'm only thinking, I'm thinking mostly about quiet systems where it's unlikely that there are that many changes or maybe zero. Like if you're doing a quiet time and it's like an internal system. Well, this for me, there is in place, there is in place upgrade and there is zero downtime involved in logical.
Starting point is 00:18:06 Logical has variations actually. But what you propose, it's kind of like it breaks my brain a little bit. Well all I'm thinking is I can imagine cases where being able to take a clone in a minute or two instead of it being 10 or 20 minutes is actually quite attractive for the same reason you mentioned about 8 hours it's a different I would stop doing another time if something's gonna take 20 minutes I'd move on to a different task but if it's only gonna take one or two minutes I might only check my emails you know like there's a
Starting point is 00:18:38 very different it might be you need some to do some research on fork, on clone. I call it clone, some people call it fork. You need to verify, for example, you want to verify how long it will take to upgrade, in place upgrade. For testing, it's good to have like identical copy of the same machine and just run all the detail like and see the log and how else can you check it right and for if forking takes a day well it's frustrating yeah it's frustrating but if it takes only dozens of minutes it's good
Starting point is 00:19:22 the only catch here again there will be lazy load Lazy load right so you need to understand like you you okay you have fork it pretends. It's it's up and running It's up and running, but it's slow because I invalidate the test then Because then some some tests will be invalidated for sure Maybe not upgrade for upgrade I would what I would do for test of major upgrade in place with hard links, I would just in this case warm up only the system catalog part. Because this is what PG upgrade does, it dumps and restores system catalogs. Then I would ignore the remainder of data knowing that it will be handled by the hard links. So this is my warm up. It's super fast.
Starting point is 00:20:10 So this is perfect example then. But we need to understand details, internals. So we fork during the 10-20 minutes, I mean snapshot, we have up and running Postgres, we dump schema a couple of times. Well, one time is enough actually. And then we run the pgupgrade hyphen k or hyphen links and see what happens, how long it takes for us. And we are not worried about data actually still being in GCS or S3 or blob storage. It doesn't matter in this particular case.
Starting point is 00:20:56 But if it's something, for example, you want to check how long it will take to run PG repack or vacuum full or PG squeeze on a large table without additional traffic. It's also useful data. You know that additional traffic will affect timing in negative way, but you want to measure at least like in an empty space, right? Or even adding a new index, like how long would that take? Also, yeah, yeah, yeah. In this case, the fact that we have lazy load problem with snapshot restore, well, it will affect your experiment for sure.
Starting point is 00:21:30 You will have bad timing in production, you will have good timing and you will regret you did experiment at all. You will stop trusting those experiments if you don't understand lazy load this problem. If you understand lazy load you first warm up particular table just doing I don't know select everything from it right and then create index and measure right. So yeah just to get an idea how much slower are we talking is there some kind of rough? Slow? Let's say you select star how, how much slower would you expect it to be while it's lazy loading still versus on production? That's a tough question. I don't have good examples, recent examples. First time I heard about lazy load was 2015 or something. This was the first time I was working with RDS.
Starting point is 00:22:26 This is when the importance of working with clones started to form in my mind. I was super impressed how fast we can clone a terabyte database and start working and experimenting with it. But I was upset why it's slow, so I reached out to support and they explained Lazerload and their articles. By the way, AWS explains this quite well, lazy load. This explanation in various levels, I think, both at EBS, so basic level EBS volumes and also RDS.
Starting point is 00:22:58 But I must admit Google Cloud doesn't do a good job here in terms of documenting this. So I know it exists, I know they admit it exists, but it's not well documented. There's a good opportunity to catch up here in terms of documentation and understanding. As for the question of duration, I don't know, definitely it can be frustrating. So if you select, for example, count star from a table, if it's a sequential scan, it will be the same effect for warming up, right? It will not spam your output. Well, there is a different way not to spam your output.
Starting point is 00:23:36 It's just select from table without specifying columns and maybe also somehow select array. Will explain analyze? Yeah, by the way, yes. Yes. Yes. So IOT timing will be really high in that explain analyze. If it's enabled, I hope it's enabled. So I think if select count star is super slow in Postgres, we know it because it's raw storage
Starting point is 00:24:07 and all the things in MVCC and so on. But here it can be order of magnitude slower. I think. Maybe two orders of magnitude. I don't have numbers here in my head. Yeah, I was just trying to understand if it was 10% or double or an order of magnitude. And yeah, that makes more sense.
Starting point is 00:24:24 Fetching data from object storage in the ground. understand if it was 10% or double or an order of magnitude. No, no, no. That makes no sense. It's really fetching data from object storage and background. So it can be frustrating, this is what I can tell you. But you can, again, if you aim to warm up a node just restored from snapshot, Postgres node, you can again benefit from the fact that object storage is good in terms of parallelization of operations. You can warm it up not just a single session,
Starting point is 00:24:53 single connection to Postgres. You can run 10 or 20 parallel warming up queries. And this is a recommendation on how to warm up because of lazylog. By the way, AWS also offers a few years ago, they implemented it, I think this is called Fast Restore, Fast Recovery, something, Fast Snapshot Recovery, FSR or something, I might be mistaken, but if my memory is right, they support up to 50 snapshots marked as available for fast restore in the region, I think, for a particular account.
Starting point is 00:25:32 So you can mark some snapshots and the lazy load problem will disappear for those snapshots. This is a good thing to have and I'm not aware of similar thing for Google Cloud or Azure. But this makes like once per, I don't know, week you have a good snapshot, we wish you can recover very fast and then replay walls. But you have super fast recovery. For experiments, it's already super cool. you have super fast recovery for experiments that's already super cool super cool right what about for the actual like the rto discussion what are you seeing people with dozens of terabytes opt for in terms of that trade-off are they happy to accept things will be slow it's kind of like a brownout situation for a while but they're back online at least or do they prefer longer RTO but
Starting point is 00:26:26 everything works exactly as expected once it is back online? Well I think nobody is happy when recovery takes 10, 20, 30 hours, nobody. Of course. Right so so this becomes a problem and the characteristics of system become very bad I mean this is not it becomes not acceptable for business also to understand that we will be down a whole day or two Just to recover from backups But before business realizes it operation engineers Realized because it's really hard to experiment
Starting point is 00:27:02 So well and and the cases where this happens are super rare, right? Like we're talking, we're not talking about failover. Like with all of these setups would generally be like high availability anyway, right? So we're talking only a very specific class of recovery where you have to recover from backups like there's no failover situation, which I'm guessing is pretty rare. People have to think about it, but they don't face it very often. Yeah. 10 plus years cloud environments were super unreliable and failovers happened very
Starting point is 00:27:35 often. Right now, again, at least for Google and AWS, for Google Cloud, AWS, I see nodes can be running for years without disruption, so it's much better right now in terms of lower risks. Well, again, we have a habit to consider cloud environments as not super reliable. For example, it also depends on particular instance classes and the sizes, I think. If you have a very popular instance family, very small, migration might happen often. For example, we were down because we are on GCP, I mean, Postgres AI. We were down because Kubernetes cluster was migrated again, like our nodes were migrated. And then we had a problem, let me admit, we had a problem with backups because after reconfiguration
Starting point is 00:28:41 half a year ago another dot history file for some timeline in future, which was in the past, but for us currently was in future was left in in PG world directory sub directory and after such migration caused by migration of VMs in Kubernetes we couldn't start because promotion happened timeline changed and there was basically a rogue file from previous life corresponding like to the same timeline number. It was a nasty incident. I think we were down an hour until we figured out, unfortunately, misconfiguration. Yeah, it's a bad problem.
Starting point is 00:29:16 It was good that it happened only on Saturday, so not many users noticed for our systems. And I must thank my team for a couple of folks who jumped very quickly to Zoom and helped to diagnose and fix. So yeah, back to questions. In my opinion, again, snapshots are inevitable to use if you reach dozens of terabytes scale. And again, there are several levels of trust here. First level, are they consistent? Answer is we don't care because we have PgStart backup, PgStop backup. Right? Second question, are they reliable and what about like possible corruption and so on?
Starting point is 00:30:00 Question, answer is test them and test backups. You must test backups anyway because otherwise it's a Schrodinger backup system. So we don't know if it will work, we must test. And once enough number of tests show positive results, trust will be built. And I have several... In the beginning of the year I announced that every managed Postgres provider must consider PGWeightSampling and PGStatK cache as very important pieces of observability to extend what we have with PGStat statements. Even if you are RDS and have performance insights, still consider adding PGA weight sampling,
Starting point is 00:30:48 because this will add very flexible tooling for your customers. Now I say, for backups, of course, these big guys like RDS already use Snapshots 100%. We know it indirectly from the suffer of lazy lot of myself in 2015 or 2016. And now all competitors of RDS or alternatives to RDS, please consider snapshots. Because I know you don't yet. So it's worth, because your customers grow and once they
Starting point is 00:31:28 reach 10, 20, 30, 40, 50 terabytes, they will go to RDS or self-managed and stop paying you, 100%. Because it's a big pain, we just described this pain. And now those who self-manage also consider snapshots, right? But most importantly, developers of WorldG and PgBecrest consider supporting snapshots, cloud disk snapshots, as an alternative, as official alternative to full backups and delta backups you already have. In this case it would become native and yeah so I think it's worth doing this. Maybe other types of snapshots. So like external
Starting point is 00:32:13 external tool for snapshotting if it exists in infrastructure and natively supported by backup tool I think it's a good idea. So backup tool will perform all the orchestration and skip some of full backups. It can be a mix. Ideal backup system for me would be a mix of occasional full backups, continuous wall stream, and snapshots as the main tool to provision nodes and the main strategy for DR, but not the only one. So full backups would exist separately occasionally again like I would
Starting point is 00:32:47 have full backups by the way both snapshots and backups full backups must be done and I see it on some systems they do it on the primary it must be done on replicas they must be done on replicas because it's a big stress for primary even snapshot snapshot is a big stress for primary as I learned. So, yeah, it's good to do it on replica. And PG start backup, PG stop backup, they support so-called non-exclusive backups. So, it can be done on replicas, on physical replicas. And even if they are lagging for some time, it doesn't matter because even lag for some minutes, nobody will notice because we have continuous wall stream and replaying a few minutes of wall, it's very fast. Yeah, I saw in RDS's documentation that for multi-availability zone deployments, they take the backups from the standby. That's absolutely all, like this is what should happen.
Starting point is 00:33:49 Yeah. And snapshots also should be done from there. I saw, well, maybe it was specific cases, but I saw some cases when snapshot of disk of primary affected performance. Because in background data is being sent to object storage and it's definitely generating some disk, read disk IO for our disks. Yeah. So I'm curious though, why have you mentioned
Starting point is 00:34:18 in an ideal world you have a mix, what's the benefit of having, like once you move to this, once you move to snapshots, what's the benefit of also having like traditional PG backrest style backups as well? It's a good question. I remember reading Google Cloud documentation many years ago and it said when we create snapshot of disk this snapshot is not guaranteed to be consistent if application is running and writing.
Starting point is 00:34:49 And then suddenly the sentence was removed from Google Cloud documentation. It's a matter of trust, right? So now I trust Google Cloud snapshots and EBS volume snapshots in AWS quite a lot. I have trust. But still some part of me is paranoid. I would keep full backups, not very frequent, I would keep them just to be on safe side. But maybe over time trust will be complete. Like if we have a lot of data proving snapshot creation and recovery is very reliable. And well, consistency, by the way, it doesn't matter. The problem I mentioned in the communication, it's not about snapshots are consistent or no.
Starting point is 00:35:40 Consistent or no. Honestly, I don't know even. And I don't care because I have PGSTART backup and STOP backup and I'm protected from inconsistent state. Postgres will take care of it. But it's a matter of trust, like lack of transparency in documentation, understanding what's happening and so on. So only obtaining experience over time can bring some trust and understanding the system works fine. And regular testing, testing should be automated, recovery testing of backups and so on. So now I think it doesn't matter if they are consistent or not, it matters if they
Starting point is 00:36:21 work, right? If we can recover and then reach consistency point by our means, I mean by what Postgres has. And that's it. But still, what if I see snapshots created, but I cannot use them? And also bugs happen, people change systems. I mean, cloud providers also are being developed all the time and new types of disks and other changes are introduced all the time. And so what if snapshots will not work? If I have very reliable testing process, if I test maybe, for example, all snapshots, I restore from them automatically all the time and I see that over the last few months, every day we created snapshots, maybe every few hours we created snapshots and we tested them all. Maybe at some point I will say, okay, full backups by PJBcrest or WorldJr are not needed anymore. But for now, my mind is in a state of paranoia. Very reasonable.
Starting point is 00:37:27 Sounds good? Yeah. Only one interesting thing here is that if you use snapshots, there is a catch for large databases. EBS volumes and persistent disks in Google Cloud, they are limited by 64 terabytes. I think some types of EBS volumes are limited by 128 terabytes, if I'm not mistaken, so the limit is increased. But 64 terabytes, if you experience problems being at level of 10, 20, 30, 50 terabytes,
Starting point is 00:38:02 64 terabytes is not far already. Yeah, that's true. And then the question will be what to do. Because if you use regular backup, full backup, by means of PG Backrest and WorldWall-G, you can have multiple disks and again, it doesn't matter. Thanks to PgStartBackup and PgStopBackup, you can copy data from multiple disks. It can be LVM and multiple disks organized to give you more disk space. Like table spaces? No, no, no. Not table space. Table space is different thing. It's at Postgres level.
Starting point is 00:38:51 If you need to have... You can combine multiple disks using LVM2. And have one big, basically, logical volume for Postgres. For regular backups, when you copy data, PgBecrst or Wojc copies data to object storage, it doesn't matter that there are underlying many disks contribute to build this big logical volume. Because we have only files up to one gigabyte in Postgres. Each huge table is shrink to one gigabyte files so they are copied, compressed and copied by our backup tool. But if you start using snapshots it's
Starting point is 00:39:36 interesting. Well also people at that scale people are... I'm thinking of the companies we've talked to at that area. We had a good hundredth episode, didn't we? And a lot of them are considering sharding by that point anyway. So I guess in some cases that... Or deleting data, right? My favorite advice. Just delete some data. And keep sleep well. Yeah. But well, so the problem is that it's not a problem again, it's not a problem that you will create multiple snapshots. If those snapshots are created between PG start backup, PG stop backup, you will be able to recover from them. The question will be only will LVM survive this? And this is question not about Postgres, it's question about the cloud provider and how snapshot orchestration.
Starting point is 00:40:25 And this topic already goes beyond my current understanding of things here. I can just suggest, I can recommend, OK, use snapshots. But if you reach the limit and you need to start using LVM, it's a different story. So you need to check how LVM will survive snapshot restore and test. And test, yeah. I think it's definitely solvable because like RDS did solve that, right? So yeah, good point. Right, right. Well, that's actually, that's a really good probably place to,
Starting point is 00:41:02 I don't know how you're doing for time, but it makes me think about the likes of Aurora or people that are doing innovations at the storage layer where this kind of problem just doesn't exist anymore. You know, they've already got multiple... I guess we do need to still have a disaster recovery. You know, with all of Aurora's down, we need a way of recovering, but they are promising to handle that for us in this distributed world, right? Like, that's part of the promise of keeping copies at the storage layer. Yeah, well, it's definitely new approaches where data is stored on object storage originally. Not only backups, but it becomes the primary storage originally. And then there is a big orchestration to bring data closer to compute.
Starting point is 00:41:59 But it's all hidden from the user. It's an interesting approach, but there are also doubts that this approach will win in terms of market share. So it's interesting. Definitely it's where I don't know many things. Maybe our listeners will share some interesting ideas and we can explore them. Maybe we should invite someone who can talk about this more. Yeah, sounds good. Yeah, yeah.
Starting point is 00:42:30 But anyway, I think snapshots are great. I think they can be great for smaller databases as well, like smaller, I mean, one terabyte still because instead of one hour, it will be a minute. Right? I think that might be that I'm curious about that area because I think there are a lot more databases in that kind of range of hundreds of gigabytes to kind of like a couple of terabytes than there are companies having to deal with dozens of terabytes. But yeah if like if we can
Starting point is 00:42:59 make their developer experiences much better or... I agree if a few terabytes I would say it's already middle market in terms of database size so it's very common to have a one two terabytes. And snapshots can be beneficial because operations speed up significantly and RTO is improved. If you remember about lazy load and know how to mitigate it or accept it. Good. Good. Nice one Nikolai. Thank you for listening. Again, if you have ideas in this area, I will be happy to learn more because I'm learning all the time myself. And I think Michael also is in the same
Starting point is 00:43:46 shoes right? Yeah I find this stuff interesting I have to deal with it a lot less than you which I'm grateful for but yeah definitely let us know I guess via YouTube comments or on various social media things. Yeah good nice one we'll catch you next week Nikolai. Bye bye. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.