Postgres FM - Disks

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things Postgresquist. I am Michael, founder of PG-Massad. I'm joined as usual by Nick, founder of Postgres AI. Hey, Nick. Hi, Michael. How are you? I am good. How are you? Very good. Great. And what are we talking about this week?

Starting point is 00:00:16 Discs. If you imagine database regular icon, or how to say, like, picture how we usually visualize database on various diagrams, it consists of disks, right? Yeah, like three. Like I'm thinking of cylinder. Sometimes a cylinder, yeah, with like normally three layers. Yeah, three four. And obviously databases and disks, they are close to each other, right? But my first question, why do we keep calling them disks?

Starting point is 00:00:48 Hmm. Like outdated term, you mean? Yeah, obviously. I don't know. What does the D and SSD stand for? Yeah, actually, sometimes. sometimes we like logical level volumes storage volumes something like this and in cloud context especially ABS volumes right we talk about them like that but in all cases we still it's acceptable

Starting point is 00:01:18 to say disks but disks are like they don't look like disks anymore right they are rectangular and microchips instead of rotational devices right yeah makes sense in most cases not in all cases rotational devices can be still seen in the in the world but not often if we talk about all TP databases because it's not okay to to use rotational devices if you want good latency but yeah so disks because databases are they require good disks and they depend on it heavily in most cases not in all

Starting point is 00:01:59 sometimes it's fully cashed so we don't care if it's cashed right yeah I was going to ask you about that because I think even even in the fully cash state if we've got like a lot of rights for example we might still want really good

Starting point is 00:02:15 discs there's things where we're still writing out to disc and we want that to be fast not just reading from but we are not writing to disk. If it moved to the post-Graph context, we don't write to the disk except a wall, right? Yes.

Starting point is 00:02:33 Yeah, and that's it. Well, yeah, I agree it can be expensive if a lot of data is written. So, yeah, you're right, because we need to write our tuples, and if it's full page right after a checkpoint, we need to write whole page, eight kilobyte page. Yes. And we need to get a sync before commit is finalized. So definitely it goes to disk. But the data in terms of table and index, tables and indexes are said it's written only to memory and it's dropped for checkpoint normally. It dropped for checkpointer to later write it first to page cache and then page cache can use PD flash or something to write it further to disk. But, yeah, in terms of f-sync, right, latency is important. It affects commit time. By the way, I just had a case.

Starting point is 00:03:32 It's slightly off-topic, but I published a tweet and LinkedIn post about Listen Notify. I added them to the list of deprecated stuff. It's not deprecated, right? But you're saying you recommend not using it at scale? Yeah, well, if... Or possibly at all. Yes. My post-Gus vision deviates from the official vision in some cases. For example, official documentation says don't set statement amount globally because blah, blah, blah.

Starting point is 00:04:08 And I don't agree with this. In OTP, it's a good idea to set it globally to some value and override locally when needed. And here, this notify, I just see, like, we should just abandon this completely until it fully redesigned because there is global log and one of our customers recall AI, they published great post about this because they had outages. And it's related to the topic we discussed in an interesting way. To reproduce it, I used a bigger machine and the issue like is with Netify at commit time, it gets a global log to serialize Netify events. Global lock like on database, exclusive log, insane. And if commit is fast, everything is fine. But if in the same transaction, you write

Starting point is 00:04:59 something, commit like wall, it waits a little bit, right? In this case, contention starts because of that lock. So if you have a lot of commits which are writing something to wall, meaning they need f-sync and they need to wait on disk, if disk is slow, you use notify, this doesn't scale. Performance will be terrible very soon at some concurrency level you will have issues and you will see commit spans

Starting point is 00:05:33 like many milliseconds and dozens of milliseconds and then up to seconds and eventually system will be down. Anyway, this is related to slow disks. You are right if latency right is bad we might have issues. Yeah, but you're right too

Starting point is 00:05:50 that the majority of the the time we care about the quality of our disks, it's when our data isn't fully in memory and we're worrying about reading things either from, well, from disk or even from the operating system. It's hard to tell from Postgres sometimes where it's coming from. But we have a It's impossible to tell in Postgres unless you have PG-Stat K-cash extension. That's why since Buffers is already in Post-Gus 18, again, I advertised to all people who develop some systems with PostGios, if it's possible, include extensions, PG-WAT sampling, and PG-STAT K-Cash. And K-cash can show.

Starting point is 00:06:31 Yeah. Yeah. I think it's not, I think you're right that it's impossible to be certain without those. But for example, with through timings, through I.O. timings, which another thing that people might want to consider having on, obviously a bit of overhead. Track A.O. timing, you mean? Track A.O. timing gives you an indication. like if you're seeing not too many reads from either the disk or the operating system

Starting point is 00:06:57 and the io timings are bad you've got a clue that it's coming from this yeah indirectly we can guess yeah yeah that this time was spent yeah not not many it's a good point because sometimes it's fully cached in page cache we see reads and since there are so many of them are you timing is spent reading from page cache to the buffer pool and disk is not involved if volumes are huge but if volumes are not huge and still a significant time is spent very likely it's from disk exactly yeah yeah yeah it's not enough like this is something we added to our product just as a tip it doesn't come up that often like it's not able to take out timing we actually just use we actually because most people don't have that on

Starting point is 00:07:50 We actually just used the buffers, like shared red, and then the timing, the total time of the operation. It's a big day. It's not on. Yeah. In big systems, we have it on. Like, I never saw big problems on modern, at least Intel and Arm. Graviton 2 on Amazon. Like, I just see it's working well.

Starting point is 00:08:12 There is a utility you can check your infrastructure and understand if it's worth enabling. but my default recommendation to enable it. Of course, there might be observer effect, but it can be double-checked if you want to be serious with this change, but I just see we enable it. Yeah, it's all to do with the performance

Starting point is 00:08:34 of the system clock checks. And I think, for example, the setup I've seen with really bad performance there are like running, like dev systems that are running Postgres inside Docker and things like that that still have really slow system, clock lookups. But most people aren't doing that with production Postgres databases. And I haven't seen even any of the cloud providers have slow. I think it's PG test timing or

Starting point is 00:09:02 something like that. Is it a double check? Yeah, you can run it really easily. But what if it's managed postgres? You cannot run it there. In this case, you need to understand what type of Instances behind that managed Postgres instance. Take the same instance in the cloud. For example, if it's RDS, from RDS instance name, you can easily understand what EC2 instance is this, right? Yeah. You can install it.

Starting point is 00:09:30 It will be, well, operational system matters also, right? There are some, yeah, yeah. There are some tricks you can do, like do things that would call the system clock a lot, like nested loop type things or count, like aggregation. things like that, like trying to get lots of loops. Ah, you're talking about testing at higher level, at post-gust level. Yeah. Oh, that's a good idea.

Starting point is 00:09:54 Yeah, a lot of nest loops. And you test with this, without this, completely, like, running like 100 times, taking average, for example, and comparing averages and this, you can guess. Yeah, it's a good test, by the way. I think the first time I saw that was Lucas Fittal. I think you must have done a five minutes of Postgres episode on this kind of thing, so I'll link that up.

Starting point is 00:10:16 Yeah, I'm glad we touch this because, again, like, our default recommendation is to have it enabled. It's super helpful in Pichita Statements analysis and explain analysis plans. And, yeah, track I.O. timing, if possible, should be enabled. And this is related to disks directly, of course. Yeah. Although, strictly speaking, it's not timing of disks. It's timing of reading from page cache to buffer pool. So it might include pure memory timing as well.

Starting point is 00:10:42 That's why it does. Yeah, yeah. That's why your comment about large or not large volumes, it's important. But it's honestly, like, if you even, like, if you are a backend engineer, for example, listening to this episode, I can easily imagine in one month you will forget about this nuance and we'll think about track your timing like only about discs, right? And it's okay because like it's really like super narrow topic to remember, to memorize. I guess this is moving the topic on a tiny bit

Starting point is 00:11:17 but if you're on a managed Postgres setup which a lot of back-end engineers that are working with Postgres are you don't have control over the discs you're probably not going to migrate provider just for quality of discs maybe you would but it would have to be really bad and you'd have to be in a setup that really was hammering them maybe super right heavy workload or huge data volumes that you can't afford to have enough memory for.

Starting point is 00:11:48 You know, those kinds of edge cases where you're really hammering things. Well, there are two big areas where things can be bad. Bad means saturation, right? Yeah, yeah. We can saturate disk space, so to speak, out of disk space. And we can saturate disk ayo. Both happen quite often. and managed post-gris providers are not all equal and clouds are not all equal.

Starting point is 00:12:16 They manage disk capacities quite differently. For example, at Google, at GCP, I know regular PDSSD quite old stuff. They have maximum 1,200 Mibi bytes per second separately for reads, separately for write, speaking of throughput and they have 100,000 or 120,000 iops maximum right and i know from the past discussions with google engineers that actually real capacity is bigger but it was not sustainable so it was not guaranteed all the time and they could raise the bar but it could be it would be not guaranteed so they needed they decided to choose the guaranteed bar for us makes sense yeah but like basically we're not using at full possible we could use more right but we can not so they throttle

Starting point is 00:13:24 it okay interesting artificially to have guaranteed capacity for this kio interesting what I I guess the subtlety that I was missing was not when you're at the maximum. So in between tiers, imagine you're in a much smaller setup. I see a lot of people just in upgrading to the next level up within that cloud provider to get more iops. You know, if you're on Aurora, just scaling up a little bit instead of switching from an Aurora to Google Cloud. But you're right at the, when you're at the last, or second to last level is when people start to worry, isn't it? When you're at the last level, you can't go

Starting point is 00:14:10 just scale up on that cloud provider anymore. So yeah, really good point. And also at Google, for example, let's say we're like I know this rules there are artificial. So this throttling, what I just told you, it

Starting point is 00:14:26 also can be throttled additionally if we have not many CPUs, VCPUs. So the maximum possible throttling is achieved only if you have 32, I I remember VCPUs or more. If it's less, also it can depend on family, I think. In instance, family.

Starting point is 00:14:43 So, so complex rules. On Amazon, AWS, ABS volumes, okay, there is GP2, GP3, IOUI-1, you choose between them, also there is provisioned IOPS. Really complex, right? And you haven't even mentioned burst diops, yeah. Yeah, yeah. So hitting IOPS limit. is really easy, actually.

Starting point is 00:15:10 If you... When do you... Yeah, well, the times I see people hidden it is like they're doing a massive migration. No, no, that's really. Okay, when do you... Just like, just growing.

Starting point is 00:15:24 Yeah, just projects just grows and then latency, database latency becomes worse. Why? We check and we see... Well, if you're... if you have experienced capabilities like to look at graphs you can easily identify some plateau it's not full like not ideal plato but usually some spikes small but you feel oh this is we are hitting the ceiling here you're checking diskayo performance cliff it's not cliff no it's a wall

Starting point is 00:16:00 instead of cliff cliff it's when this is this important distinction cliff is when everything was okay okay okay and then suddenly slightly more load or something and you completely down or down like drastically 50 plus percent okay yeah here we have a wall and everything is okay okay okay and then slightly not okay slightly not okay you know like and then more is coming and we like start scheduling processing right accumulating XF processes and if so in performance Cliff, if you raise load slowly, it is acute drop in capabilities to process workload. In the case of hitting the ceiling in terms of saturation of diskio or CPU, it's different. You grow your load slowly and then you see you grow further and things become worse, worse, worse.

Starting point is 00:17:00 It's not like acute. It's slightly more, more, more. and things become very bad only if you grow a lot further, right? So it's not acute drop. It's like hitting the wall. It feels like hitting the wall. You know, like if you imagine many lines in store. Like, for example, we have several cashiers, eight, for example.

Starting point is 00:17:25 And then normally lines should be one or two, one, zero or one people only. This is ideal. throughput, everything good. We haven't saturated them. Once we saturated, we see lines are accumulating. And latency, meaning how much we spend to process each customer, they start to grow, but they don't grow acutely, boom, no. Here we talk about, like, performance cliff is that, for example, if we talk about cash

Starting point is 00:17:56 only, no cards involved. And suddenly, like we had remains of cash for change. in all lines, right? And cashers suddenly, suddenly all, they can, like, say, okay, do you have change? I have change. Okay, we're processing.

Starting point is 00:18:12 And then suddenly we're out of cash to give change. This is acute performance cliff. They say, okay, we cannot work anymore. Boom. Right? We need to wait until someone goes somewhere like, and this is like we need 15 minutes of wait, basically. This is like important distinction of performance cliff

Starting point is 00:18:32 and hitting the wall or ceiling. Okay, I haven't heard that stricter definition before. Like, it sounds to me like you're describing the difference between blackout and a brown. Have you heard of a brownout? Like, so blackout is kind of like you, your database can't accept rights anymore or even selects. Like, no reads. Like, everything is down. Brownout would be like, it's still working, but people are, like, they're spinning loaders

Starting point is 00:19:00 and it may be loads after 30 seconds. or maybe some people are hitting timeout some people aren't and there's like the like the queuing issue in the supermarket you talked about performance is severely degraded but it's not completely offline still working at least for some people so it feels like that's the kind of distinction yeah and brown can become dark if you keep loading a lot if so if saturation happened at like some workload level but you you you give it 10x of course it will be blackout but because of context switching and then it's like it's different but for performance cliff it happens like very quickly it's very much more yeah situation i think i'm also biased by the cases that i've seen which

Starting point is 00:19:44 are more acute because they are bulk loads or backfills where they are running at a much much higher rate than they would normally be they're consuming iops at a much higher rate than they normally would so they hit it really fast and it's like running at the wall extremely fast but i guess if you approach the wall slowly it's not going to hurt quite as much. Yeah. Okay, I think I understand. Yeah, back to disks. Definitely we should check this KIO usage and saturation risks. So you mean like monitor, monitor for it, alert, when we're close to our limits? Yeah, yeah. Yeah, and also it might be interesting. For example, I remember, I don't know right now, but many years ago, Nardias, I remember, we like ask, okay, we, like, small.

Starting point is 00:20:31 system maybe. We need 10,000 IOPS. But we see situation at 2,500 somehow. Oh, there is radar actually. We have four disks and that's like, okay, okay. So there are interesting nuances there. But also, so understanding your limits is super important. And like I think clouds could do better job explaining where the limits are. Because, Right now, you need to do a lot of legwork to figure out what is your advertised limit. For example, as I said, at GCP, you need to understand how many VCPUs. Also, a disk, I forgot like 10 terabytes, I think, is when you achieve the, or one terabyte. Memory fools me a little bit.

Starting point is 00:21:23 So you need to take into account many factors to understand, oh, our like, theoretical limit is this. And then ideally you should test it to see that it can be achieved. Testing is also interesting because, of course, it depends on block size you're using. And also it depends on like you're testing through page cache or direct IO, right? So directly writing to device. And then you go to the graphs in monitoring and see some disk IO in terms of IOPS and throughput separately reads and writes. and then you think, okay, let's draw a line here. This is our limit.

Starting point is 00:22:06 So what I'm saying, they should draw the line. Clouds should draw the line. They know all these damned rules, right, which are really complex. So this should be automated. This line should be automated. Okay, with this, this, this and this, we give you this. This is your line in terms of capabilities of your disk. and here are you okay at 50% okay I know now it's like it's like whole day of work for someone

Starting point is 00:22:35 to understand all the details double check them and like then correct mistakes even if you know all the nuances still you returned to this topic and you oh i forgot this redo yeah when you mentioned the terabytes thing is that i was working with somebody a while back who they weren't using the disk space they already had at the cat like they let's say they had a one terabyte disk they only had a couple of hundred gigabytes but they upgrade they they expanded their discs to a few terabytes so that they would get more provisioned iops because that was the way of so is that what you're talking about you need a certain size yeah so the rule for throttling is so multi-factor you need to read a lot of docs and if like with gCP and a

Starting point is 00:23:26 AWS, I have pages which I read many, many times per year, like carefully, trying to remember. Oh, this rule I forgot again. Why isn't this automated? Someone can say, okay, these limits depend on block sizes. Okay, but if it's RDS, block size is already chosen. Posgis uses 8 kilobytes. If it's X4, it's 4 kilobytes there. Everything is already defined.

Starting point is 00:23:55 So we can talk about limits for throughput quite well, right? So yeah, this is like, I think, lack of automation here. Also, if you mentioned the number of VCPUs, like, I guess that is that they have all the setting, right? They have all the knowledge and they define these rules. Yeah. So give me this like usage level and understanding how far from saturation I am. because it's so important. No, in reality, we wait until that plateau I mentioned,

Starting point is 00:24:30 and only then we go and do something about it and raise the bar. This should be alerts, Ivan. You, like, your database is spending at 80 plus percent of your capacity on this kind. Prepare to upgrade, you know, add more. Yeah, well, I was going to say, sometimes there are perverse incentives here where they're not incentivized to help you improve your performance so that you upgrade.

Starting point is 00:24:54 But in this case, the incentives should be aligned. If they let you know earlier that upgrading might help prevent an issue, you're going to be paying more up front. So the incentives are aligned. Yeah. At the same time,

Starting point is 00:25:09 this complaints we are currently expressing. They all are reminding me complaints of a guy who is sitting on an airplane and saying that there is no leg room and so on. you're sitting in the air, like in flying, 30,000 feet above ground, and it's magic, right? So these, like, ABS volumes, PD SSD, like other newer disks on JCP or NVMEs, they are great. Like, I mean, snapshots, elasticity of everything, it's great, right?

Starting point is 00:25:46 We just, yeah, we just want even more. it's good that you're being positive about them but i feel like i hear quite a lot of people saying that one of the cases still for self-hosting is better this you can so actually i think a lot of a lot of the time with the cloud you're paying for hardware that might be a bit on the older side and you have no control over that so it's yeah i'm interested in your take on that as somebody who's historically being, you know, pro-self-managing or, you know, some hybrid version. So, you know, I love clones and snapshots.

Starting point is 00:26:27 That's why, actually, BS volumes and what RDS has, and even if it's a lazy load involved, and when we restore from snapshot, it's actually getting data from S3. It still feels like magic and great and, like, we're very good for reproducing incidents and so on. And the snapshots are cheap because they are stored in the S3. At GCP, it's the same, although there is lazy load there as well, although their documentation still doesn't admit it. But just looking at the price, we understand it's the snapshots of Google cloud disks, it's stored in GCS, so it's three analog. It's great.

Starting point is 00:27:08 But also, if they think about a cluster of three nodes, or four, five, six, up to ten nodes, and more, more. Some people have more. Database is basically copied to all replicas, and on replicas it's stored on disk, and this becomes more and more expensive over time, right? So it can be significant. It can be even more than compute sometimes. That's the point. Like if we have a large database, but working set is not that large, we can have much smaller memory. that thus much smaller like not big compute instance we had this this these cases for example a lot of time series data and we have much bigger disk than you could expect and then all replicas need to have the same disk and this disk if it's a BS volume

Starting point is 00:28:06 it becomes expensive very expensive and contributes to costs so much so then you think why not to use local disks well we used local disks for a bench marks it was a i3 instance like years ago seven years ago maybe started liking them because it's always included to price right of yes or easy two instance and it's super fast it's like basically one order of magnitude faster in terms of iops can give you a million iops these days already at and throughput three gigabyte per second well and the resiliency like you you're if you've already got replicas provisioned for failovers. You don't need the resiliency that the cloud. The point is the ephemeral. So if restart happens, you might lose this data. But if restart

Starting point is 00:29:01 happens, we have replicas. Yes, that's what I mean. So if, like, that doesn't, that doesn't actually matter. In fact, this reminds me a lot of the planet scale stuff that's been, the planet scale postgres, I think they call it metal. They've got two products, but the metal one has the local discs and this is a lot of the things you can have a local if maryland v me is only on virtual machines of course smaller size metal yeah yeah sorry all all i meant was they're doing a all lot of their publicity that a lot of their blog posts and things are relevant to this discussion you don't have to use their services and also you could do it a much smaller scale yeah and it's it's so big cost saving and it brings so much more this call your capacity amazing yeah

Starting point is 00:29:47 And latency reduction, right? Like, because the systems are just closer together. Yeah, yeah, yeah. So it's much like, it can handle workloads much better in terms of OTP workloads. There are two caveats, if a matter, property, and also limits in terms of, we didn't touch this space topic yet. Yeah, yeah, yeah. We have a whole separate episode on that, but yeah, we should still touch on it.

Starting point is 00:30:14 Right. And on AWS, I like local disks much more because they are usually bigger and so on. Like they are bigger, each disk is bigger and the summarized aggregated disk volume is also bigger. On GCP, I think, first of all, somehow local disks are still, I think, 375 gigabytes only looks like old. But you can stack a lot of them. and I think up to 70, 72, or how many? Terabytes. Yeah, quite a lot.

Starting point is 00:30:50 But in this case, you need to really, like, maybe go with metal, like the maximum. Take a whole machine, basically, right? But it's possible, but this 72 terabytes will be your hard limit, hard stop. And it's not that bad. Most people will be fine. Yeah, yeah. It's okay, I mean, to have this. limit but it's a hard limit but you're yeah the hard limits the interesting thing so you're saying let's say

Starting point is 00:31:19 we start on small machines and they only have a set amount and we and we suddenly realize we're at 80 or 90 percent capacity right but at the same time iBS volume has limit 64 terabytes and pdsd on gp has the same limit 64 terabytes and rds and gz and google cloud sequel they also have hard stops at 64 terabytes. Aurora has 128 double of that size. And that's it. Right. So these are hard stops. And I think in 2025, I think this is not a lot of data already. 50, 100 terabytes we had episode about it. It's already like achievable for bigger, bigger startups. So RDS, I don't know. I think we should solve it soon. And I think Cloud SQL, Google Cloud SQL, they should solve it soon but to my knowledge they haven't solved it yet so if you approach this

Starting point is 00:32:18 it's hard stop and basically you need to go to self-managed maybe right and there you can combine multiple eBS volumes most that we've talked to that do this shard at that point this is different yeah that's why i think plan scale like it's easier for them to choose local disks and deal with those hard limits in size as well because if there is a rebalancing if it's if it's zero down time rebalancing you can just make sure no shards will reach that limit that's it it's good yeah they have that further they have that for my score but they don't have that for postgres so well not yet they're building it they announced it right yeah well they announced building it i think lots of people announcing building sharding at the moment well i see mottegris already

Starting point is 00:33:09 has some code. I even commented in a couple of places, proposing some improvements. Yeah, well, I know they all have some code, right? Like, PG-Dog's got some code. It's not, the PG-Dog is already you can test it already, yeah. I think Montegras also will

Starting point is 00:33:25 have some at some point. All I mean is that you can shard in other ways, right, without these solutions. Like Notion talked about doing it, Figma, I've done it. So-called application side sharding as I... Yeah, but they,

Starting point is 00:33:39 I did it without leaving RDS in those cases. So it is interesting. But I thought you were going to go in a different direction here. Like, I thought it was more about the practicalities of expanding. So let's say you're not at the dozens of terabytes limit, whatever your provider has. Let's say you're at one terabyte and you just want to expand to two terabytes. That's often really easy. You know, you can do it at a few clicks of a button without any downtime in a lot of providers.

Starting point is 00:34:09 providers, whereas if you've got local disks, is it a bit more complicated? Yeah, you know what? I think these days, RDS also provides options with local NMEs. Wow. Okay. Yeah, the instance, I'm double checking right now, it's instance, for example, X2 IDN. Yeah, and it has, I think it has, yeah, it has local NMEs, several terabytes and up to, I think, not many actually 4 terabytes interesting

Starting point is 00:34:39 so there might be hybrid approach when you have ABS volumes and you use local NVME as a caching layer

Starting point is 00:34:49 for both reads and rights but then what would you do would you set up some replicas with larger disks and then fail over to the like

Starting point is 00:34:56 how are you managing a migration to like larger local disks when you hit 64 terabytes well no

Starting point is 00:35:05 you let's see you've started with smaller like you've started with local disks that are smaller ah with local disks yeah i think you did switchover approach of course yeah so you need a different instance with bigger capacity in terms of disk space of course here again uh elasticity and automation of network attached disks cloud providers have it's great but uh let's also criticize it so they have like eBS volume has auto scaling but only in one direction For example, if we re-sharred, right, we need to reprovision and then switch over. Or if we saw we had like we didn't have a vacuum tuning in place or we screwed up in terms of long-running transactions or abandoned logical slots. So we accumulated a lot of blood and say we have 80% of blood.

Starting point is 00:35:57 Okay, we re-indexed, repacked. Now we sit with a lot of free disk space. We don't need it during next year. Why should we pay for it, right? And shrinking is not automated. But, of course, yeah, you can provision new replica with smaller disk and then switchover. And when I think about switchover, you know, I decided to force myself to have a mind shift and my team as well, to self-driving postgres.

Starting point is 00:36:29 We talked about it. And I think when I think about this particular case, we eliminated a lot of. of bloat. We want disk to be smaller. We need to switch over. But switch over also, it's a maintenance window. Yeah. Because what's these, what's the shift there? What did you used to think? What, what, say again, what was the mindset change that you had? So I think operations like adding disk, removing disk space, when not needed, getting rid of blood and so on, automation must be much higher. So it should be like approval from DBA or some senior backend engineer or something, a CTO if it's a small startup, just approval.

Starting point is 00:37:17 Yeah, we need to shrink disk space. We don't want to pay for all those terabytes. And automation should be very high. Repacking and then without downtime, we have a smaller disk. But to achieve this right now, so many moving parts. And for example, to have, you can provision node with smaller disk. It can be local, can be BS volume, doesn't matter. But then you need to switch over.

Starting point is 00:37:44 Without downtime, you need the PG bouncer or PG dock layer with pose resume support. And then orchestrated properly. RDS proxy doesn't, for example, support pause resume. So you can, you must have some small downtime. Yeah. And usually people say, oh, it's just 30. seconds. Well, I disagree. Why should we lose? This is just some routine operation. Why should we show some errors to customers? Let's raise a bar and have pure zero-down time, everything.

Starting point is 00:38:22 And auto-scaling. It's like it can be auto-scaling, but it can be maybe like auto-scaling is about like it makes decision itself. It's too much. Like, let's step back. I can make decision myself, but I want full automation, right? And we don't have it. We have it for increasing, to increase disk space, which is good for EBS volumes. Which is good. We don't need to have switchover, so it will be zero down time.

Starting point is 00:38:54 You can say add one terabyte. This is what people do all the time. And I think there is checkbox for auto-scaling in RDS can decide to add more disk space itself, right? Which is good. Yeah, like if you get within 10% for example, but yeah, only up, yeah, as you said. Yeah, at least we will avoid downtime. I also saw in some places people, like, there's a trick to put some file, like some gigabytes filled with zeros.

Starting point is 00:39:23 So if we are out of this request, you can delete it. You can delete the file. Oh, no. Yeah, just like something like sitting there, we can invent some. funny name for this approach yeah but just emergency it's like reserved connections for max connections three connections reserved for admin so reserved disk space you can quickly delete and and buy your some time to increase disk space yeah on the on the disk space thing the only thing i think people sometimes get caught out by is having alerts early enough like you need

Starting point is 00:40:00 sometimes you need quite a lot of spare disk space in order to save diskways. To do a repack, for example, you need the size of the table you're repacking at least free in order to do the operation. With indexes. Yes. So that, well, either start with your smallest ones, which is not going to make the most difference or like try and set that alert quite early. Yeah. But yeah. Um, what, is there anything else you wanted to make sure we talked about? Yeah, well, no, I think it's a good idea to understand some numbers. numbers, right? So our very old rule was latencies also. We didn't talk about latencies. What latency is normal? Very old rule was, you look at monitoring if it's SSD, it can be BS volume,

Starting point is 00:40:47 if it's some, the best volume is also in VME there usually these days with most modern instance families. And you just see very rough old rule one millisecond. I know I already have a feeling we have discussion about previous episode where I shared some old rule and someone disagreed with this old rule. Yeah, rules might be already outdated. So if it's one millisecond, these days maybe we should go lower, right, half of millisecond. If it's local disk, it should be even lower, right? Yeah. This is our like point when we think it's okay. If it's more, well, back those days we thought up to 5 to 10 milliseconds it's okay

Starting point is 00:41:35 but this day is already this is not okay 10 milliseconds is definitely slow these days for SSDs and in the missile specifically particularly so this is latency which is like you should start worrying right

Starting point is 00:41:49 so and basically in monitoring we should control usage situation risks and latency as well this is like regular use or four golden signals maybe for golden signals right so we control these things and also errors and yeah we check these these things and and understand where we are right now and should we start worrying already yeah simple it's actually simple and my recommendation also

Starting point is 00:42:23 to know your theoretical limits based on the docs as i said it's not trivial But also recommendation, if you use some particular setups in cloud, always test them to understand actual limits. And if they don't match theoretical advertised limits, you should understand why. And to test, it's easy. I usually prefer FIO, simple program. I like snippets GCP provides. They have snippets if you just check SSD disk, GCP, performance. you will see a bunch of snippets. The only warning, I managed to destroy several times.

Starting point is 00:43:05 I destroyed PJ data because it was like, you know, like, there are some of those snippets are direct IO and if you try to test, like it always, it was always not production, but still it, I like I made mistakes and if you try to test FIO, try to test your disc capabilities with direct IO and you use volume, which is you, you, you, you use, you, you, you, you use, you, which is used for PG data, forget about your PG data. And this is a good way, for example, to have silent corruption as well, because Postgres even might work for some time until you reach the point when it will touch the areas you write to, you had the rights to.

Starting point is 00:43:49 Yeah, so, yeah, there's a practical pieces of advice. Given PG bench stress, we've talked in the past about PG bench stress, Is this actually a benefit in this case? Could we use it because it's kind of what we're going to do is stress test at this point? Right, but PG-Benz tests everything, including Postgres. In our methodology, let's split everything to pieces and study them quite well, if possible, separately. So, disk I.O should be understood separately from progress.

Starting point is 00:44:23 We had many times, by the way, we started, oh, let's PG-BENCH, we talk about disk here. Let's forget about Postgreas completely for now. Right. So try and isolate. Not completely, actually. We usually keep in mind that pages are 8 kilobytes. Yeah. Well, I was thinking on managed providers, like, it's a bit, like, how would you test on RDS, what the...

Starting point is 00:44:47 That's a tricky question, right? That's a tricky question. I think PGBench would be a good solution there. PGBrench, yes, but you can try to guess which instance. Well, instance is easy to guess. but which disks are there and like iops and so on try and then you can provision the same instance issue two instance and disk you guessed and but again as i said that one day i discovered they use a raid so there's a stripe there and if you want to do the same probably you will you'll be

Starting point is 00:45:17 you'll have different setup that's that's an issue also with those um like i know cloud sequel has it for bigger customers i don't remember enterprise plus or something they also have caching with local mvmys yes yes yeah it's good but uh to reproduce it's already tricky to test yeah right so yeah i i i think yeah i think it's it's tricky how to test discs for rds but yet another reason to think about who controls your database yeah and why you cannot connect to your own database using SSH and see what's happening under the hood yeah probably a good place to end it yeah let's do it all right nice on Nikola thanks so much thank you see you next week bye bye

Postgres FM - Disks

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.