Postgres FM - Disks
Episode Date: August 29, 2025Nik and Michael discuss disks in relation to Postgres — why they matter, how saturation can happen, some modern nuances, and how to prepare to avoid issues. Here are some links to things t...hey mentioned:Nik’s tweet demonstrating a NOTIFY hot spot https://x.com/samokhvalov/status/1959468091035009245Postgres LISTEN/NOTIFY does not scale (blog post by Recall ai) https://www.recall.ai/blog/postgres-listen-notify-does-not-scaletrack_io_timing https://www.postgresql.org/docs/current/runtime-config-statistics.html#GUC-TRACK-IO-TIMINGpg_test_timing https://www.postgresql.org/docs/current/pgtesttiming.html PlanetScale for Postgres https://planetscale.com/blog/planetscale-for-postgresOut of disk episode https://postgres.fm/episodes/out-of-disk100TB episode https://postgres.fm/episodes/to-100tb-and-beyond Latency Numbers Every Programmer Should Know https://gist.github.com/jboner/2841832Fio https://github.com/axboe/fio~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things Postgresquist.
I am Michael, founder of PG-Massad.
I'm joined as usual by Nick, founder of Postgres AI.
Hey, Nick.
Hi, Michael. How are you?
I am good. How are you?
Very good.
Great. And what are we talking about this week?
Discs.
If you imagine database regular icon, or how to say, like, picture how we usually visualize database on various diagrams, it consists of disks, right?
Yeah, like three.
Like I'm thinking of cylinder.
Sometimes a cylinder, yeah, with like normally three layers.
Yeah, three four.
And obviously databases and disks, they are close to each other, right?
But my first question, why do we keep calling them disks?
Hmm.
Like outdated term, you mean?
Yeah, obviously.
I don't know.
What does the D and SSD stand for?
Yeah, actually, sometimes.
sometimes we like logical level volumes storage volumes something like this and in cloud context
especially ABS volumes right we talk about them like that but in all cases we still it's acceptable
to say disks but disks are like they don't look like disks anymore right they are rectangular
and microchips instead of rotational devices right yeah makes sense in most cases not in all
cases rotational devices can be still seen in the in the world but not often if we
talk about all TP databases because it's not okay to to use rotational devices if
you want good latency but yeah so disks because databases are they require good disks
and they depend on it heavily
in most cases
not in all
sometimes it's fully cashed
so we don't care
if it's cashed right
yeah I was going to ask you about that
because I think even
even in the fully cash state
if we've got like a lot of rights
for example we might still want really good
discs
there's
things where we're still writing out to disc
and we want that to be fast
not just reading from
but we are not writing to
disk. If it moved to the post-Graph context, we don't write to the disk except a wall, right?
Yes.
Yeah, and that's it.
Well, yeah, I agree it can be expensive if a lot of data is written.
So, yeah, you're right, because we need to write our tuples, and if it's full page right after
a checkpoint, we need to write whole page, eight kilobyte page.
Yes. And we need to get a sync before commit is finalized. So definitely it goes to disk. But the data in terms of table and index, tables and indexes are said it's written only to memory and it's dropped for checkpoint normally. It dropped for checkpointer to later write it first to page cache and then page cache can use PD flash or something to write it further to disk.
But, yeah, in terms of f-sync, right, latency is important.
It affects commit time.
By the way, I just had a case.
It's slightly off-topic, but I published a tweet and LinkedIn post about Listen Notify.
I added them to the list of deprecated stuff.
It's not deprecated, right?
But you're saying you recommend not using it at scale?
Yeah, well, if...
Or possibly at all.
Yes. My post-Gus vision deviates from the official vision in some cases.
For example, official documentation says don't set statement amount globally because blah, blah, blah.
And I don't agree with this.
In OTP, it's a good idea to set it globally to some value and override locally when needed.
And here, this notify, I just see, like, we should just abandon this completely until it fully redesigned
because there is global log and one of our customers recall AI, they published great post about
this because they had outages. And it's related to the topic we discussed in an interesting way.
To reproduce it, I used a bigger machine and the issue like is with Netify at commit time,
it gets a global log to serialize Netify events. Global lock like on database, exclusive log,
insane. And if commit is fast, everything is fine. But if in the same transaction, you write
something, commit like wall, it waits a little bit, right? In this case, contention starts
because of that lock. So if you have a lot of commits which are writing something to
wall, meaning they need f-sync and they need to wait on disk, if disk is slow, you use notify,
this doesn't scale.
Performance will be terrible very soon
at some concurrency level
you will have issues
and you will see commit spans
like many milliseconds
and dozens of milliseconds
and then up to seconds
and eventually system will be down.
Anyway, this is related to slow disks.
You are right if latency right is bad
we might have issues.
Yeah, but you're right too
that the majority of the
the time we care about the quality of our disks, it's when our data isn't fully in memory
and we're worrying about reading things either from, well, from disk or even from the operating
system. It's hard to tell from Postgres sometimes where it's coming from. But we have a
It's impossible to tell in Postgres unless you have PG-Stat K-cash extension. That's why
since Buffers is already in Post-Gus 18, again, I advertised to all people who develop
some systems with PostGios, if it's possible, include extensions, PG-WAT sampling, and PG-STAT K-Cash.
And K-cash can show.
Yeah.
Yeah.
I think it's not, I think you're right that it's impossible to be certain without those.
But for example, with through timings, through I.O. timings, which another thing that people
might want to consider having on, obviously a bit of overhead.
Track A.O. timing, you mean?
Track A.O. timing gives you an indication.
like if you're seeing not too many reads from either the disk or the operating system
and the io timings are bad you've got a clue that it's coming from this yeah indirectly
we can guess yeah yeah that this time was spent yeah not not many it's a good point because
sometimes it's fully cached in page cache we see reads and since there are so many of them
are you timing is spent reading from page cache to the buffer pool and disk is not involved
if volumes are huge but if volumes are not huge and still a significant time is spent very likely
it's from disk exactly yeah yeah yeah it's not enough like this is something we added to our
product just as a tip it doesn't come up that often like it's not able to take out timing
we actually just use we actually because most people don't have that on
We actually just used the buffers, like shared red,
and then the timing, the total time of the operation.
It's a big day. It's not on.
Yeah.
In big systems, we have it on.
Like, I never saw big problems on modern, at least Intel and Arm.
Graviton 2 on Amazon.
Like, I just see it's working well.
There is a utility you can check your infrastructure
and understand if it's worth enabling.
but my default recommendation to enable it.
Of course, there might be observer effect,
but it can be double-checked
if you want to be serious with this change,
but I just see we enable it.
Yeah, it's all to do with the performance
of the system clock checks.
And I think, for example,
the setup I've seen with really bad performance there
are like running, like dev systems
that are running Postgres inside Docker
and things like that that still have really
slow system, clock lookups. But most people aren't doing that with production Postgres databases.
And I haven't seen even any of the cloud providers have slow. I think it's PG test timing or
something like that. Is it a double check? Yeah, you can run it really easily. But what if it's
managed postgres? You cannot run it there. In this case, you need to understand what type of
Instances behind that managed Postgres instance.
Take the same instance in the cloud.
For example, if it's RDS, from RDS instance name,
you can easily understand what EC2 instance is this, right?
Yeah.
You can install it.
It will be, well, operational system matters also, right?
There are some, yeah, yeah.
There are some tricks you can do, like do things that would call the system clock a lot,
like nested loop type things or count, like aggregation.
things like that, like trying to get lots of loops.
Ah, you're talking about testing at higher level, at post-gust level.
Yeah.
Oh, that's a good idea.
Yeah, a lot of nest loops.
And you test with this, without this, completely, like,
running like 100 times, taking average, for example,
and comparing averages and this, you can guess.
Yeah, it's a good test, by the way.
I think the first time I saw that was Lucas Fittal.
I think you must have done a five minutes of Postgres episode on this kind of thing,
so I'll link that up.
Yeah, I'm glad we touch this because, again, like, our default recommendation is to have it enabled.
It's super helpful in Pichita Statements analysis and explain analysis plans.
And, yeah, track I.O. timing, if possible, should be enabled.
And this is related to disks directly, of course.
Yeah.
Although, strictly speaking, it's not timing of disks.
It's timing of reading from page cache to buffer pool.
So it might include pure memory timing as well.
That's why it does.
Yeah, yeah.
That's why your comment about large or not large volumes, it's important.
But it's honestly, like, if you even, like, if you are a backend engineer, for
example, listening to this episode, I can easily imagine in one month you will forget about
this nuance and we'll think about track your timing like only about discs, right?
And it's okay because like it's really like super narrow topic to remember, to memorize.
I guess this is moving the topic on a tiny bit
but if you're on a managed Postgres setup
which a lot of back-end engineers that are working with Postgres are
you don't have control over the discs
you're probably not going to migrate provider
just for quality of discs maybe you would
but it would have to be really bad
and you'd have to be in a setup that really was hammering them
maybe super right heavy workload or huge data volumes that you can't afford to have enough memory for.
You know, those kinds of edge cases where you're really hammering things.
Well, there are two big areas where things can be bad.
Bad means saturation, right?
Yeah, yeah.
We can saturate disk space, so to speak, out of disk space.
And we can saturate disk ayo.
Both happen quite often.
and managed post-gris providers are not all equal and clouds are not all equal.
They manage disk capacities quite differently.
For example, at Google, at GCP, I know regular PDSSD quite old stuff.
They have maximum 1,200 Mibi bytes per second separately for reads, separately for write,
speaking of throughput and they have 100,000 or 120,000 iops maximum right and i know from the past
discussions with google engineers that actually real capacity is bigger but it was not sustainable
so it was not guaranteed all the time and they could raise the bar but it could be it would be not
guaranteed so they needed they decided to choose the guaranteed bar for us makes sense yeah but like
basically we're not using at full possible we could use more right but we can not so they throttle
it okay interesting artificially to have guaranteed capacity for this kio interesting what I
I guess the subtlety that I was missing was not when you're at the maximum.
So in between tiers, imagine you're in a much smaller setup.
I see a lot of people just in upgrading to the next level up within that cloud provider to get more iops.
You know, if you're on Aurora, just scaling up a little bit instead of switching from an Aurora to Google Cloud.
But you're right at the, when you're at the last, or second to last level is when people
start to worry, isn't it? When you're at the last
level, you can't go
just scale up on that cloud provider
anymore. So yeah, really good point. And also
at Google, for example,
let's say we're
like I know this
rules there are artificial.
So this throttling, what I
just told you, it
also can be throttled additionally
if we have not many CPUs,
VCPUs. So the
maximum possible throttling is
achieved only if you have 32, I
I remember VCPUs or more.
If it's less, also it can depend on family, I think.
In instance, family.
So, so complex rules.
On Amazon, AWS, ABS volumes, okay, there is GP2, GP3, IOUI-1,
you choose between them, also there is provisioned IOPS.
Really complex, right?
And you haven't even mentioned burst diops, yeah.
Yeah, yeah.
So hitting IOPS limit.
is really easy, actually.
If you...
When do you...
Yeah, well, the times I see people
hidden it is like they're doing
a massive migration.
No, no, that's really.
Okay, when do you...
Just like, just growing.
Yeah, just projects just grows
and then latency,
database latency becomes worse.
Why? We check and we see...
Well, if you're...
if you have experienced capabilities like to look at graphs you can easily identify some plateau
it's not full like not ideal plato but usually some spikes small but you feel oh this is we are
hitting the ceiling here you're checking diskayo performance cliff it's not cliff no it's a wall
instead of cliff cliff it's when this is this important distinction cliff is when
everything was okay okay okay and then suddenly slightly more load or something and you
completely down or down like drastically 50 plus percent okay yeah here we have a wall and everything
is okay okay okay and then slightly not okay slightly not okay you know like and then more is coming
and we like start scheduling processing right accumulating XF processes and if so in performance
Cliff, if you raise load slowly, it is acute drop in capabilities to process workload.
In the case of hitting the ceiling in terms of saturation of diskio or CPU, it's different.
You grow your load slowly and then you see you grow further and things become worse, worse, worse.
It's not like acute.
It's slightly more, more, more.
and things become very bad only if you grow a lot further, right?
So it's not acute drop.
It's like hitting the wall.
It feels like hitting the wall.
You know, like if you imagine many lines in store.
Like, for example, we have several cashiers, eight, for example.
And then normally lines should be one or two, one, zero or one people only.
This is ideal.
throughput, everything good.
We haven't saturated them.
Once we saturated, we see lines are accumulating.
And latency, meaning how much we spend to process each customer, they start to grow,
but they don't grow acutely, boom, no.
Here we talk about, like, performance cliff is that, for example, if we talk about cash
only, no cards involved.
And suddenly, like we had remains of cash for change.
in all lines, right?
And cashers suddenly,
suddenly all, they can, like, say,
okay, do you have change?
I have change.
Okay, we're processing.
And then suddenly we're out of cash to give change.
This is acute performance cliff.
They say, okay, we cannot work anymore.
Boom.
Right?
We need to wait until someone goes somewhere like,
and this is like we need 15 minutes of wait, basically.
This is like important distinction of performance cliff
and hitting the wall or ceiling.
Okay, I haven't heard that stricter definition before.
Like, it sounds to me like you're describing the difference between blackout and a brown.
Have you heard of a brownout?
Like, so blackout is kind of like you, your database can't accept rights anymore or even selects.
Like, no reads.
Like, everything is down.
Brownout would be like, it's still working, but people are, like, they're spinning loaders
and it may be loads after 30 seconds.
or maybe some people are hitting timeout some people aren't and there's like the like the queuing
issue in the supermarket you talked about performance is severely degraded but it's not completely
offline still working at least for some people so it feels like that's the kind of distinction
yeah and brown can become dark if you keep loading a lot if so if saturation happened at like some
workload level but you you you give it 10x of course it will be blackout but because of
context switching and then it's like it's different but for performance cliff it happens like very
quickly it's very much more yeah situation i think i'm also biased by the cases that i've seen which
are more acute because they are bulk loads or backfills where they are running at a much
much higher rate than they would normally be they're consuming iops at a much higher rate than they
normally would so they hit it really fast and it's like running at the wall extremely fast but i guess if
you approach the wall slowly it's not going to hurt quite as much. Yeah. Okay, I think I understand.
Yeah, back to disks. Definitely we should check this KIO usage and saturation risks.
So you mean like monitor, monitor for it, alert, when we're close to our limits? Yeah, yeah.
Yeah, and also it might be interesting. For example, I remember, I don't know right now,
but many years ago, Nardias, I remember, we like ask, okay, we, like, small.
system maybe. We need 10,000 IOPS. But we see situation at 2,500 somehow. Oh, there is
radar actually. We have four disks and that's like, okay, okay. So there are interesting
nuances there. But also, so understanding your limits is super important. And like I think
clouds could do better job explaining where the limits are. Because,
Right now, you need to do a lot of legwork to figure out what is your advertised limit.
For example, as I said, at GCP, you need to understand how many VCPUs.
Also, a disk, I forgot like 10 terabytes, I think, is when you achieve the, or one terabyte.
Memory fools me a little bit.
So you need to take into account many factors to understand, oh, our like, theoretical limit is this.
And then ideally you should test it to see that it can be achieved.
Testing is also interesting because, of course, it depends on block size you're using.
And also it depends on like you're testing through page cache or direct IO, right?
So directly writing to device.
And then you go to the graphs in monitoring and see some disk IO in terms of IOPS and throughput separately reads and writes.
and then you think, okay, let's draw a line here.
This is our limit.
So what I'm saying, they should draw the line.
Clouds should draw the line.
They know all these damned rules, right, which are really complex.
So this should be automated.
This line should be automated.
Okay, with this, this, this and this, we give you this.
This is your line in terms of capabilities of your disk.
and here are you okay at 50% okay I know now it's like it's like whole day of work for someone
to understand all the details double check them and like then correct mistakes even if you know
all the nuances still you returned to this topic and you oh i forgot this redo yeah when you mentioned
the terabytes thing is that i was working with somebody a while back who
they weren't using the disk space they already had at the cat like they let's say they had a
one terabyte disk they only had a couple of hundred gigabytes but they upgrade they they
expanded their discs to a few terabytes so that they would get more provisioned iops because
that was the way of so is that what you're talking about you need a certain size yeah so the
rule for throttling is so multi-factor you need to read a lot of docs and if like with gCP and a
AWS, I have pages which I read many, many times per year, like carefully, trying to remember.
Oh, this rule I forgot again.
Why isn't this automated?
Someone can say, okay, these limits depend on block sizes.
Okay, but if it's RDS, block size is already chosen.
Posgis uses 8 kilobytes.
If it's X4, it's 4 kilobytes there.
Everything is already defined.
So we can talk about limits for throughput quite well, right?
So yeah, this is like, I think, lack of automation here.
Also, if you mentioned the number of VCPUs, like, I guess that is that they have all the setting, right?
They have all the knowledge and they define these rules.
Yeah.
So give me this like usage level and understanding how far from saturation I am.
because it's so important.
No, in reality, we wait until that plateau I mentioned,
and only then we go and do something about it and raise the bar.
This should be alerts, Ivan.
You, like, your database is spending at 80 plus percent of your capacity on this kind.
Prepare to upgrade, you know, add more.
Yeah, well, I was going to say,
sometimes there are perverse incentives here
where they're not incentivized to help you improve your performance
so that you upgrade.
But in this case,
the incentives should be aligned.
If they let you know earlier
that upgrading might help prevent an issue,
you're going to be paying more up front.
So the incentives are aligned.
Yeah.
At the same time,
this complaints we are currently expressing.
They all are reminding me complaints
of a guy who is sitting on an airplane
and saying that there is no leg room
and so on.
you're sitting in the air, like in flying, 30,000 feet above ground, and it's magic, right?
So these, like, ABS volumes, PD SSD, like other newer disks on JCP or NVMEs, they are great.
Like, I mean, snapshots, elasticity of everything, it's great, right?
We just, yeah, we just want even more.
it's good that you're being positive about them but i feel like i hear quite a lot of people
saying that one of the cases still for self-hosting is better this you can
so actually i think a lot of a lot of the time with the cloud you're paying for
hardware that might be a bit on the older side and you have no control over that so it's
yeah i'm interested in your take on that as somebody who's
historically being, you know, pro-self-managing or, you know, some hybrid version.
So, you know, I love clones and snapshots.
That's why, actually, BS volumes and what RDS has, and even if it's a lazy load involved,
and when we restore from snapshot, it's actually getting data from S3.
It still feels like magic and great and, like, we're very good for reproducing incidents and
so on.
And the snapshots are cheap because they are stored in the S3.
At GCP, it's the same, although there is lazy load there as well, although their documentation still doesn't admit it.
But just looking at the price, we understand it's the snapshots of Google cloud disks, it's stored in GCS, so it's three analog.
It's great.
But also, if they think about a cluster of three nodes, or four, five, six, up to ten nodes, and more,
more. Some people have more. Database is basically copied to all replicas, and on replicas
it's stored on disk, and this becomes more and more expensive over time, right? So it can be
significant. It can be even more than compute sometimes. That's the point. Like if we have a large
database, but working set is not that large, we can have much smaller memory.
that thus much smaller like not big compute instance we had this this these cases for
example a lot of time series data and we have much bigger disk than you could expect
and then all replicas need to have the same disk and this disk if it's a BS volume
it becomes expensive very expensive and contributes to costs so much so then you
think why not to use local disks well we used local disks for a bench
marks it was a i3 instance like years ago seven years ago maybe started liking them because
it's always included to price right of yes or easy two instance and it's super fast it's like
basically one order of magnitude faster in terms of iops can give you a million iops these days
already at and throughput three gigabyte per second well and the resiliency like you you're if you've
already got replicas provisioned for failovers. You don't need the resiliency that the
cloud. The point is the ephemeral. So if restart happens, you might lose this data. But if restart
happens, we have replicas. Yes, that's what I mean. So if, like, that doesn't, that doesn't
actually matter. In fact, this reminds me a lot of the planet scale stuff that's been, the planet
scale postgres, I think they call it metal. They've got two products, but the metal one has the local
discs and this is a lot of the things you can have a local if maryland v me is only on virtual machines
of course smaller size metal yeah yeah sorry all all i meant was they're doing a all lot of their
publicity that a lot of their blog posts and things are relevant to this discussion you don't
have to use their services and also you could do it a much smaller scale yeah and it's it's so
big cost saving and it brings so much more this call your capacity amazing yeah
And latency reduction, right?
Like, because the systems are just closer together.
Yeah, yeah, yeah.
So it's much like, it can handle workloads much better in terms of OTP workloads.
There are two caveats, if a matter, property, and also limits in terms of, we didn't touch
this space topic yet.
Yeah, yeah, yeah.
We have a whole separate episode on that, but yeah, we should still touch on it.
Right.
And on AWS, I like local disks much more because they are usually bigger and so on.
Like they are bigger, each disk is bigger and the summarized aggregated disk volume is also bigger.
On GCP, I think, first of all, somehow local disks are still, I think, 375 gigabytes only looks like old.
But you can stack a lot of them.
and I think up to 70, 72, or how many?
Terabytes.
Yeah, quite a lot.
But in this case, you need to really, like, maybe go with metal, like the maximum.
Take a whole machine, basically, right?
But it's possible, but this 72 terabytes will be your hard limit, hard stop.
And it's not that bad.
Most people will be fine.
Yeah, yeah.
It's okay, I mean, to have this.
limit but it's a hard limit but you're yeah the hard limits the interesting thing so you're saying let's say
we start on small machines and they only have a set amount and we and we suddenly realize we're at
80 or 90 percent capacity right but at the same time iBS volume has limit 64 terabytes and pdsd
on gp has the same limit 64 terabytes and rds and gz and google cloud sequel they also have hard
stops at 64 terabytes. Aurora has 128 double of that size. And that's it. Right. So these are
hard stops. And I think in 2025, I think this is not a lot of data already. 50, 100 terabytes
we had episode about it. It's already like achievable for bigger, bigger startups. So RDS,
I don't know. I think we should solve it soon. And I think Cloud SQL, Google Cloud SQL,
they should solve it soon but to my knowledge they haven't solved it yet so if you approach this
it's hard stop and basically you need to go to self-managed maybe right and there you can combine
multiple eBS volumes most that we've talked to that do this shard at that point this is different
yeah that's why i think plan scale like it's easier for them to choose local disks and deal with
those hard limits in size as well because if there is a rebalancing if it's if it's zero down
time rebalancing you can just make sure no shards will reach that limit that's it it's good
yeah they have that further they have that for my score but they don't have that for postgres so
well not yet they're building it they announced it right yeah well they announced building it
i think lots of people announcing building sharding at the moment well i see mottegris already
has some code. I even commented
in a couple of places, proposing
some improvements.
Yeah, well, I know they all have some code, right?
Like, PG-Dog's got some code.
It's not, the PG-Dog is already
you can test it already, yeah.
I think Montegras also will
have some at some point.
All I mean is
that you can shard in other ways, right,
without these solutions. Like Notion
talked about doing it, Figma, I've done it.
So-called application
side sharding as I...
Yeah, but they,
I did it without leaving RDS in those cases.
So it is interesting.
But I thought you were going to go in a different direction here.
Like, I thought it was more about the practicalities of expanding.
So let's say you're not at the dozens of terabytes limit, whatever your provider has.
Let's say you're at one terabyte and you just want to expand to two terabytes.
That's often really easy.
You know, you can do it at a few clicks of a button without any downtime in a lot of providers.
providers, whereas if you've got local disks, is it a bit more complicated?
Yeah, you know what? I think these days, RDS also provides options with local NMEs.
Wow. Okay.
Yeah, the instance, I'm double checking right now, it's instance, for example, X2 IDN.
Yeah, and it has, I think it has, yeah, it has local NMEs, several terabytes and up to, I think,
not many actually
4 terabytes
interesting
so
there might be
hybrid approach
when you have
ABS volumes
and you use
local NVME
as a caching layer
for both
reads and rights
but then what would you do
would you set up
some replicas
with larger disks
and then fail over
to the like
how are you
managing a migration
to like larger
local disks
when you hit
64 terabytes
well
no
you let's see you've started with smaller like you've started with local disks that are smaller
ah with local disks yeah i think you did switchover approach of course yeah so you need a different
instance with bigger capacity in terms of disk space of course here again uh elasticity and automation
of network attached disks cloud providers have it's great but uh let's also criticize it
so they have like eBS volume has auto scaling but only in one direction
For example, if we re-sharred, right, we need to reprovision and then switch over.
Or if we saw we had like we didn't have a vacuum tuning in place or we screwed up in terms of long-running transactions or abandoned logical slots.
So we accumulated a lot of blood and say we have 80% of blood.
Okay, we re-indexed, repacked.
Now we sit with a lot of free disk space.
We don't need it during next year.
Why should we pay for it, right?
And shrinking is not automated.
But, of course, yeah, you can provision new replica with smaller disk and then switchover.
And when I think about switchover, you know, I decided to force myself to have a mind shift
and my team as well, to self-driving postgres.
We talked about it.
And I think when I think about this particular case, we eliminated a lot of.
of bloat. We want disk to be smaller. We need to switch over. But switch over also, it's
a maintenance window. Yeah. Because what's these, what's the shift there? What did you
used to think? What, what, say again, what was the mindset change that you had? So I think
operations like adding disk, removing disk space, when not needed, getting rid of blood and
so on, automation must be much higher.
So it should be like approval from DBA or some senior backend engineer or something, a CTO if it's a small startup, just approval.
Yeah, we need to shrink disk space.
We don't want to pay for all those terabytes.
And automation should be very high.
Repacking and then without downtime, we have a smaller disk.
But to achieve this right now, so many moving parts.
And for example, to have, you can provision node with smaller disk.
It can be local, can be BS volume, doesn't matter.
But then you need to switch over.
Without downtime, you need the PG bouncer or PG dock layer with pose resume support.
And then orchestrated properly.
RDS proxy doesn't, for example, support pause resume.
So you can, you must have some small downtime.
Yeah.
And usually people say, oh, it's just 30.
seconds. Well, I disagree. Why should we lose? This is just some routine operation. Why should we
show some errors to customers? Let's raise a bar and have pure zero-down time, everything.
And auto-scaling. It's like it can be auto-scaling, but it can be maybe like
auto-scaling is about like it makes decision itself. It's too much.
Like, let's step back.
I can make decision myself, but I want full automation, right?
And we don't have it.
We have it for increasing, to increase disk space, which is good for EBS volumes.
Which is good.
We don't need to have switchover, so it will be zero down time.
You can say add one terabyte.
This is what people do all the time.
And I think there is checkbox for auto-scaling in RDS can decide to add more disk space itself, right?
Which is good.
Yeah, like if you get within 10% for example, but yeah, only up, yeah, as you said.
Yeah, at least we will avoid downtime.
I also saw in some places people, like, there's a trick to put some file, like some
gigabytes filled with zeros.
So if we are out of this request, you can delete it.
You can delete the file.
Oh, no.
Yeah, just like something like sitting there, we can invent some.
funny name for this approach yeah but just emergency it's like reserved connections for
max connections three connections reserved for admin so reserved disk space you can quickly delete
and and buy your some time to increase disk space yeah on the on the disk space thing the only
thing i think people sometimes get caught out by is having alerts early enough like you need
sometimes you need quite a lot of spare disk space in order to save
diskways. To do a repack, for example, you need the size of the table you're repacking at least
free in order to do the operation. With indexes. Yes. So that, well, either start with your
smallest ones, which is not going to make the most difference or like try and set that alert
quite early. Yeah. But yeah. Um, what, is there anything else you wanted to make sure we talked
about? Yeah, well, no, I think it's a good idea to understand some numbers.
numbers, right? So our very old rule was latencies also. We didn't talk about latencies. What
latency is normal? Very old rule was, you look at monitoring if it's SSD, it can be BS volume,
if it's some, the best volume is also in VME there usually these days with most modern instance
families. And you just see very rough old rule one millisecond. I know I already have a feeling we have
discussion about previous episode where I shared some old rule and someone disagreed with
this old rule. Yeah, rules might be already outdated. So if it's one millisecond, these days
maybe we should go lower, right, half of millisecond. If it's local disk, it should be even
lower, right? Yeah. This is our like point when we think it's okay. If it's more, well, back
those days we thought up to
5 to 10 milliseconds it's okay
but this day is already
this is not okay 10 milliseconds is definitely
slow these days for SSDs
and in the missile specifically
particularly
so this is latency which is like
you should start worrying
right
so
and basically in monitoring
we should control usage
situation risks
and latency as well
this is like regular use or four golden signals maybe for golden signals right so we control these
things and also errors and yeah we check these these things and and understand where we are right now
and should we start worrying already yeah simple it's actually simple and my recommendation also
to know your theoretical limits based on the docs as i said it's not trivial
But also recommendation, if you use some particular setups in cloud, always test them to understand actual limits.
And if they don't match theoretical advertised limits, you should understand why.
And to test, it's easy.
I usually prefer FIO, simple program.
I like snippets GCP provides.
They have snippets if you just check SSD disk, GCP, performance.
you will see a bunch of snippets. The only warning, I managed to destroy several times.
I destroyed PJ data because it was like, you know, like, there are some of those snippets are
direct IO and if you try to test, like it always, it was always not production, but still
it, I like I made mistakes and if you try to test FIO, try to test your disc capabilities
with direct IO and you use volume, which is you, you, you, you use, you, you, you, you use, you,
which is used for PG data, forget about your PG data.
And this is a good way, for example, to have silent corruption as well,
because Postgres even might work for some time until you reach the point
when it will touch the areas you write to, you had the rights to.
Yeah, so, yeah, there's a practical pieces of advice.
Given PG bench stress, we've talked in the past about PG bench stress,
Is this actually a benefit in this case?
Could we use it because it's kind of what we're going to do is stress test at this point?
Right, but PG-Benz tests everything, including Postgres.
In our methodology, let's split everything to pieces
and study them quite well, if possible, separately.
So, disk I.O should be understood separately from progress.
We had many times, by the way, we started, oh, let's PG-BENCH, we talk about disk here.
Let's forget about Postgreas completely for now.
Right.
So try and isolate.
Not completely, actually.
We usually keep in mind that pages are 8 kilobytes.
Yeah.
Well, I was thinking on managed providers, like, it's a bit, like, how would you test on RDS, what the...
That's a tricky question, right?
That's a tricky question.
I think PGBench would be a good solution there.
PGBrench, yes, but you can try to guess which instance.
Well, instance is easy to guess.
but which disks are there and like iops and so on try and then you can provision the same instance
issue two instance and disk you guessed and but again as i said that one day i discovered they
use a raid so there's a stripe there and if you want to do the same probably you will you'll be
you'll have different setup that's that's an issue also with those um like i know cloud sequel
has it for bigger customers i don't remember
enterprise plus or something they also have caching with local mvmys yes yes yeah it's good but uh to
reproduce it's already tricky to test yeah right so yeah i i i think yeah i think it's it's tricky
how to test discs for rds but yet another reason to think about who controls your database
yeah and why you cannot connect to your own database using SSH and see what's happening under
the hood yeah probably a good place to end it yeah let's do it all right nice on Nikola
thanks so much thank you see you next week bye bye