Postgres FM - Memory
Episode Date: June 16, 2023Nikolay and Michael discuss memory in PostgreSQL — how it works, important settings, and how to go about tuning them. Here are links to a few things we mentioned: Resource Consumption (Po...stgreSQL docs) https://www.postgresql.org/docs/current/runtime-config-resource.htmlAndres Freud tweet about shared_buffers https://twitter.com/AndresFreundTec/status/1438912583554113537 Henrietta (Hettie) Dombrovskaya https://hdombrovskaya.wordpress.com/about-the-author/annotated.conf (by ash Berkus) https://github.com/jberkus/annotated.conf Our episode about checkpoint tuning https://postgres.fm/episodes/wal-and-checkpoint-tuning Our episode about BUFFERS https://postgres.fm/episodes/buffers-by-default Analyzing the Limits of Connection Scalability in Postgres (blog post by Andres Freund) https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/analyzing-the-limits-of-connection-scalability-in-postgres/ba-p/1757266#memory-usageTuning memory parameters for Aurora PostgreSQL https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.BestPractices.Tuning-memory-parameters.html RDS for PostgreSQL memory https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/PostgreSQL.Tuning.concepts.html#PostgreSQL.Tuning.concepts.memory EDB guide by Via Fearing https://www.enterprisedb.com/postgres-tutorials/introduction-postgresql-performance-tuning-and-optimization#resourceusage pg_stat_kcache https://github.com/powa-team/pg_stat_kcache pg_buffercache https://www.postgresql.org/docs/current/pgbuffercache.html Process and Memory Architecture chapter (from Hironobu SUZUKI) https://www.interdb.jp/pg/pgsql02.htmlPostgreSQL 14 internals PDF book from Egor Rogov (pages 37, 184) https://edu.postgrespro.com/postgresql_internals-14_en.pdf src/backend/storage/buffer/README https://github.com/postgres/postgres/blob/master/src/backend/storage/buffer/README pg_backend_memory_contexts (PostgreSQL 14+) https://www.postgresql.org/docs/current/view-pg-backend-memory-contexts.html pg_stat_io (coming in PostgreSQL 16) https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-IO-VIEW pg_prewarm https://www.postgresql.org/docs/current/pgprewarm.html Configuring work_mem blog post https://www.pgmustard.com/blog/work-mem~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!If you would like to share this episode, here's a good link (and thank you!)~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hi, this is Nikolai and my co-host is Michael. Hi, Michael.
Hello, Nikolai.
This is Postgres FM and today we talk about memory management, shared buffers, workmem,
effective cache size, maintenance workmem, auto-vacuum workmem, and so on, right?
Yeah, absolutely. And just before the show, actually, you were telling me you had a joke about this.
I forgot it.
Yeah, nice.
Okay. Let's try to be shallow because it's easier, but very wide.
I mean, we will try to cover as wide as possible in half an hour and we can give tips where to if people want to dive deeper we can recommend some
great materials great resources such as articles books and so on source code so where do you want
to start my experience is very biased because i make a tool for optimizing individual queries. So these system-wide things
don't come up as often in the conversations I'm having. But my understanding is that a lot of
people still don't, and there seems to still be disagreement in the community about what is a
sensible setting for things like shared buffers? What is a sensible setting for things like work
mem? So some of these extremely high level but extremely important settings
have some debate around them and very experienced people disagreeing with
exactly what the doc should say or exactly what equation to use to set
these and there seems to be some like differences of opinion from people more
in the theory versus people more in practice very different if there's uh depending on the the database like how big or small it is can make a
huge difference to what the advice should be so yeah i think there's a good high level overview
we can do so i'd probably like to start with maybe shared buffers it feels like there's there's a few
things that postgres uses shared memory for but but that's the biggest, right? Or normally the biggest.
Yeah, I agree. This is probably number one in terms of memory management to consider.
But there is also underlying thing that probably should go before it. And underlying thing is
called page cache or relational system file cache. So Postgres always works through it.
Postgres never reads files from disk, from file system.
It always goes through page cache.
So the first cache...
First of all, why cache?
Why do we need cache?
Because SSD is quite fast right it's like you can do a lot of operations per
second latency is quite good sometimes below one millisecond it's quite good of course memory is
much better definitely so of course we still despite of the rise of new external storage
devices memory is still several orders of magnitude faster.
And we have a lot of memory lately.
Like we can easily have a server with hundreds of gigabytes
and sometimes already terabytes of memory.
And memory is super fast.
And of course, it makes sense to keep a lot of data in memory.
So final latency of query execution is better for our end users.
Well, our end users are not end users of systems we usually use.
Our end users from database perspective are some application nodes or sometimes humans, but very often not humans.
And we want, of course, query execution to be very fast.
Performance is critical for databases, so that's why we want cache.
And we always have page cache underlying before even going to shared buffers.
We have it.
And usually when we think about it, we don't know how much it will be originally because
like linux for example linux uses for page cache everything it has which is not used by others
right all free memory becomes our page cache memory and this this is good this is fully
transparent and only then we have the buffer pool which is controlled by shared
buffers and it's allocated at startup time and it cannot be changed during while we run server and
this is big limitation postgres has one of the biggest items i would like to be changed in the
postgres but it's very hard i mean it would be great to change shared buffers without restart but
right now it's not possible so if you want to adjust your buffers the restart. But right now it's not possible. So if you want to adjust
shared buffers, the size of the buffer pool, you need to restart your Postgres. And since
it's allocated, we need to think how big should it be. But thinking about it, we should consider
underlying cache, the page cache. So Postgres basically has two levels of cache, page cache and buffer pool.
And the rule of thumb right now, 25%. I remember when we started to have
more than a couple of gigabytes of memory
on most machines,
there was a debate.
At some point, the rule was 25
if it doesn't exceed 8 gigabytes,
but now it's okay to allocate 100, 200 gigabytes,
and it's normal.
And I should notice that 25% is not optimal.
Depending on your workload, you probably want to allocate 80%, for example, but it's risky.
Of course, I would like to allocate 90% to shared buffers.
Linux would use less memory for the page cache.
But it's okay because shared buffers are smarter than page cache.
For example, in some cases, Postgres decides to use a very limited area of shared buffers.
For example, if you perform sequential scan on a large table,
page cache doesn't know that it's a sequential scan.
Linux doesn't know it.
And it will spam your cache, page cache.
It will spam it.
But buffer pool, you have a ring.
So only limited area, I don't remember, 256k.
I don't remember.
So basically, this process will use only this limited area to read whole table, and it won't cause eviction of useful data that is needed by other backends, right?
So buffer pool is smarter than page cache, so we would allocate 80 or 90%, but it's risky because we can be we can hit out of memory state sooner yeah so lots to go on i think uh 25 is does seem to be i think it's even mentioned in
the postgres docs it's mentioned a lot of guides it's interesting you say it's not optimal but like
it may not be like there's a there's a there probably
is a workload like if for example your tables and indexes are almost exactly about 25 percent of
your ram like if they fit exactly in shared buffers maybe that is an opt but i'm just saying
like there is possibly yeah yeah i agree the problems we've got are that the default is so low
that it's unlikely to be a good starting point.
25% probably gives you a better starting point for experimentation,
but then to experiment, you need restarts,
and you also need to be able to replay your workload and things like that.
So it's a tricky one.
And before we move on, though, I did really want to say
there's a really good tweet from Andres Freund that I thought,
I couldn't believe he managed to summarize this in basically two tweets.
But the reason that your statement is a good one that is probably not optimal
is that assuming your data doesn't fit in RAM, which is like a lot of people, it will.
A lot of people, if you're working on a small database,
your data might well fit in less than 25%,
in which case you're probably going to be fine.
It's probably going to be like...
Or it's 10% off, yeah.
Yeah, exactly.
But the people that are going to have problems with this,
the people that are going to need to listen to this or look into some of these links are probably people that have got
much larger data sets than will fit easily into ram or at least it'll be extremely expensive to
a server where they would fit into ram and then you've got the problem of the postgres yeah the
shared buffers the postgres cache versus
the operating system cache and i think andrews does a great job of simplifying the fact that
you can have it saved in both and if you have the same data duplicated that's inefficient for your
cache hit rate that's why 50 probably is not a good idea at all right so you need to move this
slider either to one side or to different side exactly
and i would prefer to move it to 90 to 80 90 but it's risky i've seen more commonly people like
starting to point to maybe closer to 40 at the higher end and towards 10 at the lower end but
80 or 90 it just sounds wild to me when you consider what else we want to use memory for.
My personal experience, I spent some time studying various cases and making a lot of experiments,
trying to develop fully automated methodology. So each particular database, we take workload.
Somehow it's a different topic.
And we just run a lot of experiments and see where is our optimal point.
Sometimes we have local optimum before 50%
and slightly better optimum after,
like we pass 50 and go closer to 100.
But again, it's risky.
We can discuss why.
But eventually I got disappointed in this area completely,
like absolutely disappointed,
because query performance, things you do, hits more.
Right now I prefer to say let's just put 25%.
We know probably we are not optimal, but we are on the safe side.
And let's first focus on different settings
which are very influential and query optimization.
And if we do need to squeeze a lot from performance, like for example, we have 10,000 Postgres
nodes and we need to save on memory, we need really optimal performance, then probably
it would make sense to find better setting than 25%. But in my opinion, if you start with 25% for LTP workloads
and then say we are going to optimize it and say maybe 40% is better,
you have premature optimization
because most likely you have different things to tune first.
And not to tune, query, optimize workload, you lack indexes and so on.
You need partitioning
to have data locality,
to use the buffer pool better,
more efficiently.
And if you focus on that areas,
maybe when you return
and think about 25 or 40%,
which will probably give you
10% benefit,
you think like later, later, later, and you keep 25% for longer.
This is my current understanding of this topic.
And again, like I would like to thank Henrietta Dobrovska.
I always forget, I'm bad with names as usual, but I would like to thank her because it was
like cold shower for me when I presented our approach for Postgres config tuning
with experiments and so on in New York,
maybe five years ago.
And she said, you can optimize a lot,
but then developers or ORIMs write bad queries
and all your optimization is vanished
because queries lack indexes or written poorly and so on.
And again, my experience was like, no, no, config optimization is super important, let's do it.
But then I switched completely and now we work on query optimization as well with experiments and thing cloning and so on.
The change happened there.
So I say like shared buffers optimization is probably number 10 item you want to optimize.
Put 25% and go elsewhere first. Yeah. But given our topic is memory management.
Well, we have other memory settings we need to consider, and they are also very interesting.
But shared buffers, of course, number one in terms of Postgres configuration,
because it's hard to change
because restart is needed.
And it's influential
if you don't change it at all,
because default,
original Postgres default is 128 megabytes, right?
And there is also different default
dictated by apt or yum packages,
which probably half of gigabyte, I don't remember, maybe 256.
But these days we can use these settings on Raspberry PI
and also not on the latest models
because latest models have like eight gigs of memory.
So this default settings of shared buffers and Postgres config
is so outdated that can be used only on TPODs.
So it's TPOD Postgres style.
So it should be always tuned, but 25% rule is good enough.
The one exception is a lot of people these days, their default config is now the default config chosen by their cloud provider. And this is because of the reason you said,
this is one of the few settings that I see almost,
well, I've seen every one I've checked has tuned this.
Not always to people's taste, but yeah, this is one people.
And they use 25%.
Not all of them.
Not?
No.
Well, I haven't checked all of them. Not? No. Well, I haven't checked all of them recently and started new instances,
but I was talking to a customer of Heroku just recently,
and they're having a real ordeal trying...
Heroku is...
It's a very old solution.
Yeah, but lots of people are still on it.
Yes, I know. I have
customers as well. And they have
tuned ship, they have changed ship.
They don't have PG repack even.
I know, I know, but they have changed it, but not
to 25%, which is interesting, because that
means it's a deliberate decision not going with
what the docs say. Like, I think the easiest,
the least thought... Because this decision
was made before it was written into
docs. Maybe.
So probably they still apply 25, if not more than 8 gigs,
if it's maximum of 8 gigabytes and 25%.
And this was also a recommendation in old George Berkus
PostgreSQL Conf annotated.
Annotated PostgreSQL Conf.
It was, yes,
and it's unfortunately
quite like lagging
for new versions,
but 25% is current rule of thumb.
That's it.
Let's keep it.
It should be number 10.
Let's agree to put 25%
and forget about it for a while.
Cool.
So we have 25%
went to shared buffers.
And if we leave in vacuum, nothing else,
and processors are not using memory,
75% will go to the patch cache.
And we have effective cache size,
which is additional, but very important setting,
which tells Postgres Planner
how big are our
both caches together.
Not together.
It's interesting.
Normally people
recommend saying effective cache size
as the total
memory minus area.
No, not minus.
It includes shared buffers.
For example, if we indeed decided to exceed 50, it's also, I remember several years ago we discussed it, hackers, and other processes, non-postgres processes, use, say, 10%, so we have 30% left for page cache, it means that, okay, 60 shared buffers, 30 page cache, effective cache size should be either some of them.
Some of them wrong, right?
Because we have duplicated caching.
So either maximum or some.
It's interesting.
In this case, yeah, I already forgot.
But in general, like if we decide start with 25% for buffer pool,
I would just put 75% for effective cache size as a starting point.
But what I would like to say here, we can put 200, 500, 1000% of our RAM available.
And since it's not a location, Postgres will work.
And this helps us on lower environment where we don't have enough memory,
but we want planner behave similar to
production actually exactly like production we use planner settings and workmem the same as on
production and effective cache size is one of the most important planner settings so we can full
pause this and the planner doesn't look at real available ram real available cpu disk doesn't know about it anything and doesn't look and share
at shared buffers even it just looks at effective cache size and it's interesting i still remember
your tweet about that i think you did a poll and i got it wrong i thought it would take shared
buffers into account in the planner but i was wrong so yeah that's that was a few years ago now
but yeah good one i feel like we should move on.
From my experience, at least,
work mem is hugely important here.
Is that the next thing you'd want to talk about?
Or where would you go next?
Yeah, so brief summary.
Shared buffers is more like physical setting,
which will affect timing of query execution.
Of course, if data is there, good.
If it's not there bad so we need to
inspect patch cache again and for postgres it already will count as read buffer read reading
of buffer but then it might be disk related but maybe not without disk at all so we don't
postgres doesn't know that's why we need pgstat kcache to understand but effective cache size is for
planner how planner chooses planner thinks okay we expect that many gigabytes of ram involved into
caching so it's more optimistic that like it will be in memory or it expects some data reads
happening so more heavy operations io I.O. operations.
So this is a very high-level thing, like logical, only planner decides what to do.
But shared buffers, again, it will affect timing, but not the behavior of buffer pool.
We should mention our episode about checkpoint or tuning,
because when some buffers are changed and it can happen
during select as well if hint bits are updated on the fly and the buffer can see buffer is
like in virtual memory it's a block read by from disk page yeah it's page res 8 kilobytes
we have huge pages as well right so it's a different topic
but buffer is eight kibibytes and if we changed at least something in it inside this page it's
considered as dirty and then check pointer or the ground writer first then check pointer and
sometimes back end itself they need to clean it so So like synchronize with disk. So less stress will be next time we need to perform recovery.
And we discussed this topic, right?
So we need to checkpoint or tuning.
But what we didn't discuss probably is that page cache also needs tuning sometimes.
Because if we accumulate a lot of dirty pages in page cache,
it's not directly related to Postgres,
but it's related to overall performance of our database.
If we allow operational system to accumulate a lot of dirty pages in the underlying page cache,
we might have performance issues as well
because we want this process to be smoother
and probably you want to tune the
behavior of page cache adjusting with cctl adjusting kernel settings so page cache doesn't
grow a lot and the operational system flushes dirty buffers with pd flush and so on like more
often but it's slightly outside of postgres but but it sometimes affects how we work with memory.
So again, Postgres relies on page cache a lot,
so this topic should be studied as well.
Okay, now what's left?
WorkMem, maintenance WorkMem, these two guys, first of all, right?
Yeah, and when you mentioned the previous episode,
I thought you were actually going to mention the buffers episode,
and I think there's also something to be said there before we move on, which is when people are doing estimations for their cache hit rate, for example,
or if you're looking at a single query plan with buffers, which hopefully you are, you'll see hits and reads.
But the thing you mentioned here that's relevant is that those reads aren't necessarily
from disk, they could be from the
RS page cache, and we
can't tell.
And pgstart
statements cannot tell, but pgstart
cache can, and unfortunately
we don't have anything for explain. It would be
great to have some extension which would
extend the explain analyze
buffers, so we would see real disk I.O.
But real disk I.O. can be measured if you look at proc I.O. for our process ID.
You just cut slash proc slash I.O. process ID and you see reads and writes and this is real disk.
So you can see it if you have access to Linux.
Of course, not on RDS.
So sometimes you can cheat a little bit
and check what's happening before you run,
explain the last buffer send after it.
Nice. Right. WorkMEM.
So WorkMEM, we discussed it briefly recently as well.
So the difficulty with WorkMEM,
nobody knows how much we will need in advance.
And each query can consume multiple times of WorkMem,
but sometimes only a fraction of WorkMem.
So WorkMem, this defines the amount of memory
each backend needs to process query
for operations like hash join, ordering, grouping, and so on.
And if we have multiple of such operations for one query, we might use multiple times of workmem.
And if we reach workmem and we need more, then temporary files will be created and
query execution will slow down significantly.
And that's because Postgres is no longer doing, for example,
that sort in memory.
It's doing that sort on disk. Extrusion.
Yeah, exactly.
And it's not even just per operation,
but if we did an episode recently on parallelism,
it can be per backend, per operation.
So it can multiply out quickly
yeah also there is some setting for hash multiply so it's very tricky to understand how much memory
you will expect and usually we approach our approach for tuning is very reactive of course
default for max again is for-pods only, usually.
But it's very related to Macs connections and how many CPUs and how many connections for backends you expect.
Starting from Postgres 14, we might allow a lot of idle connections.
And you mentioned, Anders Freud, a couple of tweets and a couple of blog posts related to memory consumption by backends and also max connections and is it okay to have a lot of idle connections. Starting from Postgres 14,
we have optimization. It turned out that memory management was not number one problem. Number
one problem was how work with snapshots is organized. So these are very good in-depth
blog posts, these two. But it's very tricky to understand how much
memory is used first of all by process even if you have all accesses and you see like ps top
everything you you have sar all of things but it's hard to understand in linux how much memory
is really consumed by each process but, we cannot limit it reliably.
We cannot say, okay, 100 megabytes for each backend, that's it.
Unfortunately, because if we say workmem 100 megabytes,
still backend is allowed to use multiple times
because of this behavior.
Workmem defines the limit for each operation,
not for backend.
And also, we have parallel workers.
So together this becomes quite tricky topic.
And our approach usually is very reactive and iterative.
So we first start from some safe workman setting.
Safe approach is like, okay, we have max connections,
say 300 or 500 for our 96 or 244 core server
big big one and then we say okay we should be on safe side we know 25 is shared buffers we know
some backends will have maintenance workmem to be used for index creation or auto vacuum. And we allocated like four gigs or two gigs.
We can discuss this also.
We should discuss this.
And we also have maximum 10 workers of like, okay, 40 gigs.
We have like maybe half of terabyte, 700 gigs.
So 25% there, 40 gigs for maintenance workers, like index creation, index maintenance,
because it's needed all the time. And auto vacuum, like we have 10 workers auto vacuum,
one or two workers rebuilding indexes. Okay, we know. Math is good here. So what's left? We have
ignoring page cache. We have like 300 gigs, for example. For a while we ignore
page cache and we say, okay, we have 300 max connections, 300 gigs.
It means like we can give up to one gigabyte to all backends, right? Okay, that's good. So let's
have some multiplier two or three, because we know some queries can have multiple operations
like sort or hash join and so on.
In this case, we say, okay, our quite safe setting is 100 megabytes
times two, three, it will be 200, 300 megabytes.
We don't approach, we don't reach one gigabyte per backend.
It means we can use 100 megabytes for workmem.
Let's start with it.
And we start with it.
And we see, first we see most backends don't reach it even once.
But some backends see that they need more.
And this is imbalanced.
And we also see page cache is quite huge.
What's left is most page cache. We understand,
okay, our average workmem consumption for workmem is quite low, but sometimes we need more. And so
we start to raise it. We know on average we don't reach it at all, but some backends need it.
We can have two options here, either to set locally in sessions, if we control
application code, we can say set workmem right for particular query because we know it's very heavy
and we don't want this query to use temporary files. Or we can just raise workmem globally.
We're observing our workload for quite a long time. We see it's quite safe. And it's not uncommon to see that people
are having settings not safe already. If you take max connections, take workmem, have multiplier two
or three, and consider shared buffers and maintenance workmem, you see that we don't
have so much memory. But we know we don't reach the limits. And this is what I was alluding to at the beginning of the episode.
You say it's not safe, but you mean it's not safe in theory,
not safe in practice.
For particular database.
Yeah, exactly, with the current workload.
So it's an interesting topic.
I think what you said there is actually a really good point
that sometimes gets missed is it can be set at different levels.
And if you've got one
reporting query that runs once a month, for example, you could set a higher number for that
gigabyte five gigabytes, we are just especially single backup. Yeah, if you know, it's running
at a time that's relatively quiet, for example, you and it's not going to be like contending with
lots of other things. So lots of things are possible here.
Smaller servers than the ones Nikolai are talking about are available.
But even for very small services, even for extremely small instances,
I very rarely see one where a setting less than 16 megabytes or four times the default is ever sensible as far as I've seen.
And quite often even these small queries even these transactional
workloads do spill over for me like if you you see if you start logging temporary files which
I think is very good advice although I saw in one of the one of the AWS docs that you shared with me
that I'll link up that they advise only turning it on temporarily so I was going to ask you if
that was the case for you as well. But logging temporary files is extremely useful.
And if you see lots of ones being logged in the 4 to 10 megabyte range,
it's not uncommon for transactional workloads.
You've probably not changed the setting and it's well worth looking at.
And we have aggregated statistics for temporary files,
the size and number of files occurred in the pgStats database for each database.
This is a good thing to monitor as well and have alerts and so on.
But as for logging, I prefer logging all temporary files occurrences in tuned server.
If it's not tuned and each query execution involves it, it will be a disaster.
But normally we already raised shared buffers, raised workmem,
and since we raised workmem, only particular queries are experiencing temporary files.
So it's okay to log them all. So I mean to set zero as threshold of size to be logged.
But I would say I see issues with workmem way more often than I see issues with shared buffers, probably because of my bias, but also because cloud providers don't tend to tune it.
Because it's this scary setting.
It's dangerous.
They don't even use that formula you were talking about, that formula that shared lots of places.
I particularly like an EDB guide by Vic Fearing.
I'll share in the links as well.
But that formula, they could apply it, right?
They could set a sensor setting.
Annotated Postgres Conf also has it. This is, yeah. But my formula, they could apply it, right? They could set a sensor. Annotated Postgres Conf also has it.
This is, yeah.
But my experience is they don't.
They start at 4 megabytes still.
Even in, well, so maybe I'm wrong,
maybe I'm a bit out of date, but a lot of them
that I was checking didn't change.
And you see temporary files in plans a lot,
right? And first advice, raise
Workman. Because I
often see 4MX.
First advice, check your indexes.
Check your indexes. Well...
Well, first advice, tune the query. Like, if you're doing a massive sort and like, you know, like...
In this case, second advice, apply partitioning.
Well, first advice, do less. Yeah.
First advice is always buffers. Because if you focus on buffers,
you're like databases,
all database performance optimization
is about to do less IO.
Right.
And that's the key for good latencies,
timings and so on.
Less IO.
This is, well,
we're possibly talking about the one exception to that
in today's topic,
which is memory.
It's not reporting buffers.
If we're doing a sort in memory, is that I.O.?
I think no.
Your I.O. can be internal memory or external memory.
Do you support page cast considered as external?
But let's forget about it.
It's not reported as buffers, crucially, if it's done in memory.
So that's an interesting...
But maybe that's an implementation detail.
Right.
Regardless, if you do one thing as a result of this episode,
please check what your current work memsetting is.
And if it's 4 megabytes, start checking your temporary...
Yeah, probably your cloud provider already tuned shared buffers
not to optimal state, but good enough state.
But workmen probably didn't touch it.
It's still 4MX.
And for modern workloads and data volumes,
it's tiny size and also like teapot size and so on.
But if you raise it a lot and max connections is huge,
a lot of idle connections, is huge, a lot of
file connections, probably you will be out of memory. So it should be also done carefully
in multiple steps, observing and so on. But again, like I agree with you, query tuning
and partitioning is also important here because it gives data locality. And you mentioned
when our database can be put fully to shared buffers.
That's great.
But if we partitioned, we can say, okay, this is our total database size, but this is our
work set size.
Only fresh partitions we work intensively with.
And BufferPool has a page cache where both have quite good, simple mechanisms to understand
what's needed, what's not.
Usage counters and so on.
And it can be studied using a good extension called pgBufferCache.
You can study what's inside your buffer pool right now.
It doesn't hurt to run it.
It will take some time because it will scan the whole buffer pool.
If it's huge, it will take some time.
But it will give you statistics like this relation this index is present by these like
blocks and you can aggregate it and and the documentation it's official contrib model
shipped with postgres country modules and the documentation has basic examples and you can
understand what's inside the buffer pool right now and this is. So if you apply partitioning, these things might start improving,
because instead of mixing all the new data in single table and indexes, also,
it will start improving a lot. It will be more efficient.
Yeah. In fact, Timescale give this recommendation in their docs, I think. So they actually talk
about setting the chunk interval so that at least your latest chunk, which is their word for partition, fits in shared buffers.
So that's another way of thinking about it.
Like you could be choosing your partition size or frequency based on what you have available memory-wise.
So yes, cool flip way of thinking about it.
Yeah.
You said you wanted to talk about something.
I've forgotten what it was,
but we needed to make sure we covered one more thing.
Yes.
Oh, yeah.
Let's cover that.
So, autovacuum workmem is set to minus zero,
meaning that it's inherited from maintenance workmem.
Minus one.
By default.
Minus one.
Yeah.
Autovacuum workmem is minus one by default.
In most cases, I see people don't touch it.
So it means that maintenance workmem is both about index creation and such operations like index creation or indexing.
And also auto-vacuuming.
And usually, I don't remember default.
It's also very low.
Usually, you want to raise it at least like one gigabyte
if you have a lot of memory or half of a gigabyte.
But interesting, I made a couple of years ago,
I made new experiments trying to understand
how beneficial it is to go to higher values.
And for index creation particularly,
I didn't see big benefits after half of or one gigabyte difference between one gigabyte
and five gigabytes in my experiments were low but don't trust me blindly it's good to experiment
always right so you create your index with one setting then you restart server to flush the
set of shared buffers and everything and also don't forget to flush page cache you say echo
three to some page cache you can find it in internet how to flush page cache. You say echo free to some page cache.
You can find it in internet
how to flush page cache in Linux.
It's easy.
It's just single line.
And then you start experiment from scratch again
with higher value of maintenance workmem.
And you compare duration, for example.
And you can compare how much IO happened,
for example, and so on.
It's a good experiment to run in simple exercise,
and you will see is it beneficial to raise it
for your particular Postgres version of your database
and your indexes and so on.
For autovacuum, I think it makes more sense to raise it,
but again, you can experiment in vacuum,
but you need to have the same number of data apples
for each experiment run.
So it's interesting how to build this experiment.
If our listeners have different experience, please comment on Twitter, YouTube, anywhere.
I would like to hear if some of you found beneficial for index maintenance to raise
maintenance workmem much more than one gigabyte, for example.
But again, since auto vacuum workmem is set to minus one, we need to take maintenance workmem,
because it's used by auto vacuum workers, multiply by auto vacuum workers. And my recommendation is
three is not enough. Usually, if you have a lot of cores, raise it to like 25% for example of course available and then you can understand
how much memory can be consumed by autowacom workers as well right so it's can be it can be a
lot the limit is per worker right the limit we talked about earlier each worker can consume that exactly but only once right not multiple times not
like it's not like only once in this case as i understand so my my preference is more if especially
if we have a lot of partitions my preference is more workers and maybe not so big maintenance
workman and uh index index maintenance and maybe just single back and recreating index maintenance maybe just single backend recreating indexes, maybe sometimes two
maybe three, but not many
Nice, I feel like we should wrap up
soon, but there's a few links that I found
really useful, one in particular
we haven't managed to mention yet
so maybe we should do a quick fire round of good
resources
First of all, both
internal books or
online books, one is PDF, both internal books or online books,
one is PDF, one is just website,
are very interesting from Suzuki and from Rogoff.
And they cover many aspects quite deeply.
So if you want to go inside,
I think in this case, almost everyone should go inside.
And even backend developers could benefit
from understanding how Postgres works
like process model no threads shared buffers always to work with any relations and so on
workman maintenance workman this is the number one thing yeah before you move on I wanted to
give a special shout out to the Rogoff book I was expecting given it's translated for it not to be
as clear and well written as it is and I looked up just before this's translated, for it not to be as clear and well-written as it is.
And I looked up just before this who translated it, and that's Lyudmila Mantrova.
So shout out to them, because it's fantastically well-written, even the English version.
So I'll share a link to that.
Yeah, I use this book as a reference, and I actually honestly don't realize which language I use because I speak
both and I found both
versions are good enough. It means that
translation is really good. I agree.
But before the internals
we should mention documentation of course
and also readme file
in the backend
I don't remember. In the source code.
Yeah, about BufferPool
because it's quite well also plain English,
explaining some details how BufferPool is organized and about pins and so on,
like about logs.
But also interesting for folks who are using newer versions of Postgres. In Postgres 14, we have PG backend memory context system view,
which gives aggregated stats for memory usage. I never used it in production somehow because I
still have mostly older version. Well, somewhere already Postgres 14, but I need to have some
probable incidents to start using it.
It's interesting thing to keep in mind that now we have good observability tool.
And in Postgres 16, pgstat.io is added.
Coming soon.
Yeah.
Right.
Also aggregated statistics for how many operations happened, timing and so on.
Also very interesting.
So, and pgbuffCache, these are things like our
quite deep, like this is enough to go quite deep. And also PgPreWarm by the way, if you
want to restore the state of buffer pool, you can check PgBufferCache plus PgPreWarm
and there's also automatic pre-warming starting a couple of years ago, some versions ago, there is a recipe.
So you have restart, you want to restore the state of the buffer pool, so query performance
is good again.
It's also possible.
I never used it, honestly, but I know this is possible.
So knowing which is possible also sometimes is more important than knowing some details
sometimes, right?
I've heard people using it for experimentation, but not for other things.
Yeah, this is the opposite approach.
Instead of starting cold, you probably want to start from the same warmed-up state.
Makes sense.
And also worth mentioning that if you are on ZFS and you use, for example, Database
Lab, you can use ZFS on production.
And in this case, you have branched Postgres,
you can use smaller buffer pool.
Of course, in this case,
the cache is not so smart as
the buffer pool, but it's good to have
multiple Postgres versions,
like branched Postgres versions,
utilizing single
underlying cache instead of
regular page cache.
It's called Arc, ZFS Arc.
And all databases are writable, running on single host,
but they use single cache, Arc, A-R-C from ZFS.
And it's great to see that shared blocks are cached there.
So you start new branch and you're ready quite fast.
It's not in the buffer pool.
And we need to use smaller buffer pool if you run multiple Postgres on single machine.
But it's already quite fast
because it's cached in this underlying arc.
So it's also interesting approach,
quite underappreciated, I think.
But some folks run ZFS on production
and it's interesting.
I think that's enough for today.
What do you reckon?
Right. Yeah, that's it for today. What do you reckon? Right.
Yeah, that's it.
And I like your idea, raise work mem as a single recommendation.
Consider raising it, but be careful.
Yeah, exactly.
In fact, I've got a blog post on that.
I can't believe I forgot to say, but I'll share that as well.
Good.
Right.
Thanks so much, Nicolai.
And I won't see you next week because even though last week
we were talking about such a good streak of weeks in a row i'll be missing the next two
yeah you it's called slacker right yeah absolutely but we will have episodes no matter what
wonderful well i look forward to coming back from a holiday and seeing what they're on and being a listener.
Surprise.
Good.
Yeah.
Exciting.
Thank you.
Take care.
And yeah, catch you soon.
Bye.
Bye.