Postgres FM - Memory

Episode Date: June 16, 2023

Nikolay and Michael discuss memory in PostgreSQL — how it works, important settings, and how to go about tuning them. Here are links to a few things we mentioned: Resource Consumption (Po...stgreSQL docs) https://www.postgresql.org/docs/current/runtime-config-resource.htmlAndres Freud tweet about shared_buffers https://twitter.com/AndresFreundTec/status/1438912583554113537 Henrietta (Hettie) Dombrovskaya https://hdombrovskaya.wordpress.com/about-the-author/annotated.conf (by ash Berkus) https://github.com/jberkus/annotated.conf Our episode about checkpoint tuning https://postgres.fm/episodes/wal-and-checkpoint-tuning Our episode about BUFFERS https://postgres.fm/episodes/buffers-by-default Analyzing the Limits of Connection Scalability in Postgres (blog post by Andres Freund) https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/analyzing-the-limits-of-connection-scalability-in-postgres/ba-p/1757266#memory-usageTuning memory parameters for Aurora PostgreSQL https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.BestPractices.Tuning-memory-parameters.html RDS for PostgreSQL memory https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/PostgreSQL.Tuning.concepts.html#PostgreSQL.Tuning.concepts.memory EDB guide by Via Fearing https://www.enterprisedb.com/postgres-tutorials/introduction-postgresql-performance-tuning-and-optimization#resourceusage pg_stat_kcache https://github.com/powa-team/pg_stat_kcache pg_buffercache https://www.postgresql.org/docs/current/pgbuffercache.html Process and Memory Architecture chapter (from Hironobu SUZUKI) https://www.interdb.jp/pg/pgsql02.htmlPostgreSQL 14 internals PDF book from Egor Rogov (pages 37, 184)  https://edu.postgrespro.com/postgresql_internals-14_en.pdf src/backend/storage/buffer/README https://github.com/postgres/postgres/blob/master/src/backend/storage/buffer/README pg_backend_memory_contexts (PostgreSQL 14+) https://www.postgresql.org/docs/current/view-pg-backend-memory-contexts.html pg_stat_io (coming in PostgreSQL 16) https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-IO-VIEW pg_prewarm https://www.postgresql.org/docs/current/pgprewarm.html Configuring work_mem blog post https://www.pgmustard.com/blog/work-mem~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!If you would like to share this episode, here's a good link (and thank you!)~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hi, this is Nikolai and my co-host is Michael. Hi, Michael. Hello, Nikolai. This is Postgres FM and today we talk about memory management, shared buffers, workmem, effective cache size, maintenance workmem, auto-vacuum workmem, and so on, right? Yeah, absolutely. And just before the show, actually, you were telling me you had a joke about this. I forgot it. Yeah, nice. Okay. Let's try to be shallow because it's easier, but very wide.
Starting point is 00:00:35 I mean, we will try to cover as wide as possible in half an hour and we can give tips where to if people want to dive deeper we can recommend some great materials great resources such as articles books and so on source code so where do you want to start my experience is very biased because i make a tool for optimizing individual queries. So these system-wide things don't come up as often in the conversations I'm having. But my understanding is that a lot of people still don't, and there seems to still be disagreement in the community about what is a sensible setting for things like shared buffers? What is a sensible setting for things like work mem? So some of these extremely high level but extremely important settings have some debate around them and very experienced people disagreeing with
Starting point is 00:01:34 exactly what the doc should say or exactly what equation to use to set these and there seems to be some like differences of opinion from people more in the theory versus people more in practice very different if there's uh depending on the the database like how big or small it is can make a huge difference to what the advice should be so yeah i think there's a good high level overview we can do so i'd probably like to start with maybe shared buffers it feels like there's there's a few things that postgres uses shared memory for but but that's the biggest, right? Or normally the biggest. Yeah, I agree. This is probably number one in terms of memory management to consider. But there is also underlying thing that probably should go before it. And underlying thing is
Starting point is 00:02:19 called page cache or relational system file cache. So Postgres always works through it. Postgres never reads files from disk, from file system. It always goes through page cache. So the first cache... First of all, why cache? Why do we need cache? Because SSD is quite fast right it's like you can do a lot of operations per second latency is quite good sometimes below one millisecond it's quite good of course memory is
Starting point is 00:02:54 much better definitely so of course we still despite of the rise of new external storage devices memory is still several orders of magnitude faster. And we have a lot of memory lately. Like we can easily have a server with hundreds of gigabytes and sometimes already terabytes of memory. And memory is super fast. And of course, it makes sense to keep a lot of data in memory. So final latency of query execution is better for our end users.
Starting point is 00:03:33 Well, our end users are not end users of systems we usually use. Our end users from database perspective are some application nodes or sometimes humans, but very often not humans. And we want, of course, query execution to be very fast. Performance is critical for databases, so that's why we want cache. And we always have page cache underlying before even going to shared buffers. We have it. And usually when we think about it, we don't know how much it will be originally because like linux for example linux uses for page cache everything it has which is not used by others
Starting point is 00:04:13 right all free memory becomes our page cache memory and this this is good this is fully transparent and only then we have the buffer pool which is controlled by shared buffers and it's allocated at startup time and it cannot be changed during while we run server and this is big limitation postgres has one of the biggest items i would like to be changed in the postgres but it's very hard i mean it would be great to change shared buffers without restart but right now it's not possible so if you want to adjust your buffers the restart. But right now it's not possible. So if you want to adjust shared buffers, the size of the buffer pool, you need to restart your Postgres. And since it's allocated, we need to think how big should it be. But thinking about it, we should consider
Starting point is 00:04:57 underlying cache, the page cache. So Postgres basically has two levels of cache, page cache and buffer pool. And the rule of thumb right now, 25%. I remember when we started to have more than a couple of gigabytes of memory on most machines, there was a debate. At some point, the rule was 25 if it doesn't exceed 8 gigabytes, but now it's okay to allocate 100, 200 gigabytes,
Starting point is 00:05:26 and it's normal. And I should notice that 25% is not optimal. Depending on your workload, you probably want to allocate 80%, for example, but it's risky. Of course, I would like to allocate 90% to shared buffers. Linux would use less memory for the page cache. But it's okay because shared buffers are smarter than page cache. For example, in some cases, Postgres decides to use a very limited area of shared buffers. For example, if you perform sequential scan on a large table,
Starting point is 00:06:06 page cache doesn't know that it's a sequential scan. Linux doesn't know it. And it will spam your cache, page cache. It will spam it. But buffer pool, you have a ring. So only limited area, I don't remember, 256k. I don't remember. So basically, this process will use only this limited area to read whole table, and it won't cause eviction of useful data that is needed by other backends, right?
Starting point is 00:06:46 So buffer pool is smarter than page cache, so we would allocate 80 or 90%, but it's risky because we can be we can hit out of memory state sooner yeah so lots to go on i think uh 25 is does seem to be i think it's even mentioned in the postgres docs it's mentioned a lot of guides it's interesting you say it's not optimal but like it may not be like there's a there's a there probably is a workload like if for example your tables and indexes are almost exactly about 25 percent of your ram like if they fit exactly in shared buffers maybe that is an opt but i'm just saying like there is possibly yeah yeah i agree the problems we've got are that the default is so low that it's unlikely to be a good starting point. 25% probably gives you a better starting point for experimentation,
Starting point is 00:07:35 but then to experiment, you need restarts, and you also need to be able to replay your workload and things like that. So it's a tricky one. And before we move on, though, I did really want to say there's a really good tweet from Andres Freund that I thought, I couldn't believe he managed to summarize this in basically two tweets. But the reason that your statement is a good one that is probably not optimal is that assuming your data doesn't fit in RAM, which is like a lot of people, it will.
Starting point is 00:08:06 A lot of people, if you're working on a small database, your data might well fit in less than 25%, in which case you're probably going to be fine. It's probably going to be like... Or it's 10% off, yeah. Yeah, exactly. But the people that are going to have problems with this, the people that are going to need to listen to this or look into some of these links are probably people that have got
Starting point is 00:08:28 much larger data sets than will fit easily into ram or at least it'll be extremely expensive to a server where they would fit into ram and then you've got the problem of the postgres yeah the shared buffers the postgres cache versus the operating system cache and i think andrews does a great job of simplifying the fact that you can have it saved in both and if you have the same data duplicated that's inefficient for your cache hit rate that's why 50 probably is not a good idea at all right so you need to move this slider either to one side or to different side exactly and i would prefer to move it to 90 to 80 90 but it's risky i've seen more commonly people like
Starting point is 00:09:14 starting to point to maybe closer to 40 at the higher end and towards 10 at the lower end but 80 or 90 it just sounds wild to me when you consider what else we want to use memory for. My personal experience, I spent some time studying various cases and making a lot of experiments, trying to develop fully automated methodology. So each particular database, we take workload. Somehow it's a different topic. And we just run a lot of experiments and see where is our optimal point. Sometimes we have local optimum before 50% and slightly better optimum after,
Starting point is 00:09:54 like we pass 50 and go closer to 100. But again, it's risky. We can discuss why. But eventually I got disappointed in this area completely, like absolutely disappointed, because query performance, things you do, hits more. Right now I prefer to say let's just put 25%. We know probably we are not optimal, but we are on the safe side.
Starting point is 00:10:18 And let's first focus on different settings which are very influential and query optimization. And if we do need to squeeze a lot from performance, like for example, we have 10,000 Postgres nodes and we need to save on memory, we need really optimal performance, then probably it would make sense to find better setting than 25%. But in my opinion, if you start with 25% for LTP workloads and then say we are going to optimize it and say maybe 40% is better, you have premature optimization because most likely you have different things to tune first.
Starting point is 00:11:01 And not to tune, query, optimize workload, you lack indexes and so on. You need partitioning to have data locality, to use the buffer pool better, more efficiently. And if you focus on that areas, maybe when you return and think about 25 or 40%,
Starting point is 00:11:18 which will probably give you 10% benefit, you think like later, later, later, and you keep 25% for longer. This is my current understanding of this topic. And again, like I would like to thank Henrietta Dobrovska. I always forget, I'm bad with names as usual, but I would like to thank her because it was like cold shower for me when I presented our approach for Postgres config tuning with experiments and so on in New York,
Starting point is 00:11:48 maybe five years ago. And she said, you can optimize a lot, but then developers or ORIMs write bad queries and all your optimization is vanished because queries lack indexes or written poorly and so on. And again, my experience was like, no, no, config optimization is super important, let's do it. But then I switched completely and now we work on query optimization as well with experiments and thing cloning and so on. The change happened there.
Starting point is 00:12:20 So I say like shared buffers optimization is probably number 10 item you want to optimize. Put 25% and go elsewhere first. Yeah. But given our topic is memory management. Well, we have other memory settings we need to consider, and they are also very interesting. But shared buffers, of course, number one in terms of Postgres configuration, because it's hard to change because restart is needed. And it's influential if you don't change it at all,
Starting point is 00:12:53 because default, original Postgres default is 128 megabytes, right? And there is also different default dictated by apt or yum packages, which probably half of gigabyte, I don't remember, maybe 256. But these days we can use these settings on Raspberry PI and also not on the latest models because latest models have like eight gigs of memory.
Starting point is 00:13:20 So this default settings of shared buffers and Postgres config is so outdated that can be used only on TPODs. So it's TPOD Postgres style. So it should be always tuned, but 25% rule is good enough. The one exception is a lot of people these days, their default config is now the default config chosen by their cloud provider. And this is because of the reason you said, this is one of the few settings that I see almost, well, I've seen every one I've checked has tuned this. Not always to people's taste, but yeah, this is one people.
Starting point is 00:13:58 And they use 25%. Not all of them. Not? No. Well, I haven't checked all of them. Not? No. Well, I haven't checked all of them recently and started new instances, but I was talking to a customer of Heroku just recently, and they're having a real ordeal trying... Heroku is...
Starting point is 00:14:17 It's a very old solution. Yeah, but lots of people are still on it. Yes, I know. I have customers as well. And they have tuned ship, they have changed ship. They don't have PG repack even. I know, I know, but they have changed it, but not to 25%, which is interesting, because that
Starting point is 00:14:35 means it's a deliberate decision not going with what the docs say. Like, I think the easiest, the least thought... Because this decision was made before it was written into docs. Maybe. So probably they still apply 25, if not more than 8 gigs, if it's maximum of 8 gigabytes and 25%. And this was also a recommendation in old George Berkus
Starting point is 00:15:02 PostgreSQL Conf annotated. Annotated PostgreSQL Conf. It was, yes, and it's unfortunately quite like lagging for new versions, but 25% is current rule of thumb. That's it.
Starting point is 00:15:16 Let's keep it. It should be number 10. Let's agree to put 25% and forget about it for a while. Cool. So we have 25% went to shared buffers. And if we leave in vacuum, nothing else,
Starting point is 00:15:29 and processors are not using memory, 75% will go to the patch cache. And we have effective cache size, which is additional, but very important setting, which tells Postgres Planner how big are our both caches together. Not together.
Starting point is 00:15:50 It's interesting. Normally people recommend saying effective cache size as the total memory minus area. No, not minus. It includes shared buffers. For example, if we indeed decided to exceed 50, it's also, I remember several years ago we discussed it, hackers, and other processes, non-postgres processes, use, say, 10%, so we have 30% left for page cache, it means that, okay, 60 shared buffers, 30 page cache, effective cache size should be either some of them.
Starting point is 00:16:43 Some of them wrong, right? Because we have duplicated caching. So either maximum or some. It's interesting. In this case, yeah, I already forgot. But in general, like if we decide start with 25% for buffer pool, I would just put 75% for effective cache size as a starting point. But what I would like to say here, we can put 200, 500, 1000% of our RAM available.
Starting point is 00:17:16 And since it's not a location, Postgres will work. And this helps us on lower environment where we don't have enough memory, but we want planner behave similar to production actually exactly like production we use planner settings and workmem the same as on production and effective cache size is one of the most important planner settings so we can full pause this and the planner doesn't look at real available ram real available cpu disk doesn't know about it anything and doesn't look and share at shared buffers even it just looks at effective cache size and it's interesting i still remember your tweet about that i think you did a poll and i got it wrong i thought it would take shared
Starting point is 00:17:57 buffers into account in the planner but i was wrong so yeah that's that was a few years ago now but yeah good one i feel like we should move on. From my experience, at least, work mem is hugely important here. Is that the next thing you'd want to talk about? Or where would you go next? Yeah, so brief summary. Shared buffers is more like physical setting,
Starting point is 00:18:19 which will affect timing of query execution. Of course, if data is there, good. If it's not there bad so we need to inspect patch cache again and for postgres it already will count as read buffer read reading of buffer but then it might be disk related but maybe not without disk at all so we don't postgres doesn't know that's why we need pgstat kcache to understand but effective cache size is for planner how planner chooses planner thinks okay we expect that many gigabytes of ram involved into caching so it's more optimistic that like it will be in memory or it expects some data reads
Starting point is 00:19:02 happening so more heavy operations io I.O. operations. So this is a very high-level thing, like logical, only planner decides what to do. But shared buffers, again, it will affect timing, but not the behavior of buffer pool. We should mention our episode about checkpoint or tuning, because when some buffers are changed and it can happen during select as well if hint bits are updated on the fly and the buffer can see buffer is like in virtual memory it's a block read by from disk page yeah it's page res 8 kilobytes we have huge pages as well right so it's a different topic
Starting point is 00:19:47 but buffer is eight kibibytes and if we changed at least something in it inside this page it's considered as dirty and then check pointer or the ground writer first then check pointer and sometimes back end itself they need to clean it so So like synchronize with disk. So less stress will be next time we need to perform recovery. And we discussed this topic, right? So we need to checkpoint or tuning. But what we didn't discuss probably is that page cache also needs tuning sometimes. Because if we accumulate a lot of dirty pages in page cache, it's not directly related to Postgres,
Starting point is 00:20:30 but it's related to overall performance of our database. If we allow operational system to accumulate a lot of dirty pages in the underlying page cache, we might have performance issues as well because we want this process to be smoother and probably you want to tune the behavior of page cache adjusting with cctl adjusting kernel settings so page cache doesn't grow a lot and the operational system flushes dirty buffers with pd flush and so on like more often but it's slightly outside of postgres but but it sometimes affects how we work with memory.
Starting point is 00:21:06 So again, Postgres relies on page cache a lot, so this topic should be studied as well. Okay, now what's left? WorkMem, maintenance WorkMem, these two guys, first of all, right? Yeah, and when you mentioned the previous episode, I thought you were actually going to mention the buffers episode, and I think there's also something to be said there before we move on, which is when people are doing estimations for their cache hit rate, for example, or if you're looking at a single query plan with buffers, which hopefully you are, you'll see hits and reads.
Starting point is 00:21:41 But the thing you mentioned here that's relevant is that those reads aren't necessarily from disk, they could be from the RS page cache, and we can't tell. And pgstart statements cannot tell, but pgstart cache can, and unfortunately we don't have anything for explain. It would be
Starting point is 00:22:00 great to have some extension which would extend the explain analyze buffers, so we would see real disk I.O. But real disk I.O. can be measured if you look at proc I.O. for our process ID. You just cut slash proc slash I.O. process ID and you see reads and writes and this is real disk. So you can see it if you have access to Linux. Of course, not on RDS. So sometimes you can cheat a little bit
Starting point is 00:22:29 and check what's happening before you run, explain the last buffer send after it. Nice. Right. WorkMEM. So WorkMEM, we discussed it briefly recently as well. So the difficulty with WorkMEM, nobody knows how much we will need in advance. And each query can consume multiple times of WorkMem, but sometimes only a fraction of WorkMem.
Starting point is 00:22:53 So WorkMem, this defines the amount of memory each backend needs to process query for operations like hash join, ordering, grouping, and so on. And if we have multiple of such operations for one query, we might use multiple times of workmem. And if we reach workmem and we need more, then temporary files will be created and query execution will slow down significantly. And that's because Postgres is no longer doing, for example, that sort in memory.
Starting point is 00:23:32 It's doing that sort on disk. Extrusion. Yeah, exactly. And it's not even just per operation, but if we did an episode recently on parallelism, it can be per backend, per operation. So it can multiply out quickly yeah also there is some setting for hash multiply so it's very tricky to understand how much memory you will expect and usually we approach our approach for tuning is very reactive of course
Starting point is 00:24:00 default for max again is for-pods only, usually. But it's very related to Macs connections and how many CPUs and how many connections for backends you expect. Starting from Postgres 14, we might allow a lot of idle connections. And you mentioned, Anders Freud, a couple of tweets and a couple of blog posts related to memory consumption by backends and also max connections and is it okay to have a lot of idle connections. Starting from Postgres 14, we have optimization. It turned out that memory management was not number one problem. Number one problem was how work with snapshots is organized. So these are very good in-depth blog posts, these two. But it's very tricky to understand how much memory is used first of all by process even if you have all accesses and you see like ps top
Starting point is 00:24:53 everything you you have sar all of things but it's hard to understand in linux how much memory is really consumed by each process but, we cannot limit it reliably. We cannot say, okay, 100 megabytes for each backend, that's it. Unfortunately, because if we say workmem 100 megabytes, still backend is allowed to use multiple times because of this behavior. Workmem defines the limit for each operation, not for backend.
Starting point is 00:25:23 And also, we have parallel workers. So together this becomes quite tricky topic. And our approach usually is very reactive and iterative. So we first start from some safe workman setting. Safe approach is like, okay, we have max connections, say 300 or 500 for our 96 or 244 core server big big one and then we say okay we should be on safe side we know 25 is shared buffers we know some backends will have maintenance workmem to be used for index creation or auto vacuum. And we allocated like four gigs or two gigs.
Starting point is 00:26:09 We can discuss this also. We should discuss this. And we also have maximum 10 workers of like, okay, 40 gigs. We have like maybe half of terabyte, 700 gigs. So 25% there, 40 gigs for maintenance workers, like index creation, index maintenance, because it's needed all the time. And auto vacuum, like we have 10 workers auto vacuum, one or two workers rebuilding indexes. Okay, we know. Math is good here. So what's left? We have ignoring page cache. We have like 300 gigs, for example. For a while we ignore
Starting point is 00:26:49 page cache and we say, okay, we have 300 max connections, 300 gigs. It means like we can give up to one gigabyte to all backends, right? Okay, that's good. So let's have some multiplier two or three, because we know some queries can have multiple operations like sort or hash join and so on. In this case, we say, okay, our quite safe setting is 100 megabytes times two, three, it will be 200, 300 megabytes. We don't approach, we don't reach one gigabyte per backend. It means we can use 100 megabytes for workmem.
Starting point is 00:27:26 Let's start with it. And we start with it. And we see, first we see most backends don't reach it even once. But some backends see that they need more. And this is imbalanced. And we also see page cache is quite huge. What's left is most page cache. We understand, okay, our average workmem consumption for workmem is quite low, but sometimes we need more. And so
Starting point is 00:27:53 we start to raise it. We know on average we don't reach it at all, but some backends need it. We can have two options here, either to set locally in sessions, if we control application code, we can say set workmem right for particular query because we know it's very heavy and we don't want this query to use temporary files. Or we can just raise workmem globally. We're observing our workload for quite a long time. We see it's quite safe. And it's not uncommon to see that people are having settings not safe already. If you take max connections, take workmem, have multiplier two or three, and consider shared buffers and maintenance workmem, you see that we don't have so much memory. But we know we don't reach the limits. And this is what I was alluding to at the beginning of the episode.
Starting point is 00:28:46 You say it's not safe, but you mean it's not safe in theory, not safe in practice. For particular database. Yeah, exactly, with the current workload. So it's an interesting topic. I think what you said there is actually a really good point that sometimes gets missed is it can be set at different levels. And if you've got one
Starting point is 00:29:05 reporting query that runs once a month, for example, you could set a higher number for that gigabyte five gigabytes, we are just especially single backup. Yeah, if you know, it's running at a time that's relatively quiet, for example, you and it's not going to be like contending with lots of other things. So lots of things are possible here. Smaller servers than the ones Nikolai are talking about are available. But even for very small services, even for extremely small instances, I very rarely see one where a setting less than 16 megabytes or four times the default is ever sensible as far as I've seen. And quite often even these small queries even these transactional
Starting point is 00:29:46 workloads do spill over for me like if you you see if you start logging temporary files which I think is very good advice although I saw in one of the one of the AWS docs that you shared with me that I'll link up that they advise only turning it on temporarily so I was going to ask you if that was the case for you as well. But logging temporary files is extremely useful. And if you see lots of ones being logged in the 4 to 10 megabyte range, it's not uncommon for transactional workloads. You've probably not changed the setting and it's well worth looking at. And we have aggregated statistics for temporary files,
Starting point is 00:30:22 the size and number of files occurred in the pgStats database for each database. This is a good thing to monitor as well and have alerts and so on. But as for logging, I prefer logging all temporary files occurrences in tuned server. If it's not tuned and each query execution involves it, it will be a disaster. But normally we already raised shared buffers, raised workmem, and since we raised workmem, only particular queries are experiencing temporary files. So it's okay to log them all. So I mean to set zero as threshold of size to be logged. But I would say I see issues with workmem way more often than I see issues with shared buffers, probably because of my bias, but also because cloud providers don't tend to tune it.
Starting point is 00:31:09 Because it's this scary setting. It's dangerous. They don't even use that formula you were talking about, that formula that shared lots of places. I particularly like an EDB guide by Vic Fearing. I'll share in the links as well. But that formula, they could apply it, right? They could set a sensor setting. Annotated Postgres Conf also has it. This is, yeah. But my formula, they could apply it, right? They could set a sensor. Annotated Postgres Conf also has it.
Starting point is 00:31:26 This is, yeah. But my experience is they don't. They start at 4 megabytes still. Even in, well, so maybe I'm wrong, maybe I'm a bit out of date, but a lot of them that I was checking didn't change. And you see temporary files in plans a lot, right? And first advice, raise
Starting point is 00:31:42 Workman. Because I often see 4MX. First advice, check your indexes. Check your indexes. Well... Well, first advice, tune the query. Like, if you're doing a massive sort and like, you know, like... In this case, second advice, apply partitioning. Well, first advice, do less. Yeah. First advice is always buffers. Because if you focus on buffers,
Starting point is 00:32:06 you're like databases, all database performance optimization is about to do less IO. Right. And that's the key for good latencies, timings and so on. Less IO. This is, well,
Starting point is 00:32:21 we're possibly talking about the one exception to that in today's topic, which is memory. It's not reporting buffers. If we're doing a sort in memory, is that I.O.? I think no. Your I.O. can be internal memory or external memory. Do you support page cast considered as external?
Starting point is 00:32:41 But let's forget about it. It's not reported as buffers, crucially, if it's done in memory. So that's an interesting... But maybe that's an implementation detail. Right. Regardless, if you do one thing as a result of this episode, please check what your current work memsetting is. And if it's 4 megabytes, start checking your temporary...
Starting point is 00:33:03 Yeah, probably your cloud provider already tuned shared buffers not to optimal state, but good enough state. But workmen probably didn't touch it. It's still 4MX. And for modern workloads and data volumes, it's tiny size and also like teapot size and so on. But if you raise it a lot and max connections is huge, a lot of idle connections, is huge, a lot of
Starting point is 00:33:25 file connections, probably you will be out of memory. So it should be also done carefully in multiple steps, observing and so on. But again, like I agree with you, query tuning and partitioning is also important here because it gives data locality. And you mentioned when our database can be put fully to shared buffers. That's great. But if we partitioned, we can say, okay, this is our total database size, but this is our work set size. Only fresh partitions we work intensively with.
Starting point is 00:33:57 And BufferPool has a page cache where both have quite good, simple mechanisms to understand what's needed, what's not. Usage counters and so on. And it can be studied using a good extension called pgBufferCache. You can study what's inside your buffer pool right now. It doesn't hurt to run it. It will take some time because it will scan the whole buffer pool. If it's huge, it will take some time.
Starting point is 00:34:20 But it will give you statistics like this relation this index is present by these like blocks and you can aggregate it and and the documentation it's official contrib model shipped with postgres country modules and the documentation has basic examples and you can understand what's inside the buffer pool right now and this is. So if you apply partitioning, these things might start improving, because instead of mixing all the new data in single table and indexes, also, it will start improving a lot. It will be more efficient. Yeah. In fact, Timescale give this recommendation in their docs, I think. So they actually talk about setting the chunk interval so that at least your latest chunk, which is their word for partition, fits in shared buffers.
Starting point is 00:35:11 So that's another way of thinking about it. Like you could be choosing your partition size or frequency based on what you have available memory-wise. So yes, cool flip way of thinking about it. Yeah. You said you wanted to talk about something. I've forgotten what it was, but we needed to make sure we covered one more thing. Yes.
Starting point is 00:35:33 Oh, yeah. Let's cover that. So, autovacuum workmem is set to minus zero, meaning that it's inherited from maintenance workmem. Minus one. By default. Minus one. Yeah.
Starting point is 00:35:44 Autovacuum workmem is minus one by default. In most cases, I see people don't touch it. So it means that maintenance workmem is both about index creation and such operations like index creation or indexing. And also auto-vacuuming. And usually, I don't remember default. It's also very low. Usually, you want to raise it at least like one gigabyte if you have a lot of memory or half of a gigabyte.
Starting point is 00:36:08 But interesting, I made a couple of years ago, I made new experiments trying to understand how beneficial it is to go to higher values. And for index creation particularly, I didn't see big benefits after half of or one gigabyte difference between one gigabyte and five gigabytes in my experiments were low but don't trust me blindly it's good to experiment always right so you create your index with one setting then you restart server to flush the set of shared buffers and everything and also don't forget to flush page cache you say echo
Starting point is 00:36:44 three to some page cache you can find it in internet how to flush page cache. You say echo free to some page cache. You can find it in internet how to flush page cache in Linux. It's easy. It's just single line. And then you start experiment from scratch again with higher value of maintenance workmem. And you compare duration, for example.
Starting point is 00:36:58 And you can compare how much IO happened, for example, and so on. It's a good experiment to run in simple exercise, and you will see is it beneficial to raise it for your particular Postgres version of your database and your indexes and so on. For autovacuum, I think it makes more sense to raise it, but again, you can experiment in vacuum,
Starting point is 00:37:20 but you need to have the same number of data apples for each experiment run. So it's interesting how to build this experiment. If our listeners have different experience, please comment on Twitter, YouTube, anywhere. I would like to hear if some of you found beneficial for index maintenance to raise maintenance workmem much more than one gigabyte, for example. But again, since auto vacuum workmem is set to minus one, we need to take maintenance workmem, because it's used by auto vacuum workers, multiply by auto vacuum workers. And my recommendation is
Starting point is 00:37:58 three is not enough. Usually, if you have a lot of cores, raise it to like 25% for example of course available and then you can understand how much memory can be consumed by autowacom workers as well right so it's can be it can be a lot the limit is per worker right the limit we talked about earlier each worker can consume that exactly but only once right not multiple times not like it's not like only once in this case as i understand so my my preference is more if especially if we have a lot of partitions my preference is more workers and maybe not so big maintenance workman and uh index index maintenance and maybe just single back and recreating index maintenance maybe just single backend recreating indexes, maybe sometimes two maybe three, but not many Nice, I feel like we should wrap up
Starting point is 00:38:50 soon, but there's a few links that I found really useful, one in particular we haven't managed to mention yet so maybe we should do a quick fire round of good resources First of all, both internal books or online books, one is PDF, both internal books or online books,
Starting point is 00:39:05 one is PDF, one is just website, are very interesting from Suzuki and from Rogoff. And they cover many aspects quite deeply. So if you want to go inside, I think in this case, almost everyone should go inside. And even backend developers could benefit from understanding how Postgres works like process model no threads shared buffers always to work with any relations and so on
Starting point is 00:39:32 workman maintenance workman this is the number one thing yeah before you move on I wanted to give a special shout out to the Rogoff book I was expecting given it's translated for it not to be as clear and well written as it is and I looked up just before this's translated, for it not to be as clear and well-written as it is. And I looked up just before this who translated it, and that's Lyudmila Mantrova. So shout out to them, because it's fantastically well-written, even the English version. So I'll share a link to that. Yeah, I use this book as a reference, and I actually honestly don't realize which language I use because I speak both and I found both
Starting point is 00:40:07 versions are good enough. It means that translation is really good. I agree. But before the internals we should mention documentation of course and also readme file in the backend I don't remember. In the source code. Yeah, about BufferPool
Starting point is 00:40:24 because it's quite well also plain English, explaining some details how BufferPool is organized and about pins and so on, like about logs. But also interesting for folks who are using newer versions of Postgres. In Postgres 14, we have PG backend memory context system view, which gives aggregated stats for memory usage. I never used it in production somehow because I still have mostly older version. Well, somewhere already Postgres 14, but I need to have some probable incidents to start using it. It's interesting thing to keep in mind that now we have good observability tool.
Starting point is 00:41:10 And in Postgres 16, pgstat.io is added. Coming soon. Yeah. Right. Also aggregated statistics for how many operations happened, timing and so on. Also very interesting. So, and pgbuffCache, these are things like our quite deep, like this is enough to go quite deep. And also PgPreWarm by the way, if you
Starting point is 00:41:31 want to restore the state of buffer pool, you can check PgBufferCache plus PgPreWarm and there's also automatic pre-warming starting a couple of years ago, some versions ago, there is a recipe. So you have restart, you want to restore the state of the buffer pool, so query performance is good again. It's also possible. I never used it, honestly, but I know this is possible. So knowing which is possible also sometimes is more important than knowing some details sometimes, right?
Starting point is 00:42:02 I've heard people using it for experimentation, but not for other things. Yeah, this is the opposite approach. Instead of starting cold, you probably want to start from the same warmed-up state. Makes sense. And also worth mentioning that if you are on ZFS and you use, for example, Database Lab, you can use ZFS on production. And in this case, you have branched Postgres, you can use smaller buffer pool.
Starting point is 00:42:28 Of course, in this case, the cache is not so smart as the buffer pool, but it's good to have multiple Postgres versions, like branched Postgres versions, utilizing single underlying cache instead of regular page cache.
Starting point is 00:42:44 It's called Arc, ZFS Arc. And all databases are writable, running on single host, but they use single cache, Arc, A-R-C from ZFS. And it's great to see that shared blocks are cached there. So you start new branch and you're ready quite fast. It's not in the buffer pool. And we need to use smaller buffer pool if you run multiple Postgres on single machine. But it's already quite fast
Starting point is 00:43:11 because it's cached in this underlying arc. So it's also interesting approach, quite underappreciated, I think. But some folks run ZFS on production and it's interesting. I think that's enough for today. What do you reckon? Right. Yeah, that's it for today. What do you reckon? Right.
Starting point is 00:43:26 Yeah, that's it. And I like your idea, raise work mem as a single recommendation. Consider raising it, but be careful. Yeah, exactly. In fact, I've got a blog post on that. I can't believe I forgot to say, but I'll share that as well. Good. Right.
Starting point is 00:43:42 Thanks so much, Nicolai. And I won't see you next week because even though last week we were talking about such a good streak of weeks in a row i'll be missing the next two yeah you it's called slacker right yeah absolutely but we will have episodes no matter what wonderful well i look forward to coming back from a holiday and seeing what they're on and being a listener. Surprise. Good. Yeah.
Starting point is 00:44:08 Exciting. Thank you. Take care. And yeah, catch you soon. Bye. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.