Postgres FM - Parallelism

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael, founder of PgMustard. This is my co-host Nikolai, founder of PostgresAI. Hey Nikolai, what are we going to talk about today? About parallelism. Parallelism. Are you trying to say it multiple times at the same time? I did it sequentially. Nice. Do you know that parallelism works well with sequential operations?

Starting point is 00:00:24 This is where it shines like if something needs to be done sequentially their parallelism can be applied yeah it makes sense i guess it's like oxymoron right i see what you mean like a dichotomy like almost like it feels like it shouldn't be the case but it is very much it's not my thought by the way it's from thomas munro who contributed a lot to parallelism of various operations. This is one of those topics where we have had multiple requests, not in parallel, I guess. But yeah, so people are keen to know how parallel query works specifically, some pros and cons.

Starting point is 00:00:57 I think talking about parallelism overall makes sense as well, like a bit of an overview. So I'm glad you added those bits in. So yeah, looking forward to this one. Do you want to start with what it is or a bit of an overview. So I'm glad you added those bits in. So yeah, looking forward to this one. Do you want to start with what it is or a bit of history? Yeah, let's start with history. I must admit, I'm not a big expert on this topic. Of course, I know many things, but I'm far from being an expert. And I see, I don't see bottom in some places, if you get what I mean, like in some places i see i need to dive deeper to see the bottom well i'm i think from previous conversations we've had that sometimes it's a sign that you are an expert if you realize quite how much there is that you don't

Starting point is 00:01:35 know still but this is one of topics where i i feel my weakness so like disclaimer sure yeah and i've only looked into this from a like i've obviously done a bit of reading for this episode, but previously looking into just performance of queries in general and having to read around things, having to write some things for it as well. So, yeah, mine is only surface level two compared to people that have written this. But I think a lot of people don't even necessarily realize that Postgres can run queries in parallel, that it can do certain operations. It's not just can, it does all the time.

Starting point is 00:02:09 And usually my experience is closer to OLTP. And it makes sense because Postgres is the best OLTP system. But some people, of course, do some heavy analytical kind of queries. And, of course, I remember how Aprilism appeared in 1st and 9.6 2016. Easy to remember, 6.6, right? But you need to remember that 9.7 became 10, so

Starting point is 00:02:40 2017. But it was not enabled by default in 2016 and 96. Correct. And I remember there were fears that in LTP context, it's not a good idea to have it because it's more for analytical queries. Let's enable it dynamically for particular queries only because of course it has some overhead.

Starting point is 00:02:59 Yep. So for LTP, we want only single core operations. But eventually, right now we live with it enabled by default and we see benefits and so on. Of course, sometimes application produces not optimal, suboptimal queries, even sometimes having sequential scans or heavy index scans. And in this case, parallelism works well well improving the speed and so on unlike for example just in type jit which we usually recommend disable for ltp this we recommend usually enable

Starting point is 00:03:36 and it's interesting i think it's a similar conversation and the difference is that the the the defaults are better for the parallelism settings i think the jit one basically i don't think the cost estimations are quite right i think maybe we'd be able to leave that one on more if it had a higher cost threshold or had some costing improvements i think parallelism costings are a bit better and it tends to not cause as many. What about planning time? Planning time also affected, right? Yeah. Good point. I'm not sure about the, I haven't seen many parallel queries that also explode planning time. So I don't think it's as bad as things like number of joins and things. So it's, yeah, I haven't seen it cause excessive planning time issues. I don't know if you have.

Starting point is 00:04:23 The reason I'm talking about this, because because we definitely now live in the world where JIT probably should be disabled for LTP cases, but parallelism should be enabled, sometimes tuned. And we see it works well. Every release we have improvements and it works well. And it makes total sense to me. By the way, we needed to mention that we are not going to talk about parallel execution on multiple nodes. Like, I don't know, like a green plumb site or all those systems. We talk about only single node parallelism inside one machine. Yeah, core PostgreSQL. This is important to distinguish.

Starting point is 00:05:00 Right, and also to frame what we have, we have it only for read operations, plus a different level for some other operations like vacuum, only manual vacuum, explicit vacuum and PgRestore and so on. But for queries, it's only for reads. We don't have it for writes still. But like, for example, parallel copy would make total sense, but it's still under development. Yeah, we do have it for similar things like create table and create index, create materialized view, refresh materialized view. Oh, it's also writes. Yeah, we can consider them write queries also, but DDL queries. For DML queries, it's only for read-only, for reads. Yeah. So select with these queries. And this is, like,

Starting point is 00:05:47 since 2016, first non-default, next year it already became default in version 10, former 9.7. Then it was developed better and better. Every year we have a bunch of improvements. Yeah, I think it's interesting, like, a couple of things about those first few

Starting point is 00:06:04 versions. I think it was really interesting in 9.6 it was released it was very limited in scope but still useful for people so i think it only supported sequential scans a couple of join operations and it was as you say it was off by default and that was controlled by a a very important setting i think if you're doing if we're going to be talking about this. Max parallel workers per gather was set to zero. So that meant it couldn't use any extra work. Each query couldn't use any extra workers. And that in version 10, a couple of big improvements.

Starting point is 00:06:36 One was that more scan types were added. So index scans, index only scans as well for B-tree indexes and bitmap heap scans as well. The heap part of those could be parallelized. Huge, huge improvements. And that max parallel workers per gather was bumped to two, which meant we could get three processes by default on a query that supported parallelism. So one that suited it, we could get up to three cores working on our query at the same time. Right. And let's maybe mention in this context. So I remember in 2016-17, I remember we played with some selects trying to understand,

Starting point is 00:07:15 is it worth having it enabled by default, even in 9.6? And I remember examples when we saw very clearly that overhead brings some penalty, so it's better not to enable it. But then somehow I lost traction, and then we already live in a world where it's everywhere enabled, and we see it's beneficial. And if we check examples, I think it's very doable to find examples where you disable parallelism for some query, and you see that it's better in terms of in terms of what by the way time and buffers maybe right buffers well eventually time time is the most important for end user user doesn't like well user might care about buffers but only those which produce final result which goes to the user because if it's like a terabyte to download it's not good but buffers here also important and overhead i think exists in terms of both so it's like a terabyte to download, it's not good. But buffers here are also important and

Starting point is 00:08:05 overhead, I think, exists in terms of both. So it's good to check both when you analyze and make decisions based on that. But let's maybe mention why parallelism is important at all, philosophically and theoretically and so on. I think I know where you're going with this but i have some other like yeah let's do the obvious ones first we have a lot of cpus this is yeah and a growing number like in the the modern world it's quite common to have a lot of cpus available it because people often scale up their database size. And especially in the cloud, that comes with lots of CPU. The key problem is that CPU, in general, CPU development, processor development,

Starting point is 00:08:53 went into direction where one core cannot be improved as before. Like we had this house called the law. Moore's law, I think. Right, right right right so obviously the makers of cpus intel md they decided to bet on multiple core approach and we now have hundreds of cores easily in cloud it's relatively cheap especially on some kind of spot instances you can play with. And this means that application needs to be adjusted because you cannot say like, okay, now we need, like we want to benefit from having multiple cores.

Starting point is 00:09:33 But unfortunately, application needs to be changed to benefit from it. It cannot be done transparently in most cases. Sometimes it can be done in some trivial cases when work can be paralyzed somehow. Well, again, it's a different topic, but some algorithms can be paralyzed. But I remember even some languages exist, like extensions to C and so on.

Starting point is 00:09:56 It's from my university 20 years ago. I already forgot everything. But obviously, to make Postgres benefit from multiple cores, it needed to be heavily reworked. By the way, I remember also in 9.6, a lot was done to reduce contention in shared buffers. It's also important. So a lot of areas need to be improved to benefit from multiple cores. And of course, if you have a few users, but many more cores, you have only 10 sessions, but you have 96 cores. And of course, if you have a few users, but many more cores, you have only 10 sessions, but you have 96 cores. Okay. And three autowalking workers default. It sounds like most

Starting point is 00:10:34 of cores will be underutilized and it's not good. This is the key. The development of modern hardware dictates how to build software. Yeah, there's an opportunity there, right? Like, especially for things that are very easy to parallelize, by giving it access to three times the workers, we're not going to get a three times improvement, because as you said, there is some overhead, but we could get close.

Starting point is 00:11:01 Like, it's not uncommon, especially once you bump that number up to get close to the number of workers times the benefit. So there's huge potential, not as much as in indexing. You know, you see, whenever you see a blog post that says how we sped up our database 1,000x in one easy step, that's never going to be parallelism. But so that's going to be parallelism. So that's going to be... Unless you switch from one CPU to a thousand CPUs and scaling linearly, which is very... Theoretically, it can be imagined already. But an interesting example, if you dump PgBench,

Starting point is 00:11:38 and you have PgBench table, you do some benchmark, you create that one terabyte database, which obviously all one terabyte database, which obviously all one terabyte will go to a PgBench accounts table. Other tables initially are empty. And then you say, okay, I'm going to test the testj option of PgDump because PgDump allows you to benefit from multiple cores. And you say, okay, I have 96 cores, so I expect my pgDump.

Starting point is 00:12:06 Okay, not 100x, but at least 20x, I will be already happy almost. And you start dumping and see no benefits at all. Why? Because DashJ can work when you have many tables only. If all your database is a single table and other tables are empty, PgDump cannot parallelize it. So is that... I thought that was like a practical limitation

Starting point is 00:12:32 because the work hasn't been done, or is it a theoretical limitation? Is there some reason why it couldn't... It's just how parallelization of PgDump is implemented. Sure, okay, great. At database level. It cannot parallelize a single table unless it's partitioned. Yeah, it makes sense.

Starting point is 00:12:47 But it's not really one table anymore. But yeah, it's a nice link back to that. And a couple of years ago, a special option for pgbench was added. So now you can create, when you create pgbench database, dash i, initialize it. You can say,

Starting point is 00:13:00 I wanted to have partitioned pgbench accounts. And then you can speed up your dump. And this is an interesting example showing that there are many places where parallelization can be tricky. While we're talking about parallel query side of it, I think it obviously is

Starting point is 00:13:17 worth giving a shout out to the people that did this. It's a huge amount of work to get it in at all and make it safe, make it performant then it did improve a lot in versions 10 11 and it kept improving there were kind of smaller improvements and performance improvements in 13 14 even 15 recently so it has it's something that went in and has continued to get better which is normally a sign that people are using it and want it to be um wanting it to be better i'd say another

Starting point is 00:13:45 reason it got added this is like it's the same reason but a second order effect was that this was around the time i think that people were starting to think about alternatives to oracle and sql server so i think a lot of a lot of people migrating were probably struggling this is a theory based on who was working on the on the feature and like which consultancies they were working for but this isn't i don't think this was a passion project for people who are excited about working on this yeah there's demand there's real need exactly right and let me finish with reasons dictated by hardware it's not only about cpu i see in many articles people mention cpu like we have many course let's do it. It's not only about CPU. I see in many articles, people mention CPU.

Starting point is 00:14:26 Like, we have many cores, let's do it. But it's not on... Ah, and of course, very important that Postgres is still process-based, not thread-based. It's also important, and this is how Postgres works. So for parallelism, it needs to run multiple processes and then find a way to communicate between them. But I want to highlight, to emphasize a very important point.

Starting point is 00:14:51 It's not only about CPUs, of course about CPUs and multiple cores and so on, but it's also about disks. This is often overlooked when people explain why we need parallelism they forget mention that now we are on ssd and to get maximum of it of ssd you need to do multiple threads i mean multiple processes you cannot get full speed of ssd if you if you read in a single thread. You need like 16 threads. To saturate disk I.O., you need multiple. For example, FIO, very good tool to benchmark disks.

Starting point is 00:15:32 If you use single-threaded benchmark, you will never reach saturation. You need like 8, 16 threads. And this means that to go full speed, especially if you have heavy queries, you definitely need multiple threads or multiple processes and the communication between processes. So disks, in my opinion, are very important. But couldn't, so here's a question. Couldn't that be an argument for having more connection, like allowing more users, for example, rather than like, it's not necessarily

Starting point is 00:16:06 a... It's both. Okay, yeah. Of course, well, if you have a lot of small queries in parallel, of course, this is a different area. This is what I mentioned in 9.6, there was big improvement to reduce contention, how we've worked with the buffer pool is organized. I remember a great work from in post, blog post from Alexander Korotkov

Starting point is 00:16:28 towards 1 million TPS on single node. It was interesting that it was done around the same time when MySQL did it as well. We've mentioned it a few times, yeah. Right, it's interesting. But it's not related to processing of a single query. It's related to multi-user, multi-session processing. But parallelism for single session is needed

Starting point is 00:16:56 when, again, you have many cores, you have much fewer, and this is actually normal, much fewer active sessions right now. If you have more active sessions than cores, it's not a good position already. I mean, usually we have some room. If we don't have this room, if we have 96 cores and more than 96 active sessions, I suspect we're saturated already.

Starting point is 00:17:20 There's CPU, most likely CPU. It's not a good position. But usually we have some room and why not to use some room and speed up one query and reduce contention as well. It will be executed faster and we will have average active session number going slightly down. It's good. Yeah, I was going to ask, is this because... So we've got good note in the docs. I think actually the Postgres docs on parallelism are great. They're in far more detail than I was expecting when I first looked it up.

Starting point is 00:17:51 I think they really do explain it well. A note that they make is that this isn't just about slow queries. This isn't just about querying lots of data. It's querying lots of data and returning few of those rows. It's not as likely to make a big difference to your queries if you're having to return all the rows anyway because of the overhead of of sending that data over the network is likely to dominate the the time but if we're doing aggregations it depends well sure sure but the the big benefit here is in kind of aggregate

Starting point is 00:18:21 queries or things that are heavy queries that are returning some summary of that information or only returning a small... Which normally is the only justification for doing a heavy query, right? If you're doing a heavy query and then not returning most of it, it's bad, right? That's why I think buffers first analysis of single query matters here at all. A lot you need to pay attention to buffers

Starting point is 00:18:44 and see what's inside, how many bytes, how many pages were involved in query processing and compare parallel, non-parallel, tuned with this in mind. That's a good point. You return just one row, like average, aggregate, but inside it you needed to process a lot of buffers. And this explains why it's so heavy. So I was a bit surprised when you said, like, I get the impression you don't turn off

Starting point is 00:19:11 parallelism for some of your, like, let's say you've got a lot of RLTP. But would you, if you had the choice, do you think? It depends. I would probably do some benchmarks. i would analyze the content of pg style statements create some like i call it how i how do i call it simulated workload or something like i i already forgot how i call it sorry word out of my mind so i would create some test set and just run multiple iterations checking even probably just in single thread as usual like i do it sometimes in shared environments where i don't care about time a lot i just see how much data how much buffers were involved in processing without realism with realism it will give me idea of course it's not directly converted to time but it will influence the most the resulting time. And here, of course, in parallelism, we can understand like we're reading a thousand buffers

Starting point is 00:20:10 using a single process versus three processes or four workers or more. It's different, right? So we need to understand like timing will be affected by parallelism. But still, it's interesting to see overhead and so on. So buffers analysis is important, right? Yes, absolutely. In fact, there's a few tips. So when we're talking about this, we're probably talking about looking at explain plans.

Starting point is 00:20:35 So we get information about parallelism through explain plans. Analyze buffers, right? Yes, but initially just explain. You can see whether the planner is going to use. Exactly. Not just is going to use it, but how many workers it will request as well. So you can get some information about how much parallelism. So it doesn't have to max out the option. it can choose anything between zero and your, well, initially max parallel workers per gather

Starting point is 00:21:07 for each gather or gather merge. So yeah, there's that. But then additionally, I wanted to say that verbose is quite important here as well. If you want to see some of the per process statistics, verbose is quite a useful thing. Yeah, good point.

Starting point is 00:21:24 So let me finish my thought probably yes sorry of course if you if we need for example to deal with 100 buffers for example again currently if we forget about vacuum we should discuss vacuum and create index and dump restores separately which have paralyzation and uncheck we mentioned last time let's let's talk about them separately but for normal queriesML queries, it's only about reads. So, for example, if we need to deal with 1,000 buffers originally fetched from heap using sequential scan or from index using index scan, index-only scan, or also bitmap scan, we understand that, okay, if we have three workers, will it be three times faster? And here it's interesting to see overhead and buffers option will also provide us some additional insight about overhead of parallel versus non-parallel and parallel with how many workers.

Starting point is 00:22:15 And when Planner chooses plan, it only takes two kinds of input besides query itself. First is statistics for all tables involved. And second, planner settings and workmem. Planner settings from PgSettings, you have categories, special category about planner settings. There are some settings related to parallelism.

Starting point is 00:22:39 Yeah, actually workmems are really good. It's an important point here. Yeah, if you're going to change your parallelism settings you also need to consider your work mem setting at the same time like why am i saying that is because most people realize that work mem can be used multiple times in a query if you have multiple hashes or multiple sort operations work mem can be used multiple times that's why it's a scary parameter to set and why people often go relatively conservative, but they want to bump it up so that operations can happen in memory. Once you introduce further parallelism, that multiplies. So you can have each parallel

Starting point is 00:23:16 operation that's using workmem can use it the number of times that there are workers. So it multiplies quickly. So if you're going to bump that number up, it's something to be aware of. And it might be an argument for reducing it sometimes. Maybe you can get away with a higher WorkMem setting for your entire workload if you reduce it. But like WorkMem, if you do have the odd, really big query that you need to run, you can set some of these settings locally, right? Oh, like on a session level? So workmem is a super interesting topic, and we need a couple of minutes to discuss it. But before that, let me mention what I understand about which read queries can be parallelized

Starting point is 00:23:57 in Postgres. First of all, if you want to benefit the most, you need the most recent version, because each version had improvements. Yes, makes sense. Like 10, 11, recent version because each version had improvements. Yes, makes sense. Like 10, 11, 12, all of them had improvements. Second, the easiest is sequential scan and index scan and bitmap scan. They all

Starting point is 00:24:14 can be parallel. And it's quite obvious because well, index is more complex, but still can be parallelized. All good. We can benefit a lot. And if we expect only a few users using our 96 core or maybe 244 AMD EPYC Milan CPU, so many cores, of course,

Starting point is 00:24:36 and only if we have like this beast running Postgres 15 and only 10 users doing some analytical queries, we definitely need to reconsider and increase max workers per gather. There is one set of settings, max working processes, which requires restart. All others don't require restart. Max workers, max processes.

Starting point is 00:24:58 Max worker processes, yeah. Right. It's usually increased also for logical replication. But in this case, if you expect heavy queries and only a few users using some beast machine, you need to increase it a lot. And this requires a restart. Everything else can be adjusted later and dynamically in session to experiment and so on. These access method nodes in plan, which are leaves in our explained plan, they're relatively straightforward, but then interesting, like aggregates, joins, right? These operations are interesting. And in 10, hash join was, or merge join was supported.

Starting point is 00:25:42 And later, a couple of years later, in 12, maybe, I don't remember exactly, hash join was supported merge join was supported like and later a couple of years later in 12 maybe i don't remember exactly hash join was supported as well but they have some difficulties right and and also i see in release notes of 16 which is uh it's already better right better was released last week i guess full outer join also can be paralyzed and so on it's interesting and if you see like about joints it's also possible about aggregates it's also possible but things become much more difficult there and now memory memory even without parallelism is so tricky in postgres memory management what backend developer expects. I say, okay, I have 600 gigabytes. 25% of that goes to shared buffers and also

Starting point is 00:26:29 200 gigabytes probably or maybe like 300 gigabytes I'm ready to give to normal query processing. Everything else will be second underlying layer of caching, which is page cache. I want some setting. I want to say, okay okay 200 gigabytes can be used by

Starting point is 00:26:48 postgres there is no such setting and it's so hard to understand how like if we set up new server it's so hard to predict how much memory it will be using it's super well i think a lot of people don't even realize what happens if it goes wrong. So we're talking out of memory. Everything shuts down. That's the failure case. We can enable swapping, and some people advise to enable it or enable a small swap for protection. But in general, yes, it's not a good position.

Starting point is 00:27:22 Either the server goes down or you are becoming very very slow like so latency spike you have latency spike it's not good you you don't you definitely don't want out of memory killer to be involved or to be very slow so you don't want to be out of memory but how to properly plan it memory planning planning is a super difficult topic. Maybe we should allocate for it a special episode. But in a few words, you take workmem. Of course, there is a maintenance workmem and autovacuum workmem, which by default is inherited by maintenance workmem. So if you understand how many workers can build indexes, and this can be paralyzed as well, which is good. Also, you know how many autovacuum workers.

Starting point is 00:28:10 You have max autovacuum workers, which is by default three, not enough for modern hardware at all. So you allocate some autovacuum max workers. You increase it. So you can plan this part relatively easy it's simple but normal query processing dml processing it's super complicated so you take workmem and then you should understand workmem is what not single session can consume but can consume multiple times of workmem because it's how much memory is allowed

Starting point is 00:28:47 to be consumed by ordering and join operations, hashing operations, which becomes very complicated. Okay, one query can have multiple joins and so it can consume multiple times of workmem. For safety, you need to take workmem and multiply, for example, by two or three. But in reality, it's never allocated as a whole. Usually, it's some part of it. So you need to understand your workload on production to make proper decisions. And I see many servers are kind of overprovisioned because they know that queries, most of the queries are relatively light, so they don't reach even one workmem for a single query. So it's good to have max connections. They allow max connections quite a lot.

Starting point is 00:29:34 There are many reasons to try to reduce max connections, especially for older Postgres. But in general, you need to take max connections, multiply by workmem, and also by two or three. Because there is some uncertainty how many megabytes can be used by a single session. But of course, worth mentioning, default workmem, if I'm not mistaken, it's four

Starting point is 00:29:58 megabytes, maybe bytes. It's for teapot Postgres. Yeah, it's too small. So 100 megabytes is good. It's for teapot Postgres. Yeah, it's too small. Right. So 100 makes this good. Well, yeah, for big ones. But I think if I was setting defaults, I might try and argue for a bump to 16 at least.

Starting point is 00:30:16 I feel like there's almost no, even for tiny instances. Depends on your resources. I know it depends. I know, I know. We should probably go back to parallelism. Right. So I wanted to draw this picture really quickly and then we add parallelism here.

Starting point is 00:30:31 It means that our max connections, which for example is 300 for our 96-bit server. Okay. We have work map 100 max and for safety we multiply by two or three. It gives 100 megs times roughly 100. It's 10 gigs already, and max connections is 300, I said. It's already too much, right?

Starting point is 00:30:55 Way too much. So probably you should reduce workmem, but in reality, you see it's never achieved. So real memory consumption is good, so you try to increase it to avoid temporary files. But parallelism, what does it bring to this picture? It brings more complexity because it says, okay, now each session, each session out of our 300 or 500 max connections can have multiple parallel workers and consume even more workmem, right? And for safety, you need to multiply by expected number of parallel workers on average,

Starting point is 00:31:31 which you also hard to predict without observing in reality real production behavior. So you have two multipliers now. Yeah, and that was my point. I think people are often kind of aware of the status quo but if they change this setting they might not think to read to change the other one as well so it's yeah so my main point is that if you i do some consulting my my team does some consulting and we say okay if you want to be super safe you need to use these multipliers, but you will end up having a lot of temporary files, unfortunately. Later, in iterations, you can adjust and try to get rid of temporary files,

Starting point is 00:32:13 going maybe to unsafe space, theoretically, but practically, you see that your workload is good. I mean, you still have a lot of free memory, which is used by page cache. So also you don't need to steal all gigabytes from page cache. So this is the approach. This is theoretical safe approach, but it's very wasteful in terms of memory and leading to latency overhead because of temporary files. So parallelism gives complexity to this picture definitely oh there is also multiplier for hash okay it's yeah hash mem hash mem multiplier but it's if you're thinking about work mem hopefully you're thinking about that as well because it's in the docs in the same place and stuff so yeah that's two now by default yeah another tip like

Starting point is 00:33:03 before we move off so parallel query if you notice like if you have a query plan that's not parallel or something you think should be, one thing, for example, to be aware of is that user-defined functions by default are not marked parallel safe. So that catches a lot of people out. So you need to explicitly mention that those are parallel safe. And they should be parallel safe as well, right? If they are parallel safe, yeah. Also, you probably deal with very small table, tiny table, tiny index.

Starting point is 00:33:35 There are two thresholds defining, like below these two thresholds, one is for table, one is for index. Below them, it cannot be. You can adjust them and go down if you think it's worth but depends again yeah good point so you will never like if you don't change any settings you'll never see a parallel scan on a tiny table oh by the way and you know what cpu is one reason we discussed it cpu development dictates software development. SSD disks also.

Starting point is 00:34:06 But memory, we have some. It's so affordable. Hundreds of gigabytes now. Well, it depends, of course. But if you're a startup which grows really quick, you have some money, but you cannot be slow. So you sometimes prefer just scaling vertically and putting a whole database or a major part of it to memory. It's possible, I mean.

Starting point is 00:34:30 In clouds, it's one of options which we have. And it's quick. Of course, it requires some additional budgets. And also, this is very closely connected to partitioning. Because when I say put whole database to memory, it might be not whole database, but working set. And if you have tables partitioned, you probably have some old partitions which are out of memory. And you have fresh partitions which are in memory,

Starting point is 00:34:59 and you start benefiting from parallelism because you don't touch disk and so on. And work with memory, of course, can be well parallelized. Unlike magnetic disks, right? Is that a good transition onto things like parallelism for create index and vacuum? Yeah, for DDL and other. Yeah. So for create index is definitely a good thing

Starting point is 00:35:25 because we like to recall this, you call it time, drama with Postgres 14. It means that when you build, and again, it's related to partitioning. Everything is related to partitioning. Everyone should partition tables which exceed 100 gigs. So if you build an index over a large table, and as we discussed in our episode about index maintenance,

Starting point is 00:35:49 we definitely want to rebuild indexes from time to time because their health tends to degrade. It degrades slowly in modern Postgres, like 14, 15, but things are much worse in older Postgres versions. So while you rebuild index, XminHorizon has been held, auto-vacuum cannot delete freshly the tuples, and blood is accumulated eventually. Attempt to improve it in Postgres 14 was reverted in 14.4.

Starting point is 00:36:22 So it means that you want your index build to go faster because if you move faster, fewer write transactions, concurrent sessions, produce dead tuples. If you move faster, if you have room, of course, if you have power in terms of CPUs and disk. So you move faster, fewer dead tu was generated and less bloat is produced eventually when auto vacuum cleans those dataples makes sense right so you want to go faster but of course you don't want to saturate your iops limits for example or disk throughput and so on and of course, if you rebuild 10 indexes in parallel and they are also paralyzed, having

Starting point is 00:37:11 more workers and it's like dozens of backends are working on index maintenance, you are going to probably also saturate your CPUs, which is also not good. So this needs careful planning. Probably during weekend, moving fast enough, but not saturating anything. Control of utilization of CPU and disk IO here is the key. But it's a great feature

Starting point is 00:37:35 because you can rebuild index faster, especially if it's on large table, it's good. Then what else? Vacuum, unfortunately, vacuum is, parallelization works only for indexes and only for explicit vacuum. Auto vacuum cannot do it still. And maybe I missed something, but it's only for your manually invoked vacuum runs. But if a table has multiple indexes, it's quite often.

Starting point is 00:38:01 Indexes can be processed using additional workers which is really good thing. I hope eventually vacuum will be parallelized even for a single table. But you know like in general again partitioning is important because you can parallelize processing using just multiple workers if you deal with multiple partitions. Even with for auto, in this case, this feature for explicit vacuum command, which applies only to specific tables, indexes, you don't care. We have many workers. We have a lot of partitions. We can process them using auto-vacuum, and that's it.

Starting point is 00:38:37 This is the way to parallelize in practice if you have a lot of cores and a lot of resources. Yeah, and I think defaults in these areas are still quite low for most people. Like in terms of like... T-pod settings, T-pod users. Cool. What about analyze, actually? I didn't look it up. Again, like I think a single table,

Starting point is 00:38:55 analyze for single table, it's not... I think, as I remember, it's not paralyzed. It might be, but the key here is, again, we usually rely either on auto-vacuum, so we need more workers. My recommendation is to consider, like,

Starting point is 00:39:09 you have a number of cores. Take something between 10% up to maybe 30-40% of them. And this is your auto-vacuum.

Starting point is 00:39:18 Because, if you think, any writes, they just mark X-max, or like, then we need to really process it. And you need to allocate a significant amount of resources for that

Starting point is 00:39:30 to manage your writes, right? So you definitely need to increase to vacuum max workers. Three is not enough if you have dozens of cores or hundreds of cores. So analyze either it's rely on your to vacuum max workers. And again, partitioning is good because analysis of large tables, especially if you globally increase the default statistics target, like some people do, it's not good.

Starting point is 00:39:52 But also there is an important case when you want explicit analysis, especially after PIDGE upgrade. In this case, you run vacuum DB CLI tool, for example, with dash J. And you're going to have problem if you have huge tables because dash J again applies at upper level. It takes tables and run analyze on them.

Starting point is 00:40:16 So if you have partitioning, you can say dash J after pgupgrade, for example, we have maintenance window and we can use all cores we can saturate resources because we we want to minimize duration of maintenance window and parallelization with vacuum db-j you take number of cores go full speed and then you are limited again similar to my example with pgdump you are limited with like largest size. That's why it's also important to partition because in this case physical tables will be much

Starting point is 00:40:50 smaller and parallelization will work much better. So you need to shrink those tables to pieces. Nice. And we can link up our episode on partitioning as well. Right. It's very related to parallelization. Yeah. On that note, for further reading or if anybody wants to learn more about this, where would you point them? Oh, by the way, it's very related to parallelization yeah on that note for further reading if anybody

Starting point is 00:41:05 wants to learn more about this where would you point them oh by the way it's not only related to parallelization but also related to dml to reads if you select from many partitions you need constant exclusion you need to deal with as few partitions as possible and also parallel workers might help you there as well definitely to read from different partitions and then to aggregate result yeah i haven't studied i haven't actually looked into that but it's still using the same like worker processes right like it's still doing the same parallel scans what i'm trying to say is that if you need to deal with many partitions in your query, multiple workers can be helpful.

Starting point is 00:41:46 Yeah, true. It's a very related topic. Partitioning is very related to parallelism, obviously. Yeah. So vacuum we covered, create index covered, analyze covered, what else? I think we should leave it there. No, no, dump restore is important. PgDump supports that J, but pgRestore also supports that J. And there are improvements.

Starting point is 00:42:04 I don't remember. If you check release notes, some fixes were done. So in newer versions, it works better. Also there is work with foreign data wrappers and so on. Something can be paralyzed, but I don't remember details about that. I think that the Postgres foreign data wrapper, for example, doesn't support it by default, but you can. There is a fork of it by Swarm64, but I think that was experimental when they got sold. So I don't know any more about that. On that note, there is a really good talk by somebody that used to work

Starting point is 00:42:37 at Swarm64 called Sebastian that I'll link up. It's Parallels and Do's and Don'ts. It's about an hour, I think. Oh, great. I don'ts it's about an hour i think oh great i didn't i don't remember it or worth mentioning also so i would like to have copy paralyzed sometimes it's you need to deal with single large table and pg restore pre-stored realization applies only at like higher level so between tables so if you two tables, you can probably move two times faster, roughly. But if you have single table, parallel copy would be interesting feature. And you can cook it yourself,

Starting point is 00:43:14 dump restore, dealing with snapshots and so on, knowing your IDs, you can split your table and so on, it's possible. But also there is create table S, which is parallelized. And also materialized views, creation with initialization of data. And refreshing materialized views as well. Maybe something else.

Starting point is 00:43:34 I'm sure something else. Check list notes, a lot of things for that. Absolutely. Also, for anybody looking at wanting a recap of the basics, I did also have a blog post for people that did want to increase max parallel works per gather and the basics around that. I think we've covered it all here, but that's a shorter summary. There are many nuances there, but yeah, you need to experiment. Wonderful. Well, thank you, Nikolai. Thank you for the people that requested this. And I'll see you next week.

Starting point is 00:44:05 I have a strong feeling we didn't cover everything, but I hope we did a good overview. Yeah, absolutely. And I expect many more features in the future because it's obviously a huge topic and a very important one for modern hardware. Thank you so much. Thank you, all listeners.

Starting point is 00:44:21 See you next time. Bye. See you. Bye.

Your Ad Here

Postgres FM - Parallelism

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.