Postgres FM - Multi-column indexes

Starting point is 00:00:00 Hello, hello, this is PostGIS FM. My name is Nick, PostGIS AI, and as usual, my co-host is Michael, Pigeamastered. Hi, Michael. Hello, Nick. How are you doing? I'm doing great. How are you? I'm good, thank you. I tried to say, I thought, but actually, this is you who thought it's not enough episodes starting with multi-word. Yeah. So what is the topic today? Tell us. Yeah, this is, I chose multi-column indexes. And I know we've tried. talked about them a bunch. We've done episodes on things like index only scans, for example, but it felt to me looking back at those that we didn't really get into some of the details around my multi-column indexes that I find particularly interesting. I think also

Starting point is 00:00:46 this is still the area. I see people, especially fairly experienced full-stack engineers, back-end engineers struggling with in terms of performance or not realizing how big a difference they can make or how helpful they can be in certain, especially a lot of the workloads I see are web application, just like a lot of reads, like maybe 80, 90% reads, and a lot of trying to optimize performance of those reads, especially with the odd kind of aggregate type query. And index only scans help a bunch, but only once you understand multi-column indexes. And I think there's some details in there that I'd like to discuss with you. So, yeah, thought it was about time we actually had one dedicated to this.

Starting point is 00:01:33 Yeah, great. I guess we will be going slightly beyond just multi-column indexes, right? Well, possibly, yeah, what do you mean? Well, maybe the question is, like, how we can compare them, for example, having just indexes on single column, but many of them. And sometimes I have a case where people intentionally understanding consequences chose to cover every column with index, a single column, I guess.

Starting point is 00:02:01 And, yeah, we usually say it's anti-pattern, right? Well, I mean, yeah, I guess it depends. I've not seen a workload, like, with a set of access patterns where that would make the most sense. Personally, I tend to see tables that have some columns that get queried a lot more than others and the groups of columns that get queried very often together, lots of equality searches amongst, for example,

Starting point is 00:02:28 account ID, user ID, but there may be a range search in terms of date. And that's the kind of query that ends up being so much more efficient with a multi-column index than with a series of single-column indexes. So, yeah, the kind of access patterns I see, I can't imagine living without them. But I'm sure that, well, I'm looking forward to hearing

Starting point is 00:02:50 about the case where it made most sense to... Well, it was multi-tenant, application where the freedom of queries is unpredictable so every tenant can choose different types of filtering and ordering and we don't know in advance and they are not engineers those tenants right so we need to give them some support at least somehow but index right amplification I guess is a killer in this use case if data size grows, right?

Starting point is 00:03:27 Yeah, and given it's a multi-tenant application, are we talking single database per? Separate tables. That makes more sense. Yes. So that starts to make some sense, then at least you're not...

Starting point is 00:03:42 Yeah, there are smaller tables, of course. And this is isolation, complete isolation. So all blocks, and all tuples in these blocks buffers or pages how you name it they are belonging only to specific tenants or all problems

Starting point is 00:03:59 only like they are localized yeah anyway if we think columns A and B and we have two indexes on both columns when does it make to consider a two column index instead

Starting point is 00:04:17 this is a key question I guess yeah yeah i like i like that a lot and i guess should we probably should go one step back and say probably we're going to be talking about btree indexes most of it for most of this about b3 jin indexes yeah well there's a few the docs mention and it's quite funny the word only is still in there i think from the initial commit so it says currently only the btree gist gin and brin index types support multi-key column indexes and that um the key is important word there but basically

Starting point is 00:04:52 that means key columns not include columns so there is a subtle difference there or gin index does not be multi-colum so gin can brin can gist can and beetree can so when it says only I think the only

Starting point is 00:05:11 inbuilt index type that it doesn't that doesn't support multi-term indexes hash yeah right So I think only is a bit of a dubious word there. Is that it? Is that it? That's it. I can't think of a sixth one.

Starting point is 00:05:29 Okay. Or maybe SPGIST, but like, I think that's a subcategory of gist. Anyway, basically, I think we're probably going to be talking about a bee tree. Like, the vast majority of index is a B tree and the vast majority of multiple column index I see a B tree. Have you got any different examples other than, well, yeah, maybe maybe B tree. b3 gen yeah there's a problem with gin I always like to talk about

Starting point is 00:05:54 that if you have a scalar column column of scalar values like timestamp or integer big and you need to order by it and limit like 125

Starting point is 00:06:10 then it can be inefficient because gin does know like planar needs to choose to use only B3 or only gene and this is where B3 gene can help to combine them into multi-column index, right?

Starting point is 00:06:26 So we can talk about hybrid indexes when one column, one layer is gene and another is B3 for example. Yeah, well and I think that kind of starts to get to why we need or want multi-colum indexes in the

Starting point is 00:06:42 first place because I think a large part of it is ordering or clustering. Like you want, B. Tree is a simple example. Beautifully, it has a single order through the entire structure, right?

Starting point is 00:07:00 Single demand. Yes, exactly. So the order is really useful for things like enforcing unique constraints. Like things are placed in an order and you can see is there one of this already in place? And multi-column just gives you kind of order by the first of these columns first, and then for any of the equal order by the second column

Starting point is 00:07:24 next. And I don't think the docs make this super clear. I think they give a good example, but that could be clearer, I think, that it's single order and order by A, then order by B, and then C, and then D, and however many you list. And that's if they're key columns. Includes is a bit different. That doesn't affect the ordering, and that has some benefits. what we can discuss whether it does I guess in a bit that's one of the main things I wanted to discuss with you actually I think that's a particularly interesting part but that ordering thing is really really important

Starting point is 00:08:01 in terms of then choosing which order to index things in in the first place so I've never I've never come up with or come across a guideline that I'm totally happy with in terms of well it's straightforward We discussed with 20 years ago in some circles. So, by the way, not only ordering, but also filtering. If your filters involve range comparison or like between operator, right? Or in case of gist, it can be something else.

Starting point is 00:08:36 So anyway, so it can be not only strict ordering, but also just, is it inside the range or no? Is it more or less with operators? But as for the rule, let me this time also step one step back and say super roughly at high level forgetting about parallel operations and so on. We have three types of access to data. Sequential scan on one side for lots of data, it's the slowest index and index only scan. Let's combine them in this case. another side the fastest single index scan especially single index only scan is the fastest

Starting point is 00:09:21 if you have a lot of data and between them there is a bitmap scan which is a combination of a bitmap index scan bitmap hip scan and if you have like this is like proposal to have explanation why we need a multi-column index in case of two columns and filtering

Starting point is 00:09:43 or ordering on them. If you have two separate single-column indexes, you likely will end up using either one of them or maybe both of them, but in the form of Bidmap scan. Yes.

Starting point is 00:09:59 Which is not the most efficient. And with multi-colum index can bring you the most efficient either index scan or even index-only scan. Right? Yeah. And I think there are exceptions or times where they're pretty much equal in efficiency.

Starting point is 00:10:20 Of course. But it's very, very easy to demonstrate with non-overlapping, like, data sets. Like, you quite often use the social media thing, right? If we had an index on user ID and a separate index on created out or, you know, some kind of timestamp, looking at all of the user's posts in a certain time range even if we can do the bitmap index scan on users just to find all of that user's posts and then a bitmap index scan on all of the posts in that time range

Starting point is 00:10:54 it's very easy to see how we're having to scan a ton more data in that and then even though we're only looking at they're doing like the bitmap index scan only having to look at the index entries and the bitmap itself will be relatively small if that user doesn't have any posts and then the eventual heap scan will be relatively small and relatively efficient.

Starting point is 00:11:14 The actual initial scans to get the right data is huge compared to having a multi-column index with all of the user's posts next to each other with their created at timestamps. So it's, yeah, in terms of, we often talk about buffer reads, right? We could create statistics, right? And so the planner would know that this user was recently active, so we need to follow the index on created ad.

Starting point is 00:11:41 And another is not recently active, so it's better to fetch all entries for this user and then order by in memory and have top end sorting memory. But both, with any sufficiently active system, both would lose massively in terms of buffer reads still to the index only scan or even an index scan on a single multi-column index.

Starting point is 00:12:05 Yeah, yeah. Anyway, also what matters, random page cost matters here, right? Because planar might, in reality, index scan might still be better or vice versa, but planner might think it can choose bitmap scan in favor of index scan because random page cost is for, for example, default.

Starting point is 00:12:30 Anyway, let's return. While we're on the topic of outdated things, I actually think there's a relevant part of the docs that I have an issue with that I wanted your opinion on. It's in this area. It says multi-column indexes should be used sparingly. In most situations, an index on a single column is sufficient and save space and time. Indexes with more than three columns are unlikely to be helpful unless the usage of the table is extremely stylized. I don't remember these words from the docks.

Starting point is 00:13:00 Interesting. They're still there in the latest versions. and to be fair to them, they've existed since pretty much since 7.2. I understand. In reality, it depends on application. In some application, there are absolutely critical queries which must have index scan

Starting point is 00:13:21 and sometimes even index only scan with aggressively tuned to vacuum and visibility maps maintain very well, so index only scan doesn't degrade. So I think this is like, I can see why this was written, but I cannot agree with it in all cases because definitely we have three column indexes

Starting point is 00:13:41 sometimes because we choose to have the most efficient path specifically to support specific queries which are most important for us. They receive a lot of traffic, so we cannot allow degradation here. Yeah. What about four or five columns? I guess it's rarer.

Starting point is 00:13:58 I don't see many, but I've seen a few. well this let's let's let's a little bit postpone this question because I know you wanted also to touch covering indexes right because maybe this is where we will discuss this but let me return to your previous question yeah how to choose the order and I remember the rule originally what was if it doesn't matter so there are two rules and you need to spend time understanding them I I'm pretty sure users I mean listeners who are most experienced they know this very well but I know a lot of inexperienced listeners our podcast receives so two rules, one rule

Starting point is 00:14:40 and this is maybe most popular in terms of mentioning the column on the index on columns A-B makes a separate index on columns A-B makes an index on column A

Starting point is 00:14:59 a single column index like not needed you can drop it right because these two column index can support queries that need only index on column A right and this means that if you have queries

Starting point is 00:15:15 dealing with both A and B in terms of ordering and filtering and also you have queries that deal with only A you should put A on the first place. This is rule number one. And there are nuances of course. And second

Starting point is 00:15:32 rule is if you cannot choose, if you have patterns that kind of imagine you have only one query which involve both columns. How to choose? The order. And the rule says, choose that column which will give you the highest

Starting point is 00:15:50 selectivity and put it to the first place. So the scope of your scope narrows down faster for yeah I like it and I think for single quick and I think there's a another one I've heard is your equality conditions

Starting point is 00:16:08 in the leftmost and then if you have a range condition or an inequality condition like a greater than or less than that needs to come last so you all of your quality conditions first then a range condition like created that's with previous one

Starting point is 00:16:24 imagine but it's really important for let's say you have a query that has like where A equals one, B equals two, C equals three, and then D is between 17 and 1, it's way more efficient to have the A, B, again, depending a little bit on selectivity, I'm sure you can come up of examples where it isn't true, but having A, B and C in some order and then D last is really important to making sure that it's as like low buffers as possible in terms of that index scan. But imagine we have a query forgetting about order buy

Starting point is 00:17:02 and limit. We have a query which selects with two filters. One is some bullion column and another is time range. So what should go first? Well,

Starting point is 00:17:18 so think about it though because like let's say a 50-50. It's the simplest case, right? We've got 50% true, 50% false. Probably if we have a different thing, it's trickier. It should go, it should be partial index, actually, but yeah.

Starting point is 00:17:34 Maybe, depending if we need to like query the opposite one, sometimes. Maybe it should be, maybe it should be two partial indexes. Or I don't like, I don't think so. I don't think two partial indexes are more efficient than. But do you agree that

Starting point is 00:17:51 filter on date, usually, if we, for example, we know that we usually select narrow, narrow ranges, day or maximum month and we have a lot of data for many years do we agree that the filter on or filter on

Starting point is 00:18:07 timestamp gives you better selectivity in general case I think it has to right? Yeah. What do you mean? Well of course if Boolean column has some like it has two values and most

Starting point is 00:18:26 of columns have one value but you're selecting another one it's a different story like where is deleted and we know we have most rows not soft deleted right or something like this

Starting point is 00:18:39 in this case the Boolean filter can give you better selectivity but it's less but I think a scan let's think about like the buffer reads if you if it's true

Starting point is 00:18:53 and between these dates or between these timestamps our index scan can be perfectly efficient if we are indexed by Boolean first, then time stamp, because we only have to look at the true values between a certain date range. If we mix that, if we do date range, then Boolean values, we discard half the values we read. We read all of the date range and have to discard half the Boolean values. Why should it all the date, all, I don't, I don't get it.

Starting point is 00:19:24 Why? We should, we should, I reckon there will be, double the buffer reads. I forgot to mention my main rule, always doubt and test in reality. Based on my understanding that you'll get double the buffer reads for putting the date first

Starting point is 00:19:41 and the billion second. So this is what I meant. All of the rules of thumb are good, but they're not perfect and actually they're slightly contradictory and I would say even the most selective first. Think about the shared

Starting point is 00:19:58 multi-tenancy thing I prefer account ID before user ID even though user ID is definitely more selective like partly because we're never really well in most cases data locality I mean right data locality is one thing

Starting point is 00:20:15 yeah data locality is one thing but also we're very rarely serving a single query with a single index normally we're serving more than one in that query like type so but then like then the rules start to like not like they start to become a little bit more of an art form and you're trying to think what will they be in future what's most likely what what might we want to

Starting point is 00:20:38 query together like it's quite likely we want to query users in the same account together at some point maybe we want to look at all posts by this account or this team and having that data closer together in the index has benefits than if we do it by user ID and it's let's say it's or no matter how you're doing user ID, whether it's, maybe they'll be clustered, like, teams. Like, maybe if you've got like a big in, maybe they'll be clustered close together, but they won't all be. You might add a team member many years after setting up your account, that kind of thing. So I prefer some breaking of those rules.

Starting point is 00:21:13 There's one more rule, which is foundational. Data Locality, by the way, I would like to think about it, and also partitioning if we have it of course we don't want to scan all partitions right

Starting point is 00:21:28 and actually partitioning serves as type of kind of an index kind of leading like partition keys kind of the leading column or leading columns

Starting point is 00:21:36 I remember I was trying to create an index on partitioning key in each partition and it was so strange because this partition

Starting point is 00:21:45 partition had only one value and I was trying to include this as additional column and somebody said we have partitioning, why do you do this? And I realized, oh, partitioning eliminates the need an additional step in indexes or some indexes.

Starting point is 00:22:00 So what I was trying to say, additional rule is to remember the order of execution of select clause. Oh, not select clause, the whole query. Yeah, yeah. So first, like, we should think, for example, simple, let's make it, keep it simple. Order by goes last. right so if we have order by created at desk limit 25

Starting point is 00:22:26 it means created ad should be the last step in the multi-column index right yeah I never thought about that but yeah that fits with the range thing I was talking about yeah because order of execution if you put created ad on the first place

Starting point is 00:22:42 it won't be helpful it's a simple rule like it's intuition should tell you this but maybe no. And yeah. But I agree with you that rules can be confusing a little bit and it's better to test and learn

Starting point is 00:23:00 and sometimes doubt and test again. And testing, we should pay attention to buffers. Yeah. This world, and this is going to get better in Postgres 18, right? People are going to get it by default one. Yeah, me neither. We already have customers migrating to 70. So next year

Starting point is 00:23:21 I expect buffers will become default and I hope you will see a growing number of data with buffers, I mean plans with buffers involved because it's difficult. Yeah. We definitely will.

Starting point is 00:23:36 Cool. What about covering indexes? Well, yeah, when I saw the commit for adding include to create index. So create index added this option to include certain columns in the leaf pages, so in the tuples that are stored in the leaf pages,

Starting point is 00:23:56 so you can return more data. Like if a query, for example, once your order by, like maybe wants to filter by user ID equals this and name equals this, order by this date, the name could be a payload. It doesn't have to be in the index ordering, but it could be there. in the index leaf pages. To achieve index only scans. To achieve index only scans, yeah, very, very important.

Starting point is 00:24:26 That's, I think, the only use case for this. But it's a very important one, important enough to get added. To avoid heap fetches. But that also works if under heavy load, it works only if you tuned to vacuum properly and avoid blockers of X-Win Horizon. Yes, yes.

Starting point is 00:24:47 Otherwise, your index-only scan, will end up just as regular slowly degrade to index scans basically yeah so yeah but covering index like the word covering for me I think is slightly loaded because I think a an index for me is covering for a query if you can serve it with an index only scan and in my opinion we had covering indexes before this before it includes because if you had a multi-column index or all of the columns that you need. That was a covering index for that query.

Starting point is 00:25:25 My favorite topic criticizing the choice of terminology important. I agree with you. But I'm only reiterating this because I think it's important for people to understand. Like, I think it could be confusing

Starting point is 00:25:41 if you hear that the covering indexes are only for the word when you use includes because it's confusing. So, yeah, it got added in Postgres 11, 2018. And I'm interested in kind of like the benefits over, like, why not just add it as another key? Like, what are the downsides to doing that? I spent some time thinking about it.

Starting point is 00:26:06 And I came to conclusion, when they were added in 2018, I spent some time. And yeah, and I came to conclusion that the key difference. is the semantics of uniqueness uniqueness, unique, uniqueness, right?

Starting point is 00:26:23 So if you want, if you have strict, like, if you add additional column to your multiple column indexes, you change uniqueness semantics,

Starting point is 00:26:32 right? If you want, that's a great point, yeah. If you want, index on this can work well, but you cannot, but this is unique index,

Starting point is 00:26:40 this is where you should have including, but it's quite rare case, actually. So let's say we want, an index on email, we kept back to the last week's episode or whenever it was, we want it to be unique on email, but we want to include the name as a payload, but we want to let people have the same name as each other, even if they...

Starting point is 00:26:59 Yeah. But wait, because it's unique on email already, wouldn't it always be unique on email and name? Oh no, because then we could, if they had a different name, they could use the same email. Yeah, that's a problem. So we don't want to have uniqueness on payers of failures. If you want uniqueness only on email and, name or something like it's just you said payload this is good point of view on it you're just putting there as a passenger yeah that makes perfect sense yeah i also think do you remember

Starting point is 00:27:31 is there any size difference or well i would yeah i was going to bring up the exact same thing i wondered if it also got done for that reason a little bit because if we've got the structure of a bee tree you've got the once it gets sufficiently large you've got levels of kind of They're kind of like an org tree, don't they? Like layers of management almost. And the, if you have a multi-column index with keys, so where they're part of the ordering, you need to preserve that information at the different levels,

Starting point is 00:28:05 like for ordering purposes. But because includes columns aren't needed for ordering, they're not relevant, they only live in the leaf pages at the bottom level of the like B-tree. so size-wise there should be a difference but and I think this has changed recently de-duplication I think changes this not for unique indexes

Starting point is 00:28:30 but in cases where you can have duplicates which is quite a lot of indexes I actually think the de-duplication will normally beat the benefits of not having to store it at multiple levels obviously depending a little bit on how big this data is that you're probably not putting huge paid loads in this. But yeah, I think there was a size benefit. I think it's less now that we've got the B2 de-duplication work.

Starting point is 00:28:56 Yeah, obviously in the columns you include into including part of Cate Index, you cannot use them in filtering and ordering. They are like passengers. They're only beneficial for index-only scans. But since they are sitting like basically in leaves, do you remember this, there is a problem with B3 on 10. there is a limitation due to key size, right? So you cannot, like, 2,000-something bytes or something like this.

Starting point is 00:29:25 I guess in this case, for columns used in including, there is no such limitation, right? I'm not having to... There must be a limit, right, because it's still going in normal data pages. I'm not sure. I'm not sure. Let's not jump to... Wait, yeah, but I read this in the documentation, and I haven't made note of it, but it is in the docks, I'll find it.

Starting point is 00:29:48 But I must admit, this is a rare beast to meet in the wild, you know. I don't see including the word often. I see, but not often at all. Oh, yeah, I was going to ask. So generally you see people with multi-column indexes, but they're all keys. Is that fair? Yeah, yeah. Well, this is what minds tell you.

Starting point is 00:30:13 including is still kind of exotic it's good to have it but not so obvious when exactly you should use it well we discussed it but still in normal practice you say I will just add it if it's not unique index why I should think about it

Starting point is 00:30:31 if there are no limitations in the key to be met if it's just some timestamp or number it's easier to add it and that's it so I wanted to touch one more topic here yeah when we should avoid completely putting another column to multi-column index or this type of indexes we just discussed which are called a covering but you don't like it

Starting point is 00:30:56 when should we avoid adding such columns another column there for example we see oh there is obvious benefit so when you say benefit like benefit to a specific set of read queries we could be faster. Imagine we have some simple case, like the table, some filter we further by, and it's already working well, but oh, we see like there is also column status. It has several values, and we see most of the time we select 70% of everything, like status equals active or something. But we think, actually we could shave off. We see some rows are filtered out dynamically. This is what PG Master always suggests immediately.

Starting point is 00:31:44 Oh, there is like, index works inefficiently because later some roles you exclude right. Rolls filtered or something. Like, how is it called? So, well, yeah, there's a couple of ways that it can be either filter or an index

Starting point is 00:32:00 recheck, some types of index and sometimes like, for example, with BitNaps, I was looking at this today actually because post-Cras 18 changes this a little bit. But exact heat blocks and lossy heat locks so that can end up with an index recheck.

Starting point is 00:32:14 But I'm talking about rows removed because we have an additional part of the work workload. So a filter. Yeah, filter. And they are filtered out dynamically. And we think, oh, it's not efficient. Let's go fully efficient.

Starting point is 00:32:27 And we start adding, for example, status to as additional column in our multi-colum index or it was a single column, but it becomes two column, right? Or we decided, to use it as a condition and make index partial so we say where status equals active because we know most of time it's active or something like this doesn't matter so index starts using

Starting point is 00:32:56 this call somehow right any of three ways another column including or where keywords I know what you're going to talk about Are we talking Are we going into hot update territory Exactly exactly So this is what happened to me I was like enthusiastic optimizing some project

Starting point is 00:33:20 It was I remember first time It was very long ago It was dock sand Later it was acquired by Dropbox It was in San Francisco And I was like

Starting point is 00:33:32 Oh we optimized additionally Like shaved off 20 30% of buffer hits and reads and it's great like it's small optimization but it's helpful and we can pay this price okay we will we need to maintain this index anyway it's slightly

Starting point is 00:33:46 bigger now but it's still worth it and then later I see degradation of updates and you know why right because we had hot updates because basically every most of updates they updated status yeah

Starting point is 00:34:03 yeah and now it's participating in index So hot updates, it makes hot updates not possible at all. And hot updates, just a reminder. We know there is new audience usually. So heap only tipple updates. This is update which changes only table pages, pages of table data and doesn't touch indexes. Because normally regular update, it needs to touch every single index, which produces more dirty pages, checkpointer or BJ writer or even

Starting point is 00:34:39 back end, they need to flash it to disk first of it doesn't matter and then it generates more wall and it slows down updates like regular updates are much slower

Starting point is 00:34:51 than hot updates so just putting another column to your multi-colum index might affect updates this is worth remembering if it wasn't indexed already like if it was indexed already

Starting point is 00:35:05 in some other way Fair of like. Yeah, yeah. It's a really good point. And I think those index efficiency tips are the ones. Like, I think people are most surprised by, like, people think, they look through their query. They see, I'm talking about like experience engineers, but not experienced postgres folk. Look through a query plan. See, index scan, index scan, index scan, maybe there's a lot of loops.

Starting point is 00:35:29 Maybe there is a rose removed by filter, but maybe it only says one. But it's like 20,000 loops. and they don't realize that that's the most inefficient part of their query. Like, just, anyway, so I love those, but normally when it's like index efficiency is, we talk about a percentage like 0.1%, like that or less than 0.1%, great. If it's 70%, we score it much less well. It's a combination of factors, there's choice. But one particular case, which is very common, and when I see it, I immediately recognize the problem.

Starting point is 00:36:02 if I see select blah blah blah where some filters and then order by created ad desk limit 100 definitely there are multiple indexes which include updated ad and there is a trigger which or just some ruby or jango doesn't matter application which updates updated at always so this table doesn't see hot updates 100%. And yeah and how to break that issue not. losing the mechanics of updated ad timestamp. I don't know.

Starting point is 00:36:37 I try to sometimes to avoid indexing this. If you present to your users values, really like you need order by updated ad desk, limit something. You need it, right?

Starting point is 00:36:52 I try to say, okay, updated ad, it's very close to created ad usually. Let's use created ad and then like somehow like it does work well. It doesn't. I would have thought in most cases it would actually, but yeah.

Starting point is 00:37:08 Depends, of course. It depends. Sometimes it's good. Like you basically already have almost everything and on the fly. Yeah, sometimes there is correlations, a very good correlation. But sometimes if it's on partition table, updates happening all the time and it's a mess. And updates might happen a month later, for example. this is like

Starting point is 00:37:30 screws this I will for newer listeners I'll link up our episodes on hot updates and we did one on limit

Starting point is 00:37:37 that's really good for this Yeah yeah I just wanted to say this is a warning Yeah yeah Multicolum

Starting point is 00:37:44 Miner says we Everyone needs them I can't but we should be careful if we need fast updates

Starting point is 00:37:50 and fewer wall bites put to wall you know Yeah because wall wall bites can be an issue

Starting point is 00:37:58 and they if they put pressure on both backups and replication systems. Right. Yes. And this is the old read versus right thing, isn't it? And also, like, if you've got a write heavy application or if a lot of you, if you're not doing many reads at all, you might not want many indexes at all.

Starting point is 00:38:16 Like, it is a trade-off in general and not just multicolum, right? Sometimes, you know, sometimes here it makes sense to split and have one-on-one relationship to tables and in one you update a lot and you give up but you have only like three or two columns and that's it right and another one has everything else

Starting point is 00:38:37 and search and so on but in this case you will doubt with the problem that we cannot create an index on two tables we lose for some queries we will lose a single index scan we will need to join back to denormalization topic again

Starting point is 00:38:53 yeah yeah I had one more on include I wondered if it might end up with less bloat in some workloads. So when we're updating that column that's been included as a payload, for example, the tuple doesn't need to move in the index. So in the case where we've got them both as key columns, we're having to move the tuple to a different page most likely. If it's like ordered by email, then name, for example,

Starting point is 00:39:24 if their name changes, probably their entry or move page and you could end up with a split. So I was just thinking in terms of updates, I wonder if we'd end up with less index bloat if we used include instead of multiple keys. It seems like a minor advantage, but I wondered if that could help as well. Yeah, it's an interesting point.

Starting point is 00:39:48 I should think about it. Yeah, yeah. But yeah, it feels to me like a include feels like, a scalpel, like a very specific tool that is help. Including, it's, yeah, just, it's, the word is including. Is you sure? No? I'm not sure.

Starting point is 00:40:08 Include? Yeah, it's includes. I know this was, it was a bug in my mind. Sorry. Yeah, include. It's not even includes ploy. It's include singular. Yes.

Starting point is 00:40:18 Yeah. Yeah. But that shows how common, like, shows how commonly you see it. Yeah. Super rare. Including would probably make more sense, but it's just include, yeah. Yeah. So, okay.

Starting point is 00:40:34 Good. Factor naming. I think we're done? Yeah, it was a bag or something, some pieces of advice. And I think we'll come back to some of the other index types at some point. By the way, I found it just as like an interesting tip bit. Jin and Bryn, because of their structures, doesn't actually matter about the ordering of the columns, which I found fascinating.

Starting point is 00:40:57 Said again? GIN indexes and BRIN indexes, the order of the columns. How's matter? Doesn't matter, yeah. Interesting. Okay. I mean, mostly we're worried about Beatry,

Starting point is 00:41:10 so it's not too important. Yeah, 90 plus percent. Yeah. All right. Nice on Nicolai. Thank you so much. Catch you next week. Bye, bye.

Starting point is 00:41:20 Bye.

Postgres FM - Multi-column indexes

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.