Postgres FM - BRIN indexes

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQR. I'm Michael, founder of PgMustard, and this is my co-host Nikolai, founder of PostgresAI. Hey Nikolai, how are you doing today? Hello, doing great. Just returned to California for a long trip, a long flight. But doing great. How are you? I am also well, just recovering from our hottest day ever in the UK. Oh yeah, I was tired of movies and tried to watch some news and so on. And the BBC, 95% of time was talking about this weather records, right?

Starting point is 00:00:35 So, yeah. Wonderful. So today you chose the topic. What is it we're going to be talking about and why? We are going to talk about various kinds of indexes, particularly we are talking like talk about various kinds of indexes particularly we are talking like brin indexes so block range indexes because recently we had a few materials published like i think it's good also discussion of hacker news so it's a good time probably to

Starting point is 00:00:59 discuss various aspects what in our experience as well, right? Yeah, absolutely. I'm very interested in this. And as you mentioned, there was a recent blog post by Paul Ramsey at Crunchy Data. And I was lucky enough to be able to see a talk at Postgres London by Thomas Vondra about some of the improvements. Oh, so you was. Yeah, let me repeat the thoughts I expressed last time. I think all conferences should start recording. In terms of my personal materials, I'm going only to attend conferences which record.

Starting point is 00:01:33 I think it's very important to present to a wider audience because not everyone can travel. And it's just much more efficient if you present to an offline audience and you you have good connection to them and follow up discussions after you talk. But also if you have recorded material, this is forever. It's published and you can refer to your previous materials and you can also see some mistakes you've made and fix later. I saw how other conferences outside the Postgres community did it during many, many years. And it's also beneficial for the conference itself, because this cloud of materials accumulated over years, it attracts even more attention to the conference and it increases the value. So I think conferences which don't record, they shouldn't exist anymore at some point. This is my strong belief. So as I've said, in San Jose, I made an announcement that I'm not going to present anymore if conference

Starting point is 00:02:36 is not providing recording. So I'm glad that you, Thomas Wondra, right? You was there. Yes, and it was. And you saw it. But we have slides, we don't have video, unfortunately, right? Yes, good point. He was great and he published his slides. I'm not actually 100% sure it's the exact slides from Postgres London, but I think he's given this talk a couple of times, so it's from one of them.

Starting point is 00:02:57 I hope he will give it one more time online. I already asked to invite him to Postgres TV Open Talks series, so I hope we will have a recording. If he accepts, it would be great. Wonderful. Anyone listening, that will be on Postgres TV if we get it. Right. So we have Postgres FM, Postgres TV.

Starting point is 00:03:14 Quite easy to remember, right? And since we discussed it a little, everyone, please subscribe and help us to grow. What can help? Of course, if you subscribe, it can help. And also, if you share links to Postgres FM to particular episodes or just to Postgres FM webpage in your social networks and groups where you discuss Postgres or engineering in general, it would be very helpful for us

Starting point is 00:03:39 to grow. So please do it. Brilliant. Well, so block range indexeses so we have our default index type if i just create an index in in postgres it will create a b tree which is the default great default it's got a lot of benefits it's brilliant but we do have other options and one of those is brin yeah speaking of b3 i think in every database system, it's default. And this is my favorite question, but also a very tricky question when you interview some engineer. You feel it's quite a good engineer, but if you pull this question out of your set of prepared questions, like what's B3 and let's discuss the structure and balancing. In many, many cases,, interview is over, unfortunately.

Starting point is 00:04:26 So I think every engineer should know it, but there are many, many engineers who don't know, unfortunately, what B3 is. But B3, of course, is a standard de facto for database systems. But it's not enough in many cases.

Starting point is 00:04:40 It's not enough in terms of performance, and B3 can degrade over time quite a lot. Or it can be a big one in terms of size occupied. And also when we talk about size, it's not only disk space occupied, but also part of buffer pool, the cache state. So if the index is huge, it means your buffer pool needs to keep more pages, more buffers in the pool. Right. I just wanted to provide some remarks about B3 because also like some meta remark additionally,

Starting point is 00:05:19 we have got very good feedback. Thank you to everyone for feedback. It's very important for us to hear what you think about our show and ideas. And we also got a couple of guys mentioned that we like quite basic material, right? Let's do some more hardcore. But I'm 100% sure that there are many, many people who need basic material. So I think we need to continue to talk about some basics, but sometimes trying to go deeper, of course, right? So that's why I talk about B3 because if it jumps straight to Brin, well, B3 is default. So it's important to understand how it works. Well, and Brin is quite advanced, I would say. I think a lot of people can go a long way using Postgres

Starting point is 00:06:05 and there's not a sensible case for using Brin or there's no need to know about it for a long time. And you mentioned already a few of the times where B-trees can get cumbersome. And I think the big one here is size. Just, you know, if you've got a very, very large table and you're indexing a column on it, it can be a very, very large index. you're indexing a column on it it can be

Starting point is 00:06:25 a very very large index and then there's bloat yeah absolutely and then the other thing that got brought up by paul ramsey in his blog post which i hadn't actually considered before was there can be a difference in right overhead so btree having a higher write overhead on average than some other index types. So that's one other final downside sometimes. Yeah, this famous index write amplification, one of the reasons Uber went from Postgres to MySQL, right? So each time we update row, all indexes should be updated unless it's hot update. So it's a big problem, of course. So if we have just one index, it's just one update additionally to the heap update. But if we have 10 indexes, overhead becomes bigger and bigger. That's why you need to

Starting point is 00:07:16 get rid of unused indexes and redundant indexes, right? Yeah. But the one thing I hadn't considered was that different index types could have massively different overheads to one another. Well, it's very common, for example, to have issues with updating of GIN. There are special options, like fast update and so on. So GIN updates are very heavy, expensive. But B3, it's like medium level of overhead in terms of updates but brin is definitely like light super low light super low right do you think it's fair to say that the biggest advantage of

Starting point is 00:07:55 brin indexes is their size so they're at least an order of magnitude often two orders of magnitude smaller than a brin b index. Well, right. So if you don't know about B3, you should definitely start with Wikipedia as soon as possible if you consider yourself an engineer, right? But what is a print index? It's quite a simple structure which describes which,

Starting point is 00:08:20 like in B3 pages, we have links to tuples and but not direct for each tuple but for ranges so we describe like this page in heap has tuples starting from this id or timestamp or something starting with this number till this number so So everything in between is there. Of course, multiple pages can be referenced in this way. By default, it's 128 pages. It can be referenced for one range, but it can be adjusted. So in this case, interesting here is that this index directly dictates you that you should think about physical layout because B3 directly doesn't do it. So of course we know that if, for example, you have some value or range and B3 says that

Starting point is 00:09:15 these tuples are located in that heap pages. And for example, we have a thousand tuples and each one is in separate pages, a thousand pages. It's called sparse storage. You can run cluster command and reorganize your heap. So the storage will be according to this index, right? And in this case, probably they will all go to a few pages. So fetching such tuples will be extremely fast compared to this sparse storage. But Bitbinindex directly says, I'm useful only if your tuples in heap are stored sequentially according to ID or timestamp. So in this case, for a particular particular range only a few pages will be referenced right the way i've heard it described is it's it's all about correlation and if they're

Starting point is 00:10:14 exactly in order 100 correlated it it's optimal and you can get really good performance there because the index can be used to filter out the vast majority of the of the blocks from being scanned and it only has to look at a small number but as it gets less correlated performance degrades and if it's if there are some exceptions and it makes the makes the ranges wider than they than they ideally be, it can degrade quite quickly. And overlapping also. Yes. Overlapping, right.

Starting point is 00:10:49 So, like, okay. But yeah, not overlapping, but the same range can be used many times because more than 128 pages are referenced for this range, so it's not good. But, like, how can we check the physical storage aspects there is a hidden column called city id you cannot create a column called city id right because it's a system name it's reserved but what city id is it's in two numbers one number is page number and second

Starting point is 00:11:20 number is offset inside page first number is the important. And there is a trick how to extract it for like various, I don't know, like to count distinct number of pages for particular rows. It's easy. You can convert it to point and then extract only first, like to point, I mean, x, x, y, right? And then you can take only first argument. And in this case, you will get only page. This is how you can extract only page number from the ID.

Starting point is 00:11:51 So you can check your data always and check what city IDs or just page numbers with this trick. You can check and you can see exactly if you can see this correlation right so if you have sequential id used or timestamp which is filled by some like now or better in this case clock timestamp because now is generated only in the beginning of transaction so if you insert you have some batch insert thousand rows all of them will have the same now value. But clock timestamp will generate timestamp for each row inserted, right? And you will have difference in values, right? And in this case, you can just check, select CT ID,

Starting point is 00:12:36 convert to point and get first argument and then ID or credit add. And you can see if there is correlation or I think you can even apply some functions to prove that correlation is strong right maybe it's a good exercise but the cases i hear about this are like people actually using these in the real world tend to be cases where you're pretty sure it's correlated already because you've been inserting timestamp data in order you know maybe it's a sensor reporting data and you it's you never update it you never delete it it's all going in order and you can delete is delete is okay

Starting point is 00:13:17 only update matters good point good point yep um so you can check for sure, but equally, that's the case where it's most likely to be relevant, or at least until the latest version. I think those cases were pretty much the only good use case for Brim when you had... Append-only tables. Exactly. Yeah. Let me describe once more. I'm trying to, as I've said, some basics, to describe some basics. Yes. Good luck. So once you've learned what CT ID is,

Starting point is 00:13:45 I recommend you to create some table with surrogate primary key, like sequentially some generated number or some sequence. And then, for example, we have a row where ID is one. And we see that it's inside page zero. We've set zero. Okay. But then I recommend you to execute update this table set ID equals ID where ID equals 1 and see what happens with CT ID.

Starting point is 00:14:14 This is very interesting because it may be unexpected for you because you will see that CT ID value changes. Sometimes page is the same if there is space inside the page but you see how tuple which is a raw version physical raw version is generated new tuple generated always even if you logically didn't do like id equals id means we don't do anything right but you see how tuple is generated. That's why updates can shuffle your data, right? You can have correlation can be worse over time. Yes, let's use an example where if I have some really old data and maybe I found out the sensor was incorrect or had a bug in it

Starting point is 00:15:00 and some of the data needs to be updated. If I update some of that old data, what Postgres will really do is insert new rows at the bottom of the data needs to be updated. If I update some of that old data, what Postgres will really do is insert new rows at the bottom of the heap and then mark the... Or in some pages where there is a space. Yes, good point. Yeah, good point. But not in the same...

Starting point is 00:15:18 Yes. But then that reduces the correlation of that table. Exactly. Brilliant. Yeah. Or like we have some import of old data we are missing. Let's not update, but insert can be problem as well. If we insert old data, like postponed insert, right?

Starting point is 00:15:38 Delete is not a problem. I don't see problem with deletes, but postpone inserts and updates. And also, you know what? Probably a problem with deletes, but postpone inserts and updates. And also, you know what? Probably a problem repacking. If you run pgRepack, pgRepack allows you to achieve the same effect as cluster, but without downtime, not blocking for long. So if someone, some DBAs, for example, didn't notice that there is a brain index which requires this correlation

Starting point is 00:16:06 with physical storage and decided to perform repacking using pgRepack and use clustering according to another column or index, you will get very different physical storage and correlation will be not good at all in terms for your brain index right interesting i thought i thought pg repack would help have you heard of pg squeeze that's the one i'm aware of that's an alternative to repack because i've i've read about it checked it but i mean i didn't use it yet it's it's interesting idea to use logical decoding instead of this pg repack approach with like substitution of table it's it's a really interesting idea with delta table right so pgrepack writes all changes in some

Starting point is 00:16:52 delta table and then sometimes it's a it's a challenge to apply it's because delta changes from this delta table should be applied need to be applied in a single transaction. So if a lot of changes are accumulated, it can be a problem. But if you use repacking with cluster option, using different, for example, you may have, I don't know, like some name column, right?

Starting point is 00:17:24 Alphabetical. And you want to reorganize your heap. You have pages. You need to present the data with pagination ordered by name. And you think this is the most useful use case for you, so you decided to reorganize heap. So the data in heap is ordered according name.

Starting point is 00:17:45 So there is ordered according name. So there is correlation with name, not with ID or created at timestamp. And your brain in this case will perform not good. Unless it's a brain of name. Right. But I've actually thought of a problem with delete as well. Once you've deleted those rows, let's say in the middle of your heap and it's vacuumed later new insets are now going to go in the middle yeah so now we've got a problem again but anyway so this was agreed this is interesting because i

Starting point is 00:18:16 think it slightly takes us on to the improvements that thomas was talking about in Postgres 14. Because we discussed this. Yes, let's do that. So just to make some conclusion before we discuss improvements in Postgres 14. So it means that before Postgres 14, brin can be used only if you definitely have append-only pattern for your table, log-like table. You log some events or some data from,

Starting point is 00:18:45 like some telemetry data from somewhere and so on, right? In this case, bring can be useful. Or it degrades quickly and you need to re-index regularly. Like, I guess there are some use cases if you do update data and then re-index, I guess you have some benefits still,

Starting point is 00:19:04 but it degrades quickly and maybe there isn't the benefits maybe you're better hold on index it can re-index help like clustering can help so you need to organize here good point maybe like yeah maybe repacking and then re-indexing maybe this is interesting doing the re-indexing. Clustering. Maybe. This is interesting, doing the re-indexing. But it's light anyway. Yes. It's very light. But they were popular.

Starting point is 00:19:33 That use case is not... Lots of people have logging tables. Lots of people do have this, at least for one table. Yeah, but the problem is I always... I tried at least three times over a few last years since Breen popped up, like was created. Never decided to go with Breen, never. I don't know.

Starting point is 00:19:56 Like I didn't find it useful. Like, okay, the size of it is quite small. But it always, on large volumes of data, it always performed worse than B3 for me, much worse. Maybe because I had like these materials, both materials were mentioned, and we will provide links in description. They both mentioned that for like point searches, like when you need only one row or a couple of rows bring is not attractive maybe i had closer to this all like i needed only 20 rows for pagination and so on but when you need to find many many rows like thousands of rows bring can outperform b3 i just never i always when

Starting point is 00:20:41 i need to decide what i should use i do experiments i'm a huge fan of i i'm building company on top of idea of experimenting experimenting with always always always for learning for making decision it's the best case if you can experiment with production data or like similar to production without pii we are without data. But I never decided to use Brin because I saw them less performant compared to B3, even for update-only tables, only inserts. I'm the same. I've only ever seen E3 outperform Brin in terms of raw query performance. I think there's a good example in the CrunchyData

Starting point is 00:21:26 blog post where in lab conditions Bryn can outperform B-tree, but in the real world... I have like, let me have some disclaimer. Last time I was like a bad cop

Starting point is 00:21:42 guy who criticized a lot of various stuff. I'm going to continue. So I see value in criticism, but I'm going to be polite. Maybe sometimes not very polite, because I often have some opinion, quite strong opinion. But I hope nobody will be offended. And it's just for improving things, not for damage, right? And I also can be sometimes very wrong

Starting point is 00:22:13 and I quickly admit it if I see evidence that I'm wrong. But in this case, I want to criticize both materials from Paul Ramsey and Thomas Wondra. I cannot understand how we can talk about performance and the layout and so on and provide plans without buffers it's a huge mistake like simply huge mistake because we discussed how brin can be more better in terms of performance than b3 and talk about some timing which is have a lot of

Starting point is 00:22:48 fluctuations right and it's not reliable at least at least you need to run it multiple times and take some average if you talk about timing so and and also paul rams's blog post also makes some conclusions based only just from a single point of data. For example, like 10 rows, 100 rows, 1,000 rows, 10,000 rows. Okay, we see Breen is better. Like, I'm not convinced. I'm not convinced because I saw for huge, also like million rows. Seriously, million rows is nothing today. Test at least a billion rows, right?

Starting point is 00:23:27 Sorry, maybe I'm offensive, but I just want everyone to have better materials. It's great to discuss performance, but don't do plans of explain-analyze. Do plans with explain-analyze buffers. We will see I.O. I.O. is most important. All indexes are needed to reduce number of I.O. operations. Index is all about reducing I.O., any index. We want, instead of making a lot of buffer hits and reads,

Starting point is 00:24:00 we want to have a few buffer hits and hits to fetch one or thousand rows. We don't want to have a million hits and reads when we need only thousands of rows. But when you talk timing, I don't understand where this timing comes from. When I see buffers, I understand, oh, this timing goes from huge buffer reads and hits numbers. Yeah, really good point. And I think actually there's a couple more. So the reason I was always, I always used to think until the blog post and until a couple of things that Thomas said, I always thought Brin couldn't outperform E3 because it only has the ability to do bitmap heap scans

Starting point is 00:24:48 or bitmap scans in general. Oh, yes. So a B-tree can also do a bitmap scan. So the only slight advantage is that you have to build the bitmap if you're using a B-tree. So there's slight overhead there for the B-tree, but it's also efficient so it's looking at exactly the blocks it needs whereas the bit the um when you say you need to build it i mean uh

Starting point is 00:25:13 you as executor like the postgres yeah no sorry yes not to use not the user no but i was thinking the the executor has to for the btree index but it doesn't have to for the brin index but with a brin index i think you'd always get false positives back that you have to filter out like you're always going to get some rows on those blocks that you're going to then have to get rid of so i didn't understand how it could be more efficient but the the main advantage i always thought was the size so instead of having a multi-gigabyte index that also then takes up multiple gigabytes in the cache and all those things, you could have an index that's normally in the kilobytes or even for very large tables in the megabytes. So I always

Starting point is 00:25:56 saw that as the main advantage rather than performance. You know, if, for example, if we just created a table, filled it with data and and forgot to run vacuum, and I see a crunchy blog post flexed, in this case, you will have a plan with a bump index scan, right? And in this case, I can imagine that Breen will be efficient if you have a huge table and you need to find and fetch a lot of rows. It should be there. But I would definitely look, first of all, on a type of operation like bitmap index scan. And also, I would look at buffer hits. I've just remembered. Yeah, I agree.

Starting point is 00:26:39 One last thing that's a bit unfair with the crunch data blog post and i know it makes it a slightly fairer comparison but it's unfair in the real world is that they also disabled index only scans because brin doesn't support them so make it a more apples to apples comparison that's disabled but in the real world you might sometimes get an index only scan especially for data that isn't changing michael michael michael you you were supposed to be a good call here why i didn't agree to this why why you okay um yes but yeah um those are all really good points i'm really happy that we've brought those up i actually do think they're good uh content and i'm glad that they spark in a discussion here the other thing if we're on the more advanced

Starting point is 00:27:27 side I do think Brin has improved a lot in the last version and I didn't know that until well I wasn't aware of how much it had until Thomas's talk the one thing I really wanted to bring people's attention to is a new operator class which I'm aware is slightly advanced but by default brin indexes still behave pretty much exactly as they did in Postgres 13 and 12 I believe but if you change a setting while you're creating the index if you create it slightly differently using a new well there's a couple of different new operator classes, but the one I'm particularly excited about is min-max-multi. And my understanding of that, and this might be flawed, is that instead of only maintaining a single minimum and maximum

Starting point is 00:28:15 for each block range, it can instead maintain multiple minimum maximums. Now, the big advantage of that, well, so two big advantages are it can support different types of correlation so if there are um if your data's uh oddly it may be not the timestamp example but may well actually timestamp example is great if we if we inserted some old data we would now be able to have two min-maxes, or probably not two, probably lots more than that,

Starting point is 00:28:48 but we could have the new data and the old data and a big gap in the middle where Postgres knows it doesn't have any data for that big gap. So at worst, they degrade much less badly, but I think it's a huge, huge improvement that could support a lot of different types of correlation as well. And I'm interested why, well, I think it could be so useful that it should be the new default for Brin,

Starting point is 00:29:12 but I do understand that defaults are difficult to change and you might not want to affect existing users. To me, it sounds like a game changer change. And I'm looking forward to testing it once again. Unfortunately, I still don't have any real big production with postgres 14 but once i have it next time like for me it's a reset of my opinion about between indexes so because as i've said in the past they like i made decisions not not go and i like next time i would probably not spend any time not waste

Starting point is 00:29:46 this time double checking but now i definitely will double check uh for for the cases when this correlation is not perfect and we have some some not in extent intensive but like occasional updates or deletes or postpone inserts as we discussed. Yeah. It sounds interesting and I also see like to be fair Paul Ramsey's article describes doesn't describe these improvements but

Starting point is 00:30:16 it talks about pages per range option that was available before already and you can try to play with this parameter and see how performance in your particular case is affected negatively or positively and also mentions pg start tuple extension to check like physically out of pages and see see what's happening under the hood this is very very good very good. I mean, to remind about these capabilities,

Starting point is 00:30:47 this is helpful if you run your own experiments. So good thing. But Thomas Fondra discussing improvements also shows some good examples. And I noticed that these operator classes, it looks like they are available for various data types, right? And UUID is there. This is interesting.

Starting point is 00:31:09 Because UUID is usually considered as like randomly, physical distribution is awful, right? And BRIN for UUID is not good. Because you have, like, it's not ordered. I think there are some UUid versions that are there are some yeah okay but i'm wondering if that's why they've been included yeah actually we use it i i now remember we use some some location in our tool we use some almost uh ordered right but uh many versions are not true like uuid v V4. I don't remember these things. They don't look ordered at all.

Starting point is 00:31:58 And it's interesting to check how improved brain index can perform on this data type. So not only it should be interesting for cases which not fully update only pattern, but also for these kind of data types. Yeah. In person, actually, Thomas made a really good point about this is actually a really good area for new people that want to contribute to Postgres to explore. He has got a few ideas of improving these further. And indexing in general is quite an isolated part of the code base. You don't have to understand everything about Postgres to be able to improve things. And Brin indexes especially, they're isolated.

Starting point is 00:32:37 They're not the default, so they're not as contentious. So I think there's a few good reasons why this would be a good area to get involved in. And I think Thomas mentioned being willing to help people out if they do want to get started on that kind of thing. So that would be awesome. Good first issue, right? Yeah, right, the label. If Postgres used GitHub or GitLab, this label would be there in this issue. But no.

Starting point is 00:33:03 Definitely a topic for another day. Postgres doesn't use. Yes. Oh, yeah. Yeah, we have several topics like that. Yeah, good, good. So I find these two materials useful for reminding that you can use this in your experiments or that.

Starting point is 00:33:20 But as I've said, I think everyone should experiment with their own data and queries. If you need to make decision, don't trust blog posts blindly. Experiment with your data and your queries. Think about if even if you don't have data, you can generate some like as close as you think about future and also queries and then you can start experimenting absolutely my my big piece there is always to make sure you have the roughly the right number of rows like everything else matters a bit but then the the raw number of rows in the right order of magnitude makes a big big difference to which query plans you're going to see and to over performance so if you do nothing else

Starting point is 00:34:02 please at least insert an appropriate number of rows for testing. And I need to add here, on two sides, number of rows on two sides. First number, what number of rows stored in table? And second,

Starting point is 00:34:16 number of rows you need for your query. Sometimes there is no limit in query and then your data is growing and this is a big mistake many people often do. So not limiting the result set. So you need to think how many rows your users will need. They won't need probably a million rows if you present these rows on some page on a mobile or web app.

Starting point is 00:34:43 So you need to think about limit and pagination and so on absolutely right i think we're done yeah good nothing left in this at least at least in our heads maybe some people have some uh additional thoughts i would love to to to see in comments or in Twitter probably yeah and well I think we probably could talk about it for a bit more like there's a bloom like the new bloom operating operator class seemed really interesting and the min max

Starting point is 00:35:15 thing is also configurable oh yeah but I'm also conscious of time and I think we should probably I think that probably is verging a little bit advanced but there's some really cool things happening in Britain if you haven't considered them for a while. Please do.

Starting point is 00:35:30 Upgrade if you can to Postgres 14. What else? Test 15. Yeah, right. If you're brave. Test Postgres 15. For testing, you don't need to be brave. It's not production.

Starting point is 00:35:42 Testing can happen elsewhere. Very true. I'm not saying upgrade your production to 15 beta 2. It's probably too brave. Brave is not quite the right word, is it? Awesome. Well, thank you everybody for joining us. And yeah, send us your feedback. Let us know what you'd like

Starting point is 00:36:00 discussed. And yeah, thank you, Nikolai. Hope you have a good week. Thank you, Michael. See you next week. See you next week. Bye.

Your Ad Here

Postgres FM - BRIN indexes

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.