Postgres FM - BRIN indexes
Episode Date: July 21, 2022Here are links to the two main resources we mentioned: Paul Ramsey's recent blog post on BRIN indexesTomas Vondra's slides on BRIN index improvementsA few other things we mentioned:B-tree Wi...kipedia page pg_repack pg_squeeze ------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov and @michristofidesIf you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQR.
I'm Michael, founder of PgMustard, and this is my co-host Nikolai, founder of PostgresAI.
Hey Nikolai, how are you doing today?
Hello, doing great. Just returned to California for a long trip, a long flight.
But doing great. How are you?
I am also well, just recovering from our hottest day ever in the UK.
Oh yeah, I was tired of movies and tried to watch some news and so on.
And the BBC, 95% of time was talking about this weather records, right?
So, yeah.
Wonderful.
So today you chose the topic.
What is it we're going to be talking about and why?
We are going to talk about various kinds of indexes,
particularly we are talking like talk about various kinds of indexes particularly we
are talking like brin indexes so block range indexes because recently we had a few materials
published like i think it's good also discussion of hacker news so it's a good time probably to
discuss various aspects what in our experience as well, right? Yeah, absolutely. I'm very interested in this.
And as you mentioned, there was a recent blog post by Paul Ramsey at Crunchy Data.
And I was lucky enough to be able to see a talk at Postgres London by Thomas Vondra
about some of the improvements.
Oh, so you was.
Yeah, let me repeat the thoughts I expressed last time.
I think all conferences should start recording.
In terms of my personal materials, I'm going only to attend conferences which record.
I think it's very important to present to a wider audience because not everyone can travel.
And it's just much more efficient if you present to an offline audience and you you have good connection to them and follow up discussions after you talk.
But also if you have recorded material, this is forever.
It's published and you can refer to your previous materials and you can also see some mistakes you've made and fix later. I saw how other conferences outside the Postgres
community did it during many, many years. And it's also beneficial for the conference itself,
because this cloud of materials accumulated over years, it attracts even more attention to the
conference and it increases the value. So I think conferences which don't record, they shouldn't exist anymore at some point. This is my strong belief. So as I've
said, in San Jose, I made an announcement that I'm not going to present anymore if conference
is not providing recording. So I'm glad that you, Thomas Wondra, right? You was there.
Yes, and it was.
And you saw it. But we have slides, we don't have video, unfortunately, right?
Yes, good point.
He was great and he published his slides.
I'm not actually 100% sure it's the exact slides from Postgres London,
but I think he's given this talk a couple of times,
so it's from one of them.
I hope he will give it one more time online.
I already asked to invite him to Postgres TV Open Talks series,
so I hope we will have a recording.
If he accepts, it would be great.
Wonderful.
Anyone listening, that will be on Postgres TV if we get it.
Right.
So we have Postgres FM, Postgres TV.
Quite easy to remember, right?
And since we discussed it a little,
everyone, please subscribe and help us to grow.
What can help?
Of course, if you subscribe, it can help.
And also, if you share links to Postgres
FM to particular episodes or just to Postgres FM webpage in your social networks and groups
where you discuss Postgres or engineering in general, it would be very helpful for us
to grow. So please do it.
Brilliant. Well, so block range indexeses so we have our default index type if i
just create an index in in postgres it will create a b tree which is the default great default it's
got a lot of benefits it's brilliant but we do have other options and one of those is brin yeah
speaking of b3 i think in every database system, it's default. And this is my favorite question, but also a very tricky question when you interview some engineer.
You feel it's quite a good engineer, but if you pull this question out of your set of prepared questions,
like what's B3 and let's discuss the structure and balancing.
In many, many cases,, interview is over, unfortunately.
So I think every engineer should know it,
but there are many, many engineers
who don't know, unfortunately,
what B3 is.
But B3, of course,
is a standard de facto
for database systems.
But it's not enough in many cases.
It's not enough in terms of performance,
and B3 can degrade over time quite a lot.
Or it can be a big one in terms of size occupied.
And also when we talk about size, it's not only disk space occupied, but also part of
buffer pool, the cache state. So if the index is huge, it means your buffer pool needs to keep more pages, more buffers
in the pool.
Right.
I just wanted to provide some remarks about B3 because also like some meta remark additionally,
we have got very good feedback.
Thank you to everyone for feedback.
It's very important for us to hear what you think about our show and ideas. And we also got a couple of guys mentioned that we like
quite basic material, right? Let's do some more hardcore. But I'm 100% sure that there are many,
many people who need basic material. So I think we need to continue to talk about some basics,
but sometimes trying to go deeper, of course, right? So that's why I talk about B3 because if
it jumps straight to Brin, well, B3 is default. So it's important to understand how it works.
Well, and Brin is quite advanced, I would say. I think a lot of people can go a long way using Postgres
and there's not a sensible case for using Brin
or there's no need to know about it for a long time.
And you mentioned already a few of the times
where B-trees can get cumbersome.
And I think the big one here is size.
Just, you know, if you've got a very, very large table
and you're indexing a column on it,
it can be a very, very large index. you're indexing a column on it it can be
a very very large index and then there's bloat yeah absolutely and then the other thing that
got brought up by paul ramsey in his blog post which i hadn't actually considered before was
there can be a difference in right overhead so btree having a higher write overhead on average than some other index types.
So that's one other final downside sometimes. Yeah, this famous index write amplification,
one of the reasons Uber went from Postgres to MySQL, right? So each time we update row,
all indexes should be updated unless it's hot update. So it's a big problem,
of course. So if we have just one index, it's just one update additionally to the heap update.
But if we have 10 indexes, overhead becomes bigger and bigger. That's why you need to
get rid of unused indexes and redundant indexes, right?
Yeah. But the one thing I hadn't considered was that different index types could have
massively different overheads to one another.
Well, it's very common, for example, to have issues with updating of GIN.
There are special options, like fast update and so on.
So GIN updates are very heavy, expensive.
But B3, it's like medium level of overhead in terms of updates but brin is definitely like
light super low light super low right do you think it's fair to say that the biggest advantage of
brin indexes is their size so they're at least an order of magnitude often two orders of magnitude
smaller than a brin b index. Well, right.
So if you don't know about B3,
you should definitely start with Wikipedia as soon as possible
if you consider yourself an engineer, right?
But what is a print index?
It's quite a simple structure
which describes which,
like in B3 pages,
we have links to tuples and but not direct for each tuple but for ranges so
we describe like this page in heap has tuples starting from this id or timestamp or something
starting with this number till this number so So everything in between is there. Of course, multiple pages can
be referenced in this way. By default, it's 128 pages. It can be referenced for one range,
but it can be adjusted. So in this case, interesting here is that this index directly
dictates you that you should think about physical layout because B3 directly doesn't
do it. So of course we know that if, for example, you have some value or range and B3 says that
these tuples are located in that heap pages. And for example, we have a thousand tuples and each one is in separate pages, a thousand pages.
It's called sparse storage.
You can run cluster command and reorganize your heap.
So the storage will be according to this index, right?
And in this case, probably they will all go to a few pages. So fetching such tuples will be extremely fast compared to this sparse storage.
But Bitbinindex directly says, I'm useful only if your tuples in heap are stored sequentially according to ID or timestamp.
So in this case, for a particular particular range only a few pages will be
referenced right the way i've heard it described is it's it's all about correlation and if they're
exactly in order 100 correlated it it's optimal and you can get really good performance there
because the index can be used to filter out the vast majority of the
of the blocks from being scanned and it only has to look at a small number but as it gets less
correlated performance degrades and if it's if there are some exceptions and it makes the
makes the ranges wider than they than they ideally be, it can degrade quite quickly.
And overlapping also.
Yes.
Overlapping, right.
So, like, okay.
But yeah, not overlapping,
but the same range can be used many times
because more than 128 pages are referenced for this range,
so it's not good.
But, like, how can we check the physical storage aspects there is a
hidden column called city id you cannot create a column called city id right because it's a system
name it's reserved but what city id is it's in two numbers one number is page number and second
number is offset inside page first number is the important. And there is a trick how to extract it for like various, I don't know, like to count
distinct number of pages for particular rows.
It's easy.
You can convert it to point and then extract only first, like to point, I mean, x, x, y,
right?
And then you can take only first argument.
And in this case,
you will get only page. This is how you can extract only page number from the ID.
So you can check your data always and check what city IDs or just page numbers with this trick.
You can check and you can see exactly if you can see this correlation right so if you have
sequential id used or timestamp which is filled by some like now or better in this case clock
timestamp because now is generated only in the beginning of transaction so if you insert you
have some batch insert thousand rows all of them will have the same now value. But clock timestamp will generate timestamp
for each row inserted, right?
And you will have difference in values, right?
And in this case, you can just check, select CT ID,
convert to point and get first argument
and then ID or credit add.
And you can see if there is correlation
or I think you can even
apply some functions to prove that correlation is strong right maybe it's a good exercise but
the cases i hear about this are like people actually using these in the real world tend to be
cases where you're pretty sure it's correlated already because you've been inserting timestamp data in order you know maybe it's a sensor reporting data and you it's you never
update it you never delete it it's all going in order and you can delete is delete is okay
only update matters good point good point yep um so you can check for sure, but equally, that's the case where it's most likely to be
relevant, or at least until the latest version. I think those cases were pretty much the only
good use case for Brim when you had...
Append-only tables.
Exactly.
Yeah. Let me describe once more. I'm trying to, as I've said, some basics, to describe some basics.
Yes. Good luck.
So once you've learned what CT ID is,
I recommend you to create some table with surrogate primary key,
like sequentially some generated number or some sequence.
And then, for example, we have a row where ID is one.
And we see that it's inside page zero.
We've set zero.
Okay.
But then I recommend you to execute update this table set ID equals ID where ID equals
1 and see what happens with CT ID.
This is very interesting because it may be unexpected for you because you will see that
CT ID value changes.
Sometimes page is the same if there is space inside the page but you see how
tuple which is a raw version physical raw version is generated new tuple generated always even if you
logically didn't do like id equals id means we don't do anything right but you see how tuple is generated. That's why updates can shuffle your data, right?
You can have correlation can be worse over time.
Yes, let's use an example where if I have some really old data
and maybe I found out the sensor was incorrect or had a bug in it
and some of the data needs to be updated.
If I update some of that old data,
what Postgres will really do is insert new rows at the bottom of the data needs to be updated. If I update some of that old data, what Postgres will really do is insert new rows
at the bottom of the heap and then mark the...
Or in some pages where there is a space.
Yes, good point.
Yeah, good point.
But not in the same...
Yes.
But then that reduces the correlation of that table.
Exactly.
Brilliant.
Yeah.
Or like we have some import of old data we are missing.
Let's not update, but insert can be problem as well.
If we insert old data, like postponed insert, right?
Delete is not a problem.
I don't see problem with deletes, but postpone inserts and updates.
And also, you know what?
Probably a problem with deletes, but postpone inserts and updates. And also, you know what? Probably a problem repacking.
If you run pgRepack, pgRepack allows you to achieve the same effect as cluster,
but without downtime, not blocking for long.
So if someone, some DBAs, for example, didn't notice that there is a brain index
which requires this correlation
with physical storage and decided to perform repacking using pgRepack and use clustering
according to another column or index, you will get very different physical storage and
correlation will be not good at all in terms for your brain index
right interesting i thought i thought pg repack would help have you heard of pg squeeze that's
the one i'm aware of that's an alternative to repack because i've i've read about it checked
it but i mean i didn't use it yet it's it's interesting idea to use logical decoding instead
of this pg repack approach with like substitution of table it's
it's a really interesting idea with delta table right so pgrepack writes all changes in some
delta table and then sometimes it's a it's a challenge to apply it's because delta changes
from this delta table should be applied need to be applied in a single transaction.
So if a lot of changes are accumulated,
it can be a problem.
But if you use repacking with cluster option,
using different, for example,
you may have, I don't know,
like some name column, right?
Alphabetical.
And you want to reorganize your heap.
You have pages.
You need to present the data with pagination
ordered by name.
And you think this is the most useful use case for you,
so you decided to reorganize heap.
So the data in heap is ordered according name.
So there is ordered according name.
So there is correlation with name, not with ID or created at timestamp.
And your brain in this case will perform not good.
Unless it's a brain of name.
Right.
But I've actually thought of a problem with delete as well.
Once you've deleted those rows, let's say in the middle of your heap and it's vacuumed later new insets are now going to go in the middle
yeah so now we've got a problem again but anyway so this was agreed this is interesting because i
think it slightly takes us on to the improvements that thomas was talking about in Postgres 14. Because we discussed this. Yes, let's do that.
So just to make some conclusion
before we discuss improvements in Postgres 14.
So it means that before Postgres 14,
brin can be used only
if you definitely have append-only pattern for your table,
log-like table.
You log some events or some data from,
like some telemetry data from somewhere
and so on, right?
In this case, bring can be useful.
Or it degrades quickly
and you need to re-index regularly.
Like, I guess there are some use cases
if you do update data and then re-index,
I guess you have some benefits still,
but it degrades quickly and
maybe there isn't the benefits maybe you're better hold on index it can re-index help like clustering
can help so you need to organize here good point maybe like yeah maybe repacking and then re-indexing
maybe this is interesting doing the re-indexing. Clustering. Maybe. This is interesting, doing the re-indexing.
But it's light anyway.
Yes.
It's very light.
But they were popular.
That use case is not...
Lots of people have logging tables.
Lots of people do have this, at least for one table.
Yeah, but the problem is I always...
I tried at least three times over a few last years
since Breen popped up, like was created.
Never decided to go with Breen, never.
I don't know.
Like I didn't find it useful.
Like, okay, the size of it is quite small.
But it always, on large volumes of data, it always performed worse than
B3 for me, much worse. Maybe because I had like these materials, both materials were mentioned,
and we will provide links in description. They both mentioned that for like point searches,
like when you need only one row or a couple of rows bring is not attractive
maybe i had closer to this all like i needed only 20 rows for pagination and so on but when you
need to find many many rows like thousands of rows bring can outperform b3 i just never i always when
i need to decide what i should use i do experiments i'm a huge fan of i i'm
building company on top of idea of experimenting experimenting with always always always for
learning for making decision it's the best case if you can experiment with production data or like
similar to production without pii we are without data. But I never decided to use Brin because I saw them less performant compared to B3,
even for update-only tables, only inserts.
I'm the same.
I've only ever seen E3 outperform Brin in terms of raw query performance.
I think there's a good example in the CrunchyData
blog post where in lab conditions
Bryn can outperform
B-tree, but
in the real world...
I have
like, let me
have some disclaimer.
Last time I was like a bad cop
guy who criticized a lot
of various stuff. I'm going to continue.
So I see value in criticism, but I'm going to be polite.
Maybe sometimes not very polite, because I often have some opinion, quite strong opinion.
But I hope nobody will be offended.
And it's just for improving things,
not for damage, right?
And I also can be sometimes very wrong
and I quickly admit it
if I see evidence that I'm wrong.
But in this case,
I want to criticize both materials
from Paul Ramsey and Thomas Wondra.
I cannot understand how we can talk about
performance and the layout and so on and provide plans without buffers it's a huge mistake like
simply huge mistake because we discussed how brin can be more better in terms of performance than b3 and talk about some timing which is have a lot of
fluctuations right and it's not reliable at least at least you need to run it multiple times and
take some average if you talk about timing so and and also paul rams's blog post also makes some conclusions based only just from a single point of data.
For example, like 10 rows, 100 rows, 1,000 rows, 10,000 rows.
Okay, we see Breen is better.
Like, I'm not convinced.
I'm not convinced because I saw for huge, also like million rows.
Seriously, million rows is nothing today.
Test at least a billion rows, right?
Sorry, maybe I'm offensive, but I just want everyone to have better materials.
It's great to discuss performance, but don't do plans of explain-analyze.
Do plans with explain-analyze buffers.
We will see I.O.
I.O. is most important.
All indexes are needed to reduce number of I.O. operations.
Index is all about reducing I.O., any index.
We want, instead of making a lot of buffer hits and reads,
we want to have a few buffer hits and hits to fetch one or thousand rows. We don't want
to have a million hits and reads when we need only thousands of rows. But when you talk timing,
I don't understand where this timing comes from. When I see buffers, I understand, oh,
this timing goes from huge buffer reads and hits numbers.
Yeah, really good point.
And I think actually there's a couple more.
So the reason I was always, I always used to think until the blog post and until a couple of things that Thomas said,
I always thought Brin couldn't outperform E3 because it only has the ability to do bitmap heap scans
or bitmap scans in general.
Oh, yes.
So a B-tree can also do a bitmap scan.
So the only slight advantage is that you have to build the bitmap
if you're using a B-tree.
So there's slight overhead there for the B-tree,
but it's also efficient so it's looking at
exactly the blocks it needs whereas the bit the um when you say you need to build it i mean uh
you as executor like the postgres yeah no sorry yes not to use not the user no but i was thinking
the the executor has to for the btree index but it doesn't have to for the brin index but with a
brin index i think you'd always get false positives back that you have to filter out like you're
always going to get some rows on those blocks that you're going to then have to get rid of so i didn't
understand how it could be more efficient but the the main advantage i always thought was the size
so instead of having a multi-gigabyte index that
also then takes up multiple gigabytes in the cache and all those things, you could have an
index that's normally in the kilobytes or even for very large tables in the megabytes. So I always
saw that as the main advantage rather than performance. You know, if, for example, if we
just created a table, filled it with data and and forgot to run vacuum, and I see a crunchy blog post flexed, in this case, you will have a plan with a bump index scan, right?
And in this case, I can imagine that Breen will be efficient if you have a huge table and you need to find and fetch a lot of rows.
It should be there.
But I would definitely look, first of all, on a type of operation like bitmap index scan.
And also, I would look at buffer hits.
I've just remembered.
Yeah, I agree.
One last thing that's a bit unfair with the crunch data blog post and i know it makes it a slightly fairer
comparison but it's unfair in the real world is that they also disabled index only scans because
brin doesn't support them so make it a more apples to apples comparison that's disabled but in the
real world you might sometimes get an index only scan especially for data that isn't changing
michael michael michael you you were supposed to be a good call
here why i didn't agree to this why why you okay um yes but yeah um those are all really good points
i'm really happy that we've brought those up i actually do think they're good uh content and
i'm glad that they spark in a discussion here the other thing if we're on the more advanced
side I do think Brin has improved a lot in the last version and I didn't know that until well
I wasn't aware of how much it had until Thomas's talk the one thing I really wanted to bring
people's attention to is a new operator class which I'm aware is slightly advanced but by default brin
indexes still behave pretty much exactly as they did in Postgres 13 and 12 I believe but if you
change a setting while you're creating the index if you create it slightly differently
using a new well there's a couple of different new operator classes, but the one I'm particularly excited about is min-max-multi.
And my understanding of that, and this might be flawed,
is that instead of only maintaining a single minimum and maximum
for each block range,
it can instead maintain multiple minimum maximums.
Now, the big advantage of that,
well, so two big advantages are it can
support different types of correlation so if there are um if your data's uh oddly it may be
not the timestamp example but may well actually timestamp example is great if we if we inserted
some old data we would now be able to have two min-maxes,
or probably not two, probably lots more than that,
but we could have the new data and the old data and a big gap in the middle
where Postgres knows it doesn't have any data for that big gap.
So at worst, they degrade much less badly,
but I think it's a huge, huge improvement that could support
a lot of different types of correlation as well.
And I'm interested why,
well, I think it could be so useful
that it should be the new default for Brin,
but I do understand that defaults are difficult to change
and you might not want to affect existing users.
To me, it sounds like a game changer change.
And I'm looking forward to testing it once again.
Unfortunately, I still don't have any
real big production with postgres 14 but once i have it next time like for me it's a reset of my
opinion about between indexes so because as i've said in the past they like i made decisions not
not go and i like next time i would probably not spend any time not waste
this time double checking but now i definitely will double check uh for for the cases when this
correlation is not perfect and we have some some not in extent intensive but like occasional
updates or deletes or postpone inserts
as we discussed. Yeah. It sounds interesting
and I also see like
to be fair Paul Ramsey's article
describes
doesn't describe these improvements but
it talks about
pages per range
option that was available
before already and you can
try to play with this parameter and see
how performance in your particular case is affected negatively or positively and also
mentions pg start tuple extension to check like physically out of pages and see see what's
happening under the hood this is very very good very good. I mean, to remind about these capabilities,
this is helpful if you run your own experiments.
So good thing.
But Thomas Fondra discussing improvements
also shows some good examples.
And I noticed that these operator classes,
it looks like they are available for various data types, right?
And UUID is there.
This is interesting.
Because UUID is usually considered as like randomly, physical distribution is awful, right?
And BRIN for UUID is not good.
Because you have, like, it's not ordered.
I think there are some UUid versions that are there are some
yeah okay but i'm wondering if that's why they've been included yeah actually we use it i i now
remember we use some some location in our tool we use some almost uh ordered right but uh many
versions are not true like uuid v V4. I don't remember these things.
They don't look ordered at all.
And it's interesting to check how improved brain index can perform on this data type.
So not only it should be interesting for cases which not fully update only pattern, but also for these kind of data types.
Yeah.
In person, actually, Thomas made a really good point about this is actually a really good area for new people that want to contribute to Postgres to explore.
He has got a few ideas of improving these further.
And indexing in general is quite an isolated part of the code base.
You don't have to understand everything about Postgres to be able to improve things.
And Brin indexes especially, they're isolated.
They're not the default, so they're not as contentious.
So I think there's a few good reasons why this would be a good area to get involved in. And I think Thomas mentioned being willing to help people out if they do want to get started on that kind of thing.
So that would be awesome.
Good first issue, right?
Yeah, right, the label.
If Postgres used GitHub or GitLab,
this label would be there in this issue.
But no.
Definitely a topic for another day.
Postgres doesn't use.
Yes.
Oh, yeah.
Yeah, we have several topics like that.
Yeah, good, good.
So I find these two materials useful for reminding that you can use this
in your experiments or that.
But as I've said, I think everyone should experiment
with their own data and queries.
If you need to make decision, don't trust blog posts blindly. Experiment with your data and your
queries. Think about if even if you don't have data, you can generate some like as close as you
think about future and also queries and then you can start experimenting absolutely my my big piece there
is always to make sure you have the roughly the right number of rows like everything else matters
a bit but then the the raw number of rows in the right order of magnitude makes a big big difference
to which query plans you're going to see and to over performance so if you do nothing else
please at least insert an appropriate number of rows
for testing.
And I need to add here,
on two sides,
number of rows on two sides.
First number,
what number of rows stored in table?
And second,
number of rows you need for your query.
Sometimes there is no limit in query
and then your data is growing
and this is a big mistake many people often do.
So not limiting the result set.
So you need to think how many rows your users will need.
They won't need probably a million rows
if you present these rows on some page on a mobile or web app.
So you need to think about limit and pagination
and so on absolutely right i think we're done yeah good nothing left in this at least at least
in our heads maybe some people have some uh additional thoughts i would love to to to see
in comments or in Twitter probably yeah and well
I think we probably could talk about it for a bit more
like there's a bloom like the new
bloom operating operator class seemed
really interesting and the min max
thing is also configurable
oh yeah but I'm also conscious
of time and I think we should
probably I think that probably is
verging a little bit advanced but
there's some really cool things happening in Britain
if you haven't considered them for a while.
Please do.
Upgrade if you can to Postgres 14.
What else?
Test 15.
Yeah, right.
If you're brave.
Test Postgres 15.
For testing, you don't need to be brave.
It's not production.
Testing can happen elsewhere.
Very true.
I'm not saying upgrade your production to 15 beta 2.
It's probably too brave.
Brave is not quite the right word, is it?
Awesome. Well, thank you everybody
for joining us. And yeah,
send us your feedback. Let us know what you'd like
discussed. And yeah,
thank you, Nikolai. Hope you have a good week.
Thank you, Michael. See you next week.
See you next week. Bye.