Postgres FM - Skip scan
Episode Date: September 6, 2024Michael and Nikolay are joined by Peter Geoghegan, major contributor and committer to Postgres, to discuss adding skip scan support to PostgreSQL over versions 17 and 18. Here are some links... to things they mentioned:Peter Geoghegan https://postgres.fm/people/peter-geogheganPeter’s previous (excellent) interview on Postgres TV https://www.youtube.com/watch?v=iAPawr1DxhMEfficient Search of Multidimensional B-Trees (1995 paper by Harry Leslie, Rohit Jain, Dave Birdsall, and Hedieh Yaghmai) https://vldb.org/conf/1995/P710.PDFIndex Skip Scanning in Oracle https://oracle-base.com/articles/9i/index-skip-scanningPeter’s introductory email to the hackers mailing list about adding skip scan https://www.postgresql.org/message-id/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.comLoose Indexscan versus Index Skip Scan (PostgreSQL wiki) https://wiki.postgresql.org/wiki/Loose_indexscanTom Lane will be on the Talking Postgres podcast on October 9th https://aka.ms/TalkingPostgres-Ep20-calBenoit Tigeot feedback and repro (originally reported via Slack) https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4597410#gistcomment-4597410Summary video and blog post about the v17 work by Lukas from pganalyze (not mentioned but great) https://pganalyze.com/blog/5mins-postgres-17-faster-btree-index-scansUnderstanding HNSW + filtering (pgvector repo discussion) https://github.com/pgvector/pgvector/issues/259btree_gin https://www.postgresql.org/docs/current/btree-gin.html~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
I am Michael, founder of PDMustard, and I'm joined as usual by Nikolai, founder of PostgresAI.
Hello, Nikolai.
Hi, Michael. How are you?
Let's not miss the fact that we have a guest today.
We didn't miss it. You've thrown things off by saying, how are you?
And I'm British, so I have to reply, and how are you?
Well, let's play in the US. This is so I have to reply. And how are you? Well, let's blend the US.
This is where I learned to ask.
So I'm super glad we have Peter today.
Hi, Peter.
Hi.
Good to be here.
Thanks for having me on the pod.
We are delighted to have you.
And it's Peter Gagan.
Hopefully everybody knows of Peter.
But in case you don't, he's a major contributor and committed to Postgres for many years. Three or so of the last were at AWS. So today we're going to be talking about some of
Peter's recent and upcoming work on Postgres related to skip scan implementation, but it's
far more than that. And honestly, the more I learn about this work, the more excited I am about it.
I thought I knew quite a lot and i've been reading up in the
last couple of days and it's even better like i didn't realize and this i'm probably cheating or
skipping ahead far too much but i want people to understand how big this is this could affect our
indexing strategies going forwards we could be choosing to index postgres differently
and that anyway i guess it's obvious in hindsight,
but there's some real major differences.
So yeah, I'm super excited about this work.
I want to thank you for having started it
and continuing with it.
I'd be interested to hear though,
what attracted you to this work
and did you realize how powerful it would be
when you first started on it?
Not at first, no.
When I last did a similar interview to this,
I think it was the predecessor podcast,
technically, not this one. At that time, I happened to talk about how, for whatever reason,
I kind of like finding as many different ways as possible of looking at a thing. At the time,
I think I said something like, well, there's a bunch of different perspectives. There's the
data structure perspective, there's the concurrency control perspective and on down the list and it was that sort of general mentality that brought me to this work because i realized hey
i'm missing something here my tendency to sort of collect these different ways of looking at the
thing the same familiar old topic this tendency i have i was for for a long time missing this extra
bit which i think is embodied by the work i'm doing now and its source material, which is a pretty obscure 1995 paper.
It happens to be called Efficient Search of Multidimensional Indexes.
So you haven't heard of it, but hopefully that's not any indication of its real world relevance.
It has been something that I believe has influenced a bunch of other systems.
So it's in that sense, I think, pretty standard stuff.
But the implications may be, as you were saying, Michael,
are not fully appreciated.
So maybe we'll start with a general review for those that don't know
what skip scan is.
It's a feature that certainly Oracle and in recent versions,
MySQL and also SQLite have.
They all call it skip scan.
I think in DB2 it's called jump scans or something like that.
So the basic idea is to extend the standard Petri index access method
to support new types of querying without changing anything about the on-disk contents,
without fundamentally altering anything.
This is possible with multi-column indexes, composite indexes.
The simplest case, and the one that gets the most attention, is where you have an omitted
prefix column in your query. So for example, you do a select from a table where b equals,
let's say, five. Here we would have a composite index on a b a of course has been
omitted from the query predicate so traditionally in postgres at least you wouldn't be able to
really use the index you might do a very inefficient full index scan where you sort
of have to grovel through the entire leaf level of the index and that works but obviously that's
not very efficient what skip scan will allow you to do here is, assuming that there are not terribly many distinct values in the column A,
as a rule of thumb, I would say hundreds of distinct columns total or perhaps thousands is about the upper limit.
Then you can effectively use that composite index as a collection of logical sub-indexes they're sometimes called.
So, for example, the first value might be 1 in the column A.
So we'll do 1, A equals 1, B equals 5, A equals 2, B equals 5, A equals 3, B equals 5.
And exhaustively, we'll do multiple small index scans that are effectively pretending to be one continuous index scan.
And that will be certainly much more efficient than a sequential
scan in most cases. Of course, it's true that you'd be much better off having column B directly.
But it's also true that who says that that's going to be possible, you might just not think to do it.
Or in some cases, at least, it might not really be feasible. It much depends on the requirements for the user.
This is particularly likely with things like data warehousing use cases, the fact table, where a lot of the querying is sort of ad hoc.
And it's not really feasible to have all the indexes you might possibly require ahead of time. So here, it's making it a composite index,
which is more useful for a wider variety of queries.
If you only require adequate performance here,
sub-second performance is attainable.
It's not going to give you perhaps sub-millisecond performance,
although that does sometimes happen,
but it's still, for these kinds of use cases, almost as good,
and yet it allows you to avoid creating
a bunch of quasi-redundant indexes.
So in that sense, as you were saying, Michael,
it's kind of like it at least would require
some of the conventional wisdom to be updated.
I wouldn't say it's a radical shift,
but there is a sense in which,
okay, in light of this new capability, we're going to have
to rethink some of the rules of thumb we have for indexing in general. I think that's likely to work
out that way. I would say I'm excited to see how people end up using the feature in practice.
Many things are possible. So that example I gave is a sort of very traditional, simple example, but also could be that you're not enumerating every
possible date in the entire table, but just those dates from that range, but the same exact idea
still. So for the 1st of August, plus whatever the actual thing that appeared in your predicate
in the query is, let's say, again, it's B equals 5. I don't know. That for the first of all,
for the second of all,
for the third of all.
So from the query perspective,
you haven't omitted the column.
The column appears in this between range,
but it is effectively the same idea.
It also should be pointed out
that these are all very general techniques.
They can be combined
in all kinds of different ways.
And that's where
it becomes super interesting like you have very very complicated combinations of these things
the example query that i used to sort of uh in my initial email to the hackers mailing list to sort
of sell it to people was taken from the original source paper and in that instance it was i believe so you have an initial emitted
prefix column which is on column called departments then you have a range then you have
an in and then you have another in so it's all kind of working together you say it's like this
sort of and that's where i think it can make a huge difference when you combine all these
techniques together which you can you will be able to do freely.
So there, effectively, you have this query that would conventionally require a huge and efficient index scan, which is just kind of a non-starter.
What you actually would have to do is have a bunch of single column indexes and use some kind of bitmap or something like that.
But then you give up the ability to have the sort order from the index.
You can't do a limit five or something like that.
You can't use a group aggregate directly, et cetera.
Whereas this way, you can pretty well convert what you would think of
as one big query into what you would think of as many thousands
of tiny queries that each
land on one if page again and again and it's sort of a lot of small queries pretending to be one big
one it's not a little apparent to you looking at explain analyze that that's going on but it is
so effectively you're rewriting the query as if it was just a bunch of tiny queries in a way that is kind of fairly intuitively obvious.
Like it wouldn't be very practical to do this, but you could sort of do it by procedurally generate a bunch of these small queries yourself in principle, at least.
And that is the one sort of core clever idea here.
It's surprising how far this one core idea will get you, because it is a fairly simple
idea in the end. But, you know, there's a bunch of things that I haven't even considered yet that
the paper also contemplates. For example, it has a part about not-in predicates. You would think
it would be impossible to use a not-in predicate for an index, right? But again, it's the same
idea. It's like skip scan, except rather than enumerating
every possible value in the index,
it's enumerate every possible value in the index
except these ones.
So it's basically the same thing again and again and again.
So it just buys you an awful lot, one core simple idea,
which I find kind of cool.
So that's the big picture.
Yeah, something that i am expecting to
get into postgres 18 and something that builds on work that started with postgres 17 so yeah that's
a great explanation i think the multi-column thing is the like the the complex example you mentioned
is the part that blows my mind and starts to make me think you know all this advice we've given
that's you know you generally advise people against huge multi-column indexes
because they often only target then
like a very, very niche set of queries.
And you end up accepting
that you're going to do a large index scan
with a bunch of rows being filtered,
which we know is inefficient,
but it's currently the best solution.
You lost hot updates also.
I'm going to continue advising against too many columns.
Well,
okay.
But I can see the flip side.
This like,
once you've got fewer,
once you only need fewer indexes,
you don't have to support,
like,
let's say you've got five columns that are important here.
Sometimes I've seen people index all five in three different orders to support different like versions of uh
and I feel like that kind of thing will become less important if if we even need it at all anymore
so that's mind-blowing but just to go back your thought process is super interesting here as well
so did you come across this paper as like part of widening your horizons and then think of all
these implementation ideas or was it you
thought this is an area this is a blind spot for me i should seek out papers in this area well
there's a surprisingly little written about the topic actually even though as i say it's it's
implemented in a bunch of other systems like postgres i also think it's probably true that
they don't exploit it to its full extent. The actual paper was written by people that were working on
a system called Non-Stop SQL in the 1990s, which
hasn't quite completely died yet, but it certainly has far
fewer users. It was a very influential system, though.
I think that it's likely that the full generality of it hasn't been exploited
for whatever reason in other systems.
My hope is that my implementation will do just that.
It's kind of weird.
The example complicated query I gave, what I noticed about that is most of the time, if I change the order of the columns, not always, but most of the time, it made hardly any difference.
Wow.
Right. Right, which wouldn't always be true.
But just given the particulars of the query that I'm talking about, like, it kind of didn't matter if I swapped.
It made maybe a small difference in favor or get like, but that is a bit mind blowing.
Sure, that that could ever be true.
Again, it's going to be most compelling in cases where you don't really know what the requirements are ahead of time.
It's very good for ad hoc indexing and you can always beat it by having the dedicated index if you want
to pay for that but that's perhaps impractical for a lot of users and again i think you know
on tp apps it is still useful but it's more for avoiding the worst case for something that you
maybe should have anticipated but didn't for whatever reason it's good for for that more so than something that you would ideally plan on
using but it is it is relevant to both um as for why i i don't know i think it entered my awareness
about probably five or six years ago but i didn't fully understand it and it took me a little while
to figure out the right way of doing it so having explained all that
it should also be pointed out that sometimes whether or not you actually want to rewrite
your query in this matter is it all depends doesn't it it could be for example that the
old way of doing it happens to be the most efficient one because you know yes you could
in principle,
rewrite it to 100 different small queries, but those small queries would all be hitting the same
few leaf pages anyway. So you actually would want to do maybe not 100, maybe one or two at most
small queries there. How do you know ahead of time, which it is? Well, you don't, right?
But you don't have to so i've had this intuition
that if i if i solve that problem first if i solve it for in lists then after that i would be able to
move on to this that would be a useful foundation so post-press 17 uh this has been implemented
um it'll be available i think in three weeks when we release 17,
if it goes as planned.
So that took the existing capability for indexing in lists
and made it so that if you have 50 integers, say,
and their continuous integers is like 1, 2, 3, 4, 5, through to 50,
then suppose they're all co-located on the same leaf page,
then you're going to do one index scan and read them all at once.
If, on the other hand, it turns out that they're spread out all over the index
because maybe it's low cardinality,
or maybe it's low cardinality just in that part of the index,
there's no rule that says it has to be one or the other.
It could be at different points in the key space,
low cardinality, others could be high.
So making a dynamic choice is very useful, I think,
because you don't have to know.
That's what makes it safe to aggressively apply these optimizations
because if it doesn't work out, well, in any case,
the v3 scan code is able to sort of combine
the would-be accesses into one on the fly.
So you'll get the same behavior as in previous versions,
if and only if that makes sense.
So that sort of flexibility, that dynamic behavior
seemed like an essential kind of prerequisite.
So it's not actually true.
It's conceptually the case that you're rewriting
the big query to lots of little ones
and then combining the result,
allowing them to pretend that they are one big index scan,
even though they're not.
But the actual degree to which you do that,
how much atomizing actually happens,
is determined on the fly as the scan progresses.
So it's kind of indeterminate
if it's how many actual index scans you're doing,
but that doesn't matter, if you get what i mean yeah
it's so cool it doesn't matter to the you it's like one of those things that as a user i don't
have to worry about it think about it at all as somebody who's implementing it though it feels
like it makes it much more complex for you to to build is that fair yes or no i think it's two
ways of looking at it there's what you said, which is in an immediate sense true.
It is more complicated.
But in another sense, it isn't because complicated to who?
I mean, the optimizer has to model all these costs, right?
And it turns out that's hard.
And it turns out that statistics are, well, they're statistics.
And if you're talking about like you have all these secret hidden correlations,
but these are multiple column indexes, of course.
So we have statistics for each column, but at least unless you go out of your way to,
there's no correlated statistics for columns.
There's a generic assumption that these are independent, even though, well, very often they're not.
So having the access path be very
forgiving so the optimizer gets things quite wrong but it actually doesn't matter or doesn't matter
very much at least is way easier at least from the point of view of the career optimizer and
on balance that probably makes it much much easier to actually as a practical matter make it work
i i think that that will probably
try to be really important although at this point that's pretty speculation i had this on my list of
things that might come up but are the cost models super interesting it seems to me and i haven't
checked for sure but from looking at some examples like you might not have changed the cost estimations
at least in 17 you know you might you
might consider that because you're making a bunch of things more efficient in a bunch of cases
you might try and reduce the costs for those to encourage those scans when they make sense but i
guess we're already doing anyway i guess my question is have you changed the cost model
and if not why not yes but i'm not surprised that you didn't notice it because
the main change that i made to the cost model was that now it'll saturate because very simple
observation the most expensive possible index scan ought to be a full index scan you know very
simple if you scan everything like there's a maximum pretty tangible maximum cost to, in this hand, at least in this world where this stuff is available, as it will be in 17, where it is physically impossible to scan the same leaf page a second time now, no matter how the constants have been specified in the query itself.
Because automatically that just cannot happen.
It's a physical impossibility.
So we know that we can take it to the bank.
And so the cost model saturates.
There's an awful lot of uncertainty, except when it comes to the worst case.
We know for absolute sure that in the very worst case, we'll scan every leaf page once.
So that's where you notice it more.
So you can have an absurdly long in list and you know at a certain point it'll just stop adding additional
cost since it is physically impossible for the cost to keep going even as you continue to add
this so that sort of plays into what i was saying about having some general understanding of what
the worst possible case can be when everything goes wrong when the statistics are garbage
that assurance is in the end makes
everything easier rather than making it harder i feel um it's also just useful i like that it
seems pessimistic but also safe that's that's it exactly um it's sort of i think of statistics as
being kind of inherently just very it's it's amazing that optimizers work as well as they do, I sometimes say.
Like, people don't realize, I feel, often when something goes wrong, well, it was next to a miracle that everything was working as well as it was for as long as it was.
And probably things were not technically working optimally.
But you didn't know that because you were happy with adequate performance you didn't consider that that was technically 10 times slower
than what could have in principle have been attained with the same index and such so like
i think that if we're honest we have to admit that there's an awful lot of uncertainty about the cost models.
And that doesn't necessarily matter.
More and more, I'm encouraging the idea that it's amongst people that are on the hackers list that better to have a kind of, I wouldn't even call it insurance.
That's too strong a word.
But just have account for variation at runtime dynamically rather than doing so statically
in the query plan that just seems like to me something that has very little downside and
potentially quite a lot of upside it is a more general idea it doesn't just apply here it applies
for example with i don't know you can do some similar things with hash joins it does get more
speculative um if you keep going down the line of things that are like that.
But why would you trust an uncertain selectivity estimate if you didn't have to?
There's no reason why you can't have some kind of way of recovering from a mis-exclamation at runtime, at least in certain cases.
I guess that it is related to that whole way of thinking about it.
That is the projects I'm working on, in that way, I feel.
Do you mind if I break the structure a little bit, as usual?
Go ahead.
Honestly, I didn't read the whole discussion.
So I'm going to ask some newbie questions.
First of all,
I see there is potential of,
like, there is loose index scan
as well, right?
And this is a very different thing
for distinct values.
And we have wiki page
and we're constantly
using this technique.
And we also mentioned,
like you did,
other database systems
have it implemented.
Postgres doesn't.
But here we discuss skip scan and the skip scan,
and this is very different, I see.
But I just googled a little bit,
and I found people use skip scan in that context as well, right?
So if I'm right, some people reaching this point
will still not fully understand what we discuss, right?
Oh, that's...
It's possible, right?
So we discuss the problem.
There is a problem.
We have index on columns A, B, but filter is only on B.
Kind of first layer of A is not relevant.
And usually it means this index cannot be applied well.
Right. relevant and usually it means this index cannot be applied well right it's worth pointing out that
with what i'm talking about it only applies when you have at least two index columns whereas i
don't believe that's true with flucid next scan right you can still usefully skip if there's only
one column because there we need distinct values it's different task here we need all values but
we have index which usually we say you have wrong index,
right? But your
optimization will make it
good, but only if
a column has not many values,
right? If it has a lot of values, maybe
it won't be good, right?
It's certainly not going to be able to help, because
of course, yeah, technically
you can say, okay, there's
a million, a lot of distinct sub-indexes,
but that degenerates to a full index scan, so it's the same. By the way, that might still be
the best plan, all things considered. We can't rule that out either. Sometimes that is actually
the best available plan. But overall, yeah, you're probably not going to want to use it unless you
have hundreds, perhaps low thousands, or occasionally maybe
tens of thousands, but after that, it's unlikely.
Will the planner decide, like, will it have some threshold to decide when skip scan should
be applied when it already doesn't make sense?
Not as such, because as I sort of touched on already, there is no special skip scan
index node.
I think that's a bad idea because it's all dynamic
right it's all it's an additive thing yeah now i'm catching up sometimes for all the usual reasons
you might want to have an index only scan that happens to use the skip scan or you might want
to map index scan or you might want a parallel index scan etc etc etc um it isn't useful i think to have a distinct kind of index path more choices means
more wrong choices from the point of view of the optimizer so i feel like that's a distinct advantage
of the approach i've taken of course i still have to figure out a way of making it not matter when
that's not the right thing to do and that's not very easy but i'm almost uh i think i've got that part figured out
we'll see do you mean to like minimize regressions with like at times where yeah okay because again
occasionally it's going to be intrinsically the best thing is to just scan the index without
trying to skip at all so you don't want to waste a lot of cpu cycles on a useless uh enterprise that is skipping when
you shouldn't be skipping you really do want to fall back on essentially the same behavior as
previous releases there now mind you overall it's it's hopefully the optimizer just wouldn't pick
that way of doing it because it would be wrong it wouldn't be the fastest available plan but
given the available indexes so that only applies at all when you're already doing something that's basically pretty slow.
But overall, yeah, I think I'm going to have to make sure that you don't notice it when it isn't
able to do what you would hope it would. That is sort of complicated, for sure.
Just to make sure I've understood, if we could do a scan without skips,
if we had the perfect index
for a very efficient,
let's say index only scan,
that would have been the index
that would have been chosen
in the first place
because its cost estimate
would have been lowest.
So already when we are skipping,
you're doing a scan
that involves some skips.
We haven't had that available
and therefore yeah okay
that's really cool yeah in general i think it makes more sense to think about these sorts of
queries as being on a continuum one extreme is where there's let's say no more than five or ten
distinct and then the fact that you're doing five different descents of the index is hardly noticeable
it's maybe less than a millisecond in the end anyway. So that will be a very favorable case for skipping. And then you
have less favorable cases, I suppose, where the cardinality goes up. And eventually you get to
the point where you can't do any skipping at all. And that's where you don't want to pay a penalty
for being open to the possibility of skipping, if you like. Also, skipping, as I sort of touched on
earlier, it could be, for example, that you have an index that has 90% of the pages in the index
in respect of the leading column have three values total, but nevertheless, the remaining 10% are all
perfectly unique. So now you don't have, like, you can't really average the two. It's just two
different things. You can't think of it as the average of the two, really. So the fact that it's kind of continuous just makes sense. You wouldn't want to have it be... If the planner were to make a static choice, then how would you account for that variation where you legitimately need to change your strategy for the same index scan in real time like that's totally possible that
that would be the the best thing to do so like not having an opinion until you really have to
have an opinion at the last minute you ideally get the best of both worlds that's the thinking
anyway i think it's super smart and those distributions are quite common like we see them all the time and it's the time where
statistics most obviously falls flat on its face because of averages and anyway i know there i know
it does some things to like most common values and things to but anyway yeah completely agree
that makes loads of sense sorry you're gonna say i was i was going to give another example of
what you're saying with statistics i'd mentioned briefly earlier, there's this, I think it's called the attribute and the independence assumption.
This generic sort of assumption made by all cost-based optimizers that there's no correlation between different columns, which is demonstrably false all the time, but somehow mostly doesn't break that badly but for something like
this you've got to think like there could be a very strong correlation between columns
or there could be a totally different completely independent relationship between the two and it's
really hard to model that so if you can just not do that to much of an extent at all and still get useful query plans, that seems ideal.
You know, there's just different types of indexes.
So, you have, for example, in the paper that I talk about, it's data warehousing.
So, you explicitly have different dimensions of a business that are obviously quite independent. But then you could also have something like, I don't know, supplier ID, first column, second column, discount code.
So it might be that different suppliers use the same discount code, coincidentally.
But even then, those two discount codes are completely distinct entities.
So it just doesn't make sense to think of them as being the same at all you wouldn't expect a query
to just supply a predicate that doesn't have a supplier id you would expect the supplier i would
have a supplier id or have both you wouldn't expect it to be omitted because it just kind of
wouldn't make sense so it's impossible or very difficult in any case for the optimizer to
understand those kinds of distinctions even though they're obviously very very common so again ideally it wouldn't have to ideally it would just be able to fall
back on on this sort of idea of hey we'll try to work it out runtime to the best of our ability
as long as we have the get the fastest query plan everyone's happy uh it doesn't necessarily have to
be that exact and cost model in fact usually isn't
so that's kind of the uh the overarching philosophy of the thing and yeah as usual with your work it's
so simple once it makes like once you understand it but to like i'm super impressed with the thought
process to have got there it feels like yet another feature where as users we don't have to think
about how it's working why it's working that much maybe with strategies later in terms of indexing
we can take advantage of that but we could carry on indexing exactly the way we are and we'll only
benefit as a result still unless i mess up of course but hopefully yeah yeah i'm not exactly
playing to my strengths when i'm talking about the optimizer you understand
but nevertheless that is kind of part of part of the philosophy i'm not really an optimizer person
because i like being successful that's probably why um but here i think that you know undeniably
there is some thought given to those those aspects because it kind of has to be what do you mean
successful well it's just really hard to do
anything in the optimizer that's why some people will do um and thank goodness someone is willing
to do it but uh i you know it's a very difficult area to work in i feel which does seem to put
people off more than perhaps it should perhaps that will change at some point in the future but
you have to ask somebody else about that uh yes uh that somebody else i
believe is going on another podcast in which i'm excited to oh great oh i think i know what you're
talking about yes for everybody else this is tom lane going on the talking postgres podcast which
is scheduled to be live i think next month so i'll share a link to that uh exciting times so we
you've talked about it briefly already,
but another cool thing you've done here is split this work into some useful stuff
on its own for version 17 coming out soon.
I'd be interested to hear like how and when did you realize
you could split it up into the in-list optimizations
that I think are going to be very useful
for a bunch of often generated
queries that people don't have that much control over, like long, like relatively long in-lists
seem to pop up, at least from what I've seen from like ORMs normally. How did you work that out?
I'm not sure exactly when, I think it probably wasn't any one moment where I realized that those
were, I remember talking about another patch,
actually one for LucidnextScan,
and just being very confused about that whole area myself,
being very confused about what the difference is,
where to draw the boundaries architecturally.
And I recall asking to nobody in particular at the time,
how does this relate to in lists or what we would call internally
stale array of expressions equals any array
expressions there must be some relationship here at that time i was aware of the paper the source
material and that strongly suggested as much but i just didn't i kind of thought you can't really
have an opinion about one without having an opinion about the other right you want to be
able to combine these things arbitrarily so that was when the initial awareness sort of sunk in.
Then I did some work on Vacuum over a couple of releases,
and I came back to B-Trace again.
And I think at that point I realized,
okay, this is probably the most useful thing I could do.
And then I kind of got a little bit lucky as well
with the project for 17.
I worked out a couple of things on the optimizer side
that I wasn't necessarily counting on.
That's more subtle.
It's stuff that's related to
whether or not we know
that the scan will return ordered results.
So I realized the relationship
between that question.
So in previous versions,
you would do a query like
select from table where A equals five
and B in a long list order by a b limit
something and it would be able to terminate the scan early because for reasons that are
complicated and not that interesting the implementation could trust the b3 code to
correctly return things to the results so i i kind of i had this idea that that was important but i didn't know
how important it was to everything else so i kind of got a bit lucky in figuring that out
made about 2023 before i even realized that i was pretty sure that i would do the skip scan thing
but there was some about serendipity with um there was actually a user complaint i think it was on
slack oh yeah benoit um that's right yeah i'll link i'll link on Slack. Oh, yeah, Benoit. That's right.
I'll link that up as well.
Yeah, Benoit had a complaint.
Basically, because of the limitation I just mentioned,
the query plan, which was for such a query with an emit,
it couldn't do it without creating a way of filtering
the non-matching rows that required heap access so for reasons that are not really
fundamental and worth explaining it did way more heap accesses just to exclude non-matching rows
when in principle it could have done that at the index level before accessing the heap at all so
for very roundabout hard to explain and actually not that relevant reasons that made a huge
difference for his use case i wasn't even thinking about because i'm beatry uh you know i wasn't really thinking about avoiding heap accesses but then
i found oh actually that's by far the main reason to do that i was thinking initially about just the
simple fact that hey do you want to do the index descent you know one time or 100 times obviously
you want to do it one time if that's feasible that's where i started but then i realized not much later but somewhat later with some help from a while oh actually yeah you can
in some cases at least completely oh we do all this right about nonsense with restrictions in
the planner you can completely some cases like the one he showed you can completely eliminate
a huge number and that was by far the best but the most
sort of flattering case for the 17 work where i think it was like 8 20 times fast something like
that it doesn't really matter what the number is just way way faster because you're you're
fundamentally doing way less io so i i didn't necessarily count on on that happening at all
and then it kind of did without my initially realising it, which was which
was nice serendipity did play a role that often does.
Very cool. Nikolai, do you have any questions?
Of course, I have comments and questions. All of them are
silly, maybe once again, we should avoid this confusion.
Loose index scan is not skip scan because i saw posts about mixing these names but
i double checked uh oracle and sql sql light sql light and they have both skip scan the same
meaning so we should stick to this i guess so yeah i mean i wouldn't mind calling it something
else entirely but i certainly do think the distinction between loose index scan and skip scan,
I mean, I was confused on that point myself.
I mentioned this already.
The important idea to me, the important distinction between the two is,
so loose index scan involves a index scan that knows specifically
that it's feeding into a group aggregate.
So it's only correct because you don't actually logically need to return all of the rows from the index
or indeed from the heap at all.
It's only correct because of that high-level context that you're applying in the slow-level code.
Whereas in contrast, skip scan isn't like that.
Skip scan is just exploiting information that's readily available to the scan.
It's doing index scan things in a more clever way
without really understanding how that information is being consumed
one level up or two levels up.
So that's how I think of it.
Yeah. You mentioned work on Lucendix scan also happening.
What do you think about the future of that work?
I think it's a perfectly fine idea.
There's nothing wrong with it.
I did a brief look at that some years back and my
recollection of exactly how it was at the time is is a little bit vague i remember noticing that
had lots of planner stuff which you know that was probably a little bit of something that discouraged
my involvement because again i like being successful right yeah it but i mean fundamentally
it's it's a perfectly sound idea.
It's obviously a good idea that could be implemented with enough effort.
And I think that the patch didn't have any major problems, or at least by me saying it was maybe doing things the wrong way in the planner but that's kind of what my pay grades but you know
from an indexing point of view like it's obviously it makes sense it's like you know yeah just
thinking about it very very sensibly it's doing potentially an enormous amount of IO, potentially an enormous amount of extra
IO you don't need to do logically.
It's just, it's not, yeah, I mean,
it's not, it's not some, it's inarguable
that that's a perfectly good idea.
If I haven't started
there or I haven't paid too much attention to it,
it's only because of competing priorities.
It's not, I think it's
totally reasonable. Yeah, we
write with recursive all the time to implement this,
like basically at application level, right?
I'm curious if there are any specific benchmarks
which could be helpful for these both patches
to check kind of workloads.
Well, not really, because as often is the case with things
that have some Pl planner component to them
it's sort of like it's perfectly obvious that it's it's not a quantitative improvement it's
a qualitative improvement so you name a number and i'll give you a test case that is faster by
that amount it doesn't like make up a number it's no problem i'll get you a test case that'll do
that i just have to come up with something that has sufficiently large amounts of things that you will be skipping now.
Coming up with a number, it's not really very meaningful.
It's just you're doing it in the efficient way, the obvious
sensible way, having not done that. So there's really no practical limit on
how much faster it could be. It's hard to make generalizations in the real
world. So that's hard to make generalizations in the real world.
So that's kind of the annoying answer
to your question.
By the way, I wanted to ask you about
propagation of this idea from B3
to other types of indexes,
GIN first of all.
Is it a valid idea at all?
Maybe with combination with B3-GIN?
It definitely could be,
but then the way the gin in particular
does multi-column indexes is kind of weird.
The way that actually works is it essentially,
rather than having a b3 that has a and b stored together,
let's say it has a b3,
if and only if you have a multi-column index it
has a b tree where attribute one is stored with one whatever the value for a is and then attribute
two is stored with two whatever the value for a is so the column number is in the b tree itself the
where the entries live is the most significant key which is very different now if you said to me
what about gist then that's that's very different because Now, if you said to me, what about GIST? Then that's very different,
because GIST is much more like B-tree, particularly here. I think with GIST, it's totally possible.
The way that GIST works... So the source paper I mentioned, Efficient Search of Multidimensional
Indexes, it might be confused. You might think, well, multidimensional like GIST? And the answer is yes, sort of.
I'm doing work that's purely enhancing B-treaters,
but some of the ideas with dimensionality are at least at a high level related to things like multidimensional indexing using GIST.
So I think it's entirely possible to do that.
GIST is loosely ordered.
So it's kind of inherently doing something a bit
like that anyway right where it's not so much finding the exact match in the exact place where
it must be but rather looking in all the places it could be exhaustively which is usually only
a fairly small number of pages it doesn't have the same propensity of B3 to go to exactly the right,
the only right place for that value that you're looking for directly.
It's more because of the loose ordering of the index.
You know, it's not quite doing it that way.
There's a bounding box in the internal pages which describes what the leaf pages contain.
So you might have to do two or three, even for a point lookup,
you might have to do two or three leaf page accesses as opposed to just one which would be very often the case with the b-tree so it's almost kind of
doing that anyway so it seems entirely plausible that with just you could you could do something
similar i don't think that's probably possible in the case of gin though because what does that even
mean in light of the scheme with the jet and storing two different attributes in completely different places,
and the tree doesn't really seem compatible with that.
I don't know, though.
I mean, anything's possible.
What if it's a B3 gene
when the first column is a scalar column,
like organization ID or category or something,
and the second column is like JSONB or something?
Maybe. I mean, I think it's probably more
proofable to talk about it in the context of a use case. So there,
I guess, intrinsically you have
a kind of nested structure to JSON, and so
having that capability could be useful if you
had some kind of partitioning scheme or something like that, I suppose.
I mean, JSON, NASA structure, it doesn't matter.
It's second column.
I know, I know.
Right?
We put scalar value, like some number or timestamp on the first place.
Yeah, I mean, I think that could be possible.
I haven't thought about it too much, though.
You have to talk about it in two levels as well, though.
It's like, what would be practically achievable within a year?
Let me tell you.
Could it be?
Yeah.
Okay.
Why I care so much about this?
All right.
Because I started to reflect one problem recently
related to full-text search in Postgres
and why people move to
Elastic. And why I did?
Because I saw a similar
situation happening with PgVector
because, you know, the
most popular issue in PgVector
repository right now on GitHub
is additional
filtering. So people
need global index with vectors
but they also need to be able to
sometimes search locally for particular project organization i don't know anything and this
usually is done what elastic and others they call it metadata search basically it's additional
filters usually scholars like numbers or timestamp something and b3 for vector search is not solved yet that's why people
still can see like they say postgres vector search with pg vector is not well for us because
usual solution is let's just create partial indexes but if you have hundreds or thousands of categories, it doesn't work well.
And then I realized that for full-text search, we had a similar problem for years.
And usually it's solved by extension bitregine.
And then you are able to create a multi-column index.
But this is a barrier for many people.
Although this is a contrib module, it's present everywhere. It's some kind of a barrier for many people although this is a contrib module it's present everywhere
it's some kind of a barrier and i saw cases when people even created this extension but somehow
still not using it well like they still have this problem b3 versus gene dilemma for planner right
so i'm thinking yeah and i'm thinking if for example, if we use bitregine more often,
we usually will put scalar value on first position
because it's more selective by default, maybe.
We don't know, right?
And if it has a low number of distinct values,
this skip scan approach would make total sense there as well right maybe i don't have a lot
a lot of domain expertise here so forgive me i need to ask questions to clarify what you meant
are you talking about like pushing down a limit for each grouping or something like that
limit is like i'm not i'm not sure about limit and ordering. I'm just thinking. But not a limit for the entire query,
but rather for each individual grouping or whatever,
where you want to, is that what,
does that relate to the question?
Imagine we have full-text search,
TS vector, gene index,
and then we have additional thing,
like, for example, creation time.
And when we want to search within some window only, right?
By default, full-text search in Postgres doesn't support it itself.
It will be different B3 index on timestamp,
like created at timestamp,
and we have global whole table gene, right?
And then planner decides what is cheaper,
B3 and then filter on the fly,
or use full-text search and then filter
or order by on timestamp.
And B3GIN extension gives you opportunity to combine,
have two-column index.
And then the goal is to find everything, no limits, right?
But if we put scalar value on first position,
queries which don't have this filter
will be not good with this index, right?
So we put a timestamp as our column A
and the TS vector as our column B.
Use bitregine for that to have two-column index, right?
And then I would like to have skip scan for this
because if my query doesn't have additional filtering
and we have a low number of distinct values for column A, right?
Maybe timestamp was not a good example
because it will be a lot of distinct values, right?
Let's say it's some category ID or, I don't know, like group ID or something.
In this case, probably skip scan will help there as well for such queries.
Or I'm forced to put this column to second position and keep full TS vector on first position.
Okay, this is something to think about.
Additionally, I'm just very curious.
I don't know is the simple answer. In principle, you could, I mean, after all, what I've described
is in the end, there's a bunch of details, but it is basically quite simple. You have to be willing
to do, at least in the simple case where you've omitted, fully omitted a prefix column you have to be willing to look at
every single individual partition every distinct value which adds up quite quickly if it's something
like a timestamp if you whereas i think if you had to kind of book it by day or by day or something
like that then it would be more feasible because then you would have relatively few buckets and it
would especially if you had a not completely omitted the date
column, you had a range of between, say, then it would be, I think, feasible from a data
structure point of view, not necessarily in GIN, but in something to do something like
that. I mean, I don't know what the basis of comparison here, what is Elastic doing?
I just don't know. I can't speak to that.
I don't know as well, internals.
They just claim if you have additional filtering,
it's not slower at all, and usually it's faster.
This is what their documentation say.
I'm very curious about internals as well,
because I just feel this, even from user perspective,
even the need to create B3GIN extension,
it's already a barrier.
Some people don't do it it and they think, okay,
full-text search is weak. And they are going to have the same impression about vector search.
And this whole topic about combination of these powerful indexes with b3 indexes,
it's interesting. Direction, I think, can be improved as well and with skip scan maybe it will be can be improved further this is my like basic idea it's quite like maybe not well thought i i promised
to have silly questions so it's okay there are no silly questions well i've got an even
sillier idea because i don't i don't have, a very, very Fisher-Price level understanding of how PG vector works, which is not going to do much good here.
So, you know, I'm not going to say something about, I think I know so little about.
It's something that maybe I'll talk to Jonathan Katz about it next time I see him.
He's my go-to expert on these things.
He's very happy to talk to me about
I'm sure he knows
already this problem very well because as I
said this is the most active discussion
among issues on
repository and basically all
people need this they need
combination of B3 and vector search
somehow maybe not only
just one B3 column
I mean not only Scalar but
a bunch of them because they need to filter additionally.
And this is a challenge.
I also know one project which they built some kind of quite universal solution for the others.
And when they first came to us, they said, we are forced to index every column because for flexibility.
We know this is anti-pattern, but we are forced.
And I'm sure they will be happy when Postgres 18,
if it's out with this optimization.
This is super relief for them because they are struggling
with this problem.
It's good to have, as Michael said,
only one index on multiple columns
and don't think too much about order.
Yeah.
Especially if, as I say,
the important distinction is between getting
very slow, totally unacceptable performance,
typically sequentials can have a very large table,
versus adequate, sub-second, but not sub-millisecond.
Like, a lot of people are in that zone,
and there, if they have a lot of ad hoc queries,
they can't really predict the needs of indexing needs ahead of time.
I think it makes the biggest difference there,
much less so in OLTP apps, but still to some degree,
it's going to get you acceptable performance
where you didn't know
that you really ought to have had
a certain kind of index.
Now you get adequate-ish performance there too,
which could make all the difference in a pinch.
Yeah, well, actually,
interesting additional question I have.
Like with loose index scan,
we have this wiki page describing this technique, we
apply this technique using recursive CTE,
right? And there is
like expectation in the future
we will stop doing this, it will
be fully automated. Doesn't make any
sense to think about implementing
while we are still not there
and not on Postgres 18,
it's not even committed, right?
It doesn't make sense to try to implement a similar approach
at a higher level ourselves somehow,
splitting query to pieces like different index scans.
It might.
I would just say, in the case of...
If you're able to upgrade to Postgres 17,
and having done that, you're able to also, you know, suppose you have 10
distinct values in your leading column. If you can write an in list that has those 10 distinct
columns and are satisfied that that will work for you, that'll give you the correct answer,
not only when you write the query, but going forward, if it's sort of static,
then you can totally do that.
And in fact, I would expect the actual access patterns to be identical
to what you would have gotten had you had the skip scan feature or anything.
It is basically doing the same thing.
It's just rather than having an array with these constants that come from the query,
you will instead have a magical sort of an array
that the B-tree scan code uses almost the same way.
It's just that it generates its array elements procedurally and on demand,
but it is essentially the same thing at quite a low level.
So it would be not just similar, but identical
if you could make it work within those constraints.
And in particular, there'd be no possible downside
in terms of whether or not it made sense to do a full index scan.
Depending on the distribution of values,
it's going to be adaptive in that same way I described.
With the big caveat, of course, you kind of need to know that
at least for those leading values,
that it's static and predictable.
They won't change and break your query.
So I guess our approaches will change with this.
Usually we analyze workload and decide on indexes.
We decide what to put on the first position, second position.
Here, it seems like sometimes we will decide to put
columns with lower cardinality of distinct values or something, right?
Yeah, that'll start to make way more sense than it did.
I don't want to overstate it.
I don't want to say it's not like a revolution by any means, but there are specific rules of thumb that we can buy it for sure.
What you already said yourself, Nikolai, was, oh, put the highest cardinality column first.
Well, that's already kind of suspect, actually.
But it especially is in this world with this capability.
Interesting.
I think it's mostly confined.
That'll be most true, but by far with more your data warehousing use cases where there is a built-in requirement
for a lot of ad hoc querying
that you can't really exactly predict ahead of time,
looking at the detail of something
rather than just a summary really matters,
then yes, absolutely, that will be true.
Well, with all its ETFs, less so.
Cool, yeah, this is interesting direction to explore
even if patch is not yet committed.
Yeah, thank you.
Thank you.
Have something in the to-do list.
Yep.
Is there anything else you wanted to make sure you mentioned, Peter, or anything anyone can help you with?
Well, I always welcome input into the work I'm doing, it would be kind of nice if I could get some feedback on the stuff we were just discussing
about, hey, how as a practical matter
does this alter the general advice
that we give to you
about how to index things?
I'm not entirely sure
how in practice that'll tend to work out anyway.
It is nice that this is a very general thing.
I think that that'll attempt
to make it possible to use it more often,
because in a certain sense, the optimizer only has to have the right rough idea,
because if it has that, then it will pick an index scan,
and then the index scan itself will figure out what to do.
So I think that's probably going to make it more compelling
than it appears to be in other systems,
where, as far as i can tell i might
be mistaken they have they require the optimizer to generate a distinct index node type so it's
kind of either or which seems to be anyway could be something that limited the ability of of the
oppositions so i feel like it would be nice if i had a better idea of how practically speaking
this will change things and where.
I don't think I've got a complete understanding of that myself.
Like what kind of applications, what kind of usage patterns.
That would be nice to have more detail on.
So if anyone wants to help me, that would be an obvious place to start, I would say.
Nice. Thanks so much, Peter. It's been an honor having you on. Great chat.
Thank you both. Thanks, Michael. I really enjoyed it.
Thank you. Have a good week.