Postgres FM - Skip scan

Episode Date: September 6, 2024

Michael and Nikolay are joined by Peter Geoghegan, major contributor and committer to Postgres, to discuss adding skip scan support to PostgreSQL over versions 17 and 18. Here are some links... to things they mentioned:Peter Geoghegan https://postgres.fm/people/peter-geogheganPeter’s previous (excellent) interview on Postgres TV https://www.youtube.com/watch?v=iAPawr1DxhMEfficient Search of Multidimensional B-Trees (1995 paper by Harry Leslie, Rohit Jain, Dave Birdsall, and Hedieh Yaghmai) https://vldb.org/conf/1995/P710.PDFIndex Skip Scanning in Oracle https://oracle-base.com/articles/9i/index-skip-scanningPeter’s introductory email to the hackers mailing list about adding skip scan https://www.postgresql.org/message-id/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.comLoose Indexscan versus Index Skip Scan (PostgreSQL wiki) https://wiki.postgresql.org/wiki/Loose_indexscanTom Lane will be on the Talking Postgres podcast on October 9th https://aka.ms/TalkingPostgres-Ep20-calBenoit Tigeot feedback and repro (originally reported via Slack) https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4597410#gistcomment-4597410Summary video and blog post about the v17 work by Lukas from pganalyze (not mentioned but great) https://pganalyze.com/blog/5mins-postgres-17-faster-btree-index-scansUnderstanding HNSW + filtering (pgvector repo discussion) https://github.com/pgvector/pgvector/issues/259btree_gin https://www.postgresql.org/docs/current/btree-gin.html~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael, founder of PDMustard, and I'm joined as usual by Nikolai, founder of PostgresAI. Hello, Nikolai. Hi, Michael. How are you? Let's not miss the fact that we have a guest today. We didn't miss it. You've thrown things off by saying, how are you? And I'm British, so I have to reply, and how are you? Well, let's play in the US. This is so I have to reply. And how are you? Well, let's blend the US.
Starting point is 00:00:25 This is where I learned to ask. So I'm super glad we have Peter today. Hi, Peter. Hi. Good to be here. Thanks for having me on the pod. We are delighted to have you. And it's Peter Gagan.
Starting point is 00:00:39 Hopefully everybody knows of Peter. But in case you don't, he's a major contributor and committed to Postgres for many years. Three or so of the last were at AWS. So today we're going to be talking about some of Peter's recent and upcoming work on Postgres related to skip scan implementation, but it's far more than that. And honestly, the more I learn about this work, the more excited I am about it. I thought I knew quite a lot and i've been reading up in the last couple of days and it's even better like i didn't realize and this i'm probably cheating or skipping ahead far too much but i want people to understand how big this is this could affect our indexing strategies going forwards we could be choosing to index postgres differently
Starting point is 00:01:21 and that anyway i guess it's obvious in hindsight, but there's some real major differences. So yeah, I'm super excited about this work. I want to thank you for having started it and continuing with it. I'd be interested to hear though, what attracted you to this work and did you realize how powerful it would be
Starting point is 00:01:38 when you first started on it? Not at first, no. When I last did a similar interview to this, I think it was the predecessor podcast, technically, not this one. At that time, I happened to talk about how, for whatever reason, I kind of like finding as many different ways as possible of looking at a thing. At the time, I think I said something like, well, there's a bunch of different perspectives. There's the data structure perspective, there's the concurrency control perspective and on down the list and it was that sort of general mentality that brought me to this work because i realized hey
Starting point is 00:02:11 i'm missing something here my tendency to sort of collect these different ways of looking at the thing the same familiar old topic this tendency i have i was for for a long time missing this extra bit which i think is embodied by the work i'm doing now and its source material, which is a pretty obscure 1995 paper. It happens to be called Efficient Search of Multidimensional Indexes. So you haven't heard of it, but hopefully that's not any indication of its real world relevance. It has been something that I believe has influenced a bunch of other systems. So it's in that sense, I think, pretty standard stuff. But the implications may be, as you were saying, Michael,
Starting point is 00:02:49 are not fully appreciated. So maybe we'll start with a general review for those that don't know what skip scan is. It's a feature that certainly Oracle and in recent versions, MySQL and also SQLite have. They all call it skip scan. I think in DB2 it's called jump scans or something like that. So the basic idea is to extend the standard Petri index access method
Starting point is 00:03:14 to support new types of querying without changing anything about the on-disk contents, without fundamentally altering anything. This is possible with multi-column indexes, composite indexes. The simplest case, and the one that gets the most attention, is where you have an omitted prefix column in your query. So for example, you do a select from a table where b equals, let's say, five. Here we would have a composite index on a b a of course has been omitted from the query predicate so traditionally in postgres at least you wouldn't be able to really use the index you might do a very inefficient full index scan where you sort
Starting point is 00:03:56 of have to grovel through the entire leaf level of the index and that works but obviously that's not very efficient what skip scan will allow you to do here is, assuming that there are not terribly many distinct values in the column A, as a rule of thumb, I would say hundreds of distinct columns total or perhaps thousands is about the upper limit. Then you can effectively use that composite index as a collection of logical sub-indexes they're sometimes called. So, for example, the first value might be 1 in the column A. So we'll do 1, A equals 1, B equals 5, A equals 2, B equals 5, A equals 3, B equals 5. And exhaustively, we'll do multiple small index scans that are effectively pretending to be one continuous index scan. And that will be certainly much more efficient than a sequential
Starting point is 00:04:45 scan in most cases. Of course, it's true that you'd be much better off having column B directly. But it's also true that who says that that's going to be possible, you might just not think to do it. Or in some cases, at least, it might not really be feasible. It much depends on the requirements for the user. This is particularly likely with things like data warehousing use cases, the fact table, where a lot of the querying is sort of ad hoc. And it's not really feasible to have all the indexes you might possibly require ahead of time. So here, it's making it a composite index, which is more useful for a wider variety of queries. If you only require adequate performance here, sub-second performance is attainable.
Starting point is 00:05:37 It's not going to give you perhaps sub-millisecond performance, although that does sometimes happen, but it's still, for these kinds of use cases, almost as good, and yet it allows you to avoid creating a bunch of quasi-redundant indexes. So in that sense, as you were saying, Michael, it's kind of like it at least would require some of the conventional wisdom to be updated.
Starting point is 00:05:59 I wouldn't say it's a radical shift, but there is a sense in which, okay, in light of this new capability, we're going to have to rethink some of the rules of thumb we have for indexing in general. I think that's likely to work out that way. I would say I'm excited to see how people end up using the feature in practice. Many things are possible. So that example I gave is a sort of very traditional, simple example, but also could be that you're not enumerating every possible date in the entire table, but just those dates from that range, but the same exact idea still. So for the 1st of August, plus whatever the actual thing that appeared in your predicate
Starting point is 00:07:00 in the query is, let's say, again, it's B equals 5. I don't know. That for the first of all, for the second of all, for the third of all. So from the query perspective, you haven't omitted the column. The column appears in this between range, but it is effectively the same idea. It also should be pointed out
Starting point is 00:07:19 that these are all very general techniques. They can be combined in all kinds of different ways. And that's where it becomes super interesting like you have very very complicated combinations of these things the example query that i used to sort of uh in my initial email to the hackers mailing list to sort of sell it to people was taken from the original source paper and in that instance it was i believe so you have an initial emitted prefix column which is on column called departments then you have a range then you have
Starting point is 00:07:53 an in and then you have another in so it's all kind of working together you say it's like this sort of and that's where i think it can make a huge difference when you combine all these techniques together which you can you will be able to do freely. So there, effectively, you have this query that would conventionally require a huge and efficient index scan, which is just kind of a non-starter. What you actually would have to do is have a bunch of single column indexes and use some kind of bitmap or something like that. But then you give up the ability to have the sort order from the index. You can't do a limit five or something like that. You can't use a group aggregate directly, et cetera.
Starting point is 00:08:34 Whereas this way, you can pretty well convert what you would think of as one big query into what you would think of as many thousands of tiny queries that each land on one if page again and again and it's sort of a lot of small queries pretending to be one big one it's not a little apparent to you looking at explain analyze that that's going on but it is so effectively you're rewriting the query as if it was just a bunch of tiny queries in a way that is kind of fairly intuitively obvious. Like it wouldn't be very practical to do this, but you could sort of do it by procedurally generate a bunch of these small queries yourself in principle, at least. And that is the one sort of core clever idea here.
Starting point is 00:09:21 It's surprising how far this one core idea will get you, because it is a fairly simple idea in the end. But, you know, there's a bunch of things that I haven't even considered yet that the paper also contemplates. For example, it has a part about not-in predicates. You would think it would be impossible to use a not-in predicate for an index, right? But again, it's the same idea. It's like skip scan, except rather than enumerating every possible value in the index, it's enumerate every possible value in the index except these ones.
Starting point is 00:09:52 So it's basically the same thing again and again and again. So it just buys you an awful lot, one core simple idea, which I find kind of cool. So that's the big picture. Yeah, something that i am expecting to get into postgres 18 and something that builds on work that started with postgres 17 so yeah that's a great explanation i think the multi-column thing is the like the the complex example you mentioned is the part that blows my mind and starts to make me think you know all this advice we've given
Starting point is 00:10:22 that's you know you generally advise people against huge multi-column indexes because they often only target then like a very, very niche set of queries. And you end up accepting that you're going to do a large index scan with a bunch of rows being filtered, which we know is inefficient, but it's currently the best solution.
Starting point is 00:10:41 You lost hot updates also. I'm going to continue advising against too many columns. Well, okay. But I can see the flip side. This like, once you've got fewer, once you only need fewer indexes,
Starting point is 00:10:55 you don't have to support, like, let's say you've got five columns that are important here. Sometimes I've seen people index all five in three different orders to support different like versions of uh and I feel like that kind of thing will become less important if if we even need it at all anymore so that's mind-blowing but just to go back your thought process is super interesting here as well so did you come across this paper as like part of widening your horizons and then think of all these implementation ideas or was it you
Starting point is 00:11:25 thought this is an area this is a blind spot for me i should seek out papers in this area well there's a surprisingly little written about the topic actually even though as i say it's it's implemented in a bunch of other systems like postgres i also think it's probably true that they don't exploit it to its full extent. The actual paper was written by people that were working on a system called Non-Stop SQL in the 1990s, which hasn't quite completely died yet, but it certainly has far fewer users. It was a very influential system, though. I think that it's likely that the full generality of it hasn't been exploited
Starting point is 00:12:04 for whatever reason in other systems. My hope is that my implementation will do just that. It's kind of weird. The example complicated query I gave, what I noticed about that is most of the time, if I change the order of the columns, not always, but most of the time, it made hardly any difference. Wow. Right. Right, which wouldn't always be true. But just given the particulars of the query that I'm talking about, like, it kind of didn't matter if I swapped. It made maybe a small difference in favor or get like, but that is a bit mind blowing.
Starting point is 00:12:35 Sure, that that could ever be true. Again, it's going to be most compelling in cases where you don't really know what the requirements are ahead of time. It's very good for ad hoc indexing and you can always beat it by having the dedicated index if you want to pay for that but that's perhaps impractical for a lot of users and again i think you know on tp apps it is still useful but it's more for avoiding the worst case for something that you maybe should have anticipated but didn't for whatever reason it's good for for that more so than something that you would ideally plan on using but it is it is relevant to both um as for why i i don't know i think it entered my awareness about probably five or six years ago but i didn't fully understand it and it took me a little while
Starting point is 00:13:21 to figure out the right way of doing it so having explained all that it should also be pointed out that sometimes whether or not you actually want to rewrite your query in this matter is it all depends doesn't it it could be for example that the old way of doing it happens to be the most efficient one because you know yes you could in principle, rewrite it to 100 different small queries, but those small queries would all be hitting the same few leaf pages anyway. So you actually would want to do maybe not 100, maybe one or two at most small queries there. How do you know ahead of time, which it is? Well, you don't, right?
Starting point is 00:14:03 But you don't have to so i've had this intuition that if i if i solve that problem first if i solve it for in lists then after that i would be able to move on to this that would be a useful foundation so post-press 17 uh this has been implemented um it'll be available i think in three weeks when we release 17, if it goes as planned. So that took the existing capability for indexing in lists and made it so that if you have 50 integers, say, and their continuous integers is like 1, 2, 3, 4, 5, through to 50,
Starting point is 00:14:43 then suppose they're all co-located on the same leaf page, then you're going to do one index scan and read them all at once. If, on the other hand, it turns out that they're spread out all over the index because maybe it's low cardinality, or maybe it's low cardinality just in that part of the index, there's no rule that says it has to be one or the other. It could be at different points in the key space, low cardinality, others could be high.
Starting point is 00:15:05 So making a dynamic choice is very useful, I think, because you don't have to know. That's what makes it safe to aggressively apply these optimizations because if it doesn't work out, well, in any case, the v3 scan code is able to sort of combine the would-be accesses into one on the fly. So you'll get the same behavior as in previous versions, if and only if that makes sense.
Starting point is 00:15:30 So that sort of flexibility, that dynamic behavior seemed like an essential kind of prerequisite. So it's not actually true. It's conceptually the case that you're rewriting the big query to lots of little ones and then combining the result, allowing them to pretend that they are one big index scan, even though they're not.
Starting point is 00:15:50 But the actual degree to which you do that, how much atomizing actually happens, is determined on the fly as the scan progresses. So it's kind of indeterminate if it's how many actual index scans you're doing, but that doesn't matter, if you get what i mean yeah it's so cool it doesn't matter to the you it's like one of those things that as a user i don't have to worry about it think about it at all as somebody who's implementing it though it feels
Starting point is 00:16:14 like it makes it much more complex for you to to build is that fair yes or no i think it's two ways of looking at it there's what you said, which is in an immediate sense true. It is more complicated. But in another sense, it isn't because complicated to who? I mean, the optimizer has to model all these costs, right? And it turns out that's hard. And it turns out that statistics are, well, they're statistics. And if you're talking about like you have all these secret hidden correlations,
Starting point is 00:16:48 but these are multiple column indexes, of course. So we have statistics for each column, but at least unless you go out of your way to, there's no correlated statistics for columns. There's a generic assumption that these are independent, even though, well, very often they're not. So having the access path be very forgiving so the optimizer gets things quite wrong but it actually doesn't matter or doesn't matter very much at least is way easier at least from the point of view of the career optimizer and on balance that probably makes it much much easier to actually as a practical matter make it work
Starting point is 00:17:24 i i think that that will probably try to be really important although at this point that's pretty speculation i had this on my list of things that might come up but are the cost models super interesting it seems to me and i haven't checked for sure but from looking at some examples like you might not have changed the cost estimations at least in 17 you know you might you might consider that because you're making a bunch of things more efficient in a bunch of cases you might try and reduce the costs for those to encourage those scans when they make sense but i guess we're already doing anyway i guess my question is have you changed the cost model
Starting point is 00:18:00 and if not why not yes but i'm not surprised that you didn't notice it because the main change that i made to the cost model was that now it'll saturate because very simple observation the most expensive possible index scan ought to be a full index scan you know very simple if you scan everything like there's a maximum pretty tangible maximum cost to, in this hand, at least in this world where this stuff is available, as it will be in 17, where it is physically impossible to scan the same leaf page a second time now, no matter how the constants have been specified in the query itself. Because automatically that just cannot happen. It's a physical impossibility. So we know that we can take it to the bank. And so the cost model saturates.
Starting point is 00:18:48 There's an awful lot of uncertainty, except when it comes to the worst case. We know for absolute sure that in the very worst case, we'll scan every leaf page once. So that's where you notice it more. So you can have an absurdly long in list and you know at a certain point it'll just stop adding additional cost since it is physically impossible for the cost to keep going even as you continue to add this so that sort of plays into what i was saying about having some general understanding of what the worst possible case can be when everything goes wrong when the statistics are garbage that assurance is in the end makes
Starting point is 00:19:26 everything easier rather than making it harder i feel um it's also just useful i like that it seems pessimistic but also safe that's that's it exactly um it's sort of i think of statistics as being kind of inherently just very it's it's amazing that optimizers work as well as they do, I sometimes say. Like, people don't realize, I feel, often when something goes wrong, well, it was next to a miracle that everything was working as well as it was for as long as it was. And probably things were not technically working optimally. But you didn't know that because you were happy with adequate performance you didn't consider that that was technically 10 times slower than what could have in principle have been attained with the same index and such so like i think that if we're honest we have to admit that there's an awful lot of uncertainty about the cost models.
Starting point is 00:20:26 And that doesn't necessarily matter. More and more, I'm encouraging the idea that it's amongst people that are on the hackers list that better to have a kind of, I wouldn't even call it insurance. That's too strong a word. But just have account for variation at runtime dynamically rather than doing so statically in the query plan that just seems like to me something that has very little downside and potentially quite a lot of upside it is a more general idea it doesn't just apply here it applies for example with i don't know you can do some similar things with hash joins it does get more speculative um if you keep going down the line of things that are like that.
Starting point is 00:21:07 But why would you trust an uncertain selectivity estimate if you didn't have to? There's no reason why you can't have some kind of way of recovering from a mis-exclamation at runtime, at least in certain cases. I guess that it is related to that whole way of thinking about it. That is the projects I'm working on, in that way, I feel. Do you mind if I break the structure a little bit, as usual? Go ahead. Honestly, I didn't read the whole discussion. So I'm going to ask some newbie questions.
Starting point is 00:21:46 First of all, I see there is potential of, like, there is loose index scan as well, right? And this is a very different thing for distinct values. And we have wiki page and we're constantly
Starting point is 00:21:58 using this technique. And we also mentioned, like you did, other database systems have it implemented. Postgres doesn't. But here we discuss skip scan and the skip scan, and this is very different, I see.
Starting point is 00:22:12 But I just googled a little bit, and I found people use skip scan in that context as well, right? So if I'm right, some people reaching this point will still not fully understand what we discuss, right? Oh, that's... It's possible, right? So we discuss the problem. There is a problem.
Starting point is 00:22:31 We have index on columns A, B, but filter is only on B. Kind of first layer of A is not relevant. And usually it means this index cannot be applied well. Right. relevant and usually it means this index cannot be applied well right it's worth pointing out that with what i'm talking about it only applies when you have at least two index columns whereas i don't believe that's true with flucid next scan right you can still usefully skip if there's only one column because there we need distinct values it's different task here we need all values but we have index which usually we say you have wrong index,
Starting point is 00:23:06 right? But your optimization will make it good, but only if a column has not many values, right? If it has a lot of values, maybe it won't be good, right? It's certainly not going to be able to help, because of course, yeah, technically
Starting point is 00:23:21 you can say, okay, there's a million, a lot of distinct sub-indexes, but that degenerates to a full index scan, so it's the same. By the way, that might still be the best plan, all things considered. We can't rule that out either. Sometimes that is actually the best available plan. But overall, yeah, you're probably not going to want to use it unless you have hundreds, perhaps low thousands, or occasionally maybe tens of thousands, but after that, it's unlikely. Will the planner decide, like, will it have some threshold to decide when skip scan should
Starting point is 00:23:54 be applied when it already doesn't make sense? Not as such, because as I sort of touched on already, there is no special skip scan index node. I think that's a bad idea because it's all dynamic right it's all it's an additive thing yeah now i'm catching up sometimes for all the usual reasons you might want to have an index only scan that happens to use the skip scan or you might want to map index scan or you might want a parallel index scan etc etc etc um it isn't useful i think to have a distinct kind of index path more choices means more wrong choices from the point of view of the optimizer so i feel like that's a distinct advantage
Starting point is 00:24:34 of the approach i've taken of course i still have to figure out a way of making it not matter when that's not the right thing to do and that's not very easy but i'm almost uh i think i've got that part figured out we'll see do you mean to like minimize regressions with like at times where yeah okay because again occasionally it's going to be intrinsically the best thing is to just scan the index without trying to skip at all so you don't want to waste a lot of cpu cycles on a useless uh enterprise that is skipping when you shouldn't be skipping you really do want to fall back on essentially the same behavior as previous releases there now mind you overall it's it's hopefully the optimizer just wouldn't pick that way of doing it because it would be wrong it wouldn't be the fastest available plan but
Starting point is 00:25:23 given the available indexes so that only applies at all when you're already doing something that's basically pretty slow. But overall, yeah, I think I'm going to have to make sure that you don't notice it when it isn't able to do what you would hope it would. That is sort of complicated, for sure. Just to make sure I've understood, if we could do a scan without skips, if we had the perfect index for a very efficient, let's say index only scan, that would have been the index
Starting point is 00:25:53 that would have been chosen in the first place because its cost estimate would have been lowest. So already when we are skipping, you're doing a scan that involves some skips. We haven't had that available
Starting point is 00:26:03 and therefore yeah okay that's really cool yeah in general i think it makes more sense to think about these sorts of queries as being on a continuum one extreme is where there's let's say no more than five or ten distinct and then the fact that you're doing five different descents of the index is hardly noticeable it's maybe less than a millisecond in the end anyway. So that will be a very favorable case for skipping. And then you have less favorable cases, I suppose, where the cardinality goes up. And eventually you get to the point where you can't do any skipping at all. And that's where you don't want to pay a penalty for being open to the possibility of skipping, if you like. Also, skipping, as I sort of touched on
Starting point is 00:26:47 earlier, it could be, for example, that you have an index that has 90% of the pages in the index in respect of the leading column have three values total, but nevertheless, the remaining 10% are all perfectly unique. So now you don't have, like, you can't really average the two. It's just two different things. You can't think of it as the average of the two, really. So the fact that it's kind of continuous just makes sense. You wouldn't want to have it be... If the planner were to make a static choice, then how would you account for that variation where you legitimately need to change your strategy for the same index scan in real time like that's totally possible that that would be the the best thing to do so like not having an opinion until you really have to have an opinion at the last minute you ideally get the best of both worlds that's the thinking anyway i think it's super smart and those distributions are quite common like we see them all the time and it's the time where statistics most obviously falls flat on its face because of averages and anyway i know there i know
Starting point is 00:27:54 it does some things to like most common values and things to but anyway yeah completely agree that makes loads of sense sorry you're gonna say i was i was going to give another example of what you're saying with statistics i'd mentioned briefly earlier, there's this, I think it's called the attribute and the independence assumption. This generic sort of assumption made by all cost-based optimizers that there's no correlation between different columns, which is demonstrably false all the time, but somehow mostly doesn't break that badly but for something like this you've got to think like there could be a very strong correlation between columns or there could be a totally different completely independent relationship between the two and it's really hard to model that so if you can just not do that to much of an extent at all and still get useful query plans, that seems ideal. You know, there's just different types of indexes.
Starting point is 00:28:52 So, you have, for example, in the paper that I talk about, it's data warehousing. So, you explicitly have different dimensions of a business that are obviously quite independent. But then you could also have something like, I don't know, supplier ID, first column, second column, discount code. So it might be that different suppliers use the same discount code, coincidentally. But even then, those two discount codes are completely distinct entities. So it just doesn't make sense to think of them as being the same at all you wouldn't expect a query to just supply a predicate that doesn't have a supplier id you would expect the supplier i would have a supplier id or have both you wouldn't expect it to be omitted because it just kind of wouldn't make sense so it's impossible or very difficult in any case for the optimizer to
Starting point is 00:29:41 understand those kinds of distinctions even though they're obviously very very common so again ideally it wouldn't have to ideally it would just be able to fall back on on this sort of idea of hey we'll try to work it out runtime to the best of our ability as long as we have the get the fastest query plan everyone's happy uh it doesn't necessarily have to be that exact and cost model in fact usually isn't so that's kind of the uh the overarching philosophy of the thing and yeah as usual with your work it's so simple once it makes like once you understand it but to like i'm super impressed with the thought process to have got there it feels like yet another feature where as users we don't have to think about how it's working why it's working that much maybe with strategies later in terms of indexing
Starting point is 00:30:31 we can take advantage of that but we could carry on indexing exactly the way we are and we'll only benefit as a result still unless i mess up of course but hopefully yeah yeah i'm not exactly playing to my strengths when i'm talking about the optimizer you understand but nevertheless that is kind of part of part of the philosophy i'm not really an optimizer person because i like being successful that's probably why um but here i think that you know undeniably there is some thought given to those those aspects because it kind of has to be what do you mean successful well it's just really hard to do anything in the optimizer that's why some people will do um and thank goodness someone is willing
Starting point is 00:31:10 to do it but uh i you know it's a very difficult area to work in i feel which does seem to put people off more than perhaps it should perhaps that will change at some point in the future but you have to ask somebody else about that uh yes uh that somebody else i believe is going on another podcast in which i'm excited to oh great oh i think i know what you're talking about yes for everybody else this is tom lane going on the talking postgres podcast which is scheduled to be live i think next month so i'll share a link to that uh exciting times so we you've talked about it briefly already, but another cool thing you've done here is split this work into some useful stuff
Starting point is 00:31:50 on its own for version 17 coming out soon. I'd be interested to hear like how and when did you realize you could split it up into the in-list optimizations that I think are going to be very useful for a bunch of often generated queries that people don't have that much control over, like long, like relatively long in-lists seem to pop up, at least from what I've seen from like ORMs normally. How did you work that out? I'm not sure exactly when, I think it probably wasn't any one moment where I realized that those
Starting point is 00:32:21 were, I remember talking about another patch, actually one for LucidnextScan, and just being very confused about that whole area myself, being very confused about what the difference is, where to draw the boundaries architecturally. And I recall asking to nobody in particular at the time, how does this relate to in lists or what we would call internally stale array of expressions equals any array
Starting point is 00:32:46 expressions there must be some relationship here at that time i was aware of the paper the source material and that strongly suggested as much but i just didn't i kind of thought you can't really have an opinion about one without having an opinion about the other right you want to be able to combine these things arbitrarily so that was when the initial awareness sort of sunk in. Then I did some work on Vacuum over a couple of releases, and I came back to B-Trace again. And I think at that point I realized, okay, this is probably the most useful thing I could do.
Starting point is 00:33:18 And then I kind of got a little bit lucky as well with the project for 17. I worked out a couple of things on the optimizer side that I wasn't necessarily counting on. That's more subtle. It's stuff that's related to whether or not we know that the scan will return ordered results.
Starting point is 00:33:34 So I realized the relationship between that question. So in previous versions, you would do a query like select from table where A equals five and B in a long list order by a b limit something and it would be able to terminate the scan early because for reasons that are complicated and not that interesting the implementation could trust the b3 code to
Starting point is 00:33:59 correctly return things to the results so i i kind of i had this idea that that was important but i didn't know how important it was to everything else so i kind of got a bit lucky in figuring that out made about 2023 before i even realized that i was pretty sure that i would do the skip scan thing but there was some about serendipity with um there was actually a user complaint i think it was on slack oh yeah benoit um that's right yeah i'll link i'll link on Slack. Oh, yeah, Benoit. That's right. I'll link that up as well. Yeah, Benoit had a complaint. Basically, because of the limitation I just mentioned,
Starting point is 00:34:32 the query plan, which was for such a query with an emit, it couldn't do it without creating a way of filtering the non-matching rows that required heap access so for reasons that are not really fundamental and worth explaining it did way more heap accesses just to exclude non-matching rows when in principle it could have done that at the index level before accessing the heap at all so for very roundabout hard to explain and actually not that relevant reasons that made a huge difference for his use case i wasn't even thinking about because i'm beatry uh you know i wasn't really thinking about avoiding heap accesses but then i found oh actually that's by far the main reason to do that i was thinking initially about just the
Starting point is 00:35:14 simple fact that hey do you want to do the index descent you know one time or 100 times obviously you want to do it one time if that's feasible that's where i started but then i realized not much later but somewhat later with some help from a while oh actually yeah you can in some cases at least completely oh we do all this right about nonsense with restrictions in the planner you can completely some cases like the one he showed you can completely eliminate a huge number and that was by far the best but the most sort of flattering case for the 17 work where i think it was like 8 20 times fast something like that it doesn't really matter what the number is just way way faster because you're you're fundamentally doing way less io so i i didn't necessarily count on on that happening at all
Starting point is 00:36:02 and then it kind of did without my initially realising it, which was which was nice serendipity did play a role that often does. Very cool. Nikolai, do you have any questions? Of course, I have comments and questions. All of them are silly, maybe once again, we should avoid this confusion. Loose index scan is not skip scan because i saw posts about mixing these names but i double checked uh oracle and sql sql light sql light and they have both skip scan the same meaning so we should stick to this i guess so yeah i mean i wouldn't mind calling it something
Starting point is 00:36:41 else entirely but i certainly do think the distinction between loose index scan and skip scan, I mean, I was confused on that point myself. I mentioned this already. The important idea to me, the important distinction between the two is, so loose index scan involves a index scan that knows specifically that it's feeding into a group aggregate. So it's only correct because you don't actually logically need to return all of the rows from the index or indeed from the heap at all.
Starting point is 00:37:11 It's only correct because of that high-level context that you're applying in the slow-level code. Whereas in contrast, skip scan isn't like that. Skip scan is just exploiting information that's readily available to the scan. It's doing index scan things in a more clever way without really understanding how that information is being consumed one level up or two levels up. So that's how I think of it. Yeah. You mentioned work on Lucendix scan also happening.
Starting point is 00:37:35 What do you think about the future of that work? I think it's a perfectly fine idea. There's nothing wrong with it. I did a brief look at that some years back and my recollection of exactly how it was at the time is is a little bit vague i remember noticing that had lots of planner stuff which you know that was probably a little bit of something that discouraged my involvement because again i like being successful right yeah it but i mean fundamentally it's it's a perfectly sound idea.
Starting point is 00:38:10 It's obviously a good idea that could be implemented with enough effort. And I think that the patch didn't have any major problems, or at least by me saying it was maybe doing things the wrong way in the planner but that's kind of what my pay grades but you know from an indexing point of view like it's obviously it makes sense it's like you know yeah just thinking about it very very sensibly it's doing potentially an enormous amount of IO, potentially an enormous amount of extra IO you don't need to do logically. It's just, it's not, yeah, I mean, it's not, it's not some, it's inarguable that that's a perfectly good idea.
Starting point is 00:38:54 If I haven't started there or I haven't paid too much attention to it, it's only because of competing priorities. It's not, I think it's totally reasonable. Yeah, we write with recursive all the time to implement this, like basically at application level, right? I'm curious if there are any specific benchmarks
Starting point is 00:39:13 which could be helpful for these both patches to check kind of workloads. Well, not really, because as often is the case with things that have some Pl planner component to them it's sort of like it's perfectly obvious that it's it's not a quantitative improvement it's a qualitative improvement so you name a number and i'll give you a test case that is faster by that amount it doesn't like make up a number it's no problem i'll get you a test case that'll do that i just have to come up with something that has sufficiently large amounts of things that you will be skipping now.
Starting point is 00:39:48 Coming up with a number, it's not really very meaningful. It's just you're doing it in the efficient way, the obvious sensible way, having not done that. So there's really no practical limit on how much faster it could be. It's hard to make generalizations in the real world. So that's hard to make generalizations in the real world. So that's kind of the annoying answer to your question. By the way, I wanted to ask you about
Starting point is 00:40:11 propagation of this idea from B3 to other types of indexes, GIN first of all. Is it a valid idea at all? Maybe with combination with B3-GIN? It definitely could be, but then the way the gin in particular does multi-column indexes is kind of weird.
Starting point is 00:40:31 The way that actually works is it essentially, rather than having a b3 that has a and b stored together, let's say it has a b3, if and only if you have a multi-column index it has a b tree where attribute one is stored with one whatever the value for a is and then attribute two is stored with two whatever the value for a is so the column number is in the b tree itself the where the entries live is the most significant key which is very different now if you said to me what about gist then that's that's very different because Now, if you said to me, what about GIST? Then that's very different,
Starting point is 00:41:05 because GIST is much more like B-tree, particularly here. I think with GIST, it's totally possible. The way that GIST works... So the source paper I mentioned, Efficient Search of Multidimensional Indexes, it might be confused. You might think, well, multidimensional like GIST? And the answer is yes, sort of. I'm doing work that's purely enhancing B-treaters, but some of the ideas with dimensionality are at least at a high level related to things like multidimensional indexing using GIST. So I think it's entirely possible to do that. GIST is loosely ordered. So it's kind of inherently doing something a bit
Starting point is 00:41:45 like that anyway right where it's not so much finding the exact match in the exact place where it must be but rather looking in all the places it could be exhaustively which is usually only a fairly small number of pages it doesn't have the same propensity of B3 to go to exactly the right, the only right place for that value that you're looking for directly. It's more because of the loose ordering of the index. You know, it's not quite doing it that way. There's a bounding box in the internal pages which describes what the leaf pages contain. So you might have to do two or three, even for a point lookup,
Starting point is 00:42:25 you might have to do two or three leaf page accesses as opposed to just one which would be very often the case with the b-tree so it's almost kind of doing that anyway so it seems entirely plausible that with just you could you could do something similar i don't think that's probably possible in the case of gin though because what does that even mean in light of the scheme with the jet and storing two different attributes in completely different places, and the tree doesn't really seem compatible with that. I don't know, though. I mean, anything's possible. What if it's a B3 gene
Starting point is 00:42:52 when the first column is a scalar column, like organization ID or category or something, and the second column is like JSONB or something? Maybe. I mean, I think it's probably more proofable to talk about it in the context of a use case. So there, I guess, intrinsically you have a kind of nested structure to JSON, and so having that capability could be useful if you
Starting point is 00:43:24 had some kind of partitioning scheme or something like that, I suppose. I mean, JSON, NASA structure, it doesn't matter. It's second column. I know, I know. Right? We put scalar value, like some number or timestamp on the first place. Yeah, I mean, I think that could be possible. I haven't thought about it too much, though.
Starting point is 00:43:46 You have to talk about it in two levels as well, though. It's like, what would be practically achievable within a year? Let me tell you. Could it be? Yeah. Okay. Why I care so much about this? All right.
Starting point is 00:43:58 Because I started to reflect one problem recently related to full-text search in Postgres and why people move to Elastic. And why I did? Because I saw a similar situation happening with PgVector because, you know, the most popular issue in PgVector
Starting point is 00:44:15 repository right now on GitHub is additional filtering. So people need global index with vectors but they also need to be able to sometimes search locally for particular project organization i don't know anything and this usually is done what elastic and others they call it metadata search basically it's additional filters usually scholars like numbers or timestamp something and b3 for vector search is not solved yet that's why people
Starting point is 00:44:47 still can see like they say postgres vector search with pg vector is not well for us because usual solution is let's just create partial indexes but if you have hundreds or thousands of categories, it doesn't work well. And then I realized that for full-text search, we had a similar problem for years. And usually it's solved by extension bitregine. And then you are able to create a multi-column index. But this is a barrier for many people. Although this is a contrib module, it's present everywhere. It's some kind of a barrier for many people although this is a contrib module it's present everywhere it's some kind of a barrier and i saw cases when people even created this extension but somehow
Starting point is 00:45:31 still not using it well like they still have this problem b3 versus gene dilemma for planner right so i'm thinking yeah and i'm thinking if for example, if we use bitregine more often, we usually will put scalar value on first position because it's more selective by default, maybe. We don't know, right? And if it has a low number of distinct values, this skip scan approach would make total sense there as well right maybe i don't have a lot a lot of domain expertise here so forgive me i need to ask questions to clarify what you meant
Starting point is 00:46:14 are you talking about like pushing down a limit for each grouping or something like that limit is like i'm not i'm not sure about limit and ordering. I'm just thinking. But not a limit for the entire query, but rather for each individual grouping or whatever, where you want to, is that what, does that relate to the question? Imagine we have full-text search, TS vector, gene index, and then we have additional thing,
Starting point is 00:46:40 like, for example, creation time. And when we want to search within some window only, right? By default, full-text search in Postgres doesn't support it itself. It will be different B3 index on timestamp, like created at timestamp, and we have global whole table gene, right? And then planner decides what is cheaper, B3 and then filter on the fly,
Starting point is 00:47:05 or use full-text search and then filter or order by on timestamp. And B3GIN extension gives you opportunity to combine, have two-column index. And then the goal is to find everything, no limits, right? But if we put scalar value on first position, queries which don't have this filter will be not good with this index, right?
Starting point is 00:47:37 So we put a timestamp as our column A and the TS vector as our column B. Use bitregine for that to have two-column index, right? And then I would like to have skip scan for this because if my query doesn't have additional filtering and we have a low number of distinct values for column A, right? Maybe timestamp was not a good example because it will be a lot of distinct values, right?
Starting point is 00:48:04 Let's say it's some category ID or, I don't know, like group ID or something. In this case, probably skip scan will help there as well for such queries. Or I'm forced to put this column to second position and keep full TS vector on first position. Okay, this is something to think about. Additionally, I'm just very curious. I don't know is the simple answer. In principle, you could, I mean, after all, what I've described is in the end, there's a bunch of details, but it is basically quite simple. You have to be willing to do, at least in the simple case where you've omitted, fully omitted a prefix column you have to be willing to look at
Starting point is 00:48:45 every single individual partition every distinct value which adds up quite quickly if it's something like a timestamp if you whereas i think if you had to kind of book it by day or by day or something like that then it would be more feasible because then you would have relatively few buckets and it would especially if you had a not completely omitted the date column, you had a range of between, say, then it would be, I think, feasible from a data structure point of view, not necessarily in GIN, but in something to do something like that. I mean, I don't know what the basis of comparison here, what is Elastic doing? I just don't know. I can't speak to that.
Starting point is 00:49:24 I don't know as well, internals. They just claim if you have additional filtering, it's not slower at all, and usually it's faster. This is what their documentation say. I'm very curious about internals as well, because I just feel this, even from user perspective, even the need to create B3GIN extension, it's already a barrier.
Starting point is 00:49:48 Some people don't do it it and they think, okay, full-text search is weak. And they are going to have the same impression about vector search. And this whole topic about combination of these powerful indexes with b3 indexes, it's interesting. Direction, I think, can be improved as well and with skip scan maybe it will be can be improved further this is my like basic idea it's quite like maybe not well thought i i promised to have silly questions so it's okay there are no silly questions well i've got an even sillier idea because i don't i don't have, a very, very Fisher-Price level understanding of how PG vector works, which is not going to do much good here. So, you know, I'm not going to say something about, I think I know so little about. It's something that maybe I'll talk to Jonathan Katz about it next time I see him.
Starting point is 00:50:40 He's my go-to expert on these things. He's very happy to talk to me about I'm sure he knows already this problem very well because as I said this is the most active discussion among issues on repository and basically all people need this they need
Starting point is 00:50:56 combination of B3 and vector search somehow maybe not only just one B3 column I mean not only Scalar but a bunch of them because they need to filter additionally. And this is a challenge. I also know one project which they built some kind of quite universal solution for the others. And when they first came to us, they said, we are forced to index every column because for flexibility.
Starting point is 00:51:27 We know this is anti-pattern, but we are forced. And I'm sure they will be happy when Postgres 18, if it's out with this optimization. This is super relief for them because they are struggling with this problem. It's good to have, as Michael said, only one index on multiple columns and don't think too much about order.
Starting point is 00:51:52 Yeah. Especially if, as I say, the important distinction is between getting very slow, totally unacceptable performance, typically sequentials can have a very large table, versus adequate, sub-second, but not sub-millisecond. Like, a lot of people are in that zone, and there, if they have a lot of ad hoc queries,
Starting point is 00:52:15 they can't really predict the needs of indexing needs ahead of time. I think it makes the biggest difference there, much less so in OLTP apps, but still to some degree, it's going to get you acceptable performance where you didn't know that you really ought to have had a certain kind of index. Now you get adequate-ish performance there too,
Starting point is 00:52:35 which could make all the difference in a pinch. Yeah, well, actually, interesting additional question I have. Like with loose index scan, we have this wiki page describing this technique, we apply this technique using recursive CTE, right? And there is like expectation in the future
Starting point is 00:52:52 we will stop doing this, it will be fully automated. Doesn't make any sense to think about implementing while we are still not there and not on Postgres 18, it's not even committed, right? It doesn't make sense to try to implement a similar approach at a higher level ourselves somehow,
Starting point is 00:53:10 splitting query to pieces like different index scans. It might. I would just say, in the case of... If you're able to upgrade to Postgres 17, and having done that, you're able to also, you know, suppose you have 10 distinct values in your leading column. If you can write an in list that has those 10 distinct columns and are satisfied that that will work for you, that'll give you the correct answer, not only when you write the query, but going forward, if it's sort of static,
Starting point is 00:53:44 then you can totally do that. And in fact, I would expect the actual access patterns to be identical to what you would have gotten had you had the skip scan feature or anything. It is basically doing the same thing. It's just rather than having an array with these constants that come from the query, you will instead have a magical sort of an array that the B-tree scan code uses almost the same way. It's just that it generates its array elements procedurally and on demand,
Starting point is 00:54:13 but it is essentially the same thing at quite a low level. So it would be not just similar, but identical if you could make it work within those constraints. And in particular, there'd be no possible downside in terms of whether or not it made sense to do a full index scan. Depending on the distribution of values, it's going to be adaptive in that same way I described. With the big caveat, of course, you kind of need to know that
Starting point is 00:54:39 at least for those leading values, that it's static and predictable. They won't change and break your query. So I guess our approaches will change with this. Usually we analyze workload and decide on indexes. We decide what to put on the first position, second position. Here, it seems like sometimes we will decide to put columns with lower cardinality of distinct values or something, right?
Starting point is 00:55:08 Yeah, that'll start to make way more sense than it did. I don't want to overstate it. I don't want to say it's not like a revolution by any means, but there are specific rules of thumb that we can buy it for sure. What you already said yourself, Nikolai, was, oh, put the highest cardinality column first. Well, that's already kind of suspect, actually. But it especially is in this world with this capability. Interesting. I think it's mostly confined.
Starting point is 00:55:37 That'll be most true, but by far with more your data warehousing use cases where there is a built-in requirement for a lot of ad hoc querying that you can't really exactly predict ahead of time, looking at the detail of something rather than just a summary really matters, then yes, absolutely, that will be true. Well, with all its ETFs, less so. Cool, yeah, this is interesting direction to explore
Starting point is 00:56:02 even if patch is not yet committed. Yeah, thank you. Thank you. Have something in the to-do list. Yep. Is there anything else you wanted to make sure you mentioned, Peter, or anything anyone can help you with? Well, I always welcome input into the work I'm doing, it would be kind of nice if I could get some feedback on the stuff we were just discussing about, hey, how as a practical matter
Starting point is 00:56:30 does this alter the general advice that we give to you about how to index things? I'm not entirely sure how in practice that'll tend to work out anyway. It is nice that this is a very general thing. I think that that'll attempt to make it possible to use it more often,
Starting point is 00:56:47 because in a certain sense, the optimizer only has to have the right rough idea, because if it has that, then it will pick an index scan, and then the index scan itself will figure out what to do. So I think that's probably going to make it more compelling than it appears to be in other systems, where, as far as i can tell i might be mistaken they have they require the optimizer to generate a distinct index node type so it's kind of either or which seems to be anyway could be something that limited the ability of of the
Starting point is 00:57:18 oppositions so i feel like it would be nice if i had a better idea of how practically speaking this will change things and where. I don't think I've got a complete understanding of that myself. Like what kind of applications, what kind of usage patterns. That would be nice to have more detail on. So if anyone wants to help me, that would be an obvious place to start, I would say. Nice. Thanks so much, Peter. It's been an honor having you on. Great chat. Thank you both. Thanks, Michael. I really enjoyed it.
Starting point is 00:57:47 Thank you. Have a good week.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.