Postgres FM - Under-indexing

Starting point is 00:00:00 Hello and welcome to PostgresFM. My name is Michael, founder of PgMustard, and this is Nikolai, founder of Postgres AI. Hey Nikolai, what are we talking about today? Hi Michael, you chose the topic under-indexing. You said this is the most popular case you see like when people like indexes, table like indexes, not people of course. Yes, well last week we discussed over-indexing and the problems that can come about because of that. And then that made me think, wait, why are we covering this case when we haven't covered the case I tend to see more often? And I think is at least in terms of raw numbers of companies or people, I think more people are in this boat, even if they don't think they are. And I think that's that might be an important thing. I think you said are in this boat, even if they don't think they are. And I think

Starting point is 00:00:45 that's, that might be an important thing. I think you said you were going to do a poll, you might be about to do a poll on whether people consider that they are under indexed or over indexed. I think a lot of people have added a lot of indexes. But for their trade offs, either some of those are not necessarily precise or optimal, or there are some duplicates in them, maybe like they have a lot of indexes but that they still could afford to have a few more precise ones to really help their workload right well yeah i haven't i don't have poll results right now but right so we think like over indexing comes before after you we are in the previous position,

Starting point is 00:01:26 under indexing, right? So then over indexing and then pre-site indexing, like evolution of the state of your index sets. So yeah, I actually don't know. And the poll, it's not representative, even if a few hundred people answered still, we don't know. And of course, from my experience,

Starting point is 00:01:46 when people ask us to help, like consulting practice, of course, we already see a lot of efforts to fix problems. And obviously, sometimes too many indexes created. And obviously, this is the state, not initial one. It's already some developed situation. So yeah, I agree with you that under indexing is also interesting topic maybe more basic but worth discussing yeah and i've actually just realized i probably should have realized this earlier that it's possible to be

Starting point is 00:02:15 both at the same time right like you could bring in a consultant they remove 10 indexes and add two like it's possible that you're missing the perfect indexes but also have way too many indexes and add two. Like it's possible that you're missing the perfect indexes, but also have way too many indexes. So maybe it's a bit oversimplistic for me to say. And interesting that if you come to some project, removing indexes is easier. It's a simple task, actually. You check duplicates, you check redundant, unused indexes, right? And this is easier than finding proper indexes when some tables are under-indexed

Starting point is 00:02:55 because you need to analyze workload, and this is more complex research, basically, than just checking for redundant or unused indexes. I mean, those types of research are also not super trivial. They have some complexities. We discussed them. But finding proper indexes may be a more difficult problem. What do you think? Finding them is interesting.

Starting point is 00:03:21 I think there are some pretty simple cases of finding, cases of being under- under indexed as well. But clearing it up, I guess now that we have, especially now that we have dropped index concurrently, it's quite easy to clean up. There's not much risk to cleaning up indexes, whereas adding an index and the downsides aren't as like dropping an index. As long as you're confident it's not being used, you're only seeing upsides. Whereas adding an index, you have to convince people it's worth it as well, maybe. To be honest, I'm not sure we're ever in a situation where we're really having to choose between these two. Normally, it's a case of quite a clear cut win.

Starting point is 00:03:58 But on that, like before we dive too much into specifics, should we go through a few of the... How would somebody know what are some signs that you might be in a situation where you are what we're calling under-indexed? Sequential scan. My main first sign. You see sequential scan. And actually, if you see it, maybe the first thing you need to do,

Starting point is 00:04:21 not just go and create index, but first you need to check your Postgres settings because still I see many managed Postgres services, service providers. They do a bad job pre-configuring Postgres clusters. I see random page cost being four in so many places still. Yeah. They give you SSD disk and keep random page cost four,

Starting point is 00:04:45 which is for magnetic disks. And this is the first thing you need to check if you have sequential scans more often than... You expect in the scan, but you see sequential scan. Even if you don't have index yet, it's worth first checking this thing, and only then

Starting point is 00:05:02 start creating indexes, because otherwise you might see you created index but it's not still not used because if random page cost is four it means that posgus will will prefer sequential scans more often right yeah i i thought this i was actually quite surprised when somebody i know was migrating from a provider that tuned it a little they tuned it down to 2.0 not not as low as you might suggest but they were moving to a provider who didn't suggest copying across the config and had the setting as 4.0 and saw no issues at all so i i do have a fresh case let's's drop some names. Crunchy Bridge. Yeah.

Starting point is 00:05:45 Crunchy Bridge. 4.0. I said great, like super cool that they provide real super user, unlike most of the others. But they also keep random batch cost 4. What's happening there? I don't know. I think they make a good point that it's, in a way, it's not that important, right? It's important for some subtle cases and for some cases where we're scanning a lot of rows that it's, well, in a way, it's not that important, right? Like, it's important for some subtle cases and for some cases where we're scanning a lot of rows.

Starting point is 00:06:08 Edge cases, probably, right? Yeah, but for the standard, like, primary key lookups, it doesn't make a difference. I agree, but edge cases happen more often than you would like to have them. Yeah, and as a managed provider, I think... Slightly more complex query and you already have this problem. I encounter

Starting point is 00:06:28 with this problem very often. We start optimizing and this is like... I myself forget about it. That's why I'm raising this as a first thing. Because I found myself in this position when I spent some time and then realizing,

Starting point is 00:06:43 oh, random page cost four. So many times already. I can count like 10 times maybe during the last couple of years. So it's a lot already. And we should fix it first and then move forward, I think. Cool. Going back to what you mentioned, sequential scans, are you, do you mean, let's say in the statistics we're looking at, like in the system views, we're looking at the percentage of sequential scans done on a table? Or do you mean like looking at a single execution plan of a slow query and it's using a sequential scan? topic of lacking the tools to analyze workload more holistically because if you don't have

Starting point is 00:07:26 auto explain enabled, PGSTAR statements don't have plans. And you might guess that there is sequential scan looking at buffer numbers, right? Compare them to calls and think, oh, so many buffer numbers, maybe sequential scan there. But of course, I'm mostly talking about explain and analysis of a single query. But we can also look up, is it PGStat user tables? We can look at the proportion of queries that use sequential scans versus index scans. That could be, like if you're under index, you might see a decent percentage.

Starting point is 00:08:01 Actually, you're right. Yeah, at table level, not at query level, but at table level. Exactly. That's one of the starting points that I would look at for seeing if somebody is thinking about overall, they just haven't added that many indexes. That's going to be a real telltale sign there. But I was thinking even higher level things, like maybe they're saturated on IOPS, or maybe they have high CPU all the time.

Starting point is 00:08:26 There might be some system-wide effects of doing a lot of these table scans as well, right? Right. So it's a good point. Analysis of PGStats user tables, it's a good idea. But also worth remembering that some queries do need sequential scan. For example, dump or some reading whole table, select count star from, oh, select count star is already index scan, maybe index only scan, right? Yeah, but you're right.

Starting point is 00:08:56 This is actually probably worth mentioning quite early on that, I mean, I actually haven't looked at the percentage recently, and it will depend on the size of the table, the number of columns and things. But if you're scanning more than... I'm going to put out a number there. This is wrong, but it's roughly in the right area. If you're scanning more than 20% of a table, chances are a sequential scan is going to be faster, or at least as fast as an index scan, and less expensive. It depends. It depends. Yeah, so count, if it's

Starting point is 00:09:29 even whole table scan, I mean, you need all rows, count all rows. Index-only scan will be much faster than counting with sequential scan, obviously. So it very depends. That's why, actually, honestly, I forgot about Pages Data user tables because I don't

Starting point is 00:09:45 use it too often. I usually prefer switching to the next layer. This is very top level. Of course, if you go from top to bottom, it's great. You start with looking at tables. Each table has a lot, participates in many queries, but I prefer going second layer queries, so Pages so just statements. And there we don't have the counter related to plans and each query might have multiple plans depending on parameters.

Starting point is 00:10:15 So auto-explain is good. And this is like, anyway, let's maybe skip it. We discussed this query analysis in other episodes, but the bottom line, sometimes you do need sequential scan, for example, if you dump. It's inevitable. But in many cases, if it's sequential scan, you probably need an index unless, as you said, we need to read too much of the table. So words cardinality and selectivity to these two words are important to understand. If selectivity is bad, it means that cardinality is high and we read too many rows. Sequential scan maybe is fine. Yeah.

Starting point is 00:10:53 And the other exception is tiny tables. If you've got less than 100 rows in a table, like a lookup table, and you see like a high percentage... How many rows did you mention? I said 100, but let's say 10. Let's use a number that you can't fight. 10,000 or just 10? 10 rows.

Starting point is 00:11:08 Like we're talking about a little lookup table, for example, and you see a hundred percent of scans of that table are sequential scans. Right. Not a problem. Like just not an issue. Right, right. But a thousand rows maybe in the scan already will be better. Slightly.

Starting point is 00:11:24 Like very small difference between this. And actually, all of the things I mentioned are not a definite sign that you're under index, like high IOPS, high CPU. They could just be like system load. It could be you would have perfect indexing and still have those issues, but they could be a sign.

Starting point is 00:11:41 Like you said, high number of sequential scans isn't a sign that you necessarily are under index, but it might be. It's a clue rather than a piece of evidence. Right, but looking at a single plan, if you see a sequential scan, when you know that you just read one row in a billion-row table,

Starting point is 00:12:00 this is a strong sign you need an index. Exactly. If it's an important query, if it's a query you repeat over and over, or like if it's not one you're running like once ever and then never needing again. Trade-off, right? Exactly.

Starting point is 00:12:16 But yeah, I've got a couple of others that might be useful to think about. So we've got the high IOPS, high CPU, low index scan ratio. Also like low cache hit ratio could be a sign if you're using indexes nicely chances are you are going to see more things being with them and then a couple of other more obvious ones are slow deletes if you're seeing that there's a decent chance you're missing an index and You dive to some specific topic. I mean, it's not very uncommon,

Starting point is 00:12:47 but before we consider this, how about mentioning that you might have index scan, but it's still underperforming and the sign of it, if you see in the plans filter, some rows Some rows were filtered out because index was created for different queries, basically. It works, but finds too many rows, like extra rows. Yeah, exactly. And this takes us back to the first thing you mentioned, which was sequential scans. So if we're looking at a query that we think is slow or is maybe one of our most common queries, or we're sorting PGStats statements by total time and we're just looking at the execution of one of those

Starting point is 00:13:28 and we see either a sequential scan with a large filter or, as you were mentioning, an index scan with a large filter, then there's a chance that's not optimal for our workload and we'd be better off with either an index, in the case of a sequential scan, or better, like a more precise index for that access pattern right right so i i also don't like filtering but there's something again you might say i'm diving too much into the weeds here but there

Starting point is 00:13:57 is something slightly different about these two with sequential scan it can be really obvious the the rows removed by filter tends to be quite a large number not always the exact number you're filtering because chances are if you're doing a large sequential scan on a modern version of postgres it will be parallelized at least so the number you're seeing will probably be three times lower than the actual number being filtered because you have to multiply it by the number of um right total threads or i or I don't quite know how to describe that. Number of workers plus the main. Yeah, but plus the main one, right?

Starting point is 00:14:29 So what do you call those three combined? That's confusing. If you see two parallel workers, it means three workers already, right? Yeah, yeah, yes. So that number can be already big though, and it sticks out in plans. Whereas sometimes with inefficient index scans,

Starting point is 00:14:45 we might be in a nested loop, for example, with thousands of loops. And that number removed by, like rows removed by filter could be, could say just one or two or five. But inside the loop, right. But you have to multiply it by that number of loops to realize what the proportion.

Starting point is 00:15:02 And if it's the difference between rows removed by filter and rows returned by that node the actual rows returned is is the giveaway there and if you multiply it by the number of loops then you get a really good idea of how inefficient this is being and therefore oh could we think of a better way but the the key takeaway you're probably not going to remember all of that if you're listening to this for the first time, but if you just please, please, please don't just see an index scan in your query plan and move on and assuming it's the most efficient thing possible. So sequential scan is one sign filtering in the like, not our index, someone else's index is a different case. But there is also, you see index scan, maybe no filtering, but also top-end sort or something, like in memory sorting.

Starting point is 00:15:54 Yeah. Because you could do better if index combine a couple of columns, which one, for example, is to support filtering where they were clouds right and second is to order by for the example is like filter by date and then order by by name or filter by name order by by date for example if you combine the both columns properly in proper order in this in the same index and single index it will be just one single index scan and that's it no memory in memory sorting so this is a yet another uh sign that probably you need to think about different index right yeah absolutely in addition to filters filter with large numbers yeah yeah and that's that's another good point actually if we If you consider the general access patterns of OLTP systems, the ones that we tend to see more of,

Starting point is 00:16:51 I think people start with single column indexes when they're learning about indexing. And there's only so efficient you can get with those. Maybe that trade-off is perfect for your system, but chances are that you're going to have some access patterns that return a couple of different columns. perfect for your system, but chances are that you're going to have some access patterns that return a couple of different columns. And for those, you're never going to be able to compete on efficiency with single column indexes, even though Postgres does have some optimizations

Starting point is 00:17:14 around that, then you are with a precisely ordered multi-column index. And the key takeaway for me there that I still see people getting confused about is that order really matters. Which column you put first really matters. Which column you put second really matters. And onward. Let's consider this simple example, which is very, very common. You filter by city, for example. You filter by city and order by, I don't know, by ID in descending order. So if you have two indexes, one on the city ID or city name, if you put the name or the second is on ID, primary key,

Starting point is 00:17:53 these two indexes cannot work together well here. So Postgres will decide to choose which index to choose and then perform additional either filtering or if it's chose index on id or if it chose index on city id then it will be this stop and sort in memory and in this case you just need the which index you need the opposite order, id and then city, would be pretty useless for that query. Right, right. It will be useless because order by is executed after where,

Starting point is 00:18:39 where comes before it. So we need the city id to be in the first place on this two column index. Also, if you create this index, you can drop an index on city ID. Yes. So that we can refer to what people forget. Back to the previous episode. Previous episode for

Starting point is 00:18:57 having too many indexes. Good. So we discussed basic things. Now we move to the area you started to raise when you said foreign keys and so on. But I would say there are many, many cases and foreign keys is one of them when we talk about we need this index, but not too often. And this is difficult because you need to decide. Because if you delete from a table and want to have it executed fast, but you do it only once per year, maybe you don't need to keep the indexes which support this delete.

Starting point is 00:19:33 Because the indexes occupy disk space. They slow down updates and inserts, as we discussed. And they even slow down selects, as we discussed. I mean planning time. Or not only planning time. If you achieve 16 logs, execution is slowed down as well. So maybe you don't need those indexes. So every time you need to decide, okay, I have this report, for example, or this delete with foreign keys cascaded delete right but it's executed once per what once per week or month or year do i need those indexes or maybe i need to create indexes and then drop them i don't know when we do some cleanup some temporary indexes

Starting point is 00:20:20 so it's it's interesting because there is trade-off here and this price you pay for indexes. And I'm again echoing our previous episode. But let's consider particular examples of some cases. Foreign key is an interesting one, right? Yeah, I don't think I said foreign key. I love that you jumped straight to it. I think I was giving specific examples. Support deletes.

Starting point is 00:20:45 Okay. Yeah, yeah, yeah. Well, maybe the mistake I made was saying slow deletes because what I should have said is slow queries. It's the sign that you're missing an index. The first sign is like the web page is slow for users. These are like the – I know we're back to basics, but slow deletes are a subset of slow queries in general, right?

Starting point is 00:21:06 It's the slow query. And the slow delete is one I mentioned partly because I think it doesn't fit into the other categories because we can see a slow delete without a sequential scan. We don't see the sequential scan in the execution pattern. So it was more than... You mean cascaded delete. When we have delete on some table, it works using some index scan.

Starting point is 00:21:31 It finds rows to delete using some index scan, but it also needs to propagate deletion to reference sync tables. So we have a foreign key to us. And in this case, if we don't have special index on those tables, Postgres will need to perform sequential scan on that table to find all rows which reference to row we delete in the main table. And this is a better situation. Many DBA kits, toolkits, include a special report to find

Starting point is 00:21:59 such under-indexed foreign keys. Usually on one side of foreign key we have primary key usually, right? It's okay. But on the other side, we sometimes don't have it. And if we think about it, like this is the area where reactive approach doesn't work well, because probably you will like live quite long without encountering this plan at all. And then someone decides to delete something. For example, delete some user and all data this user produced. Very common case. And suddenly, if you don't use some asynchronous approaches, you want to delete everything, maybe it's not a good idea,

Starting point is 00:22:39 but it's a different story. In this case, suddenly, you didn't delete anyone during a couple of months of your project when suddenly you try to delete and it works very slow. So proactive analysis of such situation using that report

Starting point is 00:22:55 and creation of indexes proactively here might be a good idea. But it depends again. It depends a lot on your use case. I understand it. I was just just gonna say that i think it surprises some people that postgres doesn't index foreign keys by default so you can define it as a referencing but you like with the primary key we get the index implicitly behind the scenes to guarantee uniqueness but it means we can use

Starting point is 00:23:23 it for lookups as well but foreign keys we don't get indexed by default but some frameworks do so like oh what are you gonna say well indexing foreign keys sound strange to me because like again it's we have two ends of foreign keys so yeah indexing was one end and the index to the referencing column. Yeah. But Django, for example, if I add a second table with a... Well, it's not called that in the Django world. But if I add this into my model, I'll get an index on the original table added for me. And I think it's quite a sensible default for ORL2.

Starting point is 00:24:05 Yeah, yeah. But I think this also a case when, again, like, this is a case when we need some proactive analysis and understanding trade-offs. There are many such cases, a lot of such cases. For example, we have monthly reports, right? And we didn't prepare indexes for such reports. Or, I don't know, like, for example,

Starting point is 00:24:28 we have very high flexibility in our UI, allowing people to filter on too many, like, in extreme case, we allow users when working with data, we allow them to order by

Starting point is 00:24:44 anything, to filter by any column, combinations of them. And this is like the worst situation when we try to understand which indexes we need. And for example, we analyzed last week, we're doing fine,

Starting point is 00:24:59 but tomorrow somebody won't filter on something new which wasn't used before. And this is not covered by any indexes we have. Right. So I'm trying to say we have situations when it's hard to predict all possible cases and things like ORM or GraphQL. They give power to users, but this power needs to be supported by indexes. And you can say quite like, this is very high probability that if you take some project

Starting point is 00:25:35 which says we give full freedom, full flexibility to our users, you can say most likely this project is underindexed because you can find some situations when we don't have index and we have sequential scan, for example. What to do in this case? So, I mean, we're getting close to the point where we're saying what to do when you're almost optimally indexed and you've got a decision to make. I personally think that most people aren't in that situation and the overhead of adding another one is going to be worthwhile. But as we've discussed in the previous episode, if you take that to its logical conclusion, you're going to end up in a situation where you have probably more indexes than is optimal. But maybe you're dealing with more complex situations than I am.

Starting point is 00:26:33 But I think a lot of applications are quite simple. Like a lot of applications only have a certain number of access patterns to each table. table and you actually you cut once you've got a few indexes that are really precisely for certain features you're probably going to be fine until you add a new feature or until you scale beyond a point where like like we talked about in the partitioning episode you know like where you have to make a bigger change so yeah i think adding a new feature is a good like case where you should reconsider but maybe I misunderstood what you were saying. There was a little bit of openness. Right.

Starting point is 00:27:08 So let me repeat. If you have some application with strict usage patterns, you understand how exactly users can use which queries they can produce under the hood. This is okay. You can build proper set. You can find very good set of indexes and then just slightly adjust, revisit it from time to time. But if you build some platform on top of Postgres, for example, extreme case like no code platform,

Starting point is 00:27:42 which provides like very flexible way for like or even developers and you build like some some rest api graphql or something and you give a bit like full freedom filter by anything order by anything all columns are possible like build a column build a table with 100 columns and use all of them to filter order by and do it on client side from browser or something. This is extreme situation and you don't know in advance, you don't know which indexes you will need. This is a bad situation. And obviously you are an undernix

Starting point is 00:28:16 because you cannot create index on each column and then two column indexes on each combinations. It's a bad idea. Yeah, well, this reminds me me i think when we tweeted probably the last episode it must have been the last episode somebody replied saying i came across a client recently that had 50 indexes on a this on this specific table is that a new record and then somebody else replied with one of the heap analytics blog post so heap are kind of in that situation right they let people dump i guess analytics data right into postgres and they don't know what people are going to want to look up and order by and sort so that that's effectively

Starting point is 00:28:56 a custom analytics product on top of postgres right so i would probably direct them at this right but it was it was not many columns situation they keep this is old but very interesting article they indexed they created a lot like how many hundred thousand millions millions of indexes yeah uh but partial ones so they indexed like a few some rows not whole table some rows of it. And they also decided to go with create index without concurrently. They explained why, I don't remember. It's an interesting article, but it's a special case. I think it's because you can't do more than one create index concurrently at the same time.

Starting point is 00:29:40 Yeah, maybe because of that, right. Yeah. Which when you're doing millions is a problem. Right, right. Interesting, yeah. But it's not what I'm talking about when we have many columns and we need to expose this table via REST API or GraphQL and allow end users to choose which filtering and ordering and this is bad. I mean, this is bad i mean this is not this is hard problem to solve did you i i'm not i'm not as familiar with that um as it sounds like you are but there was a there was a blog post not that long ago from cybertech i think talking about a new um

Starting point is 00:30:21 a new extension i believe that was to do with this kind of like it was i think it was designed or the example they gave was for like an online shop that wanted to let you filter by any category like you want to pick men's t-shirts and then color and at each stage they wanted to give you a count of how many there were in those categories um and it was like a a new algorithm for getting those counts quickly on any custom data set so i'm guessing there's something in that it's called face it face it at search right that's it yeah right right is that related it's well yes but if you're building e-commerce system and need this it's solvable problem i'm talking about platform like platform for people who don't want to dive

Starting point is 00:31:15 into index details for example think about super base for example right but maybe even more like people who don't understand databases and don't understand indexes, and they need just to visually create tables and then build some mobile app or something. This is not good. You probably still need to think how to restrict the freedom and think about indexes. Without understanding indexes, probably you won't achieve good performance if you have a lot of data. So you're saying as a platform provider, how much should you index things by default? And how much should you let the users have power to do that kind of thing?

Starting point is 00:31:56 Yeah, so maybe you need to restrict the paths. And maybe if you have GraphQL, maybe you need to restrict the possibilities it provides. I mean, this is just a strict trade-off right either you restrict things and try and give certain amount of performance as a minimum or you accept poor performance or you have something in between like like i've seen one product for example land on saying if a query takes more than like if there if a user is changing a dashboard and it takes more than a few seconds to load uh say say to the user could you restrict it a little bit more like it could like could you filter down to a smaller date range or something else to to give it a chance of doing a faster scan so there are like user facing things you can do i

Starting point is 00:32:42 think in those situations but we're in the strict trade-off land, aren't we? Right, right. I'm just trying to say if you want to avoid indexing topic, maybe you won't solve the problem. You need to understand indexes and you need to make some decisions probably leading to some restrictions. Yeah, I think so. Yeah, good.

Starting point is 00:33:02 And what else? How to build indexes faster, maybe? Honestly, I think so. Yeah, good. And what else? Like how to build indexes faster, maybe? Honestly, I think this is probably, I think this is enough. There's so many articles I see go into kind of the different index types and they go into partial indexing and index on expressions.

Starting point is 00:33:16 It gets complicated fast, but honestly, the basics are, well, I guess they're the basics for a reason, but if you, I think you can get a long, long way just with single and multi-column B-tree indexes. But yeah, obviously everything else could be left for another day.

Starting point is 00:33:30 Right, or GIN index for some more complex data, like JSON or arrays. But again, let's finish with a reminder that if you decided to add an index, also consider the cost, the price you are going to pay for it. It's not only the storage, it's a negative effect on existing queries. It will be small, but in some cases it's significant, especially if you had hot updates, heap only tuple updates,

Starting point is 00:33:59 and then you create index and lose them. Updates can be significantly slower and heavily loaded system may be very noticeable. And if you're in an even moderately loaded system, you probably don't want to just create an index. You probably want to do create index concurrently. Right, but we're going to discuss this

Starting point is 00:34:18 next time, how to speed up and do it right and so on. That's how to slow it down, but how to keep it safe. Right. Two times slower. Okay, good. Nice one. Thanks so much, Nikolai. Thanks, everyone. Catch you next week.

Starting point is 00:34:34 Bye.

Your Ad Here

Postgres FM - Under-indexing

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.