Postgres FM - LIMIT vs Performance

Episode Date: May 3, 2024

Nikolay and Michael discuss LIMIT in Postgres — what it does, how it can help with performance, and an interesting example where adding it can actually hurt performance(!) Here are some li...nks to things they mentioned:LIMIT considered harmful in PostgreSQL (Twitter thread by Christophe Pettus) https://twitter.com/Xof/status/1413542818673577987 LIMIT and OFFSET (docs) https://www.postgresql.org/docs/current/queries-limit.html No OFFSET (by Markus Winand) https://use-the-index-luke.com/no-offset LIMIT clause (docs) https://www.postgresql.org/docs/current/sql-select.html#SQL-LIMIT ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQL. I am Michael, founder of PGMastered. This is my co-host Nikolai, founder of PostgresAI. Hey Nikolai, what are we talking about today? Hi Michael. It's your proposal actually, Limit, right? But before that, let's mention a little bit what our companies are doing, because I think last time we did it 90 episodes ago.
Starting point is 00:00:22 Right. Sure. Yeah. So PGMon Master. By the way, I wanted to tell you yesterday, but forgot. During some session, people asked me about Pigeon Master suddenly. Cool. And asked, can I recommend?
Starting point is 00:00:37 What did you say? I said, yes, of course. And became shy. I should recommend it before they ask, right? It's good that they ask. We discussed some plan and they wanted obviously we discussed how important it is
Starting point is 00:00:53 to use buffers our last week topic. If you haven't listened to it please do. I mean not you but our listeners. Because buffers are super important and there are many aspects so and that actually found pg master there's a
Starting point is 00:01:11 good thing to visualize and suggest it's not my idea it's yes that's the idea from my customer yeah yeah yeah so i i personally think visualization is not the strongest PgMaster feature. Strongest is a list of recommendations based on heuristics, which is quite a good thing. So, and I know you improve it all the time, already many years. So it's great. Yeah. So it's great.
Starting point is 00:01:37 Yeah. And what about me? My company is currently creating bot. Check out our latest few months old but our blog post my blog post where I described how we build it and currently we are rebuilding it fully
Starting point is 00:01:54 it will be interesting the release will be soon yeah it's quite interesting we bring it to web from slack I think it will live in slack still but think it will live in Slack still, or she, but it will be in web. It's just
Starting point is 00:02:10 easier to use it there. And we started using different models, including Gemini 1.5 with 1 million context window. So, yeah, we changed our approach right now, and it's kind of interesting.
Starting point is 00:02:28 It knows already more than 1 million documents about Postgres. It's huge. So, yeah, that's probably enough about Postgres AI and PgMaster. And let's talk about Limit. Nice, yeah. That's cool to hear. Maybe we'll hear more about that another time, about the differences between the models.
Starting point is 00:02:42 Yes, so I suggested this topic. And it came out of a tweet a long time ago that i found myself recommending over and over again to people because it was a phenomenon i saw and most people didn't understand or didn't intuitively understand and including myself so this was a tweet from christoph peters and the headline was, limit considered harmful or something along those lines. And it was a really good kind of initial tweet for a thread. I was like, what?
Starting point is 00:03:11 Limit's good. There's so many good things about limit. It's really helpful for so many performance reasons. What's harmful about it? And he goes on to describe one specific case where it can cause plan flips
Starting point is 00:03:24 and really, really slow queries in cases where you'd expect it to be helpful in performance. So that was the trigger for this topic, but it might be nice to discuss all the positives of using limit, or at least when we should be using it, whether we should be using it, that kind of thing. Yeah, well, first of all i i guess we will be fighting a little bit in this episode because we have different opinions uh let's start uh i mean i cannot be on the side of christophe peterson in that tweet i think there is huge potential because limit this is what every one of us is is using all the time, right? The limit is everywhere. So if there are some cases
Starting point is 00:04:07 when it's harmful, it's interesting. But my regular approach is if you don't limit, you're in danger. Limitless queries are harmful because you don't know if you tested,
Starting point is 00:04:21 you know, my favorite topic, testing and experiments and how to test and the idea that if you test on small databases, you miss. It's not, it's like it's bad testing. It's anti-pattern. Unless you know what you're doing and do it before you test on full-size databases. So this is exactly when it's not even about performance. It's just if you test on small database, small tables, and some queries showing some results on a page,
Starting point is 00:04:52 they don't have limit. And we know all we start usually with it. If we prototype something, first we do it like, forget about pagination, let's show everything here. But then you test, it works well, like 1000 rows on one page is not a problem, probably depending on how heavy they are in markup. But then project grows and some users, for example, have 10,000 posts or I don't know, like comments and you show all of them on one page and you have problems.
Starting point is 00:05:24 And this is a post-p point problem if you start a new project, right? So limitless queries can be harmful quite easily and this is a straightforward idea. If you limit, you know, at least you tell yourself what to expect. Okay, I expect 15 rows, for example. Okay, we know it's good. We know how many buffers we should expect in the very efficient query. 15 times the width of row plus some internal buffer operations, continuing the topic we had last week. And that's it. If you don't have limit, you don't know what to expect. Maybe it's a billion rows. Yeah. So I agree.
Starting point is 00:06:06 And I think pagination is quite a good place to start because when people think, like if you look it up, limit in the docs, for example, you're also going to learn about offset. And it's a really common concept. And more offset. Well, yeah.
Starting point is 00:06:19 So I'm a big fan of, I'm going to say friend of the podcast, Marcus Winnand, previous guest. Yeah. I've always wanted to say friend of the podcast. So maybe I finally got to say it. It was a great episode. Thank you. And yeah, it was, so he has a page on his website called No Offset.
Starting point is 00:06:37 And I just, I just used the No Offset hashtag on Twitter. I use it all the time because when we discuss, when somebody mentions offset and this is not an expert who is doing offset on purpose, knowing why, for example, index only scans before you fetch the,
Starting point is 00:06:57 you deal with regular index scan. There you probably won't offset sometimes, but very rare actually. So I always use this hashtag no offset. It's good. I'm sure most people listening know exactly what these things do, but just for clarity, when we use limit, we might add like limit 10 to the end of our query,
Starting point is 00:07:19 limit 20, limit some number, and that will, as the name suggests, limit the result set to that many. Limit zero. Do you use limit zero? No, but I've seen quite a lot of offset zero. I use limit zero in two cases. When I,
Starting point is 00:07:38 for example, create a table, I select specific columns limit zero. For example, sometimes just bootstrap some table. I don't want to repeat all data types. Sometimes it's like you're dealing with CVS or something. And also when I do tests, for example, presence of column, you can just select that column from table limit zero,
Starting point is 00:07:57 not caring about anything. And yeah, kind of assert. Yeah, I've seen that really cool use case. I've seen offset zero used a few times. So limit limits the first n results. I've never thought of using zero for that. It's cool. Offset will jump you forward that many before you start limiting.
Starting point is 00:08:21 So you could offset. You take the first 10, throw those away, and then limit will take you the next 10. So if you do limit 10, offset 10, you get the second set of 10. Yeah, and offset zero, it's either somewhere around generated, right? Because it was offset N where N is zero. Or it's also some performance trick, some hack, right?
Starting point is 00:08:44 Yes. I already forgot it's yeah i think it's an optim it is or was an optimization fence so it can it can trip the planner into not doing certain optimizations uh which i guess we'll get onto a little bit but that trick or it was mitigated well something in the docs made me think it might. I think it's still, I think they still work because I remember reading in a hacker's thread or it was in the thread somewhere in the mailing list. Yeah, so many times it may be used, but it was so long ago.
Starting point is 00:09:16 That's why I'm wondering if it's still actual. So I read a line in the docs in preparation that the offset zero is the same as omitting the offset clause, but I don't think that's quite true because of the planar implications. But in terms of the results, that's true. In the results... Semantically, it's true, but
Starting point is 00:09:37 physical execution might be affected with presence of offset zero. Yeah. So the reason I brought this up in type was because we were talking about pagination, right? There's, we can, both common ways of paginating a use limit. One is the kind of crude way of doing it. And a lot of RRMs, a lot of tools implemented it this way,
Starting point is 00:10:01 at least initially was to the first set of let's say 20 rows that you want just add limit 20 the next set they just did offset 20 limit 20 and then the next one offset 40 limit 20 and the problem with that is performance degrades linearly over like as you go and if you want to do a large you want to get to page 20 or page 100, that can be quite what is a very inefficient query. And Marcus wrote a great post about why we should instead be doing key set pagination for predictable performance and efficiency. The explanation is simple. It's just because internally it fetches everything until that
Starting point is 00:10:40 and discards. It's like a lot of inefficiency. It grows with the depth of your paginating, right? Yeah. So what we were talking about in terms of good is where ID is above, or where some parameter is greater than the last result you fetched and then still limiting by 20 or 21,
Starting point is 00:11:04 depending on exactly how you want to do like something like that so limit zero can be a problem or limit 10 can be a problem right let's talk about this problem maybe right because i i hope i already explained why limit is good in terms of performance because you limit you understand you understand yourself what to expect and you know what is efficient and what what's not if you look at buffer operation numbers i also i think there are a couple of cases like even simpler cases like it is chances are you don't want or like if your query could return a hundred thousand rows do you really want all of them or do you you might only be interested in the top 10 or maybe your user might only care about the most recent 10 and returning
Starting point is 00:11:52 all 100 000 even if the query is quite fast on on the server side sending all that data over the wire is is inefficient so there are i think it's it's kind of like almost all more fundamental reason that limit is good for for returning a subset of rows exactly what it's designed for it's close to our topic about delete and batching by the way someone asked on youtube comments how about updates and i explained it's very similar but with additional things so if if you don't batch and send the whole result set in one shot, what if not only just it's like a lot of memory and so on, but resiliency, reliability,
Starting point is 00:12:30 what if it fails like in the end of processing, you lose everything, right? And retry will be huge again. So it's better to go in smaller steps with retry logic. And limit, of course, is the key tool to split to batches. So generation, batching, we need it.
Starting point is 00:12:52 And also, in general, common sense, if I limit again, I know how much I want. This is my request. I want 15, for example, or 2,500. And if you scan many more parts of database, then these 15 records or 25 records
Starting point is 00:13:10 plus additional internal like index pages and so on, then I'm not considering this as efficient work, right? So it sets a simple expectation and metric for efficiency versus not efficient comparison of queries, right? Yeah, I think that makes tons of sense. Like in application code, for example, you are suggesting that you're kind of communicating with future developers what the expectation is here. I think this is where we're going to start to differ in opinion though, because I think this is exactly... Let going to start to differ in opinion, though, because I think this is exactly...
Starting point is 00:13:46 Let's say on an ad hoc query, I've sat next to people working on production systems that were quite scared and saw them and didn't think this was a problem, actually was learning from them at the time. They were only expecting a single row back, and they added a limit one at the end of the query. For example, primary key lookup, right? Yeah, I think this might have even been like... I can't remember exactly what they were doing, but they definitely added... Primary key or unique key. It cannot hurt because they are...
Starting point is 00:14:18 It's like if you don't trust Postgres unique keys, you add limit one just like as a sort, basically. You think, okay like i must check it should it should fail or something like if it returns more or something like that yeah i i actually can't remember exactly why all the thing i remembered was oh that's quite a clever way of actually not having a runaway one second i remember i did i different thing. I did limit two on purpose. So if something goes wrong and I have two rows, I don't remember where,
Starting point is 00:14:53 but somewhere I should have error or exception, catch it and process. So limit two was where it should be unique, so I expect one row, but I add limit two on purpose to check later that it's not two yeah i like it but yeah so anyway like in my opinion the limit is always i would add it like as a must for everything i think the fear was in this case a runaway sequential scan across a huge table that might start to eat resources and cause
Starting point is 00:15:26 I think this was a very, very on-fire server. I'm thinking now and I think maybe I would just even consider this as a rule, mandatory like limit everywhere and the only thing that we cannot limit is, for example, select count star from something. This basically
Starting point is 00:15:42 deals with all rows, returns just one number, and we cannot limit there, unfortunately. There, okay. Well, I mean, you can add limit. You're just still going to get the count. Yeah, yeah, yeah. Limit will be useless in this case.
Starting point is 00:15:54 Yeah, definitely. The interesting thing, though, and I was looking back at, just for fun, could I find a version of the Postgres docs that didn't have limit in it? I'm sure it's like as old as time but it was in an old version of the doc set i got back to 7.1 because that's the the oldest one online and there was a really like i got way more than i deserved looking back
Starting point is 00:16:18 at that and it said as of postgres 7.0 the query optimizer takes limit into account when generating a query plan, which I thought might be a nice segue. So before 7.0, there wouldn't have been any danger in adding limit to your query. I disagree. I object. Mind your language. This language is wrong. Any danger in adding limit. There is no danger in adding limit. Let's already jump to the core of the topic. There is no danger in adding limit. There is danger in having bad statistics or forming expectations based on indexes which cannot support your order by way of filtering. And limit. Yes. By the way, when you will describe this case,
Starting point is 00:17:08 focus on order by as well, because without order by, this case is impossible. It's not about limit. It's about order by plus limit, which forms expectations. Like, I mean, gives planner the ability to use some index. Right? Yeah.
Starting point is 00:17:22 So let's talk about, well, for anybody thinking we've jumped i shared some examples with nikolai before the call so that um i saw this example before you like a long ago i saw it live i just not yeah well that's good so let's back up and try and explain the problem or the or the edge case or whatever you want to call it. So exactly the same query, if you run it with and without limit some number, can be dramatically slower with limit the number added than it is without the limit at all, which I found really counterintuitive. And yes, it relies on some specific cases, but those specific things are not as rare in the real world
Starting point is 00:18:06 yeah so that person sitting in front of that console hoping that limit some number is going to make it safer and guarantee faster execution same logic as expected yes i agree can actually shoot themselves in the foot and make things loads worse for themselves and i actually think there might be other advice you could follow um like i'd much rather they added additional where clauses or like other other things than an additional limit so that's potentially controversial i'd love your taking it but let's try and explain how this can happen so the simplest case is to limit by a relatively rare condition and then order by something else that you have indexed so for example created app so if we think about a let's say a software as a service business and maybe you have a customer that cancelled a few years ago and you want to look at some like their most recent event or something like that
Starting point is 00:19:12 and so you're looking at but a company a order by created at descending limit one something like that or limit 10 or limit 10 exactly it can happen the problem can happen with not only with one it can be some small number yeah it could be a big number or relatively small the key is that it's small enough to have flip yeah well christoph made a good point that i haven't tested yet that he thinks it's worse around the point of the actual number of rows like there are some cases where let's say that company had a thousand events the limit thousand would be actually one of the would would be problematic so the the problem comes from not knowing like postgres assumes things are randomly distributed unless it knows better unless it has statistics to suggest otherwise and if it thinks that the the table that contains these events is going to be
Starting point is 00:20:10 it's going to have events from any company in any order it it because it knows it only has to fetch one or ten events it might it now has this optimization where it might be quicker for it to scan backwards through you you missed a very important thing. Let me add things. So we talk about, for example, in case you took some SaaS system, some users, and users have some activities, let's say, like comments or, I don't know, orders, anything.
Starting point is 00:20:39 So we have orders table, for example. This is what you showed, if I'm not mistaken. I remember. Something like that. So orders, orders and we have user id we have created that we have id of order enough right and the key here is we want to select all orders for this specific user in a chronological order like order by okay desk so like in like in reverse order, the newest first, the latest first, right? And then the key here is not to forget our developers, or we are developers, we don't know about multicolumn indexes.
Starting point is 00:21:20 So we created index on created at, we created index on user ID, and on ID. Okay, let's simplify assumption. We like to index all columns. It's worse. And when we do it, we use only single column indexes always, right? This is quite classic example. I can imagine it happens in the life of every engineer during five years, for example, of development. Definitely, at least once it should happen,
Starting point is 00:21:50 I would say. And 99% of developers. It happened with me many times and with my products, my databases, and also with others when I observed them. Actually, recently we had a similar case, maybe a few months ago, I remember. So the key is Postgres needs to choose, okay, I have filter user ID and I have order by created at desk limit one or limit 10. And I have two indexes. So I have two opportunities here to save in terms of I.U. operations. Of course, Postgres should avoid sequential scan here. It can be one user of many, of millions, right? So it can be huge table. So we don't want to scan whole tables. So we need to choose only one index or two of them, and maybe combine with bitmap scan and so on. But two opportunities here, filtering by user ID and also order by created at desk limit one or 10.
Starting point is 00:22:49 And the key here is order by. Without order by limit, it will be fast always because order by is the key to consider index on created at. Yeah, we need both the equality check and the order by to make it a choice for the planner either has to filter on the equality condition with one of the
Starting point is 00:23:11 indexes take the data and sort it and return the order or it has to choose the index that's ordered and take the first row it reaches that matches the equality condition. It can be not necessarily an equality, it can be between one and the other. Sorry, it doesn't also have to be an order by a while, it could be a different, it has to be two separate conditions and those have to... They fight between each other, right?
Starting point is 00:23:38 Yes. How do they fight? They fight based on costs, based on statistics and what Planner expects in terms of cost. And the key also, like, important point here is that planner has two costs. Startup cost, which can be very small. It means
Starting point is 00:23:53 first row can be returned very fast. And also the full cost. Full cost is the most important when planner, like, the full cost is what planner uses to make the final decision choosing between a variety of plans,
Starting point is 00:24:09 right? Full cost. So our case index on user ID, if we use just it, it means planner thinks, okay, I will choose all rows for this user but then I need to order by them in memory, it means yeah startup
Starting point is 00:24:26 cost work cannot be cannot be close to zero because uh or the ordering and so on it will take some time and once done only once done we can return all those i assume like in this case startup cost and the final full cost they are very close to each other, but far from zero probably, right? But the other option, order by created at desk limit one or 10. This option means that we immediately use created at index. We start fetching rows based on there, but we need to filter out irrelevant rows owned by different other users, right?
Starting point is 00:25:04 We only need our user ID. It means we start returning rows quite quick, but only if we find our user ID quite quick. There is uncertainty here, but for Planner, it's quite certain because it has some expectations about distribution. You mentioned it, right? So probably Planner can think, okay, I will be lucky because this user, maybe it's not real user, maybe it's still super active, by the way, because we start from the very fresh roles. The latest role will be considered the first, and if our user ID is there, it means we already found one role in the first step already, right? And they go, go, go. But if this user is super inactive last year, for example, in reality, we need to scan whole year to reach the first row of that user to return. And here is exactly where the plan can be very wrong.
Starting point is 00:25:58 I mean, it simply doesn't know. It doesn't know. There is no correlation in statistics between user IDs and created that. And also, Planner doesn't know where,
Starting point is 00:26:12 like, who is recently active, who is not. Planner doesn't know. So, it can be very hard to figure out
Starting point is 00:26:19 Yeah. So, there are multiple. So, I completely agree that you brought up multi-column indexes. In the case that I shared with you and the case we just discussed, a multi-column index on the user ID or whatever it was, and then created that, ordered, would make this query efficient and predictable all the time. And so if this was an application... For me, it's not that important. I'm feeling a lack of index okay yes
Starting point is 00:26:46 yes yes but if this was an application query and this was showing up in pgstat statements as a problem and we needed to optimize it and it was one of our biggest concerns we didn't mind about the overhead of adding another index great we can add this index remove the one on user idea whatever we had like and it would be fast all the time. But I was talking about a case where we're sitting in front of a console on production, trying to stop ourselves from running a large query or a slow query
Starting point is 00:27:14 and adding that limit even though we didn't think it would make a difference, even though we think we're only going to get one or two rows returned. If we add it, we're taking a risk that it could actually make things worse because we don't like if we don't check and this is actually maybe what what i would recommend i don't know about you but if you add explain before the query and just check what index it's going to use is does it make
Starting point is 00:27:36 sense like if we saw it was going to do a backward scan on created that maybe that would be a sign that is a bad idea the the tricky part is this is an issue because of a cost underestimation so it's always going to look like a cheap query in cost like it the reason it's being chosen if it's a bad plan if it was a slow plan is because the plan is thinking it will be cheap it's because it thinks it can abort early so costs won't give you a clue here you'd have to be attuned to to the fact of which index is it scanning, and is that enough? Is that a good idea for this query?
Starting point is 00:28:12 So I'm talking about production cases, like where you're doing ad hoc queries, not so much application queries where you should optimize. Yeah, for example, we need to return our table and allow users to order by all columns in both directions. It's a nightmare to support in terms of performance. Yeah. And if you have many of such tables, or maybe you allow users to create tables, any tables,
Starting point is 00:28:39 and you want to hire this performance optimization from them, some builder of website builder or mobile app builder, and you can say, okay, users, developers, your users are developers of a website or mobile app. And you say, I allow you to create tables. I will take care of performance. In tables, you're allowed to create columns. I will take care of performance.
Starting point is 00:29:03 And you can put some element on your UI for your users, millions of users, and you can allow them to order by any column in any direction. It's a super nightmare. Absolutely nightmare to support because obviously you want to create index on every column and also you think, okay, I'm smart. I know about key set pagination you get your pagination you you then allow this this is exactly where you can encounter it right because this limit so i wanted to emphasize again like i i described one index has very low startup cost
Starting point is 00:29:39 it's created at index and why it's low? Because Postgres doesn't know for arbitrary user. We probably want this query work not for one user, but for arbitrary user, right? Right. So in this case, you probably, like, if Postgres doesn't know, for this particular user, how fresh activity is or orders are, right? So maybe, but it assumes something. It can be slightly more, slightly less So maybe, but it assumes something. It can be slightly more, slightly less, depends,
Starting point is 00:30:08 but it assumes something. For second row, it's two times more, 10 rows, 10 times more. And this is our, like, but once we found, we don't need to do anything. If we go with the traverse created at index and just filter out irrelevant other users, right? We just need to find our users.
Starting point is 00:30:28 Once we found our n1 or n10 rows, we are happy. This is our final cost. And then if we don't use limit, our final cost is our expectation for this user. This statistics probably is present, probably, Unless we don't have at all statistics at all. Postgres expects for this user ID, I expect this number of rows and this forms our total cost. And probably if it expects a lot of rows, total cost will be high and it won't start using this approach. In this case, it switches to the more reliable approach. I will fetch all the rows and do a memory sort. So the key here, if we limit,
Starting point is 00:31:11 this is probably why Christophe mentioned that the boundary is this limit. It's not the actual number of rows, but it's statistics, postgres expectation for number of rows for this user ID, exactly. So when our limit goes down, like below it, right, below this number, our traversal with filtering out irrelevant users becomes more attractive because total cost starts looking good there.
Starting point is 00:31:43 But it's a trick, right, because the problem here. But it's a trick, right? Because the problem here, like, it's a trick. And, of course, it's a super interesting phenomenon. I would say this is a phenomenon. It's not a harmful situation because there are many cases when limit, like, doesn't help. For example, we don't have indexes at all. We say, well, no. No, that can really help. It can help. It can help. Yeah,, we don't have indexes at all. We say, well, no. No, that can really help.
Starting point is 00:32:08 It can help, yeah, because filtering out, it can help. It's easy to imagine when limit is useless. Actually, that's another time I use limit. So the docs are quite... In general case, it improves performance because you will stop earlier. are quite... In general case,
Starting point is 00:32:26 it improves performance because you will stop earlier. But on edge case, when our record is the latest in scanning, it doesn't help. Here we deal with similar situation. In general, it's good, but we have some edge case
Starting point is 00:32:40 which not only doesn't help, but it decreases performance. But it's a match case and it's phenomenal. It's not a harmful limit, especially because you cannot do this without order by. And especially because the problem here is not limit, but lack of proper index, to call an index. And again, we're talking about a one-off query rather than an ongoing optimization challenge.
Starting point is 00:33:09 I'm talking about it's just an interesting phenomenon that, and I don't think most people expect this, that by adding limit, you can make the query slower. And that's true. It can be a lot slower. So I just don't want people to use that as their safety mechanism. When you understand what's happening inside, adding limit
Starting point is 00:33:28 is clear why it's wrong. It's clear. You just force the planner to choose the wrong index here. Right? That's it. But of course, for people who don't dive inside the execution and planning process,
Starting point is 00:33:44 they, of course, are very surprised and think limit is bad. But remove order by, and you won't see this effect, right? But the docs, and this is what I wanted to say, the docs quite often when you're reading about limit
Starting point is 00:33:57 will say it's really sensible to use order by with limit because otherwise you're getting things in an under- Again, one second again like if we talk about arbitrary or user id what if this user id is super active and created a thousand rows right row orders right today in this case it's it's liner was right so that's why i say this is not even edge case it's a corner. We have several things in the corner here.
Starting point is 00:34:26 And observing very strange effect, phenomenon, of course, it makes sense. But create proper indexes, that's it. But at least let's correct this harmful title. Let's say limit one with order by considered harmful.
Starting point is 00:34:42 Order by limit. It's not just order by, though. It could be another condition that you add. Lack of multi-column indexes, order by limit considered harmful. This is the full truth. This is the truth, actually, here. Not just limit.
Starting point is 00:34:58 I was hoping you'd understand my point of view on this. It feels like you haven't. I do think that in real in a lot of real world cases people don't have perfect stats or they have data that's correlated it strongly in certain ways for example like user id and account id being extremely tightly coupled you know the the age-old city and country codes and postgres assumes that all of these stats are uncorrelated by default. And if you don't, you can add, you can create statistics for these things. But in a lot of cases, we have a lot of skewed distributions in databases that we haven't created extra statistics for.
Starting point is 00:35:37 And most of the time, the plan is great and can really can work things out nicely. But by having this limit optimization, it can flip plans that we're not expecting to flip. And the reason I found it so powerful is we're using it exactly in the cases where we're most scared of doing a big scan on production. If that's what you're really scared of doing, and this is something that can actually quit it. Michael, why do you cannot agree with all this
Starting point is 00:36:05 concept that is harmful. Because imagine we don't have an index on user ID at all. In this case, in this case, created that shows bad performance because the user
Starting point is 00:36:20 became inactive last year. We need to scan whole year. But at least we have some approach to scan, right? It adds some order of scanning. Without that, we only have sequential scan, right? That's it. So the problem is we don't have proper index. Now, okay, it turns out we have index on user ID.
Starting point is 00:36:47 It happened to be helping, but what if some user has 50% of whole row records, millions of records? Bear with me. This is why I'm suggesting instead of running the query on production by adding a limit one or something to make it feel safe, instead of adding limit one, just one word, explain in front of the query,
Starting point is 00:37:11 not explain analyze, not explain analyze buffers, just explain your query without the limit and see what the query plan is before you run it. If you're scared of what it's going to do in production, do that instead of adding the limit and you'll see what indexes it's using, whether you're missing one. But I cannot not add limit because I don't want my page to blow up because of many, many rows returned.
Starting point is 00:37:33 Well, add limit one and put explain in front just to check what it's going to do. Because that's what's really going to check. That's what's going to actually keep you safe. Yeah, you're checking what you're actually scared of instead of assuming you're going to be safe. It's like, I understand your point that it's almost never going to be a problem, but imagine if a safety on a gun, for example, like, almost always kept you safe.
Starting point is 00:37:59 It's like a safety feature that's almost always good, but, like, sometimes it's going to actually shoot you when you press when you press that safety catch not when you pull the trigger but when you press the catch again in my in my in my world i add limit one to limit things if it doesn't work it means i don't have index if there is some index which would help me when I don't have limit, I don't care because I do need limit one anyway. And maybe you'll hit some statement timeout or you'll cancel your query quickly or there will be some other safety mechanism in place that will stop you from hurting yourself.
Starting point is 00:38:37 But still, if my limit behavior is not good, I'm looking which index will help. And I'm not considering index on user ID as helpful here because I don't want to deal with a lot of records and memory because I need only 15 records. So that's why I jump straight to multi-column indexes here. And as we discussed... And I agree.
Starting point is 00:39:00 Yeah, yeah, yeah. I agree that to optimize this query, but we're not talking about optimizing the query. We're talking about should you use limit as a safety feature when you don't need it? Why might I not? Let's say you know there's only going to be one row. Should you add limit one? How do you know it's only one row?
Starting point is 00:39:23 Unique index? If you have unique index, this problem won't happen. Let's say a company could set up multiple things, but they generally set up one. Like how many websites are you going to have for each organization? You don't have a unique index on it. People can add a second. Okay.
Starting point is 00:39:41 But almost everybody else... Or emails, for example, right? Yeah, email addresses. can add a second. Okay. But almost everything else, for example, right? You, like, yeah, email addresses. Yeah.
Starting point is 00:39:51 It's probably going to be less than 10. Like you don't have any, but maybe you're not certain it's going to be like one, but you're pretty sure it's going to be one. But, uh, am I having up-to-date stats or no? Because stats will say, they will form an expectation to Postgres
Starting point is 00:40:08 if it's a low number of rows, it will prefer. I think I'm struggling because it's a simplified example, but if you've got lots of joins going on, there could be all sorts of other complexities. Joins are tricky, by the way. The same problem is becoming much more tricky,
Starting point is 00:40:24 but I don't think limit is a problem. I think the problem is a lack of proper indexes, and sometimes you cannot create them. Imagine user ID is present in one table, but created is present in a different table. So are you saying you've never sat in front of a database where you don't know
Starting point is 00:40:39 all of the indexes in the system? You don't know the state of stats and wanted to run a query and not been scared that it might take a long time. Well, when I do this, I add limit. I understand that if I lack proper index, that will help. I might be dealing with sequential scan or wrong index being scanned almost fully, right? I understand that.
Starting point is 00:41:02 Yeah. If you don't understand that that's strange right well but then so this is my that's kind of my like takeaway instead of adding limit one and hoping that i'm okay let me give you an example i saw before what we talk to discuss today so we have uh full text search for example and you have rare words and we have very frequent words present in like majority of records in TSVector. Yeah, that's a nice example. Right. And then we have created that or ID and we want to order by.
Starting point is 00:41:37 We want to find all matching documents and return only 15 latest, for example. In this case, Postgres need to choose, must choose between ordering by B3 and then applying a filter to add signs, right, TS vector, add TS query, or it might decide, okay, I need to choose, I need to extract all documents and then order by memory and then limit. Same problem, completely same problem. And I saw terrible cases when I thought, wow, why do you do this? Why do you think this word is super frequent? So instead of using GIN, you decide to use B3 to order by limit 15
Starting point is 00:42:28 for example and then you end up scanning almost everything finding my one or two records only in the end of scanning. It's absolutely the same situation but the problem to me is not limit there at all. I put limit
Starting point is 00:42:43 I see the problem, i understand limit didn't help i understand why because we need to combine both things in one single index and if we have only one single index scan unfortunately this is impossible there right in this particular case because for that you need to use a ram index and it can be not available in rds for example it's not present um and it's or like work out something like middle like middle ground take way more rows than you want and then sort them that kind of thing right so so i mean uh we we sit on the edge here uh we we might be lucky but we are not and what else like what can i say it's not a problem of limit it's problem of lack of index i think let's try and wrap up and i think i agree with you that to solve the
Starting point is 00:43:33 performance issue you don't get rid of the limit you add a good index or you get your stats like you you work out your stats correlation maybe create some statistics there maybe you just need to unanalyze on that table or maybe globally like that there's lots just need to run analyze on that table or maybe globally like that there's lots of ways of solving it so that that query doesn't run slowly again in future i just wanted to publicize and share that and by the way i wanted to shout out frank pasho who got in touch recently when i i shared this twitter thread again to say, I think we agreed that it's an optimization problem. And yeah, he would completely be agreeing with you right now. I just found it really interesting and quite counterintuitive.
Starting point is 00:44:13 Like probably sounds like more than you did that it could, like just by adding it could make a query slower. Right. Even though that's the sign of other issues. So I wanted to talk about that. It's an interesting phenomenon for sure. And's a mental sync phenomenon, for sure. Adding a limit is not a performance optimization tool. Performance optimization tool is order by limit
Starting point is 00:44:32 and proper index, like together. And adding a limit, I should not expect quick execution. I just form expectations for myself and I can then say, okay, this is not what I want and find a better index, which is sometimes not possible, for example, GIN versus B3 and so
Starting point is 00:44:53 on. But by the way, to be fair, with GIN statistics already, Postgres is quite often right. So it knows when to use B3 and quickly find proper results. And it knows when it's better to find everything and order by memory. So it's not that bad, but also not ideal, not just single index scan.
Starting point is 00:45:16 Yeah. Yeah. Cool. Okay. Okay. Thanks so much, Nikolai. Take care. Thank you.
Starting point is 00:45:22 Bye-bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.