Postgres FM - LIMIT vs Performance
Episode Date: May 3, 2024Nikolay and Michael discuss LIMIT in Postgres — what it does, how it can help with performance, and an interesting example where adding it can actually hurt performance(!) Here are some li...nks to things they mentioned:LIMIT considered harmful in PostgreSQL (Twitter thread by Christophe Pettus) https://twitter.com/Xof/status/1413542818673577987 LIMIT and OFFSET (docs) https://www.postgresql.org/docs/current/queries-limit.html No OFFSET (by Markus Winand) https://use-the-index-luke.com/no-offset LIMIT clause (docs) https://www.postgresql.org/docs/current/sql-select.html#SQL-LIMIT ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
I am Michael, founder of PGMastered.
This is my co-host Nikolai, founder of PostgresAI.
Hey Nikolai, what are we talking about today?
Hi Michael.
It's your proposal actually, Limit, right?
But before that, let's mention a little bit what our companies are doing,
because I think last time we did it 90 episodes ago.
Right.
Sure.
Yeah.
So PGMon Master.
By the way, I wanted to tell you yesterday, but forgot.
During some session, people asked me about Pigeon Master suddenly.
Cool.
And asked, can I recommend?
What did you say?
I said, yes, of course. And became shy.
I should recommend it before they ask, right?
It's good that they ask.
We discussed some plan
and they wanted
obviously
we discussed how important it is
to use buffers
our last week
topic.
If you haven't listened to it
please do. I mean not you
but our listeners.
Because buffers
are super important and there are many aspects so and that actually found pg master there's a
good thing to visualize and suggest it's not my idea it's yes that's the idea from my customer
yeah yeah yeah so i i personally think visualization is not the strongest PgMaster feature.
Strongest is a list of recommendations based on heuristics,
which is quite a good thing.
So, and I know you improve it all the time, already many years.
So it's great.
Yeah.
So it's great.
Yeah.
And what about me?
My company is currently creating bot.
Check out our latest
few months old but
our blog post my blog post
where I described how we build it
and currently we are rebuilding it fully
it will be interesting
the release will be soon yeah it's
quite interesting we
bring it to web
from slack I think it will
live in slack still but think it will live in Slack still,
or she, but it will be
in web. It's just
easier to use it there.
And we started using different
models, including Gemini 1.5
with 1 million
context window.
So, yeah, we changed our
approach right now, and it's kind of
interesting.
It knows already more than 1 million documents about Postgres.
It's huge.
So, yeah, that's probably enough about Postgres AI and PgMaster.
And let's talk about Limit.
Nice, yeah.
That's cool to hear.
Maybe we'll hear more about that another time,
about the differences between the models.
Yes, so I suggested this topic.
And it came out of a tweet a long time ago that i found myself recommending over and over again to people because it was a phenomenon i saw and most
people didn't understand or didn't intuitively understand and including myself so this was a
tweet from christoph peters and the headline was, limit considered harmful
or something along those lines.
And it was a really good
kind of initial tweet for a thread.
I was like, what?
Limit's good.
There's so many good things about limit.
It's really helpful
for so many performance reasons.
What's harmful about it?
And he goes on to describe
one specific case
where it can cause plan flips
and really, really slow queries in cases where
you'd expect it to be helpful in performance. So that was the trigger for this topic, but it might
be nice to discuss all the positives of using limit, or at least when we should be using it,
whether we should be using it, that kind of thing. Yeah, well, first of all i i guess we will be fighting a little bit
in this episode because we have different opinions uh let's start uh i mean i cannot be on the side
of christophe peterson in that tweet i think there is huge potential because limit this is what
every one of us is is using all the time, right? The limit is everywhere.
So if there are some cases
when it's harmful,
it's interesting.
But my regular approach is
if you don't limit,
you're in danger.
Limitless queries are harmful
because you don't know
if you tested,
you know, my favorite topic,
testing and experiments
and how to test and the idea that if you test on small databases, you miss.
It's not, it's like it's bad testing.
It's anti-pattern.
Unless you know what you're doing and do it before you test on full-size databases.
So this is exactly when it's not even about performance. It's just if you test on small database, small tables,
and some queries showing some results on a page,
they don't have limit.
And we know all we start usually with it.
If we prototype something, first we do it like,
forget about pagination, let's show everything here.
But then you test, it works well, like 1000 rows on one page is not a problem, probably
depending on how heavy they are in markup.
But then project grows and some users, for example, have 10,000 posts or I don't know,
like comments and you show all of them on one page and you have problems.
And this is a post-p point problem if you start a new
project, right? So limitless queries can be harmful quite easily and this is a straightforward idea.
If you limit, you know, at least you tell yourself what to expect. Okay, I expect 15 rows,
for example. Okay, we know it's good. We know how many buffers we should expect in the very
efficient query. 15 times the width of row plus some internal buffer operations,
continuing the topic we had last week. And that's it. If you don't have limit,
you don't know what to expect. Maybe it's a billion rows.
Yeah. So I agree.
And I think pagination is quite a good place to start
because when people think,
like if you look it up,
limit in the docs, for example,
you're also going to learn about offset.
And it's a really common concept.
And more offset.
Well, yeah.
So I'm a big fan of,
I'm going to say friend of the podcast,
Marcus Winnand, previous guest.
Yeah. I've always wanted to say friend of the podcast.
So maybe I finally got to say it.
It was a great episode.
Thank you.
And yeah, it was, so he has a page on his website called No Offset.
And I just, I just used the No Offset hashtag on Twitter.
I use it all the time because when we discuss, when somebody mentions
offset
and this is not an expert who is doing
offset on purpose,
knowing why,
for example, index only scans before you
fetch the,
you deal with regular index
scan. There you probably won't
offset sometimes, but very rare
actually. So I always use this hashtag no offset.
It's good.
I'm sure most people listening know exactly what these things do,
but just for clarity, when we use limit,
we might add like limit 10 to the end of our query,
limit 20, limit some number,
and that will, as the name suggests, limit the result
set to that many.
Limit zero.
Do you use limit zero?
No, but I've seen quite a lot of offset zero.
I use limit zero
in two cases. When I,
for example, create a table, I select
specific columns limit zero.
For example, sometimes just
bootstrap some table.
I don't want to repeat all data types.
Sometimes it's like you're dealing with CVS or something.
And also when I do tests, for example, presence of column,
you can just select that column from table limit zero,
not caring about anything.
And yeah, kind of assert.
Yeah, I've seen that really cool use case.
I've seen offset zero used a few times.
So limit limits the first n results.
I've never thought of using zero for that.
It's cool.
Offset will jump you forward that many before you start limiting.
So you could offset.
You take the first 10, throw those away, and then limit will take you the next 10.
So if you do limit 10, offset 10,
you get the second set of 10.
Yeah, and offset zero,
it's either somewhere around generated, right?
Because it was offset N where N is zero.
Or it's also some performance trick, some hack, right?
Yes.
I already forgot it's
yeah i think it's an optim it is or was an optimization fence so it can it can trip the
planner into not doing certain optimizations uh which i guess we'll get onto a little bit but
that trick or it was mitigated well something in the docs made me think it might.
I think it's still, I think they still work because I remember reading in a hacker's thread
or it was in the thread somewhere in the mailing list.
Yeah, so many times it may be used, but it was so long ago.
That's why I'm wondering if it's still actual.
So I read a line in the docs in preparation that
the offset zero is the same
as omitting the offset clause, but I
don't think that's quite true because of
the planar implications. But in terms of the results,
that's true. In the results...
Semantically, it's true, but
physical execution might be
affected with presence
of offset zero.
Yeah. So the reason I brought this up in type was because we were talking about
pagination, right?
There's, we can, both common ways of paginating a use limit.
One is the kind of crude way of doing it.
And a lot of RRMs, a lot of tools implemented it this way,
at least initially was to the first set of let's say 20
rows that you want just add limit 20 the next set they just did offset 20 limit 20 and then the next
one offset 40 limit 20 and the problem with that is performance degrades linearly over like as you
go and if you want to do a large you want to get to page 20 or page 100, that can be quite what is a very inefficient query.
And Marcus wrote a great post about why we should instead be doing
key set pagination for predictable performance and efficiency.
The explanation is simple.
It's just because internally it fetches everything until that
and discards.
It's like a lot of inefficiency.
It grows with the depth of your paginating, right?
Yeah.
So what we were talking about in terms of good
is where ID is above,
or where some parameter is greater than the last result you fetched
and then still limiting by 20 or 21,
depending on exactly how you
want to do like something like that so limit zero can be a problem or limit 10 can be a problem
right let's talk about this problem maybe right because i i hope i already explained why limit
is good in terms of performance because you limit you understand you understand yourself what to expect and you know what is efficient and what what's not if you look at
buffer operation numbers i also i think there are a couple of cases like even simpler cases like it
is chances are you don't want or like if your query could return a hundred thousand rows do
you really want all of them or do you you might only be
interested in the top 10 or maybe your user might only care about the most recent 10 and returning
all 100 000 even if the query is quite fast on on the server side sending all that data over the
wire is is inefficient so there are i think it's it's kind of like almost all more fundamental
reason that limit is good for for returning a subset of rows exactly what it's designed for it's close to our topic
about delete and batching by the way someone asked on youtube comments how about updates and i
explained it's very similar but with additional things so if if you don't batch and send the whole
result set in one shot,
what if not only just it's like a lot of memory and so on,
but resiliency, reliability,
what if it fails like in the end of processing,
you lose everything, right?
And retry will be huge again.
So it's better to go in smaller steps with retry logic.
And limit, of course, is
the key tool to
split to batches. So generation,
batching, we need it.
And also, in general, common
sense, if I limit again,
I know how much I want.
This is my request. I want
15, for example, or 2,500.
And if you scan
many more parts of database,
then these 15 records or 25 records
plus additional internal like index pages and so on,
then I'm not considering this as efficient work, right?
So it sets a simple expectation and metric
for efficiency versus not efficient comparison of queries, right?
Yeah, I think that makes tons of sense. Like in application code, for example,
you are suggesting that you're kind of communicating with future developers what
the expectation is here. I think this is where we're going to start to differ in opinion though,
because I think this is exactly... Let going to start to differ in opinion, though, because I think this is exactly...
Let's say on an ad hoc query, I've sat next to people working on production systems that were quite scared and saw them and didn't think this was a problem, actually was learning from them at the time.
They were only expecting a single row back, and they added a limit one at the end of the query.
For example, primary key lookup, right?
Yeah, I think this might have even been like...
I can't remember exactly what they were doing,
but they definitely added...
Primary key or unique key.
It cannot hurt because they are...
It's like if you don't trust Postgres unique keys,
you add limit one just like as a sort, basically.
You think, okay like i must
check it should it should fail or something like if it returns more or something like that yeah i
i actually can't remember exactly why all the thing i remembered was oh that's quite a clever
way of actually not having a runaway one second i remember i did i different thing. I did limit two on purpose.
So if something goes wrong and I have two rows,
I don't remember where,
but somewhere I should have error or exception,
catch it and process.
So limit two was where it should be unique, so I expect one row,
but I add limit two on purpose
to check later that it's not two yeah i like it but
yeah so anyway like in my opinion the limit is always i would add it like as a must for everything
i think the fear was in this case a runaway sequential scan across a huge table that might
start to eat resources and cause
I think this was a very, very on-fire
server. I'm thinking now and I think
maybe I would just even
consider this as a rule, mandatory
like limit everywhere and the only thing
that we cannot limit is, for example, select
count star from something.
This basically
deals with all rows, returns just
one number,
and we cannot limit there, unfortunately.
There, okay.
Well, I mean, you can add limit.
You're just still going to get the count.
Yeah, yeah, yeah.
Limit will be useless in this case.
Yeah, definitely.
The interesting thing, though,
and I was looking back at,
just for fun,
could I find a version of the Postgres docs
that didn't have limit in it?
I'm sure it's like as old as time but it was in an old version of the doc set i got back to 7.1 because that's
the the oldest one online and there was a really like i got way more than i deserved looking back
at that and it said as of postgres 7.0 the query optimizer takes limit into account when generating a query plan, which I thought might be a nice segue.
So before 7.0, there wouldn't have been any danger in adding limit to your query.
I disagree. I object. Mind your language. This language is wrong.
Any danger in adding limit. There is no danger in adding limit.
Let's already jump to the core of the topic.
There is no danger in adding limit.
There is danger in having bad statistics or forming expectations based on indexes which cannot support your order by way of filtering.
And limit. Yes. By the way, when you will describe this case,
focus on order by as well,
because without order by, this case is impossible.
It's not about limit.
It's about order by plus limit,
which forms expectations.
Like, I mean, gives planner the ability to use some index.
Right?
Yeah.
So let's talk about, well, for anybody thinking we've jumped i shared some
examples with nikolai before the call so that um i saw this example before you like a long ago
i saw it live i just not yeah well that's good so let's back up and try and explain
the problem or the or the edge case or whatever you want to call it. So exactly the same query, if you run it with and without limit some number,
can be dramatically slower with limit the number added
than it is without the limit at all, which I found really counterintuitive.
And yes, it relies on some specific cases,
but those specific things are not as rare in the real world
yeah so that person sitting in front of that console hoping that limit some number is going
to make it safer and guarantee faster execution same logic as expected yes i agree can actually
shoot themselves in the foot and make things loads worse for themselves and i actually think there might be other advice you could follow um like i'd much rather they
added additional where clauses or like other other things than an additional limit so that's
potentially controversial i'd love your taking it but let's try and explain how this can happen so the simplest case is to limit by a relatively rare condition and then order by
something else that you have indexed so for example created app so if we think about a
let's say a software as a service business and maybe you have a customer that cancelled a few
years ago and you want to look at some like their most recent event or something like that
and so you're looking at but a company a order by created at descending limit one something like that
or limit 10 or limit 10 exactly it can happen the problem can happen with not only with one
it can be some small number yeah it could be a big number or relatively small the key is that
it's small enough to have flip yeah well christoph made a good point that i haven't tested yet that
he thinks it's worse around the point of the actual number of rows like there are some cases where let's say
that company had a thousand events the limit thousand would be actually one of the would
would be problematic so the the problem comes from not knowing like postgres assumes things are
randomly distributed unless it knows better unless it has statistics to suggest otherwise and if it thinks that the the table that contains these events is going to be
it's going to have events from any company in any order it it because it knows it only has to fetch
one or ten events it might it now has this optimization where it might be quicker for it
to scan backwards through you you missed a very important thing.
Let me add things.
So we talk about, for example,
in case you took some SaaS system, some users,
and users have some activities, let's say, like comments or,
I don't know, orders, anything.
So we have orders table, for example.
This is what you showed, if I'm not mistaken.
I remember. Something like that. So orders, orders and we have user id we have created that we have id of order enough right
and the key here is we want to select all orders for this specific user in a chronological order
like order by okay desk so like in like in reverse order, the newest first, the latest
first, right?
And then the key here is not to forget our developers, or we are developers, we don't
know about multicolumn indexes.
So we created index on created at, we created index on user ID, and on ID.
Okay, let's simplify assumption.
We like to index all columns.
It's worse.
And when we do it, we use only single column indexes always, right?
This is quite classic example.
I can imagine it happens in the life of every engineer during
five years, for example, of development. Definitely, at least once it should happen,
I would say. And 99% of developers. It happened with me many times and with my products,
my databases, and also with others when I observed them. Actually, recently we had a similar case, maybe a few months ago, I remember.
So the key is Postgres needs to choose, okay, I have filter user ID and I have
order by created at desk limit one or limit 10. And I have two indexes. So I have two opportunities
here to save in terms of I.U. operations. Of course, Postgres should avoid
sequential scan here. It can be one user of many, of millions, right? So it can be huge table. So we
don't want to scan whole tables. So we need to choose only one index or two of them, and maybe
combine with bitmap scan and so on. But two opportunities here, filtering by user ID and also order by created at desk limit one or 10.
And the key here is order by.
Without order by limit,
it will be fast always
because order by is the key
to consider index on created at.
Yeah, we need both the equality check
and the order by to make it
a choice for the planner either has to filter on the equality condition with one of the
indexes take the data and sort it and return the order or it has to choose the index that's ordered
and take the first row it reaches that matches the equality condition. It can be not necessarily an equality, it can be between
one and the other.
Sorry, it doesn't
also have to be an order by a while, it could be
a different, it has to be two separate conditions
and those have to...
They fight between each other, right?
Yes. How do they fight? They fight
based on costs, based on
statistics and what Planner expects
in terms of cost. And the
key also, like, important point
here is that planner has two costs.
Startup cost, which
can be very small. It means
first row can be returned very
fast. And also the full cost.
Full cost is the
most important when
planner, like, the full cost is
what planner uses to
make the final decision
choosing between a variety of plans,
right? Full cost.
So our case index on
user ID, if we use just
it, it means planner
thinks, okay, I will choose
all rows for this user
but then I need to order by
them in memory, it means yeah startup
cost work cannot be cannot be close to zero because uh or the ordering and so on it will take
some time and once done only once done we can return all those i assume like in this case
startup cost and the final full cost they are very close to each other, but far from zero probably, right? But the other option,
order by created at desk limit one or 10.
This option means that we immediately use created at index.
We start fetching rows based on there,
but we need to filter out irrelevant rows
owned by different other users, right?
We only need our user ID. It means we start returning
rows quite quick, but only if we find our user ID quite quick. There is uncertainty here, but
for Planner, it's quite certain because it has some expectations about distribution.
You mentioned it, right? So probably Planner can think, okay, I will be lucky because this user, maybe it's not real user, maybe it's still super active, by the way, because we start from the very fresh roles.
The latest role will be considered the first, and if our user ID is there, it means we already found one role in the first step already, right? And they go, go, go. But if this user is super inactive last year, for example,
in reality, we need to scan whole year
to reach the first row of that user to return.
And here is exactly where the plan can be very wrong.
I mean, it simply doesn't know.
It doesn't know.
There is no correlation in statistics
between user IDs
and created that.
And also,
Planner doesn't know
where,
like,
who is
recently active,
who is not.
Planner doesn't know.
So,
it can be very
hard to figure out
Yeah.
So,
there are multiple.
So,
I completely agree
that you brought up multi-column indexes. In the case that I shared with you and the case we just discussed, a multi-column index on the user ID or whatever it was, and then created that, ordered, would make this query efficient and predictable all the time.
And so if this was an application...
For me, it's not that important. I'm feeling a lack of index okay yes
yes yes but if this was an application query and this was showing up in pgstat statements as a
problem and we needed to optimize it and it was one of our biggest concerns we didn't mind about
the overhead of adding another index great we can add this index remove the one on user idea
whatever we had like and it would be fast all the time. But I was talking about a case where
we're sitting in front of a console
on production, trying to
stop ourselves from running
a large query or a slow query
and adding that limit
even though we didn't think it would make a difference, even though
we think we're only going to get one or two rows returned.
If we add it,
we're taking a risk that it could actually
make things worse because we don't
like if we don't check and this is actually maybe what what i would recommend i don't know about you
but if you add explain before the query and just check what index it's going to use is does it make
sense like if we saw it was going to do a backward scan on created that maybe that would be a sign
that is a bad idea the the tricky part is this is an issue
because of a cost underestimation so it's always going to look like a cheap query in cost like it
the reason it's being chosen if it's a bad plan if it was a slow plan is because the plan is
thinking it will be cheap it's because it thinks it can abort early so costs won't give you a clue
here you'd have to be attuned to to the fact of which index is it scanning,
and is that enough?
Is that a good idea for this query?
So I'm talking about production cases,
like where you're doing ad hoc queries,
not so much application queries where you should optimize.
Yeah, for example, we need to return our table
and allow users to order by all columns in both directions.
It's a nightmare to support in terms of performance.
Yeah.
And if you have many of such tables, or maybe you allow users to create tables, any tables,
and you want to hire this performance optimization from them,
some builder of website builder
or mobile app builder, and you can say, okay, users, developers, your users are developers
of a website or mobile app.
And you say, I allow you to create tables.
I will take care of performance.
In tables, you're allowed to create columns.
I will take care of performance.
And you can put some element on your UI for your users, millions of users,
and you can allow them to order by any column in any direction.
It's a super nightmare.
Absolutely nightmare to support because obviously you want to create index
on every column and also you think, okay, I'm smart.
I know about key set pagination you get your
pagination you you then allow this this is exactly where you can encounter it right because this
limit so i wanted to emphasize again like i i described one index has very low startup cost
it's created at index and why it's low? Because Postgres doesn't know for arbitrary user.
We probably want this query work not for one user,
but for arbitrary user, right?
Right.
So in this case, you probably, like, if Postgres doesn't know,
for this particular user, how fresh activity is or orders are, right?
So maybe, but it assumes something.
It can be slightly more, slightly less So maybe, but it assumes something. It can be slightly more, slightly less, depends,
but it assumes something.
For second row, it's two times more,
10 rows, 10 times more.
And this is our, like, but once we found,
we don't need to do anything.
If we go with the traverse created at index
and just filter out irrelevant other users, right?
We just need to find our users.
Once we found our n1 or n10 rows, we are happy.
This is our final cost.
And then if we don't use limit, our final cost is our expectation for this user.
This statistics probably is present, probably, Unless we don't have at all statistics
at all. Postgres expects for this user ID, I expect this number of rows and this forms our
total cost. And probably if it expects a lot of rows, total cost will be high and it won't start
using this approach. In this case, it switches to the more reliable approach. I will fetch all the rows and do a memory sort.
So the key here, if we limit,
this is probably why Christophe mentioned
that the boundary is this limit.
It's not the actual number of rows,
but it's statistics,
postgres expectation for number of rows for this user ID, exactly.
So when our limit goes down, like below it, right, below this number,
our traversal with filtering out irrelevant users becomes more attractive
because total cost starts looking good there.
But it's a trick, right, because the problem here. But it's a trick, right?
Because the problem here, like, it's a trick.
And, of course, it's a super interesting phenomenon.
I would say this is a phenomenon.
It's not a harmful situation because there are many cases when limit, like, doesn't help.
For example, we don't have indexes at all.
We say, well, no. No, that can really help. It can help. It can help. Yeah,, we don't have indexes at all. We say, well, no.
No, that can really help.
It can help, yeah, because filtering out, it can help.
It's easy to imagine when limit is
useless.
Actually, that's another time
I use limit.
So the docs are quite...
In general case,
it improves performance because you will stop earlier. are quite... In general case,
it improves performance because you will stop earlier.
But on edge case,
when our record
is the latest in scanning,
it doesn't help.
Here we deal with similar situation.
In general, it's good,
but we have some edge case
which not only doesn't help,
but it decreases performance.
But it's a match case and it's phenomenal.
It's not a harmful limit, especially because you cannot do this without order by.
And especially because the problem here is not limit,
but lack of proper index, to call an index.
And again, we're talking about a one-off query
rather than an ongoing optimization challenge.
I'm talking about it's just an interesting phenomenon that,
and I don't think most people expect this,
that by adding limit, you can make the query slower.
And that's true.
It can be a lot slower.
So I just don't want people to use that as their safety mechanism.
When you understand
what's happening inside, adding limit
is clear why it's wrong.
It's clear. You just
force the planner
to choose the wrong index here.
Right? That's it.
But of course, for people who don't
dive inside the execution
and planning process,
they, of course, are very surprised
and think limit is bad.
But remove order by,
and you won't see this effect, right?
But the docs,
and this is what I wanted to say,
the docs quite often
when you're reading about limit
will say it's really sensible
to use order by with limit
because otherwise you're getting things
in an under-
Again, one second
again like if we talk about arbitrary or user id what if this user id is super active and created
a thousand rows right row orders right today in this case it's it's liner was right so that's
why i say this is not even edge case it's a corner. We have several things in the corner here.
And observing very strange effect,
phenomenon, of course, it makes sense.
But create proper indexes, that's it.
But at least let's correct
this harmful
title. Let's say
limit one with order by
considered harmful.
Order by limit.
It's not just order by, though.
It could be another condition that you add.
Lack of multi-column indexes,
order by limit considered harmful.
This is the full truth.
This is the truth, actually, here.
Not just limit.
I was hoping you'd understand my point of view on this.
It feels like you haven't.
I do think that in real in a lot of
real world cases people don't have perfect stats or they have data that's correlated it strongly
in certain ways for example like user id and account id being extremely tightly coupled you
know the the age-old city and country codes and postgres assumes that all of these stats are uncorrelated by default.
And if you don't, you can add, you can create statistics for these things.
But in a lot of cases, we have a lot of skewed distributions in databases that we haven't created extra statistics for.
And most of the time, the plan is great and can really can work things out nicely.
But by having this limit optimization,
it can flip plans that we're not expecting to flip. And the reason I found it so powerful
is we're using it exactly in the cases
where we're most scared of doing a big scan on production.
If that's what you're really scared of doing,
and this is something that can actually quit it.
Michael, why do you cannot agree with all this
concept that is harmful. Because imagine
we don't have an index
on user ID at all.
In this case,
in this case,
created that
shows bad performance because
the user
became inactive last year.
We need to scan whole year.
But at least we have some approach to scan, right?
It adds some order of scanning.
Without that, we only have sequential scan, right?
That's it.
So the problem is we don't have proper index.
Now, okay, it turns out we have index on user ID.
It happened to be helping,
but what if some user has 50% of whole row records,
millions of records?
Bear with me.
This is why I'm suggesting instead of running the query on production
by adding a limit one or something to make it feel safe,
instead of adding limit one,
just one word, explain in front of the query,
not explain analyze, not explain analyze buffers,
just explain your query without the limit
and see what the query plan is before you run it.
If you're scared of what it's going to do in production,
do that instead of adding the limit
and you'll see what indexes it's using, whether you're missing one.
But I cannot not add limit because I don't want my page to blow up
because of many, many rows returned.
Well, add limit one and put explain in front just to check what it's going to do.
Because that's what's really going to check.
That's what's going to actually keep you safe.
Yeah, you're checking what you're actually scared of
instead of assuming you're going to be safe.
It's like, I understand your point that it's almost never going to be a problem,
but imagine if a safety on a gun, for example,
like, almost always kept you safe.
It's like a safety feature that's almost always good,
but, like, sometimes it's going to actually shoot you when you press when you press that safety catch not when you pull the trigger but
when you press the catch again in my in my in my world i add limit one to limit things if it doesn't
work it means i don't have index if there is some index which would help me when I don't have limit, I don't care because I do need limit one anyway.
And maybe you'll hit some statement timeout
or you'll cancel your query quickly
or there will be some other safety mechanism in place
that will stop you from hurting yourself.
But still, if my limit behavior is not good,
I'm looking which index will help.
And I'm not considering index on user ID as helpful here
because I don't want to deal with a lot of records and memory
because I need only 15 records.
So that's why I jump straight to multi-column indexes here.
And as we discussed...
And I agree.
Yeah, yeah, yeah.
I agree that to optimize this query,
but we're not talking about optimizing the query.
We're talking about should you use limit as a safety feature when you don't need it?
Why might I not?
Let's say you know there's only going to be one row.
Should you add limit one?
How do you know it's only one row?
Unique index?
If you have unique index, this problem won't happen.
Let's say a company could set up multiple things,
but they generally set up one.
Like how many websites are you going to have for each organization?
You don't have a unique index on it.
People can add a second.
Okay.
But almost everybody else...
Or emails, for example, right?
Yeah, email addresses. can add a second. Okay. But almost everything else, for example, right?
You,
like,
yeah,
email addresses.
Yeah.
It's probably going to be less than 10.
Like you don't have any,
but maybe you're not certain it's going to be like one,
but you're pretty sure it's going to be one. But,
uh,
am I having up-to-date stats or no?
Because stats will say,
they will form an expectation to Postgres
if it's a low number of rows,
it will prefer.
I think I'm struggling
because it's a simplified example,
but if you've got lots of joins going on,
there could be all sorts of other complexities.
Joins are tricky, by the way.
The same problem is becoming much more tricky,
but I don't think limit is a problem.
I think the problem is a lack of proper
indexes, and sometimes you cannot create them.
Imagine user ID is
present in one table, but created
is present in a different table.
So are you saying you've never sat in front
of a database where you don't know
all of the indexes in the system?
You don't know the state of stats and wanted to run
a query and not been scared that it might take a long time.
Well, when I do this, I add limit.
I understand that if I lack proper index, that will help.
I might be dealing with sequential scan
or wrong index being scanned almost fully, right?
I understand that.
Yeah.
If you don't understand that that's strange
right well but then so this is my that's kind of my like takeaway instead of adding limit one and
hoping that i'm okay let me give you an example i saw before what we talk to discuss today so we
have uh full text search for example and you have rare words and we have very frequent words present in like majority of records in TSVector.
Yeah, that's a nice example.
Right.
And then we have created that or ID and we want to order by.
We want to find all matching documents and return only 15 latest, for example. In this case, Postgres need to choose, must choose between ordering by B3 and then applying
a filter to add signs, right, TS vector, add TS query, or it might decide, okay, I need
to choose, I need to extract all documents and then order by memory and then limit.
Same problem, completely same problem.
And I saw terrible cases when I thought, wow, why do you do this?
Why do you think this word is super frequent?
So instead of using GIN, you decide to use B3 to order
by limit 15
for example and then
you end up scanning
almost everything finding
my one or two records only
in the end of scanning. It's absolutely the same
situation but
the problem to me is not limit there
at all. I put limit
I see the problem, i understand limit didn't
help i understand why because we need to combine both things in one single index and if we have
only one single index scan unfortunately this is impossible there right in this particular case
because for that you need to use a ram index and it can be not available in rds for example it's not present um and it's or like
work out something like middle like middle ground take way more rows than you want and then sort
them that kind of thing right so so i mean uh we we sit on the edge here uh we we might be lucky
but we are not and what else like what can i say it's not a problem of limit it's problem of
lack of index i think let's try and wrap up and i think i agree with you that to solve the
performance issue you don't get rid of the limit you add a good index or you get your stats like
you you work out your stats correlation maybe create some statistics there maybe you just need
to unanalyze on that table or maybe globally like that there's lots just need to run analyze on that table or maybe
globally like that there's lots of ways of solving it so that that query doesn't run slowly again in
future i just wanted to publicize and share that and by the way i wanted to shout out frank pasho
who got in touch recently when i i shared this twitter thread again to say, I think we agreed that it's an optimization problem.
And yeah, he would completely be agreeing with you right now.
I just found it really interesting and quite counterintuitive.
Like probably sounds like more than you did that it could, like just by adding it could make a query slower.
Right.
Even though that's the sign of other issues.
So I wanted to talk about that.
It's an interesting phenomenon for sure. And's a mental sync phenomenon, for sure.
Adding a limit is not a performance
optimization tool. Performance
optimization tool is order by limit
and proper index, like together.
And adding a limit,
I should
not expect
quick execution. I just
form expectations for myself and I can
then say, okay, this is not what
I want and find a better index, which is sometimes not possible, for example, GIN versus B3 and so
on. But by the way, to be fair, with GIN statistics already, Postgres is quite often right. So it
knows when to use B3 and quickly find proper results.
And it knows when it's better
to find everything
and order by memory.
So it's not that bad,
but also not ideal,
not just single index scan.
Yeah.
Yeah.
Cool.
Okay.
Okay.
Thanks so much, Nikolai.
Take care.
Thank you.
Bye-bye.