Postgres FM - Performance cliffs
Episode Date: April 4, 2025Nikolay and Michael are joined by Tomas Vondra to discuss single query performance cliffs — what they are, why they happen, some things we can do to make them less likely or less severe, an...d some potential improvements to Postgres that could help. Here are some links to things they mentioned:Tomas Vondra https://postgres.fm/people/tomas-vondraWhere do performance cliffs come from? (Talk by Tomas) https://www.youtube.com/watch?v=UzdAelm-QSYWhere do performance cliffs come from? (Slides) https://vondra.me/pdf/performance-cliffs-posette-2024.pdfIncrease the number of fast-path lock slots (committed for Postgres 18) https://www.postgresql.org/message-id/flat/E1ss4gX-000IvX-63%40gemulon.postgresql.org San Francisco Bay Area Postgres meet-up with Tomas on 8th April (online) https://www.meetup.com/postgresql-1/events/306484787Our episode on Extended Statistics https://postgres.fm/episodes/extended-statisticsLogging plan of the currently running query (proposed patch by Rafael Thofehrn Castro and Atsushi Torikoshi) https://commitfest.postgresql.org/patch/5330Our episode with Peter Geoghegan on Skip Scan https://postgres.fm/episodes/skip-scanIndex Prefetching patch that Tomas is collaborating with Peter Geoghegan on https://commitfest.postgresql.org/patch/4351A generalized join algorithm, G-Join (paper by Goetz Graefe) https://dl.gi.de/server/api/core/bitstreams/ce8e3fab-0bac-45fc-a6d4-66edaa52d574/content Smooth Scan: Robust Access Path Selection without Cardinality Estimation (paper by R. Borovica, S. Idreos, A. Ailamaki, M. Zukowski, C. Fraser) https://stratos.seas.harvard.edu/sites/g/files/omnuum4611/files/stratos/files/smoothscan.pdfJust-in-Time Compilation (JIT) https://www.postgresql.org/docs/current/jit.htmlNotes from a pgconf.dev unconference session in 2024 about JIT (discusses issues) https://wiki.postgresql.org/wiki/PGConf.dev_2024_Developer_Unconference#JIT_compilationImplementing an alternative JIT provider for PostgreSQL (by Xing Guo) https://higuoxing.com/archives/implementing-jit-provider-for-pgsqlTomas’ Office Hours https://vondra.me/posts/office-hours-experiment ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello and welcome to Postgres FM,
a week to share about all things PostgreSQL.
I am Michael, founder of PG Mustard,
and as usual, I'm joined by my co-host, Nikolai,
founder of PostgreSQL CI.
Hey, Nikolai.
Hi, Michael.
Hello, and today we are excited to have a special guest,
Tomas Vondra, who is a PostgreSQL major contributor
and committer, now working at Microsoft
on their Postgres team.
Welcome to the show, Tomas.
Hello, everyone.
Right, so a few months ago, you gave an excellent talk about where performance glyphs come from,
and we have been excited to talk to you about this. So perhaps as a starting point, could
you let us know a little bit like how you got interested in this as a topic or why you
chose to talk about it there?
So I think I started thinking about performance glyphs
when I actually started working on Postgres
or started using Postgres.
Because performance glyphs are something people often
run into and have to investigate changes,
sudden changes in behavior after small changes in the query.
A single additional row in the query. Like a single additional row in the query could kind of
like flip the behavior in a way that is completely like unexpected and unpredictable. And it's
causing problems for like robustness and so on. And I actually started with Postgres as
like a user or like a developer using Postgres. And one of the things I was responsible for was investigating performance problems.
So that's how I actually came to work on performance glyphs.
But after I started working on development of Postgres, you also see the other side of
the performance glyphs, which is why it actually happens in practice,
because the database needs to do some decisions.
So it's also an engineering problem,
which is interesting for me as a developer committer.
I've been working on a planner some of the time,
and that's where some of the performance glyphs actually happen,
because of decisions made either during planning or then during the execution.
So I've been reading some papers and that's actually where I learned that there even isn't
like a name for this, which is the performance cliff.
And there is a whole area of research for likeness, making robust decisions where the performance cliffs
do not happen at all or not as often. So that's why I'm interested in this.
Nice. And yeah, I really like the term performance cliff. It's so intuitive in that,
from the example you were talking about,
kind of adding one additional row.
So if we go back a few steps,
if we're adding one row incrementally up until the cliff,
we might get a certain gradient,
might even be relatively horizontal.
And then all of a sudden there's a massive spike
and that's like the cliff.
I'm probably thinking of it as a spike,
but maybe a cliff would be a drop.
So yeah, I really like that as a phrase. You mentioned kind of planner versus execution time cliffs. And I guess we're talking about individual execution more than kind of system level cliffs,
mostly here. What from your experience as a user, or maybe since you've since you've been looking at like trying to prioritize which things to work on, where do you see the majority of
cliffs coming from like plan time, execution time, some some mix?
So that's a good question. I don't know. Because I am inherently biased, right?
The fact that a lot of performance,
which for example, happen during planning
means that you can't actually see a lot of glyphs
that happen later, right?
So I don't have a very good like data to explain that
or to say that most of the glyphs happen at some point.
But I would say that most of the cliffs happen at some point.
But I would say that the sooner the decision needs to be done in the query planning and query execution,
the sooner you need to do that.
The less information you actually have
to do a good robust decision.
And the more likely it is that it will be wrong, right?
Because for example, we do planning based on statistics and those are like inherently
inaccurate, right?
There'll be a lot of details that we remove from the statistics to keep it small.
And therefore, those are the things that can actually cause problems. And then there is, of course, the thing that even the cost model itself is extremely simplified.
So I would say that the main problems is that we are making a lot of decisions early in
the query execution. And because of that, we are kind of like fixed in a certain
execution path. And that may not be the optimal one, right? So that's why I think the performance
cliffs come from.
I wanted to ask, like, there should be a challenge, not about like planner versus executor but also
they're both of them behavior like in a single session when there is no like a multi-user
situation versus some some various contention situations when we have many many sessions
working in parallel and all of them have some issue for
example an infamous lightweight local manager happens when during planning
planner needs to lock all the indexes right or for the table and we exceed 16
fast path becomes not available for some relations. And we cannot see it when we're just checking the single query,
but it becomes a performance
cliff immediately when we have
thousands of transactions per second doing the same thing,
and planning becomes problem, right?
I'm curious, how do you think this should be understood,
like with what tools and earlier,
not when you already fell from that cliff.
But before, this is the trickiest, how to diagnose cliffs which are going to happen
soon with our system.
I understand the example that you've been describing,
but I think that's a slightly different performance cliff,
and not what I usually mean by the term.
Because you are talking about resource exhaustion.
You have a resource which is like a lock queue or
the buffer for the fast path logs,
which was increased or like made larger in Postgres 18.
And it's configurable now, right? I think it's maybe configurable.
Yes. It's tied to expected number of transactions per connection. So it is configurable in this way, yes. So it's not fixed size like before.
But I think what I mean by performance cliffs and what I describe in my talk is mostly performance
cliffs that are done because of like a binary decision somewhere in the processing, right? And the
locking bottleneck is not that, right? I mean, like, yes, you do have the case where you
can either fit the fast part lock or not, right? And if not, you have to do something
much more expensive. That's a problem, but it's not the thing that I had
in mind thinking about life performance, please.
Now I remember when I was checking your talk, this is exactly what I thought, like we have
different views on this term. And I grabbed this term from various blog posts where people talked about multi-exact SLRU
buffer problem, like it was also a cliff.
Sub-transactions, also calling this performance cliff, like saying sub-transactions are cursed,
don't use them at all because there is a cliff that is hidden there and so on.
So this definition is very different from yours.
I understand.
Yeah, I agree with you.
Yeah, I don't feel like I'm the authority to define what exactly that means. I'm trying to
more explain like what I've been describing, like the cases that I've been discussing in my talk.
Because also the reasons why that happens are slightly different, right? For the performance, as I understand them
or as I discuss them,
it's more like an engineering question
because you have multiple algorithms
that you can choose from.
Each of those algorithms is kind of optimal,
best suited for certain type of conditions
or selectivity ranges or something like that.
And therefore, we make the decision based on the assumption that you know the correct
answer.
And I think the bottlenecks that you've been describing are more about, say, changes in
the hardware available, right? Because years ago when we initially sized the number
of fast padlocks, I think like 16 was probably fine, right?
It's like, it was like reasonable, right?
Like you probably didn't have that many tables
or like connections and it wasn't such a problem.
But nowadays we have many more CPUs,
many more cores available usually.
And we also have things like partitioning
which can completely explode the number of things
that you need to think.
So I think it's more like a consequence
of changes in Postgres that leads to hitting this bottleneck.
I'm not saying it's not a valid problem. It definitely is. We need to improve that.
But it's a slightly different thing.
Right. Yeah, I agree. Yeah.
The topic's already huge if we only limit it to single session, single query, performance
cliffs.
So if we think about that, just those, you talked about both reducing kind of the frequency
of these, so making them less likely to happen, but also reducing the severity of them.
So kind of accepting they will happen, make them less of a cliff or less of a drop when
they do.
Can you talk us through some things from a user perspective that we can do to reduce the severity
and or to reduce the likelihood of running into them?
That's a good question. I think it probably depends on each individual case of the performance cliff.
And in some cases, you probably can't do anything.
Right?
But it's just good to know that this is what's happening in the talk,
which you can see on YouTube.
And I'm actually invited to do the talk at the San Francisco meetup in like a week.
But the first example that I gave in that talk is about the number of elements in the
in condition, right?
Where column, in, and then array of elements.
And there is like a flip because we exactly switch from
one algorithm to the other like from linear search to hash.
But that's hard-coded.
There's nothing the user can do about that.
The one thing you could do is to pick the right size of the list, right?
But that's not a very practical thing.
But I think in many cases, just knowing what's causing the problem
can be a very useful thing because you can say like,
okay, this is what's happening, I can't do anything about that.
I have quantified like the impact
and we have just to live with that for now.
I can imagine like patches doing something
like improving that by adjusting the point
where exactly that flips
by doing some better estimation.
But at this point, people can't do anything about that.
For some of the cases in planning,
you can do the usual thing,
trying to convince the planner to do the right decision.
There are different ways to do that.
I don't know what's the best solution to that.
Things I really liked were,
I think proactively people can adjust
some of the cost parameters.
Like we still have random page cost as four as the default,
even on most cloud providers.
So I feel like things like that,
when we've mostly got SSDs these days, just feel like
they're going to be more likely to push plan flips at the wrong times.
You know, some of these, if there are other cost signals that are like massively out as
well, I'm not aware of any, but that, that feels like a foot gun or something that if
people change proactively or reduce proactively, they'd be less likely to encounter this kind of thing.
And also a way you did a lot of the work on extended statistics,
and that feels like another tool that used well could start to teach the planner
about certain relationships and make cardinality errors far less likely.
So, and again, maybe then less likely to have plan flips at the worst, not the worst times.
When we do have the plan flip,
it's more likely to be closer to the optimal point.
So we're less likely to have a severe cliff at least.
What do you think?
I think you are absolutely right that like
tuning some of the parameters for like random page cost,
for example, and something like that,
is exactly one of the things you can do
to convince the planner to make the correct decisions
like more often, right?
Because you are essentially moving the flip point
between the costs closer to the actual flip point, right?
Between the execution costs,
like measured in time, for example.
So that's definitely true.
I think most of the time we actually do the correct
decision like for like regular queries,
because the selectivity where we would have a problem
is kind of like in the middle, right?
Somewhere.
And usually queries do have like very low or very high selectivity, something.
So, so in between it's not very common.
It can happen, but I don't really have a perfect solution to that.
Extended statistics definitely can help, but it's still the, you know,
the very early decision, right?
It's going to help with with improving that in some cases.
And I know you had a whole podcast about that.
The problem with the extended statistics
is that it still is like an extremely simplified
representation, approximation of the data. That's the whole point of the data, right?
You know, like that's the whole point of having statistics to have like very compact thing,
very compact data set, which you can use to do the planning quickly, right?
Which means that it definitely is missing some of the details.
And that can be fatal, right?
So I think what we need to focus on in the future
is kind of like making the execution part a bit more robust,
like moving some of the decisions to that,
instead of expecting the decisions in planning to be perfectly correct all the time
to kind of like allow the execution to actually recover from mistakes.
Change the mind on the fly, right? Change the execution path?
Kind of, right? Or like, yes, there are different approaches to that. Some
are kind of like difficult to do correctly, like replanning, for example, right? Or you
could have like a typical problem for a planner is like a sequence of nested loops, right? When you have underestimated the join size
or like a condition selectivity,
then you pick a nested loop
and then it explodes and takes forever.
So there are approaches which kind of
blend two different paths through the join
and either do like a hash join or like a nested
loop, right?
But there are like practical issues what you can do limiting the behavior of the optimizer,
right?
Because if you want to do this, then for example, you cannot emit any tuples from that join
until you know that that's the correct approach,
right?
Because then you wouldn't be actually able to flip to the other and so on.
And I see another challenge, it's like user experience, we all are very scared about the
probability of plant flips sometimes, right?
And it's, it's, it's dangerous, especially when you do upgrade, major upgrade, you, plant
flips are not uncommon.
And I know, for example, AWS Aurora, RDS Aurora has extension, which is not open to
freeze plants, so no flips can be possible.
What you describe like if I see some plan and explain but then during execution plan
can be changed it makes like my head thinking like how can I control it?
I'm not controlling, I just understand what's happening, right? Because explain will lie in this case, right?
Hmm.
How did or maybe you see some something what could help to see like
two plans and explain that or three plans and we know that...
But we already can can show this kind of stuff in the plan, right?
Where when you have a sub plan,
we can either have like a regular plan
or hashed sub plan, I think.
So we kind of like can show like alternative plans.
I don't think that would be a problem.
Obviously that wouldn't tell you
like what exactly happened during execution. It
would need some additional improvements. But yes, it adds ambiguity to the explain plan.
We go to tell you these are the possibilities. You don't know what's going to happen for, and it could actually differ between executions
of the same query even, right?
But that's the price for higher robustness, right?
Right.
I'm just thinking observability tools,
many people want them to be improved,
like to track, PgSense statements is not enough to have plans.
One normalized query can have multiple plans.
In the past, there were attempts to have PgStat plans
or something.
There are also extensions which allow
you to see the plan in for ongoing query.
I think if this way is like this direction is going to be like norm to alter the plan on the fly, that observability
tooling needs improvements for sure because we need to understand what's happening.
And auto-explain as well, right?
Probably it should report initial plan and then additional plan or something like this alternation. So I
just feel the bigger need to improve observability part of PosgOS as well.
Yes, I have nothing against observability And I think I haven't tried like PGSTAT plans for a while,
but I think it still exists.
It's still available, right?
I think it's very useful.
I think there's also a patch to actually add ability
to watch a plan as it's being executed, right?
Which can actually help.
That would be a tremendous benefit
when investigating queries that never complete,
where you can't actually get like explain analyze and so on,
or just watch the execution of a query in another backend.
There is a page for that.
It's definitely not going to be in Postgres 18,
but I think Rafael actually is working on that.
I hope I didn't get the name wrong.
Yeah, but again, I don't think that's necessarily
about the performance lifts only, right?
That's about like monitoring in general.
This concept of robustness is really interesting.
I don't think I had come across it as defined a term
until reading some of the papers that you linked to
at the end of your talk.
And I found it fascinating.
I think it aligns well with my experience
of what people really care about in performance.
I think
correct me if I'm wrong, but I got the impression it's really about trying to avoid
really bad worst case situations at the cost of potentially slightly worse average execution time, for example, if we want to optimize based on time. So this idea that on average, we might perform slightly worse,
but our worst case will be nowhere near as bad as the,
so the P99 would be way better or way less severe,
but the P50 might be a bit higher.
Is that about right?
Yeah, I think that's exactly right.
You're exchanging the, you know,
worse average performance for having to do
less investigations of performance issues.
And I think that's perfectly reasonable trade off. I mean,
I do like working on like micro optimizations and micro benchmarks and everything.
That's great. But in the end, someone needs to be using the database, right? I mean, like,
that's the point. And if they do have, I don't know, until people had like one database,
and they cared about that only, that was fine. As soon as you have like a larger fleet of
database servers that are actually used by people or by applications
or something, you definitely will value robustness, like not having issues, even if the peak
performance is maybe 25% lower or something like that. Because it's much easier to spend a bit more on the hardware
than having to deal, having to run a large team
investigating and fixing issues
and firefighting all the time.
That just doesn't work at scale.
And I think it's also what some of the cloud hosting
providers are like, realize, right. So it's a bit against my
nature, because I definitely do want to do like the best
performance possible, right. And I've been benchmarking a lot and
like doing like performance related patches for a long time.
But at some point, I think it's kind of pointless, right?
Because people will not be actually able to use that in production anyway.
Yeah.
I mean, there are definitely, it feels like there are cases and there are people trying
to get the best of both worlds, right?
Trying to get a faster than we currently have solution that is also more robust.
Like we had Peter Gagin on to talk about the work he's been doing on, like in the direction of skip
scan. But that work seems to make some decisions post planning execution as to whether it's going
to skip or continue trying to skip
or just run through the index sequentially,
depending on certain statistics it gathers along the way.
That strikes me as work in this direction
that in the real world doesn't seem to give up too much
on the optimal path either.
So I think you are talking about
the multi-dimensional access method that Peter was talking about.
And I think it's actually a win-win situation in this case, right?
Because he actually can do better decisions based on the upper information that he has
at the planning stage, right?
So he actually is not sacrificing anything.
He actually, he's improving the behavior in many cases.
So I think it's amazing what he's doing with indexes.
And I've been grateful to actually collaborate with him
on the index prefetching page.
So, and his comments were very useful.
And I think it's actually Peter Gegan who pointed out
like a quote from Gert Grafe, who is one of the
authors of the patches, who said that choice is confusion.
By adding more different ways to execute a query to the optimizer, then you are just
giving it more chance to make a mistake. So I think the robustness,
like the main parts of the features that improve robustness,
needs to be at the execution path.
It's like adjusting based on the actual data, right?
Yeah. So we've got probably more than this,
but three main times we're having to make,
or the planners having to make decisions
that are really impactful for plan flips.
We've got scan type.
So we've got kind of squinch scan,
index scan, bitmap scan, index only scan,
perhaps some others
if you've got extensions installed.
And then we've got join algorithm.
And then we've also got join order.
I've read a bit about from, based on your talk,
I've read a bit about the first two,
but nothing about the last one
in terms of execution time optimizations.
Is there anything around that that you'd be wanting to be looking into?
So the first two parts that you mentioned, that's like, there are actually, you know,
papers proposing like more robust on average slower, but more robust execution paths.
on average slower but more robust execution paths. And I think that would be interesting to have some of those
as extensions in Postgres.
Because we do have the custom scan interface
and we can actually inject a different path.
So that would be fine.
For the joint order search, I'm not sure. I'm aware of like some issues in that
but I think for the typical queries the the algorithm actually works quite fine
but there are some examples of queries where that actually you know blows up
right and I don't know if that's what you've been pointing at, but like star joint, for example,
is like a great example, right?
Because we kind of like end up exploring n factorial of different plans.
And we can actually improve that.
I actually posted like a proof of concept patch for improving this case. But there's a lot of work that needs to be
done on that. But that could actually help tremendously.
Because I think in another talk, I compared like performance for
different kinds of workloads, since like Postgres 8, I think.
And on like a star join with, I mean, OLTP star join,
which means like you select one row from
or a small number of rows from the fact table,
and then have like 20 dimensions joining too.
So for that, we didn't actually improve
throughput at all since Postgres 8, right?
Which is like crazy, like compared to the, you know, order or two orders of magnitude
better throughput for the other like OLTP workloads.
I think that's a clear bottleneck there.
And it's exactly because, you know, like the joint or the search is not great.
We do have something we call genetic optimization,
but I think almost no one actually
uses that because the default configuration parameters
actually force us to do a different thing.
So we don't actually use the genetic optimizer at all. Well, when we have 20 tables joined, it's used, right? Because the threshold...
No, no, no, no, no, no. Because first we split that at 8 into smaller joint searches, right?
And we do the regular algorithm for that.
I see.
So you would actually need to change the defaults, and only then would the genetic algorithm be used.
I didn't know this. Okay.
Yeah. I didn't know that either because that's when I started doing the benchmarks to compare the different versions.
I was wondering why is the genetic optimizer not actually visible anywhere?
And it's because of this.
So people don't use that.
When you enabled it,
did it improve on later versions versus version eight?
I think it did.
But I don't remember the details.
I haven't like included that in the charts at all,
for example.
That's a good question.
I don't know.
Thinking about like next steps for post,
what would you be most excited to see people working on
or what are you looking to work on next
that you're excited about?
So I definitely would like to see someone working on
the robustness patches for either the smooth scan, I do have a,
like a proof of concept patch for that, or the G join, I think is, they call it like, that's like
a robust algorithm for joins. So I think that would be like extremely useful. And I'm 99%
sure it could be done as an extension.
So that makes it like much easier to hack on
without breaking everything else.
And it also makes it like immediately available for users,
which I think is, that's very useful.
And then I think there are actually some smaller patches
where people could improve individual performance
glyphs.
If I go back to the example that I used in the talk, I can imagine people experimenting
with choosing the flip, the number at which we flip from a linear search to a hash search, right?
So I can imagine people doing some runtime measurements and doing that.
I think another problem for robustness is just-in-time compilation for us, right?
Which adds a completely different level of robustness issues, like unpredictable behavior.
We just switched it off.
But that's great.
Yes.
I think that's the default recommendation, but it's also quite sad, right?
Because the just-entained kind of compilation can be really helpful for some queries.
But it's a robustness decision, right?
Like Nikolai is often, and I've seen people as well,
choosing potentially worse on average performance
in certain queries for not getting those outliers.
So it's exactly about robustness.
Yes, I think that's a great example,
showcasing the decisions people make, right?
And they almost always choose better robustness if that's a production workload, right?
I think that's fine, right?
And I think we could have like a, I don't know if you had a session about just-in-time
compilation, but I think there's a whole set
of things people could do to improve that.
It could be better planning, or it could be using a different library for JIT.
Because a lot of the problems that we have is actually due to LLVM not being a well-suited library for the kind of JIT that we do for
the use case.
Yeah, interesting.
We haven't done an episode on it.
I think you're right, it would be a good one to do.
Who would you recommend we talk to about that?
Is it Andres or who is it?
I think Andres is the one person who knows about that most because I think he's
the one implementing that.
So you can also like complain to him and he's the right person, right?
But I think there are actually other people who submitted some patches for like improvements
for like, I think there was like two years ago, there was a blog post about someone actually implementing like custom JIT.
And that might be another good guest, I think.
I completely forgot about that. That's a good idea.
And the other thing is I think it's based solely around like a limit of a cost limit.
That feels very crude to me.
Yes, as I say like the planning is very very simplistic. It makes like essentially the
decision for the whole query at once, right? I mean like the fact that it's based on like
cost limit that's roughly fine but then crossing the whole limit, crossing the cost limit for the whole
query can also enable just-in-time compilation in parts of the query where
that's not really useful, right?
So that's definitely one of the things that we could improve.
I mean, like it's not a perfect solution for all the cases,
but it would be, I think, better approach.
But I'm sure like there are reasons
why it was done like this, right?
So I don't want to like make like a definitive statement
so we're like solutions.
Yeah, would be an interesting discussion.
Was there anything you want, anything we should
have asked about that we didn't? No, I think we discussed a lot of different parts of the
talk that I mentioned in the talk. So I think that's fine. It's a really excellent talk
and we definitely didn't discuss it all and it comes across really well with examples.
So I will include the link to the talk and the slides as well for those links.
And I'll include a link to the San Fran talk.
So if anybody's got questions themselves, I guess you're attending live so they
can ask them at the meetup as well.
Yeah.
Or they can just like send me an email or something.
I also do something I call office hours
where people can just let me know
what they would like to talk about.
And it could be like a page they are considering or whatever.
And we can definitely like schedule call or something.
That's also what I do for the community, yes.
Yeah, that's awesome.
I did actually want to say thank you
for everything you're doing,
because you're also doing some mentoring and some patch ideas for new potential committers.
And I think that kind of work is very, very valued and very, very valuable. So thank you.
Yeah.
Awesome. Well, thanks so much for joining us Hamish. And yeah, I hope you have a great week.
Yep. Bye.
Bye bye.