Postgres FM - Getting started with benchmarking
Episode Date: August 16, 2024Michael and Nikolay are joined by Melanie Plageman, database internals engineer at Microsoft and major contributor and committer to PostgreSQL, to discuss getting started with benchmarking �...� how it differs for users and developers of Postgres, how and when it comes up during development, some tools and lessons, as well as what she's working on at the moment. Here are some links to things they mentioned:Melanie Plageman https://postgres.fm/people/melanie-plagemanMeanie’s Introduction to Benchmarking With pgbench talk slides https://postgresql.us/events/pgconfnyc2023/schedule/session/1410-introduction-to-benchmarking-with-pgbench/#slidesMelanie’s Visualizing Postgres I/O Performance for Development talk recording https://www.youtube.com/watch?v=CxyPZHG5beIMelanie’s Visualizing Postgres I/O Performance for Development talk slides https://speakerdeck.com/melanieplageman/o-performance-for-developmentpgbench https://www.postgresql.org/docs/current/pgbench.htmlMark Callaghan’s blog https://smalldatum.blogspot.comSome of Tomas Vondra’s blog posts https://www.2ndquadrant.com/en/blog/author/tomas-vondraSome of Andres Freund’s blog posts https://www.citusdata.com/blog/authors/andres-freund/An example of Alexander Lakhin’s work https://www.postgresql.org/message-id/b32bed1b-0746-9b20-1472-4bdc9ca66d52%40gmail.comSimplifying the TPC Benchmark C, an OLTP Workload (talk by Mark Wong) https://www.youtube.com/watch?v=qi0I74urLoYMatplotlib https://matplotlib.orgpandas https://pandas.pydata.orgpg_stat_io https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-IO-VIEWpg_stat_io commit https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a9c70b46dbe152e094f137f7e6ba9cd3a638ee25 pg_buffercache https://www.postgresql.org/docs/current/pgbuffercache.htmltmpfs https://docs.kernel.org/filesystems/tmpfs.htmlEager page freeze criteria mailing list thread https://www.postgresql.org/message-id/CAAKRu_b3tpbdRPUPh1Q5h35gXhY%3DspH2ssNsEsJ9sDfw6%3DPEAg%40mail.gmail.comThe path to using AIO in postgres (talk by Andres Freund) https://www.youtube.com/watch?v=qX50xrHwQa4Improve dead tuple storage for lazy vacuum (Masahiko Sawada) https://www.postgresql.org/message-id/flat/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA@mail.gmail.com~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQL.
I am Michael, founder of PGMustard, and as usual, I'm joined by Nikolai, founder of PostgresERA.
Hello, Nikolai.
Hi, Michael.
Today is no normal episode, though, because we are delighted to be joined by Melanie Plagerman,
who is a database internals engineer working at Microsoft and major contributor and committer to PostgresQL.
It's a real honor to have you on the show. Melanie, welcome.
Thank you. I'm excited to be here.
We're excited to have you. There are so many things we could have chosen as a topic today,
but off the back of a couple of talks you've given at Postgres conferences, we've opted to go
for the topic of getting started with benchmarking. Would you mind giving us a little background about
why it's a topic you're particularly interested in?
Yeah, sure. So since I started working at Microsoft a few years ago, a lot of the work
that our team does is performance related. So I would say as a Postgres engineer, even for features
that are not performance related, you have to make sure that you understand their performance impact.
But we tend to work on things like AIO and other features where you wouldn't do them if they didn't have a positive performance impact.
So assessing that is a huge part of my job.
And it's an area where every little detail matters.
And the things that don't matter are kind of counterintuitive.
And it's just it's kind of an art.
And it's so hard to learn. And it's kind of an art and it's so hard to learn.
And it's like a dark art, you know, and it's, I think I've been really blessed in that I've
gotten to talk to and learn from people that know how to do it, like Andres Freund and
other people at Microsoft and in the community, but not everyone gets to learn from people
that have, you know, 15 years or 20 years of experience.
So I like talking about it and trying to see if I can help people get started
and demystify what I can about it.
Yeah, that's a good point.
So you think it can be better, right?
And people can understand benchmarks better, right?
What are the tools, do do you think for that to find
solutions better understand bottlenecks faster and so on yes leading results we have all the time
like we think we found something but it's wrong and then you like one day later you realize that
it's completely opposite for sometimes right so so you think it can be improved right right
so i think one of the things that users have going for
them is that they have a baseline understanding of how you should configure your database. And
some of the mistakes that you may be that I made as a beginning of performance work,
you might not make. So I remember like the first set of benchmarks that I did, I didn't configure share buffers at
all. This was a few, this was years ago. So I give myself a little bit of an up, but basically
I was looking at it and I was saying this improvement, you know, that I, this is supposed
to be an improvement and it, it looks like it's actually hurting performance. And I was showing
Andres and he was like, well, tell me how you configured it. And what, you know, what kind of, I didn't change any configuration. And he was like, okay, we'll
just throw all of this away. And so some of the basic stuff around benchmarking, I think that's
something that developers who maybe don't have as much user experience have to learn,
but users have that. I think one of the things that I see happen more than you would think is that there's some unexplained performance impact or like a lot of times you develop a change, you want to see how does it doesn't improve performance.
And then when it doesn't, or it has an effect that you can't explain, you have to start investigating.
And one of the things that people almost never do, at least like performance engineers, they don't look at the logs.
And it's like, maybe because it's a development branch, right?
Like maybe you actually had a bug and it's crashing or, you know, in the logs, there's
something that's actually, and that's what's slowing it down.
And it's such a basic thing that you don't think about.
You're like looking at IO stat and looking at these more advanced tools when you really
just need to like look at the log you know
or it was fast because it didn't do any work right yeah and looking at logs you can notice it yeah
good point sometimes even some companies we had i won't name it but sometimes like
some benchmarks are published and they claim something and if you have experience
10 plus years you quickly have some ideas what's wrong with those benchmarks for example
full-tech search in postgres is bad it shows capital o of n you think how come it's not
possible like uh like you quickly realize it's not possible and you look at this and realize, oh yeah, they just didn't build Genindex at all, right?
But if you don't have experience,
it's easy to make errors, right?
To have errors, mistakes.
And my idea, it would be good
if something would help people to brainstorm what's wrong
or what else to look at as early as possible.
So we need some brainstormer
mechanism right what do you think yeah almost like a checklist or that kind of thing i think
that's one of the reasons why when i started out with development i thought you know with postgres
there's not enough basic resources like for performance work that give you a checklist. But then I think I realized over
time, I sort of moved towards the, you almost need that one-on-one interaction with someone
who has more experience because it's just so individual. And so it's just so hard to give
general advice to people. So I think that there are basic questions you can ask yourself. There's a big
difference between, I think a lot of users when they do benchmarking, I think a common case is
they're going to install, they're thinking about upgrading. And so they're like, let me just see
if it's slower or faster, whatever, right? In that case, like using PG bench for that is not necessarily like going to help you, right? Because
first of all, benchmarking what you actually want to benchmark with PG bench or any benchmarking
tool is hard. And second of all, what if that's not like your workload at all? So let's say that
in Postgres, there were changes to the way that vacuum works, and you run a
benchmark for less than, you know, a few minutes, well, like, did you even end up having anything be
vacuumed, you almost have to think about what do I actually care about. So if you're a user who's
worried about upgrading, like, I guess, you guys would know better than me, but you probably have options like running your maybe non in not production, you can run your real workload and on a newer version and see how it is.
And like, that's the ideal scenario. But if that's not an option for you, then I think what you have to do is really think about what are the highest risk areas for me like do i have one critical report i have to run or you know i need to make
sure that i'm able to reach this tps at this time of day or whatever it is and then you have to have
a way to reduce that like you almost have to have a minimal repro for the performance question that
you're asking and be able to test it and i would say even then, I don't know if you're going to get a perfect answer
outside of an actual production environment.
The use case that users have for benchmarking,
I haven't heard of one where I think it's going to work out for them,
but I don't know.
Maybe you can tell me what people want to use it for.
Let me shamelessly ask you your opinion about the approach
i ended up having for upgrades with our customers big and small doesn't matter so i was big fan of
replaying workload like five years ago but we don't have tooling for it and mirroring like a
pg cat i think has mirroring but it hard. It should be in between clients and server.
And it's very critical infrastructure.
Usually, PgBouncer is there.
And you cannot just quickly replace it with PgCat and so on, and overhead and so on.
And mirroring also is a tricky thing, actually.
So what I ended up doing, I split it into two things.
First thing is load testing general with PgBench.
And usually we
have already it's done by like hackers and so on, like performance farms or something. And we can
check it additionally with some synthetic workload. PgBench, not PgBench, SystBench,
doesn't matter. So this is one thing. And usually we rely on community results,
or we can extend them as well. But as for our workload for our project,
instead of replaying the whole,
we think, okay, we forget about lock manager,
buffer pool, and so on, all those components.
And we care only about plan flips.
We want just to check the planner behavior.
And we just grab query examples, like 100 of them,
which are most critical in terms of total time or calls or something from PG-STAT statements.
Sometimes we collect multiple examples for one normalized query in PG-STAT statements.
And then this is our testing set.
And then we don't care about machine we use, actually.
It can be a smaller machine.
We just have an exact clone of production database to have
the same data and pg statistic and then we also adjust settings all planner settings and workmem
which is not planner setting but also affects planner behavior and then we just reproduce
plans upgrade reproduce plans again and compare and see if some plans changed and also use buffers of course and
we see like costs and buffers we usually look at we don't care about timing because it can be
smaller machine cache state can be different and if some plan flips occurs we quickly find it
and it can happen like in small machine like in shared environment basically it's not not pg bench is not needed
for plan flip check right and that's it yeah i think if i understand correctly at some point
you use pg bench and then you realize that it wasn't actually going to help you with this
particular case it's super hard to reproduce actual workload. Yeah, so I support that.
I think really, honestly, what you're doing,
which is looking for plan flips and starting with PG stat statements,
is probably the best thing that you could do
to try to make sure that when you upgrade,
that it's going to work out for you.
Because I think that's one of the only things
that you can
realistically reproduce, because like the planner doesn't take into account other things that are
happening. Each plan is planned in isolation with the statistics and the data that you're talking
about. It's not looking at overall system load. I mean, maybe we should do things like that,
but planner doesn't do that right now. So it's actually possible to do that.
So I think that's a really good sort of precautionary thing that you can do to see if the next version of Postgres
is going to cause you problems.
So yes, I sign off on that.
That makes sense as a strategy.
Before that, we were using PgBench
and we took most frequent
and the most time-consuming queries from PG-SAR statements,
and we called it crafted workload.
We've tried to find some parameters or examples of queries
and put them to a PG bench using hyphen F
and then add sign to balance somehow,
like to have some balanced workload
and then ignore some errors if they happen,
like foreign key violation or something.
I cannot say it worked well.
It helped, but then I just realized,
why do we care about buffer pool and so on?
Let's just split this.
So you were doing that before upgrading
to see if you would get the same?
I stopped doing this because it's hard, actually.
It's time-consuming.
And if we care about plan flips, that's it.
We have a recipe.
If we care about improvements in some components of Postgres,
we can research them using some simple workloads like PGBench.
That's it.
Right.
Or specific workloads.
You really can't use PGBench to replicate real workloads.
So, for example, if I want to analyze a patch that I'm doing,
and I want to replicate some scenario, I might run multiple instances of PG Bench at the same time.
So I run one that's doing a select query and another one that's doing a transactional workload.
And you can combine different PG Benches and you can sequence them and that kind of thing but ultimately like each pgbench is going to do the same thing no matter
what kind of custom script you provide so you can there's different variables and you can do things
like have variables in the script and then you know interpolate random numbers and things like
that but in a real system you'll have some kind of work happening. And then another kind of
work happening that's different work and like interspersed or at unpredictable intervals. And
with PG bench in the, you know, sort of during a PG bench run, all of the workers are going to
be doing the same thing, you know, whatever's in the PG bench script. So you can't get, it's very,
very hard to replicate realistic scenarios with it. And there's lots of creative things you can
do. Like you can have some if statement logic where you're like, okay, if this worker is this
worker number, then do this other thing. And like, but at that point, I think it's not useful
if you're trying to understand what the performance impact is going to be for your real world workload.
Like I can see users maybe using it to figure out, to understand a particular Gawk or something like that.
Or maybe if you're, you know, you're thinking about like changing your hardware, you know, using a different SKU from the same vendor
or something like that you can try to understand how it might affect general Postgres workloads
and then it could be helpful but PGBench is like mainly a developer tool and I mean that's what
it's marketed at it's not really marketed as something not that it's marketed but it's not
I don't think it's bill built to anyone as a tool for users
to understand Postgres performance of their workloads.
But I mean, all the other benchmarks that are out there,
I think PG Bench kind of gets lumped in with them.
And it's not really useful for any of the things
other than development, in my opinion.
It's very hard to make it useful for like comparing whatever,
Postgres to MySQL or that kind.
I mean, because you can't use it.
Different hardware options, for example, it can be useful.
Yeah, you can use it for that for sure.
Yeah, but it definitely serves a different purpose than the TPC benchmarks, for example, which allow you to compare across databases.
And then, of course, we also have a community of people that are doing
performance testing from version to version of Postgres. That's different than just testing one
patch and saying, what is this patch doing? They're like sort of looking for broad themes
and regressions across versions. And you can use PG Bench for that. But ultimately, like,
I would say it's easier or sort of more meaningful to do something
like run the TPC benchmarks when you're looking to see if Postgres has a regression from one version
to another for the purpose of community work. You mentioned we've got a community of people
doing that I've seen a couple but I don't know of many do you actually have any names or things i could look into so i
think the most um the benchmarker that i follow closely and that many people follow is mark
callahan yeah great yeah so he's great and he actually came to pg comp that's up this year and
i got to meet him he's kind of like one of my i was starstruck when i met him people you know
and he's like super nice you know in that i like his blog's
name small datum yeah yeah and he's great because he has i mean he's an engineer who has a ton of
experience in doing performance work for a long time you know in my sequel mainly the my sequel
community and then with benchmarking describing your methodology is so important.
So he does that in great detail.
And because he's active and he's doing these benchmarks like all the time, he's providing
a lot of really useful information for the community around, you know, how he does version
to version of Postgres, different types of benchmarks, different on different size machines and that kind of thing. So he's great. And if you want to learn more about
benchmarking, I would say his blog posts are fairly advanced. So like, if you're getting
started, it might not be the best resource. I don't know. But I still am kind of like, okay,
let me read through this. And you know, he's getting down to looking at how does your CPU
frequency governor affect your benchmark results, right?
So it's definitely at the point of it's definitely past configuring shared buffers for sure.
So he's great.
And then I would say within the Postgres community, I mean, there's developers like Tomas Andra and, of course, Andras who do benchmarks when they're developing.
And then usually they'll look at the holistic release.
But most of those results are just getting posted on hackers
and they're not...
Because it takes so much time to do what Mark does
and sort of describe your methodology,
like double check every result,
investigate every discrepancy,
and then publish it and sort of be accountable for it.
Because people will come after you.
They'll be like, no, that is not a regression or whatever.
You have to be ready to defend your work.
And it's almost like a full-time job in itself.
So a lot of the benchmarks that get done within the community are people doing them and then
just saying, hey, I found this.
Can anyone else look into it? So, you know, there's also Alexander Lakin runs TPCDS and I think TPCH also every release
and does a lot of investigation.
And he's great because he does so much investigation of bugs and like reporting of bugs and reproducing
a very hard to reproduce bugs.
So if he finds some sort of regression, performance regression, he'll also bisect it
down to the commit that actually caused it. And as a developer, it's nice to have someone doing
that level of analysis. And so again, he's not necessarily publishing a methodology post and all
of that. Then there's some people that maintain the kits for, because if you want to
implement the TPC benchmarks, it's not like you just run it. It's more involved than that. So
there are people that maintain different kits to the bash scripts and stuff. And I think Mark Wong
just did a new one for TPCDS. So he has a lot of database internals experience. And then he was
involved with the TPC council for a while. And so he kind of is an expert on just the TPC benchmarks.
And so he tends to put together these kits to try to help people because it's not that
straightforward. And then I know there's some ongoing work with the Postgres Performance Farm initiative, but it's really hard, right, to
agree upon what is a useful benchmark in general to run. So you can find some regression or
whatever. And it's like, in order to actually prove that that's correct or valid is a lot of
work on the part of everyone else. So I think in some ways, publishing benchmarks,
you can't just say I set up this farm and I ran all these benchmarks and now I expect people to
go look at them. As a benchmarker, you're kind of accountable for the analysis and trying to find
out if your results are valid. Practically useful. Practically useful. I found one day,
I found, we found,
my team found bottleneck
in WallSender
for logical replication.
It was so easy to reproduce.
You just create some transactions
with delete and roll back them.
And like at some point,
very quickly,
you reach 100% of CPU
for WallSender.
And it led to bad conclusions
that in production,
we won't be able to use logical
for many things but it turned out that this kind of workload doesn't happen in in the wild at all
because it's like kind of very specific kind of workload so i guess this bottleneck maybe it's not
worth fixing right now right i mean it's, it's like, so it's interesting.
Yeah, you can definitely report things like that on the hackers mailing list or on performance mailing list
and have a discussion about, you know,
if it's kind of the right thing to focus on
and also, you know, if it's reproducible.
I did some discussions with some hackers about this
and didn't met understanding that it's reproducible. I did some discussions with some hackers about this and didn't meet understanding that it's worth fixing.
And then I realized in production we don't see such bottleneck at all
and I just postponed this research for the future maybe.
What do you think about observability tools
to be used in PgBench benchmarks?
What extensions or additional tools and so on,
what to look at?
Yeah.
To avoid observer effect as well, right?
Yeah, I mean, I tend to not worry too much
about the observer effect,
depending on what tools I'm using,
because I, you have to do,
I mean, as long as you don't have, like,
log min duration statement set to do some, I mean, as long as you don't have like log min duration statement set
to zero or something like that, but if you're comparing to another version of Postgres where,
and they're using the same observability tools, it's sort of the price you have to pay some
overhead. But every person that does performance work is different. But when I started, I made the
mistake of trying to like generalize and make some of my tools useful to other people and
no one wants that so I learned that the hard way but personally I find that it's very hard to look
at aggregated numbers at the end of a benchmark run and make sense of it and like that might be
what you present to other people because they don't want to look at every detailed graph that you produced or chart that you produced.
But while I'm doing the work, I can't do it at all without visual representations and charts and things like that.
So I use for that, I mean, for the actual charting, I use like MatPatLib and Pandas and that kind of thing.
But I have all sorts of scripts
that do things because I use a lot of data input sources. So I mean, one of the things that depends
on the patch, but I so I'll query pgstat.io or, you know, depending on what the patch is,
pgstat activity or pgstat all tables and then gather certain information every two seconds or every one second. And then I
have scripts that take this output and then I'm able to just basically make it CSVs and then load
it into Pandas DataFrames. So for example, I've been working on vacuum recently. And then one of
the things that you consider when you're doing vacuum performance work is how often vacuum work is
being done. So you might want to look at the weight events and look at, you know, how many,
if auto vacuum workers are waiting on a vacuum delay weight events a lot and like not actually
doing the vacuuming work. And if you have the right auto vacuum settings to actually
be able to observe the thing you're trying to observe.
So, you know, just gathering that information and then plotting it correctly and that kind of thing.
Another thing, because Postgres uses buffered IO right now,
I tend to use external tools for IO observability, like IOstat,
because you can't actually tell with, you know,
reads and writes in the Postgres statistics
that could be a reader right to the kernel buffer cache.
You could be reading from there,
and that's obviously going to be quite different than actually reads from this.
So I use iostat input depending on what I'm doing.
And then also there's different files.
Well, I use Linux, so there's different CPU statistics
or memory
usage statistics that you can get. And so a lot of what my scripts do is just query whatever file
somewhere in the file system that has that information over time at different intervals
so that I can see, okay, what's the memory utilization or what kinds of effects is this patch having on my resource utilization.
So that's a source that I use.
And then I also use some different Postgres extensions,
again, depending on the patch.
So pgbuffercache, I look at the, again, I just query it at intervals
and then see in terms of the shared buffers that I have,
how many are pinned.
And it helps me to see how the patch is affecting the behavior of shared buffers and utilization there.
So there's basically all of the statistics views are fair game.
All of the output from PGBench itself.
So I parse that.
So there's progress output, which is like, you know know pgbench does some averaging and that kind of thing so i also parse the execution reports that come
out which is like they call it a transaction very confusing it's literally just an execution of the
script by a worker so you might get an average tps but depending you know, because of how it's calculated, that might not actually
show you the P99 latency of an individual execution of one of the execution scripts.
So I typically parse all the execution reports. There's one line for each execution and then do
various analysis of that and plot that. So that's an input. And there's always new sources. And
it's funny, I'll take some analysis challenges and say, I haven't figured out why there's this
weird spike in IO at this one particular time. And then he's like, well, here's this other input
source you could use to investigate this. And I was like, I have 55 input sources. How is it
possible there's something else
that I could look at?
But I guess everything is caused by something.
And so it's possible to find out what it is,
I guess, if you look hard enough.
Right.
Don't you feel like it would be so good
to have P99, P95 in PgBench?
It would be so good.
It would.
It would work well, right?
I'm also using progress all the time,
like 10 seconds, 30 seconds.
It's so great to see how caches are filled and so on,
but lack of percentiles.
Yeah.
So if you do get the execution reports,
you can just then read them in with some Python tooling. Execution reports, it's
hyphen R or... It's dash dash
L and then you provide a log prefix. Yeah. So that
gives you the actual time that each execution of the script
took. But this is so slow, no?
Well, I mean, it's not free. Yeah, that's true. of the script took. But this is so slow, no?
Well, I mean, it's not free.
Yeah, that's true.
Not free at all.
But I think if you wanted the P99 latency,
I'm sure there's statistical methods for getting it without recording every execution,
but you would need something like that.
Yeah, do you use the actual file there
or some RAM disk to memory or something?
I use TempFS and then when it's over, I copy it somewhere to a disk, yeah. do you use the actual file there or like some ram disk to memory or something i just i use
tempfs and then when it's over i copy it somewhere to a disk yeah good idea i will add it to my
tool set yeah yeah i at first i was pretty apprehensive about doing cassandra like a lot
of work but but if once you actually have the execution, you literally just like the statistics modules in Python,
you can just give it the quantile you want,
and then it just calculates it.
It's two lines of code or three or four or whatever.
Yeah, and you mentioned pgstart.io.
This was great when I first saw it.
Well, this should exist many years already,
and we discussed it with Michael.
And we here in PostgresFM,
we often discuss lack of buffers in plans, for example, right?
In individual query plans.
Buffers is like, we try to advertise
it's so important to have buffers in execution plans.
And this is a macro level, like high level.
So good to have it, right?
But unfortunately, I cannot say I meet it
because it's only in Postgres 16.
We don't have it in most production systems yet still.
But maybe next year I will start feeling it in production, right?
With our customers.
Thank you for doing this.
My question is how you came up with this idea originally.
Well, I can't take credit for it.
So Andres has done a lot of performance work
for years and years.
So when I joined Microsoft,
one of the first projects he suggested
was to add some view
that would improve observability of IO.
And at that time, we didn't have the shared memory stats system.
So the patch was much bigger at that point
because of the way the stats collector worked.
So I did an early version of it.
And then I used it in my own development
and I ended up working on some of the AIO work
and found it really useful for that.
And then I ended up doing some review
of the shared memory stats patch. And after that, it made the pgstat.io patch much smaller and
simpler. So then I went back to it. But I honestly, like for me, I just looked at it as this
helps me as a developer doing benchmarking.
And so it's cool that users like it because I don't have that perspective, really.
Obviously, it's useful for benchmarking itself.
But I'm wondering about the other way around.
Did you have to look at its overhead?
And how did you go about that?
Yeah, so it was much easier, like I said, once we had shared memory stats, because the
way that it didn't need any extra infrastructure is basically just like another data structure,
a couple other data structures.
And then everything works the way that it does.
Now, we already had some counting for different kinds of reads and writes, because that's
how for explain, analyze
and that kind of thing, we had those counters.
And then there was some reads and writes
that in PG stat IO all tables
at one of the views that has some IO things.
And then there was some for PG stat statements.
So there was some places
where we already had some IO counting.
And then the other places where we didn't, there wasn't too much risk because if you're adding some
timing for whatever background writer or check pointer, there are processes that it's okay to
add a tiny bit of overhead. So you didn't have to think too much about it. I think the hard thing was actually finding all of the places that we could possibly do IO for the types of IO that we were representing,
and then thinking about how to categorize it, and especially for the different buffer access
strategies. So that's one of the things that was completely not surfaced to users was how buffer access
strategies work and how they reuse buffers.
So for copy and vacuum and things like that, literally you have no information about that.
So I think just figuring out how to represent all of that, that was probably the least performance
work that patched in terms of the impact performance impact of it
just because there was a lot of like instrumentation already just that was not complete if that makes
sense so you're saying like we're already paying a lot of the overhead like we're running postgres
15 for example we're doing a lot of that counting anyway so storing it or like also displaying in an extra view there's very little
added in actual instrumentation okay cool yeah i mean like we had four back ends that were doing
queries like counting of reads and writes and hits in most places so the places that it was
sort of net new were mostly for types of processes that we don't mind a little bit of overhead if
there was and it would be so small i mean it's measuring io right so like the if you're doing
io the overhead of that will cover any sort of overhead of instrumentation unless it's hits
actually it also counts hits and i found in some cases, for example,
monitoring systems try to avoid storing this metric at all, for example, from PGSAT statements.
But I had some incidents where we lacked hits numbers. But PGSAT.io has hits, right? I'm
looking at the documentation just to confirm.
Yeah, it does. This might be kind of overhead of counting.
This might be interesting because it doesn't have real IO.
I mean, in terms of disk.
Yeah, it actually was already being counted.
We had a hit somewhere else, one of the other views,
or maybe it was for PG stat statements. I'm not sure.
So that was already kind of being
counted. And when you go to get something from shared buffers, the there's blocking and all sorts
of other things that happen that are expensive enough that it was pretty negligible. Nice.
Yeah, okay. But there are some benchmarks proving that Pagesata.io didn't add or like just...
Oh, no. You want to see my receipts?
Yeah, yeah. I'm just curious. It's just a general feeling that it doesn't bring overhead or there
is solid proof of it in terms of benchmarks.
That's a good point. It's mostly based on a feeling.
I'm asking just to, like, I'm curious, is it worth checking this, for example?
Because maybe we can strip it out somehow or compare 15 versus 16 in various cases and see if... I would look at those benchmarks, but I guess my gut feeling would be that because we already were doing, when track IO timing is on, we already were doing a lot of the measurements, that it wouldn't be much different.
Actually, I can answer myself.
We have benchmarks with insane numbers.
Oh, you do?
Okay.
No, no.
I mean, 15.
We compare all versions and we try to reach as many TPS as possible.
I achieved, how many, Michael? 4 million?
Nearly, 3.75, I think.
Yeah, just select only PgBench, right?
And we didn't see any degradation for version 16 at all.
It's already some kind of proof that there is no overhead here.
And we have wait events and flame graphs and so on.
We would notice it. But Nikolai, you wouldn't have had track io timing on i want to check it or and
maybe no way yeah if you did you should repeat again maybe you get four million okay okay right
now and good point but yeah this is great like it's super interesting what like how do you choose
which things to benchmark and which things not to then yeah so I think developing intuition around that is so I actually became
a software engineer like my the first thing that I worked on professionally was green plum which is
a fork of postgres so almost all of my software engineering experience comes from being a postgres
engineer I was pretty new to software engineering,
right? So I think a lot of software engineers develop intuition around like, you know,
if you take a lock every millisecond versus every 30 seconds, like which one of those is okay or
whatever, right? Like it obviously depends on the context, but I think there's a lot of sort of
performance intuition that people develop just from being software engineers. And that's sort of like the first step of, you know, thinking about, okay, well, is it okay for me to take an exclusive lock here? Do I need to find a performance question and you said do is it okay that i do x then you have to think about okay how would i benchmark this and in a lot
of cases maybe not benchmark how would i profile this how would i evaluate or micro benchmark this
or something like that so for example in 17 i worked on some refactoring of hot that code, adding any sort of additional
instructions, you know, potentially has overhead. So of course, like if you're doing actual hot
pruning, then you have IO from the wall and you have IO from writing out the dirty buffer and
that kind of thing. And so you have to think about is the thing that I'm changing actually going to matter in the context that it's in?
And then I think evaluate from there.
And also how important would it be if there was a difference, right?
So like on every select query, if it's a little bit more expensive, that's obviously bad, even if it's only a tiny bit more expensive.
So I and Heike did a lot of micro benchmarking around those changes to make sure that they seemed right. And then you have to design the benchmark so that it's actually exercising
what you think might be the bottleneck, which is its own challenge. And then like right now,
I'm working on something where it has to do with improving eager freezing so that you freeze more data, vacuum freezes more data sooner.
And there's a bunch of different pieces to it, but one of them is a new sort of responsibility that background writer would have for maintaining a new data structure that contains some new statistics. And one of the questions I got in review was like, have you profiled this to make sure that it doesn't have an overhead? And my
first reaction was like, but it's background writer. No one cares about background writers
performing, you know, like kind of like it's different than on every select query, you're
going to have some overhead. But again, it was doing something that was like
more, I'd say heavyweight, like it's doing some calculations that could take time and doing them
while holding an exclusive lock. And so even though it's background writer, like it's worth
proving that, you know, the number of extra instructions, for example, is tolerable.
And that's a judgment call. But what you want is a paper trail to that you like did your due
diligence, which is why it was funny when you're asking me about PG stat IOS performance impact,
because in that case, I kind of thought, you know, it's negligible compared to the fact that
it's already you're doing io first of all and
second of all we had a lot of that instrumentation there but for most other things like i haven't had
a patch other than that probably in the last a couple years where there wasn't a discussion of
what the performance impact could be and then benchmarking or profiling.
Nice.
Yeah, Michael, it was on.
Huh?
Tracking IO timing was on.
Well, that's good news for Melanie.
There's potential for 4 million.
Yeah, and we didn't see difference between 15 and 16.
So, I mean...
Wow, wow, wow.
Yeah, I already asked our robot to double check with tracking our timing
off so we'll see this as well yeah good interesting i hope we'll have more tps i'm curious what you
work uh work are working on right now and direction of future work ideas and so on yeah so i mean i started a project last release that had to do with
reducing the amount of wall emitted by vacuum and that was like kind of a refactoring but it
spun out a lot of other projects and i think i sort of wanted to take a break from vacuum this
release but it didn't work out that way so I'm working on a couple of different things around vacuum. And some of them are some
leftover pieces around combining wall records that vacuum used to emit like up to six wall
records per block. So we've combined a few of them and there's more to do. So that's one of the things. But sort of the more probably exciting work is around the freezing work.
So that basically was born out of Peter Gagan a few years ago.
I think it was 16.
Did some work to have us do more eager freezing.
So one of the things we hear a lot from users is they have, say, insert-only tables and none of the data gets frozen.
And then all of a sudden they have anti-wraparound vacuums and then eventually their system becomes read-only.
But even before that, anti-wraparound vacuums, if you don't have any actual pruning or vacuuming to do, can be expensive from an IO perspective.
And so it can slow your whole system down. And
so being more aggressive about freezing data sooner was something that people had been asking
about. And so Peter worked on that. And it's really difficult to find a heuristic that works
for determining whether or not to freeze something. Because if you have a table where you're doing a lot of updates,
like a standard PG bench built-in workload,
then it's a waste to freeze things
because freezing emits wall and it dirties the page
and that kind of thing.
So finding a heuristic that freezes the stuff you want to freeze
and doesn't freeze the stuff you don't want to freeze
has been a journey.
And one of the parts that's really hard is that you have to try to come up with all of the
workloads that are important, the best case and worst case, especially the worst case workloads
that users might have and think about how it would perform. And that's like, so hard. So that took a lot of time. And
I spent a lot of time improving my different benchmarking setups and coming up with these
different workloads and evaluating them, looking at them. So that's one thing I've been working on.
And then last release, I also, so there's the new read stream API that does vectored IO does larger
larger reads that Thomas Monroe did last release and I worked on the different users so for
sequential scans and this I had some additional users that didn't go into 17 because their
performance effects that you have to sort of analyze really closely and they weren't ready and one of them was the word analyzed deliberately there was it
yeah no no analyze went in actually um that was the law's work yeah so analyze went in but a bitmap
heapscan and vacuum didn't got it so vacuum is an interesting one because streaming vacuum sounds great,
but vacuum uses a ring buffer to sort of keep it from having too much of a negative effect on shared buffers.
So you don't wash out all of the things that are resident and shared buffers from your workload.
It reuses buffers, which means that it has to write wall typically.
And there's a lot of calculation around how big the ring is such that wall writer can help you or help vacuum to write out the wall.
So vacuum is not doing all of it.
And if you do IO and if you do reads in a different size, that means that you're dirtying and needing to
clean buffers at a different rate. And that actually was affecting how many individual
wall writes and syncs that we would do. So like in a given F-Sync, you can have different amounts
of data, but each F-Sync is a system call and has overhead, right? So you want to amortize that cost over a large enough amount of data and also have WallWriter
help to do some of the work. So by doing bigger IOs, we are actually making the performance worse
in some cases because of these interactions with additional wall sinks.
So Thomas Monroe has been working on a, I think he's calling it streaming right behind.
It's kind of a way of thinking about, for the purposes eventually of AIO, but of how backends
and maintenance processes and rights in general and Postgres can sort of make sure that
they are cleaning up and doing rights in large enough, if they're doing larger rights, right,
that they're actually being conscious of when they should do those rights based on the wall
that it's required that you emit. So like, if you're doing reads, you don't have to really
think about that. But if you're writing out data, the wall associated with those dirty
buffers has to be flushed first. So you have to think about when do I want to do writes,
because you need to think about when do you actually want to do the wall writes.
So he's been working on that. And it's a tough problem. A lot of benchmarking,
a lot of thinking about it. And buffer access strategy is really complicated.
And those are used for large selects.
Those are used for, you know, copy, vacuum.
So he's doing some work around that.
And I'll probably sort of jump in after some of this vacuum stuff is done and try to work on some of the AIO work that's coming up. And I think that ultimately, like just between those things,
that'll probably be most of what I end up doing on my own.
But I would like to see there's some, for pgStatIO,
there's been a patch to add wall-related IO statistics to it
that's been around for about a release.
And we haven't, basically what we have, we haven't come to an agreement on what the right
thing to do is because you can technically, like our thought was that we would have the
block size for like, you can configure, you know, your block size for Postgres, you can
configure it for wall and that they would be in units of block size, wall block size for Postgres, you can configure it for wall, and that they would be in units of wall block size. But that actually isn't always the unit that we do reads and writes in
exactly like it usually is. So now we're actually talking about rejiggering pgstat.io, especially in
light of vectored.io to change it so that it's not looking at, you know,
if you have a thousand writes and then you have the block size and then you multiply them to get the number of bytes,
do we change it to represent it differently and put it just in the number of bytes?
And like, how do you sort of represent that if you're doing, do you want the right number of writes to be the number of system calls?
So we were just thinking about how to actually
represent it. But I would love to see wall stats go into PGStatIO in 18 also. So we'll see.
And there's also a bunch of other exciting things going on, like Masahiko Sawada's working on
parallelizing the first and third phases of vacuum. So he did that really exciting work in 17 on tid store
and making it so that i think that's my favorite feature from 17 can you explain it yeah i don't
know that one yeah so you know like one of people's biggest complaints about vacuum is when they have
to do multiple rounds of index vacuuming because the maintenance work mem, even if you set it to a high value,
you might end up filling it up with dead tids. And then you have to do a round of index vacuuming.
And if your indexes are very big, you can imagine that that ends up really affecting your vacuum performance. So what he did was introduce a new data structure
that organized the dead tits in a way that was much, much, much more efficient.
And that made it so that you're using way less memory
for the actual dead tit storage,
but then also you can end up having, you can end up
sort of changing it on the fly. You could, I mean, you basically, you're not tied to the same
restrictions that we had around the size of maintenance work mem as before. So most people
are going to not need ever to do multiple passes of index vacuuming so i think that people with very large indexes and
large tables are going to see their vacuum performance be a lot better i have big customers
with big partition tables we talk about partitioning for years and it's hard and
they will benefit so vacuuming will be faster, basically, right? Yeah. I mean, if you don't have this problem
with the multiple index vacuuming passes,
then maybe not.
But it's something we hear a lot from sophisticated customers.
And also, I remember in Postgres 17,
before that, it was only up to one gigabyte could be used.
Yeah, that restriction is lifted now.
Is it related to this work?
Yes, it is related, yeah.
I suspected so. Good, yeah.
So that's really cool, and I think it's something
that's sort of hard to explain unless you've had the problem.
I think if you haven't had this problem,
you're not really thinking about it.
But for people that do have this problem,
I think it's going to make a big difference.
Yeah, so he's sort of building on that work and there's a lot of things that we can do with vacuum because of having, so this read stream API
will make it somewhat easier to do parallelization of the first and third phases of vacuuming. So
he's working on that as well. And then I think, you know, there's
the dreams of having global indexes involved a step where you persisted the dead TIDs
to disk because otherwise you wouldn't be able to. Basically, there's discussion of being able
to split up the phases of vacuuming and be able to vacuum indexes at some
point and come back to it and do it later. Like vacuum, just do the first stage of vacuuming. And
if you want to do the phases of vacuum separately, you have to have the dead tits that you need,
basically what index entries you need to delete, you need to save that somewhere that's not in memory. And so there's
some modifications to the TID store that they can make and probably make it easy to persist.
And I don't know that he's planning on working on that now, but Dileep had done some work on
dead TID storage. And so I think there's a lot of exciting things around vacuum that'll be happening.
And there's also been discussion,
which I know has happened in the past, but new discussion about auto vacuum scheduling
and not just auto vacuum scheduling, but cost-based delay and how to change that to be
more adaptive and to use statistics basically that are collected while vacuuming to decrease the delay adaptively while vacuuming if you're not
able to finish the vacuum appropriately and things like that so we'll see so we we talked to peter
gagan in the past as well and i remember the duplication so you both sound not like people
dreamed to get rid of vacuum completely,
but you sound like there is a big potential for improvements and it's
happening.
And so it's going to stay,
but with improvements,
this is like,
I'm trying to simplify the undo people.
I mean,
they still,
the undo people.
Right.
I think that there has been a lot more work on vacuum this release than there has been 17 or
1817 and 1817. Yeah, yeah, that's great. That's great. So it's Yeah, many people need it. I mean,
both instances. So yeah,
I like this kind of work. I think Peter Gagan's done a lot of work in the last few releases and you and others that focus on things that will help
people but almost everybody will benefit without having to change anything transparent and that's
those are so powerful they're so easily easily forgotten or easily ignored, but everybody benefits.
It gets better for almost everybody without the,
A,
noticing and B,
having to do anything.
Just so powerful,
those kinds of changes.
So thank you for working on those and encouraging others to as well.
I appreciate it a lot.
Yeah.
The dream is that we get rid of all of those auto vacuum gucks because
that's terrible.
Like, there's, I don't know, the fact that there are so many blog posts.
I mean, yeah, like, they should all be gone.
Your auto-vacuum configuration should be basically nothing.
I mean, the system should figure out what to do.
We often discuss, like, outdated outdated defaults but this is a radical
position right yeah i mean it's not going to happen overnight but i think my vision a lot
of people i think want this i don't i think the cost-based delay system is really hard to
understand because it's not intuitive the The relationship, like scale factor and all of that,
it's just like, what is that?
I mean, I understand why it happened
from a developer perspective,
but I think that getting vacuum to do the right thing
is like our job as engineers
and shouldn't be users' jobs, basically.
So we'll see.
Well, we don't do clips on this show but if we did i think that
would be one also number of workers three workers you have like almost 200 cores and you have only
three workers by default this is insane right yeah well actually so i will take it back i would
say that the one configuration that i think users should be able to provide is the max number of workers because that's your system preference, right?
I mean, maybe not the number of workers, the minimum number, but the maximum number.
You should be able to keep the system from having 50 processes just doing vacuuming work
if you want.
I think that should be...
Because this is second phase of of rights we need it right because
if you just deleted something or updated something work is not done until it's vacuumed
you need more workers anyway so also pagequiz doesn't know how many cores right
doesn't know it doesn't take that into account yeah i mean it but like people are allowed to
use their servers for other things at the same time.
And maybe you don't, but like we sort of, we don't assume that we have access to everything, I guess.
Yeah. Makes sense.
Thank you for coming. It was very interesting.
Thank you for having me.
For this work. Again, thank you for PgSata, your work and so on and benchmarks.
I like, we will keep an eye on it and good luck good thank you
yeah thanks so much thanks bye