Postgres FM - pg_ash
Episode Date: February 20, 2026Nik and Michael discuss pg_ash — a new tool (not extension!) from Nik that samples and stores wait events from pg_stat_activity. Here are some links to things they mentioned: pg_ash https...://github.com/NikolayS/pg_ashpg_wait_sampling https://github.com/postgrespro/pg_wait_samplingAmazon RDS performance insights https://aws.amazon.com/rds/performance-insightsOur episode on wait events https://postgres.fm/episodes/wait-eventspg-flight-recorder https://github.com/dventimisupabase/pg-flight-recorderpg_profile https://github.com/zubkov-andrei/pg_profilepg_cron https://github.com/citusdata/pg_cron~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello, hello. This is PostGosafm. My name is Nick, PostGosier. And as usual with me is Michael Pigeumastert. Hello, Michael. Hello, Nick.
How are you? What's new?
I'm good. Not as much new as you, I think. Wrote a blog post recently. You've been working on...
Oh, I saw. Yeah, little improvements here and there. Doing some research in the background as to what people want and need. But yeah, you've got some much more exciting news, I think.
It's small, but maybe good. I don't know.
I'm experimenting a lot lately with AI as well,
constantly learning.
It's crazy times.
People reconsider workflows, obviously, right?
How we work.
So it's easier right now to ship some stuff.
And all the ideas you had during a long time,
it's quite easy right now to implement.
If you just shift from coding to engineering
and organize design,
and verification of all the details, like benchmarks.
And you focus on like architecture more and so on.
It's great time to have like helpers.
And honestly, Opus 4.6 is quite good.
We even better than 4.5.
So yeah.
And others also good.
I use them for reviews, other models, including Gemini and GPD, of course.
Yeah.
Anyway, I wanted to discuss this small toy.
tool I just created over weekend.
And today I'm releasing version 1.2 and publishing it finally.
I have good feedback from a couple of folks, not only from my team, but also external
ones.
And it's working on our production already.
And yeah, and I think it gives me the same feeling as I had when I proposed a transaction
timeout to Andre to implement during our post-guess.
life hacking sessions on YouTube.
Because it's, it's, it's, it's, it's, it's feeling is, why doesn't this thing exist yet?
Yeah, do you want to say what it is?
Yeah, PG-PG-EH, active session history.
I believe with all the new setups and growing existing setups of post-Gus, millions of
database clusters, I believe weight event analysis has huge potential and heavily underappreciated.
And performance as sites and RDS is great.
Other tools like Pidgeevali, all tools are great.
But there is huge potential for various tools and people to understand what it is.
And I think a lot of non-experts don't understand what it is.
And they are going to explore it soon, I hope, because it's a great tool to troubleshoot database issues.
You know like Brandon Greg books and his presentations, he says if if somebody asks him which Linux performance observability, like Linux performance analysis tool in console, he says it would be a Iostat.
Because Iostat shows database metrics, or that was not a disk metrics, right? Latency throughput, Q size. And also it shows a little bit about CPU as well.
If you have one tool, quickly understand as much as possible.
This is it.
Of course, he also mentioned Sar, this is great and so on.
But the same thing here.
If you choose just one approach or tool, which will give you as much understanding
of problems during troubleshooting, like why database is slow,
this is going to be a weight event analysis or active session history analysis.
We had a separate episode about it.
maybe a couple of them.
And I think inside me, there's a huge feeling.
It's hugely underappreciated.
When I talk to people who are not Postgres experts,
if they are on RDS, this is default tool for there.
But outside of RDS, they don't know what it is.
And then even with RDS, performance insights or CloudSQL database insights
or how it's called Quddy insights,
it's still like a huge lack of use of it all the time.
think it should be central in all troubleshooting.
It should be starting point.
Okay, so why?
Okay, go on.
Yes.
So I've been a big fan of PG-weight sampling for many years.
PG-Wate sampling is an extension
which provides you very precise sampling, 10 milliseconds.
Internally, but you need additional tool like monitoring to export that data for analysis.
It's hard to analyze it inside because you still need some persistence storage
because history it has, it's not like covering a lot.
It's just some tail and you need to export it.
But it gives you a 10 millisecond sampling.
It's very precise tool.
The problem with PG with sampling,
and we had it for long in our monitoring stack,
but slowly we move the attention from it away,
and we implemented, we call it lazy sampling,
where every 15 seconds our monitoring tool collects it.
Because availability, a lot of customers, they don't have perjewat sampling.
It's present only in CloudSQL.
I wish it was installed everywhere.
But no.
Unfortunately, it's reality, you know, like.
So we shifted to lazy sampling.
And then something like, it's working, it's enough if database is huge.
15 seconds sampling already good enough.
And we cannot do one second sampling because it's too much, to big frequency,
observe effect, like it's too much.
Because we are external, we're pulling this data to some external,
connecting to instance on super base or RDS anywhere.
There is a feeling that lazy this approach is not lazy, infrequent.
It's not okay.
We need something internally to sample and then we need to offer.
This was number one reason why I thought we need something else.
Number two reason, there are more and more cases when databases are small
and it's hard to justify having full-fledged monitoring for them.
because you need to pay like basically at least 100 bucks for this monitoring setup.
But if your database is tiny and you pay 50 bucks, it's like some pet project.
There's no way you will decide to spend a few hundred bucks per month for this pet project.
But you still need this because if something happened a couple of hours ago,
you need to troubleshoot and post-gastroids internally has everything, right?
But it's just you need to, it doesn't have memory.
So weight events, there are two columns, weight event type, weight event event,
impregent activity.
But nobody is sampling them without, like by default.
And it was also the third reason,
understanding that current monitoring tools are going to be changed,
to be better, like to provide information in better form for LLMs during troubleshoot.
This was reason number three.
That's how I came to idea, okay, let's implement something.
And this is something like must be.
be low footprint in terms of storage and observer effect. It must work everywhere. And it must
be LLM friendly in terms of how information is provided. Not a lot of jasons or you need like a lot of
like to write prongue all the time. But some compact form you can feed it to your AI agent
or something and ask to explain what's happening here and where to dig further. And this is how
PGS was created. Because of the first requirement, I couldn't create it as extension.
Couldn't create. So I called it anti-extension. And remember, we discussed PG index pilot,
which is automation for recreation of indexes. It was also created in the same fashion.
And also, like some sneak peek, we talk with some super base folks. And they seem to
understand that as well and there is also a new project called PG flight recorder.
I hope we will discuss it's maybe separately, which actually is very similar to my PGA-She,
but it covers more.
So we are thinking how to maybe unite our efforts.
It's ongoing discussion.
But yeah, it samples like PG-Stat statements, PG-Stat-I-O, and so on, like all the stuff,
SLRU.
Oh, interesting.
So there was an old project called PG-Profile, Andréz-Zou.
Zupkov created it and for quite some time it exists.
It's great.
It's AWR from Oracle, but for PostGus, right?
It records all statistics also inside, but it's extension.
So since it's extension, it's not available on most managed postgius platforms.
That's why I like the idea of anti-extension because it's just pure SQL and PLPDGQL.
So installation is just the case of running an SQL script that you've got, you know,
creates a schema functions.
Exactly.
Yeah.
So I think this.
Which you could bundle as an extension, right?
But the problem then becomes, let me to be transparent.
Why not?
Yesterday we discussed it.
And there is an idea that yes, it can be like extension, not like non extension.
It's just a bunch of.
There is an idea.
Okay, this could be.
I honestly have some thought that maybe PostGus could have some concept of
packages or something.
There was trusted language.
Yes, exactly.
Yes.
Rust is all.
Exactly.
So David Ventimilia,
I hope I pronounced last name right.
I apologize, David,
if I did wrong.
From Superbase said,
and who is developing PG flight recorder?
Yeah.
Yes.
So he said,
this is the direction he's thinking
to packages,
T-L-E extension, right?
Because extensions, if it's available,
I think it's available on Superbase on RDS, right?
It's like straightforward concept, right?
But in some cases, we need to think about all platforms.
And in this case, we want to be maximum flexibility.
So I'm thinking now to package it in future versions,
to package it in both ways, like pure SQL installation or TLE installation.
Why not?
Yeah, absolutely.
It's just a bit sad that extensions were,
the idea of having an extension framework was to make distribution easier.
And now we've gone full surface.
and it makes it harder because everyone's on managed service providers that pick which extensions are available.
So yeah, it's just a bit of shame.
This sentiment you just shared, like it's so deep in me.
I remember 2005, six, seven, where like Posgis is extensible.
Then extensions concept was created by Dimitri Fountain, right?
Oh, really?
Yeah.
He created extension is his thing.
Maybe I'm not mistaken, I hope.
I can hallucinate often.
This happens because I'm overloaded with information.
And then everyone was excited.
Postgres is so extensible.
But we should blame managed Postgres platforms,
starting with Hiroku and RDS and all others.
Why extensions became an extensible concept?
Because it's in their hands to decide which extension to add.
And they have their own reasons, obviously.
When every time they add some extension,
it means they need to support it
and to have some responsibility.
if it's secure and so on. I understand that. But extensions became unexstendable.
And every time I thought, oh, you want to implement this in Postgres, every time I decided,
no way I will decide to make this extension. Of course, the concept of trusted language
extensions is great, but still it's not adopted everywhere. It's only, it's limited, right?
Yeah. So this is not an extension, but could be, but it's not to start with.
and it installs a couple of tables, some functions.
What do they, what does it do exactly?
Yeah, let's talk about this, what exactly it does.
The idea is let's just sample,
collect weight event samples from activity with query ideas,
and have some history.
It's quite simple.
I didn't want, so the problem is we need some help
to invoke function, which will sample.
And I don't want any lambda functions or anything.
I want everything in the PostGGIS,
but I cannot build a background worker or anything
because I don't want to create extension.
PG cron was chosen as a dependency,
and PG cron is present everywhere.
And I remember I was driving at that time.
I was brainstorming with Charger PT.
Unfortunately, Claude,
voice abilities are not as good as ChargapT.
So I was brainstorming,
and we agreed both.
that PG crone is going to be a problem because it has, as any crone, it has only one minute precision.
I was driving.
I reached my destination.
I was starting to doubt.
I always doubt when working with AI and humans, actually.
It's not only about AI.
Humans also can make mistakes.
So I was doubt in doubt and I said, let's verify.
I check the documentation.
Check with me.
And thankfully, starting PG cron 1.5 per second precision is available.
Okay.
And this was a moment, okay, I'm going to build this because I need per second precision.
Is one second the minimum though?
It's minimum, yeah.
But it's going to ask why.
I was going to ask why one second as the default and now it makes sense.
Yes.
Then I was thinking, okay, we are going to write a lot.
We need to write query ID and the weight event.
So how to optimize storage?
And there are two big problems.
First, how much we are going to spend in terms of bytes if we're just writing.
At second, we know postgis and VCC and bloat issues, right?
So if you implement some ring buffer for your storage, anyway, let's jump to solution.
I knew the proper solution from Skype, PGQ, Sky Tools, PGQ, three partitions and rotation, right?
So I just created partition yesterday, today, today, and tomorrow, logically.
So yesterday is read only partition fully filled.
24 times 60 times 60, right?
Records.
Actually, no.
We need to think how, do we store one row pair
weight event pair or we somehow store it differently?
Let's return to this point.
Now we have current partition today and tomorrow is truncated
because truncated is super efficient as we all know, right?
No deletes, no updates, inserts on truncate, that's it.
So this is how we implement.
And we have visibility right now in current version.
We have visibility only for yesterday and today.
And you can export it to your monitoring tool this as we are going to implement
compatibility with this extension, not anti-extension in our monitoring stack.
But also there is a plan.
Maybe when you listen to this podcast, I already implemented this,
there is a plan to have a roll-up storage for longer term.
I plan to have one year with precision, maybe one hour or one minute.
It depends.
It's not a lot.
So this is how we sold MVCC.
Just borrowed an idea from old good Skype from 20 years ago.
I still think it's a great.
It's a great idea.
So no block.
So that solves the bloat issue.
Yeah, exactly.
And second, how to find optimized storage.
We had several ideas.
So we decided, like, I actually decided, but yeah, I agreed with a lot of benchmarks,
many iterations.
We chose the idea to store one row per second.
So it's timestamp.
It's database ID because, oh, not one row.
If you have multiple logical databases,
then it will be as many databases as are currently active
in terms of workload at this very second.
So at this very long...
One row per database per second.
But only those which receive,
which have active sessions right now.
Okay.
Yeah.
I thought about it.
If we have a thousand databases, it's insane.
But it's unlikely all of them have right now active sessions
because it might be if you have huge machine
and a lot of things happening.
But still, if you have huge machines,
you can afford maybe like 1,000 rows per second.
Maybe no, it depends.
There is room for improvement here.
But anyway, right now it's one row per second
if it's only one database active right now.
And then we encode everything.
But how?
The thought about JSONB, first thing, JSON or JSONB,
I proposed correlated arrays and proved me that it's better in this case than JSONB.
Two arrays, so two rows.
And then I proposed this encoded one array, and benchmarks proved it's better.
So we encode weight event, number of sessions, like weight event, like, for example,
or for example, lightweight lock manager, our favorite, right?
And then there is number how many active sessions we are present, like for example, five.
And then five query IDs.
So seven numbers and then next weight event and so on.
So we have a bunch of numbers.
And I didn't want to use eight byte integers.
I wanted to use two byte integers because it should be enough.
So we had also, we created two dictionaries, one to encode all weight events.
Actually, if you write every time if you write LW log manager, it's a lot of bytes.
Why?
We can encode it and to support, as in recent couple of major versions, there is a PG weight event, weight events dictionary, right, in PostGGGGS.
Yeah.
Which is just a static list of them.
But I wanted to support older versions with this thing supports PostGus 14+, so that's why we created some our dictionary like to propagate.
that to back back yeah one thing I didn't understand is you mentioned not using eight by
integers is that does that include because query IDs by default eight by right yeah so query ideas
eight and that's why we have us one we have two dictionaries one to encode weight events and second
is to encode query IDs to map interesting that's it and yeah and we also we have I don't
remember but we somehow solved the problem that it can
grow without limits. We solve this. So it cannot grow without limits. We keep it short.
Ah, I know how we solved it. Can I remember because I proposed this solution? By the way, I forgot to say, I created it using three AI plus me. So a team of four worked.
One AI was engineer and one was focusing on benchmarks. Another was focusing only just on quality and documentation and reviews.
and benchmarks
benchmark engineer also focused on reviews
but with specific goal
like storage efficiency and performance
and so low observer effect
and so on
and we worked many iterations
like crazy
and I did it everything
the whole thing is coded in Telegram
so I didn't touch anything
but you reviewed it right
I reviewed what
I didn't review code
really
I reviewed only results of benchmarks
I cross-check from two engineers, AI engineers.
I cross-check results.
Like, they work.
I trust this code, actually.
It works in my production already.
Yeah.
Why should I look at this code if I know that it was very thoroughly tested and benchmarked
and reviewed many times by AI?
Well, yeah, that's the why for me.
But, yeah, anyway, it's interesting.
I understand your question.
But in this case, I think it's quite solid.
I actually checked the code and it follows my style.
Wait, you just said you didn't check the code.
I checked pieces of code just in sure.
Before I decided not to look into code,
I made sure it's written according to our style guide.
And I like what I see.
So style corrected, somehow how we write PLPGSQL.
I have my already very well established.
Because PLPGQL is the most popular code language for me last 10 years.
So I wrote a lot in it myself.
So that's why like when I made sure it's producing good code, I didn't see the final version of everything.
But I know it was thoroughly tested in many aspects.
Yeah.
So back to storage.
This is how we encode and how we solve this unbounded growth of Curia ID dictionary.
My idea, actually.
I'm excited because you can engineer solutions thinking about like algorithms and data structures
and you don't don't need to code everything.
AI is coding. So I decided to have three tables, one per every day, that's it. And the same like
rotation. We have dictionary for today, we have dictionary for tomorrow. AI said there was a doubt
because query ID might not match, but we don't care because if the same query ID receives different
encoded ID tomorrow, it doesn't matter because we don't work with this data directly. We work
with this with some interface functions, which exposed to you.
user and the user doesn't see the encoded numbers at all.
Yeah, and once you look them up, then it's the same query idea anyway.
Okay, yeah, that makes sense.
Great.
And it cannot grow unbounded anymore, right?
Yeah.
Yeah.
And I guess for a lab approach, we also will need a separate dictionary.
That's it.
Which might be bigger, but we, why?
Because we, I want to have history for a one year at least or maybe half a year.
Good. Okay, fine.
But it will be already like compact, compact,
is compacted and query ideas, not all query ideas will go there, only most important, which
are most popular or participating in spikes. Yeah. So make sure I've understood. We've got three
main tables, only two of which will contain data each time, a day's worth of data, like all of
yesterday's data. Per second. And then, yeah, and then today's data until now, like it's like,
one and a bit days data at any point in time.
And that's fine because the main point of this is something was slow recently.
Like we had an incident 10 minutes ago or an hour ago or over the weekend.
In fact, over the weekend is interesting.
But there was something fairly recently we want to look into.
And normally that's within a couple of days.
Yeah, I think maybe we should allow to configure how many days raw data is stored.
Anyway, I know of this approach with three, it's from PGQ, as I said.
This is three partitions in rotation.
You cannot see a day before yesterday with it until we implement a rel-up approach for longer term storage.
And I forgot to mention that how much of data it is.
So per one row, if it's five back-ons active, it's roughly 100 bytes only.
I forgot to mention also that I decided knowing alignment padding,
I decided to also encode time stamps.
So it's 4 bytes instead of 8.
4 bytes, Unix timestamp, but it's shifted.
It starts with January 1st this year.
So it will be enough until the end of century, almost.
And the database ID, it's 4 byte.
It was already.
It's all ID, so 4 bytes.
8 bytes plus this encoded data, roughly 100 bytes.
if you have five active sessions.
And for example, if you have a more loaded machine,
it will be, for example, 50 average sessions on average.
It will be producing 30 to 50 megabytes per day only,
which is acceptable for us.
Absolutely.
I also forgot to mention that you cannot use it on replicas.
This is downside.
We forgot to mention that RITME, but I thought about it before we implemented it.
So it's obvious because we need to write.
But it's fine because.
we observe more and more even bigger clusters coming to our consulting which don't have replicas.
Or only H.J replicas.
Yeah, this is a good point.
Actually, yes, they recognize H.A. is needed, but they are okay to leave with one.
Somehow quite long.
They leave with it.
It's so unusual.
I still think classic three nodes set up is much better.
But reality is just showing different thing because cloud resources, as we discussed multiple times.
they are so reliable already that people are okay.
And also Postgres is performance.
You don't need the replicas for quite some time because they had the complexity.
So that's why I thought, okay, this is only for primary.
It's enough.
And I'm satisfied with results like 50 megabytes per day for a loaded cluster.
It's great.
And I'm thinking maybe we'll be like let's keep seven days of road data or 14 days of
row data maybe.
Yeah, I could easily see.
At least the weekend thing makes sense to me.
You come in on a Monday and it turns out there was some blip on Saturday.
It would be a shame if that data's gone.
I agree.
Yeah.
Yeah.
And we have a bunch of functions which allow you to see what's happening.
Give me like overview of last few hours, for example, or previous day.
Which top weight events happened and which query ideas are participating there?
Yeah.
And it joins with PGISO statements if it's available to present some high level settings,
macro level settings for each query ID, which is participating in some weight events.
We also implemented visualization with bars.
It can be monochrome visualization or even colorful for PSQL if you do some trick.
And I enjoy it.
You see basically performance insights right inside PSQL.
It's quite fun.
And you can, for example, overview last 24 hours, then you think you can zoom to specific area.
You can understand which queries.
And of course it requires some typing work.
But since how it's organized, quite straightforward, you can let your LLM to do it.
If you connect using some redonly user role.
Yeah.
Let it troubleshoot it can quickly find which bottlenecks.
Is it like I.O bound workload?
You need to increase disks or give more memory.
Or it's like there is a heavyweight, low contention.
Which queries participate?
It's quite good in terms of this kind of troubleshooting.
Yeah, definitely macro level stuff.
But I saw there was even some micro level abilities as well,
like looking at specific queries weights by the query ID.
Yeah, you can ask, for example, for specific query ID,
you can say which weight events.
It mimics a little bit
what we have in full-fledged monitoring
because we have all this.
And here is just like in textual form,
you can do this kind of thing.
For this query idea,
what are weight events during last hour or previous day?
And for this weight event type
or specific weight event,
what are like top five query IDs
and their macro level characteristics?
Yeah, it's not, it's still macro level actually.
because you still like a lot of aggregation is happening here.
But yeah, you can jump to micro level, specific query ID and see plans and start working with this.
So there is a bridge here.
Right.
Yeah, there is a bridge.
So anyway, it's early days.
I'm releasing this week.
Let's see how it works.
Is it useful or no?
I wanted to just to add that this thing feels to me as, oh, yeah, one more thing.
So this is poor man's monitoring, self-monitoring, right?
I think it's not, I don't think so, but yeah, sure.
It's quite straightforward.
This trick with PG-Crone, I forgot to mention, by the way, that if we sample every second,
PG-Crone by default writes logs to a table.
Oh, yeah, yeah.
Yeah, so I think PG-Crone can be improved here.
For example, I would turn it off for this specific job.
I don't need logging here.
I just need to see errors and I don't need errors to be stored in with fully CID.
PostGGIS lock itself would be enough for me to trouble should what's happening.
Anyway, like there's a potential for PG cron here to be slightly improved for these high
frequency jobs.
But what I wanted to say, this is like self-contained self-monitored system, right?
So you can even without LLM, you can write some analysis inside Postgres so it could produce
some reports and analyze what's happening.
Maybe like it's time to add more.
or increase shared buffers, for example,
because we see a lot of IOT data file read or something, right?
Or maybe we see a lot of time spent on heavyweight locks,
so we need to signal that workload must be redesigned
to avoid contention on heavy weight locks, right?
But there is a trade-off.
Should we monitor inside or outside?
And I think there are pros and calls.
We discussed it, right?
Already.
We did.
Yeah, that's...
I had one last question, though.
I think, by the way, the reason I disagree is,
I think it's actually just another tool, right?
It's another tool in the tool belt.
If you self-manage or use a provider that doesn't have weight sampling,
this is a valuable tool in its own right.
Like regardless of monitoring,
especially when you mentioned the overhead from monitoring externally,
you'd only get, let's say, 15 second frequency or something of that order of magnitude.
To get one second frequencies is an upgrade,
not like a poor man's solution.
Yeah.
It's not 10 milliseconds.
It's no, true.
Good point.
point. So the last thing I wanted to ask, because you always ask our guest this, on the license
front, you've chosen a really permissive license, Apache 2.0. Why? Why not? It can be
baddable to some stuff. And I even think, honestly, let me, to be transparent here, I think it will
contribute to my business, because if we have this everywhere, it's easier for us to explain which
problems and what should be done about it and improve health and support people.
Yeah, yeah.
So we get right now in not quite good position because a CloudSQL, this is one of not many
but quite great things they have, is that they have PJ white sampling, super great.
But RDS, they have their own proprietary tool performance insights, which is great if you
only on AWS.
But if you think about interoperability or how to say, like you want to be able to work with
any Postgres, performance insights, database insights, I call it right now. It's some proprietary
API and it's not transparent. And also external dependency, I always think, okay, how good is it?
Is it trustworthy? Maybe there are also bugs. I don't see code, right? I cannot know, I don't know
maybe there is there. RDS quality is great. Like, let's agree. Just quality is great.
But still, like sometimes you doubt, like how exactly it works. This is fully transparent.
works everywhere right and that's it
yeah nice all right
thanks so much
Nikolai
yeah thank you
so yeah
everyone is welcome to try and
contribute
fork anything
yeah I will
I'll stick it in the show notes
obviously
make it nice and easy
thank you
yeah good
all right
take care
catch in a bit
bye
