Postgres FM - Monitoring from Postgres AI
Episode Date: July 25, 2025Nikolay talks to Michael about Postgres AI's new monitoring tool — what it is, how its different to other tools, and some of the thinking behind it. Here are some links to things they ment...ioned:postgres_ai monitoring https://gitlab.com/postgres-ai/postgres_aiDB Lab 4.0 announcement https://github.com/postgres-ai/database-lab-engine/releases/tag/v4.0.0pganalyze https://pganalyze.compostgres-checkup https://gitlab.com/postgres-ai/postgres-checkupPercona Monitoring and Management (PMM) https://github.com/percona/pmmpgwatch https://github.com/cybertec-postgresql/pgwatchpgwatch Postgres AI Edition https://gitlab.com/postgres-ai/pgwatch2libpg_query https://github.com/pganalyze/libpg_queryThe Four Golden Signals https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signalslogerrors https://github.com/munakoiso/logerrors~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello and welcome to Postgres.
Remember we share about all things Postgres.
I am Michael, founder of Peach Mustard and I'm joined as usual by Nikolai, founder of
Postgres.
Hey, Nick, how are you doing?
Hi, Michael.
I'm doing great.
How are you?
Good.
Thank you.
And this week is launch week or was maybe if people are listening in the future.
Yeah.
I can tell you, I can assure you people are listening us in the future because we are not live. Yeah, of course. So yes, so launch week for your
company. Coach, can I say hi? Yeah, for the first time I'm excited. Yeah, so we decided
enough observing how others do it, first of all SuperBase. They just had launch
week before this. I think number 15 already, so many. And since our team has grown and we had stuff accumulated to be released.
So instead of releasing it earlier, I said, let's, let's just
accumulate a little bit and releasing one week.
It's like spring five days in a row.
And we actually, we are releasing more.
We, I hope we released by the time people are listening. print five days in a row. And we actually, we are releasing more.
I hope we released by the time people are listening,
I hope we release more than five things.
So we have also concept of extra bonus releases.
And we started on Monday,
we started with DBLab version four,
which actually took two years to develop.
Yeah, that's a long time.
Yeah, finally we polished it
and many users already installed it.
I mean, release candidates.
We needed to have five release candidates.
That's a lot of polish.
That's more than post-gres.
Yeah, well, because a lot of new problems appeared because we changed a lot there.
Yeah, but it's another story.
Today we talk about our second big release,
which is called Postgres AI Monitoring.
Yeah, so this is one you're doing,
so we're recording on Monday.
This is one you're doing tomorrow.
So what is it?
What are you up to?
Yes, we are doing this tomorrow.
Everything is ready already.
So I'm pretty confident we are doing this.
Yeah, and only few people know that we are losing this kind of monitoring.
And it's called Post-GIS AI monitoring, but right now there is a little AI there. And this is
intentional. Let me explain this first, because of course we had a short chart before recording,
and you noticed this like discrepancy, right?
So we use AI a lot, like every day.
It's, for example, in my team,
it's prohibited to do anything without AI first.
It's prohibited to write code, yeah.
It's already so.
Well, you can write code,
but only if your AI tools failed.
And first, people were resistant to this.
I would be. Yeah. Yeah. Some people said like what the heck,
like I don't want it, like it's bad, like it's poor quality. But in my
opinion, the Anthropic release of Cloud 4 a couple of months ago, it was a
game-changer. And quality already
very acceptable in many cases. Not always, sometimes, especially if you are expert in
some narrow topic, you understand problems which were never discussed publicly. In this case, of
course, you will see like AI will be wrong 100%. And we have these cases, like, of course, you will see AI will be wrong 100%.
And we have these cases.
For example, a perfect example is our guests last week.
They experienced some problem which was not covered by public discussions.
So obviously in such cases, if you just apply AI in a straightforward way, it will hallucinate
some bullshit.
But if you apply AI to brainstorm some ideas and ground it, pre-ground, post-ground with
materials, reliable materials like source code, documentation, information from reliable
sources, and experiments post-grounding.
In this case, it's different.
So you can explore behavior, reproduce it, and this is the path we chose.
Not just this problem, take AI and create, I don't know, like change requests somewhere.
This won't work.
Ungrounded approach won't work.
So this is one point. Another point is that
six, seven years I keep talking, I avoid building
monitoring system.
For example, when I met with Lucas Fittal in San Francisco one day
a few years ago, I told him this, like, you're building great monitoring
system, PGGAnalys.
We built some checkups because we need for our customers to find problems and then to
start exploring them, experimenting. Our key strength of Postgres AI team is experiments.
That's why we built DBLab to simplify specific type of experiments and benchmarks and so on.
And I mentioned checkups and somehow Pigeonalyze has checkup sections since then, but it's
a different story.
And I said, I'm not going to build monitoring system, but it was inevitably like I was pushed
somehow to this point where it's inevitable that we need to do it because no other system,
and not only open source system, not other system is providing abilities we need.
We need several specific abilities with our clients, we always show it.
And some of them is for troubleshooting.
So like very high level analysis of PostOSGAS components, and then you
can understand where you can dive deeper.
And actually, I remember great talks from Percona, from Zaitsev, and so on.
And I saw some bits of good stuff here and there, like Percona monitoring, management
and monitoring, or how is it called?
PMM, yeah.
Yeah, yeah, yeah.
And things like use and for golden signals approaches like methodologies, some great
pieces there.
But then you see something is really missing.
Like, for example, proper query analysis.
We talked about it so many times.
Comprehensive analysis using PgStat statements, PggStart cache, PgWatch sampling,
top-down, single query, with query texts, with proper characteristics,
and we didn't see any system like this.
And a few years ago we started maintaining so-called PgWatch 2 Postgres AI Edition,
which was a fork of PgWatchWatch 2 with our metrics and dashboards.
That's it.
But then Pavlo Golub decided to rewrite PidgeyWatch, right?
So we started lagging in terms of this vehicle we used as PidgeyWatch itself.
So time has come and we realized, okay, and we experimented a lot.
We actually started with InfluxDB.
It was insane.
We eliminated it.
We used the TimescaleDB.
And then we had a big thinking process.
Should we stay with Pure SQL and TimescaleDB or we should use PromQL?
And we decided to use PromQL, but with some nuances.
We still have Postgres inside.
So we have both Prometheus and Postgres.
So the idea was we need to build a new monitoring system, which will be part of our process.
And in consulting, we had maybe 30 clients over the last one year and a half.
We had a few dozens of clients, right?
And to all of them, we installed always our PGWatch 2 PostGAS AI Edition, and also used
PostGAS Checkup tool, old tool we always used to generate reports in static form, like Markdown
or HTML, PDF. And it was not super convenient. So this time we said, okay, we finally do monitoring, but the main part
of it will be open source.
So monitoring itself is open source, but it's a part of bigger, bigger picture
where we perform health checks and help clients proactively using AI, of course.
Right.
So I'm going to ask your question always.
When you say open source, what do you mean?
What license?
License is Apache 2.0.
As I remember, Pidgey Watch is a BSD or Postgres license.
One of these two, but our dashboards are Apache 2.0, so anyone can use it. The majority of that is...
So we couldn't use a so-called traditional Postgres exporter for Prometheus,
because it's like so much stuff and it's really slow to put pieces there. So basically, we have
our own exporter inside, I mean metrics and
dashboards, right? And it's Prometheus. Why Prometheus? Because we just see that in many cases,
well, several reasons.
One of the key reasons is that we had
already two, actually three companies where we,
in consulting mode, we contributed a lot implementing our
approach to their own monitoring system.
So this is why we have confidence this monitoring system is great.
And usually bigger companies, they have some stack.
Some people use Datadoc, but you know, if you have a hundred terabytes of data or say
a thousand clusters like microservices,
Datadoc is super expensive.
Some people still keep paying.
But it's not only expensive.
It has blind spots in terms of post risk monitoring.
A lot of them, for example, pitch-to-start statements, they have some, not all metrics.
And for essential metrics, they, for example, can give you per call but not per second,
or per second but not per call.
There is no completeness there.
If you check carefully, so they differentiate in one dimension but not another.
For another metric, it's vice versa.
So it lacks many things.
And also how dashboards are organized.
Oh, I don't understand why people put
double stats on the front page.
It's not the main metric to observe.
Not the main metric.
It's not number one thing you want to see
when you have incident and need to troubleshoot.
You know? And people
don't put queries per second because Postgres doesn't have queries per second.
But we can extract it from PgS Assets with some asterisk that it's approximate because some things
can be lost. Sometimes we see queries per second are lower than transactions per second, but that cannot
be so.
Right?
Yeah.
In fact, because we don't see full picture, but in most cases it's vice versa.
So this monitoring is open source, fully, and it can be used in pieces if your organization
already has Prometheus-based something.
But which lets do, yeah, makes sense.
But lacks systematic approach for dashboards.
This is where we can, I mean, you can take this and improve, take pieces, maybe contribute
back.
Also, also good idea to contribute back always.
Yeah. And I still think, although
I personally invested like more than five years already to this. First few years, I
wanted to say, no, no, I'm not going to do it. But then I was forced, well, life sometimes
force you to do something, right? And now this is my like, I resurrected to this idea, like, okay, I'm doing this, right?
And yeah, and not only I, we have really super experienced members in our team,
who also spent a lot of time thinking and contributing to this. Like, it's not only my own,
my own effort, right? So we have several very good team members
who contributed a lot.
So we decided to use Promkio also because of these,
because it's more compatible with enterprise systems.
Yeah, you mentioned using Prometheus.
But also Prometheus.
But also Postgres, because in Prometheus,
there is a problem with query texts.
In query analysis, you do want it, you have PgSys statements,
but in Prometheus, it's not normal to get query and store it. It's very inefficient.
So you need secondary storage, and we chose Postgres, obviously.
Basically, we have two data stores. One is the main one is Prometheus, and the second one is...
Actually we are debating right now.
We probably will...
Prometheus, it will be Mimir.
We also thought about Victoria metrics, so it's still...
We will switch in the next few releases, we will switch to something else.
And for query text, it's Pogas, but it's a small one, because by default, PgSense assessments for each node
are 5,000 statements.
And do we track all of them?
I need to check.
Maybe not.
So, one of the things we noticed, we had cases with our previous monitoring, that sometimes
people have millions of objects in the database.
And this really requires optimization.
And we made this optimization for really large cases.
I remember our checkup actually also
had cases with 100,000 indexes.
In case of monitoring, PidgeyWatch 2,
it was a previous version, basically, right?
Our post-guessing edition, we had cases with more than one million
indexes and hundreds of thousands of tables. Blood analysis was not working. But like it's
and this is interesting and separate question, how to perform index maintenance in such situation,
for example. But we, it's also, it's also a weakness of PD stat statements, right?
Because often, well, I don't know if this is the case at all ways, but often when
you see that number of objects, it's because somebody's doing per tenant, but
by schema, so a lot of the tables are the same dozen or several dozen tables and
indexes. And then feed stat statements kind of falls flat on its face
because it sees objects as unique and it tracks them all individually. So you
can't look at...
It's flooded.
Yeah. And 5,000 is not enough.
It's not aggregating properly.
Exactly. Yeah.
Yeah. We just, we discussed this recently also.
There is a problem with temporary tables, for example, sometimes.
Yeah, it's like the same schema, same table, but somehow you end up having two different
queries and just a set of them.
I would personally prefer some control how aggregation is done.
Right.
There is compute ID mechanism, right?
So compute query ID, compute query ID mechanism.
So basically it would be great to define a rule, for example, ignore namespace, ignore
schema name, right?
Yeah, wow.
Yeah, interesting idea. Yeah, maybe.
Because the object, the other table names are probably, for most people,
the table names would stay the same. It's the schema name that would change. Yeah.
When analyzing logs, I still prefer, I know like since Postgres 16, we have query ID
in the logs. We can't have, we need to enable it for auto-explain,
for auto-explain. But I still often prefer Lucas Fittles, already
mentioning him, his second time. His great library, libpgquery, and compute fingerprints.
I know it works purely based on syntax, no OIDs from PJ class or something. So I know how it will aggregate.
And it's mostly often, like, often it's enough.
Do you know how it can handle that?
The different schemas, same object name?
Well, no, I think it distinguishes them.
If query has fully qualified names involving schema name, I think it will produce different
fingerprints.
But I guess it's easier to tweak it than to try to change the just a statements approach
and so on.
I don't know, I just feel, yeah, it would be great to have some more flexible, some default part, but some exotic paths to tune it, how we aggregate, right?
Because otherwise you end up, like, where you have 5,000 slots in PgSense, that was
by default, and then maybe five different queries, each one 1,000 times, right?
But just with some differences.
You know, the in problem, I don't know.
It wasn't fixed already.
I know for example, lipg query, regardless of the members of in list, it will
produce the same fingerprint value.
Nice.
Okay.
But don't know if each set statements got fixed, but yeah, you're right.
If there's like the difference in having one in, in list versus two versus three,
they were all considered unique.
Yeah.
Yeah.
Some people implemented some reg exps to, to convert it before like calculating
something, usually you want, if you have in-list regardless of number of members
in the list, it should be considered the same.
Semantically it's the same, two members, three members, it doesn't matter.
But PgSus, that was until, like I thought I saw some discussion. I, like my memory really tricks me sometimes.
Maybe it was fixed.
There's definitely been discussions.
I just don't know if there's any commit.
I think gut feeling is it might be in a very recent version, but I'm not, I'm not a hundred
percent sure without checking the, but the, like, for example, any, if you, if you, if
you transform it into a query with any, you're going to get the same entry in pgset statements
every time.
So why should it be different with inlists?
Like it's, it doesn't make sense to, yeah.
Although they could have different plans, like at a certain point they
could flip the plan, but that's true of any query, that's true of any query.
How work, how they work with not in and now, right?
Remember this danger, right?
So, yeah, so yeah, let's discuss.
So, so yeah, open source, Apache 2.0, open source companies, PG version
three as like engine for this.
So we just built that.
PG watch.
PG watch 3, version 3, which was written by Pavel Golub, cyber tech.
Great work.
A lot of optimizations, a lot of things like how we control, they call it syncs, right?
Like how beta streams are controlled and so on.
Yeah. And what else?
Like very fresh Grafana, obviously, right? Yeah. And there we have several
interesting tricks there. So where does this dashboard shine? If we
talk about only dashboards, as I said, it's a part of bigger vision, but just
talking about dashboards and and manual use of them.
First dashboard is the troubleshooting dashboard.
You have just one minute and you need to understand where to dig further.
We use four golden signals there.
So we have throughput, latencies, errors.
What else?
Saturation, right?
I didn't see that one.
Dashboard, OK, dashboards.
Dashboard number one.
Active session history?
Yeah.
No, active session history, this is where it starts.
But it's a single node performance over you, number
one.
Active session history, in our opinion,
we convinced after many years observing
how RDS does it, others also convinced. Weight event analysis is the king in monitoring.
If you want one single graph or how to say chart, to understand quickly performance and
what's wrong, this is what you need to look at. That's why we put it on the very first place
on the very first dashboard.
But it's only very high level, like 30,000 feet view.
So only wait event types, not wait events.
No query IDs, not particular wait events,
only wait event types.
So it's very like, is it like CPU or IO or something?
And CPU we mark with asterisk because it's not only
CPU. We know sometimes some parts of the code still are not covered by
weight events. So technically it can be CPU or unknown weight event. And
CPU is green. As everywhere. As in performance insights, patch viewer,
anywhere. Yeah, so then we have very high level pieces
of understand workload and several,
and various Postgres components.
And we follow four Godwin signals there.
So latency, traffic, situation,
traffic throughput, right?
Situation errors.
Latencies we extract from Pgistar statements.
I think it's a great idea actually to have a Pgista database.
Probably there are reasons why it's still not there.
You can extract it if you have PgBouncer.
Sometimes I just check PgBouncer logs manually to double check that I understand latency
properly.
Yeah, but yeah, from Pgista statements, good enough.
Of course, again, if you have situation like we just discussed, and you see only
the tip of the iceberg in those PgSense statements dot max entries of PgSense
statements, 5,000 by default, then it will be wrong value.
So I think, yeah, we understand this, right?
So it's like best effort.
Kind of an edge case as well, right?
Yeah.
It's not many people that have hundreds of thousands of objects.
Yeah.
Yeah.
And throughput, we're like in various ways,
like calls per second, transactions per second.
So these are key ones.
And we didn't put tuples per second,
as you can find in Datadoc and some other places.
Maybe we should, but I'm not convinced this is super low level throughput metric should be on the very front. Maybe, actually,
because I'm contradicting with myself. I'm always discussing logical replication. I say on modern
Intel and AMD processors, single slot, single publication subscription, it can saturate subscriber side,
single threaded process at levels like 1000, 2000, 3000
tuples per second, depending on various factors.
So maybe tuples per second for writes,
maybe you should put it actually, good idea maybe.
But this is very low level throughput.
So in database
you can understand throughput in various levels. Transactions, queries, tuples next. Is tuples
next or maybe rows? VOLUME. Rows, tuples are different. Well, buffers. Actually, this is
what we do have. We have hits and reads and writes, dirted buffers.
All those buffers, we put it right on the front page,
but only high level.
We don't distinguish by query ID.
This is the job for different dashboard.
So you quickly can understand what's happening here.
And of course, we translate it to bytes, right?
Yeah.
So how many bytes per second
of buffer reads your buffer pool is having right now? And did we have spikes, right?
So yeah. And actually, yeah, and errors, yeah. So if there are errors, we will see it in
transactional analysis, number of rollbacks. We currently don't work with logs, but it's
in roadmap, of course. So we extract errors from logs, aggregate, and so on.
I still have hope that tiny but very useful extension called log errors, one day we'll
make it into engine and we will have some PgStart errors or something.
PgStart errors, actually.
PgStart errors.
It would be great to have some error code and how many times it occurred, right? Basic accumulative metric
and we could start observing them in normal way, exporting and then, yeah, alerting. Speaking
of alerts, it's also in roadmap definitely Definitely. Yeah. We will have, we have very careful approach because talking to different
customers, we see very different expectations, very like opposite sometimes.
So, yeah, I will discuss that.
Yeah.
In the past.
Well, I think it's interesting.
I think alerting is fascinating and like balancing risks of false positives
versus risks of false negatives.
Like I think it's a really hard problem.
It's a hard problem.
Configuration, like people always just end up making everything configurable
and it's, uh, then picking defaults is hard or like, anyway, it's, uh, not
envious of people that make monitoring tools are quite like making it simpler.
I think we will avoid the single default set, set of defaults, and we will just give some
like, basically, questionery.
What matters for you more, more like this or that?
And based on that, we adjust the setting right away.
Like when people first install it, there's like the old school wizards that you go through
and answer a bunch of questions and then...
Interesting, yeah.
Old school type form, you know.
I'm joking, I'm joking.
Type form is already old school.
Because I see, you know, like dental clinics use type form already, you know.
It's not cool already.
It's old, ancient.
I like actually, I love typeform.
I love the fact that you can use only keyboard,
no mouse involved, navigate all the questions, answer them.
So anyway, yeah, alerts is a separate topic.
It's in roadmap, not right now.
We also paid very careful attention to metrics we display.
We had many debates about this, really. And yeah, for example, we just talked about,
for example, buffer reads per second. Okay, we translate them to bytes. So okay, megabytes, gigabytes, gigabytes. We use lowercase i in the middle to emphasize it's gigabytes,
gigabytes. So if we display it on X is Y, is it megabytes or gigabytes per second?
And I don't see consistency in various monitoring systems. So I managed to convince we should
display megabytes per second. And and to convince I just said,
okay, actually you know what? We have all these metrics, it's differentiated by time,
so if you display a raw metric, it will be constantly growing. It's cumulative metric,
it's just number which always grows until you reset it. So since you differentiate it by time,
it becomes like already spiky or some
kind of normal behavior, right? So it should be maybe by second. But also, I said, notice we
develop additional differentiating on the right side. Almost always you can find a chart which
is displaying the same metric per call, average per call.
So flipping instead of per second per call, per query, kind of.
So if it's, for example, execution time per call, it's average execution part of the latency.
Right?
Yeah, but what does execution per second mean?
No, no, no.
Total...
That one you can't do.
Total exact time per second, you can do.
This is the key metric.
This is the central metric in our approach.
It's measured in seconds per second.
Seconds per second?
We discussed this.
Yes.
I'm glad we're repeating this topic.
This is key.
Yeah, me too.
I've clearly forgotten.
Seconds per second is a key topic, a key metric.
My favorite metric.
It shows you how many seconds database spends to execute your queries per each second.
And then do you have to divide it by cores to like understand you to like say, no. No. Wait, wait, wait, wait. It's two different things. We have some metric,
for example, number of buffers, which is also always growing across all queries. Here it's
top level. We don't, we've, we just summarize all of them, all 5,000, right? Or we have
total exact time, right? These. These are two raw metrics.
They are always growing, growing, growing.
You can let Grafana, basically, promql using irate.
You can let it differentiate by time.
We don't use mean exact time, because we can have it
from different places.
Or you can divide it by calls.
If you divide by calls, you have basically the same value
as mean exact time, but not Postgres calculated it,
but PromQL calculated it, right?
But if you divide it by time, it's differentiating by time.
You will see how it changes over time.
Every second we spend, for example, 10 seconds,
give us at least 10 cores for this workload.
That's very roughly.
It doesn't fair because this time can include waiting on local acquisition, right?
Or IEO.
It's not true CPU user or system time or something.
But roughly, it gives you a good understanding.
What kind of, like the scale of your workload, if it's
10, don't run it on two core machine.
Yeah. Well, more to the point, is it growing over time or is it you are we reaching some,
you know, at when are the spikes that kind of thing? Yeah. Interesting.
And if you divide it by costs, you have average if you, if you, yeah. So that's why we say metric per slash S per second.
And if it's seconds per second, we transparently say seconds, S slash S seconds per second.
Why not?
Well, I just think it, it hurts my head even thinking about it. I think there's always sometimes a case in products for being accurate and sometimes
the case of being simple or like easy to understand at first glance.
And it depends who you're building for, right?
Like accuracy is extremely helpful, especially for deep experts.
They want to do precisely what happened all the time.
But anything that could be distracting for folks who don't necessarily know what they're doing.
That's a great point.
Yeah, so, you've answered pretty much all the questions I had already, except for
who is your, I know yourselves are your ideal user, but like who is your ideal user for this product? Honestly, 99% ourselves. We do it for our own work, for consulting, but we see it beneficial
to publish it as open source, so first of all our clients could verify what we are collecting,
right? Even Datadoc publishes exporters, right? So it's fair. But for us, these dashboards, it's not final product.
By the time we publish this release, it's already obvious.
We already included this to workflow where metric collection is step number one of big
step number one.
It's just first half of the step number one. It's just first half of the step number one.
Dashboards is step number one.
And then API is exporting dashboards
in form of static tables.
Actually, we didn't get there, but our dashboard
with query analysis includes not only graph view charts,
but also table view like our Postgres checkup had.
And table view is super beneficial
because you can compare for each query ID, you can
see many different metrics in one row.
And you can understand much better looking at various compact form of representation.
It's for experts, it's not for non-experts.
Like regular backend engineer needs some time to understand how to work with this.
It's possible, we did it, but it really needs some time.
But since I said this is big step number one, these dashboards and charts and tables and everything.
Second step is analysis by experts, and experts can be humans or LLM. For efficient analysis,
LLM. For efficient analysis you need to transform raw observations, we call it, we transform observations to conclusions. You have some transformations where it's
quite like, like we did for with buffers. Okay, 1000 buffers. Even for expert it
takes some time to understand is it big or not. But then we multiply it by 8 kB, and we have 8,000 kB, roughly 8 mB.
Not exactly, but roughly.
And we say, oh, okay, this I can feel already, I know how long does it take to get it from disk,
or how much of memory it takes, I already feel it. And my point, my hypothesis in what we are doing is that both experts and LLMs work better
with this transformed raw data.
If you present it better, this is what we do on second stage.
Second stage, we've transformed observations very raw to conclusions, but we don't add
thinking there. Maybe a little bit, you know, a little bit. Not debatable type of thinking,
quite like straightforward logic, you know, like we can calculate percentage, for example.
This query ID in buffer hits takes like 25%. It's very noticeable.
One fourth of whole hits traffic, you know?
So yeah, like it sounds like kind of arithmetic that's difficult to do on the fly accurately
when you're in the middle of an outage or something.
It includes everything like arithmetic, like formulas. It also can include some representation transformations,
like put these numbers closer to each other.
Just by putting this, you highlight
that they have some correlation or something.
Also, take specific charts.
This is already like methodology we develop over many years,
how we approach
analysis of various problems. But still, we don't put conclusions there because we just
observe. We just slightly transform what we see to better form for consumption of buy
experts still and LLMs. And final thing is recommendations and understanding
problems. Not only recommendations, first understanding problems.
So we see problems and we already start thinking.
This is already sometimes if you take two DBAs, they have two different opinions.
What they see.
Here is already tricky part.
But our idea is, okay, we have brainstorm phase, collection of hypotheses. And then since we build very great
lab environments, we have thin clones, we have thick clones, we can have different tests,
and we can verify ideas on thin or thick clones, right? Or maybe in synthetic sometimes it's just
before you're sending bug report or raise some discussion
in hackers, you need to have minimal viable representation of your problem in synthetic
forms, not to share personal data, right? From production.
Reproducible test case.
Yeah, it should be minimal. It's also a special type of art in database operations to extract from production
some minimal stuff and inject it.
And here is where LLM can be very helpful to iterate much faster.
Yeah.
I find this difficult for blog posts as well, like coming up with an example
that you can like share the, anyway, yeah, it's an art and very impressive
and takes way longer
than it looks yeah well Lisa does for me it should take less and I hope AI I
don't I don't hope AI AI is already improving this but I think we are on the
edge of massive improvements in this area I really hope so I've tried it for
a few I've tried using even Cloud4 with limited
success.
Yeah, I know it. I know some problems like you just give up and you know, we spent whole
last year experimenting so much, we had credits from, we still have it, we still have credits
from Google Cloud, experimenting, yeah, yeah, Gemini.
Gemini excels in some parts of work, but not in all.
And of course, Claude is also not the final answer to everything.
But I see many, there is a substantial improvement with Cla 4 release, substantial, significant.
So it gives me much more level of excitement of how we can move forward.
So anyway, this is our whole system.
And right now, almost everywhere is human because we don't trust.
And it will take a few years to start trusting more. Maybe LLMs should be well, they don't stalled, they are improving, right?
And knowledge base is growing also, right? So our knowledge base is growing,
which we use when we work with LLMs to improve quality of answers. And yeah, so
this is how I see it as a component in the beginning.
It can be used directly if you're an expert, or you can spend time understanding some documents,
like to dive deeper to details, and then not becoming expert, but already to start understanding
how things work.
And then we transform it to form, which is easier to move faster, basically, not to jump
between various pieces.
For specific analysis, we compile multiple reports to a small report, which is for this
specific analysis.
For example, our checkup has a report called memory analysis.
Memory analysis, if you want to tune memory, not only you need to tune shared buffers, maintenance workmem, workmem,
max connections, but also check what's happening with your number of autovacuum workers.
And also remember that in Postgres 17, before Postgres 17, even if you raised maintenance
workmem beyond 1 GB and put it like 8 GB, thinking, okay, I'm going to create or recreate indexes
only in one, basically in one session.
But auto-vacuum workers, you have them, like you also raise number of auto-vacuum max workers,
you raise it to 10, for example, and you think, okay, I know that, well, people don't know mostly, but I also didn't know until recently, but before 17,
if auto vacuum workmem is minus one, it means that it inheritance from maintenance workmem,
but maintenance workmem if you raised beyond one gigabyte, it's still auto vacuum cannot use more
than one gigabyte, but this change, this was changed in Postgres 17, it can use more than one gigabyte but this change this was changed in Postgres 17
it can use more than one gigabyte right now so if it if you if you tune it
before and you then upgrade you can run out of memory suddenly if you perform
the vacuum tuning in more like we always say raise number of workers give more
memory to workers to back and switch indexes. Now you upgrade to Postgres 17, mathematics changes.
So we have a report which collects all the pieces
and involves their version and knowledge about these changes.
It requires some expert work.
And we should not live in the world where this work is done
manually. Right.
So it should be done like in automated fashion, collecting pieces, analyzing things, like not to miss some important parts of the knowledge.
Yeah.
So we'll have them there.
Definitely.
Well, not definitely.
Like you can get version in, well, you can get version.
I'm intrigued as to how much you'll end up using LLM on this journey because
things like the version information you can get directly from PostgreSQL.
You can run a query and find out what version it is.
You don't have to work that out or it's not something where like a language
model is going to be beneficial.
Yeah.
And there's some like, I don't do monitoring tools, but doing performance advice is like something to learn to like, try to get good at.
And I thought we would end up wanting to write bespoke advice in lots of different
cases, but it's amazingly how, it's amazing how you can word things so that
they cover like different cases or conditionally
show different pieces of the advice in different scenarios and piece together the full response
without risk of hallucinations.
I agree with you.
That's why on the first two steps we almost don't involve LLM because we don't want them to,
to like, I cannot let LLM to summarize to numbers.
I better use a regular calculator, right?
Yes.
Exactly.
That's why some things, like there are some things previous generations of commuting were
fantastic at.
Yeah.
That's why jumping from observations to conclusions, we basically don't involve LLM.
We involve our methodologies we developed, how we analyze things and we just transform.
We don't need LLM there.
But when then we have problems, first of all, we need to understand which problems look
most critical.
This, I believe, can be done at least preliminarily.
LLM can highlight what's the most critical.
And then most important, start hypothesizing about trying to understand root cause, trying
to prove it, build experiment to prove it, and then how to solve it, build experiment how to solve it,
prove that it's like drive this very hard work
of root cause analysis and finding mitigation.
You know, this can be done by LLMs
because it can search similar results
on our knowledge base and internet, right?
Find like ideas and using thin and thick clones and frameworks we built
with Postgres AI, it can start experimenting to check and to prove, oh, this is what's happening
here. Of course, the problem we discussed last week with Metronome, guys, it's like, it's...
Yeah.
...it will take time for us to reach a point where our system will perform full-fledged
root cause analysis and experimentation part. But I would say it's like one or two years only.
We already had a lot of pieces built on this road, you know.
Well, I look forward to discussing it again then.
Yeah. Yeah.
Nice. Nice one, Nikolai. Looking forward to it. Good luck with the rest of your launch week as well.
Thank you.
In the past.
In the past.
Back to the future.
Good.
Thank you so much.
Nice one.
Take care.
Thank you.
See you soon.
Bye.
Bye.