Postgres FM - Self-driving Postgres
Episode Date: August 15, 2025Nikolay and Michael discuss self-driving Postgres — what it could mean, using self-driving cars as a reference, and ideas for things to build and optimize for in this area. Here are some l...inks to things they mentioned:Nikolay’s blog post on Self-driving Postgres https://postgres.ai/blog/20250725-self-driving-postgresSAE J3016 levels of driving automation https://www.sae.org/news/2019/01/sae-updates-j3016-automated-driving-graphicOracle Autonomous Database https://www.oracle.com/uk/autonomous-database/Self-Driving Database Management Systems (2017 paper) https://db.cs.cmu.edu/papers/2017/p42-pavlo-cidr17.pdfPGTune https://pgtune.leopard.in.ua/pg_index_pilot https://gitlab.com/postgres-ai/pg_index_pilot/[Vibe] Hacking Postgres with Andrey, Kirk, Nik – index bloat, btree page merge https://www.youtube.com/watch?v=D1PEdDcvZTw~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello and welcome to Postgres FM, a weekly show about all things PostgreSQL.
I am Michael, founder of PG-Mustard, and this is Nick, found of Postgres AI.
Hey, Nick, how's it going?
Going great. I'm very glad to see you.
How are you?
Likewise. I'm good, thank you.
And you chose the topic this week. What are we talking about?
Yeah, I think it's very interesting to discuss the level of automation we have in terms of all.
You know, my position against managed progress, and in this case, it probably will be opposite.
it, like saying that it's not enough
what we have in terms of managed postgres
and also in terms of Kubernetes operators
and other automation projects
post gas ecosystem has right now.
So why I was thinking about it?
Imagine in 2011, Hiroku was started.
Hiroko Postgust was started.
2013 RDS Postgas was really,
at least in November, I think.
And at reinvent, I guess, right?
So then this was a foundation of growth of interest, I think.
Like some people say it's because of JSON or something.
I agree with those arguments, but I think the central reason of why PostGar started to grow in 2014, 15, and by now is that before that backend engineers, developers, they were always, they were always
complaining how difficult it is to set up PostGos and configure it and backups and
replication that just didn't want to deal with it and RDS and Hiroku before that they brought
automation for basic things right and this I think simplified lives of a lot of engineers
that's great in 2020 Superbase was released and I think a new wave of audience was brought
to PostGos ecosystem, front-end people, actually.
Because now it's not only Post-Gus automated.
It's very well-automated.
Other components are very well-automated, like Rest API, real-time component, right?
These things are authentication component.
So you immediately start working on front-end, forgetting about backend.
So I admire SuperBess for bringing a lot of front-end guys to Post-Gus.
That's great.
But at some point they need to learn SQL. I'm pretty sure. So this is fine. And now we have
AI builders. And this is now the wave of users and basically not humans already sometimes. So we hear
from Superbase and Neon that a lot of clusters which are created these days are created by
AI by request from like cursor or something, vibe coding. And so many, many clusters, many of them.
are small and maybe it won't go anywhere, anywhere because it's just experiments, prototyping,
and so on. But some clusters grow and they lack attention. So with RDS, there was a big shift.
We talked about startup teams who don't have DBA and they are fine with it until some point
and this is where PostGCI professional services catch them, right, quite often. But now we talk about
even lack of backend engineering team
completely, for example, with
SuperBase or even without like
lack of engineering team completely somehow
right? And only guys who
understand product and try to wipe
code this. Sometimes
with security breaches like recently
some app was like
storing data in
in Firebase, right? Google
Firebase and it was not secure
at all. Five million users
registered. It was a big scandal.
So yeah. Anyway,
But this is part of the topic security.
So what I feel is the demand of much higher automation than just RDS or super base.
Some new level of automation should be present.
And if you look at enterprise sector, there is Oracle with this idea of autonomous database,
self-driving database for many years, right?
On one hand.
And on another hand, there are academic papers like one from.
Carnegie Mellon University, Andy Paolo, from 2017, which discusses what self-driving database
management systems are.
And there is a question.
If you think about zillions of post-gues clusters, which should be highly automated, and
when experts look at them, everything should be already transparent and obvious how to fix
and move on, what is this?
what is self-driving post gas? I was thinking. And to answer that, I performed several waves
of research, of course, with deep research from Cloud and open a charge of PT, right? Latest
models. I paid to everyone a lot of bucks already. So I was thinking what could it be for
Postgres? To answer that, I performed research looking at Oracle, first of all. And I just
asked to, you know, deep research is when they,
perform Google searches or Bing searches and analyze hundreds of sources and then write some
kind of report like some student would write it might have issues of course this report but it
gives you gives you a lot of links at least and some summaries so my question was what people
after all those years of building autonomous Oracle what do people really like and what they
like less, right?
What did you find?
What did they say?
Yeah.
And one more comment.
In 2013 or 14, I think I was attending Autonomous Oracle webinar.
And I was completely like shocked.
They promised autonomous Oracle, but they've only talked about clustering logs,
organizing like better log analysis from hundreds or thousands of sources.
not only database. And I was like, where is the autonomous Oracle here? And 10 years since then
almost, more than 10 years, I was thinking it's like stupid. Now I changed mind and I hope
you and our audience will understand why. So what I found is that people appreciate a lot
self-patching. It's like mini minor releases, like if new release comes out with security
patches. It's not a headache at all. And this is kind of automated RDS. You can just define
maintenance window. Yeah, with some carrots. Minor versions only, I think. Have they done major
versions now? Major versions. Yeah, automation of major version upgrades. And here we definitely
have something to discuss. I mean, we discussed it already. And I mean, my team, like, we had very good
recent cases where our customers had the zero downtime of grades it's like and we're very happy
i hope some blockposts are coming zero zero downtime is very different to fully autonomous though like
very very different from fully autonomous or self-driving a major upgrades is very different to
let's do one more step back what is self-driving car yeah great there are six levels defined by
say S-E-S-A-E, like kind of standard or something.
So there are six levels from zero to five.
Zero means not autonomous at all, manual, regular car.
And then five is fully autonomous.
And looking at first few levels, I realized an interesting thing.
So they talk about not each feature particularly, but of combination of features.
For example, level number one, it could be either adaptive,
cruise control like so keeping maintaining speed but safely right or maintaining lane but not both if
it's both it's already the level number two right and there are several levels and for example
this paper Carnegie Mellon Andy Powell's paper from from 2017 it discusses how to map it to database
management systems right well a little bit not not much in my opinion but a little bit yeah for
short and this this paper I looked at this in advance you shared these car things I'll link them up
in the show notes as well it struck me that level four to level five so the last like the last
step is a huge jump like it's like here's loads of features in the car that that will help the
driver and then level five is suddenly and now the driver does nothing so it's like that feels to me
like a whole, like maybe a potentially huge chasm that maybe there's like a hundred more levels
in between four and five that we're going to need to break down at some point. But like I felt
like it was a very hand wavy way of saying we might be really close to this because we already
have level four features. We're very close to having level five. And it feels to me like there
might be like a, like I was unclear for example in a level five car whether a human could still
take control if needed or was there absolutely no way of doing that like that that feels to me like
a level that wasn't defined and maybe there's like other let i believe there'll be yeah yeah let me
explain how i see it and i think if you ask several people they will well they will
answer differently and i also heard the kubernetes ecosystem tries to map it as well and
some operators they claim they have a very high automation
but many people say they are not they don't have they don't have and so on so in terms of cars
let me like propagate it so number two like both these options for example and this is what
teslo to pilot for example does i use it a lot like you just turn it on but you must sit and
keep basically officially you must keep hands on on wheel and be ready to take control any second basically
right but it's still it's great it maintains lane and speed like it you just
just Alexis and spend much less effort.
And I think we can think about this in databases as well.
Next level is level number three.
And level number three, it's everything is automated,
but you still need to be ready to take control.
And this is what, for example, Tesla full self-driving is.
Well, this is like you, not everything automated.
It's like under supervision of you.
I've just put it up
No, no, no, it's not quite that
It's like level three
It says here for example
It's a traffic jam chauffeur
So it can handle like
Basic traffic conditions
Like a traffic jam
But it for example
It doesn't account for like
All weather conditions
Or like there's a bunch of other
Like potential things
So it's not like it's
There are limitations
Exactly
And you still need to take control
If needed
Basically you need to be ready
To be ready to take control
this is level three i agree so it's like it's there are limits but it can you it can bring you fully
automatically from point to point this is what tesla full self driving does you sit in driver's seat
and from point to point you can yeah you just basically you can enjoy full automation of some whole
right but if some bad condition occurs then you need to take control and fix things this is
yeah and for example if we map to full major upgrade full full
major upgrade
downtime and so on. By the way,
when we think about autonomy, we also bring
some additional features like zero
downtime. It's like it could be
in place and with downtime.
But somehow our mind wants some good
features additionally to autonomous.
So yeah, this is like
natural desire to have good
stuff, you know. But
you can imagine, for example, we have a whole thing
automated and
under many circumstances
like in many cases it will
work. But in some
edge cases it won't and you will need to
take control and make some decisions
before proceeding or even
postponing the whole procedure. This is very
similar to Tesla full cell driving and
I experienced it. It really can drive
you from point to point. But for example
if you go inside
for example my property I have
some roads inside it. It won't be able
to drive there at all because it's like
well it's already not road, not proper
road, you know.
So there is another level
for it's also conditional but there you like what my perception again might be wrong there you go you can
go to backseat and sit there you are allowed to relax completely fully but again it will work only
under some conditions for example there is way more i tried it multiple times in san francisco it's
amazing you go to it's jaguar you go to backseat and everything is fine but if like some
we'll saw this youtube videos or something multiple way more machines the cars they just
create traffic jam themselves and like basically deadlock right so so like if they if they
sense that they can't drive or they spot something they can't deal with they'll stop as a safety
precaution but then and then what do you do you have to get out yeah in this case yes in that in
the case of way more is you're a passenger yes so you can imagine for example with if we map it to um
major upgrades for example imagine there is a procedure for developed there is a vendor who can
intervene and can take control sometimes if if if allowed but the passenger in this case
those who asked to perform full major upgrades they are passengers they cannot make decisions
but this is good for them because they they their mind is spent for product development for
example right here but thanks to like million miles already of experiments and real life
experience for this
cars. If something goes wrong
safety first, it will just abandon
the trip. I mean, postpone it
and cancel and another car
will come later, right? So this
is approach. But it's like whole thing
encapsulated like black box
for you, right? You don't
go down and don't make decisions
according to some diagram of decisions, right?
And then, but it has also
limitations and I think Waymo is perfect example
for level four because
it works in San Francisco
in some areas, but you cannot drive to San Jose, which is like slightly more than one hour
usually drive because it's outside of coverage.
And this is like, this can happen here as well, major upgrades, but if it's some, if there are
some extensions, for example, it's outside coverage, we don't support this kind of upgrade
because this extension, like, I don't know, Citus or timescale DB, it requires an additional
approach and we don't have it covered here.
right so this is what i think here like we can map it and uh why not but my main insight was
looking at uh deep research of feedback from oracle users dbAs and engineers they say upgrades are
great both minor and major security is great like security control automatic procedures to level
up security, these kinds of maintenance things are great. But when we talk about smart, like
index advising and so on, it's hit and miss situation. And this, I had an aha moment because I was
thinking, actually, if you look at what all people try to do, they try to invent configuration
tuning automated with machine learning and AI. This is what Andy Powell was doing with auto
And by the end of water tune was closed, but by the end of autortune life, I already noticed the shift, which happened to Posgis earlier, attention to query tuning and optimization, like creation of indexes.
And I see PG&A is doing great job, not resisting to LLMs, which is also great.
And also some teams go inside Posgis and try to make planners smart.
And I had, like, with configurations, pretty straightforward for me, it's Pareto principle.
Take PG-Tune, Leopard, Com, UA, very simple configuration, 80% of job done in one percent, like, really, really fast.
It's just a heuristic-based, rule-based approach for LTP or, like, anything.
And it's good enough for many, for many, like, we don't need any machine learning and so on.
And even, mate, when you say for many as well,
I think it's also about time.
So it's for a while, it will then be good.
So it will for longer, it will be good enough.
Yes, yes, I agree.
And then you need, but then you need to tune.
And my way of tuning is to conduct experiments.
You know this very well, like how to make experiments faster, cheaper,
reproducible and so on.
Like this kind of things.
Is it worth, because I think self-driving, like probably level four,
many people would count as self-driving,
but there's still the kind of enough caveats
that a lot of the benefits of self-driving cars,
let's go back to cars briefly,
I'm really excited and optimistic about self-driving cars.
I love the idea of being able to get on with something else
while something drives me.
I don't mind that it might not be like the fastest way
or like it might not drive like completely optimally.
It might not even pick the route.
Not the safest.
What about the safest?
No, probably the safest, but not the fastest, sorry.
I could probably safer than me, but maybe it would be a bit more gentle.
You know, maybe it wouldn't take the yellow light when a human driver would, you know, that kind of thing.
But I love that you can just watch a movie or chat to a friend easily or play a board game in the back.
You know, you can do whatever you want.
You get so much time back, especially in America where a lot of the, a lot of time spent driving.
It makes so much sense to me that self-driving cars are like a huge unlock.
for a lot of lot of people
but
largely only at stages
four and five like at that highest
level like cruise control is
great but I still have to
I still have to concentrate I still have to be watching the road
like I don't actually gain that much
and if we go back to Postgres
I feel like a lot of the
automation features are great
but we still have to concentrate
we still need the person we still need the DBA
and as long as we still need the driver
and as long as we still need fail saves down to humans,
all I see is kind of this like gradual need for maybe few,
maybe this is where the driving analogy breaks down a little bit,
but maybe fewer humans per server,
like maybe the DBA team for a company will be smaller on average
compared to how it was in the past, you know,
and I think we've already seen that over time.
But I'm struggling with like the,
that last step until we get to those,
which feels to me like a long,
way off, especially given, like, the experience you're saying about, like, Oracle and the
experience we saw with very smart people trying to automate a lot of this stuff from, you
know, with a lot of AI, but kind of not even LLM stuff, right? Like, a lot of the research in
this area has been machine learning and, like, other, other longer researched AI methodologies that
have lots of real-world use cases. And even there, we've seen mixed results and, you
It really, I feel like the model's struggling to, well, in the experience I've had talking to customers that used Autotune, for example, I feel like the constraints, for example, were not as clear as driving, or the slightly different use cases or the performance tradeoffs that different people have in different cases are subtle enough that you can't set the exact same guardrails for everybody.
And at that point, it breaks down enough that, well, and, oh, sorry, one more addition.
performance cliffs are so real that
if you change one thing
and it looks like it's going to be great
as soon as you hit a cliff
it's then a disaster
and then recovering from those disasters
is actually a real problem
and I feel like
troubleshooting disaster is also a problem
root cause analysis is a problem
and arguably they get harder
when you involve automation
because the more that it's automated
the less people actually know
what was changed when and why
so I think it can get triggered
here with you because you you must be an expert but if you have automation you move much faster
with a with a high level of automation take cursor give it give it a lot of pieces together and and
and explain how you approach methodology of analysis this is like expert needs to bring and then
you move much faster so but this is like i was trying to explain how i moved to this area completely right
yeah okay so people say in oracle this works that one doesn't what works quite simple things
i i said like upgrades maintenance security stuff well not simple but they are boring you know
of course replication and backups like for me h a nDR it's like auto steering and maintaining
like maintaining speed and maintaining lane you know basics cars must do this so database must
have good at the HR and DA if we look closer actually there are issues with both the
HA and DR which prevent will prevent us from reaching very high level of automation but we can
dive into this later but anyway boring stuff lacks automation interesting like remember
I mentioned level one and two they talk about combination of features
So if we start analyzing each feature particularly, we cannot apply the same classification, you know, because classification talks about combination of features.
Yeah, sure.
Coming back, actually, just to make sure I understand, what have Oracle done in terms of security?
Can you give some example automation features?
Well, I know little.
I would say what we should do in Postgres, and this is like least topic I would do.
like I'm ready to discuss now like because this is in roadmap but right now we're focusing on
different areas I can I can just speculate on it on this like identify potential threats like
checking permissions roles like for example I know organizations which use the same super user
for all human DB engineers and this is quite easy to identify or any organization I had
couple of them on consulting contracts which
through IPO process before that you have audit right and during audit they ask
specific questions some of them are quite silly I would say but some of them are
quite good enough and if you just inspect your PGA Conf you inspect a user model
you inspect how multi multi tenancy is organized we have had an episode about it
right so these kind of things can be automatically analyzed and so on I don't
know I don't know details about Oracle I just saw feedback that
these stuff engineers really appreciate and they appreciate less automated configuration
automated creation of indexes and this was appreciate less or appreciate less or it gets it wrong
when it gets it wrong it's more painful like what's you see what i mean mixed results mixed
results you know like there's there's lack of trust in some minds like yes like i don't know yeah
i i for example i catch up with customers from time to time just to just to hear what
what they're doing what they like about products what they don't and i was speaking to a customer of
mine that did try and use ototune for a while and it'd be interesting i'd love maybe we should
invite somebody from that team on to discuss from what happened there yeah why it felt like why did
it shut down like what i can tell you why like this people don't need a configuration tunic that much
okay well i also think there might be other issues so there's definitely like not
i tell you the big need in configuration tuning exists only if you have say 10 000 plus
clusters then you can say okay we we are going to save like five to 10 percent of money just
with tuning or workload will be reduced this we know you know very well a really bad really bad
plan can screw all efforts of configuration tuning?
Well, yes, but I think it's worse than that.
I think also, in addition to it, not being needed that often, and therefore kind of
subscription-based models not working that well, also, when it, well, this customer
was telling me, they moved from a mental state of, when something goes wrong, let's
dive into what happened, straight to what, when something goes wrong now, what?
What did Otitude change?
And that was like a real shift.
It became like a trust, trust, but also that must be based on the fact it made changes in the past that made things worse.
So it's not trust for no, it's not kind of like, um, not distrust for no reason.
It's every now and again, if you change something, you hit a performance cliff and it's unexpected.
Or maybe it's not even like always a performance cliff.
Maybe it has another unintended consequence that you care about more.
Like, probably not in a lot of these cases, but I think they talked about, for example,
making the mistake of letting it configure some parameters that even affected like durability and things like that.
So if you're changing, depending on what you allow it to change, there might be unintended side effects.
And I think that's like putting parameters around like what you do, you will and won't let it do is actually hard than it sounds, I think.
It's very hard to perform.
like enterprise approach for making a change it's extremely complex i'm very grateful that seven years ago
i was working with chui they were preparing for an IPO and i remember cTO was ex-oracle
and discussions we had and resistance they had in any change i proposed it taught me this
enterprise approach you know i'm very grateful it was great like experience for me
And I realized, oh, actually, if you want to be serious about changes, any small change should be very thoroughly tested, like experiments, experiments.
All risks must be analyzed, and then there should be a plan to mitigate if a non-risk occurs, right?
And AI doesn't help here almost, you know, like this is framework you need to build without AI first.
I don't necessarily agree
I think AI could really help with these things
when I say AI I'm including machine learning
and not just the latest LLM stuff
I just think we need to define
constraints really clearly
and define what we care about really clearly
and make it really clear we care more about reliability
and durability than we do about performance
so that's almost always true
and I think that might be more core to the reason
that these performance tuning tools
aren't, haven't yet succeeded
because we haven't yet
nailed the reliability, durability stuff.
So that would be my theory as to like
why they didn't
necessarily succeed.
Because if we help,
even if they did help performance, almost all the time,
if they ever hurt reliability,
that's not a trade off most organizations willing to make.
And that's a difficult
thing to tell a tool that's trying to optimize for better performance.
Yeah.
Yeah, I agree.
And durability has issues.
There is a good article from Sugu, just published on Zimbled blog about synchronous
replication issues.
And we also know very good talk by Alexander Kukushkin about issues with synchronous
replication.
So durability is a must thing to have.
And targets, I agree with you, targets, durability.
availability must be reliability must be number one before performance by the way i also remember
from that research people appreciate automated analysis and control of costs yeah and this can be
initially quite simple like if i i remember actually talking to one a huge organization
there database director told me you know what is it's cool stuff what you're showing in terms of
of experiments and performance tuning and like query tuning experiments with DBLAB and so on.
But the number one problem we have is abandoned instances and how to stop doing that
and lose a lot of money.
And big organizations is a very big problem.
And yeah, so cloud providers still don't offer good tools.
It still takes a lot of effort to understand the cost, how the structure of spending, right?
practically like usually it's really too late they are not interested yeah right so anyway uh what
my aha moment back to it was yeah sure people like from academia from really great people right
great minds they try to build really cool stuff yeah like let's have automated parameter tuning
automated indexing or even let's go inside planner and and create an adaptive query optimizer
I saw even more extreme.
In the paper, they were talking about choosing whether tables should be row-oriented or column-oriented
based on the workload they're observing.
So it's like trying to attack really cool areas, right?
This is great, and this has been always so.
Academia guys, the attack really detached from reality things.
Meanwhile, I realized we implemented with multiple teams already automated indexing, and this is what people really need.
And lately, I realized in consulting, we almost always say you need automated indexing, but we didn't have polished solution.
We have multiple ways to do it.
And I always said, you take this tool, like someone from our team developed as a side project and then polish it.
And it's only about bit reindexes and this and that.
Like, oh, and also estimates, blood estimates might be off.
We know very well if there are estimates, not just start tuple numbers or real vacuum
like index numbers from a clone, right?
And I realized, actually, this problem is not solved.
And this is a boring problem.
And we can solve it.
That's why number one thing right now we are going to release.
We already like, it's about.
to be released. We call it a PG index pilot. And there is a good roadmap inside it. It's
a whole project. And it's going to be real simple. And a guy who part-time works with me right
now, Maxime Baguq, one of the most experienced DBAs, post-Gus DBs I know, much more
experience than I he created like basically I consider the prototype we I said we fork it
and then we started iterating on it the idea is simple I call it baguq number so you take
index size divide by number of life tuples in this table we let's let's forget for a while for
partial indexes and we have some ratio right when we just created index let's consider this ratio as
perfect and consider tables which exceed like say million rows over time you will see this number
is nothing like it costs nothing to check it you can put it to monitoring for all indexes it's super
fast to get a number of because these aggregates are stored they're stored right yeah yeah yeah
it's from system catalogs immediately you get it and then over time you see degradation of this
number yeah why because some pages inside the index they are not fully they are not full right
they're half empty
and so on
like they're sparse
so it means
at some point
you say
oh it's time
to re-index
and the best
there are pros and cons
of this approach
compared to
like say
traditional
based on
blood estimates
everyone is using
the big
con
like a couple of
big cons
of this approach
it required us
some effort
to get away
from super user
we did it
and oh
one more thing
we
on purpose
we said
this
company is going to be inside, like self-driving, inside database. We don't need external means,
like something installed on this into instance or lambdas. We don't need anything. It will be inside.
This means it's running inside PLPG SQL code. We know since postg 12 or earlier, stored procedures
have transaction control, right? So we can go. But we need the index concurrently, right?
So with index concurrently, we need, we cannot wrap it into inside a transaction block.
So unfortunately, we need something like db link, right?
And it's a, it's a challenge not to do it properly on RDS, for example,
because a db link, you need to expose password and this is not, you don't want to store it in plain text.
So I remember very old trick I used a lot of years ago,
DB link over fosfdW.
And this is how we do it right now.
And there is another limit like kind of limitation to start this thing to work.
You need the baseline, right?
So baseline means you need to index everything first or to bring this data from
clone, which we are this is an idea where I'll go into implement very soon.
So to avoid the full index because sometimes our customers have like 10 plus
terabytes databases and it's not cool to.
to index it will take forever and so on there's also a big impact when you index quite
fast so but good benefit from this approach it's not only about b3 it's not only yeah you can
you can take care of gene gist and and even uh h ns w and others if they degrade and like
basically yeah basically we measure kind of storage efficiency for index it's super
very well. And I think I believe into this simple approach. I think we are going to have it. And I think also like I talked about this last couple of weeks ago with Andrean Kirk on Hockey PostGus Hiking on PostGus TV. And we started doing something mind blowing. And Ray just said, let's just implement merge. Because you know B3 and B3 implementation in Postgres, it can, it has only split. It cannot merge pages.
And since Andreas Ph.D. and so work is in the area of indexes, it was great to have to have ability to start this work. And I'm looking forward to like everyone.
Wait. So like if, for example, an area of the index starts to get sparse because we've deleted a bunch of like, let's say we've deleted some historic data and we're not going to insert historic data back in that part of the index. It could proactively, like it's kind of self-healing.
exactly wow cool so our project would be archived at some point i hope so i'm not an expert in
oracle since i know i was not i have never been expert of oracle but also not a user i last
version i used was in 2001 or two it was eight i eight i it was so good but uh people say
oracle doesn't need doesn't have this need in the index SQL server i heard that has this need
Over time, have declines.
We talked about it so much.
I said, like, this is mantra.
Like, everyone needs re-indexing at some point.
And we know Peter Geigen work on Anastasia Lubenko in 13, 14,
they implemented the duplication and other improvements.
That's great.
But still, there is no merge.
So pages cannot be merged.
And there was like some work, there was also some work, I think,
from Peter Geigen and others to help avoid page splits in more cases.
Yeah, not just the de-duplication work, but yeah, the bottom-up deletion.
So, yeah, kind of avoiding them getting into this state was one thing.
Helping them heal when they do is another.
It also strikes me as, like, a lot of this stuff could live in core, right?
We already have some automation features, right?
We already have auto vacuum, which does about three or four different jobs.
So we have some groundwork here already.
We have tools like PG repack and PG-Squise,
not in core but some ideas of moving more of their features into core like so we do have some of this in core some of it in popular extensions that are supported by many clouds so that feels like a we feels like it is the project is already naturally going in this direction maybe slowly and maybe like not all the areas you want it to but is that like what does the end goal look like here is is it being in core ideal well in my opinion we just need to
solve very complex problem we like i know in some big companies manage postgres services
sometimes one experience deba is responsible for cloud hundred thousand clusters
thousand it's insane but we need to be prepared for to be responsible for a million clusters
because yeah builders will bring us a lot of clusters like the time changing really fast so postgust
need like not to lose the game by the way right now if you check hiker new new strengths
It was growing, not only about job posting, as I usually mentioned, but everything, all discussions on Hacker News.
It was growing until the beginning of last year, 2004.
And then we have a slight decline.
And I think Posgis has right now a huge challenge.
It needs to be much more automated.
If things need to be in core, it should be in core.
But not everything should be in core.
We know Autofi Lover is still not in core, right?
Patroni, right?
But Patroni, Patroni, I just asked Kukush.
can did has he considered automation of post resume and switch over without downtime this is
because this is what people want and expect for highly automated system he said no and i started
making joke because i said patroning is outside of postgres because it's not the job of
postgres to do uh h a auto file over and now i i expect you patroni maintainer to tell me that
automatic switcho zero-down time switchover is not job for patroni so i need to implement another
layer on top of it right this is what actually we do already for zero down time upgrades
we do this and by the way i i can share this already on our consulting page we mentioned git
lab but also super base and also gadget.def these these companies took uh take from us
and i have official quotes from their managers so i can share this news i think we develop
really great automation for zero downtime upgrades which are not only zero downtime but also of course
zero data loss and reversible and reversible without zero without data loss as well so this are perfect
properties but it requires a lot of orchestration fortunately since post got 17 things are improving
and less pieces need to be automated so let me go back and explain my hand moment finally and my vision
I have right now.
I realize that boring stuff needs to have much higher level of automation.
This is one.
Second, I realized PostgreSQL team, this is exactly what we are doing because with
consulting people bring us these topics.
And I realized also that in every area, if we think about automation of feature, we can
apply a simplified approach for classification.
So if every single step must be executed manually, like CLI call, like PG-Dum call or something, PG-upgrade call, or SQL snippets, if they are to be executed manually by engineer, this is manual, right?
If there are bigger pieces which, like, say, whole procedure consists of two, three big pieces, and they are combined.
And engineer only makes choices to proceed or not between pieces.
For example, our major upgrade consists of two huge pieces.
Physical to logical with upgrade, they are bundled due to some reasons, specific reasons.
And switchover, two big steps.
And inside them, there is high level of automation.
You just call playbook, it executes.
Right?
In this case, it's already can be considered level one, say, say level one.
one or like two maybe i don't know and then uh if we can fully relax and go to passenger seat
and just say okay i approve that that you need to do everything but you will do yourself i mean
postgres or additional system will do everything yourself like full match upgrade with switch over
in this case it like say at level two and then and you're in passengers but there are
limitations if it will encounter some problem it will stop revert post
spawn and you need to approve things.
Highest level, if this feature, like if system decides, oh, it's time to upgrade and
then it schedules it, like, oh, we have a low activity time on the weekend.
And you even, like, you are notified about it.
You probably can block it, but it moves on itself.
This is the highest level of automation.
And for back to reimbixing, which I chose this as low lowest hanging fruit.
Everyone needs it.
Nobody in Monk managed, maybe after our episode, people from RDS, CloudSQL, I know some of them are listening to us, will rush into implementing this.
I just see everyone needs it.
Nobody has it in terms of managed postgres.
Nobody offers it.
So I am highest level automation for PG index pilot right away.
So it will decide when to reindex, it will reindex, it will control activity levels, so not to saturate disks and so on and so on.
you can check roadmap I explained it in read me of this product and this is open source
because I actually truly believe at some point it won't be needed this project if if
and there's idea will work for merge maybe I don't know this is a dream right so forget about
index blood right yeah yeah I guess it will be super hard I I try to research why it hasn't
been done and I didn't see why like it maybe it's scary to this is basics like
like foundation of whole postgres of any cluster, right? B3.
Yeah, well, I guess when would it be done?
Would it be done by a background worker like auto vacuum?
Or would it be like at what stage does it make sense to do?
Well, it should be synchronous.
Oh, I know this is a good question.
I'm going to ask Andrea about this.
Yeah.
Because split is synchronous.
Yeah.
I also think the thing you brought up just at the end there is super interesting.
Like controlling the rate to not saturate discs.
Like this is an interesting thing with tradeoff.
again like if you're on aurora postgres and you're paying some amount per io like like for you
probably don't want your indexes rebuilt constantly all the time you what you really want is to fix the
root cause like it's probably an application issue like why are you updating the same row thousands
of times a second or like what what's the root cause of the bloat or the so we just had a case of
updating multiple times a queue like workload this can be identified
Yes, yes.
Yeah.
So I'm excited to see what you build.
I think you're right that more automation in Postgres would be good.
More automation in extensions around Postgres would be good.
But a lot of the issues I see are kind of application source issues.
And even if we make Postgres completely self-driving, there is still this kind of application-level issue that will be...
Mistake there can put all your efforts to ground.
Yeah, I agree.
say that's why in our like I envision I identify 25 areas in my blog post yeah all
weeks during our launch week maybe we will adjust those areas of course like it's not a final
list and we are targeting three right now like I hope we will expand this list to
five to six next year once we have quite good building blocks we will think about
whole system but the central part of all these building blocks is new monitoring system
which is good both for AI and humans.
And this is what we are already actively building.
We replaced all our observability tooling used in consulting with new monitoring system,
which is called post-GISI monitoring.
We talked about it separately, right?
And this is going to be like the source of insights, so why things are wrong.
And then we cannot self-drive application yet, right?
Although in some cases might be so.
Because if, for example, SuperBase, they control not only PostGus, but also REST API level, right?
So some things might be down there.
But in general case, we cannot.
So in this case, we need just to advise and so on.
But eventually things like serverless, like Versel workloads or so, I see they can be together with Tatapis, eventually.
In this case, we can discuss full cell driving something, right?
But we are very far from that, I think.
Yeah.
And we have limited resources.
I wanted to admit that I'm targeting very narrow topics right now.
So very boring, but really doable and we see demand.
Things like partitioning, help with partitioning.
It's nightmare to do it for arbitrary engineer.
We say, oh, declarative partitioning.
But for example, partitions are not created automatically.
Okay, there is a PG apartment.
but you need to combine this piece or something like the level of automation there is terrible.
Yeah, but these are the kind of things I'd love to, like a couple of things.
I'd love to see more of this go to core.
I think there has been improvements in partitioning core, many over the last few years and continuing.
And the other thing is, I think there's a lot of focus on fewer people.
With this automation, a lot of the focus is on we can do more with less humans.
And I'd love to see more effort into what would happen if we didn't try and reduce the humans,
but instead try to increase the reliability or increase, you know what I mean, like action.
And I think too, there's a lot of talk about, yeah, how do we cut costs and not enough.
Yeah, we have already, I think, five companies who are very fast growing AI companies,
very well known already. Yeah. And I see a very different approach to post goes. For them,
it's a natural choice like let's start with it but then they oh we need to partition
terabyte size tables let's estimate work oh that much work okay maybe we should migrate
to different database system they are moving really fast and they're not like attached to
postgres basically so i'm scared about postgres that's why like i'm saying automation
there should be a massive shift right now and resources of my team is not enough
So that's why I'm talking about it. I'm going like I'm going to put whole focus on it. During consulting, we also choose only, like choose on the paths that are fully automated, highly automated, fully automated. And I'm looking for some use cases, more use cases, more maybe partners who think similarly, right? This is a call for PostGus ecosystem. Like it's not enough right now. Like we're not, postgis might lose this game.
again like postgres won multiple challenges like object oriented no sequel but a i if i think
everyone thinks about a i only about like pg vector or storing vectors it's not like not everyone
means vectors people like build some app it needs database maybe with vectors maybe without
vectors but what they expect is higher level automation scale easier so i'm
super happy to see
new innovations in the area
of
sharding, right?
Yeah, I think that's actually more
I do think that's more important, partly for just the
marketing stories. Then partitioning?
For those, yeah.
Because I think we talked, like, if you think
about partitioning, if you, I think
for example, when we spoke to
when we had our
100th episode and spoke to
Notion, 100 terabast.
Yeah. Yeah. Notion skipped
partitioning and went straight to sharding, they decided that with all of the partitioning
downsides, and then there are quite a few, there are quite a few limitations for their
multi-tenant architecture, it actually made more sense to shard instead of partition. And that felt
like it wasn't one of these things where they were moving super quick and made a rash decision.
It felt like a really considered engineering decision. And I actually think it was the right call.
And I wouldn't be surprised if with more of these sharding, if more of these companies are coming in and building fairly seamless sharding that doesn't add too much, doesn't add any complexity at the application level, I could see people tempted into that, even if it wasn't for good engineering reasons, just for the marketing reason of I don't have to think about scaling this.
And I don't have to do what with terabyte tables.
You know this normal distribution meme, right?
Yeah, yeah, yeah.
Yeah, usually it's like unexpected on the right side where it's expert.
So I don't know in which camp I am.
I was thinking post-gis one node is cool enough.
Then I was thinking it's not cool enough because of performance cliffs.
Now I'm thinking maybe it depends because in some cases,
it's much safer to avoid all performance cliffs not just allowing like 100 plus
thousand transactions per second on one node.
you just and how it's called resiliency right if one node is down only part of like it's just one shard
maybe yes but at the same time it's so cool to see projects which require only one node they are isolated
they don't need a whole like bunch of shards and clusters cluster of clusters or cluster term is
heavily overloaded right yeah and and this you think oh let's see the power to have just one single
know. Sometimes without replicas, I see projects without replicas, because cloud times change,
right? And I think I'm open, you know, like I know my perception overreuse changes. Sometimes it's a
pendulum. So it depends, you know, like regular answer, a normal answer from consultants. It depends.
But Sharding, it's really great to see that it's coming from multiple teams, there will be competition.
So, yeah, here the future looks bright, but who will be helping to choose proper schema for sharding?
Well, yeah.
And rebalance properly normally.
Well, yeah.
So I think they who built this, think about this as well, automation and to choose time for rebalancing and fully automated.
Well, we talked to Sugu.
Yeah, he said it's inevitable that you're going to have to change a sharding scheme at some point.
It's just designing for that up front seems really important.
The test handles it.
So, yeah, interesting times ahead.
I'm a bit worried about the complexity of these systems.
Personally, I quite like.
Well, starting simple, but also I like the idea of if I'm a Postgres user,
I can still understand the system, like roughly what's going on,
even if I'm a full stack developer, even if I do the front end, the back end.
like I like I know it is already very complex I know there's already a lot of internals that you kind of need to know about to make a performance system but I hope we can hold on to that for a while well I truly believe that what my team does is going to help because we observe many problems and every time I'm saying guys like we need to write down somehow to and so show with experiment so users
understand, like, what's happening.
Write some how-to for next time, right?
And when we were writing how-toes recently,
we write it both for humans and AI.
So next time some RCA is happening,
root-cost analysis is happening.
If you have our how-toos injected to your cursor, for example,
it's going to take into account some situations
which are written and how to reproduce them,
how to troubleshoot them, right?
So I think something else is coming,
I'm not going to spoil it, but RCA and troubleshooting is one thing we will attack early.
We are preparing pieces for this, you know, and one of key indicators of success here
is that non-experts understand what's happening, you know, because...
Yeah, well, that's what I've, that's the area that I specialize in.
You know, I'm actually betting a little bit on humans staying in.
the loop for quite a long time and that they will always well not always but for a long time
there will be categories of issue that we still need somebody to look into yeah and i kind of feel
like the level of postgres knowledge is the median level of postgres knowledge for people
having to look into that those issues is probably going to go down like based on all the trends
that you're talking about maybe it depends if if we only have a few experts that are shared between
lots of companies maybe that's not true but if we do have a lot of individuals starting with
vibe coding or starting with like doing a full stack kind of single person companies running the
whole product start to finish yeah those guys like they have to know a lot about a lot of things
they can't know that have you seen numbers super base shared how many clusters we register and how
fast it is yeah can you can you imagine how like yeah all average knowledge of postgis there
but yeah but the idea is i i my vision is that with a i we collect uh knowledge pieces we
experiment to automate experimentation so on but then obviously this is what i see with our
customers some human is needed to explain properly to other humans to answer questions properly
you know to build trust and confidence and so on
Yeah, but my question, I guess the question then is, where do the tools live?
Like, can the end user use a tool to get help?
Like, or does the super base team use a tool to get help?
Or does the consultant that the super base team employ?
My answer to who's the tool for?
And I know, I'm just saying that, like, it depends who you count as the user in terms of how much their postgres knowledge.
I'm talking about that end user.
Yeah.
For example, with DB lab, we already went this path.
We moved from a couple of guys answering all the details.
Back-end engineers have in terms of how plan works and what to do about it,
which index to create.
We now have DBELAP, backend engineers, experiment themselves.
And if something is unclear only then, they call an expert for help.
But like 90% questions answered by backend engineers.
without involving post-guise experts you know and expertise in the backend engineering minds grow
grows as well i i'm just thinking this approach we had for query optimization can be applied at
grander schema for many other areas yeah all right probably enough for today thank you so much
we went much deeper in specific areas than i expected and i enjoyed it a lot thank you so much
nice catch next week