Postgres FM - Self-driving Postgres

Starting point is 00:00:00 Hello and welcome to Postgres FM, a weekly show about all things PostgreSQL. I am Michael, founder of PG-Mustard, and this is Nick, found of Postgres AI. Hey, Nick, how's it going? Going great. I'm very glad to see you. How are you? Likewise. I'm good, thank you. And you chose the topic this week. What are we talking about? Yeah, I think it's very interesting to discuss the level of automation we have in terms of all.

Starting point is 00:00:24 You know, my position against managed progress, and in this case, it probably will be opposite. it, like saying that it's not enough what we have in terms of managed postgres and also in terms of Kubernetes operators and other automation projects post gas ecosystem has right now. So why I was thinking about it? Imagine in 2011, Hiroku was started.

Starting point is 00:00:53 Hiroko Postgust was started. 2013 RDS Postgas was really, at least in November, I think. And at reinvent, I guess, right? So then this was a foundation of growth of interest, I think. Like some people say it's because of JSON or something. I agree with those arguments, but I think the central reason of why PostGar started to grow in 2014, 15, and by now is that before that backend engineers, developers, they were always, they were always complaining how difficult it is to set up PostGos and configure it and backups and

Starting point is 00:01:35 replication that just didn't want to deal with it and RDS and Hiroku before that they brought automation for basic things right and this I think simplified lives of a lot of engineers that's great in 2020 Superbase was released and I think a new wave of audience was brought to PostGos ecosystem, front-end people, actually. Because now it's not only Post-Gus automated. It's very well-automated. Other components are very well-automated, like Rest API, real-time component, right? These things are authentication component.

Starting point is 00:02:14 So you immediately start working on front-end, forgetting about backend. So I admire SuperBess for bringing a lot of front-end guys to Post-Gus. That's great. But at some point they need to learn SQL. I'm pretty sure. So this is fine. And now we have AI builders. And this is now the wave of users and basically not humans already sometimes. So we hear from Superbase and Neon that a lot of clusters which are created these days are created by AI by request from like cursor or something, vibe coding. And so many, many clusters, many of them. are small and maybe it won't go anywhere, anywhere because it's just experiments, prototyping,

Starting point is 00:03:04 and so on. But some clusters grow and they lack attention. So with RDS, there was a big shift. We talked about startup teams who don't have DBA and they are fine with it until some point and this is where PostGCI professional services catch them, right, quite often. But now we talk about even lack of backend engineering team completely, for example, with SuperBase or even without like lack of engineering team completely somehow right? And only guys who

Starting point is 00:03:36 understand product and try to wipe code this. Sometimes with security breaches like recently some app was like storing data in in Firebase, right? Google Firebase and it was not secure at all. Five million users

Starting point is 00:03:52 registered. It was a big scandal. So yeah. Anyway, But this is part of the topic security. So what I feel is the demand of much higher automation than just RDS or super base. Some new level of automation should be present. And if you look at enterprise sector, there is Oracle with this idea of autonomous database, self-driving database for many years, right? On one hand.

Starting point is 00:04:21 And on another hand, there are academic papers like one from. Carnegie Mellon University, Andy Paolo, from 2017, which discusses what self-driving database management systems are. And there is a question. If you think about zillions of post-gues clusters, which should be highly automated, and when experts look at them, everything should be already transparent and obvious how to fix and move on, what is this? what is self-driving post gas? I was thinking. And to answer that, I performed several waves

Starting point is 00:05:00 of research, of course, with deep research from Cloud and open a charge of PT, right? Latest models. I paid to everyone a lot of bucks already. So I was thinking what could it be for Postgres? To answer that, I performed research looking at Oracle, first of all. And I just asked to, you know, deep research is when they, perform Google searches or Bing searches and analyze hundreds of sources and then write some kind of report like some student would write it might have issues of course this report but it gives you gives you a lot of links at least and some summaries so my question was what people after all those years of building autonomous Oracle what do people really like and what they

Starting point is 00:05:50 like less, right? What did you find? What did they say? Yeah. And one more comment. In 2013 or 14, I think I was attending Autonomous Oracle webinar. And I was completely like shocked. They promised autonomous Oracle, but they've only talked about clustering logs,

Starting point is 00:06:15 organizing like better log analysis from hundreds or thousands of sources. not only database. And I was like, where is the autonomous Oracle here? And 10 years since then almost, more than 10 years, I was thinking it's like stupid. Now I changed mind and I hope you and our audience will understand why. So what I found is that people appreciate a lot self-patching. It's like mini minor releases, like if new release comes out with security patches. It's not a headache at all. And this is kind of automated RDS. You can just define maintenance window. Yeah, with some carrots. Minor versions only, I think. Have they done major versions now? Major versions. Yeah, automation of major version upgrades. And here we definitely

Starting point is 00:07:08 have something to discuss. I mean, we discussed it already. And I mean, my team, like, we had very good recent cases where our customers had the zero downtime of grades it's like and we're very happy i hope some blockposts are coming zero zero downtime is very different to fully autonomous though like very very different from fully autonomous or self-driving a major upgrades is very different to let's do one more step back what is self-driving car yeah great there are six levels defined by say S-E-S-A-E, like kind of standard or something. So there are six levels from zero to five. Zero means not autonomous at all, manual, regular car.

Starting point is 00:07:54 And then five is fully autonomous. And looking at first few levels, I realized an interesting thing. So they talk about not each feature particularly, but of combination of features. For example, level number one, it could be either adaptive, cruise control like so keeping maintaining speed but safely right or maintaining lane but not both if it's both it's already the level number two right and there are several levels and for example this paper Carnegie Mellon Andy Powell's paper from from 2017 it discusses how to map it to database management systems right well a little bit not not much in my opinion but a little bit yeah for

Starting point is 00:08:42 short and this this paper I looked at this in advance you shared these car things I'll link them up in the show notes as well it struck me that level four to level five so the last like the last step is a huge jump like it's like here's loads of features in the car that that will help the driver and then level five is suddenly and now the driver does nothing so it's like that feels to me like a whole, like maybe a potentially huge chasm that maybe there's like a hundred more levels in between four and five that we're going to need to break down at some point. But like I felt like it was a very hand wavy way of saying we might be really close to this because we already have level four features. We're very close to having level five. And it feels to me like there

Starting point is 00:09:30 might be like a, like I was unclear for example in a level five car whether a human could still take control if needed or was there absolutely no way of doing that like that that feels to me like a level that wasn't defined and maybe there's like other let i believe there'll be yeah yeah let me explain how i see it and i think if you ask several people they will well they will answer differently and i also heard the kubernetes ecosystem tries to map it as well and some operators they claim they have a very high automation but many people say they are not they don't have they don't have and so on so in terms of cars let me like propagate it so number two like both these options for example and this is what

Starting point is 00:10:18 teslo to pilot for example does i use it a lot like you just turn it on but you must sit and keep basically officially you must keep hands on on wheel and be ready to take control any second basically right but it's still it's great it maintains lane and speed like it you just just Alexis and spend much less effort. And I think we can think about this in databases as well. Next level is level number three. And level number three, it's everything is automated, but you still need to be ready to take control.

Starting point is 00:10:55 And this is what, for example, Tesla full self-driving is. Well, this is like you, not everything automated. It's like under supervision of you. I've just put it up No, no, no, it's not quite that It's like level three It says here for example It's a traffic jam chauffeur

Starting point is 00:11:13 So it can handle like Basic traffic conditions Like a traffic jam But it for example It doesn't account for like All weather conditions Or like there's a bunch of other Like potential things

Starting point is 00:11:25 So it's not like it's There are limitations Exactly And you still need to take control If needed Basically you need to be ready To be ready to take control this is level three i agree so it's like it's there are limits but it can you it can bring you fully

Starting point is 00:11:40 automatically from point to point this is what tesla full self driving does you sit in driver's seat and from point to point you can yeah you just basically you can enjoy full automation of some whole right but if some bad condition occurs then you need to take control and fix things this is yeah and for example if we map to full major upgrade full full major upgrade downtime and so on. By the way, when we think about autonomy, we also bring some additional features like zero

Starting point is 00:12:12 downtime. It's like it could be in place and with downtime. But somehow our mind wants some good features additionally to autonomous. So yeah, this is like natural desire to have good stuff, you know. But you can imagine, for example, we have a whole thing

Starting point is 00:12:28 automated and under many circumstances like in many cases it will work. But in some edge cases it won't and you will need to take control and make some decisions before proceeding or even postponing the whole procedure. This is very

Starting point is 00:12:43 similar to Tesla full cell driving and I experienced it. It really can drive you from point to point. But for example if you go inside for example my property I have some roads inside it. It won't be able to drive there at all because it's like well it's already not road, not proper

Starting point is 00:12:59 road, you know. So there is another level for it's also conditional but there you like what my perception again might be wrong there you go you can go to backseat and sit there you are allowed to relax completely fully but again it will work only under some conditions for example there is way more i tried it multiple times in san francisco it's amazing you go to it's jaguar you go to backseat and everything is fine but if like some we'll saw this youtube videos or something multiple way more machines the cars they just create traffic jam themselves and like basically deadlock right so so like if they if they

Starting point is 00:13:46 sense that they can't drive or they spot something they can't deal with they'll stop as a safety precaution but then and then what do you do you have to get out yeah in this case yes in that in the case of way more is you're a passenger yes so you can imagine for example with if we map it to um major upgrades for example imagine there is a procedure for developed there is a vendor who can intervene and can take control sometimes if if if allowed but the passenger in this case those who asked to perform full major upgrades they are passengers they cannot make decisions but this is good for them because they they their mind is spent for product development for example right here but thanks to like million miles already of experiments and real life

Starting point is 00:14:32 experience for this cars. If something goes wrong safety first, it will just abandon the trip. I mean, postpone it and cancel and another car will come later, right? So this is approach. But it's like whole thing encapsulated like black box

Starting point is 00:14:49 for you, right? You don't go down and don't make decisions according to some diagram of decisions, right? And then, but it has also limitations and I think Waymo is perfect example for level four because it works in San Francisco in some areas, but you cannot drive to San Jose, which is like slightly more than one hour

Starting point is 00:15:08 usually drive because it's outside of coverage. And this is like, this can happen here as well, major upgrades, but if it's some, if there are some extensions, for example, it's outside coverage, we don't support this kind of upgrade because this extension, like, I don't know, Citus or timescale DB, it requires an additional approach and we don't have it covered here. right so this is what i think here like we can map it and uh why not but my main insight was looking at uh deep research of feedback from oracle users dbAs and engineers they say upgrades are great both minor and major security is great like security control automatic procedures to level

Starting point is 00:16:02 up security, these kinds of maintenance things are great. But when we talk about smart, like index advising and so on, it's hit and miss situation. And this, I had an aha moment because I was thinking, actually, if you look at what all people try to do, they try to invent configuration tuning automated with machine learning and AI. This is what Andy Powell was doing with auto And by the end of water tune was closed, but by the end of autortune life, I already noticed the shift, which happened to Posgis earlier, attention to query tuning and optimization, like creation of indexes. And I see PG&A is doing great job, not resisting to LLMs, which is also great. And also some teams go inside Posgis and try to make planners smart. And I had, like, with configurations, pretty straightforward for me, it's Pareto principle.

Starting point is 00:17:03 Take PG-Tune, Leopard, Com, UA, very simple configuration, 80% of job done in one percent, like, really, really fast. It's just a heuristic-based, rule-based approach for LTP or, like, anything. And it's good enough for many, for many, like, we don't need any machine learning and so on. And even, mate, when you say for many as well, I think it's also about time. So it's for a while, it will then be good. So it will for longer, it will be good enough. Yes, yes, I agree.

Starting point is 00:17:38 And then you need, but then you need to tune. And my way of tuning is to conduct experiments. You know this very well, like how to make experiments faster, cheaper, reproducible and so on. Like this kind of things. Is it worth, because I think self-driving, like probably level four, many people would count as self-driving, but there's still the kind of enough caveats

Starting point is 00:18:02 that a lot of the benefits of self-driving cars, let's go back to cars briefly, I'm really excited and optimistic about self-driving cars. I love the idea of being able to get on with something else while something drives me. I don't mind that it might not be like the fastest way or like it might not drive like completely optimally. It might not even pick the route.

Starting point is 00:18:22 Not the safest. What about the safest? No, probably the safest, but not the fastest, sorry. I could probably safer than me, but maybe it would be a bit more gentle. You know, maybe it wouldn't take the yellow light when a human driver would, you know, that kind of thing. But I love that you can just watch a movie or chat to a friend easily or play a board game in the back. You know, you can do whatever you want. You get so much time back, especially in America where a lot of the, a lot of time spent driving.

Starting point is 00:18:52 It makes so much sense to me that self-driving cars are like a huge unlock. for a lot of lot of people but largely only at stages four and five like at that highest level like cruise control is great but I still have to I still have to concentrate I still have to be watching the road

Starting point is 00:19:10 like I don't actually gain that much and if we go back to Postgres I feel like a lot of the automation features are great but we still have to concentrate we still need the person we still need the DBA and as long as we still need the driver and as long as we still need fail saves down to humans,

Starting point is 00:19:28 all I see is kind of this like gradual need for maybe few, maybe this is where the driving analogy breaks down a little bit, but maybe fewer humans per server, like maybe the DBA team for a company will be smaller on average compared to how it was in the past, you know, and I think we've already seen that over time. But I'm struggling with like the, that last step until we get to those,

Starting point is 00:19:54 which feels to me like a long, way off, especially given, like, the experience you're saying about, like, Oracle and the experience we saw with very smart people trying to automate a lot of this stuff from, you know, with a lot of AI, but kind of not even LLM stuff, right? Like, a lot of the research in this area has been machine learning and, like, other, other longer researched AI methodologies that have lots of real-world use cases. And even there, we've seen mixed results and, you It really, I feel like the model's struggling to, well, in the experience I've had talking to customers that used Autotune, for example, I feel like the constraints, for example, were not as clear as driving, or the slightly different use cases or the performance tradeoffs that different people have in different cases are subtle enough that you can't set the exact same guardrails for everybody. And at that point, it breaks down enough that, well, and, oh, sorry, one more addition.

Starting point is 00:20:55 performance cliffs are so real that if you change one thing and it looks like it's going to be great as soon as you hit a cliff it's then a disaster and then recovering from those disasters is actually a real problem and I feel like

Starting point is 00:21:08 troubleshooting disaster is also a problem root cause analysis is a problem and arguably they get harder when you involve automation because the more that it's automated the less people actually know what was changed when and why so I think it can get triggered

Starting point is 00:21:25 here with you because you you must be an expert but if you have automation you move much faster with a with a high level of automation take cursor give it give it a lot of pieces together and and and explain how you approach methodology of analysis this is like expert needs to bring and then you move much faster so but this is like i was trying to explain how i moved to this area completely right yeah okay so people say in oracle this works that one doesn't what works quite simple things i i said like upgrades maintenance security stuff well not simple but they are boring you know of course replication and backups like for me h a nDR it's like auto steering and maintaining like maintaining speed and maintaining lane you know basics cars must do this so database must

Starting point is 00:22:25 have good at the HR and DA if we look closer actually there are issues with both the HA and DR which prevent will prevent us from reaching very high level of automation but we can dive into this later but anyway boring stuff lacks automation interesting like remember I mentioned level one and two they talk about combination of features So if we start analyzing each feature particularly, we cannot apply the same classification, you know, because classification talks about combination of features. Yeah, sure. Coming back, actually, just to make sure I understand, what have Oracle done in terms of security? Can you give some example automation features?

Starting point is 00:23:09 Well, I know little. I would say what we should do in Postgres, and this is like least topic I would do. like I'm ready to discuss now like because this is in roadmap but right now we're focusing on different areas I can I can just speculate on it on this like identify potential threats like checking permissions roles like for example I know organizations which use the same super user for all human DB engineers and this is quite easy to identify or any organization I had couple of them on consulting contracts which through IPO process before that you have audit right and during audit they ask

Starting point is 00:23:53 specific questions some of them are quite silly I would say but some of them are quite good enough and if you just inspect your PGA Conf you inspect a user model you inspect how multi multi tenancy is organized we have had an episode about it right so these kind of things can be automatically analyzed and so on I don't know I don't know details about Oracle I just saw feedback that these stuff engineers really appreciate and they appreciate less automated configuration automated creation of indexes and this was appreciate less or appreciate less or it gets it wrong when it gets it wrong it's more painful like what's you see what i mean mixed results mixed

Starting point is 00:24:35 results you know like there's there's lack of trust in some minds like yes like i don't know yeah i i for example i catch up with customers from time to time just to just to hear what what they're doing what they like about products what they don't and i was speaking to a customer of mine that did try and use ototune for a while and it'd be interesting i'd love maybe we should invite somebody from that team on to discuss from what happened there yeah why it felt like why did it shut down like what i can tell you why like this people don't need a configuration tunic that much okay well i also think there might be other issues so there's definitely like not i tell you the big need in configuration tuning exists only if you have say 10 000 plus

Starting point is 00:25:27 clusters then you can say okay we we are going to save like five to 10 percent of money just with tuning or workload will be reduced this we know you know very well a really bad really bad plan can screw all efforts of configuration tuning? Well, yes, but I think it's worse than that. I think also, in addition to it, not being needed that often, and therefore kind of subscription-based models not working that well, also, when it, well, this customer was telling me, they moved from a mental state of, when something goes wrong, let's dive into what happened, straight to what, when something goes wrong now, what?

Starting point is 00:26:13 What did Otitude change? And that was like a real shift. It became like a trust, trust, but also that must be based on the fact it made changes in the past that made things worse. So it's not trust for no, it's not kind of like, um, not distrust for no reason. It's every now and again, if you change something, you hit a performance cliff and it's unexpected. Or maybe it's not even like always a performance cliff. Maybe it has another unintended consequence that you care about more. Like, probably not in a lot of these cases, but I think they talked about, for example,

Starting point is 00:26:49 making the mistake of letting it configure some parameters that even affected like durability and things like that. So if you're changing, depending on what you allow it to change, there might be unintended side effects. And I think that's like putting parameters around like what you do, you will and won't let it do is actually hard than it sounds, I think. It's very hard to perform. like enterprise approach for making a change it's extremely complex i'm very grateful that seven years ago i was working with chui they were preparing for an IPO and i remember cTO was ex-oracle and discussions we had and resistance they had in any change i proposed it taught me this enterprise approach you know i'm very grateful it was great like experience for me

Starting point is 00:27:42 And I realized, oh, actually, if you want to be serious about changes, any small change should be very thoroughly tested, like experiments, experiments. All risks must be analyzed, and then there should be a plan to mitigate if a non-risk occurs, right? And AI doesn't help here almost, you know, like this is framework you need to build without AI first. I don't necessarily agree I think AI could really help with these things when I say AI I'm including machine learning and not just the latest LLM stuff I just think we need to define

Starting point is 00:28:24 constraints really clearly and define what we care about really clearly and make it really clear we care more about reliability and durability than we do about performance so that's almost always true and I think that might be more core to the reason that these performance tuning tools aren't, haven't yet succeeded

Starting point is 00:28:44 because we haven't yet nailed the reliability, durability stuff. So that would be my theory as to like why they didn't necessarily succeed. Because if we help, even if they did help performance, almost all the time, if they ever hurt reliability,

Starting point is 00:29:02 that's not a trade off most organizations willing to make. And that's a difficult thing to tell a tool that's trying to optimize for better performance. Yeah. Yeah, I agree. And durability has issues. There is a good article from Sugu, just published on Zimbled blog about synchronous replication issues.

Starting point is 00:29:22 And we also know very good talk by Alexander Kukushkin about issues with synchronous replication. So durability is a must thing to have. And targets, I agree with you, targets, durability. availability must be reliability must be number one before performance by the way i also remember from that research people appreciate automated analysis and control of costs yeah and this can be initially quite simple like if i i remember actually talking to one a huge organization there database director told me you know what is it's cool stuff what you're showing in terms of

Starting point is 00:30:07 of experiments and performance tuning and like query tuning experiments with DBLAB and so on. But the number one problem we have is abandoned instances and how to stop doing that and lose a lot of money. And big organizations is a very big problem. And yeah, so cloud providers still don't offer good tools. It still takes a lot of effort to understand the cost, how the structure of spending, right? practically like usually it's really too late they are not interested yeah right so anyway uh what my aha moment back to it was yeah sure people like from academia from really great people right

Starting point is 00:30:53 great minds they try to build really cool stuff yeah like let's have automated parameter tuning automated indexing or even let's go inside planner and and create an adaptive query optimizer I saw even more extreme. In the paper, they were talking about choosing whether tables should be row-oriented or column-oriented based on the workload they're observing. So it's like trying to attack really cool areas, right? This is great, and this has been always so. Academia guys, the attack really detached from reality things.

Starting point is 00:31:35 Meanwhile, I realized we implemented with multiple teams already automated indexing, and this is what people really need. And lately, I realized in consulting, we almost always say you need automated indexing, but we didn't have polished solution. We have multiple ways to do it. And I always said, you take this tool, like someone from our team developed as a side project and then polish it. And it's only about bit reindexes and this and that. Like, oh, and also estimates, blood estimates might be off. We know very well if there are estimates, not just start tuple numbers or real vacuum like index numbers from a clone, right?

Starting point is 00:32:20 And I realized, actually, this problem is not solved. And this is a boring problem. And we can solve it. That's why number one thing right now we are going to release. We already like, it's about. to be released. We call it a PG index pilot. And there is a good roadmap inside it. It's a whole project. And it's going to be real simple. And a guy who part-time works with me right now, Maxime Baguq, one of the most experienced DBAs, post-Gus DBs I know, much more

Starting point is 00:32:59 experience than I he created like basically I consider the prototype we I said we fork it and then we started iterating on it the idea is simple I call it baguq number so you take index size divide by number of life tuples in this table we let's let's forget for a while for partial indexes and we have some ratio right when we just created index let's consider this ratio as perfect and consider tables which exceed like say million rows over time you will see this number is nothing like it costs nothing to check it you can put it to monitoring for all indexes it's super fast to get a number of because these aggregates are stored they're stored right yeah yeah yeah it's from system catalogs immediately you get it and then over time you see degradation of this

Starting point is 00:33:51 number yeah why because some pages inside the index they are not fully they are not full right they're half empty and so on like they're sparse so it means at some point you say oh it's time

Starting point is 00:34:04 to re-index and the best there are pros and cons of this approach compared to like say traditional based on

Starting point is 00:34:11 blood estimates everyone is using the big con like a couple of big cons of this approach it required us

Starting point is 00:34:19 some effort to get away from super user we did it and oh one more thing we on purpose

Starting point is 00:34:27 we said this company is going to be inside, like self-driving, inside database. We don't need external means, like something installed on this into instance or lambdas. We don't need anything. It will be inside. This means it's running inside PLPG SQL code. We know since postg 12 or earlier, stored procedures have transaction control, right? So we can go. But we need the index concurrently, right? So with index concurrently, we need, we cannot wrap it into inside a transaction block. So unfortunately, we need something like db link, right?

Starting point is 00:35:04 And it's a, it's a challenge not to do it properly on RDS, for example, because a db link, you need to expose password and this is not, you don't want to store it in plain text. So I remember very old trick I used a lot of years ago, DB link over fosfdW. And this is how we do it right now. And there is another limit like kind of limitation to start this thing to work. You need the baseline, right? So baseline means you need to index everything first or to bring this data from

Starting point is 00:35:40 clone, which we are this is an idea where I'll go into implement very soon. So to avoid the full index because sometimes our customers have like 10 plus terabytes databases and it's not cool to. to index it will take forever and so on there's also a big impact when you index quite fast so but good benefit from this approach it's not only about b3 it's not only yeah you can you can take care of gene gist and and even uh h ns w and others if they degrade and like basically yeah basically we measure kind of storage efficiency for index it's super very well. And I think I believe into this simple approach. I think we are going to have it. And I think also like I talked about this last couple of weeks ago with Andrean Kirk on Hockey PostGus Hiking on PostGus TV. And we started doing something mind blowing. And Ray just said, let's just implement merge. Because you know B3 and B3 implementation in Postgres, it can, it has only split. It cannot merge pages.

Starting point is 00:36:54 And since Andreas Ph.D. and so work is in the area of indexes, it was great to have to have ability to start this work. And I'm looking forward to like everyone. Wait. So like if, for example, an area of the index starts to get sparse because we've deleted a bunch of like, let's say we've deleted some historic data and we're not going to insert historic data back in that part of the index. It could proactively, like it's kind of self-healing. exactly wow cool so our project would be archived at some point i hope so i'm not an expert in oracle since i know i was not i have never been expert of oracle but also not a user i last version i used was in 2001 or two it was eight i eight i it was so good but uh people say oracle doesn't need doesn't have this need in the index SQL server i heard that has this need Over time, have declines. We talked about it so much.

Starting point is 00:37:57 I said, like, this is mantra. Like, everyone needs re-indexing at some point. And we know Peter Geigen work on Anastasia Lubenko in 13, 14, they implemented the duplication and other improvements. That's great. But still, there is no merge. So pages cannot be merged. And there was like some work, there was also some work, I think,

Starting point is 00:38:18 from Peter Geigen and others to help avoid page splits in more cases. Yeah, not just the de-duplication work, but yeah, the bottom-up deletion. So, yeah, kind of avoiding them getting into this state was one thing. Helping them heal when they do is another. It also strikes me as, like, a lot of this stuff could live in core, right? We already have some automation features, right? We already have auto vacuum, which does about three or four different jobs. So we have some groundwork here already.

Starting point is 00:38:50 We have tools like PG repack and PG-Squise, not in core but some ideas of moving more of their features into core like so we do have some of this in core some of it in popular extensions that are supported by many clouds so that feels like a we feels like it is the project is already naturally going in this direction maybe slowly and maybe like not all the areas you want it to but is that like what does the end goal look like here is is it being in core ideal well in my opinion we just need to solve very complex problem we like i know in some big companies manage postgres services sometimes one experience deba is responsible for cloud hundred thousand clusters thousand it's insane but we need to be prepared for to be responsible for a million clusters because yeah builders will bring us a lot of clusters like the time changing really fast so postgust need like not to lose the game by the way right now if you check hiker new new strengths It was growing, not only about job posting, as I usually mentioned, but everything, all discussions on Hacker News.

Starting point is 00:40:01 It was growing until the beginning of last year, 2004. And then we have a slight decline. And I think Posgis has right now a huge challenge. It needs to be much more automated. If things need to be in core, it should be in core. But not everything should be in core. We know Autofi Lover is still not in core, right? Patroni, right?

Starting point is 00:40:21 But Patroni, Patroni, I just asked Kukush. can did has he considered automation of post resume and switch over without downtime this is because this is what people want and expect for highly automated system he said no and i started making joke because i said patroning is outside of postgres because it's not the job of postgres to do uh h a auto file over and now i i expect you patroni maintainer to tell me that automatic switcho zero-down time switchover is not job for patroni so i need to implement another layer on top of it right this is what actually we do already for zero down time upgrades we do this and by the way i i can share this already on our consulting page we mentioned git

Starting point is 00:41:07 lab but also super base and also gadget.def these these companies took uh take from us and i have official quotes from their managers so i can share this news i think we develop really great automation for zero downtime upgrades which are not only zero downtime but also of course zero data loss and reversible and reversible without zero without data loss as well so this are perfect properties but it requires a lot of orchestration fortunately since post got 17 things are improving and less pieces need to be automated so let me go back and explain my hand moment finally and my vision I have right now. I realize that boring stuff needs to have much higher level of automation.

Starting point is 00:41:57 This is one. Second, I realized PostgreSQL team, this is exactly what we are doing because with consulting people bring us these topics. And I realized also that in every area, if we think about automation of feature, we can apply a simplified approach for classification. So if every single step must be executed manually, like CLI call, like PG-Dum call or something, PG-upgrade call, or SQL snippets, if they are to be executed manually by engineer, this is manual, right? If there are bigger pieces which, like, say, whole procedure consists of two, three big pieces, and they are combined. And engineer only makes choices to proceed or not between pieces.

Starting point is 00:42:51 For example, our major upgrade consists of two huge pieces. Physical to logical with upgrade, they are bundled due to some reasons, specific reasons. And switchover, two big steps. And inside them, there is high level of automation. You just call playbook, it executes. Right? In this case, it's already can be considered level one, say, say level one. one or like two maybe i don't know and then uh if we can fully relax and go to passenger seat

Starting point is 00:43:25 and just say okay i approve that that you need to do everything but you will do yourself i mean postgres or additional system will do everything yourself like full match upgrade with switch over in this case it like say at level two and then and you're in passengers but there are limitations if it will encounter some problem it will stop revert post spawn and you need to approve things. Highest level, if this feature, like if system decides, oh, it's time to upgrade and then it schedules it, like, oh, we have a low activity time on the weekend. And you even, like, you are notified about it.

Starting point is 00:44:03 You probably can block it, but it moves on itself. This is the highest level of automation. And for back to reimbixing, which I chose this as low lowest hanging fruit. Everyone needs it. Nobody in Monk managed, maybe after our episode, people from RDS, CloudSQL, I know some of them are listening to us, will rush into implementing this. I just see everyone needs it. Nobody has it in terms of managed postgres. Nobody offers it.

Starting point is 00:44:30 So I am highest level automation for PG index pilot right away. So it will decide when to reindex, it will reindex, it will control activity levels, so not to saturate disks and so on and so on. you can check roadmap I explained it in read me of this product and this is open source because I actually truly believe at some point it won't be needed this project if if and there's idea will work for merge maybe I don't know this is a dream right so forget about index blood right yeah yeah I guess it will be super hard I I try to research why it hasn't been done and I didn't see why like it maybe it's scary to this is basics like like foundation of whole postgres of any cluster, right? B3.

Starting point is 00:45:16 Yeah, well, I guess when would it be done? Would it be done by a background worker like auto vacuum? Or would it be like at what stage does it make sense to do? Well, it should be synchronous. Oh, I know this is a good question. I'm going to ask Andrea about this. Yeah. Because split is synchronous.

Starting point is 00:45:33 Yeah. I also think the thing you brought up just at the end there is super interesting. Like controlling the rate to not saturate discs. Like this is an interesting thing with tradeoff. again like if you're on aurora postgres and you're paying some amount per io like like for you probably don't want your indexes rebuilt constantly all the time you what you really want is to fix the root cause like it's probably an application issue like why are you updating the same row thousands of times a second or like what what's the root cause of the bloat or the so we just had a case of

Starting point is 00:46:09 updating multiple times a queue like workload this can be identified Yes, yes. Yeah. So I'm excited to see what you build. I think you're right that more automation in Postgres would be good. More automation in extensions around Postgres would be good. But a lot of the issues I see are kind of application source issues. And even if we make Postgres completely self-driving, there is still this kind of application-level issue that will be...

Starting point is 00:46:35 Mistake there can put all your efforts to ground. Yeah, I agree. say that's why in our like I envision I identify 25 areas in my blog post yeah all weeks during our launch week maybe we will adjust those areas of course like it's not a final list and we are targeting three right now like I hope we will expand this list to five to six next year once we have quite good building blocks we will think about whole system but the central part of all these building blocks is new monitoring system which is good both for AI and humans.

Starting point is 00:47:14 And this is what we are already actively building. We replaced all our observability tooling used in consulting with new monitoring system, which is called post-GISI monitoring. We talked about it separately, right? And this is going to be like the source of insights, so why things are wrong. And then we cannot self-drive application yet, right? Although in some cases might be so. Because if, for example, SuperBase, they control not only PostGus, but also REST API level, right?

Starting point is 00:47:45 So some things might be down there. But in general case, we cannot. So in this case, we need just to advise and so on. But eventually things like serverless, like Versel workloads or so, I see they can be together with Tatapis, eventually. In this case, we can discuss full cell driving something, right? But we are very far from that, I think. Yeah. And we have limited resources.

Starting point is 00:48:12 I wanted to admit that I'm targeting very narrow topics right now. So very boring, but really doable and we see demand. Things like partitioning, help with partitioning. It's nightmare to do it for arbitrary engineer. We say, oh, declarative partitioning. But for example, partitions are not created automatically. Okay, there is a PG apartment. but you need to combine this piece or something like the level of automation there is terrible.

Starting point is 00:48:41 Yeah, but these are the kind of things I'd love to, like a couple of things. I'd love to see more of this go to core. I think there has been improvements in partitioning core, many over the last few years and continuing. And the other thing is, I think there's a lot of focus on fewer people. With this automation, a lot of the focus is on we can do more with less humans. And I'd love to see more effort into what would happen if we didn't try and reduce the humans, but instead try to increase the reliability or increase, you know what I mean, like action. And I think too, there's a lot of talk about, yeah, how do we cut costs and not enough.

Starting point is 00:49:24 Yeah, we have already, I think, five companies who are very fast growing AI companies, very well known already. Yeah. And I see a very different approach to post goes. For them, it's a natural choice like let's start with it but then they oh we need to partition terabyte size tables let's estimate work oh that much work okay maybe we should migrate to different database system they are moving really fast and they're not like attached to postgres basically so i'm scared about postgres that's why like i'm saying automation there should be a massive shift right now and resources of my team is not enough So that's why I'm talking about it. I'm going like I'm going to put whole focus on it. During consulting, we also choose only, like choose on the paths that are fully automated, highly automated, fully automated. And I'm looking for some use cases, more use cases, more maybe partners who think similarly, right? This is a call for PostGus ecosystem. Like it's not enough right now. Like we're not, postgis might lose this game.

Starting point is 00:50:33 again like postgres won multiple challenges like object oriented no sequel but a i if i think everyone thinks about a i only about like pg vector or storing vectors it's not like not everyone means vectors people like build some app it needs database maybe with vectors maybe without vectors but what they expect is higher level automation scale easier so i'm super happy to see new innovations in the area of sharding, right?

Starting point is 00:51:10 Yeah, I think that's actually more I do think that's more important, partly for just the marketing stories. Then partitioning? For those, yeah. Because I think we talked, like, if you think about partitioning, if you, I think for example, when we spoke to when we had our

Starting point is 00:51:26 100th episode and spoke to Notion, 100 terabast. Yeah. Yeah. Notion skipped partitioning and went straight to sharding, they decided that with all of the partitioning downsides, and then there are quite a few, there are quite a few limitations for their multi-tenant architecture, it actually made more sense to shard instead of partition. And that felt like it wasn't one of these things where they were moving super quick and made a rash decision. It felt like a really considered engineering decision. And I actually think it was the right call.

Starting point is 00:52:00 And I wouldn't be surprised if with more of these sharding, if more of these companies are coming in and building fairly seamless sharding that doesn't add too much, doesn't add any complexity at the application level, I could see people tempted into that, even if it wasn't for good engineering reasons, just for the marketing reason of I don't have to think about scaling this. And I don't have to do what with terabyte tables. You know this normal distribution meme, right? Yeah, yeah, yeah. Yeah, usually it's like unexpected on the right side where it's expert. So I don't know in which camp I am. I was thinking post-gis one node is cool enough. Then I was thinking it's not cool enough because of performance cliffs.

Starting point is 00:52:44 Now I'm thinking maybe it depends because in some cases, it's much safer to avoid all performance cliffs not just allowing like 100 plus thousand transactions per second on one node. you just and how it's called resiliency right if one node is down only part of like it's just one shard maybe yes but at the same time it's so cool to see projects which require only one node they are isolated they don't need a whole like bunch of shards and clusters cluster of clusters or cluster term is heavily overloaded right yeah and and this you think oh let's see the power to have just one single know. Sometimes without replicas, I see projects without replicas, because cloud times change,

Starting point is 00:53:33 right? And I think I'm open, you know, like I know my perception overreuse changes. Sometimes it's a pendulum. So it depends, you know, like regular answer, a normal answer from consultants. It depends. But Sharding, it's really great to see that it's coming from multiple teams, there will be competition. So, yeah, here the future looks bright, but who will be helping to choose proper schema for sharding? Well, yeah. And rebalance properly normally. Well, yeah. So I think they who built this, think about this as well, automation and to choose time for rebalancing and fully automated.

Starting point is 00:54:18 Well, we talked to Sugu. Yeah, he said it's inevitable that you're going to have to change a sharding scheme at some point. It's just designing for that up front seems really important. The test handles it. So, yeah, interesting times ahead. I'm a bit worried about the complexity of these systems. Personally, I quite like. Well, starting simple, but also I like the idea of if I'm a Postgres user,

Starting point is 00:54:44 I can still understand the system, like roughly what's going on, even if I'm a full stack developer, even if I do the front end, the back end. like I like I know it is already very complex I know there's already a lot of internals that you kind of need to know about to make a performance system but I hope we can hold on to that for a while well I truly believe that what my team does is going to help because we observe many problems and every time I'm saying guys like we need to write down somehow to and so show with experiment so users understand, like, what's happening. Write some how-to for next time, right? And when we were writing how-toes recently, we write it both for humans and AI. So next time some RCA is happening,

Starting point is 00:55:33 root-cost analysis is happening. If you have our how-toos injected to your cursor, for example, it's going to take into account some situations which are written and how to reproduce them, how to troubleshoot them, right? So I think something else is coming, I'm not going to spoil it, but RCA and troubleshooting is one thing we will attack early. We are preparing pieces for this, you know, and one of key indicators of success here

Starting point is 00:56:08 is that non-experts understand what's happening, you know, because... Yeah, well, that's what I've, that's the area that I specialize in. You know, I'm actually betting a little bit on humans staying in. the loop for quite a long time and that they will always well not always but for a long time there will be categories of issue that we still need somebody to look into yeah and i kind of feel like the level of postgres knowledge is the median level of postgres knowledge for people having to look into that those issues is probably going to go down like based on all the trends that you're talking about maybe it depends if if we only have a few experts that are shared between

Starting point is 00:56:51 lots of companies maybe that's not true but if we do have a lot of individuals starting with vibe coding or starting with like doing a full stack kind of single person companies running the whole product start to finish yeah those guys like they have to know a lot about a lot of things they can't know that have you seen numbers super base shared how many clusters we register and how fast it is yeah can you can you imagine how like yeah all average knowledge of postgis there but yeah but the idea is i i my vision is that with a i we collect uh knowledge pieces we experiment to automate experimentation so on but then obviously this is what i see with our customers some human is needed to explain properly to other humans to answer questions properly

Starting point is 00:57:44 you know to build trust and confidence and so on Yeah, but my question, I guess the question then is, where do the tools live? Like, can the end user use a tool to get help? Like, or does the super base team use a tool to get help? Or does the consultant that the super base team employ? My answer to who's the tool for? And I know, I'm just saying that, like, it depends who you count as the user in terms of how much their postgres knowledge. I'm talking about that end user.

Starting point is 00:58:14 Yeah. For example, with DB lab, we already went this path. We moved from a couple of guys answering all the details. Back-end engineers have in terms of how plan works and what to do about it, which index to create. We now have DBELAP, backend engineers, experiment themselves. And if something is unclear only then, they call an expert for help. But like 90% questions answered by backend engineers.

Starting point is 00:58:44 without involving post-guise experts you know and expertise in the backend engineering minds grow grows as well i i'm just thinking this approach we had for query optimization can be applied at grander schema for many other areas yeah all right probably enough for today thank you so much we went much deeper in specific areas than i expected and i enjoyed it a lot thank you so much nice catch next week

Postgres FM - Self-driving Postgres

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.