Postgres FM - Schema design checklist

Starting point is 00:00:00 Hello, hello. This is PostGus FM. My name is Nick, Postgas AI. And as usual with me, Michael, Pigea mustard. Hi, Michael. Hi, Nick. So the topic, I'm doing great. How are you?

Starting point is 00:00:14 I'm good, thank you. What's the topic? What's the topic? The topic I chose and everyone already saw it on cover you created. So the topic is five or so, maybe six. Things you should check when you say. start new schema or you already have some schema and your project is growing and you want to be better prepared for bigger scale and performance wise so what especially if you use AI to design and improve

Starting point is 00:00:45 your schema and your project is growing right so what should you look at to check your schema health makes sense just on the topic in general I do think there are something things that are better suited to asking for help from AI and not. Do you think like schema design is one of them or actually is it one of those ones that maybe we should still do it a little bit more ourselves at the moment? Well, as you know, I'm very, very, very intensive user of AI. Yeah. Honestly, like I do everything through AI right now. And I'm 100% schema design is a topic where I can help a lot. But it also matters which questions you ask. If you just say, I want this and don't pay attention, you will have low quality schema in many cases still. That's why,

Starting point is 00:01:40 as I said, let's have this episode because it might be quite basic for many advanced users of Postgres. But this checklist we are going to present today, it can be good thing to keep in mind always when you design schema just to ask right questions and revisit. Also, I'm not sure how exactly you use AI. I use it always like I work with people, developers. When you just ask something and think they, those engineers, humans or AI, will solve all your problems and make all the decisions. You're affording full responsibility to their shoulders.

Starting point is 00:02:20 Then you have big negative surprises always. So it's better to keep requests, as detailed as possible when you work with AI and also make some iterations to review and improve. And when making iterations, like first, like I want some schema, like, I don't know, like LinkedIn social media schema or something, right? It will give you immediately something, but then you want some details to be checked. And this is exactly when this checklist can be very helpful. So you can revisit and think and imagine when you will have. gigabytes and terabytes of data, data, what will happen? So to avoid painful refactoring in the future

Starting point is 00:03:05 and outages maybe or performance degradation, the price you can pay in the beginning is much lower than the price you pay later. Right. So I never, I never work with AI in one shot. I always make iterations to improve in any question and to revisit and sometimes combine multiple LLMs. right so this is exactly when you when you can use this list of questions so first the first topic is the data type of primary key yeah and I feel like the answer to this has changed my or my opinion on what the best default for this has changed actually in the last few years just off the back of learning more about timestamp ordered uIDs right this is this used to be an age old debate right uID V

Starting point is 00:03:56 versus integer or big integer, really. And most of the kind of blog posts I would see are like trying to just encourage people to use big int over int. But now that we have first class support for time-ordered UIDs, I don't see many reasons not to go with that. What do you think? Well, the size, eight, is double? Yeah. Yeah, your ID is a 16 bytes. So it's quite a big price to pay.

Starting point is 00:04:24 in each row you need to have eight more bytes. In many cases, it's fine. Well, indexes will be also bigger. Indexes double, yeah. It might make sense if volumes are huge. But anyway, to keep things simple, the rule to remember, red flags, if you are prepared for really large volumes of data, it must not be integer four.

Starting point is 00:04:51 Yeah. It should be integer. 8. And if in terms of UID path, data type is always UID for both version 4 and version 7, but you can check default how it's generated. And in modern PostGus, there is a function, I think in 18, only in the latest version. I think so. It didn't quite make 17, did it? So if you're on Postgres 18, so it's better to use UUID version 7, but also remember trade-offs because there is timestamp there and values will be ordered like with integer or big integer so there is a nuance here but performance-wise it's much better than than your

Starting point is 00:05:36 version 4 there are many articles about it and you just you you just need to check okay are we going to go with integer then it's in 8 or big int right are we going to go with uID then default generation default should be version 7 for Postgres 18. For older version it's also not an excuse to stick with version 4. You can write your own function and ask AI to write your own function. Or generate them outside of the database. I slightly prefer this option less because if you have multiple application nodes, who knows what will happen with their clocks and since you ID version 7 includes timestamp inside it,

Starting point is 00:06:24 Order might be not, I don't know, like I prefer to leave it on database shoulders to have, like on the primary single source of or truth. On the other side, the UID by design is generated, it serves purpose to work on multiple, like, undistributive systems, right? So maybe it's not a big deal. Generate on on client's side. All I mean is if you're on an old Postgres and like saves you creating your own function or whatever. Postgres waited a couple of years to bring your ID version. officially. Although again, you can create even SQL function,

Starting point is 00:06:58 even not Pilipage Scale function, simple SQL function. And if you ask your AI to search for, for example, probably it will find my page how to do it, or other people pages. But in other libraries, from Google,

Starting point is 00:07:11 not JS libraries, everything, they have it for a couple of years already, so for longer. They implemented before standard was finalized. Yeah. So huge red flags to look out for, are if it's using the integer data type, which is the same as int for or serial. I think AI sometimes uses, it will define it as serial.

Starting point is 00:07:32 So if you just see that word without big serial, you might hit that limit at $2 billion or whatever the exact upper limit. Yeah. And avoid the UID version 4 if you want good performance on large volumes of data. Yeah. That's it, actually. Let's keep it simple because we could discuss more advanced situations. I think we should do a whole episode someday on.

Starting point is 00:07:53 migrations, like for people getting close to that two billion number, that would be a good topic. I can share. I had a lot of experience with different approaches. Yeah. Let's do that. Good. So how to redefine primary. I can actually have an article about it, how to, yeah, covering multiple paths, so we could talk about this. Number two is constraints. Topic is constraints. It's, it's, it's other constraints. Yeah, yeah. Yeah. Other. beyond primary key. So I would just ask, like, first thing, are we using constraints well, like to myself or AI and so on?

Starting point is 00:08:33 And because Postgres has six types of constraints and it's quite rich tooling, and it's much more convenient to have them earlier, not because adding them later is a big pain. Right now we have quite good support of two-step constraints. definitions. First, I think in PostGrist 19, it will be possible also if not null constraints, which is interesting. But anyway, check constraints, foreign keys, unique constraints. It's possible to create first in almost finished state and then finalize it to avoid long-lasting

Starting point is 00:09:12 locks. But this is not the reason why I prefer to think about it in the very beginning. The main reason is the quality of data. This is why constraints exist in the first place, right? So if you introduce constraints early in your project, data quality is good. Introducing them later. And database side as well. I mean, it just... Oh, yeah.

Starting point is 00:09:38 Yeah. I don't trust constraints outside. I'm like with age, I became less radical. My points of view became less radical. with it. So I just observe without fighting with backend engineers who try to implement constraints

Starting point is 00:09:55 on application side. I just have a note. They are not reliable because who knows which other layers of types of applications, especially these days with AI, it's so easy to add something else implemented in different framework or even language.

Starting point is 00:10:11 It's super easy right now. And you will need to re-implement those constraints. Or you will just forget and miss them and data quality they will suffer. Right? Yeah. So specifically, are we, something I think I'd like to see more of,

Starting point is 00:10:25 or at least I see a lot of people not doing, is columns that you're not expecting to put any null values in or you don't want to have any null values in, actually explicitly making them not null. Like, if you don't specify, then columns are nullable. And I think the deep, I think most people, or at least most AIs I see, are creating, like by default, they don't put not,

Starting point is 00:10:47 on every column. Yeah, there is an opinion that not null should be default. Anyway, it's hard to have a default, but inside your project, you can say, okay, unless it's explicitly specified, let's make columns not null. Exactly, yeah. And avoid troubles with null we talked about in a special episode about NALs, right? Yeah, well, and I guess in this specific case, there is also, well, with some constraints, mostly we care about this.

Starting point is 00:11:17 We're doing it for quality reasons. But there is like a tiny amount of performance optimization that comes out of them. I guess a bit more for unique constraints. Like if the database knows for sure this is going to be, this is unique, it can do some optimizations. So yeah, anyway, but yeah, good point about, I like that about not knows.

Starting point is 00:11:37 Yeah, yeah, unique is interesting stuff. And also, I don't think foreign keys are underused in post-Gos ecosystem. Days when everyone suggested that foreign keys are super bad. For me, it's 20 years ago. I remember these topics. Let's avoid foreign keys completely. Although we know that foreign keys can bite badly,

Starting point is 00:11:56 we had episode with Metronome company about issues basically with multi-exact, multi-exact related issues, which can be caused when you use foreign keys a lot. But anyway, I still think foreign keys is a good thing to have, data quality. But I think what is usually underused is check constraints. They are great. They could be used much more often. For example, a classic example, when I say, like, when I see in schema in Amdata type, I think, okay, later, if we want to change this list of possible values, how will our migration?

Starting point is 00:12:37 Especially we're moving, any? Yeah. So check constraints are slightly like more flexibly. in this like it's it's and well also removing is a is a topic right but check constraint you can introduce in two stage again you can drop out the old one and you can first create a new one in not valid state and then validated in separate transaction but I just see it like quite useful quite convenient tool for me and you see it in schema right away like this this column should be like for example in

Starting point is 00:13:13 range or it should be this list or something. So I wish it was used more often. Yeah. What do you see in going back to foreign keys quickly, do you see a lot of the LLMs doing a lot of things on update cascade or on delete set null or things like that? Do you see any cascade stuff popping in there by default or not when it should be or when you think it should be? It's a good question. I actually don't remember having issues with this question. I know deferred constraints are can might be an issue. I remember a case when PG repack, it was a company called Miro, maybe you know it, right? They had issues with

Starting point is 00:13:57 PG repack and deferred constraints. There is an article, old article from them about this. But I don't remember particular issues here. And I think this leads us actually to the question for our particular projects what's better and what I if I had any doubts with AI I would just make some experiments based on the product we are developing I would say okay you know like for example in one year how many of entities in which table entity table we expect all other tables let's fill it with fake data right run vacuum analyze to have good state and then just explore plans with PigeMastard, for example, right? Let's think what kind of workload we might have, according to user stories in our specification of the product and so on, and then collect plans

Starting point is 00:14:53 and see, right? And then include also modifying queries and collect those plans and think how it will work. It's so easy these days to make these experiments before we finalize all the decisions, and then you will see it in action and it's great. So collect all the plans. find like weak spots where we have suboptimal plans and how we can deal with this. If for example we have entity and another and one too many and it's many so many, right? And then we want delete to be supported in UI where people as we know expect 200 milliseconds and less than one second let's say, right? And then this delete of one row triggers propagating according to foreign key triggers deletion of one million rows,

Starting point is 00:15:41 It cannot be, cannot meet our requirements, right? Now we need to think about asynchronous deletion, maybe if using some event queue system or something, right? Yeah, yeah. And this will pop up if you prepare a good experiment here. Right. Without experiment, it's hard to say. You need to answer questions,

Starting point is 00:16:02 how big are the tables and how, like, this relationship between tables on foreign keys will be in the worst case, for example. Can user create, like, million of entities and then we need to delete this user for example. This is a great question and with AI you can iterate so fast these days. So great. Collect plans and then think and decide. Okay.

Starting point is 00:16:24 It's a good good question but I don't see it as simple question. No, no, no. It was more, I just asked in case you'd seen there was a chance that LLMs were throwing it in on every foreign key and that was maybe a bad, bad default. But it sounds like they're not. I may be a wrong person to ask about this because I don't, I design sometimes these days. I have a few applications designed from scratch with AI, but I mostly deal with things which other people already created. And I just see how AI can help.

Starting point is 00:16:55 So I'm not sure will this like propagation of deletion, for example, will be supported in how exactly it will be implemented. But it's definitely worth thinking in the context of latencies. and what should be offloaded and what tools will be using to offload in this, to background jobs, for example. Yeah. Anyway. Yeah, it's just a thing that you need to,

Starting point is 00:17:21 instead of thinking, looking at code, I would think looking at dynamic results, experiments. Yeah, and I'm thinking, like, if we're talking specifically about schema design when you expect to hit scale, I might default to not using on X cascade just by default because of those, because at scale it can become problematic. So that would probably be if you had to come up with a best practice for if you're

Starting point is 00:17:54 designing at scale, maybe lean in that direction. That's interesting. That's interesting because this was my position for a long time. But it's not about AI building or something. This was my position. Let's let's like it's dangerous because who knows how many rows we will need to delete. It's like it's about the delete, right? Unlimited delete, this is how it feels.

Starting point is 00:18:17 Let's design without it. But then I see in my consulting practice, I see cases where really large projects have it. And it's just not one project. Many projects I see. And they keep it and somehow live with it. They at some point implement non-cascaded, like asynchronous deletion. but by default, I still rely on cascaded deletes somehow, and they are fine. This is interesting.

Starting point is 00:18:41 So reality shows it's not that bad. But I agree with your way of thinking, and this is my default way of thinking as well. This is like, Posgis can survive very terrible things these days much better than like 20 years ago. And I know. The only time I see that being a like horrendous issue, even small to medium-sized projects, is if people haven't indexed the the second part of foreign key. Exactly, the referencing color. There are two ends of foreign key.

Starting point is 00:19:12 One is always indexed because it's primary key. The other one, it's your job to index it. Not to forget. Yeah, yeah, yeah. We have an old checkup, actually. We have a special report just for this case. Perfect. Good, but it's all good questions,

Starting point is 00:19:31 but I feel we can spend a lot in the constraints area. I think we had constraints episodes. We did have an episode, yeah. Let's look that up and move on. Yeah, let's move on. Some entertaining topic, column Tetris, which is sometimes it might bring some quite interesting

Starting point is 00:19:49 and good saving. In many cases, it won't, but I just see people enjoy this topic, and I decided to include it. So the idea is when you design new schema, the price to pay to reorder columns is very low. You can reorder it. Easy. And the problem is that Posgos physically, the colon order matters.

Starting point is 00:20:14 And due to padding alignment, if integer 2, for example, is followed by integer 8, you will see a gap 6 bytes of zeros in every row. And this is bad, right? It's like a waste of storage and also memory, which is not only disk, it will pollute with those zeros, it will pollute memory. And sometimes you can be so unlucky. I saw cases like 30% of HRO is like zeros. And Postgres doesn't have automatic reordering of columns.

Starting point is 00:20:48 It could actually, but it doesn't have it. So the order you define in, let's say you do create table, the order you put those columns in, that's the order they end up on disk. And every new column you add goes at the end. That makes sense, right? But even if it could be slotted in, let's say you add a new Boolean type and it could, I don't actually know if that would work, but it always goes on the end.

Starting point is 00:21:13 Bullion is one bite. Yeah. So in Boolean, seven beats are wasted already. If you go, if you zoom in, like under microscope. Unless you put Boolean, Boolean, Boolean, Boolean next to each other, right? No, no, no, no. Bullion is one bit. Oh, bit.

Starting point is 00:21:30 Yeah, good point. Good point. Sorry. But it's stored as a one byte. Seven bits already wasted. But if you have Boolean and then times 10, for example. For example, first column, as we all agree, should be ID. Primary key.

Starting point is 00:21:49 I'm joking. It's not always, but it's very popular. ID is the first column. Integer 8 occupies whole 8 byte word. And then we have Boolean and then timestamp. Not only you wasted 7 bits inside Boolean. it's already by design like your postgres doesn't have like less than one bite but then you waste seven bytes so seven beats first and seven bytes because of this alignment padding and your

Starting point is 00:22:18 bullion which is like by sense it's only one bit suddenly occupies eight bytes it's a huge waste of resources right yeah but the big bigger picture here is that people usually realize that waste is that waste is significant only when it's a huge table, like 100 million rows, right? And then, oh, so many spaces wasted because we didn't play this column tetris initially. But this is like chicken or egg problem, like maybe inverted. Anyway, so you realize it later, but you could do it earlier, but you didn't think about it because when you have only 1,000 rows, for example, it costs you nothing. But again, with AI, it's so easy.

Starting point is 00:23:02 you just say let's play column tatters in postgres AI should know it I haven't tried but I'm pretty sure it should know it because there are many articles about it Oh and we did an episode Yeah starting from Stack Overflow where this topic was very well described So AI should know it

Starting point is 00:23:20 This topic When you create new table you think many rows will be stored Just let's apply column tetris to this Boom It will reorder columns never tried again yeah what do you think like I feel like there's a

Starting point is 00:23:36 I don't know if I'd still want to have I'd still want to do it myself just because I feel like there's also a natural order for columns so I would want to put all the eight white ones together and 16 bite ones and make sure they're all with similar and then all of the yeah exactly ah similar

Starting point is 00:23:57 but but equally like imagine if it's like a user's table and I could put the I could put certain information about that user that makes sense to go together together and then information about something like maybe like there all the IDs that maybe the foreign keys like their team ID or their organization all of those kind of like referencing columns put those together so I know the order doesn't matter I know you could just specify column order but you know when you do select star or you just this is what i i i was trying to say like are you trying to tell me that i use you're

Starting point is 00:24:36 using select star yeah and here we go so i'm joking actually i'm using select star all the time but i also all the time i see articles how bad it is to use select star right but it's convenient i know yeah but but also if you if you like if you manually work it's inconvenient to have uh all order and select star doesn't work well. But application shouldn't use select star. You're right. And even I'm even thinking about examples like the system views in Postgres, like you do select star from PG stat statements or something. But that's, that's not even the table, right? That's the view and then they can define the order. So it's, it doesn't actually matter. It's yeah, it's interesting topic. I played column tetras a few times heavily

Starting point is 00:25:23 and then it was really inconvenient or like this is ugly table. It looks ugly like this order it's ugly. But at the same time, it's your problem because you use Select Star, which is considered like not a good practice. It's practiced for exploring and so on. Okay, explore with AI, ask to provide better order or something. I don't know. But anyway, if you talk about, if it focused on performance, again, I saw cases where it was significant waste. One of the first articles I saw in it was by braintree, and I think they reported an average of about 10% on some of their largest tables. It depends. Obviously, it completely depends. But that was quite surprising.

Starting point is 00:26:02 I know you mentioned like an extreme case where you saw 30%. But I was quite surprised that in an actual production case, it was as high as 10%. And when you're talking about hundreds of gigabytes or terabytes, that's actually a decent amount. Yeah, yeah, exactly. The problem also, of course, in existing projects which were developed during many years, usually it's not so simple because the schema is evolving. during many years. And you cannot insert your new column in the middle.

Starting point is 00:26:32 It's always, as you said at the end. So sometimes we do need to, for example, to change the order using some redesign thing. By the way, have you heard that Postgis 19 will have repack and also repack concurrently, which is basically vacuum concurrently? We had an episode with author of Pidge's quiz. Is a vacuum full concurrently? Exactly, exactly, not vacuum.

Starting point is 00:27:00 But the decision was made to name it repack, concurrently. I think that's sensible, getting away from people thinking it's vacuum-related, yeah. I haven't looked into details, but I might, like, naturally, this is the point where you could reorder comments, because it's based on logical replication. Yeah, I suspect if they do that, it will be a different keyword again. Who knows? Again, I haven't looked into details. there are long discussions on Preciousal hackers mailing list.

Starting point is 00:27:31 It just feels naturally because it's based on logical replication. This is where you can reorder columns and play Tetris for existing large tables. I hope it will be implemented at some point. Maybe it's already partially implemented or something. You basically, before you start filling the new table, you can reorder it because it's logical. It should be possible. Anyway, this is column tetris, entertaining topic. let's move on to more serious topic called indexes, index chaos.

Starting point is 00:28:01 If you just use AI and if you don't spend enough time for things, as I said, like experimenting with real data benchmarks, you might end up with underindexed situation or over-induced situation. And usually people are scared to have under-indexed. In my practice, I see over-induced very often, sometimes too extreme. I guess for new projects I would be surprised to see them over-indexed. Oh, it's possible. It's possible still.

Starting point is 00:28:32 Is that as a result of AI or something? Well, the problem here is that, again, everything depends on your prompt. It depends on angles. If you say we will need to order by arbitrary column, right? Remember our episode with CTO of Gadget, Harry, right? So any column we can order. Naturally, AI might decide to put index on every column, right? But is it really what you want?

Starting point is 00:29:04 But that's what Gadget decided to do as well, right? Right, but with understanding consequences, right? But in reality, in reality, will it be used if one year later you see that 90% of those indexes are unused? And the gadget has special case. They have many applications. to support. If you design your own only one application,

Starting point is 00:29:27 this is where I would just stop and think with AI again, with some experiments. I would ask questions. This all depends on workload, right? Which queries we will have? And I will just think which queries we will really need

Starting point is 00:29:42 and which we won't need and then choose the proper minimal set of indexes. Yeah, I like that. I think there's also like some rules of thumb we can give people in terms of like how many is where it starts to get dangerous for other reasons but the kind of one place where it's not even and it depends is overlapping and I see more and more where people are adding indexes based on AI suggestions

Starting point is 00:30:13 they're adding indexes they've effectively already got or they're adding ones that overlap with indexes they've already got and therefore only one of them is needed it That it won't always show up in your stats. It won't always show up as an unused index because it is still using the smaller index. It will show up in post-GICI stats because it's called in our case redundant index in post-GICI mentoring systems and check-up. I never heard the term overlapping. I see people talk about duplicate indexes, which is like trivial case. Exact same definition.

Starting point is 00:30:49 Right? Yeah. But we call it redundant indexes. It's not easy topic. might be might think but we solved it in many like over years we have quite good detection approach how to reliably detect redundant indexes and as you say this is true in many cases all of those indexes are used and it's quite like there is a big fear oh i'm going to drop this it's used should i drop it yeah but the reports we have quite reliable and of course

Starting point is 00:31:23 course I wish we had very native standard algorithm to disable indexes for quite for some time like without involving hyperg in hypothetical indexes which this came up supported our last episode was on what's missing in post-crest and I think we got a nice comment I've forgotten who from but saying one of their miss features that would be make this temporary hide index from the planner there are there are extensions for it hyper pg supports it but we want some mechanistic is beyond index setting and index index is valid to false. We want this.

Starting point is 00:31:59 Yeah, yeah, yeah, I agree. And this is a good thing to probably work on for those who want to start hacking progress. It's a good thing to officially implement something. Anyway, questions to ask is they all start here should start from understanding workload. So we cannot think about indexes without understanding workload. So first we need to generate some fake data and then start thinking about usage patterns. And without AI, it's really hard work.

Starting point is 00:32:30 It's a lot of time. With the AI, it becomes easier. If you have user stories defined already, you understand what kind of patterns we will have. And you can start imagining. It will not be 100% accurate, but it will be good enough, much better than without it or with manual work. So you can iterate here and see and then with queries in hand and some like lab already developed. I see it should take one or two hours for like medium size project in terms of complexity. And then you can start iterating collecting plans and think

Starting point is 00:33:10 which indexes you can have, playing. But also this, whilst this might be the most important in terms of performance that we've talked about so far, it's also the easiest to change later, right? We can... Yeah, yeah. We can set up a certain set of indexes and later decide we want a slightly different set of indexes and that's not super painful in a lot, in most cases, especially if there's no partitioning involved.

Starting point is 00:33:35 So that for me feels like, yeah, it's important. It's really important. But if we don't get it right on day one, it's much, much easier to fix then the primary key being the wrong data type or not having the right constraints in place, that kind of thing. Or even the color motor, yeah. So the point is you should build like wind tunnel for your project, right? Like test lab, yeah.

Starting point is 00:34:00 Test lab, yeah, database lab. And then you should put this workload like wind and see like how your profile of your database behaves under this wind. Yeah, a lot of people do that with production. They just see which are their slow queries. It helps us with AI. it's easier to have massive experiments before you finalized all the decisions. So my point is do it. And then questions to ask, are there any indexes we don't need in terms of they are redundant or unused

Starting point is 00:34:32 or what indexes are missing because we have bad plans? And this again, like PG Master, we'll visualize it and so on. And you have API already, right? Yeah, it's quite nice. You can connect your AI to PG Master, for example, and ask to use PG Master to explain what's happening and find bad problems. and missing indexes with this lab environment. Great.

Starting point is 00:34:55 Going just like to the extreme, how many indexes on a table would you automatically just think, oh wow, that's like a lot. Maybe it's justified, but I'm already thinking that's too many. These rule of thumbs are quite weak in my opinion, but we have them anyway. Like for example, if volume of indexes exceeds volume of data, it's already some big, some bell is ringing, right? I like that, yeah. That one.

Starting point is 00:35:23 Or before Posgis 18 with this nasty light-wet lock-lock manager problem means I don't want more than 15 indexes per table. Yeah, right? Because of the 16 relation limit for the first time. It's so simple. Primary key look up, which quite likely will be very, will have quite high QPS will suffer because of locking. So there are a couple of kind of these.

Starting point is 00:35:48 rules, right? But they are not strict and again I say they are, I consider them weak, but I think it's still helpful, right? If you've got a schema designed by AI and it's got 20 indexes on one table, maybe consider it. Is that smart? Yeah, yeah, yeah. But it happens. Index data exceeding hip data. It happens. Especially with some, especially like if you're doing gin indexes. Oh yes. If it's just bee trees, yeah yeah okay great so the final topic we couldn't decide on which one to choose we had we had two ideas one is our RLS and another is partitioning let's let's touch both maybe yeah I think so so in from an AI perspective like when you're asking it for a new schema do you see partitioning

Starting point is 00:36:42 come up does it over partition it won't come up unless unless you start saying I'm going I'm going to have a lot of data here and I want just I want to I want good performance I want billion billions of rows stored in this table it should be maintainable it will naturally come to the idea of having partitioning

Starting point is 00:37:01 partitioning will be interesting to join with you ID version 7 or 8 does matter it's possible again I have a recipe for this in my how to set of how tos and it can be also the question like partitioning you cannot

Starting point is 00:37:18 develop without understanding workload again. It's similar to indexes. Like it will be something about how it will behave, who knows? Maybe you will end up your queries will need to scan all partitions, which is terrible in most cases. And then planar behavior. Anyway, the questions to ask, like you set requirements. I want a lot of rows and I want well-maintainability of this. Help me. What problems? Like deleting old data, are you thinking? Like archiving? No, creating indexes, for example. or index. Okay, yeah.

Starting point is 00:37:49 Vacuum. Vacuum, yeah, yeah. These are problems biting us much more often and badly than just, I don't know, like. We talk about this a lot as well, right? I know, I know. Direct performance benefits from partitioning are good, but also the problem, like when creation of index or full vacuum, not full, but regular, full table vacuum. It will take hours or half a day.

Starting point is 00:38:15 It's already so painful, especially if it's for the day. to prevent transaction wraparound, or if it's, again, indexing means that Xmin horizon. Like, anyway, let's not go too deep. The question to ask, will we survive n number of roads because we expect all of data, right? I want. And just that of interest, is it pretty much always doing range partitioning based on timestamp type?

Starting point is 00:38:41 Is that like pretty important? Again, I'm the wrong person to ask. Okay, yeah, fair enough. In many cases, I just still use time scale DB if it's time-based. Makes sense. And I'm happy because there is compression there, it's great. But I also saw cases when it was decided. It was like not AI, but it decided like least partitioning and full control over

Starting point is 00:39:02 partitions, a special table created. It was also working well and served needs. But on surface, you should think about workload and then push your AI or something to think about how it will behave if you have. have that much data. Yeah. Petitioning is inevitable. And then experiments again.

Starting point is 00:39:21 Even experiment with billion rows. It won't take a lot. Just allocate machine and do it. It's like just so easy, right? Half an hour of weight to generate it or more like even if it's one hour. I don't understand why people don't do it all the time. They come to, they come to us with questions, which I'm grateful. Thank you.

Starting point is 00:39:41 With money and so on. But it's so easy just to experiment more often. these days. Yeah, you even run it, even leave it running overnight. Yeah, exactly. I'm still, I'm still, with some plan,

Starting point is 00:39:53 with some plan. Yeah, I'm still in the fence with, I, I love partitioning for all the reasons you mentioned, but I think so many projects won't ever need it that I understand not doing it by default with a new project unless you know, unless it's like a new

Starting point is 00:40:09 feature for an existing system that you know is going to get heavy amounts to data really quickly. Just, there are so many trade-offs like that it's still a little bit like which trade-offs so for example indexes being able to create indexes concurrently delete let drop indexes concurrently you can concurrently concurrently create index on each partition and then yeah exists on each partition you can create this these things can be automated and of course I I like to be with everything much more better automated there is a big potential here but also every

Starting point is 00:40:44 every post-guised version gets a lot of improvements in the last maybe 10 years already so it's it's not it's not it's not it's not like frozen yeah but also make like it's just being extra careful that every high frequency query you have contains the partition key and is getting pruned at planning time like there's there are a few gotchas that yeah there are I have serious of articles about this and I went quite deep there and was crazy. But that's my hesitation. Creation of foreign keys. For example, working with timescale DB sometimes.

Starting point is 00:41:24 Often we just abandon idea of having foreign key because not supported. If you do partitioning yourself, you will deal with if you have large tables, adding foreign key between them. If partitioning is involved, it's an art, I would say. GitLab master that they have migration helpers RB. This is a great source of experience of many people involved. And I can't stop recommending how great it is because it's open source. So it's great place to look at.

Starting point is 00:41:56 And also they have great documentation about it. But anyway, like it's a thing to decide if you don't expect billions of rows, don't do it, of course. Yeah. Yeah. At the same time, we have also very sad cases when people, come to us with for consulting we identify problems we say partitioning is needed right sometimes we go and spend a lot of effort because project is huge or there are many of databases to

Starting point is 00:42:21 take care of and then finally it's implemented great we like there are many hidden dangers to explore a mid-journey or example they came to us like we helped forgot about light light light, not forgot, we didn't know about that time and they hit it. Lightweight lock-lock manager problem was hit badly and there are articles about it. But then some projects just say, okay, they evaluate how much effort it is to deal with partitioning when tables are huge and start talking, okay, maybe we should migrate to my MongoDB. Yeah, or even sharding, right?

Starting point is 00:43:01 Like instead, I think when we talked to Notion, Arka, he mentioned. that because they keep their shards relatively small, they've opted to never partition at all. Yeah, the need partitioning diminishes. I agree with this, but not vanishes completely. What, maybe? Not maybe. You mentioned, it depends how small you keep your shards, right?

Starting point is 00:43:25 Yeah, yeah. Like, Sugu, for example, Maltagres, he talked about having lots of smaller databases being much easier. Yeah, yeah, yeah, yeah, maybe you're right, okay. What about RLS? You mentioned you wanted to touch on that. Yeah, finally, RELS, if you, for example, I just noticed it's not like, SuperBase is very heavily, like promoting RLS.

Starting point is 00:43:47 I have a lot of articles, including how to avoid problems. But if you just use AI, I noticed it often adds RLS. Oh, really? Yeah, just, this is how it works. I just noticed I have had a couple of cases where suddenly brought, like, We are going to use RLS. It's multi-tenancy here. Let's do it.

Starting point is 00:44:08 Okay, let's do it. But let's also pause and think about performance and avoid problems like current setting inside a loop. And you select count of million rows suddenly becomes super slow. So the question to ask AI, like, again, benchmark is ideal here. Like with RLS without RLS, what is overhead of having RLS in our particular case? benchmark and if you see it's not okay then let's optimize because because there are tricks how to optimize it so I would just if I notice a I generate a schema with fireless I would ask question like what will be performance impact let's

Starting point is 00:44:51 benchmark it and if it's bad let's optimize that's it yeah great great that's actually I think that's a useful checklist actually five or six things to ask your AI when you design schema. Yeah, or your colleague even. Yeah, also. Or your own schema, which you have already 10 years. Why not? Yeah, exactly. Or even a new table. If you're doing a new feature, it still makes sense. All of these makes sense, right? I agree. Cool. Thank you for listening. Yeah. See you soon. Catch you soon. Bye-bye.

Postgres FM - Schema design checklist

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.