Postgres FM - Data model trade-offs

Starting point is 00:00:00 Hello, hello. This is PostgresFM episode number 70. My name is Nikolai and together with me is Michael. Michael, hello. Hello, Nikolai. What do you think we should discuss today? Well, this week was your choice, right? I think you've picked a nice topic that was a listener suggestion. So I think we've simplified it to data model trade-offs. Is that a fair summary? Right, right. Purely relational. Let's emphasize that we are going probably to keep JSON, JSONB columns as a separate topic.

Starting point is 00:00:37 We've had it already. They specified with respect to PostgreSQL internals almost as a... Internals, okay. So that's quite a nice way of looking at it. It's what could come up as a result of Postgres specific implementation details that should make us consider data model decisions. Right, so Postgres allows a lot of columns to be present in a single table. How many? 250? A lot, yeah. I'm not actually sure. Maybe many? 250? A lot.

Starting point is 00:01:05 Yeah, I'm not actually sure. I think it's more. There is some limit. Oh, okay. Poor preparation. Anyway, you can have hundreds of columns in a single table, and you can have indexes on those columns. Or you can go and implement voice code normal form, right?

Starting point is 00:01:27 And so on and like I mean there are many directions to move and of course everyone wants very good performance and the compact storage and so on is it so maybe not everyone well I think I think most people care about performance on some level, but I think it's easy to list three or four things that almost everybody cares about more, like durability, for example, or everything in ACID. You know, the reason people pick Postgres and pick old databases and trust old databases

Starting point is 00:02:03 more than some of the newer ones is because they care about things more than performance, I think I would argue that anyway, well, well, right. But if the question is related to internals, I guess we should think about performance, first of all, and, and performance in terms of various aspects of it, like read time, write time, various overhead we can have. And so to me, if we simplify the equation a lot, it's a question how wide should be

Starting point is 00:02:39 our table, how many columns we should afford, or should we normalize versus denormalize? Or should we have a lot of columns in one table or fewer columns and more joins when needed? It's not directly the same questions actually, right? Because of course, sometimes of course, denormalization is when it's done on purpose, like intentionally, it means that we extend the same table, putting more columns there, although those columns should be somewhere else, for example. But sometimes it's very different things. If we just follow normal theoretical approach normalizing our model we can easily end up having a lot of columns in a single table a lot of them because usually if you think about some social media application or a mobile application usually if it's a big project growing, we have some entities which are central to our model. For example, user table.

Starting point is 00:03:48 Or for example, if you have kind of organizations or projects or something, these tables are going to be central. A lot of other tables will have foreign key pointing to this central table. And eventually this central table is going to grow in terms of width. You will add more and more columns because you need to track something. Sometimes just because of denormalization to avoid extra join and maybe aggregates, like count, right? Sometimes just because the complexity grows

Starting point is 00:04:22 and you just need to keep more properties. And each property in a pure relational approach, it's just a single column. New property, new column, right? Attribute. So, I mean, I think the obvious question is, why is that a problem? Why is a wide table a problem? Good question. Well, obviously, there are strong trade-offs here.

Starting point is 00:04:50 On one hand, extending in terms of width our table is not a problem. We know how to do it with zero downtime very nicely uh postgres 11 brought us uh default for existing rows without like virtual default i call it it's not official name but this is actually what happens instead of rewriting whole table when you add the column with default uh it's it's stored virtually for all old records, right? And you just, your alter is very fast. So we know how to deploy changes with zero downtime, with proper testing, all good. Then we create indexes.

Starting point is 00:05:37 We've created index concurrently because obviously more columns means more indexes because some columns you want to be present in indexes because you filter, order, and so on. But trade-off is, on one hand, it's good to have a lot of things in one table because you can achieve single index scan or single index-only scan for many queries because if you need to join data from multiple tables, you cannot have single index scan.

Starting point is 00:06:14 You can't have a single scan full stop. Yeah, because index cannot be defined on multiple tables and the only way to do it is to define materialized view and deal with all issues it brings because it will be cut quick. It will need

Starting point is 00:06:31 maintenance and proper use. If you have multiple tables, let's consider a simple example, one-on-one relationship. Somehow you ended up having for example, a user's table and user's extra properties a simple example, one-on-one relationship. Somehow you ended up having, for example, a user's table and the user's extra properties, for example, table. And somehow you decided. And this may be not a good idea because you cannot, for example, filter or thereby just with index scan using columns from both tables.

Starting point is 00:07:00 For example, you want to search by email, lowercase, or case-sensitive CI text data type. And at the same time, you want to use some extra field in this search or order by it. And in this case, Postgres will need to choose one of three join algorithms and apply like nested loop or merge join, depending on if if it's a mail probably it will be nest loop or something like that right so it won't be looking super cool because super cool is like index scan index only scan and i'm not saying joins are bad because join joins and postgres quite optimized three algorithms in many In many cases, they work really well, but you

Starting point is 00:07:48 cannot beat the performance of index or, more important, index-only scan, right? So to have that, you need to bring those columns in the main table, users, for example. And in this case, you can define a multi-column index,

Starting point is 00:08:03 and your query will be super fast. You will have just a few buffer hits to find those one or a few rows, and it will be below one millisecond for sure. Right? And that's great. That's it. End of story. Let's have a lot of columns.

Starting point is 00:08:22 I don't think that's it. No, no, no. Sometimes I say, okay, sometimes even if we perform normalization like we learned from theory, at least to third normal form, very good situation already. And then I say, okay, maybe we should go back and denormalize just to have index only scan and avoid joins. But this is just one part of story. Another part of story, there are many issues you will have when you have a lot of columns and indexes. And this relates a lot with previous episodes we had recently

Starting point is 00:08:56 over indexing two weeks ago, right? I think, is there an even more fundamental problem by the design of Postgres? We store data as a row, like in tuples in pages, right? And because we're doing that, if we've got really wide rows, we reduce the number of tuples per page that we can store. So as soon as we're accessing multiple, we might get that index only scan but if we're frequently only reading from a few like a couple of those columns and maybe they're slight they're really narrow columns those being on a separate table to the the wide ones that we're infrequently accessing

Starting point is 00:09:35 is an argument against the white table for read performance so even if we only focus on read performance i think there are some arguments to have splits like that. Yeah, makes sense. Right. But again, you can compare for your particular case, you can conduct a few experiments and compare exactly with explain, analyze buffers. You will see how many buffers, how much I.O. each approach gives you. And I suspect for reads, not in all cases, but in many cases, for reads, wide single table will still be good.

Starting point is 00:10:17 Even if it's toasted, right? If we exceed two kilobytes for a row, like roughly it will be toasted and so on. It means actually implicit join, right? Because to find those chucks, it will need to perform. But buffer hits or reads will show this as well. The problem, the second side of this coin is that extending the number of columns we have in the table, we have two big issues coming.

Starting point is 00:10:53 They can come not in acute form. They can come slowly and increasing over time and accumulating. So two problems. First problem is that how writes are working. I mean how updates and actual updates, how writes are working. I mean, how updates and actual updates, how updates are working. Updates means, an update in Postgres means that the whole tuple is copied, right? And this means that if we have a lot of columns,

Starting point is 00:11:18 which we, like you said, access not often, during update, we need to copy them. So this is first part of the this is the first part of the first problem. Second part of the first problem is indexes. More columns usually means more indexes, because again, to work with columns, you need additional indexes. And if you update a row it means or if you insert a row also it doesn't affect delete but again i i had this mistake a few times delete doesn't matter but inserts and updates they will need to change all indexes you have you have more many columns you have many indexes you have. You have many columns, you have many indexes. To update just one column, you need to or to insert a row, you need to update all these indexes.

Starting point is 00:12:12 There is a heap-only tuple update optimization, but it doesn't work often. I mean, it doesn't work always. It requires conditions to be met, and it's not always happening. So if we have a lot of columns, we need to write more. We need to update indexes often. And this makes our writes very heavy, right? And it gives a lot of work for rotovacuum, for example. Simple example.

Starting point is 00:12:43 If you decided to put a counter to avoid the slow aggregates for example we talked about users okay how many like comments this user left in various places of our system or how many posts this user created i'm speaking in terms of some abstract social media application. If you put like post count as a column to users table, you will need to update it quite often. And all columns will be copied always. And all indexes will need to be updated when this update is not hot.

Starting point is 00:13:22 Or if it's in serve, but okay, here we talk about only about updates so if update is not hot all indexes will be updated it's called index write amplification one of the reasons uber decided to go back to my sql so what does it mean It means that probably we need to think carefully. If this is a central table specifically, which will be used in many, many queries, and also if it's a central table, it also will be often used as just single index scan

Starting point is 00:14:00 just to find the current user, for example. Who is working? We need to find the current user. Of course, you can cache it, but still you need to find this current user, for example. Who is working? We need to find the current user. Of course, you can cache it. But still, you need to find this user, right? And then you need to update it. So probably it's wise to keep this counter in a separate table, one-on-one relationship, like user extra data or something.

Starting point is 00:14:22 And count is there. And it will be narrow table only some numbers and foreign key to the main table and you just then unique also key to maintain this one-one relationship maybe it's primary key as well so you just updated this counter maybe there are some other counters or timestamps another example example here is like last blah blah ad. For example, last updated ad,

Starting point is 00:14:49 last commented ad, last created some post ad. Some timestamp, it's like usually 16 bytes timestamp with timezone or without. Usually you should prefer with timestamp. And it's like 16 bytes. And keeping it separately and just having an extra join when needed,

Starting point is 00:15:09 maybe it's a wise approach because the main table remains untouched if you need to update it often, right? And the index write amplification is lower because only a few indexes there, which probably just one index, actually. You probably don't need to index those extra columns like counters and timestamps. Sometimes you do, of course, but in many cases you don't. And in this case, it's just one or a few indexes to be updated. Index write amplification is not high.

Starting point is 00:15:41 It's not a big problem anymore. And we avoid a lot of bloat on the main table right a lot of um a lot of additional work on the on the main table as well as the right amplification we also don't have to free up or reuse that space efficiently in the in the heap is that presumably there's an impact on the right head log the amount, the amount of work we're doing in the right-ahead log as well? Well, yes, it is so, but we also have full-page writes, right? So full-page write means we need to write full-page anyway, but for subsequent writes until next checkpoint.

Starting point is 00:16:22 If you need to update a very narrow table, tuples are small. In this case, of course, the amount of data written to wall becomes smaller. So you're right. It's another good reason. So if we write less wall data, it means less pressure on backups and replication systems.

Starting point is 00:16:44 Right. Exactly. But again, you need to join. So probably we end up with this approach. If this is something like counter or timestamp or something which is very often updated, but we don't need index on it, it's good to keep it in a separate table. And one other condition, if we don't need, if it's not highly keep it in a separate table and one other condition if we don't need if it's not like

Starting point is 00:17:06 highly correlated to something else if we don't need statistics is it called multivariate statistics if we don't need to define those anything that's single i guess all the single table features we have this applies to as well right oh and i thought of one other, I said, I thought of it. I remember reading a really good blog post by Haki Benita on this topic. So we've got the frequently updated columns that we should consider or that it might make sense to move out into a separate table. And we have these large or medium sized texts. Now, large ones more than i think you already mentioned the two kilobyte number we postgres already handles those for us using toast awesome but it kind of shows

Starting point is 00:17:52 that there is a point where it becomes beneficial to do this join we get some other features in there as well and but there's this weird case like below two kilobytes that there's an argument that medium-sized text should be, you'd consider moving those out as well. If you've got comments or something that are like one kilobyte on average, having those in your pages means you're only going to have maximum four, what would it be, eight tuples per page. Yeah, exactly, tuples. Four, four, four. Keep saying that one. Well, I also Tuples. Four, four, four. Keep saying that one. Well, I'm also switching back and forth. Tuples or tuples, four.

Starting point is 00:18:31 If it's pages, eight kilobytes, two kilobytes, like four, roughly four per page. Yeah, yeah, yeah. So it's an interesting point. Another interesting point is vice versa. If some column is, like, very rarely used, very rarely. Yeah. But it needs index.

Starting point is 00:18:52 Then also probably it should go to this extra table. For example, token, activation token. Interesting, yeah. If you put activation token to users table, it's becoming like forever passenger, not doing useful work anymore at all. But you keep having it in all updates. You copy, copy, copy in all the raw versions in tuples. And returning that data to the cloud. And reels. Bloating reels. And yeah, if you do need it one time, although it requires... Usually, a token requires index because you find the row,

Starting point is 00:19:32 like using token, you find who this token belongs to, right? The owner of this token. So you need the index. But in this case, you need it just once per lifetime of user or maybe when a user resets password or something. It's so rarely. So keeping it in a separate table, even if it's with index, it's a good idea. And, yeah, it compacts the storage and memory representation, less bytes used in shared buffers because probably you deal with once, then you forget about it and never read anymore to shared buffers until

Starting point is 00:20:11 password reset is required one more time. That's a really great point. I've got another question for you. If you've got a really large, if you're like chucking some HTML into Postgres for some reasons, that makes sense for your use case in in this case and maybe it's potentially megabytes or something of data now we could rely on toast for this but we could put it we could proactively put it in a separate table and use and it would still use toast right but to avoid like it's almost the developer experience if if some uh somebody's accidentally running

Starting point is 00:20:45 select stars or doesn't isn't aware of the impact of fetching that data and does a select star if you're having to support those kind of queries not having it be possible to have to fetch you know megabytes and megabytes of data from the server by mistake when when they probably don't need it might be an interesting idea definitely if you look if it's some project which was developed during many years if you look to these central tables

Starting point is 00:21:14 especially like users like I said or project organization organizations and so on they are used a lot and of course you can check toast level with various queries. It's easy. And if they started to be toasted, of course select all columns including those which you don't actually need. Usually you need it often but you need only a few few columns to check that this user is still active, right? No, like it's not, this user is not banned,

Starting point is 00:21:45 for example, or project is still active, is visible. It's just a few most important columns which required often. Others can be offloaded. But again, there is a trade off here. You, if they are offloaded, but with indexes, you need to join. And Postgres planner choosing form of join and indexes, it will be sometimes less performant than just single index scan, right? So it's an art. When you build something, you need to predict future, and this comes with some experience.

Starting point is 00:22:22 You need to think, okay, these columns, will we need them often? Will we need indexes on them? And based on the answers here, we can decide to offload them to this extra table, right? So the second problem is over-indexing. We discussed it. If you have a lot of columns, you have many indexes, you quite often have single index scan or index-only scan. This is the perfect plan you always want to see because it's usually either sub-millisecond or few milliseconds only, depending on the number of rows and hardware. So if you have this, you start having a lot of indexes,

Starting point is 00:23:03 and not only it's index write amplification problem it's also the problem we discussed exceeding 16, 15 indexes actually even if your query deals only with single table but it has many indexes if you don't have prepared statements used planning each time this planning will be slower and exceeding this number of locks 16 table plus 15 indexes already 16. 17th index locking will be with fast path false so without fast path for lock manager and it will be much slower than with FastPath and it means that if it's central table and the QPS for queries involving this table especially simple queries they can be very frequent queries high QPS in this case contention for simple queries might hit performance a lot. So you will see lightweight log manager

Starting point is 00:24:07 in wait events in PGSAT activity or in like for on RDS performance insights, Cloud SQL has it as well. Time scale, I think, has it as well. Wait event analysis will show you waiting on lightweight log manager. And this will happen with very simple queries, primary key lookups or some index scans, finding single row, just because you have a lot of indexes and queries are very frequent.

Starting point is 00:24:37 So offloading might help again. If you think, okay, this table is going to be central in my system, and these columns I do need, most of them will be indexed, but I won't exceed this magic 16 number. I mean, for indexes, it's 15. And in this case, I will be able to have thousands, maybe dozens of thousands of queries per second to this table because I need it, right? Not everyone needs such performance, of course, right?

Starting point is 00:25:10 But design for such performance means you need to avoid having more than 15 indexes per table. This should be a rule for high-frequent tables. Or you can just use prepare statements, it won't eliminate this problem completely, but it decreases it because you eliminate planning time for primary queue lockups. Execution time might be even bigger than planning time, so you shave off a lot of milliseconds, microseconds from this duration. And contention also can happen, but at higher QPS.

Starting point is 00:25:50 So you postpone this problem. If you do need to have more than 15 indexes. But having less indexes is good because lock manager won't have contention if you deal only with single table. We can refer people to the over indexing episode over indexing right i've got a theoretical question around that though not something i've ever seen become an issue as a result of this but if we follow this pattern in a few places in our database and we've split table things that are logically one table into multiple, we're going to have more joins. And there is a, there's a couple of limits.

Starting point is 00:26:27 Is it from collapse limit and join collapse limit? And I think, is it eight? Yeah. Okay. I knew it was. Yeah. And that we could quite quickly reach eight relations.

Starting point is 00:26:42 So then, yeah. So you might, if you're doing this, you might also want to just check that any of your complex queries that involve a lot of joins don't flip plans or you're not getting some strange, I think it switches to, is it called the genetic query optimizer, something like that?

Starting point is 00:26:57 Genetic query optimizer, yeah, right. I'm checking 12. I always keep forgetting genetic query optimizer threshold checking 12. I always keep forgetting. Genetic query optimizer threshold is 12. So if you reach 12 relations, by the way, here we mean tables, not indexes, because previously we counted both tables and indexes when we talk about locking. Here we talk about locking. Here we talk about tables. And yes, if you join 12 or more tables, it means that genetic code optimization starts happening. It's hard to understand this thing, how it works, right?

Starting point is 00:27:38 And maybe you are good, maybe you are not. So I, in general, I, of course, aim to have good performance and joins are fine, but usually I try to avoid joining too many tables. And how to do it? Denormalization. In this case, yes, we need

Starting point is 00:27:58 white tables. It's worth saying that what I just described, it's about very high TPS, high QPS. So if you have hundreds of queries per second, good hardware, you probably will never see this log manager contention. But in this case, of course, denormalization is one way to reduce the number of relations

Starting point is 00:28:22 participating in query. In other way, of course, sometimes we do it. Sometimes we just, okay, we read something, and then we read something in different query. It's also possible. But, yeah. Or do you know, like, if we use with materialized keyword, meaning we intentionally put the optimization fence there,

Starting point is 00:28:48 it means that these steps will be performed, like, this threshold is considered separately, right? So if you, I mean, if first step was six tables, second step is another six tables, plus previous results, of course. I don't know what happens with join collapse limit there. I think it's separately. These sub queries, even a single statement, these sub queries, physically, it's separate thing, right? Because we have materialization step, right? And we just, in memory, we have temporary

Starting point is 00:29:27 tables, so to speak. It's not a table, but just some data, right? It's used like a table. We can join with it, right? In this case, splitting to multiple steps using CTE might help as well.

Starting point is 00:29:44 I guess. Actually, I see some overload in terms of number of tables used. Queries quite often. But usually we find some ways to simplify them. Sometimes a lot of tables participate almost for nothing there. It was just convenience. Let's join this, let's join that. But then you ask, why?

Starting point is 00:30:07 If you want good performance, maybe you should just make it lighter, not so heavy. In this case, you just reduce the number and it quickly becomes below 12. But I don't know. Maybe genetic optimizer

Starting point is 00:30:24 is a good thing. I think only a few people in the world know exactly is it good or not. It's like some mystery, right? Yeah, well, one thing to bear in mind is that unlike previous topics we've had, this is a configurable limit. You can change this number with the caveat that you keep an eye on planning time, basically the reason these limits are in place. And you can do it dynamically. You can change it for one query, change back before another query, same with join from collapse

Starting point is 00:30:58 limits. But it's not the same with thresholds I described like this this fast path locking related. 16, it's hard-coded constant. Is there anything else you want to cover? I've got one more thing on the denormalization front. I wanted to mention that what I described, we have recent benchmarks. Thanks to my co-workers, especially Vitaly and Dmitry, they did good benchmarks. And let's put links to

Starting point is 00:31:32 description. Very detailed. Everyone can repeat them as usual. So these benchmarks show the effect of having extra indexes, for example. Actually, until recently, I never thought that having extra indexes, for example. And yeah, I actually, until recently, I never thought that having extra indexes

Starting point is 00:31:48 is affecting planning time significantly, like 5%, 10% of overhead in terms of time. And I also never thought it's related to this topic, actually, we discussed today. More columns means more indexes more indexes. Faster, we reach this 16 threshold. So it's interesting. I must admit, it's new to me. This is what I say. Cool. Yeah. On the last question I had, the case I see most commonly for demon normalization is around like for example with a

Starting point is 00:32:29 software as a service company wanting to partition a large table by account id or or your equivalent you need to have that on the table in order to be able to do so right so like that's quite so it could be like a not only for read performance, but also in preparation for something like partitioning. Right. Partition, keep propagation, right? We need to propagate this partitioning approach to dependent tables. This is quite obvious, I think, right? And again, like if we partition, we need to make sure all our queries, or at least most of them, don't deal with more than one or just a couple of partitions. Otherwise, we can have another amplification.

Starting point is 00:33:18 It's not index write amplification, but index planning time locking related amplification because if you have partitioning and your query deals with multiple partitions each partition has for example five indexes if you deal with four partitions you already have 20 logs plus plus table so 24 right logs to be acquired and it definitely goes above this threshold 16 so fast path won't be there for some logs, and the logging will be slow. So you do need prepared statements at least. Or you need to just make sure constant exclusion is working

Starting point is 00:33:56 and your plans target just one or a couple of partitions always. A better one, right? Of course. Awesome. And if it joins also. Of course, if you join often, main table has many indexes, extra table has many indexes.

Starting point is 00:34:19 When you just select from it, planner needs to lock all. I mean, at planning time, we also need to lock all of them, actually. With the lowest possible heavy lock, access share lock, just to avoid DDL to happen while we are reading. But all indexes, even those which are not needed,

Starting point is 00:34:46 will be locked. This is the key. We discussed it. So if you split to two tables, but need to join often, and this join query needs to be very frequent, it's worthless to split. We've actually added a relation.

Starting point is 00:35:03 Yeah. We added a relation, plus one, okay. And we also need to perform join now. So it's worse. So if you offload something, you need to be very careful. You need to, like, columns which rarely used or columns which need to be very frequently updated, columns which shouldn't be indexed Or indexed and rarely used.

Starting point is 00:35:26 I guess there are some maybe design patterns here of developing what we discussed as a kind of raw material. We need to polish it, I guess, before saying this is...

Starting point is 00:35:42 And the whole point of this episode was it's a trade-off, right? Like there are going to be cases where one of these makes sense and there are going to be cases where the opposite makes sense. Yeah, it's better if it's a serious system

Starting point is 00:35:56 that will need to handle a lot of queries per second. It's worth checking, prototyping, filling with like millions of rows, hundreds of millions of rows, and just testing, testing, testing various scenarios. With concurrent selects, it's really important, right? That was your tip last time. Well, it depends. You might have only single session experiments and already understand a lot if you use buffers you see io that's it but if you need to check the behavior of the buffer pool of log manager in this case of course you need full-fledged benchmarking multi-session experiments and you need to be alone on that machine but this we had also a episode about experimenting and benchmarking.

Starting point is 00:36:45 So here I see value in both approaches, just different value. Wonderful. Was there anything else you wanted to make sure we talked about? I guess that's it. It's not all, far from being all, far from being polished. But I encourage everyone to think about their particular case with experiments. I don't call them benchmarks because explain, analyze, buffers, it's also like performance-related experiment. You can check plans, but it doesn't feel like benchmark, right? But just from this

Starting point is 00:37:19 single thing you can do many good optimizations. So experimenting is the key for understanding your particular case. I hope we highlight the directions of analysis, right? This is probably the main value of this episode. Okay, if you liked it, please like, subscribe, and share. I'm talking to our listeners. I thought you were talking to me. Thanks so much, Nikolai. I've noticed when you upload to YouTube you don't like. I usually fix it myself

Starting point is 00:37:56 and I like our video because of it. Thank you so much. Take care. Bye.

Postgres FM - Data model trade-offs

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.