Postgres FM - Denormalization

Starting point is 00:00:00 Hello, hello, this is PostgresFM. I'm Nikolai from Postgres.ai and as usual my co-host is Michael from PgMaster. Hi Michael, how are you doing? Hello Nikolai, I am good, thank you. How are you? I'm very good. So, our topic today, we are going to discuss denormalization, but at very like maybe high level and basics and the pros and cons of having it used, applied as methodology. So we already had a couple of episodes. We discussed materialized use. It's a very close topic to this one.

Starting point is 00:00:32 What else? We discussed schema design. Yeah, we called it data model trade-offs. I'll link up those two. Yeah, I think data model trade-offs was probably the closest to this because naturally it came up as a big topic and then the third one that it came up in was i had a good chat with marcus winand and he he made some great points around normalization and demon normalization that or at least normalization i'll link that up as well maybe i need to re-listen to that one yeah it was my choice for this week. And I'm just thinking that it's important to get back to this topic at some maybe slightly different angles.

Starting point is 00:01:11 Because I keep seeing issues with this and cases of this. And I have two different customers which have consulting contracts with Postgres AI and I just observed the need of denormalization or at least to consider to apply denormalization to solve some performance issues. But before we discuss performance and reasons, let me say this. If you don't understand normalization, don't use denormalization. You need first to learn normalization and feel it and use it. Because otherwise, with relational databases, it can be tricky to understand particular cases and limitations and consequences of applying

Starting point is 00:02:01 denormalization. I'm not talking about non-relational databases, but we can maybe mention the work by Stonebreaker in EDB blog discussing that MongoDB particularly, or maybe DynamoDB there as well, I don't remember exactly, the problems which happen to MongoDB, I think, document databases and big Jasons and the tendency to have denormalized state design by default normally and issues later both flexibility and performance wise.

Starting point is 00:02:41 Was that the blog post with Alvaro? Yeah. So it also had the word harmful, which I inherited and then got a big backlash on my sub-transactions-related blog post on Hacker News. Yeah. So let's keep that aside and consider we talk about only Postgres

Starting point is 00:03:02 and maybe slightly wider relational databases. First thing you need to understand is normalization, normal forms, at least three, four, voice code, right? Normal form of M. Functional dependencies. It's a very interesting thing to understand. It's not only theory because it feels theory and always felt like theory to me. But when you design your schema, if you understand these things, you learned it, studied them,

Starting point is 00:03:31 you feel much better thinking how to design schema, tables and columns and foreign keys between them, right? Yeah, and in the world, at least in the last three to five years, I think the vast majority of systems I've that backend engineers, full-stack engineers, generally do look up on. They do know it's important, and they do some thinking and upfront design work, get second opinions and things. So I actually haven't come up... I have come across some cases of, you know,

Starting point is 00:04:17 like, is it EAV, entity attribute value, like, nonsense, or, like, just things that haven't been fully thought through but they've been really rare compared to people that have done things pretty well from the start. EAV is radical normalization, right? Which leads to bad performance

Starting point is 00:04:35 usually. Yeah, that's a good point. But I think many backend engineers look at this topic, normalization versus denormalization in general, using lengths of ORM. They have different terminology, right? Sometimes it's hard to understand each other talking.

Starting point is 00:04:57 And also, I think today, most powerful LLMs, they do great job designing, suggesting design. For example, we integrated and it produces mermaid diagrams. So if you ask to design, I don't know, like Twitter-like or Facebook-like, YouTube-like schema or e-commerce, and then you iterate, adding tables, it generally tries to apply common sense in terms of normalization as well that makes sense

Starting point is 00:05:27 but again before starting to use denormalization it's good to understand normalization first several forms then first thing when you like let's say sometimes it's convenience to have a denormalization

Starting point is 00:05:43 I maybe cannot think about a particular case right now, but it happens. I know it happens. Sometimes you just think, there is functional dependency here, right? But I don't go and normalize and move this data to a separate table, for example, because I know, for example,

Starting point is 00:05:59 this functional dependency might vanish in the future. It might change. I just know my field, and this functional dependency is only for now. It feels so, but it might change. In this case, it might be beneficial and providing more flexibility if you don't apply the next step of normalization in this situation.

Starting point is 00:06:24 Yeah, it sounds abstract. And yeah, I cannot imagine some... I think of one example. Imagine if you're an early stage startup without product market fit and you're adding features quite quickly. A simple example would be you're a blogging platform and you think the idea of adding tags would be a good idea. So instead of adding a whole extra schema around that, you could have a single column where you let people add a few tags. And you know that might not be how you want it to be eventually, but you also might remove that feature in a couple of weeks if people hate it or something.

Starting point is 00:07:01 So like you might, for convenience, go with go with one where you don't like that. Well, this this particular case, I remember my talk in Maryland 2008. The first time I came to the US to speak how good Postgres is for web 2.0 years, it's like 16 years, right. And this was very, this is was was my example, tags. And I talked about HStore. Or today we can talk about Array. Back to that time, there was no GIN at all.

Starting point is 00:07:38 So GIN was not created yet. And we used only GIST and HStore or L3 in some cases. Depends. used on the gist and uh register or l3 in some cases depends but in the case of tags it's natural to think okay i'm going to put this to different table and then you need additional table like to have many to many relationship and this is a e av exactly this is what is going to have bad performance if you have a lot of entries in the main table, like your objects, and also a lot of tags and a lot of relationships between them. It's really...

Starting point is 00:08:14 This join can be problematic sometimes. But we will probably return to this schema, like three-table schema later. What I wanted to just emphasize i cannot just not to comment here it's about performance i would these days consider arrays with tags or maybe arrays with ids we know foreign keys don't support this like one row and you have tags, tag IDs, right? And then like multiple foreign keys to different entries in the text table. But we can live without the foreign key here.

Starting point is 00:08:53 For the sake of having good performance, we can just have these IDs or just tags as is, like text array, right? So integer 8 array, begin array or text array in this case uh using gin should be good just oh jason b jason b i feel would add like additional layer of something here which we don't really need because we just need IDs or a bunch of phrases, texts, like a text, right? So I don't know.

Starting point is 00:09:28 Do you think this is denormalization? I think, strictly speaking, yeah. Well, let's not say we are breaking first normal form here because, of course, first normal form is like, it is saying that in each cell basically of a table if you like in each each value should be atomic right it should not be a table for example or in this case array or json we are breaking it and yeah uh yeah strictly speaking but in this sense if we consider in this case anytime every time we use array we are already breaking breaking

Starting point is 00:10:06 normal form but at what sense it's denormalization we are repeating if we are repeating values for example next later we found a typo in some tag and going to adjust it this is functional dependency of course right and we will need to update a lot of rows here. Of course, it's better to save, to store IDs and have tag values in a separate table, right? So in this case, yeah, we can even join in an interesting way, maybe with lateral join or unnest here. Like, it's interesting, but it's a separate topic so if we don't store values of text i don't see denormalization aspect here right it's just yeah we like we treat the first normal form in interesting way this discussion happened 20 years ago, I think. I remember some from Chris Date, right? From some works. Like, are we breaking the first normal form in such cases or no? But if we

Starting point is 00:11:14 store just IDs, well, it's quite well normalized and we don't have an EAV so we can use gene search much faster if we need some tags we like this is this query is going to be much more efficient in my opinion but we like foreign unfortunately this is what we lose in such a approach approach right so flexibility is one thing different thing and this is most interesting of course performance we already started discussing performance i would highlight two particular cases this is exactly two particular customers i observed over the last couple of weeks two particular situations let's start from maybe simpler one it's slow count right did we have episode about it i'm sure we did yeah yeah so slow count or in broader view slow aggregates right so there are approaches to solve it using like index only scan we also had an episode i think we're covering more and more things which is great but if we see that it's not enough, what else?

Starting point is 00:12:25 We need to start denormalizing in some form. And obviously, we could use mutualized view, which has all the counts, but it's very asynchronous. Like pre-calculated. Yeah, pre-calculated. You cannot do it often. For example, we have a huge table like 10 billion rows for example and we insert one more row and we say okay our create our

Starting point is 00:12:53 massage to you is just create create materialized you blah blah blah name as select count star from the table interesting right is it a good idea to do it often? Well, maybe no, we should do it not often. In this case, it will be lagging and showing not relevant information like kind of eventual consistency. Yes, we will reflect next insert, unless it's got it's got deleted, right? But it will be lagging and some people can notice it not good so what else like we can consider synchronous yeah and also like when you refresh material it's scanning everything like it could skip some old data understanding it hasn't changed and just count the like only fresh part of the table maybe we

Starting point is 00:13:46 have partitioning there I don't know but with uh postgres doesn't support this only additional extensions like pgivm or what was the name of different extension I forgot I just found it recently when you just mentioned was it Dino yeah. It was interesting. I think it's Python and external to Postgres because PGI VM, it's an extension. So obviously you don't have it on many managed Postgres offerings. But if it's external or, for example, you implement it yourself, it's quite possible using maybe triggers or something, maybe some queue mechanism involved externally

Starting point is 00:14:26 to make it asynchronous and incrementally updated in this case it can be good and maybe more like resilient like updating more faster and having more actual data right so the problem is like we've materialized your problem is it's heavy it's very rough and heavy tool in most cases i deal with them i say let's try to avoid them get rid of them right in this case particularly as well by the way we could build also our own like kind of materialization mechanism using logical decoding, right? Logical replication even. For example, accumulating some events through... It can be also external queue mechanism,

Starting point is 00:15:14 but also using logical replication. If you batch and then update, not on each insert, but after like 100 inserts or updates and so on, deletes also. And then you reflect this change and nothing is lost because it's quite postgres centric approach right because postgres guarantees nothing is lost what do you think well i i like it i remember reading a good blog post by timescale on how they it how continuous aggregates work and i remember thinking

Starting point is 00:15:47 i would i wouldn't want to implement that myself like i remember thinking it like keeping it synchronous or like synchronous enough like it's just quite painful so i do admire people that have done that and yeah i i can see the argument for it but i also i think at this point if once you're once you're considering that, you probably also should be considering the synchronous approaches like triggers or some other wayity mechanism in postgres which is not yet developed which could support some kind of asynchronous way to update things oh i'd love it if it was in core that would be amazing yeah in core exactly this would be great but it's maybe a hard task to have and also not that high up the priority list for most people i think yeah but it would maybe a hard task to have. And also not that high up the priority list for most people, I think. Yeah.

Starting point is 00:16:46 But it would give a lot of interesting things, interesting capabilities to develop, interesting, well-performed and responsive. I mean, not responsive, but data is coming and you reflect it within a few seconds. It's great. By the way, when you mentioned performance, so I looked up the definition of demoralization on Wikipedia. And it said

Starting point is 00:17:11 demoralization is a strategy on a previously normalized database to increase performance. So they're explicitly saying that the right but we can assume that in our head sometimes we normalize and then we move back and denormalize and then go like deploy it right right away in production this oh yeah i thought obviously there is the inch there's an interesting part there that says it has to be previously normalized but i thought it was also interesting that they the only reason they gave was for performance reasons well maybe maybe i'm wrong thinking it's also interesting that the only reason they gave was for performance reasons. Well, maybe I'm wrong thinking it can be not only for performance reasons.

Starting point is 00:17:50 Of course, in my experience, I did it for performance reasons. Let's be honest here. I just wanted to cover maybe cases which can happen. You mentioned a couple of cases that come up recently. Yeah, well, before we move to the next one I wanted to emphasize, if we think about we are not going to use materialized view, what else?

Starting point is 00:18:15 We have a table, we insert but we need the count very fast Imagine we have a trigger and we update that count on each insert or update or delete Update and delete yeah delete update maybe no depending of course if you have soft delete approach then you need to reflect updates

Starting point is 00:18:35 because they might be doing soft deletion right well and if it's if it's only account yeah sure but it might if it's a different kind of aggregate, then you might have to worry about... Averages, sum, and so on. Yeah, yeah, min, max. Well, min, max probably is fast enough, thanks to index anyway. Yeah, by the way, of course, there is an ongoing discussion. There are ongoing discussions about having column store and postgres all the time, which might be good here, but maybe not. It

Starting point is 00:19:05 depends on the particular case, because if your main part of data needs to be row store, you still need to deal with some replication inside your engine or outside of your engine. So replication means like maybe it's triggers, maybe it's logical replication involved, who knows. But yeah, it can be interesting. But again, we have a main table and trigger, which on each insert, it increments this counter on this additional table. What do you think will happen?

Starting point is 00:19:38 Is it good? Like hotspots. We've talked, yeah. Hotspots. Yeah. At sufficient scale. And I guess all of this is only relevant at sufficient scale isn't it yeah yeah so this synchronous approach can be super dangerous because

Starting point is 00:19:50 locking heavy locking and even without heavy locking heavy locks if we have foreign keys for example there and yeah it's going to have multi-exact ID involved and even SELEX can downgrade. I think you can spread out the hotspot. You don't have to have a single record that gets updated. You could have a series and that series could grow as your concurrency grows. Right, right. There's like, yeah, like batches in place.

Starting point is 00:20:23 Then you just sum all of it and this is your total count. That's good. like, yeah, like batches in place, right? Then you just sum all of it, and this is your total count, right? That's good, yeah, yeah. This is actually what we did a few times. We just had batches, and I don't remember particularly how, like basically some kind of partitioning of your count inside one table, right? You partition those counts into like buckets or batches how to say and then you increment them sometimes collapsing maybe and so on yeah also vacuum is an issue here if you update that table very often and it can one row can can be super heavy in terms of real uh disk space

Starting point is 00:20:59 because of that tuples right yeah so postgres mvcc is not good in this particular case well you'd i think you'd ideally be aiming for those to be hot like there's no reason i think to index that column so i think you'd be aiming for that those to be hot updates good point yeah good point and then hopefully you avoid vacuum issues at all yeah It's great. Like index less stable. Yeah. Actually, this is maybe the case when primary key is definitely against our goals, good performance against the primary key anyway. But yeah, I mean, I mean, this is a concept from theory as well. At the same time, when we learn like relational theory and normalization, we also learn that tables without primary keys, basically, it's also breaking rules, right?

Starting point is 00:21:50 So in this particular case, we don't want primary key. It's interesting. Isn't there like a saying, something like, first you have to learn all the rules, so then you know how and when it's safe to break them? Yeah, yeah. Well, the whole de-normalization idea, you need to learn how to do it right and then how to break it's safe to break them. Yeah, yeah. Well, whole de-urbanization idea, you need to learn how to do it right

Starting point is 00:22:07 and then how to break it right to have good performance. So let's, we don't have a lot of time, so let's discuss this most interesting maybe case. Back to AV maybe, but not particularly. In my projects, I did it several times. Imagine you have social media, for example. You have a lot of users, say a few million users,

Starting point is 00:22:32 and you have a lot of objects, say dozens of millions of objects. It can be posts or comments, anything. Maybe it's not only like posts. There is also something, some concept, like kind of blog organization or some data source where posts exist, right? And people can relate to those channels, right channels, let's say channels. And for example, they can subscribe, or they can have permissions to view them or not to view them, different things.

Starting point is 00:23:06 And then you just need to display the freshest hundred entries, posts or something, right? Relevant to them. Relevant to this person, yeah, only from subscriptions. Or only that data which is allowed to view, to access, to edit or anything. Or even worse, if you want to show 100 last updated or last accessed, last changed, different kinds of timestamps can happen here. And it's interesting because sometimes timestamps belong to the objects itself or posts, for example, creation time or modification time. But some of timestamps can belong to this particular relationship between users and those channels or posts themselves.

Starting point is 00:23:55 For example, last accessed, right? It's definitely for each person, it's different in respect to particular post, right? One user accessed at one day, another user at the same day but different time, right? So timestamp is different. So this particular task is very common. This is common pattern. Not pattern, pattern maybe should be applied to solutions, right? But usually people just do selects, joins, and that's it.

Starting point is 00:24:24 And at some point, performance become terrible. It happened first, in one of my social medias, and I was super upset what like, I thought it's really similar to graph issues like graph, working with graphs, like for example, return, first circle of connections, second circle of connections, and circle of connections, and try to find some connections like in LinkedIn, right? I remember many, many years ago, we were trying to solve it in relational database.

Starting point is 00:24:53 It's quite difficult. We didn't have recursive CTE at the time and lateral joins and so on. So it's a hard problem to solve. And query has several joins and filtering order by right but some filters columns exist in one column in one table different columns exist in different table when filters and order by are in different columns in different tables it means you cannot have ideal situation a single index scan or even index only scan, which is very good. Right. And then you need to rely on one of three joint algorithms Postgres implements.

Starting point is 00:25:33 And I can assure you there will be edge cases where statistics and planner, they don't have idea how to execute this. So this is the age old issue of which is more selective. Like, are you going to be, is the planner going to be better off going through an index on the order by until it finds enough posts or whatever the example is to satisfy the limit? Or is it going to be cheaper to look at all posts that satisfy the other conditions and then order those if assuming they're a smaller set and unless the statistics are good

Starting point is 00:26:12 then you could end up doing the wrong the wrong one you know the more expensive one and have a very slow query if it's a huge exactly imagine we have we order by creation time and we just subscribe to channels, channels have posts, so we have users table, we join with channels, have subscriptions and posts, and then we order by creation time of posts and limit 100, right? Order by creation at desk limit 100 so in this case indeed as you say postgres need to choose what to do like is it good idea to extract all subscriptions and then in memory order by and find 100 or or it's better to use a index on created at go backwards on this index and try to find posts which can be displayed for this particular user, meaning that they are among subscriptions. Actually, both paths are bad. They might be good in some cases, but at larger scale, there will definitely be cases

Starting point is 00:27:18 where both paths are performing really bad, dealing with a lot of buffer operations and bad timing, right? Execution timing. So how would you solve this particular problem? What's the solution? Yeah, well, dynamilization is one of the ways, right? So like storing a duplicate of this data in like a materialized view. Is that what you're saying? Yeah, for example, we take creation time and we just propagate it to the...

Starting point is 00:27:50 For example, if it's a relationship to particular items, posts, we can just propagate to this table which represents this relationship. And then we can even have ideal situation, single index scan. Not index only scan because we usually need to join data and bring actual data from from like maybe it's not in the index but it might be index

Starting point is 00:28:13 only scan for this particular table right which has a relationship this is one of the ways of course you have channels well it's it's more difficult because if we order by creation of posts, not channels, right? So it's, yeah. Yeah, I think some of these cases, you mentioned the one where different users have different timestamps, for example, if it's like last viewed, I think that gets really complicated in terms of... No, vice versa.

Starting point is 00:28:40 I think it's good because if we order by last access, for example, last access timestamp on channel, we just deal with channels, we can propagate it to the posts themselves. Well, we cannot. Yeah, I agree with you. It really becomes complicated. So there are cases where it's really hard to apply this normalization, but there are cases where it's easy. If we forget about channels and, for example, think about relationship between users and these objects, it can be, for example, permissions, or it can be, I don't know, like, last access timestamps should be stored neither in users nor in objects table. It should be stored in, like, table in between, right? Like, many-to-many to many relationship so in this case usually like by default it's good like we just use this table we have index we quickly find what we need 100

Starting point is 00:29:32 for this particular user 100 objects that should be displayed but if additional filtering this is usually in real life we need additional filtering involving data inside object's table, right? It can be, well, soft delete, for example, we mentioned soft delete. So, like, delete it at timestamp. If it's filled, if it's not null, then this object should not be showed in this result set, right? But we can propagate when we delete. We can propagate in all entries this object has to all users, right? We can propagate to this table. Of course, it can be a lot of rows.

Starting point is 00:30:09 This depends on situation. I would not propagate it synchronously if it's more than like 1,000 rows, because delete will be super heavy. Soft delete, it's update actually, right? Yeah. But we know if it's like limited number of users can access each object, like not more than 1000 say or 10,000. We can do that. Or we can do it asynchronously. Again, there is some need in like, I would say there is need in not an asynchronous, not an incremental materialistic, but maybe in asynchronous triggers so like there is an oracle

Starting point is 00:30:47 is pragma autonomous it's not exactly the same but yeah sometimes and there are folks which are using pgq for example in cloud sql it's available right one of good things about cloud sql is availability of pgq so you can implement asynchronous triggers there yourself. So if some object has update to be soft deleted, we propagate to all rows of this last accessed or last viewed or something table, right? In this case, we can have, again, we can have single index scan, index only scan, but it might be more difficult than that, right? There is another way, like it's not about dynamization and this is how we solved it. And not only we, I know GitLab also has to-do list and they apply the same recipe,

Starting point is 00:31:36 which I think one day we will blog about. Forget about dynamization and just use huge recursive CTE with some tricks, some algorithm actually implementing this newsfeed pattern to solve it in a very interesting way. Oh, is it like the skip scan recursive CTE? It's not skip scan, it's like advanced skip scan. So you know you need to return 100 rows. You have 100 slots to fill, and then you start working with channels,

Starting point is 00:32:13 filling these slots, and each time you work with next channel, you think, okay, this channel has a fresh post, for example, has a fresh object, and this is the place we like we replace some object we keep in memory we replace and put it to the slot and at some point you finish and return but it requires some effort to implement yeah i can imagine that that sounds really cool it reminds me a little bit of the latest feature, one of the latest features in PG Vector, where have you seen this in, I think, 0.8, it added being able to continue a scan.

Starting point is 00:32:51 Like if you... It didn't help us, we checked yesterday. Oh, no. But the idea is the same, right? If you don't yet get enough results in your limit, go back and continue the scan. It's good if you have quite like uniform database. In our case, we are speaking about Postgres AI, for example, we have more than 1 million documents and more than 90% of them is mailing lists, mailing lists in entries, emails, right?

Starting point is 00:33:20 And when, for example, you ask for source category in this huge, this approach doesn't help. But still, it encounters so many mailing list entries, it cannot find source. So basically, we are considering either, I think we currently already using partial indexes there. And I'm seriously thinking we should move to separate tables because we have only limited categories and our use cases, very often we deal with only one source. You could maybe partition on source. Yeah.

Starting point is 00:33:54 Well, yeah. Actually, it's also a good idea, maybe, yeah. So we deviated here from original dynamization topic. And I think it's another story how to implement newsfeed better without dynamilization but with huge recursive city. My summary is that first of all dynamilization should be applied only you know dynamilization and it can bring new performance dangers if you don't think about concurrency and various

Starting point is 00:34:27 issues with Postgres MVCC, locking, lock manager, for example, and so on. So it's an interesting thing, still quite useful to consider in many cases,, it should be tested well as usual. I think it's one of those sharp tools, right? Like some, sometimes advanced craftsmen need sharp tools, but be careful with a shop. Yeah.

Starting point is 00:34:57 Yeah. Something like this. Agree. Wonderful. Thanks so much, Nicola. Catch you next week. Sounds good.

Starting point is 00:35:05 Have a good week. You too. You too. Bye. Bye. Bye.

Your Ad Here

Postgres FM - Denormalization

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.