Postgres FM - Denormalization
Episode Date: November 8, 2024Nikolay and Michael discuss denormalization in Postgres — when and why to denormalize things, and a couple of specific cases Nikolay came across recently. Here are some links to things t...hey mentioned:Denormalization https://en.wikipedia.org/wiki/DenormalizationOur episode on materialized views https://postgres.fm/episodes/materialized-viewsOur episode on data model trade-offs https://postgres.fm/episodes/data-model-trade-offsOur episode with Markus Winand https://postgres.fm/episodes/modern-sqlUniversal Relation Data Modelling Considered Harmful (blog post by Michael Stonebraker and Álvaro Hernández) https://www.enterprisedb.com/blog/universal-relation-data-modelling-considered-harmfulBoyce–Codd normal form https://en.wikipedia.org/wiki/Boyce%E2%80%93Codd_normal_formOur episode on slow count https://postgres.fm/episodes/slow-countpg_ivm https://github.com/sraoss/pg_ivmdenorm https://github.com/rivethealth/denormPostgres Materialized Views, The Timescale Way (blog post by David Kohn) https://www.timescale.com/blog/materialized-views-the-timescale-way/PgQ https://github.com/pgq/pgqDatabases, types, and the relational model (by C.J. Date and Hugh Darwen) https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello, hello, this is PostgresFM. I'm Nikolai from Postgres.ai and as usual my co-host is Michael from PgMaster.
Hi Michael, how are you doing?
Hello Nikolai, I am good, thank you. How are you?
I'm very good. So, our topic today, we are going to discuss denormalization,
but at very like maybe high level and basics and the pros and cons of having it used, applied as methodology.
So we already had a couple of episodes.
We discussed materialized use.
It's a very close topic to this one.
What else?
We discussed schema design.
Yeah, we called it data model trade-offs.
I'll link up those two.
Yeah, I think data model trade-offs was probably the closest to this because naturally it came up as a big topic and then the third one that it came
up in was i had a good chat with marcus winand and he he made some great points around normalization
and demon normalization that or at least normalization i'll link that up as well maybe
i need to re-listen to that one yeah it was my choice for this week. And I'm just thinking that it's important to get back to this topic at some maybe slightly different angles.
Because I keep seeing issues with this and cases of this.
And I have two different customers which have consulting contracts with Postgres AI and I just observed
the need of denormalization or at least to consider to apply denormalization to solve
some performance issues.
But before we discuss performance and reasons, let me say this.
If you don't understand normalization, don't use denormalization. You need first to learn
normalization and feel it and use it. Because otherwise, with relational databases, it can be
tricky to understand particular cases and limitations and consequences of applying
denormalization. I'm not talking about non-relational databases,
but we can maybe mention the work by Stonebreaker
in EDB blog discussing that MongoDB particularly,
or maybe DynamoDB there as well,
I don't remember exactly,
the problems which happen to MongoDB, I think, document databases and
big Jasons and the tendency to have denormalized state design by default normally and issues
later both flexibility and performance wise.
Was that the blog post with Alvaro? Yeah.
So it also had the word harmful,
which I inherited and then got a big backlash
on my sub-transactions-related blog post
on Hacker News.
Yeah.
So let's keep that aside
and consider we talk about only Postgres
and maybe slightly wider relational databases.
First thing you need to understand is normalization, normal forms, at least three, four, voice code, right?
Normal form of M.
Functional dependencies.
It's a very interesting thing to understand.
It's not only theory because it feels theory and always felt like theory to me.
But when you design your schema, if you understand these things,
you learned it, studied them,
you feel much better thinking how to design schema,
tables and columns and foreign keys between them, right?
Yeah, and in the world, at least in the last three to five years, I think the vast majority of systems I've that backend engineers, full-stack engineers,
generally do look up on.
They do know it's important,
and they do some thinking and upfront design work,
get second opinions and things. So I actually haven't come up...
I have come across some cases of, you know,
like, is it EAV, entity attribute value,
like, nonsense,
or, like, just things that haven't been fully thought through
but they've been really rare
compared to people that have done things pretty
well from the start.
EAV is radical normalization, right?
Which leads to bad performance
usually. Yeah, that's
a good point. But I think
many backend engineers look
at this topic, normalization versus
denormalization in general,
using lengths of ORM.
They have different terminology, right?
Sometimes it's hard to understand each other talking.
And also, I think today,
most powerful LLMs, they do great job designing,
suggesting design.
For example, we integrated and it produces mermaid diagrams.
So if you ask to design, I don't know, like Twitter-like or Facebook-like,
YouTube-like schema or e-commerce, and then you iterate, adding tables,
it generally tries to apply common sense in terms of normalization as well
that makes sense
but again before starting to use
denormalization it's good to understand
normalization first several forms
then
first thing when you
like let's say sometimes it's
convenience
to have a denormalization
I maybe cannot think about a particular case right now,
but it happens.
I know it happens.
Sometimes you just think,
there is functional dependency here, right?
But I don't go and normalize
and move this data to a separate table, for example,
because I know, for example,
this functional dependency might vanish in the future.
It might change.
I just know my field,
and this functional dependency is only for now.
It feels so, but it might change.
In this case, it might be beneficial
and providing more flexibility
if you don't apply the next step of normalization in this situation.
Yeah, it sounds abstract.
And yeah, I cannot imagine some...
I think of one example.
Imagine if you're an early stage startup without product market fit and you're adding features quite quickly.
A simple example would be you're a blogging platform and you think the idea of adding tags would be a good idea. So instead of adding a whole extra schema around that,
you could have a single column where you let people add a few tags.
And you know that might not be how you want it to be eventually,
but you also might remove that feature in a couple of weeks if people hate it or something.
So like you might, for convenience, go with go with one where you don't like that.
Well, this this particular case, I remember my talk in
Maryland 2008. The first time I came to the US to speak how good
Postgres is for web 2.0 years, it's like 16 years, right. And
this was very, this is was was my example, tags.
And I talked about HStore.
Or today we can talk about Array.
Back to that time, there was no GIN at all.
So GIN was not created yet. And we used only GIST and HStore or L3 in some cases.
Depends. used on the gist and uh register or l3 in some cases depends but in the case of tags it's natural
to think okay i'm going to put this to different table and then you need additional table like to
have many to many relationship and this is a e av exactly this is what is going to have bad
performance if you have a lot of entries in the main table,
like your objects, and also a lot of tags
and a lot of relationships between them.
It's really...
This join can be problematic sometimes.
But we will probably return to this schema,
like three-table schema later.
What I wanted to just emphasize i cannot
just not to comment here it's about performance i would these days consider arrays with tags
or maybe arrays with ids we know foreign keys don't support this like one row and you have tags, tag IDs, right? And then like multiple foreign keys
to different entries in the text table.
But we can live without the foreign key here.
For the sake of having good performance,
we can just have these IDs
or just tags as is, like text array, right?
So integer 8 array, begin array or text array in this case
uh using gin should be good just oh jason b jason b i feel would add like additional
layer of something here which we don't really need because we just need IDs or a bunch of phrases,
texts, like a text, right?
So I don't know.
Do you think this is denormalization?
I think, strictly speaking, yeah.
Well, let's not say we are breaking first normal form here
because, of course, first normal form is like,
it is saying that in each cell basically of
a table if you like in each each value should be atomic right it should not be a table for example
or in this case array or json we are breaking it and yeah uh yeah strictly speaking but in this
sense if we consider in this case anytime every time we use array we are already breaking breaking
normal form but at what sense it's denormalization we are repeating if we are repeating values
for example next later we found a typo in some tag and going to adjust it this is functional
dependency of course right and we will need to update a lot of rows here.
Of course, it's better to save, to store IDs and have tag values in a separate table, right?
So in this case, yeah, we can even join in an interesting way, maybe with lateral join or unnest here.
Like, it's interesting, but it's a separate topic so if we don't store values of text i don't see denormalization aspect here right it's just yeah we like we treat
the first normal form in interesting way this discussion happened 20 years ago, I think. I remember some from Chris Date, right? From
some works. Like, are we breaking the first normal form in such cases or no? But if we
store just IDs, well, it's quite well normalized and we don't have an EAV so we can use gene search much faster if we need some tags we like this is this query
is going to be much more efficient in my opinion but we like foreign unfortunately this is what we
lose in such a approach approach right so flexibility is one thing different thing and
this is most interesting of course performance we already started discussing performance i would highlight two particular cases this is exactly two particular
customers i observed over the last couple of weeks two particular situations let's start from
maybe simpler one it's slow count right did we have episode about it i'm sure we did yeah yeah so slow count or in broader view slow aggregates
right so there are approaches to solve it using like index only scan we also had an episode i
think we're covering more and more things which is great but if we see that it's not enough, what else?
We need to start denormalizing in some form.
And obviously, we could use mutualized view, which has all the counts,
but it's very asynchronous.
Like pre-calculated.
Yeah, pre-calculated.
You cannot do it often.
For example, we have a huge table like
10 billion rows for example and we insert one more row and we say okay our create our
massage to you is just create create materialized you blah blah blah name as select count star from
the table interesting right is it a good idea to do it often? Well, maybe no, we should do it
not often. In this case, it will be lagging and showing not relevant information like kind of
eventual consistency. Yes, we will reflect next insert, unless it's got it's got deleted, right?
But it will be lagging and some people can notice it not good so what else
like we can consider synchronous yeah and also like when you refresh material it's scanning
everything like it could skip some old data understanding it hasn't changed and just count
the like only fresh part of the table maybe we
have partitioning there I don't know but with uh postgres doesn't support this only additional
extensions like pgivm or what was the name of different extension I forgot I just found it
recently when you just mentioned was it Dino yeah. It was interesting. I think it's Python and external to Postgres
because PGI VM, it's an extension.
So obviously you don't have it on many managed Postgres offerings.
But if it's external or, for example, you implement it yourself,
it's quite possible using maybe triggers or something,
maybe some queue mechanism involved externally
to make it asynchronous and incrementally updated in this case it can be good and maybe more like
resilient like updating more faster and having more actual data right so the problem is like we've materialized your problem is it's heavy it's very rough and
heavy tool in most cases i deal with them i say let's try to avoid them get rid of them right
in this case particularly as well by the way we could build also our own like kind of
materialization mechanism using logical decoding, right?
Logical replication even.
For example, accumulating some events through...
It can be also external queue mechanism,
but also using logical replication.
If you batch and then update,
not on each insert,
but after like 100 inserts or updates and so on,
deletes also.
And then you reflect this change and nothing is lost because it's quite postgres centric approach right because postgres guarantees
nothing is lost what do you think well i i like it i remember reading a good blog post by timescale
on how they it how continuous aggregates work and i remember thinking
i would i wouldn't want to implement that myself like i remember thinking it like
keeping it synchronous or like synchronous enough like it's just quite painful so i do admire people
that have done that and yeah i i can see the argument for it but i also i think at this point
if once you're once you're considering that, you probably also should be considering the synchronous approaches like triggers or some other wayity mechanism in postgres which is not yet developed
which could support some kind of asynchronous way to update things oh i'd love it if it was in core
that would be amazing yeah in core exactly this would be great but it's maybe a hard task to have
and also not that high up the priority list for most people i think yeah but it would maybe a hard task to have. And also not that high up the priority list for most people, I think.
Yeah.
But it would give a lot of interesting things,
interesting capabilities to develop,
interesting, well-performed and responsive.
I mean, not responsive, but data is coming
and you reflect it within a few seconds.
It's great.
By the way, when you mentioned performance, so I looked up the
definition of demoralization on Wikipedia. And it said
demoralization is a strategy on a previously normalized database
to increase performance. So they're explicitly saying that
the right but we can assume that in our head sometimes we normalize and then we
move back and denormalize and then go like deploy it right right away in production this
oh yeah i thought obviously there is the inch there's an interesting part there that says it
has to be previously normalized but i thought it was also interesting that they the only reason
they gave was for performance reasons well maybe maybe i'm wrong thinking it's also interesting that the only reason they gave was for performance reasons.
Well, maybe I'm wrong thinking it can be not only for performance reasons.
Of course, in my experience, I did it for performance reasons.
Let's be honest here.
I just wanted to cover maybe cases which can happen.
You mentioned a couple of cases that come up recently.
Yeah, well, before we move to the next one
I wanted to emphasize, if we
think about
we are not going to use materialized view, what else?
We have a table, we insert
but we need the count very fast
Imagine we have a trigger and we
update that count on each insert
or update
or delete
Update and delete yeah delete update
maybe no depending of course if you have soft delete approach then you need to reflect updates
because they might be doing soft deletion right well and if it's if it's only account yeah sure
but it might if it's a different kind of aggregate, then you might have to worry about...
Averages, sum, and so on.
Yeah, yeah, min, max.
Well, min, max probably is fast enough, thanks to index anyway.
Yeah, by the way, of course, there is an ongoing discussion.
There are ongoing discussions about having column store and postgres all the time,
which might be good here, but maybe not. It
depends on the particular case, because if your main part of data needs to be row store,
you still need to deal with some replication inside your engine or outside of your engine.
So replication means like maybe it's triggers, maybe it's logical replication involved,
who knows. But yeah, it can be interesting.
But again, we have a main table and trigger,
which on each insert, it increments this counter
on this additional table.
What do you think will happen?
Is it good?
Like hotspots.
We've talked, yeah.
Hotspots.
Yeah.
At sufficient scale.
And I guess all of this is only relevant at sufficient
scale isn't it yeah yeah so this synchronous approach can be super dangerous because
locking heavy locking and even without heavy locking heavy locks if we have foreign keys
for example there and yeah it's going to have multi-exact ID involved and even SELEX can downgrade.
I think you can spread out the hotspot.
You don't have to have a single record that gets updated.
You could have a series and that series could grow
as your concurrency grows.
Right, right.
There's like, yeah, like batches in place.
Then you just sum all of it and this is your total count. That's good. like, yeah, like batches in place, right? Then you just sum all of it, and this is your total count, right?
That's good, yeah, yeah.
This is actually what we did a few times.
We just had batches, and I don't remember particularly how,
like basically some kind of partitioning of your count inside one table, right?
You partition those counts into like buckets or batches how to say and then you
increment them sometimes collapsing maybe and so on yeah also vacuum is an issue here if you update
that table very often and it can one row can can be super heavy in terms of real uh disk space
because of that tuples right yeah so postgres mvcc is not good in this particular case well you'd i think
you'd ideally be aiming for those to be hot like there's no reason i think to index that column
so i think you'd be aiming for that those to be hot updates good point yeah good point and then
hopefully you avoid vacuum issues at all yeah It's great. Like index less stable. Yeah.
Actually, this is maybe the case when primary key is definitely against our goals, good performance against the primary key anyway. But yeah, I mean, I mean, this is a concept from theory as well.
At the same time, when we learn like relational theory and normalization, we also learn that
tables without primary keys,
basically, it's also breaking rules, right?
So in this particular case, we don't want primary key.
It's interesting.
Isn't there like a saying, something like,
first you have to learn all the rules,
so then you know how and when it's safe to break them?
Yeah, yeah.
Well, the whole de-normalization idea, you need to learn how to do it right and then how to break it's safe to break them. Yeah, yeah. Well, whole de-urbanization idea,
you need to learn how to do it right
and then how to break it right
to have good performance.
So let's, we don't have a lot of time,
so let's discuss this most interesting maybe case.
Back to AV maybe, but not particularly.
In my projects, I did it several times.
Imagine you have social media, for example.
You have a lot of users, say a few million users,
and you have a lot of objects, say dozens of millions of objects.
It can be posts or comments, anything.
Maybe it's not only like posts.
There is also something, some concept, like kind of blog organization or some data
source where posts exist, right? And people can relate to those
channels, right channels, let's say channels. And for example,
they can subscribe, or they can have permissions to view them or
not to view them, different things.
And then you just need to display the freshest hundred entries, posts or something, right?
Relevant to them.
Relevant to this person, yeah, only from subscriptions.
Or only that data which is allowed to view, to access, to edit or anything. Or even worse,
if you want to show 100 last updated or last accessed, last changed, different kinds of
timestamps can happen here. And it's interesting because sometimes timestamps belong to the
objects itself or posts, for example, creation time or modification time. But some of timestamps can
belong to this particular relationship between users and those channels or posts themselves.
For example, last accessed, right? It's definitely for each person, it's different
in respect to particular post, right? One user accessed at one day,
another user at the same day but different time, right?
So timestamp is different.
So this particular task is very common.
This is common pattern.
Not pattern, pattern maybe should be applied to solutions, right?
But usually people just do selects, joins, and that's it.
And at some point, performance become
terrible. It happened first, in one of my social medias, and I
was super upset what like, I thought it's really similar to
graph issues like graph, working with graphs, like for example,
return, first circle of connections, second circle of
connections, and circle of connections,
and try to find some connections like in LinkedIn, right?
I remember many, many years ago, we were trying to solve it in relational database.
It's quite difficult.
We didn't have recursive CTE at the time and lateral joins and so on.
So it's a hard problem to solve.
And query has several joins and filtering order by right
but some filters columns exist in one column in one table different columns exist in different
table when filters and order by are in different columns in different tables it means you cannot
have ideal situation a single index scan or even index only scan, which is very good.
Right. And then you need to rely on one of three joint algorithms Postgres implements.
And I can assure you there will be edge cases where statistics and planner, they don't have idea how to execute this.
So this is the age old issue of which is more selective.
Like,
are you going to be,
is the planner going to be better off going through an index on the order by
until it finds enough posts or whatever the example is to satisfy the limit?
Or is it going to be cheaper to look at all posts that satisfy the other conditions
and then order those if assuming they're a smaller set and unless the statistics are good
then you could end up doing the wrong the wrong one you know the more expensive one and have a
very slow query if it's a huge exactly imagine we have we order by creation time and we just subscribe to channels, channels have posts, so we have users table, we join with channels, have subscriptions and posts, and then we order by creation time of posts and limit 100, right? Order by creation at desk limit 100 so in this case indeed as you say postgres need to choose
what to do like is it good idea to extract all subscriptions and then in memory order by
and find 100 or or it's better to use a index on created at go backwards on this index and try to
find posts which can be displayed for this particular user, meaning that they are among subscriptions.
Actually, both paths are bad.
They might be good in some cases,
but at larger scale, there will definitely be cases
where both paths are performing really bad,
dealing with a lot of buffer operations and bad timing, right?
Execution timing.
So how would you solve this particular problem?
What's the solution?
Yeah, well, dynamilization is one of the ways, right?
So like storing a duplicate of this data in like a materialized view. Is that what you're saying? Yeah, for example, we take creation time
and we just propagate it to the...
For example, if it's a relationship
to particular items, posts,
we can just propagate to this table
which represents this relationship.
And then we can even have ideal situation,
single index scan.
Not index only scan because we usually need to
join data and bring actual data from from like maybe it's not in the index but it might be index
only scan for this particular table right which has a relationship this is one of the ways of
course you have channels well it's it's more difficult because if we order by creation of posts, not channels,
right?
So it's, yeah.
Yeah, I think some of these cases, you mentioned the one where different users have different
timestamps, for example, if it's like last viewed, I think that gets really complicated
in terms of...
No, vice versa.
I think it's good because if we order by last access, for example, last access timestamp on channel,
we just deal with channels, we can propagate it to the posts themselves. Well, we cannot. Yeah,
I agree with you. It really becomes complicated. So there are cases where it's really hard to apply
this normalization, but there are cases where it's easy. If we forget about channels and, for example, think about relationship between users and these objects, it can be,
for example, permissions, or it can be, I don't know, like, last access timestamps should be
stored neither in users nor in objects table. It should be stored in, like, table in between,
right? Like, many-to-many to many relationship so in this case usually like
by default it's good like we just use this table we have index we quickly find what we need 100
for this particular user 100 objects that should be displayed but if additional filtering this is
usually in real life we need additional filtering involving data inside object's table, right?
It can be, well, soft delete, for example, we mentioned soft delete.
So, like, delete it at timestamp. If it's filled, if it's not null, then this object should not be showed in this result set, right?
But we can propagate when we delete.
We can propagate in all entries this object has to all users, right?
We can propagate to this table.
Of course, it can be a lot of rows.
This depends on situation.
I would not propagate it synchronously
if it's more than like 1,000 rows,
because delete will be super heavy.
Soft delete, it's update actually, right?
Yeah.
But we know if it's like limited number of users can access each object, like not more than 1000 say or 10,000.
We can do that. Or we can do it asynchronously. Again, there is some need in like, I would say there is need in not an asynchronous, not an incremental materialistic, but maybe in asynchronous triggers so like there is an oracle
is pragma autonomous it's not exactly the same but yeah sometimes and there are folks which are using
pgq for example in cloud sql it's available right one of good things about cloud sql is availability
of pgq so you can implement asynchronous triggers there yourself.
So if some object has update to be soft deleted, we propagate to all rows of this last accessed or
last viewed or something table, right? In this case, we can have, again, we can have single
index scan, index only scan, but it might be more difficult than that, right?
There is another way, like it's not about dynamization and this is how we solved it.
And not only we, I know GitLab also has to-do list and they apply the same recipe,
which I think one day we will blog about. Forget about dynamization and just use huge recursive CTE with some tricks,
some algorithm actually implementing this newsfeed pattern
to solve it in a very interesting way.
Oh, is it like the skip scan recursive CTE?
It's not skip scan, it's like advanced skip scan.
So you know you need to return 100 rows.
You have 100 slots to fill,
and then you start working with channels,
filling these slots,
and each time you work with next channel,
you think, okay, this channel has a fresh post,
for example, has a fresh object,
and this is the place we like we replace some
object we keep in memory we replace and put it to the slot and at some point you finish
and return but it requires some effort to implement yeah i can imagine that that sounds
really cool it reminds me a little bit of the latest feature, one of the latest features in PG Vector, where have you seen this in, I think, 0.8, it added being able to continue a scan.
Like if you...
It didn't help us, we checked yesterday.
Oh, no.
But the idea is the same, right?
If you don't yet get enough results in your limit, go back and continue the scan.
It's good if you have quite like uniform database.
In our case, we are speaking about Postgres AI, for example, we have more than 1 million
documents and more than 90% of them is mailing lists, mailing lists in entries, emails, right?
And when, for example, you ask for source category in this huge, this approach doesn't help.
But still, it encounters so many mailing list entries, it cannot find source.
So basically, we are considering either, I think we currently already using partial indexes there.
And I'm seriously thinking we should move to separate tables because we have only limited categories
and our use cases,
very often we deal with only one source.
You could maybe partition on source.
Yeah.
Well, yeah.
Actually, it's also a good idea, maybe, yeah.
So we deviated here
from original dynamization topic.
And I think it's another story how to implement
newsfeed better without dynamilization but with huge recursive city.
My summary is that first of all dynamilization should be applied only you know dynamilization
and it can bring new performance dangers if you don't think about concurrency and various
issues with Postgres MVCC, locking, lock manager, for example, and so on. So it's an interesting
thing, still quite useful to consider in many cases,, it should be tested well as usual.
I think it's one of those sharp tools,
right?
Like some,
sometimes advanced craftsmen need sharp tools,
but be careful with a shop.
Yeah.
Yeah.
Something like this.
Agree.
Wonderful.
Thanks so much,
Nicola.
Catch you next week.
Sounds good.
Have a good week.
You too. You too.
Bye.
Bye.
Bye.