Postgres FM - Data model trade-offs
Episode Date: November 3, 2023Nikolay and Michael discuss data model trade-offs — some cases where you might want to (gasp) de-normalize, and which PostgreSQL internals are at play.  Here are some links to things they... mentioned:Our episode on JSON https://postgres.fm/episodes/json PostgreSQL limits https://www.postgresql.org/docs/current/limits.htmlBoyce–Codd normal form https://en.wikipedia.org/wiki/Boyce–Codd_normal_form Our episode on over-indexing https://postgres.fm/episodes/over-indexing Heap-Only Tuples (HOT) optimisation https://www.postgresql.org/docs/current/storage-hot.html The Surprising Impact of Medium-Size Texts on PostgreSQL Performance https://hakibenita.com/sql-medium-text-performance Query planner settings (collapse limits and Genetic Query Optimizer) https://www.postgresql.org/docs/current/runtime-config-query.html WITH Queries (Common Table Expressions) https://www.postgresql.org/docs/current/queries-with.html Our episode on benchmarking https://postgres.fm/episodes/benchmarking The tests and benchmarks Nikolay mentioned https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/issues~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello, hello. This is PostgresFM episode number 70. My name is Nikolai and together with me is
Michael. Michael, hello. Hello, Nikolai. What do you think we should discuss today?
Well, this week was your choice, right? I think you've picked a nice topic that was a listener
suggestion. So I think we've simplified it to data model trade-offs.
Is that a fair summary?
Right, right.
Purely relational.
Let's emphasize that we are going probably to keep JSON, JSONB columns as a separate topic.
We've had it already.
They specified with respect to PostgreSQL internals almost as a... Internals, okay.
So that's quite a nice way of looking at it.
It's what could come up as a result of Postgres specific implementation details
that should make us consider data model decisions.
Right, so Postgres allows a lot of columns to be present in a single table.
How many? 250?
A lot, yeah. I'm not actually sure. Maybe many? 250? A lot.
Yeah, I'm not actually sure.
I think it's more.
There is some limit.
Oh, okay.
Poor preparation.
Anyway, you can have hundreds of columns in a single table,
and you can have indexes on those columns.
Or you can go and implement voice code normal form, right?
And so on and like I mean there are many directions to move and of course everyone wants very good
performance and the compact storage and so on is it so maybe not everyone well I
think I think most people care about performance on some level,
but I think it's easy to list three or four things
that almost everybody cares about more,
like durability, for example, or everything in ACID.
You know, the reason people pick Postgres
and pick old databases and trust old databases
more than some of the newer ones
is because they care about things more than
performance, I think I would argue that anyway,
well, well, right. But if the question is related to
internals, I guess we should think about performance, first
of all, and, and performance in terms of various aspects of it,
like read time, write time, various overhead we can have.
And so to me, if we simplify the equation a lot, it's a question how wide should be
our table, how many columns we should afford, or should we normalize versus denormalize? Or should we have a lot of columns in one table or fewer columns and more joins when needed?
It's not directly the same questions actually, right?
Because of course, sometimes of course, denormalization is when it's done on purpose, like intentionally, it means that we extend the same
table, putting more columns there, although those columns should be somewhere else, for example.
But sometimes it's very different things. If we just follow normal theoretical approach normalizing our model we can easily end up having a lot of columns
in a single table a lot of them because usually if you think about some social media application
or a mobile application usually if it's a big project growing, we have some entities which are central to our model.
For example, user table.
Or for example, if you have kind of organizations or projects or something, these tables are
going to be central.
A lot of other tables will have foreign key pointing to this central table.
And eventually this central table is going to grow in terms of width.
You will add more and more columns because you need to track something.
Sometimes just because of denormalization to avoid extra join
and maybe aggregates, like count, right?
Sometimes just because the complexity grows
and you just need to keep more properties.
And each property in a pure relational approach, it's just a single column.
New property, new column, right?
Attribute.
So, I mean, I think the obvious question is, why is that a problem?
Why is a wide table a problem?
Good question.
Well, obviously, there are strong trade-offs here.
On one hand, extending in terms of width our table is not a problem.
We know how to do it with zero downtime very nicely uh postgres 11 brought us uh default for existing rows without like
virtual default i call it it's not official name but this is actually what happens instead of
rewriting whole table when you add the column with default uh it's it's stored virtually for all old records, right?
And you just, your alter is very fast.
So we know how to deploy changes with zero downtime,
with proper testing, all good.
Then we create indexes.
We've created index concurrently because obviously more columns
means more indexes because some columns you want to be present
in indexes because you filter, order, and so on.
But trade-off is, on one hand, it's good to have a lot of things in one table because you can achieve single index scan or single index-only scan for many queries because if you
need to join data
from multiple
tables, you cannot
have single index scan.
You can't have a single scan
full stop.
Yeah, because
index cannot be defined on multiple
tables and the only
way to do it is to define
materialized view and deal with all issues it brings
because it will be cut quick. It will need
maintenance and proper use.
If you have multiple tables,
let's consider a simple example, one-on-one relationship. Somehow you ended up having
for example, a user's table and user's extra properties a simple example, one-on-one relationship. Somehow you ended up having, for example, a user's table
and the user's extra properties, for example, table.
And somehow you decided.
And this may be not a good idea because you cannot, for example,
filter or thereby just with index scan using columns from both tables.
For example, you want to search by email, lowercase,
or case-sensitive CI text data type.
And at the same time, you want to use some extra field in this search or order by it.
And in this case, Postgres will need to choose one of three join algorithms and apply like nested loop or merge join, depending on if if it's a mail probably it will be nest
loop or something like that right so it won't be looking super cool because super cool is like
index scan index only scan and i'm not saying joins are bad because join joins and postgres
quite optimized three algorithms in many In many cases, they work
really well, but you
cannot beat the performance of index
or, more important,
index-only scan, right?
So to have that, you need to bring
those columns in the main table,
users, for example.
And in this case, you can define
a multi-column index,
and your query will be super fast.
You will have just a few buffer hits to find those one or a few rows,
and it will be below one millisecond for sure.
Right?
And that's great.
That's it.
End of story.
Let's have a lot of columns.
I don't think that's it.
No, no, no.
Sometimes I say, okay, sometimes even if we perform normalization like we learned from theory, at least to third normal form, very good situation already.
And then I say, okay, maybe we should go back and denormalize just to have index only scan and avoid joins.
But this is just one part of story.
Another part of story, there are many issues you will have
when you have a lot of columns and indexes.
And this relates a lot with previous episodes we had recently
over indexing two weeks ago, right?
I think, is there an even more fundamental problem
by the design of Postgres?
We store data as a row, like in tuples in pages, right?
And because we're doing that, if we've got really wide rows, we reduce the number of tuples per page that we can store.
So as soon as we're accessing multiple, we might get that index only scan but if we're frequently only
reading from a few like a couple of those columns and maybe they're slight they're really narrow
columns those being on a separate table to the the wide ones that we're infrequently accessing
is an argument against the white table for read performance so even if we only focus on read
performance i think there are some arguments to have splits like that.
Yeah, makes sense.
Right.
But again, you can compare for your particular case, you can conduct a few experiments and compare exactly with explain, analyze buffers.
You will see how many buffers, how much I.O. each approach gives you.
And I suspect for reads, not in all cases, but in many cases,
for reads, wide single table will still be good.
Even if it's toasted, right?
If we exceed two kilobytes for a row, like roughly it will be toasted and so on.
It means actually implicit join, right?
Because to find those chucks, it will need to perform.
But buffer hits or reads will show this as well.
The problem, the second side of this coin is that extending the number of columns we
have in the table,
we have two big issues coming.
They can come not in acute form.
They can come slowly and increasing over time and accumulating.
So two problems.
First problem is that how writes are working.
I mean how updates and actual updates, how writes are working. I mean, how updates and actual updates,
how updates are working.
Updates means, an update in Postgres means that the whole tuple is copied, right?
And this means that if we have a lot of columns,
which we, like you said, access not often,
during update, we need to copy them.
So this is first part of the this is the first part of the first
problem. Second part of the first problem is indexes. More columns usually means more indexes,
because again, to work with columns, you need additional indexes. And if you update a row it means or if you insert a row also it doesn't affect delete but
again i i had this mistake a few times delete doesn't matter but inserts and updates they will
need to change all indexes you have you have more many columns you have many indexes you have. You have many columns, you have many indexes. To update just one column, you need to
or to insert a row, you need to update all these indexes.
There is a heap-only tuple update optimization, but
it doesn't work often. I mean, it doesn't work always. It requires
conditions to be met, and it's not always happening.
So if we have a lot of columns, we need to write more.
We need to update indexes often.
And this makes our writes very heavy, right?
And it gives a lot of work for rotovacuum, for example.
Simple example.
If you decided to put a counter to avoid
the slow aggregates for example we talked about users okay how many like comments this user left
in various places of our system or how many posts this user created i'm speaking in terms of some abstract social media application.
If you put like post count as a column to users table,
you will need to update it quite often.
And all columns will be copied always.
And all indexes will need to be updated
when this update is not hot.
Or if it's in serve, but okay,
here we talk about only about updates so
if update is not hot all indexes will be updated it's called index write amplification one of the
reasons uber decided to go back to my sql so what does it mean It means that probably we need to think carefully.
If this is a central table specifically,
which will be used in many, many queries,
and also if it's a central table,
it also will be often used as just single index scan
just to find the current user, for example.
Who is working?
We need to find the current user. Of course, you can cache it, but still you need to find this current user, for example. Who is working? We need to find the current user.
Of course, you can cache it.
But still, you need to find this user, right?
And then you need to update it.
So probably it's wise to keep this counter in a separate table,
one-on-one relationship, like user extra data or something.
And count is there.
And it will be narrow table
only some numbers and foreign key to the main table and you just then unique also
key to maintain this one-one relationship maybe it's primary key as
well so you just updated this counter maybe there are some other counters or
timestamps another example example here is like last
blah blah ad. For example,
last updated ad,
last commented ad, last
created some post ad. Some
timestamp, it's like usually 16 bytes
timestamp with timezone
or without. Usually you
should prefer with timestamp.
And it's like 16 bytes.
And keeping it separately and just having an extra join when needed,
maybe it's a wise approach because the main table remains untouched
if you need to update it often, right?
And the index write amplification is lower because only a few indexes there,
which probably just one index, actually.
You probably don't need to index those extra columns like counters and timestamps.
Sometimes you do, of course, but in many cases you don't.
And in this case, it's just one or a few indexes to be updated.
Index write amplification is not high.
It's not a big problem anymore.
And we avoid a lot of bloat on the main table right a
lot of um a lot of additional work on the on the main table as well as the right amplification we
also don't have to free up or reuse that space efficiently in the in the heap is that presumably
there's an impact on the right head log the amount, the amount of work we're doing in the right-ahead log as well?
Well, yes, it is so, but we also have full-page writes, right?
So full-page write means we need to write full-page anyway,
but for subsequent writes until next checkpoint.
If you need to update a very narrow table,
tuples are small.
In this case, of course,
the amount of data written to wall becomes smaller.
So you're right.
It's another good reason.
So if we write less wall data,
it means less pressure on backups and replication systems.
Right.
Exactly.
But again, you need to join.
So probably we end up with this approach.
If this is something like counter or timestamp or something
which is very often updated, but we don't need index on it,
it's good to keep it in a separate table.
And one other condition, if we don't need, if it's not highly keep it in a separate table and one other condition if we don't need if it's not like
highly correlated to something else if we don't need statistics is it called multivariate statistics
if we don't need to define those anything that's single i guess all the single table features we
have this applies to as well right oh and i thought of one other, I said, I thought of it.
I remember reading a really good blog post by Haki Benita on this topic.
So we've got the frequently updated columns that we should consider or that it might make sense to move out into a separate table.
And we have these large or medium sized texts.
Now, large ones more than i think you already mentioned the two
kilobyte number we postgres already handles those for us using toast awesome but it kind of shows
that there is a point where it becomes beneficial to do this join we get some other features in
there as well and but there's this weird case like below two kilobytes that there's an argument that medium-sized text should be, you'd consider moving those out as well.
If you've got comments or something that are like one kilobyte on average, having those in your pages means you're only going to have maximum four, what would it be, eight tuples per page.
Yeah, exactly, tuples.
Four, four, four.
Keep saying that one. Well, I also Tuples. Four, four, four. Keep saying that one.
Well, I'm also switching back and forth.
Tuples or tuples, four.
If it's pages, eight kilobytes, two kilobytes, like four,
roughly four per page.
Yeah, yeah, yeah.
So it's an interesting point.
Another interesting point is vice versa.
If some column is, like, very rarely used, very rarely.
Yeah.
But it needs index.
Then also probably it should go to this extra table.
For example, token, activation token.
Interesting, yeah.
If you put activation token to users table, it's becoming like forever passenger, not doing useful work anymore at all.
But you keep having it in all updates. You copy, copy, copy in all the raw versions in tuples.
And returning that data to the cloud.
And reels. Bloating reels. And yeah, if you do need it one time,
although it requires... Usually, a token requires index because you find the row,
like using token, you find who this token belongs to, right? The owner of this token. So you need
the index. But in this case, you need it just once per lifetime of user
or maybe when a user resets password or something.
It's so rarely.
So keeping it in a separate table, even if it's with index, it's a good idea.
And, yeah, it compacts the storage and memory representation,
less bytes used in shared buffers because probably you deal with
once, then you forget about it and never read anymore to shared buffers until
password reset is required one more time.
That's a really great point.
I've got another question for you.
If you've got a really large, if you're like chucking some HTML into Postgres
for some reasons, that makes sense for your use case in in this case and maybe it's potentially megabytes or something of data now we
could rely on toast for this but we could put it we could proactively put it in a separate table
and use and it would still use toast right but to avoid like it's almost the developer experience if
if some uh somebody's accidentally running
select stars or doesn't isn't aware of the impact of fetching that data and does a select star if
you're having to support those kind of queries not having it be possible to have to fetch you know
megabytes and megabytes of data from the server by mistake when when they probably don't need it might be an interesting idea
definitely
if you look if it's some
project which was developed during many
years if you look to these
central tables
especially like users like I said or project
organization organizations
and so on they are used
a lot and of course you can check
toast level with various queries. It's
easy. And if they started to be toasted, of course select all columns including those which you don't
actually need. Usually you need it often but you need only a few few columns to check that this
user is still active, right? No, like it's not, this user is not banned,
for example, or project is still active, is visible. It's just a few most important columns
which required often. Others can be offloaded. But again, there is a trade off here. You,
if they are offloaded, but with indexes, you need to join.
And Postgres planner choosing form of join and indexes,
it will be sometimes less performant than just single index scan, right?
So it's an art.
When you build something, you need to predict future,
and this comes with some experience.
You need to think, okay, these columns, will we need them often?
Will we need indexes on them? And based on the answers here, we can decide to offload them to
this extra table, right? So the second problem is over-indexing. We discussed it. If you have a lot
of columns, you have many indexes, you quite often have single index scan or index-only scan.
This is the perfect plan you always want to see
because it's usually either sub-millisecond or few milliseconds only,
depending on the number of rows and hardware.
So if you have this, you start having a lot of indexes,
and not only it's index write
amplification problem it's also the problem we discussed exceeding 16, 15
indexes actually even if your query deals only with single table but it has
many indexes if you don't have prepared statements used planning each time this planning will be slower and exceeding this number of locks 16
table plus 15 indexes already 16. 17th index locking will be with fast path false so without
fast path for lock manager and it will be much slower than with FastPath and it means that if it's central
table and the QPS for queries involving this table especially simple queries they can be very
frequent queries high QPS in this case contention for simple queries might hit performance a lot. So you will see lightweight log manager
in wait events in PGSAT activity
or in like for on RDS performance insights,
Cloud SQL has it as well.
Time scale, I think, has it as well.
Wait event analysis will show you waiting
on lightweight log manager.
And this will happen with very simple queries, primary key lookups or some index scans, finding
single row, just because you have a lot of indexes and queries are very frequent.
So offloading might help again.
If you think, okay, this table is going to be central in my system, and these columns I do need, most of them will be indexed,
but I won't exceed this magic 16 number.
I mean, for indexes, it's 15.
And in this case, I will be able to have thousands,
maybe dozens of thousands of queries per second to this table
because I need it, right?
Not everyone needs such performance, of course, right?
But design for such performance means you need to avoid
having more than 15 indexes per table.
This should be a rule for high-frequent tables.
Or you can just use prepare statements,
it won't eliminate this problem completely, but it decreases it
because you eliminate planning time for primary queue lockups.
Execution time might be even bigger than planning time,
so you shave off a lot of milliseconds, microseconds from this duration. And contention also can happen, but at higher QPS.
So you postpone this problem.
If you do need to have more than 15 indexes.
But having less indexes is good because lock manager won't have contention if you deal only with single table.
We can refer people to the over indexing
episode over indexing right i've got a theoretical question around that though not something i've
ever seen become an issue as a result of this but if we follow this pattern in a few places in our
database and we've split table things that are logically one table into multiple, we're going to have more joins. And there is a,
there's a couple of limits.
Is it from collapse limit and join collapse limit?
And I think,
is it eight?
Yeah.
Okay.
I knew it was.
Yeah.
And that we could quite quickly reach eight relations.
So then,
yeah.
So you might,
if you're doing this,
you might also want to just check that any of your complex queries
that involve a lot of joins don't flip plans
or you're not getting some strange, I think it switches to,
is it called the genetic query optimizer, something like that?
Genetic query optimizer, yeah, right.
I'm checking 12.
I always keep forgetting genetic query optimizer threshold checking 12. I always keep forgetting. Genetic query optimizer threshold is 12.
So if you reach 12 relations, by the way, here we mean tables, not indexes,
because previously we counted both tables and indexes when we talk about locking.
Here we talk about locking. Here we talk about tables. And yes, if you join 12 or more tables,
it means that genetic code optimization starts happening.
It's hard to understand this thing, how it works, right?
And maybe you are good, maybe you are not.
So I, in general, I, of course,
aim to have good performance
and joins are fine,
but usually I try to avoid
joining too many
tables. And how to do it?
Denormalization. In this case, yes, we need
white tables. It's worth saying that
what I just described,
it's about very high
TPS, high QPS.
So if you have hundreds of queries per second, good hardware,
you probably will never see this log manager contention.
But in this case, of course,
denormalization is one way to reduce the number of relations
participating in query.
In other way, of course, sometimes we do it.
Sometimes we just, okay, we read something,
and then we read something in different query.
It's also possible.
But, yeah.
Or do you know, like, if we use with materialized keyword,
meaning we intentionally put the optimization fence there,
it means that these steps will be performed,
like, this threshold is considered separately, right?
So if you, I mean, if first step was six tables,
second step is another six tables,
plus previous results, of course.
I don't know what happens with join collapse limit there. I think it's separately. These sub queries, even a single statement, these sub queries,
physically, it's separate thing, right? Because we have materialization step, right? And we just, in memory, we have
temporary
tables, so to speak. It's not a table,
but just some data, right?
It's used like
a table.
We can join with it, right?
In this case,
splitting to multiple steps
using CTE might help as well.
I guess.
Actually, I see some overload in terms of number of tables used.
Queries quite often.
But usually we find some ways to simplify them.
Sometimes a lot of tables participate almost for nothing there.
It was just convenience.
Let's join this, let's join that.
But then you ask, why?
If you want good performance, maybe you should just
make it lighter, not so heavy.
In this case,
you just reduce
the number and it quickly
becomes below 12.
But
I don't know. Maybe genetic optimizer
is a good thing.
I think only a few people in the world know exactly is it good or not.
It's like some mystery, right?
Yeah, well, one thing to bear in mind is that unlike previous topics we've had,
this is a configurable limit.
You can change this number with the caveat that you keep an eye on
planning time, basically the reason these limits are in place. And you can do it dynamically.
You can change it for one query, change back before another query, same with join from collapse
limits. But it's not the same with thresholds I described like this this fast path locking related. 16, it's
hard-coded constant.
Is there anything else you want to cover? I've got one more thing on the denormalization
front.
I wanted to mention that what I described, we have recent benchmarks. Thanks to my co-workers,
especially Vitaly and Dmitry,
they did good benchmarks.
And let's put links to
description. Very detailed.
Everyone can repeat them
as usual. So these benchmarks
show the effect of having
extra indexes, for example.
Actually,
until recently, I never thought that having extra indexes, for example. And yeah, I actually, until recently,
I never thought that having extra indexes
is affecting planning time significantly,
like 5%, 10% of overhead in terms of time.
And I also never thought it's related to this topic,
actually, we discussed today.
More columns means more indexes more indexes.
Faster, we reach this 16 threshold. So it's interesting.
I must admit, it's new to me. This is what I say. Cool. Yeah.
On the last question I had, the case I see most commonly for demon normalization is around like for example with a
software as a service company wanting to partition a large table by account id or or your equivalent
you need to have that on the table in order to be able to do so right so like that's quite
so it could be like a not only for read performance,
but also in preparation for something like partitioning.
Right. Partition, keep propagation, right? We need to propagate this partitioning approach to
dependent tables. This is quite obvious, I think, right? And again, like if we partition, we need to make sure all our queries,
or at least most of them, don't deal with more than one or just a couple of partitions.
Otherwise, we can have another amplification.
It's not index write amplification, but index planning time locking related amplification because if you have
partitioning and your query deals with multiple partitions each partition has
for example five indexes if you deal with four partitions you already have 20
logs plus plus table so 24 right logs to be acquired and it definitely goes above
this threshold 16 so fast path won't be there for some logs,
and the logging will be slow.
So you do need prepared statements at least.
Or you need to just make sure constant exclusion is working
and your plans target just one or a couple of partitions always.
A better one, right?
Of course.
Awesome.
And if it joins also.
Of course, if you join often,
main table has many indexes,
extra table has many indexes.
When you just select from it,
planner needs to lock all.
I mean, at planning time,
we also need to lock all of them, actually.
With the lowest possible heavy lock,
access share lock,
just to avoid DDL to happen while we are reading.
But all indexes, even those which are not needed,
will be locked. This is the key. We discussed it.
So if you split to two
tables, but need
to join often, and this
join query needs to be
very frequent, it's worthless
to split.
We've actually added a relation.
Yeah.
We added a relation, plus one, okay.
And we also need to perform join now.
So it's worse.
So if you offload something, you need to be very careful.
You need to, like, columns which rarely used
or columns which need to be very frequently updated,
columns which shouldn't be indexed Or indexed and rarely used.
I guess there are
some maybe
design patterns here
of developing
what we discussed
as a kind of raw material.
We need to polish it, I guess,
before saying this is...
And the whole
point of this episode
was it's a trade-off, right?
Like there are going to be cases
where one of these makes sense
and there are going to be cases
where the opposite makes sense.
Yeah, it's better if it's a serious system
that will need to handle a lot of queries per second.
It's worth checking, prototyping,
filling with like millions of rows, hundreds of millions of rows, and just testing, testing, testing various scenarios.
With concurrent selects, it's really important, right? That was your tip last time.
Well, it depends. You might have only single session experiments and already understand a lot if you use buffers you see io
that's it but if you need to check the behavior of the buffer pool of log manager in this case
of course you need full-fledged benchmarking multi-session experiments and you need to be
alone on that machine but this we had also a episode about experimenting and benchmarking.
So here I see value in both approaches, just different value.
Wonderful.
Was there anything else you wanted to make sure we talked about?
I guess that's it.
It's not all, far from being all, far from being polished.
But I encourage everyone to think about their particular case with experiments. I don't call them benchmarks because
explain, analyze, buffers, it's also like performance-related experiment. You can
check plans, but it doesn't feel like benchmark, right? But just from this
single thing you can do many good optimizations. So experimenting is the key for understanding your particular case.
I hope we highlight the directions of analysis, right? This is probably the main value of this
episode. Okay, if you liked it, please like, subscribe, and share.
I'm talking to our listeners.
I thought you were talking to me.
Thanks so much, Nikolai.
I've noticed when you upload to YouTube
you don't like. I usually fix it myself
and I like our video because
of it.
Thank you so much.
Take care. Bye.