Postgres FM - Schema design checklist
Episode Date: April 17, 2026Nik and Michael discuss a list of things to check when designing new schema in Postgres. Here are some links to things they mentioned:Use BIGINT in Postgres (blog post by Ryan Lambert) https...://blog.rustprooflabs.com/2021/06/postgres-bigint-by-defaultPostgres 18 and UUIDv7 (blog post by Gwen Shapira) https://www.thenile.dev/blog/uuidv7How to use UUID (how-to guide by Nik) https://postgres.ai/docs/postgres-howtos/schema-design/data-types/how-to-use-uuidOur episode on constraints https://postgres.fm/episodes/constraintsOur episode on NULLs https://postgres.fm/episodes/nulls-the-good-the-bad-the-ugly-and-the-unknownMultiXact member space exhaustion episode (with Metronome) https://postgres.fm/episodes/multixact-member-space-exhaustionOur Column Tetris episode https://postgres.fm/episodes/column-tetrisSaving Space Basically for Free (blog post by James Coleman from Braintree) https://medium.com/paypal-tech/postgresql-at-scale-saving-space-basically-for-free-d94483d9ed9aOver-indexing episode https://postgres.fm/episodes/over-indexingUnder-indexing episode https://postgres.fm/episodes/under-indexingGadget's use of Postgres https://postgres.fm/episodes/gadgets-use-of-postgresPartitioning episode https://postgres.fm/episodes/partitioningRLS vs performance episode https://postgres.fm/episodes/rls-vs-performance~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello, hello.
This is PostGus FM.
My name is Nick, Postgas AI.
And as usual with me, Michael, Pigea mustard.
Hi, Michael.
Hi, Nick.
So the topic, I'm doing great.
How are you?
I'm good, thank you.
What's the topic?
What's the topic?
The topic I chose and everyone already saw it on cover you created.
So the topic is five or so, maybe six.
Things you should check when you say.
start new schema or you already have some schema and your project is growing and you want to be better
prepared for bigger scale and performance wise so what especially if you use AI to design and improve
your schema and your project is growing right so what should you look at to check your schema
health makes sense just on the topic in general I do think there are something
things that are better suited to asking for help from AI and not. Do you think like schema design is
one of them or actually is it one of those ones that maybe we should still do it a little bit more
ourselves at the moment? Well, as you know, I'm very, very, very intensive user of AI.
Yeah. Honestly, like I do everything through AI right now. And I'm 100% schema design is a topic where
I can help a lot. But it also matters which questions you ask. If you just say, I want this
and don't pay attention, you will have low quality schema in many cases still. That's why,
as I said, let's have this episode because it might be quite basic for many advanced users
of Postgres. But this checklist we are going to present today, it can be good thing to keep
in mind always when you design schema just to ask right questions and revisit.
Also, I'm not sure how exactly you use AI.
I use it always like I work with people, developers.
When you just ask something and think they, those engineers, humans or AI,
will solve all your problems and make all the decisions.
You're affording full responsibility to their shoulders.
Then you have big negative surprises always.
So it's better to keep requests,
as detailed as possible when you work with AI and also make some iterations to review and improve.
And when making iterations, like first, like I want some schema, like, I don't know, like LinkedIn
social media schema or something, right? It will give you immediately something, but then you
want some details to be checked. And this is exactly when this checklist can be very helpful.
So you can revisit and think and imagine when you will have.
gigabytes and terabytes of data, data, what will happen? So to avoid painful refactoring in the future
and outages maybe or performance degradation, the price you can pay in the beginning is much
lower than the price you pay later. Right. So I never, I never work with AI in one shot. I always
make iterations to improve in any question and to revisit and sometimes combine multiple LLMs.
right so this is exactly when you when you can use this list of questions so first
the first topic is the data type of primary key yeah and I feel like the answer to
this has changed my or my opinion on what the best default for this has changed actually
in the last few years just off the back of learning more about timestamp ordered
uIDs right this is this used to be an age old debate right uID V
versus integer or big integer, really.
And most of the kind of blog posts I would see are like trying to just encourage people to use big int over int.
But now that we have first class support for time-ordered UIDs, I don't see many reasons not to go with that.
What do you think?
Well, the size, eight, is double?
Yeah.
Yeah, your ID is a 16 bytes.
So it's quite a big price to pay.
in each row you need to have eight more bytes.
In many cases, it's fine.
Well, indexes will be also bigger.
Indexes double, yeah.
It might make sense if volumes are huge.
But anyway, to keep things simple, the rule to remember,
red flags, if you are prepared for really large volumes of data,
it must not be integer four.
Yeah.
It should be integer.
8. And if in terms of UID path, data type is always UID for both version 4 and version 7,
but you can check default how it's generated. And in modern PostGus, there is a function,
I think in 18, only in the latest version. I think so. It didn't quite make 17, did it?
So if you're on Postgres 18, so it's better to use UUID version 7, but also remember
trade-offs because there is timestamp there and values will be ordered like with integer
or big integer so there is a nuance here but performance-wise it's much better than than your
version 4 there are many articles about it and you just you you just need to check okay
are we going to go with integer then it's in 8 or big int right are we going to go with
uID then default generation default should be
version 7 for Postgres 18. For older version it's also not an excuse to stick with version 4.
You can write your own function and ask AI to write your own function.
Or generate them outside of the database.
I slightly prefer this option less because if you have multiple application nodes,
who knows what will happen with their clocks and since you ID version 7 includes timestamp inside it,
Order might be not, I don't know, like I prefer to leave it on database shoulders to have, like on the primary single source of or truth.
On the other side, the UID by design is generated, it serves purpose to work on multiple, like, undistributive systems, right?
So maybe it's not a big deal.
Generate on on client's side.
All I mean is if you're on an old Postgres and like saves you creating your own function or whatever.
Postgres waited a couple of years to bring your ID version.
officially.
Although again, you can create even SQL function,
even not Pilipage Scale function,
simple SQL function.
And if you ask your AI to search for,
for example,
probably it will find my page how to do it,
or other people pages.
But in other libraries,
from Google,
not JS libraries, everything,
they have it for a couple of years already,
so for longer.
They implemented before standard was finalized.
Yeah.
So huge red flags to look out for,
are if it's using the integer data type, which is the same as int for or serial.
I think AI sometimes uses, it will define it as serial.
So if you just see that word without big serial, you might hit that limit at $2 billion or whatever
the exact upper limit.
Yeah.
And avoid the UID version 4 if you want good performance on large volumes of data.
Yeah.
That's it, actually.
Let's keep it simple because we could discuss more advanced situations.
I think we should do a whole episode someday on.
migrations, like for people getting close to that two billion number, that would be a good topic.
I can share. I had a lot of experience with different approaches. Yeah. Let's do that. Good. So how to
redefine primary. I can actually have an article about it, how to, yeah, covering multiple paths,
so we could talk about this. Number two is constraints. Topic is constraints. It's, it's, it's
other constraints. Yeah, yeah. Yeah. Other.
beyond primary key.
So I would just ask, like, first thing,
are we using constraints well, like to myself or AI and so on?
And because Postgres has six types of constraints
and it's quite rich tooling,
and it's much more convenient to have them earlier,
not because adding them later is a big pain.
Right now we have quite good support of two-step constraints.
definitions. First, I think in PostGrist 19, it will be possible also if not null constraints,
which is interesting. But anyway, check constraints, foreign keys, unique constraints. It's
possible to create first in almost finished state and then finalize it to avoid long-lasting
locks. But this is not the reason why I prefer to think about it in the very beginning.
The main reason is the quality of data.
This is why constraints exist in the first place, right?
So if you introduce constraints early in your project, data quality is good.
Introducing them later.
And database side as well.
I mean, it just...
Oh, yeah.
Yeah.
I don't trust constraints outside.
I'm like with age, I became less radical.
My points of view became less radical.
with it. So I
just observe without fighting
with backend engineers who
try to implement constraints
on application side.
I just have a note. They are not reliable
because who knows which
other layers of types of
applications, especially these days with AI,
it's so easy to add
something else implemented in
different framework or even language.
It's super easy right now. And you will
need to re-implement those constraints.
Or you will just forget and miss them
and data quality they will suffer.
Right?
Yeah.
So specifically, are we,
something I think I'd like to see more of,
or at least I see a lot of people not doing,
is columns that you're not expecting to put any null values in
or you don't want to have any null values in,
actually explicitly making them not null.
Like, if you don't specify, then columns are nullable.
And I think the deep, I think most people,
or at least most AIs I see, are creating,
like by default, they don't put not,
on every column.
Yeah, there is an opinion that not null should be default.
Anyway, it's hard to have a default, but inside your project, you can say, okay, unless
it's explicitly specified, let's make columns not null.
Exactly, yeah.
And avoid troubles with null we talked about in a special episode about NALs, right?
Yeah, well, and I guess in this specific case, there is also, well, with some constraints,
mostly we care about this.
We're doing it for quality reasons.
But there is like a tiny amount of performance optimization
that comes out of them.
I guess a bit more for unique constraints.
Like if the database knows for sure this is going to be,
this is unique, it can do some optimizations.
So yeah, anyway, but yeah, good point about,
I like that about not knows.
Yeah, yeah, unique is interesting stuff.
And also, I don't think foreign keys are underused
in post-Gos ecosystem.
Days when everyone suggested that foreign keys are super bad.
For me, it's 20 years ago.
I remember these topics.
Let's avoid foreign keys completely.
Although we know that foreign keys can bite badly,
we had episode with Metronome company about issues basically with multi-exact,
multi-exact related issues, which can be caused when you use foreign keys a lot.
But anyway, I still think foreign keys is a good thing to have, data quality.
But I think what is usually underused is check constraints.
They are great.
They could be used much more often.
For example, a classic example, when I say, like, when I see in schema in Amdata type,
I think, okay, later, if we want to change this list of possible values, how will our migration?
Especially we're moving, any?
Yeah.
So check constraints are slightly like more flexibly.
in this like it's it's and well also removing is a is a topic right but check
constraint you can introduce in two stage again you can drop out the old one and
you can first create a new one in not valid state and then validated in separate
transaction but I just see it like quite useful quite convenient tool for me and
you see it in schema right away like this this column should be like for example in
range or it should be this list or something.
So I wish it was used more often.
Yeah. What do you see in going back to foreign keys quickly,
do you see a lot of the LLMs doing a lot of things on update cascade or on delete set
null or things like that? Do you see any cascade stuff popping in there by default or not
when it should be or when you think it should be? It's a good question. I actually don't remember
having issues with this question. I know deferred constraints are can might be an issue. I remember
a case when PG repack, it was a company called Miro, maybe you know it, right? They had issues with
PG repack and deferred constraints. There is an article, old article from them about this. But I don't
remember particular issues here. And I think this leads us actually to the question for our particular
projects what's better and what I if I had any doubts with AI I would just make some experiments
based on the product we are developing I would say okay you know like for example in one year
how many of entities in which table entity table we expect all other tables let's fill it with
fake data right run vacuum analyze to have good state and then just explore plans with
PigeMastard, for example, right? Let's think what kind of workload we might have,
according to user stories in our specification of the product and so on, and then collect plans
and see, right? And then include also modifying queries and collect those plans and think how
it will work. It's so easy these days to make these experiments before we finalize all the
decisions, and then you will see it in action and it's great. So collect all the plans.
find like weak spots where we have suboptimal plans and how we can deal with this.
If for example we have entity and another and one too many and it's many so many, right?
And then we want delete to be supported in UI where people as we know expect 200 milliseconds
and less than one second let's say, right?
And then this delete of one row triggers propagating according to foreign key triggers deletion of one million rows,
It cannot be, cannot meet our requirements, right?
Now we need to think about asynchronous deletion,
maybe if using some event queue system or something, right?
Yeah, yeah.
And this will pop up if you prepare a good experiment here.
Right.
Without experiment, it's hard to say.
You need to answer questions,
how big are the tables and how, like,
this relationship between tables on foreign keys will be in the worst case, for example.
Can user create, like,
million of entities and then we need to delete this user for example.
This is a great question and with AI you can iterate so fast these days.
So great.
Collect plans and then think and decide.
Okay.
It's a good good question but I don't see it as simple question.
No, no, no.
It was more, I just asked in case you'd seen there was a chance that LLMs were throwing
it in on every foreign key and that was maybe a bad, bad default.
But it sounds like they're not.
I may be a wrong person to ask about this because I don't, I design sometimes these days.
I have a few applications designed from scratch with AI, but I mostly deal with things which other people already created.
And I just see how AI can help.
So I'm not sure will this like propagation of deletion, for example, will be supported in how exactly it will be implemented.
But it's definitely worth thinking in the context of latencies.
and what should be offloaded
and what tools will be using to offload in this,
to background jobs, for example.
Yeah.
Anyway.
Yeah, it's just a thing that you need to,
instead of thinking, looking at code,
I would think looking at dynamic results, experiments.
Yeah, and I'm thinking, like,
if we're talking specifically about schema design
when you expect to hit scale,
I might default to not using on X cascade just by default because of those,
because at scale it can become problematic.
So that would probably be if you had to come up with a best practice for if you're
designing at scale, maybe lean in that direction.
That's interesting.
That's interesting because this was my position for a long time.
But it's not about AI building or something.
This was my position.
Let's let's like it's dangerous because who knows how many rows we will need to delete.
It's like it's about the delete, right?
Unlimited delete, this is how it feels.
Let's design without it.
But then I see in my consulting practice, I see cases where really large projects have it.
And it's just not one project.
Many projects I see.
And they keep it and somehow live with it.
They at some point implement non-cascaded, like asynchronous deletion.
but by default, I still rely on cascaded deletes somehow, and they are fine.
This is interesting.
So reality shows it's not that bad.
But I agree with your way of thinking, and this is my default way of thinking as well.
This is like, Posgis can survive very terrible things these days much better than like 20 years ago.
And I know.
The only time I see that being a like horrendous issue, even small to medium-sized projects,
is if people haven't indexed the the second part of foreign key.
Exactly, the referencing color.
There are two ends of foreign key.
One is always indexed because it's primary key.
The other one, it's your job to index it.
Not to forget.
Yeah, yeah, yeah.
We have an old checkup, actually.
We have a special report just for this case.
Perfect.
Good, but it's all good questions,
but I feel we can spend a lot
in the constraints area.
I think we had constraints episodes.
We did have an episode, yeah.
Let's look that up and move on.
Yeah, let's move on.
Some entertaining topic, column Tetris,
which is sometimes it might bring some quite interesting
and good saving.
In many cases, it won't,
but I just see people enjoy this topic,
and I decided to include it.
So the idea is when you design new schema,
the price to pay to reorder columns is very low.
You can reorder it. Easy.
And the problem is that Posgos physically, the colon order matters.
And due to padding alignment, if integer 2, for example, is followed by integer 8,
you will see a gap 6 bytes of zeros in every row.
And this is bad, right?
It's like a waste of storage and also memory, which is not only disk, it will pollute with those zeros,
it will pollute memory.
And sometimes you can be so unlucky.
I saw cases like 30% of HRO is like zeros.
And Postgres doesn't have automatic reordering of columns.
It could actually, but it doesn't have it.
So the order you define in, let's say you do create table,
the order you put those columns in,
that's the order they end up on disk.
And every new column you add goes at the end.
That makes sense, right?
But even if it could be slotted in, let's say you add a new Boolean type and it could,
I don't actually know if that would work, but it always goes on the end.
Bullion is one bite.
Yeah.
So in Boolean, seven beats are wasted already.
If you go, if you zoom in, like under microscope.
Unless you put Boolean, Boolean, Boolean, Boolean next to each other, right?
No, no, no, no.
Bullion is one bit.
Oh, bit.
Yeah, good point.
Good point.
Sorry.
But it's stored as a one byte.
Seven bits already wasted.
But if you have Boolean and then times 10, for example.
For example, first column, as we all agree, should be ID.
Primary key.
I'm joking.
It's not always, but it's very popular.
ID is the first column.
Integer 8 occupies whole 8 byte word.
And then we have Boolean and then timestamp.
Not only you wasted 7 bits inside Boolean.
it's already by design like your postgres doesn't have like less than one bite but then you waste
seven bytes so seven beats first and seven bytes because of this alignment padding and your
bullion which is like by sense it's only one bit suddenly occupies eight bytes it's a huge waste of
resources right yeah but the big bigger picture here is that people usually realize that waste is that
waste is significant only when it's a huge table, like 100 million rows, right?
And then, oh, so many spaces wasted because we didn't play this column tetris initially.
But this is like chicken or egg problem, like maybe inverted.
Anyway, so you realize it later, but you could do it earlier, but you didn't think about it
because when you have only 1,000 rows, for example, it costs you nothing.
But again, with AI, it's so easy.
you just say let's play column tatters in postgres
AI should know it
I haven't tried but I'm pretty sure it should know it
because there are many articles about it
Oh and we did an episode
Yeah starting from Stack Overflow
where this topic was very well described
So AI should know it
This topic
When you create new table you think many rows will be stored
Just let's apply column tetris to this
Boom
It will reorder columns
never tried again
yeah what do you think
like I feel like there's a
I don't know if I'd still want to have
I'd still want to do it myself just because
I feel like there's also a
natural order for columns
so I would want to put all the eight white ones together
and 16 bite ones and make sure they're all
with similar and then all of the
yeah exactly ah similar
but but equally like
imagine if it's like a user's table and I could put the I could put certain
information about that user that makes sense to go together together and then
information about something like maybe like there all the IDs that maybe
the foreign keys like their team ID or their organization all of those kind
of like referencing columns put those together so I know the order doesn't
matter I know you could just specify column order but you know when you do select
star or you just this is what i i i was trying to say like are you trying to tell me that i use you're
using select star yeah and here we go so i'm joking actually i'm using select star all the time
but i also all the time i see articles how bad it is to use select star right but it's convenient i know
yeah but but also if you if you like if you manually work it's inconvenient to have uh all
order and select star doesn't work well. But application shouldn't use select star.
You're right. And even I'm even thinking about examples like the system views in
Postgres, like you do select star from PG stat statements or something. But that's, that's
not even the table, right? That's the view and then they can define the order. So it's, it doesn't
actually matter. It's yeah, it's interesting topic. I played column tetras a few times heavily
and then it was really inconvenient or like this is ugly table. It looks ugly like this order
it's ugly. But at the same time, it's your problem because you use Select Star, which is considered
like not a good practice. It's practiced for exploring and so on. Okay, explore with AI, ask to provide
better order or something. I don't know. But anyway, if you talk about, if it focused on performance,
again, I saw cases where it was significant waste. One of the first articles I saw in it was by
braintree, and I think they reported an average of about 10% on some of their largest tables.
It depends. Obviously, it completely depends.
But that was quite surprising.
I know you mentioned like an extreme case where you saw 30%.
But I was quite surprised that in an actual production case, it was as high as 10%.
And when you're talking about hundreds of gigabytes or terabytes, that's actually a decent amount.
Yeah, yeah, exactly.
The problem also, of course, in existing projects which were developed during many years,
usually it's not so simple because the schema is evolving.
during many years.
And you cannot insert your new column in the middle.
It's always, as you said at the end.
So sometimes we do need to, for example,
to change the order using some redesign thing.
By the way, have you heard that Postgis 19 will have repack
and also repack concurrently, which is basically vacuum concurrently?
We had an episode with author of Pidge's quiz.
Is a vacuum full concurrently?
Exactly, exactly, not vacuum.
But the decision was made to name it repack, concurrently.
I think that's sensible, getting away from people thinking it's vacuum-related, yeah.
I haven't looked into details, but I might, like, naturally, this is the point where you could reorder comments,
because it's based on logical replication.
Yeah, I suspect if they do that, it will be a different keyword again.
Who knows?
Again, I haven't looked into details.
there are long discussions on Preciousal hackers mailing list.
It just feels naturally because it's based on logical replication.
This is where you can reorder columns and play Tetris for existing large tables.
I hope it will be implemented at some point.
Maybe it's already partially implemented or something.
You basically, before you start filling the new table, you can reorder it because it's logical.
It should be possible.
Anyway, this is column tetris, entertaining topic.
let's move on to more serious topic called indexes, index chaos.
If you just use AI and if you don't spend enough time for things, as I said,
like experimenting with real data benchmarks, you might end up with underindexed situation
or over-induced situation.
And usually people are scared to have under-indexed.
In my practice, I see over-induced very often, sometimes too extreme.
I guess for new projects I would be surprised to see them over-indexed.
Oh, it's possible.
It's possible still.
Is that as a result of AI or something?
Well, the problem here is that, again, everything depends on your prompt.
It depends on angles.
If you say we will need to order by arbitrary column, right?
Remember our episode with CTO of Gadget, Harry, right?
So any column we can order.
Naturally, AI might decide to put index on every column, right?
But is it really what you want?
But that's what Gadget decided to do as well, right?
Right, but with understanding consequences, right?
But in reality, in reality, will it be used if one year later you see that 90% of those
indexes are unused?
And the gadget has special case.
They have many applications.
to support. If you design your own
only one application,
this is where I would just stop and think
with AI again, with some experiments.
I would ask questions.
This all depends on workload,
right? Which queries we will
have?
And I will just think
which queries we will really need
and which we won't need
and then choose
the proper minimal set of indexes.
Yeah, I like that.
I think there's also
like some rules of thumb we can give people in terms of like how many is where it starts
to get dangerous for other reasons but the kind of one place where it's not even and it depends
is overlapping and I see more and more where people are adding indexes based on AI suggestions
they're adding indexes they've effectively already got or they're adding ones that overlap
with indexes they've already got and therefore only one of them is needed it
That it won't always show up in your stats.
It won't always show up as an unused index because it is still using the smaller index.
It will show up in post-GICI stats because it's called in our case redundant index in post-GICI mentoring systems and check-up.
I never heard the term overlapping.
I see people talk about duplicate indexes, which is like trivial case.
Exact same definition.
Right?
Yeah.
But we call it redundant indexes.
It's not easy topic.
might be might think but we solved it in many like over years we have quite good detection
approach how to reliably detect redundant indexes and as you say this is true in many cases
all of those indexes are used and it's quite like there is a big fear oh i'm going to drop this
it's used should i drop it yeah but the reports we have quite reliable and of course
course I wish we had very native standard algorithm to disable indexes for quite for some
time like without involving hyperg in hypothetical indexes which this came up
supported our last episode was on what's missing in post-crest and I think we got a
nice comment I've forgotten who from but saying one of their miss features that
would be make this temporary hide index from the planner there are there are
extensions for it hyper pg supports it but we want some mechanistic
is beyond index setting and index index is valid to false.
We want this.
Yeah, yeah, yeah, I agree.
And this is a good thing to probably work on for those who want to start hacking
progress.
It's a good thing to officially implement something.
Anyway, questions to ask is they all start here should start from understanding workload.
So we cannot think about indexes without understanding workload.
So first we need to generate some fake data and then start thinking about usage patterns.
And without AI, it's really hard work.
It's a lot of time.
With the AI, it becomes easier.
If you have user stories defined already, you understand what kind of patterns we will have.
And you can start imagining.
It will not be 100% accurate, but it will be good enough, much better than without it
or with manual work. So you can iterate here and see and then with queries in hand and some
like lab already developed. I see it should take one or two hours for like medium size
project in terms of complexity. And then you can start iterating collecting plans and think
which indexes you can have, playing. But also this, whilst this might be the most important
in terms of performance that we've talked about so far, it's also the easiest to
change later, right?
We can...
Yeah, yeah.
We can set up a certain set of indexes and later decide we want a slightly different set of
indexes and that's not super painful in a lot, in most cases, especially if there's no
partitioning involved.
So that for me feels like, yeah, it's important.
It's really important.
But if we don't get it right on day one, it's much, much easier to fix then the
primary key being the wrong data type or not having the right constraints in place, that
kind of thing.
Or even the color motor, yeah.
So the point is you should build like wind tunnel for your project, right?
Like test lab, yeah.
Test lab, yeah, database lab.
And then you should put this workload like wind and see like how your profile of your database behaves under this wind.
Yeah, a lot of people do that with production.
They just see which are their slow queries.
It helps us with AI.
it's easier to have massive experiments before you finalized all the decisions.
So my point is do it.
And then questions to ask, are there any indexes we don't need in terms of they are redundant or unused
or what indexes are missing because we have bad plans?
And this again, like PG Master, we'll visualize it and so on.
And you have API already, right?
Yeah, it's quite nice.
You can connect your AI to PG Master, for example,
and ask to use PG Master to explain what's happening and find bad problems.
and missing indexes with this lab environment.
Great.
Going just like to the extreme, how many indexes on a table would you automatically just think,
oh wow, that's like a lot.
Maybe it's justified, but I'm already thinking that's too many.
These rule of thumbs are quite weak in my opinion, but we have them anyway.
Like for example, if volume of indexes exceeds volume of data, it's already some big,
some bell is ringing, right?
I like that, yeah.
That one.
Or before Posgis 18 with this nasty light-wet lock-lock manager problem
means I don't want more than 15 indexes per table.
Yeah, right?
Because of the 16 relation limit for the first time.
It's so simple.
Primary key look up, which quite likely will be very,
will have quite high QPS will suffer because of locking.
So there are a couple of kind of these.
rules, right? But they are not strict and again I say they are, I consider them weak, but I
think it's still helpful, right? If you've got a schema designed by AI and it's got 20
indexes on one table, maybe consider it. Is that smart? Yeah, yeah, yeah. But it happens.
Index data exceeding hip data. It happens. Especially with some, especially like if you're
doing gin indexes. Oh yes. If it's just bee trees,
yeah yeah okay great so the final topic we couldn't decide on which one to choose we had we had
two ideas one is our RLS and another is partitioning let's let's touch both maybe yeah I think so so
in from an AI perspective like when you're asking it for a new schema do you see partitioning
come up does it over partition it won't come up unless unless you start saying I'm going I'm going to
have a lot of data here and I want
just I want to I want good
performance I want billion
billions of rows stored in this table
it should be maintainable
it will naturally come to the idea of
having partitioning
partitioning will
be interesting to join with you ID version
7 or 8 does matter
it's possible again I have a recipe for this
in my how to set of
how tos and it can be
also the question
like partitioning you cannot
develop without understanding workload again. It's similar to indexes. Like it will be
something about how it will behave, who knows? Maybe you will end up your queries will need to scan
all partitions, which is terrible in most cases. And then planar behavior. Anyway, the questions
to ask, like you set requirements. I want a lot of rows and I want well-maintainability of this.
Help me. What problems? Like deleting old data, are you thinking? Like archiving?
No, creating indexes, for example.
or index.
Okay, yeah.
Vacuum.
Vacuum, yeah, yeah.
These are problems biting us much more often and badly than just, I don't know, like.
We talk about this a lot as well, right?
I know, I know.
Direct performance benefits from partitioning are good, but also the problem, like when
creation of index or full vacuum, not full, but regular, full table vacuum.
It will take hours or half a day.
It's already so painful, especially if it's for the day.
to prevent transaction wraparound, or if it's, again, indexing means that Xmin
horizon.
Like, anyway, let's not go too deep.
The question to ask, will we survive n number of roads because we expect all of data, right?
I want.
And just that of interest, is it pretty much always doing range partitioning based on
timestamp type?
Is that like pretty important?
Again, I'm the wrong person to ask.
Okay, yeah, fair enough.
In many cases, I just still use time scale DB if it's time-based.
Makes sense.
And I'm happy because there is compression there, it's great.
But I also saw cases when it was decided.
It was like not AI, but it decided like least partitioning and full control over
partitions, a special table created.
It was also working well and served needs.
But on surface, you should think about workload and then push your AI or something
to think about how it will behave if you have.
have that much data.
Yeah.
Petitioning is inevitable.
And then experiments again.
Even experiment with billion rows.
It won't take a lot.
Just allocate machine and do it.
It's like just so easy, right?
Half an hour of weight to generate it or more like even if it's one hour.
I don't understand why people don't do it all the time.
They come to, they come to us with questions, which I'm grateful.
Thank you.
With money and so on.
But it's so easy just to experiment more often.
these days.
Yeah, you even
run it, even leave it running
overnight.
Yeah, exactly.
I'm still, I'm still, with some plan,
with some plan. Yeah, I'm still
in the fence with, I, I love
partitioning for all the reasons you mentioned,
but I think
so many projects won't ever need it
that I understand not
doing it by default with a new
project unless you know, unless it's like a new
feature for an existing system that you know
is going to get heavy amounts to data really
quickly. Just, there are so many
trade-offs like that it's still a little bit like which trade-offs so for example indexes being
able to create indexes concurrently delete let drop indexes concurrently you can concurrently
concurrently create index on each partition and then yeah exists on each partition you can create
this these things can be automated and of course I I like to be with everything much more
better automated there is a big potential here but also every
every post-guised version gets a lot of improvements in the last maybe 10 years already so it's
it's not it's not it's not it's not like frozen yeah but also make like it's just being
extra careful that every high frequency query you have contains the partition key and is getting
pruned at planning time like there's there are a few gotchas that yeah there are
I have serious of articles about this and I went quite deep there and was crazy.
But that's my hesitation.
Creation of foreign keys.
For example, working with timescale DB sometimes.
Often we just abandon idea of having foreign key because not supported.
If you do partitioning yourself, you will deal with if you have large tables,
adding foreign key between them.
If partitioning is involved, it's an art, I would say.
GitLab master that they have migration helpers RB.
This is a great source of experience of many people involved.
And I can't stop recommending how great it is because it's open source.
So it's great place to look at.
And also they have great documentation about it.
But anyway, like it's a thing to decide if you don't expect billions of rows,
don't do it, of course.
Yeah.
Yeah.
At the same time, we have also very sad cases when people,
come to us with for consulting we identify problems we say partitioning is needed right sometimes
we go and spend a lot of effort because project is huge or there are many of databases to
take care of and then finally it's implemented great we like there are many hidden dangers
to explore a mid-journey or example they came to us like we helped forgot about light light
light, not forgot, we didn't know about that time and they hit it.
Lightweight lock-lock manager problem was hit badly and there are articles about it.
But then some projects just say, okay, they evaluate how much effort it is to deal with
partitioning when tables are huge and start talking, okay, maybe we should migrate to my
MongoDB.
Yeah, or even sharding, right?
Like instead, I think when we talked to Notion, Arka, he mentioned.
that because they keep their shards relatively small,
they've opted to never partition at all.
Yeah, the need partitioning diminishes.
I agree with this, but not vanishes completely.
What, maybe?
Not maybe.
You mentioned, it depends how small you keep your shards, right?
Yeah, yeah.
Like, Sugu, for example, Maltagres,
he talked about having lots of smaller databases being much easier.
Yeah, yeah, yeah, yeah, maybe you're right, okay.
What about RLS?
You mentioned you wanted to touch on that.
Yeah, finally, RELS, if you, for example, I just noticed it's not like,
SuperBase is very heavily, like promoting RLS.
I have a lot of articles, including how to avoid problems.
But if you just use AI, I noticed it often adds RLS.
Oh, really?
Yeah, just, this is how it works.
I just noticed I have had a couple of cases where suddenly brought, like,
We are going to use RLS.
It's multi-tenancy here.
Let's do it.
Okay, let's do it.
But let's also pause and think about performance and avoid problems like current setting inside a loop.
And you select count of million rows suddenly becomes super slow.
So the question to ask AI, like, again, benchmark is ideal here.
Like with RLS without RLS, what is overhead of having RLS in our particular case?
benchmark and if you see it's not okay then let's optimize because because there
are tricks how to optimize it so I would just if I notice a I generate a schema
with fireless I would ask question like what will be performance impact let's
benchmark it and if it's bad let's optimize that's it yeah great great that's
actually I think that's a useful checklist actually five or six things to
ask your AI when you design
schema. Yeah, or your colleague even. Yeah, also. Or your own schema, which you have already 10 years.
Why not? Yeah, exactly. Or even a new table. If you're doing a new feature, it still makes sense.
All of these makes sense, right? I agree. Cool. Thank you for listening. Yeah. See you soon.
Catch you soon. Bye-bye.
