Postgres FM - Column Tetris
Episode Date: December 6, 2024Nikolay and Michael discuss "Column Tetris" — what it is, why it matters, how to order columns for new tables, and how to re-organise existing ones. Here are some links to things they ment...ioned:“Column Tetris” by Erwin Brandstetter on Stack Overflow https://stackoverflow.com/questions/2966524/calculating-and-saving-space-in-postgresql/7431468#7431468Data Types https://www.postgresql.org/docs/current/datatype.htmlOrioleDB beta7 benchmarks https://www.orioledb.com/blog/orioledb-beta7-benchmarkspg_hexedit https://github.com/petergeoghegan/pg_hexeditSaving Space Basically for Free (blog post by James Coleman from Braintree) https://medium.com/paypal-tech/postgresql-at-scale-saving-space-basically-for-free-d94483d9ed9aOrdering Table Columns (GitLab https://docs.gitlab.com/ee/development/database/ordering_table_columns.htmlpostgres_dba alignment padding query https://github.com/NikolayS/postgres_dba/blob/master/sql/p1_alignment_padding.sqlGood explanation from Marco Slot of how alignment was used to fix a recent issue https://x.com/marcoslot/status/1858132850383421570pg_repack feature request discussion https://github.com/reorg/pg_repack/issues/101Our episode on bloat (with Chelsea Dole) https://postgres.fm/episodes/bloatOptimizing table layout for maximum efficiency (blog post by Renato Massaro) https://r.ena.to/blog/optimizing-postgres-table-layout-for-maximum-efficiency~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artwork
 Transcript
 Discussion  (0)
    
                                         Hello, hello, this is PostgresFM, episode number, again, I don't remember, 126, 7?
                                         
                                         Yeah.
                                         
                                         Great, 6, more than 2 years.
                                         
                                         And my name is Nikolai, Postgres AI as usual.
                                         
                                         My co-host is Michael, PgMustard. Hi, Michael.
                                         
                                         Hello, Nikolai.
                                         
                                         How are you doing today?
                                         
                                         Yeah, good. How are you?
                                         
    
                                         Good, I'm great as well.
                                         
                                         So let's play some Tetris, Column Tetris.
                                         
                                         I mean, talk about this game.
                                         
                                         It's not a very fun game, is it?
                                         
                                         I like Tetris, but Column Tetris not so much.
                                         
                                         Yeah, well, I think you can know zero about it,
                                         
                                         then you usually get excited about it,
                                         
                                         and then you kind of think, okay, it exists.
                                         
    
                                         But at some point you stop playing too much.
                                         
                                         Let's unwrap it.
                                         
                                         What is it?
                                         
                                         So columns can have different sizes.
                                         
                                         Data types have different sizes.
                                         
                                         Yes. data types have different sizes yes and most folks say for example integer big int small int
                                         
                                         but i prefer saying int4 int8 int2 because it exactly like you see exactly how many bytes it
                                         
                                         it occupies timestamps regardless of with time zone without without time zone, they all occupy eight bytes.
                                         
    
                                         UUID, I think 16, right?
                                         
                                         It's huge.
                                         
                                         Or eight as well.
                                         
                                         I think 16.
                                         
                                         I always forget.
                                         
                                         I actually can't remember.
                                         
                                         Yeah.
                                         
                                         Boolean, which should take only one bit,
                                         
    
                                         takes whole eight bits of one byte, right?
                                         
                                         It's a very wasteful way to store your flags of like true false values what else they're the big ones aren't they i guess you've got varying varying length um all the numeric
                                         
                                         ones in fact the documentation is great at listing what they all are but yeah the varying length ones
                                         
                                         i think are the other big category yeah arlina likeena, like text, and varchar, and JSON arrays, JSON
                                         
                                         B, of course, all these guys, they are variable length, and
                                         
                                         it's different story. And let's like, they are not fixed size
                                         
                                         data types, right. But these data types, which have fixed
                                         
                                         size, it's interesting to see that sometimes just reordering columns,
                                         
    
                                         you can save disk space and get better performance.
                                         
                                         Yeah, I would say most people, including myself,
                                         
                                         when they first learn this, find it very unexpected.
                                         
                                         Yeah, I agree.
                                         
                                         Same.
                                         
                                         It was long ago, but yeah, same.
                                         
                                         Yeah, mine was probably more recent yeah but yeah so i don't know
                                         
                                         if you want to get into any of the details as to why i don't fully understand the reasons why but
                                         
    
                                         my um what i was reading let's just say it's something about uh efficiency of how CPU works and if it's 8 byte bytes worth it's
                                         
                                         better for performance to read and write this number of bytes 8 bytes in modern
                                         
                                         systems but it was surprise to me to realize you know we talked before
                                         
                                         recording the alignment happens not only to eight bytes but also sometimes
                                         
                                         to four bytes and to two bytes so yeah yeah but let's let's have an example for example if you
                                         
                                         have very typical example you have a column four byte column it's just primary key say id
                                         
                                         integer which we don't recommend yeah we don't recommend and i always
                                         
                                         like you know like it's interesting like you you see something you learn something usually by doing
                                         
    
                                         some mistakes and some going through some pain in production and so on and then and then you start
                                         
                                         recommending something but still people use integer for primary keys for
                                         
                                         example super bass and aureole db they just had great release it was on on top of hacker news but
                                         
                                         by the way congratulations it looks great benchmarks look great beat to seven right
                                         
                                         yeah and i admire alexander krotkov a lot. He helped me personally many times.
                                         
                                         And for example, recently we had a discussion about track planning
                                         
                                         and we didn't mention that there is a good thread and discussion of sampling
                                         
                                         to mitigate overhead from PGC statements.
                                         
    
                                         I discussed it with Korotkov some time ago.
                                         
                                         I dropped a line and he immediately commented on that thread. So it's cool. It's cool. I mean, great. But he
                                         
                                         used integer for primary key in examples of benchmarks. But, you know, when table is named
                                         
                                         like, I don't know, like users, can we have 2 billion, 2.1 billion insert attempts to
                                         
                                         such table?
                                         
                                         Maybe yes, I don't know.
                                         
                                         It depends on the size of your project.
                                         
                                         But anyway, we don't recommend using integer 4 primary keys,
                                         
    
                                         as you mentioned.
                                         
                                         I stay at this point again,
                                         
                                         but I'm less aggressive judging people
                                         
                                         when they use integer 4 primary keys.
                                         
                                         So a lot of the argument for int four primary key versus in a would would be
                                         
                                         saving space but due to this alignment padding issue if you're going to follow it with an
                                         
                                         created a timestamp for example yes exactly super common if that's your the order when you create
                                         
                                         your table you're not even you're not saving space even other than well
                                         
    
                                         when we'll get to this other than maybe in some indexes you're not saving space by using the
                                         
                                         smaller data type so it's i think it's really fascinating and as i said for me it was unexpected
                                         
                                         not because i knew better but because i think in in like when you create things in any other
                                         
                                         in most other formats like you see their length grow as you add things together
                                         
                                         and you don't you don't consider that the lower level structure is going to be like grouped into
                                         
                                         little page i guess i guess the one the one exception is when you're creating maybe maybe
                                         
                                         some pdfs or something like let's say you wanted to print out the documentation into pdfs you
                                         
                                         probably would want to start new chapters on new pages for example like that's there's some formatting
                                         
    
                                         when you do books and things you don't want the chapter to be at the end of a page and so it's
                                         
                                         that kind of thing right it's kind of shifting data to start a new point so to make things easier
                                         
                                         for uh the thing yeah when you use Google Docs, for example,
                                         
                                         you insert page break.
                                         
                                         Yeah.
                                         
                                         Similar.
                                         
                                         A good analogy, I agree.
                                         
                                         So back to example again,
                                         
    
                                         if it's integer 4 primary key,
                                         
                                         and then create it at timestamp,
                                         
                                         and then, for example, I don't know,
                                         
                                         like org ID or something,
                                         
                                         like group ID, also integer 4,
                                         
                                         or integer 8, doesn't matter at storage
                                         
                                         level what happens between id and created between first the first and the second columns posgus
                                         
                                         will just fill up fill it up with four zeros and we can see it using pg hex edit on linux if you
                                         
    
                                         have ubuntu or something you can have this graphical interface which works on top of page inspect
                                         
                                         extension. You can see zeros with your own eyes, right? It's
                                         
                                         from Peter Gage. Is that the peak? Yeah, yeah, I've seen him
                                         
                                         using it. Yeah, I use it a few times. But I remember it
                                         
                                         acquired the bunt. I'm not sure maybe it's possible to run it
                                         
                                         on MacOS right now so I use it in virtual machines like a little bit overhead to run.
                                         
                                         But these zeros are that padding, yes?
                                         
                                         Yes, since the second column is 8 bytes for efficiency, Postgres adds four zeros
                                         
    
                                         so the first column also basically takes 8 8 bytes instead of 4 so two first columns
                                         
                                         they both take 16 bytes instead of 12 right and it means it's if you use the int 8 or
                                         
                                         bigint primary key it would be the same in terms of storage here, as you said.
                                         
                                         And we do recommend using integer 4, because who knows, maybe you will achieve 2.1 billion for your sequence,
                                         
                                         which is generating the numbers.
                                         
                                         And once you achieve that, sequence itself is eight bytes always it's okay but if the column
                                         
                                         where insert is four bytes with zeros always right you will end up having problem because
                                         
                                         you cannot insert into it anymore because like i mean larger values larger than 2.1 billion won't
                                         
    
                                         be possible to insert because the capacity of four bytes is 4.2 billion and we have signed
                                         
                                         int and it's a bad situation because the big surgery will be needed on on this table it's
                                         
                                         it's really like not easy topic and we touched it a few times in the past okay but what I didn't realize until very recently that padding can happen also to 4 bytes and to 2 bytes.
                                         
                                         So if you have, for example, int2, int2, the int tools to the end,
                                         
                                         the first int tool is going to be padded to four bytes, then four bytes, then two bytes.
                                         
                                         So overall it will be 10 bytes.
                                         
                                         This was a surprise to me because I thought
                                         
                                         they will take only eight bytes.
                                         
    
                                         It's not so, so padding happens at different levels.
                                         
                                         Eight bytes, four bytes, and also 2 bytes, as you said.
                                         
                                         The same thing if you use Boolean, you can see similarly that it's going to be padded either to 2 or 4 or 8 bytes,
                                         
                                         depending on the subsequent column's data type.
                                         
                                         Exactly, always the next one and also to make picture complete if we have for
                                         
                                         example just first example we used int4 primary key created at timestamp taking eight bytes and
                                         
                                         then for example i don't know like org id or something also for bytes so padding for first
                                         
                                         column is going to be up to 8 bytes then we have 8 bytes and in the end it also
                                         
    
                                         will be padding but not for the tuple it will be padding for before next tuple so
                                         
                                         overall we will end up having three 8-byte words. It's 24 bytes, right?
                                         
                                         And just moving the third column to the second position,
                                         
                                         we will go down from 24 bytes to 16 bytes, right?
                                         
                                         Also, there are 24 bytes for tuple header,
                                         
                                         which is actually 23, but one byte is not used as i remember
                                         
                                         but it's also padded always to 24 bytes so overall we will have 48 bytes per tuple
                                         
                                         instead of 40. and this is amount of saving per tuple we can achieve if we just move
                                         
    
                                         we just pack two and four data types together into one 8 byte position and this will contradict with our recommendation to use int8 primary keys but again the reason for int8 primary
                                         
                                         keys is to avoid risks of surgery for the table if you do know you will never achieve for your sequence value, you will never achieve 2.1 billion
                                         
                                         value. Never ever. It's okay. I could even
                                         
                                         use 2 byte integer, small int. If I know I won't
                                         
                                         achieve 65, 32
                                         
                                         thousand, right? I don't know. Something like this.
                                         
                                         I'm not good at factors. It requires careful planning, so you need to think about
                                         
                                         future and all possibilities, but
                                         
    
                                         my rule of thumb is just use
                                         
                                         int8, and that's it.
                                         
                                         But back to this additional
                                         
                                         example, this is
                                         
                                         kind of significant saving, right?
                                         
                                         From 48
                                         
                                         to 40 bytes.
                                         
                                         It's kind of 20%, right yeah as a percentage but then i think
                                         
    
                                         i think where this becomes meaningful is when people have huge tables so we're talking about
                                         
                                         only a small number like bytes and this day and this day and age even even ram's not that expensive
                                         
                                         like it's expensive at scale but
                                         
                                         we can get quite powerful machines for not very much money these days and loads of storage for
                                         
                                         not very much money so i think this one ram because the storage is not only about storage
                                         
                                         it's spamming ram no with zeros i am a huge fan of doing things efficiently and i but i think at
                                         
                                         scale it's where it starts the investment in this kind
                                         
                                         of thing starts to pay off and i remember a blog that one of my favorite blog posts on this topic
                                         
    
                                         is by the team of braintree and they mentioned by going back like but once they discovered this
                                         
                                         phenomenon by going back through their largest tables they they calculated it was about a 10 saving on disk space from from where they were
                                         
                                         before to to a more optimal setup so i believe that i don't see any reason why that that wouldn't
                                         
                                         be true for a lot of organizations that hadn't done this deliberately in the past or people
                                         
                                         hadn't been conscious about this but it's so 10 a lot in the by some benchmark noticeable it's noticeable yeah but
                                         
                                         it's also not like sometimes when we talk about performance we're often talking about orders of
                                         
                                         magnitude and when i don't think we're talking about that in most cases still you know like i
                                         
                                         again like there are stages of impression here and my current impression well it's important problem
                                         
    
                                         but i don't know like it's it's worth thinking about it when you first design a
                                         
                                         table, which will obviously be large in the future. And then
                                         
                                         all for example, GitLab, they have public documentation and,
                                         
                                         and there is a column Tetris page there. And it's I know,
                                         
                                         it's a rule to check it if it's a new table to always check it.
                                         
                                         And maybe there is also some some automation, I'm not sure.
                                         
                                         But we have bloat as well, in different dimensions.
                                         
                                         This dimension is kind of like programmed bloat in columns level,
                                         
    
                                         but we have in rows level, we have bloat there.
                                         
                                         And 10% is probably...
                                         
                                         For example, if I see extra 10% of bloat in both indexes and tables it's it's not triggering
                                         
                                         action from me exactly right but if you design table from scratch of course it's worth thinking
                                         
                                         about it yes mostly because it's so cheap to think about and i guess we'll talk about this
                                         
                                         in a moment it's not a difficult equation to work out how you should order things
                                         
                                         and it can be automated we discussed also yes before recording and i have report
                                         
                                         i have analysis script in my postgres dba old toolkit which will suggest you it's easy you
                                         
    
                                         just start from the largest or from smallest doesn't matter maybe it matters a little bit so
                                         
                                         actually start from one byte columns then put two bytes column and so on or vice versa start from
                                         
                                         the biggest size bytes and the smallest but and it will provide you like it will reorganize columns
                                         
                                         so you have less cases when you see zeros in physical storage. But don't you think it makes
                                         
                                         column order weird? For example, ID can be not at the first position and it's so strange to deal
                                         
                                         with such tables. Yeah, I haven't had to do this optimization myself. I did did a lot i think i would start large i think but i would want to start
                                         
                                         like at least maybe not largest to smallest but at least the eights first or at least a group of
                                         
                                         eights first because i'd want my id in the first place you know if you're doing select star table
                                         
    
                                         you know you're just doing some testing just Just having that idea in those first few useful columns first
                                         
                                         is beneficial for me as a user.
                                         
                                         But also, I was reading the Braintree post,
                                         
                                         and they've done some optimizations around...
                                         
                                         They've suggested putting the ID first
                                         
                                         because it's going to be read the most.
                                         
                                         So in terms of unpacking the row,
                                         
                                         apparently there's some optimization around that as well.
                                         
    
                                         Okay.
                                         
                                         So they also recommended some other super minor optimizations
                                         
                                         because they've built tooling around this.
                                         
                                         Like, for example, within your,
                                         
                                         let's say you've got a bunch of eight byte columns.
                                         
                                         Within those, you're probably better off putting your not null columns first
                                         
                                         because they're more likely to have data in them.
                                         
                                         And then putting your default
                                         
    
                                         columns a bit after that because then we'll almost always have data in them so like there's some
                                         
                                         these like probably right at the edge in terms of optimization but now we're not talking about
                                         
                                         storage space we're talking about interesting like we're talking about um read performance
                                         
                                         so it's and and actually io is a good point because if your data is taking up less space,
                                         
                                         some of your queries are going to also
                                         
                                         require less IO.
                                         
                                         And that's faster, cheaper,
                                         
                                         all good things.
                                         
    
                                         Right. Well,
                                         
                                         I didn't see that.
                                         
                                         Did they show some benchmarks
                                         
                                         to prove that there's
                                         
                                         a reason, I mean,
                                         
                                         foundation for these decisions
                                         
                                         to reorder columns? It's interesting. I need to check that there's a reason i mean foundation for these decisions to to reorder calls
                                         
                                         it's interesting i need to check that i didn't know like i didn't see it but yeah i think
                                         
    
                                         probably you're right and i need to adjust my like recommendation logic and put eight bytes before
                                         
                                         16 byte columns because id should be there or an created ad should be there and so on
                                         
                                         all although if it's uuid type it's again it's going to be 16 bytes so yeah yeah i i don't know
                                         
                                         like anyway this is the topic i i think it's super easy not to know about it obviously but it's also
                                         
                                         super easy to overestimate the benefit of in
                                         
                                         paying too much attention to it but interesting what happened recently with uh postgres versions
                                         
                                         you know 17.1 16.4 and four five right a couple of weeks ago they were released and the cost for
                                         
                                         example timescale DB not to not work them. And what happened in some structures,
                                         
    
                                         there was additional Boolean type which was added,
                                         
                                         and physically Yaw changed in minor version of structures,
                                         
                                         and causing some extensions like timescale DB to stop working.
                                         
                                         And yeah, it was not a good situation and i think lack of testing but
                                         
                                         more interesting like next week the deploy uh new releases so 17.2 16.6 and others
                                         
                                         they fixed this problem and and actually originally boolean value was added to fix some problem, right?
                                         
                                         It was a fix.
                                         
                                         Because in minor versions, Postgres, unlike some other systems, Postgres doesn't release
                                         
    
                                         new features.
                                         
                                         It's only about fixes, right?
                                         
                                         But this fix caused incompatibility in terms of physical layout.
                                         
                                         It's bad.
                                         
                                         And how next releases fix it?
                                         
                                         This Boolean was moved to different location in terms of order
                                         
                                         where some padding happened, where some zeros were present, right?
                                         
                                         Filling one of those zeros.
                                         
    
                                         So Column Tetris, as I've tweeted, Column Tetris well played, right?
                                         
                                         So, I mean, Postgres hackers themselves played Column Tetris a few weeks ago, fixing some problem, fixing a bug which was another fix of another bug, right?
                                         
                                         That was interesting to observe.
                                         
                                         I think it's worth discussing discussing we've discussed a bit why
                                         
                                         why you would do this like save save on disk space less io bet like caching like better use of
                                         
                                         ram but i think when is also used for to discuss like obviously we can do this when we're creating
                                         
                                         a new table we can think about column order that's normally when we're adding the most number of columns but i think also sometimes when you're adding when
                                         
                                         you're adding a new feature or you're doing refactoring you might be adding a few columns
                                         
    
                                         at the same time and i think it's worth pointing out that that's another time where you can make use of this right but equally I
                                         
                                         think we're moving to a world where there's a there's more and more emphasis
                                         
                                         on incremental improvements and I know for my own projects I tend to add one
                                         
                                         column at a time like I tend to be doing a migration oh yeah like five releases
                                         
                                         per day yeah we're not we're not I'm not doing five per day, but they keep things small,
                                         
                                         and they tend to not be multicolored.
                                         
                                         Well, in software engineering, this makes sense completely
                                         
                                         because if you mix many things in one big change,
                                         
    
                                         making them depending on each other, it's bad.
                                         
                                         So we try to split it to small steps and release small steps sooner because if
                                         
                                         we have problem with one of the steps it doesn't block others from deployment right this means that
                                         
                                         we probably indeed add columns one by one or sometimes normally or maybe other people have
                                         
                                         different experience but yeah so even even in pairs yeah the i guess the order does yeah but in this context in previous context what do you think first of all when postgres creates when we
                                         
                                         run create a table with many many columns what what why postgres doesn't you're gonna ask if it
                                         
                                         doesn't why doesn't it do it by default itself yeah then it's unexpected in a different way like
                                         
                                         we were saying i want my ID to come first.
                                         
    
                                         No, no, no, no, no.
                                         
                                         Imagine we had a logical order and physical order.
                                         
                                         Instead of at num column in pg attribute,
                                         
                                         we would have two, right?
                                         
                                         And we know what to present to user, like logical.
                                         
                                         And we know physical because at creation time,
                                         
                                         Postgres would have some algorithm to
                                         
                                         reorganize it right and i think it's possible i don't i'm not sure if it was discussed like it
                                         
    
                                         should should be discussed a few times but obviously it's not happened it didn't happen yet
                                         
                                         but further let's let's think like let's dream a little bit additionally. And based on this case with new releases, minor releases,
                                         
                                         which played Column Tetris,
                                         
                                         when you have an existing table,
                                         
                                         imagine we have an existing table and we have some padding happening there,
                                         
                                         and we add a Boolean column, and the table is huge.
                                         
                                         We could add it, I mean, Postgres could add it probably
                                         
                                         to some different position, not increasing
                                         
    
                                         size at all, right?
                                         
                                         Well, I don't know what needs to then happen to, what happens then to the data files?
                                         
                                         Well, existing rows is different.
                                         
                                         If we need a full rewrite, we do it, right?
                                         
                                         But since Postgres 11, if you define default for a column, it will
                                         
                                         be virtual, so without physical rewrite. It's just it will say, okay, until this moment, I think,
                                         
                                         x max, x min, or something, right? Until this transaction ID, all old rows, they virtually
                                         
                                         have this value, even we don't write it physically but it was great optimization
                                         
    
                                         and actually what i discussed here it's also in the same area of possible optimization postgres
                                         
                                         could have right so only future rows will have it they had zeros already padding okay would the
                                         
                                         padding look different well there is specific additional place in Postgres.
                                         
                                         I don't remember.
                                         
                                         I forgot.
                                         
                                         But in Postgres 11, it was created for defaults, right?
                                         
                                         Saying all rows kind of have this value for this column, which is new.
                                         
                                         All future rows will have it in normal way.
                                         
    
                                         Same here.
                                         
                                         We have padding zeros, right?
                                         
                                         In future, like future, this column, say it like fired, right?
                                         
                                         Boolean, true or false, or null, three-value logic.
                                         
                                         We can have it in some position where we had zeros for old rows all new rows will have not zero
                                         
                                         that's it it's it's good right i mean it would be beautiful it'd be lovely it'd be you know in
                                         
                                         what in in real tetris uh if you've if you get if it's getting quite fast and you're starting to get
                                         
                                         some of these gaps every now and again you get a shape that looks exactly like a gap you've got
                                         
    
                                         right down near the bottom and you want to just pick it up and pop it in.
                                         
                                         It would feel like that, wouldn't it?
                                         
                                         Similar, yeah. Another good analogy, right? So I think there is some sense in this optimization.
                                         
                                         I don't know, maybe I should discuss it with Andrey and Kirk, we're probably slowly returning to our normal hacking sessions on Postgres TV.
                                         
                                         We have actually already quite a long list of ideas.
                                         
                                         Yeah.
                                         
                                         There's two other things I wanted to discuss with you on this.
                                         
                                         One was doing it with existing data.
                                         
    
                                         So we find out that this is a phenomenon, or we've inherited a system that wasn't set up in optimal, or it's just evolved over time.
                                         
                                         We've been adding columns as we've been developing new features as the system evolved.
                                         
                                         And we decide it is worth it for us to, or like we want to look into how would we reorganize the table into a different column ordering.
                                         
                                         How would you go about doing that? Or how do you go about doing that? It's the same complexity for surgery
                                         
                                         as for int4 to int8 primary key conversion.
                                         
                                         I wish pgRepack would do it,
                                         
                                         but we had a discussion
                                         
                                         and I think I had some very quick and dirty prototype
                                         
    
                                         to reorganize column order
                                         
                                         when you are repacking a table using pg repack
                                         
                                         it will be good moment because it rewrites everything but there are some doubts and fears
                                         
                                         and i think we had just had lack of attention from very experienced hackers there so it was not
                                         
                                         i think somebody somebody in that pull request let's check it and have it shown on somebody. I tried it again because the idea is simple.
                                         
                                         PgRepack is rewriting whole table.
                                         
                                         If we need to physically reorganize the order of columns, we also need to rewrite the table, right?
                                         
                                         And let's just change the column order at this moment.
                                         
    
                                         But this is, again, this is some thing which is not ready.
                                         
                                         You cannot use it in production.
                                         
                                         I don't recommend it, PGRPAC.
                                         
                                         I mean, for this, although there is a cooking dirty prototype for this.
                                         
                                         How I would do it, I would just apply this new table approach,
                                         
                                         same as in 4.8.
                                         
                                         But it requires some additional effort.
                                         
                                         It's similar to PGRPAC.
                                         
    
                                         You need some kind of delta table.
                                         
                                         You need trigger, and you need to write changes to that delta table.
                                         
                                         And then you create a copy.
                                         
                                         While you create a copy, this trigger should write all the changes in this delta.
                                         
                                         Then copy is created.
                                         
                                         There are interesting tricks there to minimize the duration of transactions, as I remember.
                                         
                                         It was long ago, last time I used this and developed, I think, maybe five years past.
                                         
                                         We had a very serious workflow developed for a big company originally.
                                         
    
                                         And we used it several times, this approach, under very heavy loads and very mission-critical systems where Postgres was in the center.
                                         
                                         And then you just have some transaction to switch.
                                         
                                         You should be very careful.
                                         
                                         Foreign keys will be the biggest problem, actually, in terms of switch.
                                         
                                         Well, I had an alternative.
                                         
                                         Logical replication.
                                         
                                         Yeah.
                                         
                                         Yeah.
                                         
    
                                         And I think if you've only just discovered this and you have several large tables that
                                         
                                         you want to do it on all at the same time, I realize it's heavy handed if it's not your
                                         
                                         largest table or if you've got tons of data that is already pretty optimal.
                                         
                                         But if you're doing one big retroactive, what's the downside?
                                         
                                         Of logical? big retroactive maybe well what's the downside of logical as always downsides of using logical under heavy loads it's tricky sometimes and in this case we need publication for one table so we
                                         
                                         need carefully work with retries and so on because some logs will be needed they won't block other
                                         
                                         sessions but you can just fail.
                                         
                                         For example, if you have long-running,
                                         
    
                                         as usual, long-running auto-vacuum process,
                                         
                                         which is processing your table
                                         
                                         to prevent transaction ID wraparound,
                                         
                                         you won't be able to get log for long.
                                         
                                         I mean, well, it's running, so you'll fail it.
                                         
                                         You need to, like, there are some nuances here, right?
                                         
                                         But I think it's it's obviously
                                         
                                         valid approach and uh combining all these thoughts i think pg squeeze from cybertech which is
                                         
    
                                         alternative to pg repack probably is a tool where this should be maybe maybe it's already supported
                                         
                                         there like i checked i couldn't see it in the documentation but it makes sense getting rid of
                                         
                                         bloat let's also reorganize table a little bit and get rid of padding which can be considered also kind of bloat right
                                         
                                         yeah i think so i think depending on your definition vertical bloat
                                         
                                         yeah like if you re if you rebuilt your table now would like how much smaller would it be
                                         
                                         is kind of is kind of how i think of bloaters the the dip
                                         
                                         the delta between how big is it now how big would it be well you're talking about dump restore you
                                         
                                         can check it like that or we fucking fool but uh it won't it won't remove padding zeros no so
                                         
    
                                         therefore maybe it doesn't count in that strict definition but if you allow for a column reordering
                                         
                                         in between then it suddenly it counts.
                                         
                                         It would be a good feature for pgSqueeze.
                                         
                                         And unlike pgRepack, I think people should not have fears.
                                         
                                         Because if you use logical and you organize your table like call model, I don't see problems.
                                         
                                         Same.
                                         
                                         Yeah, because pgRepack, it works with substituting real file nodes nodes and it's like basically substituting files it's
                                         
                                         kind of hack right and it's it sounds scary while pgsquiz is using official api basically logical
                                         
    
                                         decoding right which is good because sometimes pgrepack is not available if it's a managed
                                         
                                         postgres offering and they forgot to add pg repack
                                         
                                         i don't know why they could do it but actually i think superbase doesn't have pg repack
                                         
                                         interesting yeah yeah i think so i think i i checked last week we had the client new client came they are on superbase and i think i checked and i didn't see pPAC among supported extensions. I wonder how Superbase clients deal with Plot,
                                         
                                         or maybe they don't care yet.
                                         
                                         But by the way, again, congratulations with Aureole DB.
                                         
                                         It's great.
                                         
                                         Aureole DB is a super innovative thing,
                                         
    
                                         which has high chances to be widely used
                                         
                                         because I think it's on Apache Apache license and the plan is to have
                                         
                                         it as extension for regular postgres which is great but uh it's not yet yet there because some
                                         
                                         changes need to be done in postgres core but but there are chances it will be done and it's not
                                         
                                         recommended it's not recommended for production used yet yeah of course because basically it's it requires patching the original postgres i think it's more than that
                                         
                                         i think there's still other yeah it requires many many years to build a reliable database and it can
                                         
                                         be considered a kind of new database because it's heavily modified postgres right storage layer is heavily modified there yeah i can't wait to see
                                         
                                         like you know we discussed it with korotkov that branching can be native in postgres
                                         
    
                                         this could be huge yeah there's a lot of promise there i think yeah but anyway uh back to pg squeeze
                                         
                                         i think it's a good idea if it would it. You have many ideas today, but back to practical.
                                         
                                         These are more episodes. These are separate episodes.
                                         
                                         Yeah, back to practical stuff. There's one more thing. So yeah, so I think those are two
                                         
                                         ways of doing it. And I also wondered, at what kind of volume are you generally seeing? What
                                         
                                         kind of size tables are we talking about before this starts
                                         
                                         to make any sense normally?
                                         
                                         You know, it's hard to say
                                         
    
                                         because again, I
                                         
                                         told you, even if I have a 1TB table
                                         
                                         which is already above
                                         
                                         threshold where partitioning
                                         
                                         should be used, and I see
                                         
                                         additional 10% of bloat, I don't care.
                                         
                                         So if I see additional
                                         
                                         10% of padding, I don't care. 10% because because of
                                         
    
                                         engineering time is very expensive, right? Yeah. Yeah, I
                                         
                                         know. It's not only about storage, like people say it's
                                         
                                         a storage stores, but it's I'm mostly concerned not about
                                         
                                         storage, although storage also matters because it affects backup
                                         
                                         sizes and replication size and so on. what matters more for me is state of memory
                                         
                                         and this is like additional spam tool in the buffer pool and page cache it's not good so 10%
                                         
                                         i wouldn't be bothered but it's 20% i already kind of in warning mode right it's a 30 40 it's already it's worth doing it right and by the way it's interesting you know
                                         
                                         like i always tell people these blood let's connect topics in very weird way so blow it
                                         
    
                                         regular blood when we have dead apples and they we are cleaned by auto vacuum or vacuum and we have gaps in terms of tuple like slots for tuples are are not filled
                                         
                                         and page has only few tuples and many empty spaces this is called blow right so over time postgres
                                         
                                         will probably put new tuples there but maybe no because new tuples are coming in different pace and we end up having too much space used
                                         
                                         than it could be. And dump restore would help or repack or pg squeeze or vacuum full which
                                         
                                         is not good in production and so on. So usually how we understand the blood? How do we understand
                                         
                                         the blood level? We use some scripts, some queries or tools use some queries or monitoring systems
                                         
                                         so you know use some queries but these queries they are like they sometimes are heavy but not
                                         
                                         so heavy as a full table scan would be right still they are light but they are estimate estimated yes and you see a blood
                                         
    
                                         level 40 but it's not 40. you dump restore and you still see 40 how how come and interesting if you
                                         
                                         take example one of those we discussed for example boolean in eight boolean intake pulling in eight Boolean int 8, Boolean int 8, Boolean int 8. Repeat it like 10 times and rinse.
                                         
                                         And check bloat level using estimated bloat, using one of those queries,
                                         
                                         which actually all originated from one work.
                                         
                                         So you will see some bloat.
                                         
                                         I can demonstrate bloat like 50%.
                                         
                                         It's insane, right?
                                         
                                         And you dump restore, you welcome full your PG repack,
                                         
    
                                         and this bloat cannot be eliminated, because it's not bloat. I think it's very related
                                         
                                         to mistakes, the errors, these scripts, like error related to padding.
                                         
                                         You could that could be fixed, right?
                                         
                                         That could be fixed. I tried. I didn't have enough time a few years ago like like and
                                         
                                         the complexity was high i think i think i first i need to make sure my analysis script in postgres
                                         
                                         dba uh works literally right i need to check this for four by two byte padding cases and if it's
                                         
                                         right i think jumping from there to those estimate blood estimate queries i think it's right, I think jumping from there to those estimate, bloat estimate queries,
                                         
                                         I think it's possible.
                                         
    
                                         It should be, right?
                                         
                                         You need to know the column order for that.
                                         
                                         I know it from practice.
                                         
                                         This is how I can easily demonstrate that you should not trust blindly those estimated
                                         
                                         bloat queries.
                                         
                                         Cool.
                                         
                                         Last thing, I think posts on this normally don't mention indexes, but one good one recently did by Renate Massara.
                                         
                                         And I don't think it's as important as Heap for perhaps obvious reasons, but it did somewhat surprise me that there are some somewhat surprising alignment padding issues related to indexes as well.
                                         
    
                                         And I thought that was fascinating.
                                         
                                         Yeah, but usually we cannot change, in most cases, changing column order indexes just because of this is not a good idea because column order matters in terms of query performance.
                                         
                                         Yes, so I completely agree.
                                         
                                         I think there are some rare cases where if you're only doing equality type,
                                         
                                         there's some rare cases where you do have some flexibility on at least a couple of the columns
                                         
                                         if you're always including them in all queries like that kind of thing but the reason i the reason i brought it up was i i've seen twice now
                                         
                                         including the person who wrote this article twice now people have spent time on something that then
                                         
                                         turned out to be not worth it so they had spent time reducing like they didn't need a timestamp
                                         
    
                                         they really only needed the date, for example.
                                         
                                         So they thought, I'm going to, I've got this index on,
                                         
                                         they only had a single column index on this timestamp,
                                         
                                         and I'm going to create a B tree.
                                         
                                         I'm going to create the index again, but only on the date,
                                         
                                         like the truncated part of it.
                                         
                                         And they spent development time thinking about this, doing this.
                                         
                                         Oh, you're saying,'re saying well data is maybe
                                         
    
                                         how much it does take four bytes or it is padded to eight actually in in bg yeah you say like
                                         
                                         having two column index four and four bytes is and just single column four bytes is kind of same
                                         
                                         or or what single column four bytes is the same as single column eight bytes and that's really surprising yeah cool so that time was wasted for them even though it wasn't a multi-column index
                                         
                                         the padding meant that they they didn't gain so guys have more two column indexes if it's about
                                         
                                         four byte columns right yeah maybe you weren't yeah Or maybe covering indexes as well. I don't know.
                                         
                                         Covering indexes may be different.
                                         
                                         You know about including?
                                         
                                         Yeah.
                                         
    
                                         If it's a single column, four byte index,
                                         
                                         and we have another like passenger, right?
                                         
                                         Passenger.
                                         
                                         Yeah, but it only gets stored in the leaf pages. Yeah, so it's different.
                                         
                                         Yeah.
                                         
                                         Okay.
                                         
                                         Yeah, that's interesting.
                                         
                                         And I didn't go there ever.
                                         
    
                                         This is good that this article raises this topic so yeah i think it's a good thing to be aware of probably not something you
                                         
                                         should be uh you shouldn't don't reorder your indexes based on performance reasons not based
                                         
                                         on storage size it would be my yeah recommendation i agree yeah Yeah. Oh, yeah. But yeah, a
                                         
                                         really good point that I hadn't realized until I read this
                                         
                                         article. Okay, good stuff. Thank you for choosing this topic.
                                         
                                         Which is
                                         
                                         you're welcome. We finally got to it. We've brought it up a few
                                         
                                         times. Good. Good. And yeah. Okay. Have a good week. Take
                                         
    
                                         care. Bye. Have a great week. Bye.
                                         
