Postgres FM - Append-only tables
Episode Date: November 15, 2024Nikolay and Michael discuss append-only tables in Postgres — what they are, some unique challenges they bring, and some options for compressing / removing / offloading the data eventually.�...�Here are some links to things they mentioned:Append-only https://en.wikipedia.org/wiki/Append-onlyOur episode on BRIN indexes https://postgres.fm/episodes/brin-indexesTips to Improve Insert Performance https://www.timescale.com/blog/13-tips-to-improve-postgresql-insert-performance/ Our episode on WAL and checkpoint tuning https://postgres.fm/episodes/wal-and-checkpoint-tuningautovacuum_vacuum_insert_scale_factor https://www.postgresql.org/docs/current/runtime-config-autovacuum.html#GUC-AUTOVACUUM-VACUUM-INSERT-SCALE-FACTOR Our episode about compression https://postgres.fm/episodes/compressionNew project announced recently called BemiDB https://x.com/samokhvalov/status/1854726389603484145pg_tier https://github.com/tembo-io/pg_tier~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artwork
 Transcript
 Discussion  (0)
    
                                         Hello, hello, this is PostgresFM. I'm Nikolai, PostgresAI, and my co-host is Michael, PJ Mustard. Hi, Michael.
                                         
                                         Hello, Nikolai.
                                         
                                         And guess what we are going to discuss today?
                                         
                                         I'll guess. Is it append-only tables?
                                         
                                         Exactly.
                                         
                                         Ah, I got it right.
                                         
                                         I was surprised to hear that we haven't discussed it in the past.
                                         
                                         I'm sure we touched it many times, right?
                                         
    
                                         Yeah, for sure it's come up in episodes, but me too.
                                         
                                         When I saw it in our listener-suggested topics,
                                         
                                         I did have a few searches on our site
                                         
                                         just to make sure we hadn't talked about it already as a whole episode,
                                         
                                         and yeah, agreed.
                                         
                                         But it's not something i see all the
                                         
                                         time like it's it is it's relatively common like to have an events table but even then i mostly see
                                         
                                         append mostly tables rather than like oh yeah we need to distinguish yeah i was thinking we are going to discuss both. Yeah.
                                         
    
                                         Do you feel in the episode for append mostly tables?
                                         
                                         No?
                                         
                                         Well, I actually, I don't think this is super complex. I think it's quite nice because almost by definition,
                                         
                                         if we do accept that it's append only,
                                         
                                         we've got quite a narrow scope
                                         
                                         and there's only a few things to consider.
                                         
                                         Maybe it gets a bit more complicated with append mostly but yeah we can definitely cover that today i think it's still not that complicated yeah where do we start
                                         
                                         like definition go for it yeah why not well if we just insert that's append only that's it
                                         
    
                                         insert and select i would say yeah no updates no deletes select is allowed right yeah insert
                                         
                                         and select this is the only two things we can allow ourselves from dml data manipulation language
                                         
                                         and that's it yeah we We just select and insert.
                                         
                                         This is append only.
                                         
                                         If we have occasional deletes and updates,
                                         
                                         it's append mostly or insert mostly.
                                         
                                         I don't know how to say.
                                         
                                         Yeah, I like that.
                                         
    
                                         And why do we care about this particular use case?
                                         
                                         Because it has characteristics, right?
                                         
                                         It usually has...
                                         
                                         If it's append-only, for example,
                                         
                                         we don't care about that tables anymore.
                                         
                                         No bloat, right?
                                         
                                         It's good.
                                         
                                         And usually we talk about huge volumes of data.
                                         
    
                                         And at some point we think,
                                         
                                         oh, we need to compress it,
                                         
                                         we need to offload it maybe to cheaper storage, or just
                                         
                                         clean up, because this old data is not needed anymore. In raw form, sometimes it's aggregated,
                                         
                                         and in raw form, we can just remove it from database. Yeah, or we just, I don't know, need bottomless. And usually we want inserts to happen very fast
                                         
                                         because volume is huge,
                                         
                                         so we need to make sure performance of inserts is good.
                                         
                                         Did I miss anything?
                                         
    
                                         No, I think those things aren't necessarily always true
                                         
                                         for append-only tables, but they correlate.
                                         
                                         A lot of the use cases
                                         
                                         for very, very fast-growing data,
                                         
                                         and by definition, append-only means
                                         
                                         it's never going to decrease in size.
                                         
                                         It's only going to keep getting larger and larger.
                                         
                                         Unless you clean it up.
                                         
    
                                         Well, there are specific cases.
                                         
                                         For example, imagine you...
                                         
                                         We discussed many times
                                         
                                         the topic of slow count.
                                         
                                         And if you can allow like a synchronous calculation of count,
                                         
                                         maybe it's like materialized or something, I don't know.
                                         
                                         So idea is instead of updating the count somewhere on each insert in the original table,
                                         
                                         you can aggregate operations in intermediate table and then it's append only.
                                         
    
                                         So you register events in some table and then you process chunk and reflect this in count in final storage.
                                         
                                         And then you can delete it or better drop this partition or truncate or something.
                                         
                                         In this case, it's append only, but it grows, grows, and then the size drops.
                                         
                                         It happens also, right?
                                         
                                         Yeah, I think dropping partitions
                                         
                                         definitely pushes the definition of append-only,
                                         
                                         but it's the thing that makes sense to do
                                         
                                         in a lot of cases at huge scale.
                                         
    
                                         But yeah, is it still append append only if we're dropping partitions
                                         
                                         yes this is how we should do it no i know but do you see what i mean you mentioned deleting data
                                         
                                         but well we again again again it's append only we drop partition we never delete we never update it's append only but we if we don't need like
                                         
                                         last year data we already processed it somehow made all calculations we need we can get rid of
                                         
                                         raw data we just dropping partition it's it's the best we can do instead of cleaning up somehow
                                         
                                         using deletes i think we need to discuss it because I
                                         
                                         did it many times and participated in huge projects in very large companies. The idea let's offload all archive data. It was e-commerce. Old orders let's offload it to cheaper storage for
                                         
                                         a longer term storage and then we need to delete it in the original place, original database, and it was not partitioned.
                                         
    
                                         Deletes, it was a project for a couple of months.
                                         
                                         Because downtime is not acceptable.
                                         
                                         It costs a lot of dollars.
                                         
                                         E-commerce guys know very well.
                                         
                                         They can calculate it each second of downtime, how much it costs
                                         
                                         to company.
                                         
                                         So if I had partition table there, it would be magic.
                                         
                                         And it's append-only.
                                         
    
                                         That particular table was not append-only, right?
                                         
                                         But it can happen with append-only.
                                         
                                         For example, we have audit log.
                                         
                                         All some actions are stored in some append-only table, but
                                         
                                         we have a policy to store only two years of data, then I would prefer to draw partition
                                         
                                         with all data.
                                         
                                         That's it.
                                         
                                         So cleaning up is a very important topic for append-only tables.
                                         
    
                                         This is what I was trying to say.
                                         
                                         Yeah, I completely agree.
                                         
                                         I think there are other benefits
                                         
                                         to partitioning with append only or append mostly as well due to like if we do have the occasional
                                         
                                         update or delete by having partitions that's bad yeah well partitioning helps with that as well
                                         
                                         right so let's let's zoom back out maybe we We've got inserts and selects.
                                         
                                         So we do have to, we might have to,
                                         
                                         if we're talking about a very, very high volume,
                                         
    
                                         we might have to worry about insert performance
                                         
                                         and select performance.
                                         
                                         We can also have, you know, sorry for interrupting,
                                         
                                         we can also have copy.
                                         
                                         Yeah, sure.
                                         
                                         In both ways, right?
                                         
                                         And I guess that's about to come up if we're talking about...
                                         
                                         Also, common table, you know, like table, which reads everything.
                                         
    
                                         Yeah, but it's kind of select.
                                         
                                         So, yeah, inserts or selects, no updates, deletes, and so on, yeah.
                                         
                                         So, use cases, you wanted to discuss use cases, right?
                                         
                                         Or even, I was actually thinking of diving straight to performance.
                                         
                                         I think there's a few things that we don't have to worry about,
                                         
                                         a few things that we can then optimize for.
                                         
                                         If we're having to insert extremely high volumes,
                                         
                                         which sometimes these use cases do lend themselves towards.
                                         
    
                                         I think IoT, for example, Internet of Things, high volumes which some sometimes these use cases do lend themselves towards you know liquid if we're
                                         
                                         i think iot for example internet of things if sensors are sending information and we're we're
                                         
                                         logging for each second amongst thousands or tens of thousands of sensors that could be that can end
                                         
                                         up being a lot of data so inserting can be a bottleneck and you might make design decisions
                                         
                                         for those tables that you wouldn't make if you had a different type of table or different type of data.
                                         
                                         So there's that side of things.
                                         
                                         But then there's also the read side of things.
                                         
                                         I think, you know, and I think those things maybe sometimes play off against each other.
                                         
    
                                         So the fact we've got append only, we have some benefits to like index only scans for example become even better
                                         
                                         i know i know you often talk about always trying to get index only scans but in a table where the
                                         
                                         data is often changing that can be a losing battle it can be a it can be a battle it's not always
                                         
                                         worth fighting or it's maybe not always worth including as many columns to the index for
                                         
                                         example like there's a there's different trade-offs for append only versus uh let's unwrap
                                         
                                         everything here you mentioned so many things in in like in in just a minute right so first
                                         
                                         let's talk about performance of inserts i would say ideal situation is we don't have indexes and
                                         
                                         we don't have triggers including foreign key
                                         
    
                                         triggers because foreign key in postgres internally implemented via system trigger and this trigger is
                                         
                                         going to slow down inserts especially if you have like if you need to insert a lot of rows
                                         
                                         if you just have a few foreign keys it can add add, like, it can multiply the duration of this massive insert.
                                         
                                         So, ideally, we should get rid of foreign keys and keep as few indexes as possible for this particular case.
                                         
                                         I remember in some cases I decided to go without primary keys, you know, breaking relational
                                         
                                         model and so on.
                                         
                                         There's no relational model in Postgres in any relational database which implements SQL
                                         
                                         model, data model, which has null and it breaks relational model completely anyway.
                                         
    
                                         But this is a different topic, side note.
                                         
                                         Anyway, so it's not good to be without primary keys, but sometimes you think,
                                         
                                         oh, I just need to dump this to some table reliably. So we have ACID, so Postgres guarantees,
                                         
                                         it's stored, it's saved, it's replicated, it's backed up. But even one index, sometimes you think, oh, it slows me down.
                                         
                                         And I remember I decided to leave without primary key.
                                         
                                         And it was a weird case, but it was just some archive,
                                         
                                         maybe just for audit purposes.
                                         
                                         And I decided to use brin at that time.
                                         
    
                                         Brin is actually a good idea to consider if we have append only because layout physically, rows
                                         
                                         don't move.
                                         
                                         If we have a row, it's a tuple, it's
                                         
                                         saved in some block,
                                         
                                         it's there, right?
                                         
                                         So this is exactly when brin
                                         
                                         indexes work well. And we had an episode,
                                         
                                         one of our first episodes,
                                         
    
                                         I remember. It was brin
                                         
                                         indexes. Brin is
                                         
                                         block range index, right?
                                         
                                         Yeah.
                                         
                                         Yeah.
                                         
                                         So it's very lightweight.
                                         
                                         It speeds up select performance,
                                         
                                         not as good as other indexes,
                                         
    
                                         especially B3,
                                         
                                         but it's still good, right?
                                         
                                         Or we might consider hash indexes also, right?
                                         
                                         Because they might be more lightweight than b3 sometimes well i think for
                                         
                                         example right yeah but but when it comes to append i think you you make a really good point each
                                         
                                         index we have slows down the inserts so the fewer the better possibly none if we aren't let's say it's a table we're never reading from,
                                         
                                         or it's an audit log that we only ever have to read from extremely rarely.
                                         
                                         We might consider one or even zero indexes on that.
                                         
    
                                         Maybe not an audit log, because maybe that's not one
                                         
                                         you would actually be writing an insane volume to.
                                         
                                         But I read a timescale.
                                         
                                         So timescales, they have to worry about this kind of thing
                                         
                                         that they're whilst they've designed for these kind of time series workloads they've written a
                                         
                                         good blog post on optimizing inserts and they list all the same things as you and go further so they
                                         
                                         as well as foreign key constraints basically other constraints can add overhead as well as foreign key constraints, basically other constraints can add overhead as well. So for example,
                                         
                                         of course,
                                         
    
                                         it checks unique constraint.
                                         
                                         Yeah.
                                         
                                         Check constraint,
                                         
                                         but unique constraints.
                                         
                                         Yeah.
                                         
                                         Index additional check.
                                         
                                         Whoa.
                                         
                                         Yeah.
                                         
    
                                         So not having it basically deciding for each constraint, if you really need it or what value it's adding,
                                         
                                         having it makes sense.
                                         
                                         That being said,
                                         
                                         I must say like in most cases,
                                         
                                         I prefer having primary key.
                                         
                                         Because it's like the center of consistency,
                                         
                                         of data consistency, right?
                                         
                                         So it's good to have.
                                         
    
                                         But it depends, right?
                                         
                                         It's good that you mentioned timescale,
                                         
                                         but I think, yeah, we will return to timescale. My question to you, tricky question, but I think yeah we will return to time scale my question to you tricky
                                         
                                         question but i think you already know and i i must admit when two years ago we started podcast i
                                         
                                         didn't realize it fully now i realize it much better so we have an index what operations does
                                         
                                         it slow down you said it slows down inserts this is for sure does it slow down? You said it slows down inserts. This is for sure. Does it slow down updates? Well,
                                         
                                         yes. Right? And there's a mechanism, hot update, which deals with it in a limited number of cases.
                                         
                                         Does it slow down delete? Well, maybe no, because during delete, index is not updated.
                                         
    
                                         Postgres only updates xmax, as we discussed a couple of times.
                                         
                                         Does it slow down selects?
                                         
                                         What do you think?
                                         
                                         So we've talked about how having a lot of them can.
                                         
                                         Yeah, yeah.
                                         
                                         It slows down selects, especially if we have a lot of them
                                         
                                         and high frequency of selects, and this is about planning time.
                                         
                                         And lock manager locks during planning, all indexes are locked.
                                         
    
                                         It's some overhead in very heavily loaded systems to keep in mind.
                                         
                                         Yeah, but in general, yeah, so I would minimize number of indexes and try not not you not to use foreign keys like foreign keys we in many cases we can
                                         
                                         imagine they exist have maybe routine checks that referential integrity is fine right but drop them
                                         
                                         intentionally because in this case we want for, good insert performance. And as usual, I would like to remind that when I say all this,
                                         
                                         in many cases when I deal with new system,
                                         
                                         I have some like these principles, but I never trust myself.
                                         
                                         I always check again.
                                         
                                         Checking should be like considered,
                                         
    
                                         like sometimes you spend time there, right?
                                         
                                         But it's worth doing experiments yeah i
                                         
                                         well and i would say we're talking about extremely high volumes here if if you can i would much
                                         
                                         rather normally have primary key have some foreign keys if they make sense and have a unique key if i
                                         
                                         need it and then test if like can I get better disks if I need to?
                                         
                                         Like, are there other ways I can improve, like, I can cope with higher write performance instead of...
                                         
                                         Perform checkpoint tuning if you expect huge volumes to let into the store.
                                         
                                         Yeah, so maybe pay for it in other ways. It's only at the absolute...
                                         
    
                                         Have more memory, bigger buffer pool.
                                         
                                         Exactly.
                                         
                                         Make sure backends don't write all the time.
                                         
                                         It depends, right?
                                         
                                         So CheckPointer is not crazy.
                                         
                                         It's not too frequent.
                                         
                                         And so, yeah, there's a lot of stuff here.
                                         
                                         And if we think about SELECT now, what's the number one problem usually?
                                         
    
                                         I think, like, it makes me so, like, I'm still wondering how come we lived so many years
                                         
                                         until I think Postgres 12 or when auto-vacuum, vacuum, and insert scale factor was added.
                                         
                                         I think Darafei initiated it.
                                         
                                         Darafei. Version 13. I looked it up. Yeah,
                                         
                                         version 13. Okay, it's very recently compared to my experience with Postgres. So strange.
                                         
                                         What it adds, originally Postgres vacuum, which also maintains Postgres statistics,
                                         
                                         which is important for good query performance,
                                         
                                         including selects, right? Originally, it was triggered only after, say, like 10% by default,
                                         
    
                                         10 or 20% of rows are changed. There is some complex formula, not very complex, but some
                                         
                                         formula, but roughly after 10 or 20% of rows changed change means deleted or updated it triggers but not on not after inserts and only in postgres 13
                                         
                                         specific parameter was added i think by default it's also 20 percent or 10 which tells AutoVacuum to run and process a table after 10 or 20% of rows were added.
                                         
                                         Yeah, I looked this up,
                                         
                                         and it's like there's three jobs, right, of AutoVacuum.
                                         
                                         There's roughly removing dead tuples.
                                         
                                         Four jobs, actually.
                                         
                                         Freezing and analyze statistics. Removing dead tuples, freezing, and analyze statistics.
                                         
    
                                         Removing the tuples, maintaining visibility maps.
                                         
                                         Maintaining visibility maps, of course, yeah.
                                         
                                         Four goals.
                                         
                                         Maybe actually more, but these four come to mind quickly.
                                         
                                         Yeah, and if you're only doing inserts, you don't need the removing dead tuples job.
                                         
                                         But that isn't the only thing vacuum's doing so this
                                         
                                         this then enables though the visibility map and the freezing to happen well regardless of inserts
                                         
                                         it will happen well we can we can insert a different table yeah okay yeah good point and
                                         
    
                                         auto vacuum will see that xmin or xmax or both, Xmin, right?
                                         
                                         Yeah.
                                         
                                         In the past, we have risk of wraparound, so it's time to freeze this table.
                                         
                                         Yeah.
                                         
                                         We can have zero operations in terms, like, table can be left unchanged for many,
                                         
                                         but at some point, AutoVacuum decide, okay, it's time to freeze.
                                         
                                         But you're right, visibility map would never be, like, that would not get.
                                         
                                         Visibility map is huge.
                                         
    
                                         You mentioned index-only scans, the performance of aggregates counts, right?
                                         
                                         So we do want to keep it up to date.
                                         
                                         I think default is not enough, as usual, with AutoVacuum.
                                         
                                         We must tune it.
                                         
                                         And even cloud providers, their defaults are not enough.
                                         
                                         We must tune it and go down to one percent
                                         
                                         or smaller and make sure autowacom maintains statistics and visibility maps more often
                                         
                                         so performance of selects including index only scans are good right performance yes are good
                                         
    
                                         another reason to partition as well so you can keep those. Yeah, that makes sense.
                                         
                                         I was going to say, it's 20%.
                                         
                                         So it is quite high still, as you say.
                                         
                                         Would you ever switch to...
                                         
                                         I cannot imagine any OLTP system, any website, any mobile app,
                                         
                                         which would be okay with Postgres AutoVacuum defaults.
                                         
                                         This, like,
                                         
                                         ah!
                                         
    
                                         That's it. I don't know why they are so. They are so for
                                         
                                         what? We have
                                         
                                         so many beautiful websites working,
                                         
                                         huge systems
                                         
                                         working with Postgres.
                                         
                                         It's so cool to see that
                                         
                                         Postgres handles
                                         
                                         so big workloads. but these defaults
                                         
    
                                         well and the strange thing is this one for example if we did reduce it to one one percent
                                         
                                         it would add overhead on small systems sure if you've only got 100 rows it runs vacuum every row
                                         
                                         for a while you know but who's running a small system that can't handle a vacuum of a hundred row table every
                                         
                                         row like that's also we've append only specifically when some page is already
                                         
                                         processed that's marked all visible all frozen or whatever a vacuum just skips it
                                         
                                         yeah so it wouldn't even be much there were many optimizations in this area, so to not do work which can be skipped.
                                         
                                         So it's doing a good job skipping, and it's many years already.
                                         
                                         So I think, like, I never saw any system, and I saw maybe already hundreds of them,
                                         
    
                                         different sizes, websites, like all TP, right?
                                         
                                         I didn't see any time we decide, oh, you know what, we need to increase scale factor.
                                         
                                         I don't remember at least at all. We can throttle it if we like, we can balance work many workers and so on but deciding let's let's make work of auto vacuum less frequent
                                         
                                         zero cases i had maybe my experience is not enough maybe one day i will see such a system
                                         
                                         i i've not seen one either enough rage about defaults my usual fun i have with postgres
                                         
                                         let's talk about partitioning maybe right, right? Why do we want it?
                                         
                                         I see several ideas
                                         
                                         here, and
                                         
    
                                         TimescaleDB is definitely, for a
                                         
                                         pen-toned database, is a good thing to have
                                         
                                         in many senses.
                                         
                                         But unfortunately, it's not available in managed
                                         
                                         offering, except
                                         
                                         their own, Timescale Cloud,
                                         
                                         right? And some others, but
                                         
                                         some others usually choose Apache 2.0 version,
                                         
    
                                         which doesn't have compression, right?
                                         
                                         It doesn't have a lot of their good features, yeah.
                                         
                                         Yeah, so partitioning is good.
                                         
                                         Again, there's some rule, empirical rule, we say, like many people say,
                                         
                                         not only I, let's consider partitioning if table exceeds
                                         
                                         100 gigabytes or has chances to exceed 100 gigabytes partitioning adds complexity it's not
                                         
                                         as well automated as in say in oracle but it's very important tool to consider and many factors here. First, for example, you might say, okay, I have partition
                                         
                                         where I insert and then many partitions where it's like my archive. And as we decided, we want
                                         
    
                                         very low number of indexes in the main partition, which is receiving inserts and constraints like foreign keys and so on.
                                         
                                         But there is no such problem in all archive partitions, right?
                                         
                                         We might have more indexes there and constraints and so on.
                                         
                                         This is one thing.
                                         
                                         Second thing, autowacuum.
                                         
                                         If occasional deletes or updates are happening, the block which contains the row,
                                         
                                         that tuple basically right it's out
                                         
                                         of visibility like it's it's marked not all visible anymore and not all frozen anymore so
                                         
    
                                         vacuum needs to process it and it's good to have data locality so archive data is in in some
                                         
                                         partitions and fresh data is in particular partitions. So AutoVacuum is focusing on fresh data in fresh partitions.
                                         
                                         It reduces the number of blocks it needs to deal with, right?
                                         
                                         Because all data is rarely touched, so AutoVacuum visits us very rarely, right?
                                         
                                         This is another reason.
                                         
                                         Cleanup is another reason as well, right?
                                         
                                         I think cleanup's the biggest reason.
                                         
                                         Maybe, yeah.
                                         
    
                                         I think maintenance,
                                         
                                         partitioning helps so much with maintenance.
                                         
                                         It does have other benefits for sure,
                                         
                                         but it helps so much with maintenance
                                         
                                         that I can't help but feel like that's the biggest one.
                                         
                                         And I actually have started to say, I think I must have stolen this from somebody else because it's too clever for me.
                                         
                                         But partitioning based on how you want to eventually delete data makes sense.
                                         
                                         So if you want to eventually delete old dataB SaaS and you eventually want to delete data
                                         
    
                                         based on a customer quitting the service, you probably want to partition based on...
                                         
                                         Or both, two level of partitioning also.
                                         
                                         Yeah, exactly. But that being the, like a guiding principle for how you partition,
                                         
                                         because it makes that deletion or dropping so easy.
                                         
                                         What will you do with data, right?
                                         
                                         And as I said, I participated in projects
                                         
                                         where delete was a big issue.
                                         
                                         And of course, with partitioning,
                                         
    
                                         it's very different and it's good.
                                         
                                         Deletes can be a problem.
                                         
                                         Postgres deletes, like if you have a terabyte,
                                         
                                         10 terabyte table and you need to delete 20% of it,
                                         
                                         it's a big headache because
                                         
                                         you need to make sure vacuum will be okay it will auto vacuum will catch catch up all the time
                                         
                                         you need to again to pay attention to check pointer and you need to find a way how to delete
                                         
                                         so delete doesn't degrade. This was my problem.
                                         
    
                                         So I created beautiful queries,
                                         
                                         but they degraded over time because of dead tuple accumulation
                                         
                                         and bloat accumulation as well.
                                         
                                         So I needed to adjust them and so on.
                                         
                                         So there are many problems with delete
                                         
                                         and it takes time to delete millions of rows.
                                         
                                         If you rush with it, can put the system down or have
                                         
                                         degradation of performance
                                         
    
                                         well yeah
                                         
                                         and it can really affect even your
                                         
                                         select performances so
                                         
                                         you mentioned
                                         
                                         brin is probably the one where it gets
                                         
                                         it used to at least get affected the most
                                         
                                         with the default way of creating
                                         
                                         a brin index if you have
                                         
    
                                         a row inserted way back in an
                                         
                                         old if you don't have partitioning if it goes miles away and you get some real scattered data
                                         
                                         brin performance can end up effectively looking like sequential scans so all indexes degrade
                                         
                                         b3 degrades very quickly if you perform deletes and updates,
                                         
                                         and you need to rebuild it.
                                         
                                         And rebuilding is better with partitioning,
                                         
                                         because the smaller partitions are, the faster rebuilding is,
                                         
                                         and Xmin Horizon is not frozen, right?
                                         
    
                                         So auto-vacuum is not affected in the whole database, right?
                                         
                                         Yeah.
                                         
                                         So it's better.
                                         
                                         Building and rebuilding indexes vacuum itself maintenance
                                         
                                         tasks are good if you have smaller physical tables or partition is great right yes on the brin thing
                                         
                                         i'll link up the old episode we did but the min max multi i think makes a big difference
                                         
                                         especially if you don't have to like well it handles loads of outliers so i do think that's easier and if you if you are able to keep on top of auto vacuum i guess the b3 stuff doesn't degrade
                                         
                                         that quickly so i feel like these things aren't as bigger problems anymore but yeah often in these
                                         
    
                                         cases if you're dealing with high volume like many many thousands of queries per sec like just
                                         
                                         extreme volume anything you can do to help fight
                                         
                                         on the performance front will be helpful. Yeah. And as usual, when we touch partitioning,
                                         
                                         the state of caches and buffer pool, for example, if you have archived data which touched rarely,
                                         
                                         those blocks are evicted from the buffer pool and cash efficiency might grow. Hit rate ratio might be better.
                                         
                                         But yeah,
                                         
                                         I agree with you.
                                         
                                         So partitioning is good in many
                                         
    
                                         senses. It comes with
                                         
                                         price of overhead and maintenance as well.
                                         
                                         But it's worth to have it.
                                         
                                         But imagine, like, all
                                         
                                         this said, we moved slightly
                                         
                                         from append-only to append-mostly,
                                         
                                         right? But let's move back to append-only.
                                         
                                         Imagine we have many partitions where data is not changed, archive.
                                         
    
                                         Index is created, all frozen, all visible.
                                         
                                         It's a beautiful state of data, right?
                                         
                                         So all index-only scans are working well.
                                         
                                         And that's it.
                                         
                                         Maintenance is not needed.
                                         
                                         Auto-vacuum is not needed there, and so on.
                                         
                                         However, what if we have 100 terabytes of data,
                                         
                                         and this is a heavily loaded cluster,
                                         
    
                                         we have many replicas.
                                         
                                         The data is not changed as good.
                                         
                                         It's evicted from buffer pool,
                                         
                                         but we still need to keep it on the main storage.
                                         
                                         And at some point we think think oh like we pay a big
                                         
                                         price because this data is replicated it increases the volume of backups full backups if we consider
                                         
                                         right so this is like this legacy it's a lot and if we we have, for example, five replicas, one primary,
                                         
                                         we need six times to store the same data,
                                         
    
                                         and nobody is reading it.
                                         
                                         Like, people read it occasionally.
                                         
                                         At some point, you think it's not efficient.
                                         
                                         And you think, I would rather to store it somewhere else,
                                         
                                         not on the main disks, not on the SSD, fast SSDs, or, I don't know, NVMe's I have,
                                         
                                         or cloud storage, which is also expensive,
                                         
                                         right? So this leads to idea, to two ideas. First idea is it would be good to compress it,
                                         
                                         right? Again, TimescaleDB, full version of TimescaleDB is doing great job, and their
                                         
    
                                         blog posts about compression are great. I like especially one,
                                         
                                         I remember, first big one, which explained algorithms in row and basically kind of column
                                         
                                         compression, although we still have row storage, it's great. And also, I think second topic here,
                                         
                                         which opens up naturally, is what I know Aurora now offers it, right? And Neon and Timescale
                                         
                                         as well, in cloud only, bottomless approach, when old partitions are offloaded to S3 or object
                                         
                                         storage, GCS on Google Cloud, or blob storage on Azure, how it's called, I don't remember.
                                         
                                         And now even Hetzner has it. They just recently released, which is big news.
                                         
                                         I like it because I like their prices.
                                         
    
                                         And I worked with them since, I think, 2006 or so in a few companies.
                                         
                                         When you bootstrap and you have a small startup,
                                         
                                         Hetzner is like number one in terms of budgets and the hardware they can offer.
                                         
                                         So they just recently released S3-compatible object storage, right?
                                         
                                         So we can have normal backups and so on.
                                         
                                         But what to do with old partitions?
                                         
                                         It's a natural way of thinking.
                                         
                                         We don't want to keep them on these expensive disks we have,
                                         
    
                                         having multiple copies of that.
                                         
                                         So offloading it somehow implicitly,
                                         
                                         transparently in background to S3 or S3-compatible mini or something,
                                         
                                         if you have self-managed Postgres,
                                         
                                         it would be great, right?
                                         
                                         And so we have it in Timescale Cloud.
                                         
                                         I think Neon also does it, right?
                                         
                                         I don't know.
                                         
    
                                         Bottomless, bottomless.
                                         
                                         Like Neon, they store data on S3 originally anyway.
                                         
                                         So idea is we want to have petabyte size cluster,
                                         
                                         but don't pay for lots of disks and headache it comes with.
                                         
                                         And for append only, it's very natural to have, to decide, okay, we want
                                         
                                         to store data forever, not to clean up, but we want cheap storage here. So S3 is a good
                                         
                                         idea to consider. And it has tiers also, right? It can be slow to retrieve, but it's okay
                                         
                                         because it's rare right well and it depends what you mean
                                         
    
                                         by slow like i think there's there can be performance advantages i think when some of
                                         
                                         this data is fresh we might want to retrieve it row by row like if you're if you're looking at
                                         
                                         some audit logs you might want to look at some recent ones that you might want all the information
                                         
                                         about them but if you're looking at data from two
                                         
                                         years ago there's probably a higher chance that you're looking at it in aggregate you know on
                                         
                                         average how many audits of this type were we having in in 2022 versus 2023 and i think actually
                                         
                                         the types of queries that happen on older data tend to be these aggregate ones that often perform better once it's column store compressed
                                         
                                         you know these file formats often suit that kind of query so i could i i don't even think
                                         
    
                                         i know what you mean flex compression good compression it lacks it yeah and timescale
                                         
                                         db compression i have seen how good it is. It can be 20 times smaller.
                                         
                                         And indeed, if they even support data changes for compressed data,
                                         
                                         which is great, right?
                                         
                                         I have seen a project or two come about lately,
                                         
                                         I think open sourcing some of this stuff,
                                         
                                         or at least putting it under the Postgres license.
                                         
                                         Is it PG Parquet that allows you to?
                                         
    
                                         Yeah, but it's different.
                                         
                                         It's for analytics.
                                         
                                         And actually for analytics,
                                         
                                         we also might want to consider append-only tables, obviously.
                                         
                                         But there is a new wave of this,
                                         
                                         and I know many people, companies look at it.
                                         
                                         PG DAGDB, not PG DAGDB.
                                         
                                         Yeah, DAGDB.
                                         
    
                                         Right.
                                         
                                         And the idea, let's marry it with Postgres.
                                         
                                         And there are a few projects. I looked at a few ones recently, and one of them was just released maybe last week and they i remember they use logical copy from original data tables regular tables
                                         
                                         to these tables which are stored on on i think in parquet format in on object storage and then
                                         
                                         duckdb is used as processing for analytics but i remember i think alvaro commented on twitter that
                                         
                                         i'm not going to consider it until it works with like basically CDC
                                         
                                         logical replication or something, because right now it's only full refresh of like,
                                         
                                         it's not serious, but they will do it. I think. Also, I think Neon guys looked at DuckDB and
                                         
    
                                         I saw some activities and Hydra, right? They also looked at DuckDB.
                                         
                                         Yeah, but I understand that most of the marketing at the moment is around analytics use cases,
                                         
                                         but I don't see why it couldn't work for append-only...
                                         
                                         Data types.
                                         
                                         Might be true.
                                         
                                         I looked at a couple of extensions because I have a couple of customers with such need to offload all data.
                                         
                                         And Parquet, this format supports only mapping of data types. It might be tricky if you have some complex data types, as I remember.
                                         
                                         And when I looked at some extensions, it didn't work well.
                                         
    
                                         And I think right now, like I have plans to look at Temples extension, which is called pg-tier for tiered storage.
                                         
                                         Right.
                                         
                                         And the idea is we can, with this extension we can can have all
                                         
                                         partitions on on object storage it's great idea so if it works it's great i just haven't looked
                                         
                                         at it yet if uh if somebody from tambo is watching listen listening to us please let us know if like how this project works like already.
                                         
                                         Is it like already production or beta stage? I'm very curious. And like is it worth trying? And
                                         
                                         what are the limitations, for example? Maybe we should actually have a separate episode because
                                         
                                         I think this extension might bring bottomless idea to the Postgres ecosystem for everyone, right?
                                         
    
                                         It's great.
                                         
                                         It's open source, unlike what other companies do.
                                         
                                         So kudos to Tembo for making this open source extension.
                                         
                                         And is it transparent?
                                         
                                         Like, do I have to...
                                         
                                         Because I think some of the DuckDB stuff,
                                         
                                         I would have to be... I'm not sure if I have to write Because I think some of the DuckDB stuff, I would have to be...
                                         
                                         I'm not sure if I have to write...
                                         
    
                                         I don't know.
                                         
                                         It might be semi-transparent.
                                         
                                         For example, if you need to make some transitions
                                         
                                         with old partitions, I'm okay with that.
                                         
                                         And I can even create new partitions in background
                                         
                                         and move all the data from one old partition
                                         
                                         to this kind of new old partition,
                                         
                                         which already has different storage, this is doable. So it's not a big deal as I see it.
                                         
    
                                         I think what, for example, Timescale Cloud has, it's transparent and they have some kind of
                                         
                                         interesting, as I remember, very in rough form. They have something also with Planner,
                                         
                                         and some algorithms to decide when to bring this data to caches and so on. So it's interesting.
                                         
                                         But the idea is we want partitioning to evict blocks with all data from memory,
                                         
                                         one thing. But then we think we want to evict them from from our disks because
                                         
                                         disks are also expensive right let's evict it and this is alternative idea to clean up and to
                                         
                                         compression or compression goes well with offloading as well i don't know so parquet format is definitely
                                         
                                         good with compression compressing data in terms of column storage, right? So if it's time-serious, it's good.
                                         
    
                                         So there are interesting new directions of development of Postgres
                                         
                                         ecosystem here. And I think we mentioned a few projects, both commercial
                                         
                                         and open source, which is great. So if someone wants to
                                         
                                         store petabyte of data, is preparing for it, doesn't want to delete
                                         
                                         everything, I think there are
                                         
                                         ideas to consider.
                                         
                                         Well, we didn't mention with partitioning you can also have some kind of foreign data
                                         
                                         wrapper and store it on very cheap also Postgres cluster.
                                         
    
                                         It can be Postgres, it can be not Postgres as well, right?
                                         
                                         But for example, we can consider cheap Postgres cluster with very slow disks, HDD, for a good price, and, because of, I didn't, I expected foreign
                                         
                                         data wrappers, Postgres FDW code in terms of how it works with, like, basically, you
                                         
                                         need two PC, right?
                                         
                                         You need two-phase commit to have a reliable commit because it's distributed system already.
                                         
                                         But without two PC, there are risks to have inconsistency, for example.
                                         
                                         But not if it's append-only, you're never going to change that data by the time you're pushing it to a different server.
                                         
                                         Right. I remember I inspected code and found some kind of edge cases, maybe even corner cases,
                                         
    
                                         but for inserts it's also possible. You wrote to one place,
                                         
                                         you're trying to write to another place,
                                         
                                         this one is already committed, here it's not committed.
                                         
                                         But I remember
                                         
                                         code was quite smart
                                         
                                         to
                                         
                                         reduce the probability
                                         
                                         of some cases, but it's not 100%.
                                         
    
                                         As I remember, it was like five years ago.
                                         
                                         So I haven't
                                         
                                         revisited it since.
                                         
                                         Tell me if I'm wrong, but you're saying new partitions would be on the local Postgres
                                         
                                         and it would be old ones that we would move to the second?
                                         
                                         Yeah.
                                         
                                         If we don't have inserting transactions which deal with multiple partitions,
                                         
                                         there is no problem at all.
                                         
    
                                         And old, we can move to foreign data.
                                         
                                         With Postgres FDW to different cluster, so you're right.
                                         
                                         No inserts happening, no problem.
                                         
                                         Cool.
                                         
                                         Select only.
                                         
                                         I never considered using FDW for partitioning.
                                         
                                         Or like, you know, the old partitioners.
                                         
                                         It's natural.
                                         
    
                                         It's natural.
                                         
                                         This is how many folks were going to have clustered Postgres, right?
                                         
                                         I mean, or sharding.
                                         
                                         Kind of sharded, yeah.
                                         
                                         Sharding, yeah.
                                         
                                         But this path has issues.
                                         
                                         Yeah.
                                         
                                         It's a different story.
                                         
    
                                         Here be dragons, is it?
                                         
                                         Or what's that old phrase?
                                         
                                         I don't know.
                                         
                                         Okay, we discussed many things.
                                         
                                         I feel we might be
                                         
                                         missing something, right?
                                         
                                         But for good overview,
                                         
                                         I think it's enough for
                                         
    
                                         general overview, what's happening
                                         
                                         with append-only tables.
                                         
                                         And yeah, so I think
                                         
                                         it's an interesting topic, actually.
                                         
                                         Many people need it, many people.
                                         
                                         Many companies store logs and so on in Postgres.
                                         
                                         And I'm looking forward to the future
                                         
                                         where Postgres at some point will have better compression
                                         
    
                                         and bottomless feature.
                                         
                                         And as I like to say,
                                         
                                         I imagine TimescaleDB was Postgres license.
                                         
                                         Okay.
                                         
                                         This is I think should happen at some point.
                                         
                                         I know there are people who don't like this idea
                                         
                                         but they developed so good stuff
                                         
                                         that Postgres would benefit from it.
                                         
    
                                         I know some people would not be happy with these
                                         
                                         words i know but it's just yeah it feels natural to have these features in postgres itself
                                         
                                         some of them for sure like but but i do think some of it's tricky like the bottomless stuff for
                                         
                                         example how where would it where would it go like what there is extension for where is extension already
                                         
                                         right yeah okay s3 compatibility is standard even google cloud gcs it's also s3 compatible
                                         
                                         everyone does it so i think no no i don't see um why not here i I see it, but it's manageable. Fair. The biggest question is business-wise, I guess.
                                         
                                         So Timescale DB.
                                         
                                         I understand.
                                         
    
                                         License, yeah.
                                         
                                         I understand business-wise it's not going to happen in the nearest future,
                                         
                                         but it could be so great to have good compression and bottomless
                                         
                                         and Postgres itself, even if it's extension.
                                         
                                         I'm okay with extensions
                                         
                                         i don't like extensions in general because i like to like you know to do all people have
                                         
                                         some features but in this case it's okay to have extensions cool extensions are great sometimes yeah
                                         
                                         good okay have a good week i know we have more than 100 suggestions in our doc.
                                         
    
                                         Yeah.
                                         
                                         We read them. Keep posting. This was one of suggestions, right?
                                         
                                         Yeah.
                                         
                                         So you chose it.
                                         
                                         Thank you.
                                         
                                         Yeah, good. Thank you for our audience for patience reaching this point. Okay,
                                         
                                         see you next time.
                                         
                                         Absolutely. Take care. See you next week. Bye.
                                         
    
                                         Bye-bye.
                                         
