Postgres FM - Append-only tables

Starting point is 00:00:00 Hello, hello, this is PostgresFM. I'm Nikolai, PostgresAI, and my co-host is Michael, PJ Mustard. Hi, Michael. Hello, Nikolai. And guess what we are going to discuss today? I'll guess. Is it append-only tables? Exactly. Ah, I got it right. I was surprised to hear that we haven't discussed it in the past. I'm sure we touched it many times, right?

Starting point is 00:00:29 Yeah, for sure it's come up in episodes, but me too. When I saw it in our listener-suggested topics, I did have a few searches on our site just to make sure we hadn't talked about it already as a whole episode, and yeah, agreed. But it's not something i see all the time like it's it is it's relatively common like to have an events table but even then i mostly see append mostly tables rather than like oh yeah we need to distinguish yeah i was thinking we are going to discuss both. Yeah.

Starting point is 00:01:09 Do you feel in the episode for append mostly tables? No? Well, I actually, I don't think this is super complex. I think it's quite nice because almost by definition, if we do accept that it's append only, we've got quite a narrow scope and there's only a few things to consider. Maybe it gets a bit more complicated with append mostly but yeah we can definitely cover that today i think it's still not that complicated yeah where do we start like definition go for it yeah why not well if we just insert that's append only that's it

Starting point is 00:01:46 insert and select i would say yeah no updates no deletes select is allowed right yeah insert and select this is the only two things we can allow ourselves from dml data manipulation language and that's it yeah we We just select and insert. This is append only. If we have occasional deletes and updates, it's append mostly or insert mostly. I don't know how to say. Yeah, I like that.

Starting point is 00:02:16 And why do we care about this particular use case? Because it has characteristics, right? It usually has... If it's append-only, for example, we don't care about that tables anymore. No bloat, right? It's good. And usually we talk about huge volumes of data.

Starting point is 00:02:38 And at some point we think, oh, we need to compress it, we need to offload it maybe to cheaper storage, or just clean up, because this old data is not needed anymore. In raw form, sometimes it's aggregated, and in raw form, we can just remove it from database. Yeah, or we just, I don't know, need bottomless. And usually we want inserts to happen very fast because volume is huge, so we need to make sure performance of inserts is good. Did I miss anything?

Starting point is 00:03:17 No, I think those things aren't necessarily always true for append-only tables, but they correlate. A lot of the use cases for very, very fast-growing data, and by definition, append-only means it's never going to decrease in size. It's only going to keep getting larger and larger. Unless you clean it up.

Starting point is 00:03:36 Well, there are specific cases. For example, imagine you... We discussed many times the topic of slow count. And if you can allow like a synchronous calculation of count, maybe it's like materialized or something, I don't know. So idea is instead of updating the count somewhere on each insert in the original table, you can aggregate operations in intermediate table and then it's append only.

Starting point is 00:04:05 So you register events in some table and then you process chunk and reflect this in count in final storage. And then you can delete it or better drop this partition or truncate or something. In this case, it's append only, but it grows, grows, and then the size drops. It happens also, right? Yeah, I think dropping partitions definitely pushes the definition of append-only, but it's the thing that makes sense to do in a lot of cases at huge scale.

Starting point is 00:04:43 But yeah, is it still append append only if we're dropping partitions yes this is how we should do it no i know but do you see what i mean you mentioned deleting data but well we again again again it's append only we drop partition we never delete we never update it's append only but we if we don't need like last year data we already processed it somehow made all calculations we need we can get rid of raw data we just dropping partition it's it's the best we can do instead of cleaning up somehow using deletes i think we need to discuss it because I did it many times and participated in huge projects in very large companies. The idea let's offload all archive data. It was e-commerce. Old orders let's offload it to cheaper storage for a longer term storage and then we need to delete it in the original place, original database, and it was not partitioned.

Starting point is 00:05:47 Deletes, it was a project for a couple of months. Because downtime is not acceptable. It costs a lot of dollars. E-commerce guys know very well. They can calculate it each second of downtime, how much it costs to company. So if I had partition table there, it would be magic. And it's append-only.

Starting point is 00:06:16 That particular table was not append-only, right? But it can happen with append-only. For example, we have audit log. All some actions are stored in some append-only table, but we have a policy to store only two years of data, then I would prefer to draw partition with all data. That's it. So cleaning up is a very important topic for append-only tables.

Starting point is 00:06:40 This is what I was trying to say. Yeah, I completely agree. I think there are other benefits to partitioning with append only or append mostly as well due to like if we do have the occasional update or delete by having partitions that's bad yeah well partitioning helps with that as well right so let's let's zoom back out maybe we We've got inserts and selects. So we do have to, we might have to, if we're talking about a very, very high volume,

Starting point is 00:07:11 we might have to worry about insert performance and select performance. We can also have, you know, sorry for interrupting, we can also have copy. Yeah, sure. In both ways, right? And I guess that's about to come up if we're talking about... Also, common table, you know, like table, which reads everything.

Starting point is 00:07:31 Yeah, but it's kind of select. So, yeah, inserts or selects, no updates, deletes, and so on, yeah. So, use cases, you wanted to discuss use cases, right? Or even, I was actually thinking of diving straight to performance. I think there's a few things that we don't have to worry about, a few things that we can then optimize for. If we're having to insert extremely high volumes, which sometimes these use cases do lend themselves towards.

Starting point is 00:08:04 I think IoT, for example, Internet of Things, high volumes which some sometimes these use cases do lend themselves towards you know liquid if we're i think iot for example internet of things if sensors are sending information and we're we're logging for each second amongst thousands or tens of thousands of sensors that could be that can end up being a lot of data so inserting can be a bottleneck and you might make design decisions for those tables that you wouldn't make if you had a different type of table or different type of data. So there's that side of things. But then there's also the read side of things. I think, you know, and I think those things maybe sometimes play off against each other.

Starting point is 00:08:37 So the fact we've got append only, we have some benefits to like index only scans for example become even better i know i know you often talk about always trying to get index only scans but in a table where the data is often changing that can be a losing battle it can be a it can be a battle it's not always worth fighting or it's maybe not always worth including as many columns to the index for example like there's a there's different trade-offs for append only versus uh let's unwrap everything here you mentioned so many things in in like in in just a minute right so first let's talk about performance of inserts i would say ideal situation is we don't have indexes and we don't have triggers including foreign key

Starting point is 00:09:26 triggers because foreign key in postgres internally implemented via system trigger and this trigger is going to slow down inserts especially if you have like if you need to insert a lot of rows if you just have a few foreign keys it can add add, like, it can multiply the duration of this massive insert. So, ideally, we should get rid of foreign keys and keep as few indexes as possible for this particular case. I remember in some cases I decided to go without primary keys, you know, breaking relational model and so on. There's no relational model in Postgres in any relational database which implements SQL model, data model, which has null and it breaks relational model completely anyway.

Starting point is 00:10:17 But this is a different topic, side note. Anyway, so it's not good to be without primary keys, but sometimes you think, oh, I just need to dump this to some table reliably. So we have ACID, so Postgres guarantees, it's stored, it's saved, it's replicated, it's backed up. But even one index, sometimes you think, oh, it slows me down. And I remember I decided to leave without primary key. And it was a weird case, but it was just some archive, maybe just for audit purposes. And I decided to use brin at that time.

Starting point is 00:11:03 Brin is actually a good idea to consider if we have append only because layout physically, rows don't move. If we have a row, it's a tuple, it's saved in some block, it's there, right? So this is exactly when brin indexes work well. And we had an episode, one of our first episodes,

Starting point is 00:11:20 I remember. It was brin indexes. Brin is block range index, right? Yeah. Yeah. So it's very lightweight. It speeds up select performance, not as good as other indexes,

Starting point is 00:11:35 especially B3, but it's still good, right? Or we might consider hash indexes also, right? Because they might be more lightweight than b3 sometimes well i think for example right yeah but but when it comes to append i think you you make a really good point each index we have slows down the inserts so the fewer the better possibly none if we aren't let's say it's a table we're never reading from, or it's an audit log that we only ever have to read from extremely rarely. We might consider one or even zero indexes on that.

Starting point is 00:12:14 Maybe not an audit log, because maybe that's not one you would actually be writing an insane volume to. But I read a timescale. So timescales, they have to worry about this kind of thing that they're whilst they've designed for these kind of time series workloads they've written a good blog post on optimizing inserts and they list all the same things as you and go further so they as well as foreign key constraints basically other constraints can add overhead as well as foreign key constraints, basically other constraints can add overhead as well. So for example, of course,

Starting point is 00:12:47 it checks unique constraint. Yeah. Check constraint, but unique constraints. Yeah. Index additional check. Whoa. Yeah.

Starting point is 00:12:58 So not having it basically deciding for each constraint, if you really need it or what value it's adding, having it makes sense. That being said, I must say like in most cases, I prefer having primary key. Because it's like the center of consistency, of data consistency, right? So it's good to have.

Starting point is 00:13:17 But it depends, right? It's good that you mentioned timescale, but I think, yeah, we will return to timescale. My question to you, tricky question, but I think yeah we will return to time scale my question to you tricky question but i think you already know and i i must admit when two years ago we started podcast i didn't realize it fully now i realize it much better so we have an index what operations does it slow down you said it slows down inserts this is for sure does it slow down? You said it slows down inserts. This is for sure. Does it slow down updates? Well, yes. Right? And there's a mechanism, hot update, which deals with it in a limited number of cases. Does it slow down delete? Well, maybe no, because during delete, index is not updated.

Starting point is 00:14:04 Postgres only updates xmax, as we discussed a couple of times. Does it slow down selects? What do you think? So we've talked about how having a lot of them can. Yeah, yeah. It slows down selects, especially if we have a lot of them and high frequency of selects, and this is about planning time. And lock manager locks during planning, all indexes are locked.

Starting point is 00:14:31 It's some overhead in very heavily loaded systems to keep in mind. Yeah, but in general, yeah, so I would minimize number of indexes and try not not you not to use foreign keys like foreign keys we in many cases we can imagine they exist have maybe routine checks that referential integrity is fine right but drop them intentionally because in this case we want for, good insert performance. And as usual, I would like to remind that when I say all this, in many cases when I deal with new system, I have some like these principles, but I never trust myself. I always check again. Checking should be like considered,

Starting point is 00:15:21 like sometimes you spend time there, right? But it's worth doing experiments yeah i well and i would say we're talking about extremely high volumes here if if you can i would much rather normally have primary key have some foreign keys if they make sense and have a unique key if i need it and then test if like can I get better disks if I need to? Like, are there other ways I can improve, like, I can cope with higher write performance instead of... Perform checkpoint tuning if you expect huge volumes to let into the store. Yeah, so maybe pay for it in other ways. It's only at the absolute...

Starting point is 00:16:03 Have more memory, bigger buffer pool. Exactly. Make sure backends don't write all the time. It depends, right? So CheckPointer is not crazy. It's not too frequent. And so, yeah, there's a lot of stuff here. And if we think about SELECT now, what's the number one problem usually?

Starting point is 00:16:27 I think, like, it makes me so, like, I'm still wondering how come we lived so many years until I think Postgres 12 or when auto-vacuum, vacuum, and insert scale factor was added. I think Darafei initiated it. Darafei. Version 13. I looked it up. Yeah, version 13. Okay, it's very recently compared to my experience with Postgres. So strange. What it adds, originally Postgres vacuum, which also maintains Postgres statistics, which is important for good query performance, including selects, right? Originally, it was triggered only after, say, like 10% by default,

Starting point is 00:17:15 10 or 20% of rows are changed. There is some complex formula, not very complex, but some formula, but roughly after 10 or 20% of rows changed change means deleted or updated it triggers but not on not after inserts and only in postgres 13 specific parameter was added i think by default it's also 20 percent or 10 which tells AutoVacuum to run and process a table after 10 or 20% of rows were added. Yeah, I looked this up, and it's like there's three jobs, right, of AutoVacuum. There's roughly removing dead tuples. Four jobs, actually. Freezing and analyze statistics. Removing dead tuples, freezing, and analyze statistics.

Starting point is 00:18:06 Removing the tuples, maintaining visibility maps. Maintaining visibility maps, of course, yeah. Four goals. Maybe actually more, but these four come to mind quickly. Yeah, and if you're only doing inserts, you don't need the removing dead tuples job. But that isn't the only thing vacuum's doing so this this then enables though the visibility map and the freezing to happen well regardless of inserts it will happen well we can we can insert a different table yeah okay yeah good point and

Starting point is 00:18:41 auto vacuum will see that xmin or xmax or both, Xmin, right? Yeah. In the past, we have risk of wraparound, so it's time to freeze this table. Yeah. We can have zero operations in terms, like, table can be left unchanged for many, but at some point, AutoVacuum decide, okay, it's time to freeze. But you're right, visibility map would never be, like, that would not get. Visibility map is huge.

Starting point is 00:19:08 You mentioned index-only scans, the performance of aggregates counts, right? So we do want to keep it up to date. I think default is not enough, as usual, with AutoVacuum. We must tune it. And even cloud providers, their defaults are not enough. We must tune it and go down to one percent or smaller and make sure autowacom maintains statistics and visibility maps more often so performance of selects including index only scans are good right performance yes are good

Starting point is 00:19:40 another reason to partition as well so you can keep those. Yeah, that makes sense. I was going to say, it's 20%. So it is quite high still, as you say. Would you ever switch to... I cannot imagine any OLTP system, any website, any mobile app, which would be okay with Postgres AutoVacuum defaults. This, like, ah!

Starting point is 00:20:09 That's it. I don't know why they are so. They are so for what? We have so many beautiful websites working, huge systems working with Postgres. It's so cool to see that Postgres handles so big workloads. but these defaults

Starting point is 00:20:27 well and the strange thing is this one for example if we did reduce it to one one percent it would add overhead on small systems sure if you've only got 100 rows it runs vacuum every row for a while you know but who's running a small system that can't handle a vacuum of a hundred row table every row like that's also we've append only specifically when some page is already processed that's marked all visible all frozen or whatever a vacuum just skips it yeah so it wouldn't even be much there were many optimizations in this area, so to not do work which can be skipped. So it's doing a good job skipping, and it's many years already. So I think, like, I never saw any system, and I saw maybe already hundreds of them,

Starting point is 00:21:23 different sizes, websites, like all TP, right? I didn't see any time we decide, oh, you know what, we need to increase scale factor. I don't remember at least at all. We can throttle it if we like, we can balance work many workers and so on but deciding let's let's make work of auto vacuum less frequent zero cases i had maybe my experience is not enough maybe one day i will see such a system i i've not seen one either enough rage about defaults my usual fun i have with postgres let's talk about partitioning maybe right, right? Why do we want it? I see several ideas here, and

Starting point is 00:22:09 TimescaleDB is definitely, for a pen-toned database, is a good thing to have in many senses. But unfortunately, it's not available in managed offering, except their own, Timescale Cloud, right? And some others, but some others usually choose Apache 2.0 version,

Starting point is 00:22:28 which doesn't have compression, right? It doesn't have a lot of their good features, yeah. Yeah, so partitioning is good. Again, there's some rule, empirical rule, we say, like many people say, not only I, let's consider partitioning if table exceeds 100 gigabytes or has chances to exceed 100 gigabytes partitioning adds complexity it's not as well automated as in say in oracle but it's very important tool to consider and many factors here. First, for example, you might say, okay, I have partition where I insert and then many partitions where it's like my archive. And as we decided, we want

Starting point is 00:23:16 very low number of indexes in the main partition, which is receiving inserts and constraints like foreign keys and so on. But there is no such problem in all archive partitions, right? We might have more indexes there and constraints and so on. This is one thing. Second thing, autowacuum. If occasional deletes or updates are happening, the block which contains the row, that tuple basically right it's out of visibility like it's it's marked not all visible anymore and not all frozen anymore so

Starting point is 00:23:53 vacuum needs to process it and it's good to have data locality so archive data is in in some partitions and fresh data is in particular partitions. So AutoVacuum is focusing on fresh data in fresh partitions. It reduces the number of blocks it needs to deal with, right? Because all data is rarely touched, so AutoVacuum visits us very rarely, right? This is another reason. Cleanup is another reason as well, right? I think cleanup's the biggest reason. Maybe, yeah.

Starting point is 00:24:32 I think maintenance, partitioning helps so much with maintenance. It does have other benefits for sure, but it helps so much with maintenance that I can't help but feel like that's the biggest one. And I actually have started to say, I think I must have stolen this from somebody else because it's too clever for me. But partitioning based on how you want to eventually delete data makes sense. So if you want to eventually delete old dataB SaaS and you eventually want to delete data

Starting point is 00:25:05 based on a customer quitting the service, you probably want to partition based on... Or both, two level of partitioning also. Yeah, exactly. But that being the, like a guiding principle for how you partition, because it makes that deletion or dropping so easy. What will you do with data, right? And as I said, I participated in projects where delete was a big issue. And of course, with partitioning,

Starting point is 00:25:33 it's very different and it's good. Deletes can be a problem. Postgres deletes, like if you have a terabyte, 10 terabyte table and you need to delete 20% of it, it's a big headache because you need to make sure vacuum will be okay it will auto vacuum will catch catch up all the time you need to again to pay attention to check pointer and you need to find a way how to delete so delete doesn't degrade. This was my problem.

Starting point is 00:26:07 So I created beautiful queries, but they degraded over time because of dead tuple accumulation and bloat accumulation as well. So I needed to adjust them and so on. So there are many problems with delete and it takes time to delete millions of rows. If you rush with it, can put the system down or have degradation of performance

Starting point is 00:26:28 well yeah and it can really affect even your select performances so you mentioned brin is probably the one where it gets it used to at least get affected the most with the default way of creating a brin index if you have

Starting point is 00:26:44 a row inserted way back in an old if you don't have partitioning if it goes miles away and you get some real scattered data brin performance can end up effectively looking like sequential scans so all indexes degrade b3 degrades very quickly if you perform deletes and updates, and you need to rebuild it. And rebuilding is better with partitioning, because the smaller partitions are, the faster rebuilding is, and Xmin Horizon is not frozen, right?

Starting point is 00:27:17 So auto-vacuum is not affected in the whole database, right? Yeah. So it's better. Building and rebuilding indexes vacuum itself maintenance tasks are good if you have smaller physical tables or partition is great right yes on the brin thing i'll link up the old episode we did but the min max multi i think makes a big difference especially if you don't have to like well it handles loads of outliers so i do think that's easier and if you if you are able to keep on top of auto vacuum i guess the b3 stuff doesn't degrade that quickly so i feel like these things aren't as bigger problems anymore but yeah often in these

Starting point is 00:27:57 cases if you're dealing with high volume like many many thousands of queries per sec like just extreme volume anything you can do to help fight on the performance front will be helpful. Yeah. And as usual, when we touch partitioning, the state of caches and buffer pool, for example, if you have archived data which touched rarely, those blocks are evicted from the buffer pool and cash efficiency might grow. Hit rate ratio might be better. But yeah, I agree with you. So partitioning is good in many

Starting point is 00:28:31 senses. It comes with price of overhead and maintenance as well. But it's worth to have it. But imagine, like, all this said, we moved slightly from append-only to append-mostly, right? But let's move back to append-only. Imagine we have many partitions where data is not changed, archive.

Starting point is 00:28:51 Index is created, all frozen, all visible. It's a beautiful state of data, right? So all index-only scans are working well. And that's it. Maintenance is not needed. Auto-vacuum is not needed there, and so on. However, what if we have 100 terabytes of data, and this is a heavily loaded cluster,

Starting point is 00:29:13 we have many replicas. The data is not changed as good. It's evicted from buffer pool, but we still need to keep it on the main storage. And at some point we think think oh like we pay a big price because this data is replicated it increases the volume of backups full backups if we consider right so this is like this legacy it's a lot and if we we have, for example, five replicas, one primary, we need six times to store the same data,

Starting point is 00:29:49 and nobody is reading it. Like, people read it occasionally. At some point, you think it's not efficient. And you think, I would rather to store it somewhere else, not on the main disks, not on the SSD, fast SSDs, or, I don't know, NVMe's I have, or cloud storage, which is also expensive, right? So this leads to idea, to two ideas. First idea is it would be good to compress it, right? Again, TimescaleDB, full version of TimescaleDB is doing great job, and their

Starting point is 00:30:21 blog posts about compression are great. I like especially one, I remember, first big one, which explained algorithms in row and basically kind of column compression, although we still have row storage, it's great. And also, I think second topic here, which opens up naturally, is what I know Aurora now offers it, right? And Neon and Timescale as well, in cloud only, bottomless approach, when old partitions are offloaded to S3 or object storage, GCS on Google Cloud, or blob storage on Azure, how it's called, I don't remember. And now even Hetzner has it. They just recently released, which is big news. I like it because I like their prices.

Starting point is 00:31:07 And I worked with them since, I think, 2006 or so in a few companies. When you bootstrap and you have a small startup, Hetzner is like number one in terms of budgets and the hardware they can offer. So they just recently released S3-compatible object storage, right? So we can have normal backups and so on. But what to do with old partitions? It's a natural way of thinking. We don't want to keep them on these expensive disks we have,

Starting point is 00:31:42 having multiple copies of that. So offloading it somehow implicitly, transparently in background to S3 or S3-compatible mini or something, if you have self-managed Postgres, it would be great, right? And so we have it in Timescale Cloud. I think Neon also does it, right? I don't know.

Starting point is 00:32:06 Bottomless, bottomless. Like Neon, they store data on S3 originally anyway. So idea is we want to have petabyte size cluster, but don't pay for lots of disks and headache it comes with. And for append only, it's very natural to have, to decide, okay, we want to store data forever, not to clean up, but we want cheap storage here. So S3 is a good idea to consider. And it has tiers also, right? It can be slow to retrieve, but it's okay because it's rare right well and it depends what you mean

Starting point is 00:32:46 by slow like i think there's there can be performance advantages i think when some of this data is fresh we might want to retrieve it row by row like if you're if you're looking at some audit logs you might want to look at some recent ones that you might want all the information about them but if you're looking at data from two years ago there's probably a higher chance that you're looking at it in aggregate you know on average how many audits of this type were we having in in 2022 versus 2023 and i think actually the types of queries that happen on older data tend to be these aggregate ones that often perform better once it's column store compressed you know these file formats often suit that kind of query so i could i i don't even think

Starting point is 00:33:33 i know what you mean flex compression good compression it lacks it yeah and timescale db compression i have seen how good it is. It can be 20 times smaller. And indeed, if they even support data changes for compressed data, which is great, right? I have seen a project or two come about lately, I think open sourcing some of this stuff, or at least putting it under the Postgres license. Is it PG Parquet that allows you to?

Starting point is 00:34:06 Yeah, but it's different. It's for analytics. And actually for analytics, we also might want to consider append-only tables, obviously. But there is a new wave of this, and I know many people, companies look at it. PG DAGDB, not PG DAGDB. Yeah, DAGDB.

Starting point is 00:34:25 Right. And the idea, let's marry it with Postgres. And there are a few projects. I looked at a few ones recently, and one of them was just released maybe last week and they i remember they use logical copy from original data tables regular tables to these tables which are stored on on i think in parquet format in on object storage and then duckdb is used as processing for analytics but i remember i think alvaro commented on twitter that i'm not going to consider it until it works with like basically CDC logical replication or something, because right now it's only full refresh of like, it's not serious, but they will do it. I think. Also, I think Neon guys looked at DuckDB and

Starting point is 00:35:16 I saw some activities and Hydra, right? They also looked at DuckDB. Yeah, but I understand that most of the marketing at the moment is around analytics use cases, but I don't see why it couldn't work for append-only... Data types. Might be true. I looked at a couple of extensions because I have a couple of customers with such need to offload all data. And Parquet, this format supports only mapping of data types. It might be tricky if you have some complex data types, as I remember. And when I looked at some extensions, it didn't work well.

Starting point is 00:35:54 And I think right now, like I have plans to look at Temples extension, which is called pg-tier for tiered storage. Right. And the idea is we can, with this extension we can can have all partitions on on object storage it's great idea so if it works it's great i just haven't looked at it yet if uh if somebody from tambo is watching listen listening to us please let us know if like how this project works like already. Is it like already production or beta stage? I'm very curious. And like is it worth trying? And what are the limitations, for example? Maybe we should actually have a separate episode because I think this extension might bring bottomless idea to the Postgres ecosystem for everyone, right?

Starting point is 00:36:47 It's great. It's open source, unlike what other companies do. So kudos to Tembo for making this open source extension. And is it transparent? Like, do I have to... Because I think some of the DuckDB stuff, I would have to be... I'm not sure if I have to write Because I think some of the DuckDB stuff, I would have to be... I'm not sure if I have to write...

Starting point is 00:37:07 I don't know. It might be semi-transparent. For example, if you need to make some transitions with old partitions, I'm okay with that. And I can even create new partitions in background and move all the data from one old partition to this kind of new old partition, which already has different storage, this is doable. So it's not a big deal as I see it.

Starting point is 00:37:33 I think what, for example, Timescale Cloud has, it's transparent and they have some kind of interesting, as I remember, very in rough form. They have something also with Planner, and some algorithms to decide when to bring this data to caches and so on. So it's interesting. But the idea is we want partitioning to evict blocks with all data from memory, one thing. But then we think we want to evict them from from our disks because disks are also expensive right let's evict it and this is alternative idea to clean up and to compression or compression goes well with offloading as well i don't know so parquet format is definitely good with compression compressing data in terms of column storage, right? So if it's time-serious, it's good.

Starting point is 00:38:27 So there are interesting new directions of development of Postgres ecosystem here. And I think we mentioned a few projects, both commercial and open source, which is great. So if someone wants to store petabyte of data, is preparing for it, doesn't want to delete everything, I think there are ideas to consider. Well, we didn't mention with partitioning you can also have some kind of foreign data wrapper and store it on very cheap also Postgres cluster.

Starting point is 00:38:57 It can be Postgres, it can be not Postgres as well, right? But for example, we can consider cheap Postgres cluster with very slow disks, HDD, for a good price, and, because of, I didn't, I expected foreign data wrappers, Postgres FDW code in terms of how it works with, like, basically, you need two PC, right? You need two-phase commit to have a reliable commit because it's distributed system already. But without two PC, there are risks to have inconsistency, for example. But not if it's append-only, you're never going to change that data by the time you're pushing it to a different server. Right. I remember I inspected code and found some kind of edge cases, maybe even corner cases,

Starting point is 00:40:01 but for inserts it's also possible. You wrote to one place, you're trying to write to another place, this one is already committed, here it's not committed. But I remember code was quite smart to reduce the probability of some cases, but it's not 100%.

Starting point is 00:40:19 As I remember, it was like five years ago. So I haven't revisited it since. Tell me if I'm wrong, but you're saying new partitions would be on the local Postgres and it would be old ones that we would move to the second? Yeah. If we don't have inserting transactions which deal with multiple partitions, there is no problem at all.

Starting point is 00:40:43 And old, we can move to foreign data. With Postgres FDW to different cluster, so you're right. No inserts happening, no problem. Cool. Select only. I never considered using FDW for partitioning. Or like, you know, the old partitioners. It's natural.

Starting point is 00:41:02 It's natural. This is how many folks were going to have clustered Postgres, right? I mean, or sharding. Kind of sharded, yeah. Sharding, yeah. But this path has issues. Yeah. It's a different story.

Starting point is 00:41:20 Here be dragons, is it? Or what's that old phrase? I don't know. Okay, we discussed many things. I feel we might be missing something, right? But for good overview, I think it's enough for

Starting point is 00:41:35 general overview, what's happening with append-only tables. And yeah, so I think it's an interesting topic, actually. Many people need it, many people. Many companies store logs and so on in Postgres. And I'm looking forward to the future where Postgres at some point will have better compression

Starting point is 00:41:57 and bottomless feature. And as I like to say, I imagine TimescaleDB was Postgres license. Okay. This is I think should happen at some point. I know there are people who don't like this idea but they developed so good stuff that Postgres would benefit from it.

Starting point is 00:42:24 I know some people would not be happy with these words i know but it's just yeah it feels natural to have these features in postgres itself some of them for sure like but but i do think some of it's tricky like the bottomless stuff for example how where would it where would it go like what there is extension for where is extension already right yeah okay s3 compatibility is standard even google cloud gcs it's also s3 compatible everyone does it so i think no no i don't see um why not here i I see it, but it's manageable. Fair. The biggest question is business-wise, I guess. So Timescale DB. I understand.

Starting point is 00:43:09 License, yeah. I understand business-wise it's not going to happen in the nearest future, but it could be so great to have good compression and bottomless and Postgres itself, even if it's extension. I'm okay with extensions i don't like extensions in general because i like to like you know to do all people have some features but in this case it's okay to have extensions cool extensions are great sometimes yeah good okay have a good week i know we have more than 100 suggestions in our doc.

Starting point is 00:43:46 Yeah. We read them. Keep posting. This was one of suggestions, right? Yeah. So you chose it. Thank you. Yeah, good. Thank you for our audience for patience reaching this point. Okay, see you next time. Absolutely. Take care. See you next week. Bye.

Starting point is 00:44:05 Bye-bye.

Postgres FM - Append-only tables

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.