Postgres FM - Materialized views

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things Postgres QR. I'm Michael, founder of PgMusted, and this is my co-host Nikolai, founder of Postgres AI. Hey Nikolai, what are we talking about today? Hi Michael, let's talk about materialized views skipping regular views. Or do you want to talk about them both? Well, I was most keen to talk about materialized views, but it's difficult to define them, I guess, without mentioning views at some point. And the rule system. Yeah.

Starting point is 00:00:26 Regular views are also interesting. They have also interesting things to discuss, right? So, yeah, let's focus on materialized views, but briefly touch regular views. And as usual, let's start from very simple things, definitions for those who are not really familiar with views system in Postgres and then try to dive into deeper topics and future maybe a little bit. Sounds good. So I guess let's go back. Let's think

Starting point is 00:00:56 about tables relations in general. Yeah. We have ways of storing data in Postgres. We have tables, which hopefully everybody's familiar with, and then views, which are kind of a virtual table. So we can set up a query that queries tables, or yeah, let's stick to that simple version and call that a view, but that doesn't store anything. That's like, it's a view into the data. But our queries, if we're querying a lot of data in the tables, we're still going to be querying a lot of

Starting point is 00:01:30 data via the views. Whereas materialized views are... Let's rephrase about the whole this area is based on the fact that any query returns kind of a table.

Starting point is 00:01:47 What is the table? It's a good question because, for example, in theory, all rows in the relation, all tuples in the relation should be distinguishable. In Postgres, it's like primary key is not required, so they can be not distinguishable. But if you also remember about hidden columns, CTID, Xmax, Xmin, we discussed a couple of times, they are distinguishable again. But result sets of queries don't have those invisible columns. So it's kind of a lot of nuances here.

Starting point is 00:02:20 But in simplified point of view, any query returns a table, right? Which means we could substitute some table in some query with another query and vice versa. We could take some subquery and say, we can put a name for it and store it as another query and just use this alias here, right? And this is actually a view. And in Postgres, views are implemented using a rule system, which is a very, very old thing. Originally, Postgres and maybe Ingress had many decades ago. And this rule system allows users to define some rules saying what to do

Starting point is 00:03:03 at planning stage, not an execution stage as triggers, but at planning stage. So instead of this, do that. So we can say, when you see select from view one, just do this subselect from there and then do what you want. And this rule system allows you to define views. And also, good thing, thing by the way it was my bachelor's thesis very long ago updatable views so you can even insert if views are quite simple and Postgres can propagate change so you can not only select from them but you can insert to them update delete and it will work and by the way some intermediate software like Postgres, and for example, if you're a Supabase user, you are using Postgres. It encourages you to define API using views,

Starting point is 00:03:53 and it relies on the fact that you can change data, not only read data there. And it's quite interesting. This is views, right? Based on rule system, not approaching execution step of query execution, just doing everything on the planning stage. So quite good, quite interesting. What's the main problem with views? Dependency hell. If you need to change underlying table or underlying views, if you have hierarchy, multiple layers,

Starting point is 00:04:28 you usually need to recreate your views so it's quite can be painful especially if your system is under constant high load heavy load yeah and and i think the the benefits of them are extremely different to the benefits of materialized so do you like when do you see views used most often well originally I remember from my bachelor thesis like 20 plus years ago, originally one of the key ideas is that this is helpful to control access. So you can restrict number of columns you provide, you expose via views.

Starting point is 00:05:01 You can also restrict number of rows. So you can do it in both dimensions of table columns and rows. And then you can have very different, you can provide, for example, writable access and view, but no direct access to table at all. And this is actually what Postgres is doing. You're supposed to have a separate schema and there you define views and nobody is working like users of your system don't work directly with table they work with it

Starting point is 00:05:32 via views it's helpful to manage access also views are helpful sometimes just shortcuts like I'll ask for your queries why not? Yeah, usability better names but you can give objects different names to make them you know, if you if you

Starting point is 00:05:50 regret a name and don't want to actually do a big migration, I've seen it used for that as well. Right? Yeah, the main one seems to be security. They add complexity to system. And especially if again, if you have several layers of views so you will end up dealing with like if your schema is changing often you will you will notice overhead of maintaining your views constantly but what else like where are our views helpful actually reporting like you have reports so you just store it and that's it. But it doesn't help with speed anyhow.

Starting point is 00:06:25 Exactly. It doesn't store any statistics. It doesn't help you. You cannot create index on view, nothing like that, because it's very ephemeral. It's like just a query stored under some name. That's it. This is view. Can it ever hurt performance in terms of push down for like predicates and things like that?

Starting point is 00:06:44 I'm not so sure. Like in some cases, of course, if you have a view, you have indirectly, you have less flexible ways to optimize your query. But like in most cases, Postgres Planner can choose what to do. Like there is this joint collapse and another collapse. From collapse? Yeah, from collapse setting. So it basically says that Postgres compares different paths easily and views are just there already substituted at optimizations time. So it's just there already

Starting point is 00:07:20 this query which defines the view, it's already there, right? So basically, I don't see a big problem. Main problem only about maintenance of schema changes, for me, from my experience. I might be wrong, of course, as usual. My experience is it's not the only one, right? Yeah, get in touch if you've had other problems. So materialized views, another idea. I would say it's opposite idea. Instead of just storing the view definition,

Starting point is 00:07:49 let's store the data and view definition as well, right? Yeah, importantly. But it's storing the data, and at the moment that's at a point in time. So when the materialized view was created. So it's that query run at that time. So that, and again, let's probably want to catch me

Starting point is 00:08:09 before I go into the benefits too quickly. But the idea is you can run an expensive like computation of data, or you can do some work once and then read from it a lot of times instead of having to do that expensive work multiple times. Right. Actually, you can define just you,

Starting point is 00:08:28 but don't have data in it originally. There is additional comment populate. What's the benefit of that? I couldn't work it out. Well, I don't know. Some cases, we can invent some cases synthetically. For example, you restore from dump and you don't need it yet, but you already restored the original data.

Starting point is 00:08:52 Because materialized view, it's the data which is derived. You always can refresh it and have an up-to-date version. So you restore from dump and you don't want to have it there yet somehow. Just to consider dump as already restored fully because all original data already there yeah so you can speed up your restore right and later in separate step it can be done using pg restore pg dump pair of utilities and pg restore has dash uppercase and downcase l pair of options. So one option provides the list of objects the dump has, which if dump is in the custom directory format.

Starting point is 00:09:32 And another option allows you to filter out everything you don't want or filter in what you want to restore. So you can restore materialized view, but don't restore data. Do it later separately. Actually, I did it. materialized view, but don't restore data and do it later separately. This is actually I did it, but I did it because I had issues with by the way, maybe it's still there. So if you have multiple materialized views and also multiple layers of them,

Starting point is 00:10:03 your hierarchy, in some cases, restore doesn't work because it doesn't know the order. And it says I cannot populate data yet for this materialized view because I depend on another materialized view, it's empty. Or maybe even not created at all, I don't remember details there. So I remember I wrote a script with like infinite loop. And if refresh materialized view fails, it just skipped, and another attempt is done later. So it was brute force, recursive refresh of all materialized use to fill all layers of it.

Starting point is 00:10:32 It worked very well. So, materialized use stores data. I agree with you that it's helpful to avoid, for example, aggregates, one of the examples, because we all know that counting is quite slow in Postgres. Postgres is a row store, so it's not column store,

Starting point is 00:10:49 unlike some analytical database systems. And of course, if you have billion rows, count will be slow. Even if you have index-only scan, it will still be slow. So you can pre-calculate it for particular groups, for example, and have quite small materialized you and you can build indexes on it. Any custom indexes, right? Yeah. I think a lot of people don't realize that. It's really cool. Yeah. It's because it's a table, actually. I would like everyone to understand that instead

Starting point is 00:11:18 of materialized you, you have very good, powerful tool in Postgres. Create table as select. Yes. It's considered as a one-time materialized you without refresh capability it's the same right yes but with refresh capability i think it becomes even more powerful right i would argue let's like i i mean i agree in many cases but in some cases i don't agree and i will tell you why in a moment awesome so create table as select gives you powerful ability to create some derived tables based on one or several other tables or materials to use actually and then create indexes and work with it

Starting point is 00:11:58 very good for experimenting it's also very good and internally it's good it doesn't spam the buffer pool because the ring is used there so only small small amount of buffers relatively small i don't remember always forget the size of it but of course the operational system page cache it doesn't acknowledge what data you are getting from disk so it will be spammed with this action. But the buffer pool will remain good. I mean, it won't evict useful buffers from your system, which is good, creatable select. And you can create indexes.

Starting point is 00:12:38 And I would like to point to a very important thing I realized not very long ago. Very simple. I like to say silly things. By the way, I want to apologize for the last episode because I said inserts cannot be had, heap only tuples, and deletes cannot be had. Well, inserts cannot be had. It doesn't make sense. But deletes, they even don't change indexes at all because indexes don't have visibility in four. So it was a wrong idea even to raise.

Starting point is 00:13:07 Delete, as we also discussed, it just puts value of transaction ID to X max in the heap, in table pages, and doesn't do anything with indexes at all. And so it cannot be hard because it's not needed to be hard at all. It's already hard, actually. Like new statement,

Starting point is 00:13:26 deletes are always hot already because they don't change indexes. Of course, indexes are changed later by vacuum activities, but it's asynchronously, which is good. So back to our topic, the idea that you can create table as select create a materialized view, which currently we can consider kind of equivalence. We will see differences a little bit later. You can create indexes. And this is super cool because if join is involved, you cannot create an index on two tables at once. Index always belongs to some table. But sometimes, for example, you want to order by two values which are present in two different tables.

Starting point is 00:14:08 Or you want to order by a filter. And you could do it using single index scan. Both order by limit and order by. And where clause uses one column, order by and limit another column. We could combine it in an index, like two-column index, for example. But we cannot because these columns are in different tables, right? So we cannot create an index for two tables. But when you join and create a table, select and join,

Starting point is 00:14:37 or you create materialized view, you finally can have this index, even if you don't do anything with data except just joining it. Yeah. Right? I mean, yeah, that's awesome. I've not seen anyone doing that. That's such a cool idea. I think everyone doing it just implicitly.

Starting point is 00:14:55 Oh, interesting. So, in fact, this actually takes us quite quickly to one of the downsides. That's great, but you are... It's a minor downside, I guess, for most people, but that is taking up space on disk, in cache, in caches and things like that. Well, of course, it does take space. And create table as select would take space, right?

Starting point is 00:15:17 Yeah. So it's a minor downside for that benefit, but it's the reason why... But the speed benefit can be many many orders of magnitude because when you avoid join and you can have single index scan and maybe even index only scan compared to two index scans in two different tables and then one of three algorithms to join it like it's very it's like game changer could be in terms of performance yeah i completely agree the the other like let's i think while i'm talking about downsides the main before downsides just

Starting point is 00:15:52 like you can instead of putting query which will solve your like like you have some goal to do something with data and you have a query you want to improve it with the speed of it. And okay, I will basically, what is this create table or select or create materialized queue? It's kind of cache. Instead of considering it as a query cache, you could, for example, store more, have more indexes on this derived table or materialized queue, and have more, for future, you can support more queries with this cache. Like, cache not the final result, cache data, right? So, as we know, query cache is kind of,

Starting point is 00:16:39 it's good until it's not, right? For example, my SQL removed query cache from engine at some point. It feels like one of those things where there are use cases that it can hugely benefit. Like if your data is not changing much or if you have past data that it's okay. You're pushing us too fast to the main problem with this. Okay.

Starting point is 00:17:03 Which also one the two biggest problems in computer science, as we know, right? Cache and validation. How to maintain the state of it. Of course, when we say create table and select, or we say create material and populate data, of course, the data is frozen, right? Yeah, but I mean, if we're talking about a...

Starting point is 00:17:23 Some people have analytics databases from static data that isn't changing at all. And that, for that, there's no downside. Well, maybe some cases, right. But I mostly observe cases where people want to maintain the state of analytical database with very small lag. Yeah. Or even they want to have it in the same system, so in Postgres and so on, so like HTAB approach.

Starting point is 00:17:53 Hybrid, right? Hybrid transaction. So yeah, the main problem is how to update it and what Postgres offers. If you forget about my idea of create table select, which I still think is quite good and it can beat materialized view in some cases, I will explain why.

Starting point is 00:18:10 I keep this intrigue state. So we can say refresh materialized view, right? And importantly, I didn't realize this was added so early, but refresh materialized view concurrently as well.

Starting point is 00:18:28 Right. What's the difference? Difference is one will lead to bloat, another will not, right? So the big difference, and I looked it up. Like developer's point of view, it's so good. Our queries are working. The base point of view, or bloating materialized views.

Starting point is 00:18:49 Who knew it can happen as well? So if we refresh a materialized view without concurrently, we block selects on the materialized view while it's refreshing, right? And that's offered, like bear in mind, we're normally doing this on a slightly slow query

Starting point is 00:19:03 because that's the benefit of materialized. it's like vacuum full basically right but yeah data will be fresh our goal is achieved yes but with concurrently it's like a i guess the equivalent would be a pg repack or something uh or or create index concurrently sorry, re-index concurrently would be the equivalent. So with concurrently, we don't block selects on the past state of the materialized view, and it replaces it once it's finished. Is that right? Yeah, that's right. But the main point to remember here is that we can refresh it only fully.

Starting point is 00:19:44 We cannot refresh part of it, unfortunately. Yeah. that we can refresh it only fully. We cannot refresh part of it, unfortunately. Yeah. If we use regular materialized view, like original Postgres provides. Because just refresh materialized view concurrently, that's it. There are no options to say that I know 90% of my data hasn't changed because it's very old data. So I probably could do it much faster. So you refresh everything and it can take a lot of time but concurrent selects are working. It's good. But we accumulate bloat of course.

Starting point is 00:20:15 And indexes, if we follow my approach, let's not use materialized view, let's define materialized view with some underlying data, maybe raw data, and support more queries, more kinds of queries with it. It means that probably you have multiple indexes there. Oh, we forgot to say that to support concurrent update, refresh. We need a unique index. Exactly. On one of the columns, right?

Starting point is 00:20:40 At least one. So already one index is there. But if you follow my approach, let's have raw data and support more queries, you probably have multiple indexes there, and their health will degrade as well. So all problems we have with tables which receive significant numbers of updates, deletes,

Starting point is 00:21:02 we'll have this here as well. So repacking or something. And here idea, like why just not to recreate it if we still need to populate it fully, right? Why? And maybe it can be just recreate materialized view with a different name and then quickly swap it in single transaction or just this create table select it's like if if it's recreated every time we want to refresh what's the point of having materialized you just maybe convenience because definition of it remembers the query right so it like database system stores the definition like it's better maybe that's it in some In some cases, I would seriously consider recreation of it every time,

Starting point is 00:21:50 just not to deal with bloat in data part of it and in indexes we have on it. But of course, if we do it every minute, for example, I would check the bloat of system catalogs as well, because if you have a lot of DDLs, you consider DDL is very, very often thing to happen. You might end up having very bloated PG class and so on. So it's also worth remembering. And PG repack cannot repack system catalogs, you need to do vacuum full. Fortunately's usually very fast but you don't want to to be like to have 99 of blood or something because it will affect basically all queries

Starting point is 00:22:32 because planning will be slowed down and so on many things so bottom line refreshment is good but it's very very rough tool one of our customers enjoyed using them until some point and then said, it's like having a huge hammer and just applying it to very small things. Like it's very rough tool, very rough. So it saves you sometimes, but then like if you only small portions of your data is changing, maybe you will be not satisfied with this approach, refresh it fully, always. And here also, by the way, could materialized view be updatable?

Starting point is 00:23:15 Do you mean increment? What do you mean? Updatable means like in the same, we discussed it for regular views, you can insert it. I don't see how that would make sense. Well, it might make sense at some point. But maybe it's too exotic. And I suspect other systems have it, but I don't remember. Let's skip it.

Starting point is 00:23:37 Just a question to know. Just an entertaining question. But as for the main topic, what would we need? We would need to be able to update only parts of the materialized view. And there is a big theory described in, not big, but some theory described in the post-Gus Wiki page, and the discussion is happening already many, many years, maybe decades, at least one decade, I would say, to have incrementally maintained views, incrementally refreshed, maintained views. So we want to avoid full refresh, right?

Starting point is 00:24:15 We want to adjust only part of it. And I already brought this, that materialized view, it's like create table select, but with rule system from views inherited like we just remember this you can use create table as select and then maintain the state using triggers or some other things and have incremental

Starting point is 00:24:35 data if you understand the logic probably it's a good way to go right just to maintain from your application the question to this approach will be, do you lose data? And do you slow down your writes on the baseline, base tables? Exactly, what are the trade-offs? It's actually quite an interesting time to talk about.

Starting point is 00:25:03 We could talk about the future and incrementally updating materialized views in postgres but i think it's also worth talking about a couple of existing tools and systems that have partially tackled it or tackled it for their own specific case the reason um well this was a requested topic so thank you to the person who requested it but there was also an announcement, not in the Postgres space, but by a company called PlanetScale that we've talked about briefly here before that announced a boost feature that looks very similar to Materialize Views,

Starting point is 00:25:37 but with incremental updates. And boost is easy to achieve if you had a join or multiple joins with a lot of various kinds of index scans. And then suddenly you have index only. I'm talking Postgres terms. Then suddenly you have single index only scan. Of course, it will be like 10,000 speed improvement or 100,000 even sometimes.

Starting point is 00:26:01 They definitely went for a click-baity title on it. But it's kind of what people sometimes want, right? They've got a slow query and they want a quick fix. I'm not concerned about the additional space usually, but sometimes I do. But the price of maintenance, how maintenance is organized. Do we have load issues in Postgres if we do this? The other one I wanted to talk about as an existing time

Starting point is 00:26:28 scale have their continuous aggregates feature as well and we don't need to discuss them in depth here but there's a good blog post that they've done actually covering some of the topics we've just discussed about views, materialized views and continuous aggregates and they have a solution that it's designed for time series

Starting point is 00:26:44 workloads where you're probably not going to have loads of data in the past changing, but it can cope with that. And it's been, they're being improved. So that's an interesting take. If you, if that's something you want and need and you're okay with the trade-offs now exists and it's kind of, it's in this area,

Starting point is 00:27:03 topic of incrementally updating. If you're a timescale user, you should go and check continuous aggregates. This is one of the key features, I would say. I agree. But there is also a project, and there is discussion happening almost four years already in hackers.

Starting point is 00:27:21 Unfortunately, I don't know details, but I just see it. Implementing incremental view maintenance started in December 2018 from Yugo Nagata. And the same author created PGIVM project, which is an extension extension okay open source extension it's quite good interesting work and this is attempt to follow this approach like let's have incremental updates automatic partial updates of you but as i understand like i've spent some time understanding how pg ivm works first of all there are restrictions of course on queries that that can be used to define such views. And also, as I understand, there are two approaches. First is synchronous propagation of data change, which will slow down our original insert or update or delete, right?

Starting point is 00:28:17 But still, it's good to have it automated. Yeah, I think of that as immediately updating materialized views. So as soon as there's change... Synchronously, immediately, right. It means that at commit time, like, our transactions which changed the original table will slow down, of course. But it's good that we don't need to write a trigger to do it. And second way is to have refresh i i didn't get it maybe like it's still full refresh like before or no like i didn't get this part i think they use triggers right okay so it's uh if if i updated only one row out of billion will uh second option will propagate change in very fast compared to refresh materialized view

Starting point is 00:29:06 i didn't realize pg ivm had two options i thought they might be talking more philosophically about there being two options and they that was my understanding but i definitely might there's a good video by lucas fitter on this as well we mentioned his channel last time i'll link up that video that's like five keep mentioning this channel and unlike us our podcast it's only five minutes always very much faster for for those poor dog owners who work with dogs only five minutes yeah maybe once with little legs yeah right so why i think well still like this is great but two big questions to this direction as a whole. First, can we do it asynchronously, but like propagated always, but asynchronously, like first option, but asynchronously, I don't want my original transaction to slow down and second this is usually needed when we have a lot of data and when we have a lot of data we always should use partitioning i want my materialist you also partitioned maybe right so

Starting point is 00:30:14 what about this like for example we partitioned by months and we have very like five years ago data january five years ago for example it can change, but so infrequently. So I would refresh it very less often. I don't want my original tables to be partitioned, but materialized to use a single thing. It means that if I refresh it fully, it's a very slow query and it will affect vacuuming and will affect whole database, leading to bloat in all tables. If I update it partially, well, I'm fine, actually, maybe fine to have. Well, vacuum will also need to work here as well, right?

Starting point is 00:30:55 And if it's a single huge table, all problems we have with huge tables which are not partitioned will hit this materialized view as well, even if we have partial incremental view maintenance. Yeah. So if I understand you, we're kind of talking about two extremes and whether there's a middle ground. So there's the extreme of having to refresh everything each time, which we currently have in Postgres. There's the other extreme of synchronous, immediate, trigger-based updates so that our materialized view is updated on each transaction, but that comes with write overhead? Is there a middle

Starting point is 00:31:33 ground option where we can... There is perfect solution. It's not middle ground. It's perfect solution. Perfect solution should propagate data asynchronously, not slowing down original transaction and not losing data. With some small lag, ideally also observable. So I would put this lag to some, I don't know, monitoring. And second, the result should be partitioned. This is where you lost me a little bit.

Starting point is 00:31:58 So my understanding of materialized views is normally that they're hugely smaller than the original data. So I mean, how many, how big are these materialized views? I don't agree. I saw so big materialized views. And even if, like, again, the idea of big result, big materialized view comes from my idea. Let's not create materialized view for each query.

Starting point is 00:32:22 Let's create it for like to avoid joints for example yeah right and then run a lot of various queries on top

Starting point is 00:32:28 of it quite good idea it's like denormalization automated denormalization basically

Starting point is 00:32:33 but even if people just use it for a final query just cache the result of query

Starting point is 00:32:39 I still saw huge materialized use huge interesting it happens

Starting point is 00:32:44 oh great sounds good if it's have it huge it should be partitioned yeah still saw huge mutualized use. Huge. Interesting. It happens. Oh, great. Sounds good. If it's huge, it should be partitioned to benefit from it. But the main key, the key, I would highlight the

Starting point is 00:32:55 key point here. Postgres, unfortunately, doesn't have this Oracle has pragma autonomous, the ability to

Starting point is 00:33:03 initiate transaction which will be detached from a trigger for example so i want to to have insert but a database will guarantee it will be finished but i already it's not my problem already i finish my transaction and another transaction is happening in different backend for example you can do it with db-link or something, emulate it, but it's kind of, it looks not clean solution. And like currently I see what, like ideal system, what I would build.

Starting point is 00:33:31 I would use tables. I would store definition, maybe not in Postgres, maybe inside. I would maybe partition result. But first thing I would solve, of course, I would propagate changes in incremental fashion, not refreshing whole view. And by the way, there are articles explaining how to do it.

Starting point is 00:33:50 And there are also tools for various languages which help you to do it. But what first thing I would do, I would propagate changes asynchronously through probably something like Kafka, I don't know. I can implement it in Postgres, but it requires additional knowledge. You need to know how to deal with bloat. It's kind of Q inside Postgres. Maybe we should cover this topic separately. It's possible. But if it's in Kafka, it's also good, or some other system which doesn't lose data, allows me to control the lag and just delivers this signal of change separately so I can update my target. And the project I sent you the link, this Materialized company called Materialized, they solve it as a commercial solution.

Starting point is 00:34:38 And as I understand this, they use Kafka and Dbizio. Everyone can do it, but maintaining Kafka requires additional efforts. So that's why maybe I would do it in Postgres itself. It depends on the system. If you have many, many databases, having Kafka as something additional to Postgres, it would

Starting point is 00:34:58 be justified. But in this case, imagine it's an ideal world. The target is good. I mean, it doesn't suffer from having huge tables and very long vacuuming. Indexes are usually smaller because it's partitioned. Index maintenance there can be applied in regular fashion. And changes are propagated atomically, like every small change is propagated, guaranteed with small delay. And for example, we have a lot propagated, guaranteed with small delay.

Starting point is 00:35:27 And, for example, we have a lot of posts, comments, like social media. Or, for example, Twitter. There's a master don't think this. People migrated. It's signing on Postgres as well, right? It is, yeah. We have tweets, for example, answers, replies, likes, everything. And we can have counters very, very, very good here, like using this approach. Like, so I wonder why we go,

Starting point is 00:35:52 development goes this direction, which will not work well for very large volumes of data. Well, I see the use cases where it will be useful, this incremental view maintenance. But why these two big problems ignored? And first of all, this ability to propagate changes asynchronously. I think this problem should be solved first to allow people to define triggers which will be asynchronous for example or something. Maybe it should be

Starting point is 00:36:18 done outside of database. It's not a closed topic for me. I don't know the ideal answer here. Maybe Kafka should be here. Or maybe it should be done in Post me. I don't know the ideal answer here. Maybe Kafka should be here. Or maybe it should be done in Postgres. I don't know. Well, if anybody out there has done a lot of thinking and reading about this, let us know what your thoughts are.

Starting point is 00:36:37 Also, I'm really grateful to the people that are working on this, on the PGI VM thing, because they are looking to contribute this to Postgres. They are looking to get it committed upstream. And I think that's a really nice use of the extension model where you write an extension as a proof of concept. What do you think? Yeah, I agree. I agree. I don't want to discourage and just to criticize. I just see that problems that should be solved also. I solved it a couple of times in different cases, not as open source, just solved it. And it was good. But maybe I remember also discussions, this implementing this autonomous pragma autonomous in Postgres, unfortunately, also didn't finish. So the need of it is there.

Starting point is 00:37:27 Incremental human maintenance probably will be good for mid-size projects. But for larger scale, like dozens of terabytes and hundreds of thousands of transactions per second, multiple standby nodes and so on,

Starting point is 00:37:40 we do need additional thing to be solved as well. I feel it. And currently it's solvable with Kafka or something. Maybe it should be. Maybe this regular discussion, should it be inside the engine or should it be outside?

Starting point is 00:37:56 I don't have exact answer for this. But it's a super interesting topic. I think in general development of materialized use can be helpful for many, many, because data volumes are growing and people want aggregates run faster and so on and so on. Yeah. Yeah, exactly.

Starting point is 00:38:15 More and more products need to have a dashboard, have some, you know, a lot of even transactional products are expected to do a little bit of analytics now as well, aren't they? So it makes a lot of sense. Well, actually, maybe it's not that bad. Maybe I would right now consider PGI VM for a particular project. I would just measure, benchmark how my writes will slow down because of this synchronous propagation approach. Also, it requires an index, right? So additional index, which also will

Starting point is 00:38:45 slightly slow down our writes. And if it's acceptable and if it will remain acceptable in a few years when my data volumes will grow, then I will just use it. Why not? It's also a problem of this very rough

Starting point is 00:39:02 approach, like let's update as a whole. Update is only part of it. It good yeah keeps everything in postgres at the start too well good luck to them and thank you did we cover everything you wanted to here yeah maybe some bottom line uh check benchmark and consider it building your own solution based on just create, table, select, some triggers maybe, maybe Kafka or some asynchronous queue mechanism you have already, if it's reliable, of course. And you can propagate and support your own data structures and queries maybe. Well, like, I mean, if existing solutions don't suit well for your needs, maybe you just need to write your own thing. It's not that difficult, actually. You have rules, you have triggers, so just do it.

Starting point is 00:39:52 Triggers, yeah, for sure. Triggers here, yeah, I think about triggers. But asynchronous nature would be good to have here. Yep. I would just warn, refresh materialized you is good as like quick and dirty solution. It will be dirty because it will hurt when data grows and everything grows and like it will be problematic for large volumes of data.

Starting point is 00:40:16 Even completely. It can get people a long way right there at the beginning. Well yeah, I'm talking about again, I'm talking about dozens of terabytes and dozens or even hundreds of thousands of transactions per second. It's like really big systems. And also materialized companies, they do interesting things. It's commercial, but interesting.

Starting point is 00:40:36 And there is a talk, there are a couple of talks from them. I've checked them. Very interesting. Just to understand how it's implemented, how the vision and so on let's also link their materials we'll do materialized materials yeah good well yeah i hope a few people got some ideas of things they can use these for as well even in the current state

Starting point is 00:40:58 and yeah i look forward to speaking with you next week good see you later bye

Your Ad Here

Postgres FM - Materialized views

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.