Postgres FM - Materialized views
Episode Date: November 25, 2022Apologies, Michael's audio is not great in this, we'll hopefully be back to normal next week!Here are links to a few things we mentioned:Â Materialized views (docs)Refresh materialized view (...docs)Timescale blog postPlanetScale Boost (content warning: MySQL)Â Incremental Materialized Views with pg_ivm (video by Lukas Fittl)Â Articles on how to do your own incremental updates(?)Materialize (company)Â Materialize talkIncremental View Maintenance (Postgres wiki) Â Implementing Incremental View Maintenance (mailing list thread)Â ------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov / @michristofides / @PostgresFM, or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things Postgres QR.
I'm Michael, founder of PgMusted, and this is my co-host Nikolai, founder of Postgres AI.
Hey Nikolai, what are we talking about today?
Hi Michael, let's talk about materialized views skipping regular views.
Or do you want to talk about them both?
Well, I was most keen to talk about materialized views, but it's difficult to define them, I guess, without mentioning views at some point.
And the rule system.
Yeah.
Regular views are also interesting.
They have also interesting things to discuss, right?
So, yeah, let's focus on materialized views,
but briefly touch regular views.
And as usual, let's start from very simple things,
definitions for those who are not really familiar
with views system in Postgres and then try to dive
into deeper topics and future maybe a little bit. Sounds good. So I guess let's go back. Let's think
about tables relations in general. Yeah. We have ways of storing data in Postgres. We have tables, which hopefully everybody's familiar with,
and then views, which are kind of a virtual table.
So we can set up a query that queries tables,
or yeah, let's stick to that simple version
and call that a view, but that doesn't store anything.
That's like, it's a view into the data.
But our queries, if we're querying a lot of data
in the tables, we're still going to be querying a lot of
data via the views. Whereas
materialized views are...
Let's
rephrase about
the whole
this area
is based on the fact that
any query returns kind of a table.
What is the table?
It's a good question because, for example, in theory,
all rows in the relation, all tuples in the relation should be distinguishable.
In Postgres, it's like primary key is not required,
so they can be not distinguishable.
But if you also remember about hidden columns, CTID, Xmax, Xmin, we discussed a couple of times, they are distinguishable again.
But result sets of queries don't have those invisible columns.
So it's kind of a lot of nuances here.
But in simplified point of view, any query returns a table, right?
Which means we could substitute some table in some query with another query and vice
versa.
We could take some subquery and say, we can put a name for it and store it as another
query and just use this alias here, right?
And this is actually a view. And in Postgres, views are implemented
using a rule system, which is a very, very old thing. Originally, Postgres and maybe Ingress had
many decades ago. And this rule system allows users to define some rules saying what to do
at planning stage, not an execution stage as triggers,
but at planning stage. So instead of this, do that. So we can say, when you see select from
view one, just do this subselect from there and then do what you want. And this rule system
allows you to define views. And also, good thing, thing by the way it was my bachelor's thesis
very long ago updatable views so you can even insert if views are quite simple and Postgres
can propagate change so you can not only select from them but you can insert to them update delete
and it will work and by the way some intermediate software like Postgres, and for example,
if you're a Supabase user, you are using Postgres. It encourages you to define API using views,
and it relies on the fact that you can change data, not only read data there. And it's quite
interesting. This is views, right? Based on rule system, not approaching execution step of query execution,
just doing everything on the planning stage.
So quite good, quite interesting.
What's the main problem with views?
Dependency hell.
If you need to change underlying table or underlying views,
if you have hierarchy, multiple layers,
you usually need to recreate your views so it's quite can be painful especially if your system is under constant high load heavy load yeah and and i think the the benefits of
them are extremely different to the benefits of materialized so do you like when do you see views
used most often well originally I remember from my bachelor thesis
like 20 plus years ago,
originally one of the key ideas is that
this is helpful to control access.
So you can restrict number of columns you provide,
you expose via views.
You can also restrict number of rows.
So you can do it in both dimensions
of table columns and rows. And then you can have very different, you can provide, for
example, writable access and view, but no direct access to table at all. And this is
actually what Postgres is doing. You're supposed to have a separate schema and there you define views and nobody is working
like users of your
system don't work directly with table
they work with it
via views
it's helpful to manage access
also views are helpful
sometimes just shortcuts
like I'll ask for your queries
why not? Yeah, usability
better names but you can give
objects different names to make them you know, if you if you
regret a name and don't want to actually do a big migration,
I've seen it used for that as well. Right? Yeah, the main one
seems to be security.
They add complexity to system. And especially if again,
if you have several layers of views so you will end up
dealing with like if your schema is changing often you will you will notice overhead of maintaining
your views constantly but what else like where are our views helpful actually reporting like you have
reports so you just store it and that's it. But it doesn't help with speed anyhow.
Exactly.
It doesn't store any statistics.
It doesn't help you.
You cannot create index on view, nothing like that, because it's very ephemeral.
It's like just a query stored under some name.
That's it.
This is view.
Can it ever hurt performance in terms of push down for like predicates and things like that?
I'm not so sure.
Like in some cases, of course, if you have a view,
you have indirectly, you have less flexible ways to optimize your query.
But like in most cases, Postgres Planner can choose what to do.
Like there is this joint collapse and another collapse.
From collapse?
Yeah, from collapse setting. So it basically says that Postgres compares different paths easily and
views are just there already substituted at optimizations time. So it's just there already
this query which defines the view, it's already there, right? So basically, I don't see a big problem.
Main problem only about maintenance of schema changes, for me, from my experience.
I might be wrong, of course, as usual.
My experience is it's not the only one, right?
Yeah, get in touch if you've had other problems.
So materialized views, another idea.
I would say it's opposite idea.
Instead of just storing the view definition,
let's store the data and view definition as well, right?
Yeah, importantly.
But it's storing the data,
and at the moment that's at a point in time.
So when the materialized view was created.
So it's that query run at that time.
So that, and again,
let's probably want to catch me
before I go into the benefits too quickly.
But the idea is you can run an expensive
like computation of data,
or you can do some work once
and then read from it a lot of times
instead of having to do that expensive work multiple times.
Right.
Actually, you can define just you,
but don't have data in it originally.
There is additional comment populate.
What's the benefit of that?
I couldn't work it out.
Well, I don't know.
Some cases, we can invent some cases synthetically.
For example, you restore from dump and you don't need it yet,
but you already restored the original data.
Because materialized view, it's the data which is derived.
You always can refresh it and have an up-to-date version.
So you restore from dump and you don't want to have it there yet somehow.
Just to consider dump as already restored
fully because all original data already there yeah so you can speed up your restore right and
later in separate step it can be done using pg restore pg dump pair of utilities and pg restore
has dash uppercase and downcase l pair of options. So one option provides the list of objects the dump has,
which if dump is in the custom directory format.
And another option allows you to filter out everything you don't want
or filter in what you want to restore.
So you can restore materialized view, but don't restore data.
Do it later separately.
Actually, I did it. materialized view, but don't restore data and do it later separately.
This is actually I did it, but I did it because I had issues with by the way,
maybe it's still there.
So if you have multiple materialized views and also multiple layers of them,
your hierarchy, in some cases, restore doesn't work because it doesn't know the order.
And it says I cannot populate data yet for this materialized view because I depend on another materialized view, it's empty.
Or maybe even not created at all, I don't remember details there.
So I remember I wrote a script with like infinite loop.
And if refresh materialized view fails, it just skipped, and another attempt is done later.
So it was brute force, recursive
refresh of all materialized use
to fill all layers of it.
It worked very well.
So, materialized use stores data.
I agree with you that it's
helpful to avoid, for example,
aggregates, one of the
examples, because we all know that
counting is quite slow in Postgres.
Postgres is a row store, so it's not column store,
unlike some analytical database systems.
And of course, if you have billion rows, count will be slow.
Even if you have index-only scan, it will still be slow.
So you can pre-calculate it for particular groups, for example,
and have quite small materialized
you and you can build indexes on it. Any custom indexes, right?
Yeah. I think a lot of people don't realize that. It's really cool.
Yeah. It's because it's a table, actually. I would like everyone to understand that instead
of materialized you, you have very good, powerful tool in Postgres. Create table as select.
Yes.
It's considered as a one-time
materialized you without refresh capability it's the same right yes but with refresh capability i
think it becomes even more powerful right i would argue let's like i i mean i agree in many cases
but in some cases i don't agree and i will tell you why in a moment awesome
so create table as select gives you powerful ability to create some derived tables based on
one or several other tables or materials to use actually and then create indexes and work with it
very good for experimenting it's also very good and internally it's good it doesn't spam the buffer pool
because the ring is used there so only small small amount of buffers relatively small i don't
remember always forget the size of it but of course the operational system page cache it doesn't
acknowledge what data you are getting from disk so it will be spammed with this action.
But the buffer pool will remain good.
I mean, it won't evict useful buffers from your system,
which is good, creatable select.
And you can create indexes.
And I would like to point to a very important thing I realized not very long ago.
Very simple.
I like to say silly things.
By the way, I want to apologize for the last episode because I said inserts cannot be had, heap only tuples, and deletes cannot be had.
Well, inserts cannot be had.
It doesn't make sense.
But deletes, they even don't change indexes at all because indexes don't have visibility in four.
So it was a wrong idea even to raise.
Delete, as we also discussed,
it just puts value of transaction ID to X max
in the heap, in table pages,
and doesn't do anything with indexes at all.
And so it cannot be hard
because it's not needed to be hard at all.
It's already hard, actually.
Like new statement,
deletes are always hot already because they don't change indexes. Of course, indexes are changed
later by vacuum activities, but it's asynchronously, which is good. So back to our topic,
the idea that you can create table as select create a materialized view, which currently we can consider kind of equivalence.
We will see differences a little bit later.
You can create indexes.
And this is super cool because if join is involved, you cannot create an index on two tables at once.
Index always belongs to some table.
But sometimes, for example, you want to order by two values which are present in two different tables.
Or you want to order by a filter.
And you could do it using single index scan.
Both order by limit and order by.
And where clause uses one column, order by and limit another column.
We could combine it in an index, like two-column index, for example.
But we cannot because these columns are in different tables, right?
So we cannot create an index for two tables.
But when you join and create a table, select and join,
or you create materialized view, you finally can have this index,
even if you don't do anything with data except just joining it.
Yeah.
Right?
I mean, yeah, that's awesome.
I've not seen anyone doing that.
That's such a cool idea.
I think everyone doing it just implicitly.
Oh, interesting.
So, in fact, this actually takes us quite quickly to one of the downsides.
That's great, but you are...
It's a minor downside, I guess, for most people,
but that is taking up space on disk, in cache,
in caches and things like that.
Well, of course, it does take space.
And create table as select would take space, right?
Yeah.
So it's a minor downside for that benefit,
but it's the reason why...
But the speed benefit can be many many orders
of magnitude because when you avoid join and you can have single index scan and maybe even
index only scan compared to two index scans in two different tables and then one of three
algorithms to join it like it's very it's like game changer could be in terms of performance yeah i completely agree
the the other like let's i think while i'm talking about downsides the main before downsides just
like you can instead of putting query which will solve your like like you have some goal to do
something with data and you have a query you want to improve it with the speed of it.
And okay, I will basically, what is this create table or select or create materialized queue?
It's kind of cache.
Instead of considering it as a query cache, you could, for example, store more, have more indexes on this derived table or materialized queue, and have more, for future,
you can support more queries with this cache.
Like, cache not the final result, cache data, right?
So, as we know, query cache is kind of,
it's good until it's not, right?
For example, my SQL removed query cache from engine at some point.
It feels like one of those things where there are use cases
that it can hugely benefit.
Like if your data is not changing much
or if you have past data that it's okay.
You're pushing us too fast to the main problem with this.
Okay.
Which also one the two biggest problems
in computer science, as we know, right?
Cache and validation.
How to maintain the state of it.
Of course, when we say create table and select,
or we say create material and populate data,
of course, the data is frozen, right?
Yeah, but I mean, if we're talking about a...
Some people have analytics databases from static data that isn't changing at all.
And that, for that, there's no downside.
Well, maybe some cases, right.
But I mostly observe cases where people want to maintain the state of analytical database with very small lag.
Yeah.
Or even they want to have it in the same system,
so in Postgres and so on,
so like HTAB approach.
Hybrid, right?
Hybrid transaction.
So yeah, the main problem is how to update it
and what Postgres offers.
If you forget about my idea of create table select,
which I still think is quite
good and it can beat materialized
view in some cases, I will explain why.
I keep this
intrigue state.
So we can say
refresh materialized
view, right?
And importantly,
I didn't realize this
was added so early, but refresh materialized view concurrently as well.
Right.
What's the difference?
Difference is one will lead to bloat, another will not, right?
So the big difference, and I looked it up.
Like developer's point of view, it's so good.
Our queries are working.
The base point of view,
or bloating materialized views.
Who knew it can happen as well?
So if we refresh a materialized view without concurrently,
we block selects on the materialized view
while it's refreshing, right?
And that's offered,
like bear in mind,
we're normally doing this
on a slightly slow query
because that's the benefit of materialized. it's like vacuum full basically right but yeah data will
be fresh our goal is achieved yes but with concurrently it's like a i guess the equivalent
would be a pg repack or something uh or or create index concurrently sorry, re-index concurrently would be the equivalent.
So with concurrently, we don't block selects on the past state of the materialized view,
and it replaces it once it's finished.
Is that right?
Yeah, that's right.
But the main point to remember here is that we can refresh it only fully.
We cannot refresh part of it, unfortunately. Yeah. that we can refresh it only fully.
We cannot refresh part of it, unfortunately.
Yeah.
If we use regular materialized view, like original Postgres provides.
Because just refresh materialized view concurrently, that's it.
There are no options to say that I know 90% of my data hasn't changed because it's very old data.
So I probably could do it much faster. So you refresh everything and it can take a lot of time but concurrent selects are working. It's good.
But we accumulate bloat of course.
And indexes, if we follow my approach, let's not use materialized view, let's define materialized
view with some underlying data, maybe raw data, and support more queries,
more kinds of queries with it.
It means that probably you have multiple indexes there.
Oh, we forgot to say that to support concurrent update, refresh.
We need a unique index.
Exactly.
On one of the columns, right?
At least one.
So already one index is there.
But if you follow my approach,
let's have raw data and support more queries,
you probably have multiple indexes there,
and their health will degrade as well.
So all problems we have with tables
which receive significant numbers of updates, deletes,
we'll have this here as well.
So repacking or something.
And here idea, like why just not to recreate it if we still need to populate it fully,
right? Why? And maybe it can be just recreate materialized view with a different name and
then quickly swap it in single transaction or just this create table select it's like if if it's recreated every
time we want to refresh what's the point of having materialized you just maybe convenience because
definition of it remembers the query right so it like database system stores the definition
like it's better maybe that's it in some In some cases, I would seriously consider recreation of it every time,
just not to deal with bloat in data part of it and in indexes we have on it.
But of course, if we do it every minute, for example,
I would check the bloat of system catalogs as well,
because if you have a lot of DDLs,
you consider DDL is very, very often thing to happen. You might end up having very bloated
PG class and so on. So it's also worth remembering. And PG repack cannot repack system catalogs,
you need to do vacuum full. Fortunately's usually very fast but you don't
want to to be like to have 99 of blood or something because it will affect basically all queries
because planning will be slowed down and so on many things so bottom line refreshment is good
but it's very very rough tool one of our customers enjoyed using them until some point and then said, it's like having
a huge hammer and just applying it to very small things.
Like it's very rough tool, very rough.
So it saves you sometimes, but then like if you only small portions of your data is changing,
maybe you will be not satisfied with this approach,
refresh it fully, always.
And here also, by the way, could materialized view be updatable?
Do you mean increment? What do you mean?
Updatable means like in the same, we discussed it for regular views,
you can insert it.
I don't see how that would make sense.
Well, it might make sense at some point.
But maybe it's too exotic.
And I suspect other systems have it, but I don't remember.
Let's skip it.
Just a question to know.
Just an entertaining question.
But as for the main topic, what would we need? We would need to
be able to update only parts of the materialized view. And there is a big theory described
in, not big, but some theory described in the post-Gus Wiki page, and the discussion
is happening already many, many years, maybe decades, at least one decade, I would say, to have incrementally maintained views, incrementally refreshed, maintained
views.
So we want to avoid full refresh, right?
We want to adjust only part of it.
And I already brought this, that materialized view, it's like create table select, but with
rule system from views inherited like we just
remember this you can
use create table as select
and then maintain the state using
triggers or some other things
and have incremental
data if you understand the logic
probably it's a good way to go
right just to maintain
from your application
the question to this approach will be, do you lose data?
And do you slow down your writes on the baseline, base tables?
Exactly, what are the trade-offs?
It's actually quite an interesting time to talk about.
We could talk about the future and incrementally updating materialized views in postgres but i think it's also worth talking
about a couple of existing tools and systems that have partially tackled it or tackled it for their
own specific case the reason um well this was a requested topic so thank you to the person who
requested it but there was also an announcement, not in the Postgres space,
but by a company called PlanetScale
that we've talked about briefly here before
that announced a boost feature
that looks very similar to Materialize Views,
but with incremental updates.
And boost is easy to achieve
if you had a join or multiple joins
with a lot of various kinds of index scans.
And then suddenly you have index only.
I'm talking Postgres terms.
Then suddenly you have single index only scan.
Of course, it will be like 10,000 speed improvement or 100,000 even sometimes.
They definitely went for a click-baity title on it.
But it's kind of what people sometimes want, right?
They've got a slow query and they want a quick fix.
I'm not concerned about the additional space usually, but sometimes I do.
But the price of maintenance, how maintenance is organized.
Do we have load issues in Postgres if we do this?
The other one I wanted to talk about as an existing
time
scale have their continuous aggregates
feature as well and we don't
need to discuss them in depth here but there's a good blog
post that they've done actually covering some of the topics
we've just discussed about views, materialized
views and continuous aggregates
and they have a solution that
it's designed for time series
workloads where you're probably not going to have loads of data
in the past changing,
but it can cope with that.
And it's been, they're being improved.
So that's an interesting take.
If you, if that's something you want and need
and you're okay with the trade-offs now exists
and it's kind of, it's in this area,
topic of incrementally updating.
If you're a timescale user,
you should go and check continuous aggregates.
This is one of the key features, I would say.
I agree.
But there is also a project,
and there is discussion happening
almost four years already in hackers.
Unfortunately, I don't know details, but I just see it.
Implementing incremental view maintenance started in December 2018 from Yugo Nagata.
And the same author created PGIVM project, which is an extension extension okay open source extension it's quite good interesting work
and this is attempt to follow this approach like let's have incremental updates automatic partial
updates of you but as i understand like i've spent some time understanding how pg ivm works first of
all there are restrictions of course on queries that that can be used to define such views.
And also, as I understand, there are two approaches.
First is synchronous propagation of data change, which will slow down our original insert or update or delete, right?
But still, it's good to have it automated.
Yeah, I think of that as immediately updating materialized views.
So as soon as there's change... Synchronously, immediately, right. It means that at commit time, like,
our transactions which changed the original table will slow down, of course. But it's good that we
don't need to write a trigger to do it. And second way is to have refresh i i didn't get it maybe like it's still full refresh
like before or no like i didn't get this part i think they use triggers right okay so it's uh if
if i updated only one row out of billion will uh second option will propagate change in very fast
compared to refresh materialized view
i didn't realize pg ivm had two options i thought they might be talking more philosophically about
there being two options and they that was my understanding but i definitely might there's a
good video by lucas fitter on this as well we mentioned his channel last time i'll link up
that video that's like five keep mentioning this channel and
unlike us our podcast it's only five minutes always very much faster for for those poor
dog owners who work with dogs only five minutes yeah maybe once with little legs yeah right
so why i think well still like this is great but two big questions to this direction as a whole. First, can we do it asynchronously, but like propagated always, but asynchronously, like first option, but asynchronously, I don't want my original transaction to slow down and second this is usually needed when we have a lot of data and when we have a lot of
data we always should use partitioning i want my materialist you also partitioned maybe right so
what about this like for example we partitioned by months and we have very like five years ago data
january five years ago for example it can change, but so infrequently.
So I would refresh it very less often.
I don't want my original tables to be partitioned, but materialized to use a single thing.
It means that if I refresh it fully, it's a very slow query and it will affect vacuuming
and will affect whole database, leading to bloat in all tables.
If I update it partially, well, I'm fine, actually, maybe fine to have.
Well, vacuum will also need to work here as well, right?
And if it's a single huge table, all problems we have with huge tables which are not partitioned
will hit this materialized view as well, even if we have partial incremental view maintenance.
Yeah. So if I understand you, we're kind of talking about two extremes
and whether there's a middle ground.
So there's the extreme of having to refresh everything each time,
which we currently have in Postgres.
There's the other extreme of synchronous, immediate, trigger-based updates so that our materialized
view is updated on each transaction, but that comes with write overhead? Is there a middle
ground option where we can... There is perfect solution. It's not middle ground. It's perfect
solution. Perfect solution should propagate data asynchronously, not slowing down original transaction and not losing data.
With some small lag,
ideally also observable.
So I would put this lag to some,
I don't know, monitoring.
And second, the result should be partitioned.
This is where you lost me a little bit.
So my understanding of materialized views
is normally that they're hugely smaller
than the original data.
So I mean, how many, how big are these materialized views?
I don't agree.
I saw so big materialized views.
And even if, like, again, the idea of big result, big materialized view comes from my idea.
Let's not create materialized view for each query.
Let's create it for like to avoid joints for example
yeah
right
and then
run a lot
of various
queries
on top
of it
quite good
idea
it's like
denormalization
automated
denormalization
basically
but even
if people
just use
it for a
final query
just cache
the result
of query
I still
saw
huge
materialized
use
huge
interesting
it happens
oh great sounds good if it's have it huge it should be partitioned yeah still saw huge mutualized use. Huge. Interesting. It happens.
Oh, great.
Sounds good.
If it's huge, it should be partitioned
to benefit from it.
But the main
key, the key, I
would highlight the
key point here.
Postgres,
unfortunately,
doesn't have this
Oracle has
pragma
autonomous, the
ability to
initiate
transaction which will be detached from a
trigger for example so i want to to have insert but a database will guarantee it will be finished
but i already it's not my problem already i finish my transaction and another transaction
is happening in different backend for example you can do it with db-link or something, emulate it, but it's kind of,
it looks not clean solution.
And like currently I see what,
like ideal system, what I would build.
I would use tables.
I would store definition,
maybe not in Postgres, maybe inside.
I would maybe partition result.
But first thing I would solve,
of course, I would propagate changes
in incremental fashion, not refreshing whole view.
And by the way, there are articles explaining how to do it.
And there are also tools for various languages which help you to do it.
But what first thing I would do, I would propagate changes asynchronously through probably something like Kafka, I don't know.
I can implement it in Postgres, but it requires additional knowledge. You need to know how to deal with bloat. It's kind of Q inside Postgres. Maybe
we should cover this topic separately. It's possible. But if it's in Kafka, it's also good,
or some other system which doesn't lose data, allows me to control the lag and just delivers this signal of change separately so I can update
my target.
And the project I sent you the link, this Materialized company called Materialized,
they solve it as a commercial solution.
And as I understand this, they use Kafka and Dbizio.
Everyone can do it, but maintaining Kafka
requires additional efforts.
So that's why maybe I would do it in Postgres
itself. It depends on the system.
If you have many, many databases,
having Kafka as something
additional to Postgres, it would
be justified.
But in this case, imagine it's an ideal
world. The target is good.
I mean, it doesn't suffer from having huge tables and very long vacuuming.
Indexes are usually smaller because it's partitioned.
Index maintenance there can be applied in regular fashion.
And changes are propagated atomically, like every small change is propagated, guaranteed with small delay.
And for example, we have a lot propagated, guaranteed with small delay.
And, for example, we have a lot of posts, comments, like social media.
Or, for example, Twitter.
There's a master don't think this.
People migrated.
It's signing on Postgres as well, right?
It is, yeah.
We have tweets, for example, answers, replies, likes, everything.
And we can have counters very, very, very good here, like using this approach. Like, so I wonder why we go,
development goes this direction, which will not work well for very large volumes of data. Well, I see the use cases where it will be useful, this incremental view maintenance.
But why these two big problems ignored? And first of all, this ability to propagate changes asynchronously.
I think
this problem should be solved first
to allow people to define
triggers which will be asynchronous
for example or something.
Maybe it should be
done outside of database.
It's not a closed topic for me. I don't know
the ideal answer here. Maybe Kafka
should be here. Or maybe it should be done in Post me. I don't know the ideal answer here. Maybe Kafka should be here.
Or maybe it should be done in Postgres.
I don't know.
Well, if anybody out there has done a lot of thinking and reading about this,
let us know what your thoughts are.
Also, I'm really grateful to the people that are working on this,
on the PGI VM thing,
because they are looking to contribute this to Postgres.
They are looking to get it committed upstream. And I think that's a really nice use of the extension model where you write an extension as a proof of concept.
What do you think? Yeah, I agree. I agree. I don't want to discourage and just to criticize.
I just see that problems that should be solved also. I solved it a couple of times in different cases, not as open source,
just solved it. And it was good. But maybe I remember also discussions, this implementing
this autonomous pragma autonomous in Postgres, unfortunately, also didn't finish. So the need of it is there.
Incremental human maintenance
probably will be good
for mid-size projects.
But for larger scale,
like dozens of terabytes
and hundreds of thousands
of transactions per second,
multiple standby nodes and so on,
we do need additional thing
to be solved as well.
I feel it.
And currently it's solvable with Kafka or
something. Maybe it should be.
Maybe this regular
discussion, should it be inside
the engine or should it be outside?
I don't have exact answer
for this. But it's a super interesting
topic. I think in general
development of materialized use
can be helpful for many, many, because
data volumes are growing and people want aggregates run faster and so on and so on.
Yeah.
Yeah, exactly.
More and more products need to have a dashboard, have some, you know, a lot of even transactional
products are expected to do a little bit of analytics now as well, aren't they?
So it makes a lot of sense.
Well, actually, maybe it's not that bad.
Maybe I would right now consider PGI VM for a particular project.
I would just measure, benchmark how my writes will slow down because of this synchronous propagation approach.
Also, it requires an index, right?
So additional index, which also will
slightly slow down our writes.
And if it's acceptable
and if it will remain acceptable
in a few years when my data
volumes will grow, then
I will just use it. Why not?
It's also a problem of
this very rough
approach, like let's update as a whole.
Update is only part of it. It good yeah keeps everything in postgres at the start too well good luck to them and thank you did we
cover everything you wanted to here yeah maybe some bottom line uh check benchmark and consider
it building your own solution based on just create, table, select, some triggers maybe, maybe Kafka or some asynchronous queue mechanism you have already, if it's reliable, of course.
And you can propagate and support your own data structures and queries maybe.
Well, like, I mean, if existing solutions don't suit well for your needs, maybe you just need to write your own thing.
It's not that difficult, actually.
You have rules, you have triggers, so just do it.
Triggers, yeah, for sure.
Triggers here, yeah, I think about triggers.
But asynchronous nature would be good to have here.
Yep.
I would just warn, refresh materialized you is good as like quick and
dirty solution.
It will be dirty because it will hurt when data grows and everything grows and like it
will be problematic for large volumes of data.
Even completely.
It can get people a long way right there at the beginning.
Well yeah, I'm talking about again, I'm talking about dozens of terabytes and dozens or even hundreds of thousands
of transactions per second.
It's like really big systems.
And also materialized companies,
they do interesting things.
It's commercial, but interesting.
And there is a talk,
there are a couple of talks from them.
I've checked them.
Very interesting.
Just to understand how it's implemented,
how the vision and so on let's also
link their materials we'll do materialized materials yeah good well yeah i hope a few
people got some ideas of things they can use these for as well even in the current state
and yeah i look forward to speaking with you next week good see you later bye