Postgres FM - Real-time analytics
Episode Date: February 17, 2023Nikolay and Michael discuss real-time analytics — what it means, what the options are, and some tips if you're trying to implement it within Postgres. Here are links to a few things we men...tioned: Loose index scan / skip scan with recursive CTE (wiki)Zheap (wiki)cstore_fdw (now part of Citus)Timescale compression docsHydra founders interview (on Postgres TV)Materialised views episode pg_ivmTimescale continuous aggregates docsClickhouseSnowflakeReplication episodeTimescale bottomless storage on S3 (blog post)pg_partmanQuerying Postgres from DuckDB (blog post)Heap blog (filter by “Engineering”)Incremental View Maintenance (wiki)PostgreSQL HyperLogLog Faster counting (by Joe Nelson on the Citus blog)------------------------What did you like or not like? What should we discuss next time? Let us know by tweeting us on @samokhvalov / @michristofides / @PostgresFM, or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things PostgresQR.
I am Michael, founder of PGMustard, and this is my co-host Nikolai, founder of PostgresAI.
Hey Nikolai, what are we talking about today?
Hi Michael, let's talk about one more controversial topic.
Can we run analytical queries in Postgres and what to expect from it?
Yeah, right. So like real-time analytics workloads,
maybe in combination with transactional
workloads, sometimes people say hybrid, but I think it's a fascinating topic. And luckily,
another listener request. So thank you. And so it's taken us a while to get to this one,
but really appreciate it. Hybrid is just term OLTP guys invented to pull everything to their side. I mean, it's still OLTP, but let's run aggregate queries in a regular OLTP database
and try to achieve better performance, right?
Let's try to get rid of analytical databases
like Vertica or Snowflake and so on.
This is HTTP.
Because vice versa, it works less often right like analytical for all tp
it usually doesn't work because latencies they have they off are awful well i was well we're
diving straight into the deep part of the conversation but i have seen some like i have
seen some go after trying to do both like single store and like a few i was going to bring this up later on but i do see
some analytical providers like click house starting to add features that allow them to do some of the
like or to to mitigate some of the um trade-offs do they have already updates and deletes i haven't
checked but i think do they work okay uh do they have already good isolation for sessions?
I mean, isolation control, snapshot isolation, and so on.
Like analytical data systems were not designed to be ACID compatible.
They don't need it.
They need to sacrifice some aspects for the sake of processing huge data volumes.
So I doubt,
like, for example, vertical, if it would be so easy, like, I
cannot imagine someone runs vertical to handle some, like,
I don't know, like social media, or e commerce kinds of
workloads. I mean, user facingfacing workloads. When customers expect all pages to work below 100 milliseconds,
right?
Better one millisecond.
On average, it should be below one millisecond on server side.
And I cannot expect that someone can run Vertica for that
or Snowflake.
It's strange.
So that's why I say htap it's all or
it's all tp with attempt to improve performance of analytical queries you you don't agree
obviously right well it's in i think it's i think this is a really interesting time and maybe that's
true at the moment but i think there's an interesting almost race between oltp systems
adding adding the ability to do more analytical work and the more analytical side of things trying
to do more transactional processing and there's at the moment most people either have two separate
systems so like extracting the data from the transactional side and putting it into the analytical side, or they try and get by as long as possible with the transactional side and optimizing for some, maybe a smaller amount of analytical workload.
But I wanted to start by thinking, if you think of real-time analytics, what do you think that includes?
Or what's the first thing what do you think that includes or what's that the first thing that that makes you think my bachelor thesis in university was about etl and how to how to
make clean clean up data to get to get rid of dirt to have a unified schema for all data from
various sources and then my first job was in a company called Program Bank. We programmed software systems for banks using Delphi and Oracle. It was an interesting time. It was more than 20 years ago already. during weekends. And then later, when I started to fly a lot, Europe to US and back and forth,
and I've noticed that since I have bank accounts on both continents, I noticed that even 20 years
later, some banks still have nightly operations. So you cannot access your bank account during
your day time in Europe when it's night in the US and so on. So when I hear real-time analytics,
I imagine that this should go away, this approach. And systems should be accessible always.
And even if you need to provide some reports, like monthly reports or something, it should be done
asynchronously at least, or better in the same system like okay we
have ltp system and we know how to handle analytical queries which process like month
month of data so this is what real-time analytics to me i not like like banks do some banks there
are new banks which already don't have this approach and they follow good practices yeah
well i mean some of those sound like they're not even showing you the transactional data,
never mind the kind of aggregated data.
But for me, this makes me think of, let's say you're looking at a software as a service dashboard
and you want to look at your total of some number or some average of some number,
but it includes up to like today and
up to this minute of information so you you might be looking at page views on a website but it
includes the last hour or the last few minutes and if somebody visits your website now they start
showing up in that dashboard as well it's not there's not really a delay there is no delay and it also doesn't go down view count unless some bots are involved.
And also if you open two tabs of the same page you have at least similar numbers, right?
You know the joke how many like MongoDB reached 1 million downloads or according to one of our nodes 999 downloads and according to the other one 1
million 2 downloads right you know this right i hadn't heard that joke before no eventual
consistency joke so yes i agree and this is historically very difficult task for row store
as postgres and with mvcc implementation postgres has with all that tuple or tuple,
I forgot, processing using vacuum process and bloat issues and index amplification and so on.
So historically, Postgres is known for very slow select count star or select count one,
doesn't matter in this case.
But it was improved.
Some aggregate queries can benefit from index-only scans.
Some can benefit from loose index scans.
But unfortunately, Postgres doesn't have an automatic loose index scan.
You need to program it using recursive CTE.
So you need to tell Postgres how to skip certain subtrees in B3.
But many aggregate queries can be improved.
So it's possible to achieve with Postgres.
Yeah, you bring up a good point about, I mean, the design decisions that make Postgres so good at concurrently doing inserts, updates, deletes.
The design decisions that make that really efficient make running queries on let's say
one column across all customers if you're trying to sum those or average those or do some aggregation
it is fundamentally difficult whereas a lot of these analytical systems are designed for the
exact opposite and they and as such we've got a hard almost a hard trade-off here right like i don't see there's
not an obvious way of having it having the best of both worlds it's possible with storage engines
and there's ongoing work for many years already and it looks like some sometime soon we will have
some there is there are projects i always mix it that store that heap i don't remember attempts to
have pluggable storage engine which will provide ColumnStore.
There was also a project from Cytos
called CStore maybe, like ColumnStore.
But it was using FDW.
So it's not like native storage.
It's not like in MySQL when you say
this table needs to use this storage engine
and that's it.
But in general,
strategic decision in Postgres project was made to have multiple storage engines,
like to have API for that and so on. So in future, we can expect that it will be easy to define some table as column store storage based. Let's talk about differences. You raised a good question about when one works better than
the other. If we have all TP, we usually need to extract user data with all columns of that user,
including like email, last name, and so on and so on. And of course, if these columns are stored
ideally in the same data page, Postgres has usually like eight kilobyte data pages.
It's just one buffer read or hit.
Speaking of buffers, right?
Our favorite topic.
But column store, like ClickHouse,
they store each column of table in separate file.
In this case, you need to deal with many files
and then extract this data
and then aggregate it in memory to provide the row. But this is how all TP queries dealing
with single row or if you have pagination like 25 rows ordered by some column, it becomes
less efficient to extract the data. But for queries like, let's calculate some of some
orders or some aggregates for a month, you deal just with a single file. And if it's stored
historically and you need just one month, you deal with fewer pages. It's all about IO as usual,
right? And the amount of memory you need to allocate and work with.
Yeah, and while we're talking about systems that allow for you to pick your trade-off
maybe on a per-table basis,
we've also, I think, probably should mention
that there are extensions attempting to do this kind of thing as well,
and the most obvious one being timescale,
of doing some level of it.
It's not exactly column store, I don't think,
but I think it works the same way uh it's very clever but the the the other thing that they
make take advantage of and a lot of the other analytical systems you mean compression uh in
timescale for time series which uh understands we're not like which uses the knowledge about
what stores in column and then applies this to compress, right?
Yeah, and that's exactly what I was going to say. Because data within a column is very likely to be similar across rows,
it compresses extremely well.
And another thing that can reduce I.O. for those analytical queries
is that there's just less.
I encourage everyone to read their articles,
especially one explaining the details about compression approaches for time
series and timescale.
I read it from time to time and recommend,
keep recommending it.
It's brilliant article.
And in general work is brilliant for,
for these types of data.
But wanted to mention also Hydra, which is a very new project, which
is aiming to build analytical type of Postgres, like Postgres flavor. And they also work in
this area. They took Citus, this C store, for com store like via fdw and and improve and so on and we had the episode on postgis tv
interview with founders it was one one month ago or so but it's still getting a lot of good
feedback they shared feedback they get get personal feedback that's great i mean it's
interesting project and i encourage also checking out this interview so yeah by default i think all people
should start from posgas right and then we don't want to be like banks right we don't want to have
etl every night it doesn't scale well this process i mean at some point it will take more and more
hours and you don't have so much time and like like, it's hard to manage this process.
And also customers suffer
because they want up-to-date info.
No, you need to wait until tomorrow
to get this info.
No, it's not good.
And customers can be internal in this case
because, for example,
business owner might want to see
some reports right now,
but we will tell return tomorrow.
No, no.
So we start with Postgres and then we have two options, basically.
First is to take something for analytical, and this is our default option still.
All TP, Postgres, DBAs usually recommend, and I usually do the same, don't use Postgres
for analytics.
This is mantra, actually.
We use this mantra all the time, because the problem is not only it's less efficient for heavy queries with aggregates,
it also affects your health, health of your Postgres.
Because if you run it on the primary node, if you have a very long query lasting half
an hour, for example, and during that time, other queries are deleted in any table, they
delete the tuples, AutoVacuum they deleted that tuples auto icon cannot delete
these tuples until your long query finishes because you hold xmin horizon right so yeah
we've talked about this a few times and i think this is this might be the one time i disagree
with you because we're talking about real-time analytics here i think if we're talking about things that update
that are user facing people are going to have problems way before half an hour in terms of
user experience so i think we're talking more in the region of the customers i see doing this kind
of optimization work within postgres it tends to be they're trying to get something to load within
ideally you know a few hundred milliseconds or five, five seconds, kind of 10 seconds, kind of up a limit of how long somebody will wait for.
Maybe if they're changing some drop downs in a UI, and then they want to refresh their data.
If we're talking about that being able to finish within a few seconds, I think that can still live
within Postgres for a while. Maybe it doesn't scale forever, but it's smaller sizes.
I definitely don't see that.
You are right.
You are right, but only until some point.
Okay, so you basically say real-time analytics
means that we have a requirement
to have statement timeout 15 seconds, for example,
for all user-facing queries.
That's great.
I agree.
But to achieve that,
you probably will face challenges
when you have select count or select sum
or anything lasting too long.
And then you will implement some denormalization
or synchronous via some queuing mechanism
through Kafka or something
to pre-process some data.
For example, on hourly basis.
And those queries you won't limit.
Right? And those queries will won't limit, right?
And those queries will hold your Xmin Horizon backend, how to say them, how to name them,
backend queries or something, not user-facing, some batch processing queries, something like that.
And to keep everything in Postgres, right, You need to have some pre-processing,
asynchronous and parallel,
queued and so on.
And we've talked about,
we should probably give a shout out to a previous episode we did
on materialized views,
which are one kind of pre-processing tool.
Probably going to give them
a couple more shout outs in this episode.
PGI VM?
I was going to say timescale actually
with their continuous aggregates.
Timescale is great.
Timescale is great.
Again.
Well, the continuous aggregates are kind of this pre-processing, right,
where you have this continuous aggregation of new data coming in
and then that getting filed away.
So I think there are some ways of avoiding what you're talking about,
but, yeah, only up to a point.
Continuous aggregates and timescale. And this is a part of their open source
offering. So that's great. And I know cases
where it works very good. I also need to, I
made a joke today about like someone from Timescale offered me to register
in Timescale Cloud. First of all, of course I'm registered. And I need to
say that I said, no uh don't go
there because there is no bottom in time scale cloud it was a joke because about their bottomless
feature we probably also discussed it of course go and register on time scale cloud it's it's
great a lot of very new approaches very interesting approaches and i I think Timescale is one of the companies which drive Postgres ecosystem forward using
innovation.
That's great.
Right.
So if we want to keep everything in Postgres, we will have these type of challenges, long
running queries.
Sometimes we can avoid them.
If we choose smaller batches, if we process this, if we implement this asynchronous process
more often, every minute,
for example. But also worth mentioning, it's hard to engineer it because Postgres doesn't have
what Oracle provides, autonomous pragma, pragmaton, something like you, I want to
detach my auxiliary transaction for background processing for example we inserted some data and
we want to increment some count we would like to do it asynchronously not blocking the main insert
right or main update and to do that you need to engineer it using some some kafka or some
i don't know like rabbit mq cannot recommend RabbitMQ or Sidekick or something.
By the way, there is a new project someone released a couple of days ago,
which is considered as a replacement for PGQ.
But it's another story. Q in Postgres, we need to discuss it one day.
But if you manage to engineer it,
probably you can keep your transactions below some duration. But if you don't engineer it, probably you can keep your transactions below some
duration.
But if you don't, you have trouble and you cannot even offload some long running selects
to standby nodes because either you need to be a single user on that standby nodes because
other users will notice long lags, big lag legs when your query lasts.
Or you need to turn on hosted biofeedback, report to the primary,
and have all the issues we've blowed again.
So what to do?
Back to my idea that we have two options.
One option is find a way to remain in Postgres using some optimization,
some of these column store
approaches, some innovation, or to use some regular approach, but with continuous ATL,
basically, which is called CDC, change data capture. So you take something, you can take,
for example, Clickhouse and use it as a secondary database right from your application.
There might be problems with consistency
because probably Postgres knows that you made
a thousand sales today,
but ClickHouse thinks it's slightly less or more.
Usually, Postgres is considered as a primary source of truth
being an OTP system.
And then we have a logical replication
to Snowflake, Vertica, or ClickHouse or anything else.
There are ways to do it. But you need to deal with logical
replication. This is challenging. It's possible, but it's
challenging. I must admit it's improving over every year Postgres
receives improvements in this area. And there are tools,
including commercial tools.
Open source is Debezium, but there are commercial tools.
I cannot recommend Qlik.
Qlik is bad.
Former Attunity, I think, right?
It's just wrong implementation of Postgres plugin.
But, for example, Fivetran is quite good
if you consider commercial systems.
And we have a whole episode on this, right? We have
right it's a different location,
different story. And so the the our option is real time
analytics is to improve queries. Let's talk about the ways to
improve.
Well, another thing I was going to mention that I think is
important here is generally once we have performance issues, or once we have a decent amount of data, we're doing analytics.
We probably are in the realm of needing or wanting to use partitioning of some description.
Oh, partitioning is huge.
Partitioning is a must have if you need to deal with big data volumes. Rule of thumb is you need partitioning for a table which exceeds,
for example, 100 gigabytes.
Yeah, which is a lot, but also not that much these days, right?
Right.
And partitioning is needed in many, many, many aspects,
including data locality.
So all data lives in particular pages,
data blocks on disk,
and they are all frozen,
very rarely touched.
So AutoVacuum just skips them very quickly.
Yeah.
And also indexes for those old partitions
are in like almost frozen shape,
quite efficient,
maybe rebuilt,
not degrading.
And these pages are barely present in our buffer pool
and operational system cache. Because if everything is mixed and each page has a couple of new tuples
and a lot of old tuples, it blows your shared buffer. This is one of the biggest reasons
to have partitioning. I just recently revisited this
topic several times with
a couple of customers.
And also,
in Postgres 14, I think,
in 14, reindex
concurrently and index concurrently
they got optimization not to hold
XminHorizon. It was so great, so
you reindex. And even if you have
a terabyte or five terabyte
size table, you can perform index maintenance because we all need rebuild indexes from time
to time, even if it's not Postgres in all LTP systems, they degrade, their health degrades.
So you don't hold Xmin Horizon. so it doesn't matter if it takes hours.
But it
had a bug and in
14.4 it was
reverted. So it means if you have
a huge table you need to rebuild
index holding spin horizon.
It's the same problem as we just discussed
with inability of autovacuum
to delete tuples which became
dead recently.
So that's why you need to have partitioning,
and everything improves, including index behavior and so on.
Yeah, well, I think there's another good reason,
and that's I think it encourages good practice around maintenance
and around retention.
I think a lot of
people before they have partitioning don't even consider deleting old data or or how long they
should be holding certain data for and a lot of the time it is partitioned on time or at least
one aspect of it as time so i think it does encourage you to think about how many of these
do we need to keep around and how easily can we get rid of them, that kind of thing.
But yeah, maybe not so analytics.
Here, again, the timescale pops up. They do a very good job to achieve HTTP for Postgres for time series.
Because, again, this bottomless I mentioned,
this is exactly how they implemented it, and I like it.
So for partitions with archived data, you can choose S3 as storage, right?
Yeah, smart.
And for HAT partitions, it's on EBS volume in AWS, for example, if it's AWS.
And for your application, everything is kind of transparent, so you don't notice.
Of course, archived data is
slower, and everything is...
You can see it even if it's explaining lies.
Yeah.
S3 was involved.
Really? That's cool. I haven't seen that.
Yeah, it's a fresh feature,
I think, bottomless.
But this works only in cloud, unfortunately,
as I understand.
Okay, it makes sense. That's a good good point that they automate the partitioning for you or once you choose certain
things but the other the normal way of doing this like the places the the extension that seems to
be most popular in is pgpartman right the yeah I I must confess I never use it myself yeah a lot
of folks around did it but I don't I didn't I but I never use it myself. A lot of folks around did it, but I don't.
But I implemented partitioning myself before partitioning became declarative.
So I have some experience with partitioning and with declarative partitioning as well.
But with partitioning, the general advice, you need to choose as fresh major version of Postgres as possible because again a lot of
progress was made during last five to seven years ability to detach partition concurrently and so
on in non-blocking way but I wanted also to mention that the same thing as timescale did
you probably can do yourself using FDW if you convert your old partitions to foreign tables
using Postgres FDW, for example.
You can store them on the remote server
with slower disk handling and so on.
Yeah.
You just reminded me actually of a whole different,
it's a different slant on that,
but did you read the post by DuckDB?
They're an analytics database
and they've done some incredible work i
didn't fully understand it but you can now query post like you can have your data in postgres
but query it through duck db like kind of like an fdw but in reverse and get some ridiculous
performance improvements on on the same queries but we like we when we use fdw i
i'm afraid we will break our requirement we won't meet our requirement to be below 15 seconds at
some point at some data volumes i'm afraid my gut tells me you know but maybe i'm wrong i i i actually
don't follow all new stuff happening in post-Goseck ecosystem.
It was so like at least five years ago, definitely 10 or 15 years ago.
Right now, I'm feeling I cannot catch up with everything.
It's too much.
Too many things are happening.
I highly recommend this post by DuckDB.
I will link to it as well.
But some people I really respect are singing their praises at the moment.
So I'm looking forward to understanding that one a bit better. it as well but they some people i really respect us singing their praises at the moment so i'm
looking forward to understanding that one a bit better and the other one i wanted to make sure
we mentioned what i don't feel like we should talk about postgres and real-time analytics without
mentioning heap they do quite a lot of blog posts related to postgres and i think i could be wrong
i could be a bit out of date here i I think they're running Heap, which is an analytics product, pretty much exclusively, I think, on Postgres.
Maybe with Citus, I think.
I didn't see fresh articles from them.
But I think it was four or five years ago when I learned. learned and i remember they used zfs with arc 2 with data stored on ebs volumes and with local
disks and local nvm nv like it was time when nitro in aws was not released i think and ebs volumes
were much slower than local nvme disks which are ephemeral. So if machine reboots, probably you lose data there.
So what they did, as I remember, they did, of course, partitioning was involved, I'm
quite sure.
A lot of nodes.
And I think also sharding was involved.
But they stored data on EBS volumes, quite cheaper.
And ARC2 cache in ZFS was on local in VMEs so it's like interesting system but I never
tried to implement it myself this approach with ZFS despite of the fact that in my product database
lab engine we use ZFS but we use only regular arc and memory like alternative to operational system cache never played with arc 2 i cannot tell that this is
works well maybe but it's an interesting idea yeah is there um i suspect we could talk about
this for quite a while is there anything else you wanted to make sure we covered well there's also
materialized views are a very important topic. And this is also not fully solved.
So I think engineers should know what Postgres can do and provide out of the box and what Postgres cannot do.
In this area of real-time analytics, you should be ready for degradation and you should experiment with larger volumes, definitely.
You should tune AutoVacuum to be quite aggressive to keep visibility maps up to date.
And this is a must-have thing.
So your benefit from index-only scans.
For example, you loaded data, you vacuumed once.
Visibility map is fresh. Some aggregate works quite fast because it doesn't consult to heap pages
because heap fetches is zero because visibility maps are very fresh.
But then a lot of updates, deletes are coming and it degrades.
And if autovacuum is not aggressive enough, by default, it's not at all.
You will see heap fetches in plans growing and performance degrading.
So you need to tune autovacuum. But three things Postgres doesn't have. see heap fetches in plants growing and performance degrading.
So you need to tune autowacke.
But three things Postgres doesn't have.
One, I mentioned this asynchronously detached sessions with transactions,
which will help us denormalize data, right?
Second, partitioning is not fully automated in terms of nobody will create partitions for you automatically, right?
Or old one.
You need to engineer
additional pieces here. Of course, that's
where timescale
shines again, because it does this.
Or use PG Partman.
Or similar.
Right, right, right.
And third thing is
that, again,
materialized U can be refreshed only as a whole.
It can be done concurrently, so ongoing queries are working, but it can be refreshed as a whole.
And it also can get bloat and so on and so on.
There is no incrementally refreshed materialized use.
This idea is one of the biggest ideas Postgres project has and there are projects around
it.
There is a wiki page that discusses how to implement yourself, it's possible, but some
engineering effort is needed definitely.
But if you manage to have incrementally updated materialized views and asynchronous workers
which denormalize your data and also implement automatic partitioning.
You can achieve real-time
analytics in Postgres
at scale few terabytes
or dozens of terabytes at least.
Also, I forgot, I wanted to
mention one extension. I think
it was developed at Cytos back in the days
before Microsoft as well, called
HyperLogLog. It helps
with count distinct, right? HyperLogLog. It helps with count distinct, right?
HyperLogLog, right.
It's like one of the options for providing a good estimate for it, right?
Yes.
Also, I never use it myself in my projects, but I recommend it.
Some folks used it.
It doesn't solve all problems, but in some cases, it's useful.
Yeah, there's an excellent post on the Citus blog that we actually linked to within PGMustard
for when we spot slow counts.
They did a really good post listing a bunch of ways that you can speed up counts,
whether you need an exact number or when an estimate can be good enough.
Yeah, I remember a good old trick with estimates if your autovacuum is tuned
aggressively, so
it has an analyze part
to maintain raw
statistics.
So, you know,
implement some function, if
you provide some query,
instead of returning data, it returns
a raw estimate from it, extracts it
and says, okay, raw estimate
roughly is this.
But you need to be friends with your statistics and auto vacuum to rely on this information.
Sometimes it's very off, depending on filters and so on.
But okay, I think despite a lot of name dropping we did, right, we provided some principles
as well, what to expect. And I think definitely
it's achievable at the scale of like terabytes or dozens of terabytes and even thousands
or maybe dozens of thousands of transactions it's possible to handle. But it's still not
as good as performance of column store for aggregates.
And unless you use one of the projects we mentioned.
Yeah, exactly.
There is a limit.
And if you're building an analytics product where people need analytics as their main thing, there's a good chance that you won't want to use Postgres for that data, right?
Right, right.
But yeah, awesome.
Thank you so much, Nikai thanks everybody thanks for the
suggestion as well keep those coming and we have enough actually oh yeah we have a lot let's pause
a little bit we need to process what we have i'm joking i'm joking i'm joking we'll take care have
a good one bye bye thank you