Postgres FM - Real-time analytics

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things PostgresQR. I am Michael, founder of PGMustard, and this is my co-host Nikolai, founder of PostgresAI. Hey Nikolai, what are we talking about today? Hi Michael, let's talk about one more controversial topic. Can we run analytical queries in Postgres and what to expect from it? Yeah, right. So like real-time analytics workloads, maybe in combination with transactional workloads, sometimes people say hybrid, but I think it's a fascinating topic. And luckily,

Starting point is 00:00:31 another listener request. So thank you. And so it's taken us a while to get to this one, but really appreciate it. Hybrid is just term OLTP guys invented to pull everything to their side. I mean, it's still OLTP, but let's run aggregate queries in a regular OLTP database and try to achieve better performance, right? Let's try to get rid of analytical databases like Vertica or Snowflake and so on. This is HTTP. Because vice versa, it works less often right like analytical for all tp it usually doesn't work because latencies they have they off are awful well i was well we're

Starting point is 00:01:14 diving straight into the deep part of the conversation but i have seen some like i have seen some go after trying to do both like single store and like a few i was going to bring this up later on but i do see some analytical providers like click house starting to add features that allow them to do some of the like or to to mitigate some of the um trade-offs do they have already updates and deletes i haven't checked but i think do they work okay uh do they have already good isolation for sessions? I mean, isolation control, snapshot isolation, and so on. Like analytical data systems were not designed to be ACID compatible. They don't need it.

Starting point is 00:01:57 They need to sacrifice some aspects for the sake of processing huge data volumes. So I doubt, like, for example, vertical, if it would be so easy, like, I cannot imagine someone runs vertical to handle some, like, I don't know, like social media, or e commerce kinds of workloads. I mean, user facingfacing workloads. When customers expect all pages to work below 100 milliseconds, right? Better one millisecond.

Starting point is 00:02:32 On average, it should be below one millisecond on server side. And I cannot expect that someone can run Vertica for that or Snowflake. It's strange. So that's why I say htap it's all or it's all tp with attempt to improve performance of analytical queries you you don't agree obviously right well it's in i think it's i think this is a really interesting time and maybe that's true at the moment but i think there's an interesting almost race between oltp systems

Starting point is 00:03:05 adding adding the ability to do more analytical work and the more analytical side of things trying to do more transactional processing and there's at the moment most people either have two separate systems so like extracting the data from the transactional side and putting it into the analytical side, or they try and get by as long as possible with the transactional side and optimizing for some, maybe a smaller amount of analytical workload. But I wanted to start by thinking, if you think of real-time analytics, what do you think that includes? Or what's the first thing what do you think that includes or what's that the first thing that that makes you think my bachelor thesis in university was about etl and how to how to make clean clean up data to get to get rid of dirt to have a unified schema for all data from various sources and then my first job was in a company called Program Bank. We programmed software systems for banks using Delphi and Oracle. It was an interesting time. It was more than 20 years ago already. during weekends. And then later, when I started to fly a lot, Europe to US and back and forth, and I've noticed that since I have bank accounts on both continents, I noticed that even 20 years

Starting point is 00:04:33 later, some banks still have nightly operations. So you cannot access your bank account during your day time in Europe when it's night in the US and so on. So when I hear real-time analytics, I imagine that this should go away, this approach. And systems should be accessible always. And even if you need to provide some reports, like monthly reports or something, it should be done asynchronously at least, or better in the same system like okay we have ltp system and we know how to handle analytical queries which process like month month of data so this is what real-time analytics to me i not like like banks do some banks there are new banks which already don't have this approach and they follow good practices yeah

Starting point is 00:05:24 well i mean some of those sound like they're not even showing you the transactional data, never mind the kind of aggregated data. But for me, this makes me think of, let's say you're looking at a software as a service dashboard and you want to look at your total of some number or some average of some number, but it includes up to like today and up to this minute of information so you you might be looking at page views on a website but it includes the last hour or the last few minutes and if somebody visits your website now they start showing up in that dashboard as well it's not there's not really a delay there is no delay and it also doesn't go down view count unless some bots are involved.

Starting point is 00:06:08 And also if you open two tabs of the same page you have at least similar numbers, right? You know the joke how many like MongoDB reached 1 million downloads or according to one of our nodes 999 downloads and according to the other one 1 million 2 downloads right you know this right i hadn't heard that joke before no eventual consistency joke so yes i agree and this is historically very difficult task for row store as postgres and with mvcc implementation postgres has with all that tuple or tuple, I forgot, processing using vacuum process and bloat issues and index amplification and so on. So historically, Postgres is known for very slow select count star or select count one, doesn't matter in this case.

Starting point is 00:07:05 But it was improved. Some aggregate queries can benefit from index-only scans. Some can benefit from loose index scans. But unfortunately, Postgres doesn't have an automatic loose index scan. You need to program it using recursive CTE. So you need to tell Postgres how to skip certain subtrees in B3. But many aggregate queries can be improved. So it's possible to achieve with Postgres.

Starting point is 00:07:31 Yeah, you bring up a good point about, I mean, the design decisions that make Postgres so good at concurrently doing inserts, updates, deletes. The design decisions that make that really efficient make running queries on let's say one column across all customers if you're trying to sum those or average those or do some aggregation it is fundamentally difficult whereas a lot of these analytical systems are designed for the exact opposite and they and as such we've got a hard almost a hard trade-off here right like i don't see there's not an obvious way of having it having the best of both worlds it's possible with storage engines and there's ongoing work for many years already and it looks like some sometime soon we will have some there is there are projects i always mix it that store that heap i don't remember attempts to

Starting point is 00:08:22 have pluggable storage engine which will provide ColumnStore. There was also a project from Cytos called CStore maybe, like ColumnStore. But it was using FDW. So it's not like native storage. It's not like in MySQL when you say this table needs to use this storage engine and that's it.

Starting point is 00:08:45 But in general, strategic decision in Postgres project was made to have multiple storage engines, like to have API for that and so on. So in future, we can expect that it will be easy to define some table as column store storage based. Let's talk about differences. You raised a good question about when one works better than the other. If we have all TP, we usually need to extract user data with all columns of that user, including like email, last name, and so on and so on. And of course, if these columns are stored ideally in the same data page, Postgres has usually like eight kilobyte data pages. It's just one buffer read or hit. Speaking of buffers, right?

Starting point is 00:09:33 Our favorite topic. But column store, like ClickHouse, they store each column of table in separate file. In this case, you need to deal with many files and then extract this data and then aggregate it in memory to provide the row. But this is how all TP queries dealing with single row or if you have pagination like 25 rows ordered by some column, it becomes less efficient to extract the data. But for queries like, let's calculate some of some

Starting point is 00:10:07 orders or some aggregates for a month, you deal just with a single file. And if it's stored historically and you need just one month, you deal with fewer pages. It's all about IO as usual, right? And the amount of memory you need to allocate and work with. Yeah, and while we're talking about systems that allow for you to pick your trade-off maybe on a per-table basis, we've also, I think, probably should mention that there are extensions attempting to do this kind of thing as well, and the most obvious one being timescale,

Starting point is 00:10:41 of doing some level of it. It's not exactly column store, I don't think, but I think it works the same way uh it's very clever but the the the other thing that they make take advantage of and a lot of the other analytical systems you mean compression uh in timescale for time series which uh understands we're not like which uses the knowledge about what stores in column and then applies this to compress, right? Yeah, and that's exactly what I was going to say. Because data within a column is very likely to be similar across rows, it compresses extremely well.

Starting point is 00:11:17 And another thing that can reduce I.O. for those analytical queries is that there's just less. I encourage everyone to read their articles, especially one explaining the details about compression approaches for time series and timescale. I read it from time to time and recommend, keep recommending it. It's brilliant article.

Starting point is 00:11:36 And in general work is brilliant for, for these types of data. But wanted to mention also Hydra, which is a very new project, which is aiming to build analytical type of Postgres, like Postgres flavor. And they also work in this area. They took Citus, this C store, for com store like via fdw and and improve and so on and we had the episode on postgis tv interview with founders it was one one month ago or so but it's still getting a lot of good feedback they shared feedback they get get personal feedback that's great i mean it's interesting project and i encourage also checking out this interview so yeah by default i think all people

Starting point is 00:12:26 should start from posgas right and then we don't want to be like banks right we don't want to have etl every night it doesn't scale well this process i mean at some point it will take more and more hours and you don't have so much time and like like, it's hard to manage this process. And also customers suffer because they want up-to-date info. No, you need to wait until tomorrow to get this info. No, it's not good.

Starting point is 00:12:53 And customers can be internal in this case because, for example, business owner might want to see some reports right now, but we will tell return tomorrow. No, no. So we start with Postgres and then we have two options, basically. First is to take something for analytical, and this is our default option still.

Starting point is 00:13:13 All TP, Postgres, DBAs usually recommend, and I usually do the same, don't use Postgres for analytics. This is mantra, actually. We use this mantra all the time, because the problem is not only it's less efficient for heavy queries with aggregates, it also affects your health, health of your Postgres. Because if you run it on the primary node, if you have a very long query lasting half an hour, for example, and during that time, other queries are deleted in any table, they delete the tuples, AutoVacuum they deleted that tuples auto icon cannot delete

Starting point is 00:13:46 these tuples until your long query finishes because you hold xmin horizon right so yeah we've talked about this a few times and i think this is this might be the one time i disagree with you because we're talking about real-time analytics here i think if we're talking about things that update that are user facing people are going to have problems way before half an hour in terms of user experience so i think we're talking more in the region of the customers i see doing this kind of optimization work within postgres it tends to be they're trying to get something to load within ideally you know a few hundred milliseconds or five, five seconds, kind of 10 seconds, kind of up a limit of how long somebody will wait for. Maybe if they're changing some drop downs in a UI, and then they want to refresh their data.

Starting point is 00:14:36 If we're talking about that being able to finish within a few seconds, I think that can still live within Postgres for a while. Maybe it doesn't scale forever, but it's smaller sizes. I definitely don't see that. You are right. You are right, but only until some point. Okay, so you basically say real-time analytics means that we have a requirement to have statement timeout 15 seconds, for example,

Starting point is 00:14:59 for all user-facing queries. That's great. I agree. But to achieve that, you probably will face challenges when you have select count or select sum or anything lasting too long. And then you will implement some denormalization

Starting point is 00:15:15 or synchronous via some queuing mechanism through Kafka or something to pre-process some data. For example, on hourly basis. And those queries you won't limit. Right? And those queries will won't limit, right? And those queries will hold your Xmin Horizon backend, how to say them, how to name them, backend queries or something, not user-facing, some batch processing queries, something like that.

Starting point is 00:15:39 And to keep everything in Postgres, right, You need to have some pre-processing, asynchronous and parallel, queued and so on. And we've talked about, we should probably give a shout out to a previous episode we did on materialized views, which are one kind of pre-processing tool. Probably going to give them

Starting point is 00:15:57 a couple more shout outs in this episode. PGI VM? I was going to say timescale actually with their continuous aggregates. Timescale is great. Timescale is great. Again. Well, the continuous aggregates are kind of this pre-processing, right,

Starting point is 00:16:11 where you have this continuous aggregation of new data coming in and then that getting filed away. So I think there are some ways of avoiding what you're talking about, but, yeah, only up to a point. Continuous aggregates and timescale. And this is a part of their open source offering. So that's great. And I know cases where it works very good. I also need to, I made a joke today about like someone from Timescale offered me to register

Starting point is 00:16:39 in Timescale Cloud. First of all, of course I'm registered. And I need to say that I said, no uh don't go there because there is no bottom in time scale cloud it was a joke because about their bottomless feature we probably also discussed it of course go and register on time scale cloud it's it's great a lot of very new approaches very interesting approaches and i I think Timescale is one of the companies which drive Postgres ecosystem forward using innovation. That's great. Right.

Starting point is 00:17:09 So if we want to keep everything in Postgres, we will have these type of challenges, long running queries. Sometimes we can avoid them. If we choose smaller batches, if we process this, if we implement this asynchronous process more often, every minute, for example. But also worth mentioning, it's hard to engineer it because Postgres doesn't have what Oracle provides, autonomous pragma, pragmaton, something like you, I want to detach my auxiliary transaction for background processing for example we inserted some data and

Starting point is 00:17:47 we want to increment some count we would like to do it asynchronously not blocking the main insert right or main update and to do that you need to engineer it using some some kafka or some i don't know like rabbit mq cannot recommend RabbitMQ or Sidekick or something. By the way, there is a new project someone released a couple of days ago, which is considered as a replacement for PGQ. But it's another story. Q in Postgres, we need to discuss it one day. But if you manage to engineer it, probably you can keep your transactions below some duration. But if you don't engineer it, probably you can keep your transactions below some

Starting point is 00:18:25 duration. But if you don't, you have trouble and you cannot even offload some long running selects to standby nodes because either you need to be a single user on that standby nodes because other users will notice long lags, big lag legs when your query lasts. Or you need to turn on hosted biofeedback, report to the primary, and have all the issues we've blowed again. So what to do? Back to my idea that we have two options.

Starting point is 00:18:58 One option is find a way to remain in Postgres using some optimization, some of these column store approaches, some innovation, or to use some regular approach, but with continuous ATL, basically, which is called CDC, change data capture. So you take something, you can take, for example, Clickhouse and use it as a secondary database right from your application. There might be problems with consistency because probably Postgres knows that you made a thousand sales today,

Starting point is 00:19:32 but ClickHouse thinks it's slightly less or more. Usually, Postgres is considered as a primary source of truth being an OTP system. And then we have a logical replication to Snowflake, Vertica, or ClickHouse or anything else. There are ways to do it. But you need to deal with logical replication. This is challenging. It's possible, but it's challenging. I must admit it's improving over every year Postgres

Starting point is 00:20:00 receives improvements in this area. And there are tools, including commercial tools. Open source is Debezium, but there are commercial tools. I cannot recommend Qlik. Qlik is bad. Former Attunity, I think, right? It's just wrong implementation of Postgres plugin. But, for example, Fivetran is quite good

Starting point is 00:20:24 if you consider commercial systems. And we have a whole episode on this, right? We have right it's a different location, different story. And so the the our option is real time analytics is to improve queries. Let's talk about the ways to improve. Well, another thing I was going to mention that I think is important here is generally once we have performance issues, or once we have a decent amount of data, we're doing analytics.

Starting point is 00:20:49 We probably are in the realm of needing or wanting to use partitioning of some description. Oh, partitioning is huge. Partitioning is a must have if you need to deal with big data volumes. Rule of thumb is you need partitioning for a table which exceeds, for example, 100 gigabytes. Yeah, which is a lot, but also not that much these days, right? Right. And partitioning is needed in many, many, many aspects, including data locality.

Starting point is 00:21:21 So all data lives in particular pages, data blocks on disk, and they are all frozen, very rarely touched. So AutoVacuum just skips them very quickly. Yeah. And also indexes for those old partitions are in like almost frozen shape,

Starting point is 00:21:39 quite efficient, maybe rebuilt, not degrading. And these pages are barely present in our buffer pool and operational system cache. Because if everything is mixed and each page has a couple of new tuples and a lot of old tuples, it blows your shared buffer. This is one of the biggest reasons to have partitioning. I just recently revisited this topic several times with

Starting point is 00:22:07 a couple of customers. And also, in Postgres 14, I think, in 14, reindex concurrently and index concurrently they got optimization not to hold XminHorizon. It was so great, so you reindex. And even if you have

Starting point is 00:22:24 a terabyte or five terabyte size table, you can perform index maintenance because we all need rebuild indexes from time to time, even if it's not Postgres in all LTP systems, they degrade, their health degrades. So you don't hold Xmin Horizon. so it doesn't matter if it takes hours. But it had a bug and in 14.4 it was reverted. So it means if you have

Starting point is 00:22:54 a huge table you need to rebuild index holding spin horizon. It's the same problem as we just discussed with inability of autovacuum to delete tuples which became dead recently. So that's why you need to have partitioning, and everything improves, including index behavior and so on.

Starting point is 00:23:15 Yeah, well, I think there's another good reason, and that's I think it encourages good practice around maintenance and around retention. I think a lot of people before they have partitioning don't even consider deleting old data or or how long they should be holding certain data for and a lot of the time it is partitioned on time or at least one aspect of it as time so i think it does encourage you to think about how many of these do we need to keep around and how easily can we get rid of them, that kind of thing.

Starting point is 00:23:46 But yeah, maybe not so analytics. Here, again, the timescale pops up. They do a very good job to achieve HTTP for Postgres for time series. Because, again, this bottomless I mentioned, this is exactly how they implemented it, and I like it. So for partitions with archived data, you can choose S3 as storage, right? Yeah, smart. And for HAT partitions, it's on EBS volume in AWS, for example, if it's AWS. And for your application, everything is kind of transparent, so you don't notice.

Starting point is 00:24:26 Of course, archived data is slower, and everything is... You can see it even if it's explaining lies. Yeah. S3 was involved. Really? That's cool. I haven't seen that. Yeah, it's a fresh feature, I think, bottomless.

Starting point is 00:24:41 But this works only in cloud, unfortunately, as I understand. Okay, it makes sense. That's a good good point that they automate the partitioning for you or once you choose certain things but the other the normal way of doing this like the places the the extension that seems to be most popular in is pgpartman right the yeah I I must confess I never use it myself yeah a lot of folks around did it but I don't I didn't I but I never use it myself. A lot of folks around did it, but I don't. But I implemented partitioning myself before partitioning became declarative. So I have some experience with partitioning and with declarative partitioning as well.

Starting point is 00:25:18 But with partitioning, the general advice, you need to choose as fresh major version of Postgres as possible because again a lot of progress was made during last five to seven years ability to detach partition concurrently and so on in non-blocking way but I wanted also to mention that the same thing as timescale did you probably can do yourself using FDW if you convert your old partitions to foreign tables using Postgres FDW, for example. You can store them on the remote server with slower disk handling and so on. Yeah.

Starting point is 00:25:55 You just reminded me actually of a whole different, it's a different slant on that, but did you read the post by DuckDB? They're an analytics database and they've done some incredible work i didn't fully understand it but you can now query post like you can have your data in postgres but query it through duck db like kind of like an fdw but in reverse and get some ridiculous performance improvements on on the same queries but we like we when we use fdw i

Starting point is 00:26:27 i'm afraid we will break our requirement we won't meet our requirement to be below 15 seconds at some point at some data volumes i'm afraid my gut tells me you know but maybe i'm wrong i i i actually don't follow all new stuff happening in post-Goseck ecosystem. It was so like at least five years ago, definitely 10 or 15 years ago. Right now, I'm feeling I cannot catch up with everything. It's too much. Too many things are happening. I highly recommend this post by DuckDB.

Starting point is 00:26:58 I will link to it as well. But some people I really respect are singing their praises at the moment. So I'm looking forward to understanding that one a bit better. it as well but they some people i really respect us singing their praises at the moment so i'm looking forward to understanding that one a bit better and the other one i wanted to make sure we mentioned what i don't feel like we should talk about postgres and real-time analytics without mentioning heap they do quite a lot of blog posts related to postgres and i think i could be wrong i could be a bit out of date here i I think they're running Heap, which is an analytics product, pretty much exclusively, I think, on Postgres. Maybe with Citus, I think.

Starting point is 00:27:31 I didn't see fresh articles from them. But I think it was four or five years ago when I learned. learned and i remember they used zfs with arc 2 with data stored on ebs volumes and with local disks and local nvm nv like it was time when nitro in aws was not released i think and ebs volumes were much slower than local nvme disks which are ephemeral. So if machine reboots, probably you lose data there. So what they did, as I remember, they did, of course, partitioning was involved, I'm quite sure. A lot of nodes. And I think also sharding was involved.

Starting point is 00:28:16 But they stored data on EBS volumes, quite cheaper. And ARC2 cache in ZFS was on local in VMEs so it's like interesting system but I never tried to implement it myself this approach with ZFS despite of the fact that in my product database lab engine we use ZFS but we use only regular arc and memory like alternative to operational system cache never played with arc 2 i cannot tell that this is works well maybe but it's an interesting idea yeah is there um i suspect we could talk about this for quite a while is there anything else you wanted to make sure we covered well there's also materialized views are a very important topic. And this is also not fully solved. So I think engineers should know what Postgres can do and provide out of the box and what Postgres cannot do.

Starting point is 00:29:13 In this area of real-time analytics, you should be ready for degradation and you should experiment with larger volumes, definitely. You should tune AutoVacuum to be quite aggressive to keep visibility maps up to date. And this is a must-have thing. So your benefit from index-only scans. For example, you loaded data, you vacuumed once. Visibility map is fresh. Some aggregate works quite fast because it doesn't consult to heap pages because heap fetches is zero because visibility maps are very fresh. But then a lot of updates, deletes are coming and it degrades.

Starting point is 00:29:54 And if autovacuum is not aggressive enough, by default, it's not at all. You will see heap fetches in plans growing and performance degrading. So you need to tune autovacuum. But three things Postgres doesn't have. see heap fetches in plants growing and performance degrading. So you need to tune autowacke. But three things Postgres doesn't have. One, I mentioned this asynchronously detached sessions with transactions, which will help us denormalize data, right? Second, partitioning is not fully automated in terms of nobody will create partitions for you automatically, right?

Starting point is 00:30:25 Or old one. You need to engineer additional pieces here. Of course, that's where timescale shines again, because it does this. Or use PG Partman. Or similar. Right, right, right.

Starting point is 00:30:40 And third thing is that, again, materialized U can be refreshed only as a whole. It can be done concurrently, so ongoing queries are working, but it can be refreshed as a whole. And it also can get bloat and so on and so on. There is no incrementally refreshed materialized use. This idea is one of the biggest ideas Postgres project has and there are projects around it.

Starting point is 00:31:08 There is a wiki page that discusses how to implement yourself, it's possible, but some engineering effort is needed definitely. But if you manage to have incrementally updated materialized views and asynchronous workers which denormalize your data and also implement automatic partitioning. You can achieve real-time analytics in Postgres at scale few terabytes or dozens of terabytes at least.

Starting point is 00:31:35 Also, I forgot, I wanted to mention one extension. I think it was developed at Cytos back in the days before Microsoft as well, called HyperLogLog. It helps with count distinct, right? HyperLogLog. It helps with count distinct, right? HyperLogLog, right. It's like one of the options for providing a good estimate for it, right?

Starting point is 00:31:53 Yes. Also, I never use it myself in my projects, but I recommend it. Some folks used it. It doesn't solve all problems, but in some cases, it's useful. Yeah, there's an excellent post on the Citus blog that we actually linked to within PGMustard for when we spot slow counts. They did a really good post listing a bunch of ways that you can speed up counts, whether you need an exact number or when an estimate can be good enough.

Starting point is 00:32:20 Yeah, I remember a good old trick with estimates if your autovacuum is tuned aggressively, so it has an analyze part to maintain raw statistics. So, you know, implement some function, if you provide some query,

Starting point is 00:32:39 instead of returning data, it returns a raw estimate from it, extracts it and says, okay, raw estimate roughly is this. But you need to be friends with your statistics and auto vacuum to rely on this information. Sometimes it's very off, depending on filters and so on. But okay, I think despite a lot of name dropping we did, right, we provided some principles as well, what to expect. And I think definitely

Starting point is 00:33:06 it's achievable at the scale of like terabytes or dozens of terabytes and even thousands or maybe dozens of thousands of transactions it's possible to handle. But it's still not as good as performance of column store for aggregates. And unless you use one of the projects we mentioned. Yeah, exactly. There is a limit. And if you're building an analytics product where people need analytics as their main thing, there's a good chance that you won't want to use Postgres for that data, right? Right, right.

Starting point is 00:33:42 But yeah, awesome. Thank you so much, Nikai thanks everybody thanks for the suggestion as well keep those coming and we have enough actually oh yeah we have a lot let's pause a little bit we need to process what we have i'm joking i'm joking i'm joking we'll take care have a good one bye bye thank you

Postgres FM - Real-time analytics

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.