Postgres FM - Companion databases

Episode Date: November 17, 2023

Nikolay and Michael discuss companion databases — when and why you might want to add another database management system to your stack (or not), and some specifics for analytics, timeseries,... search, and vectors.  Here are some links to things they mentioned:Heap were using Postgres + Citus for analytics as of 2022 https://www.heap.io/blog/juggling-state-machines-incident-response-and-data-soup-a-glimpse-into-heaps-engineering-culture Heap recently moved their core analytics to SingleStore (we only spotted this after recording 🤦‍♂️) https://www.heap.io/blog/heaps-next-generation-data-platform Posthog moved their analytics from Postgres to Clickhouse https://posthog.com/blog/clickhouse-vs-postgres Timescale https://www.timescale.com/ Citus https://www.citusdata.com/ Hydra https://www.hydra.so/ Our episode on real-time analytics https://postgres.fm/episodes/real-time-analytics Our episode on queues https://postgres.fm/episodes/queues-in-postgres Our episode on logical replication https://postgres.fm/episodes/logical-replication Full Text Search in PostgreSQL (docs) https://www.postgresql.org/docs/current/textsearch.html ZomboDB https://www.zombodb.com/ ParadeDB https://www.paradedb.com/ PeerDB https://www.peerdb.io/ RUM indexes https://github.com/postgrespro/rum Debezium https://debezium.io/ pgmq https://github.com/tembo-io/pgmq Tembo https://tembo.io/ pgvector vs Pinecone (Supabase blog post) https://supabase.com/blog/pgvector-vs-pinecone Jepsen analyses https://jepsen.io/analyses ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, hello, this is PostgresFM, your favorite Postgres podcast, one of many. And my name is Nikolai, and as usual, I have Michael here. Hi, Michael. Hi, Nikolai. So today we chose, you chose actually, one of the topics suggested by our users, so we can blame them, not us. And tell us which topic are're going to discuss today. Yeah, this is nice. I like picking user suggestions or listener suggestions
Starting point is 00:00:30 because I know at least one person will be interested in it. They suggested discussing companion databases, which is a term I hadn't heard before but made complete sense. So this is the concept of having a different database management system for a particular workload. So like Dave mentioned Redis, but we've discussed in the past queuing, we've discussed analytics workloads. There's loads more that we haven't discussed, like vector databases are all the rage these days, aren't they? Document databases, even search specificspecific databases. So, yeah, I thought this was an interesting topic.
Starting point is 00:01:07 I think there's probably some generic pros and cons, but also some details in the specifics that might be of interest. But I was, yeah, super keen to hear your thoughts and experiences with when some of these things make sense, when they don't, and whether we can come up with some rules of thumb. Well, first of all, rules of thumb for me is if you can avoid it, avoid it, right? But I'm okay with our systems as well. And obviously, pros and cons in each case, sometimes easy to see.
Starting point is 00:01:39 For example, Victoria Metrics is much better in terms of compression and working with time series and Postgres and even timescale if you check benchmarks. So it's obvious. We had a discussion with their CTO how many gigabytes one billion of rows should take. My impression usually for Postgres like row store with very light compression only for toast it's like kind of terabyte for them it's gigabytes so like several orders of magnitude i'm i'm not sure gigabytes maybe dozens of gigabytes but definitely not a terabyte and only this already highlights uh big chain big difference course, timescale is also good at compression, but not
Starting point is 00:02:27 as good as the Victoria Metrics, as I understand. I'm not going to blame timescale, it's great technology. So obviously, like this just this single example says like sometimes you probably want to consider different databases. For example, ClickHouse. Obviously, it has great capabilities to handle workloads like log-like or stream of some events and so on at a very large scale. You can do it with Postgres, but you will hit some difficulties, obviously. Again, if you can avoid it, I would prefer avoiding it if you already chose Postgres because the main problem will be synchronization and like two big problems, synchronization and management of additional tuning and so on of additional system requiring some expertise. Would you count, yeah, the expertise, would you count that in the latter?
Starting point is 00:03:28 Like I'm thinking there's like a certain amount of educational overhead as well, you know, team expertise that can make sense in larger companies or larger teams. But yeah, do you count that in the management overhead? I guess it is. Right. Yeah. Well, I say management management it's not only like you can find maybe a managed version of some system but so you won't be worrying too much about
Starting point is 00:03:53 backups although i think you should even form a system yeah and there's stuff like provisioning replication failover a chain so on but it's still, most likely it will be different language, dialect of SQL or maybe not SQL at all. It will add complexity to application, definitely. So management problem is not a good word here. Like dealing with something else requires additional effort, not in terms of management, but in terms of just working with a new system.
Starting point is 00:04:29 Right? So this is serious. If you add additional system, you need to add a lot of stuff additionally. Yeah, I like it. I think you're already starting to extract some of the things I was hoping to in terms of the trade-offs again.
Starting point is 00:04:44 So we've already touched on the analytics one, I think is particularly interesting because I think there are quite strict trade-offs there. Some of the specialized systems contain features we just don't have even with the Postgres extensions. But yeah, you mentioned we do have some quite deep Postgres extensions with Timescale and Citus that offer compression. And if you look at some,
Starting point is 00:05:05 like I was looking for examples, and it's really interesting to see, for example, Heap Analytics, a dedicated analytics product, I believe still using Postgres with Citus and making it work even at extreme scale. And yet I also read a very similar write-up from PostHog, also an analytics product, who are much newer than heap so i'm guessing smaller but maybe not and they've already bit the bullet and moved their analytics workloads to click house for the benefits but when you're an analyst for make it can make a ton of sense right pay that migration cost but the i think the the interesting learning from me from that was they did scale to a certain point using just Postgres.
Starting point is 00:05:46 So I think that I like your word. Without sideoffs, you mean? I don't know for sure, but my guess is without, yes. I remember they used ZFS and ARC2. It was interesting. So tradeoffs is synchronization of data and additional requirements to your like development and administration activities like so how about discussing several particular workloads and checking like these trade-offs and and so on like starting from analytic analytical workloads maybe what do you
Starting point is 00:06:21 think yeah good sounds good i've got so you mentioned those as trade-offs i i saw they're kind of like downsides aren't they they're like the cons of having a separate companion database but i think we haven't mentioned some of the pros as much so like some of them we mentioned more advanced features we mentioned performance like they can be better performance wise in certain cases i think there's also a couple of others like they can be easier to scale out well i think with postgres we have really good scale up options but we don't necessarily have as easy scale out options where some of these dedicated databases do and then there's one more which you brought up a few times in previous episodes which is reducing the load on the if
Starting point is 00:07:03 we've got a postgres primary that's already fairly maxed out or we're pushing the the limits of our like managed service provider for example we probably don't want to be adding additional workloads to it so or like anything we can do to split out services could be beneficial as well so those feel to me like the the pros of going down this path of adding a companion database. But then the cons are, as you said, like learning another system, another system to maintain, and those, yeah, synchronization or managing transactions,
Starting point is 00:07:35 I guess, in some sense. Right. So if you mentioned Citus, then we should mention Hydra as well because it's a new thing that probably we will soon hear some use cases, I hope. And I would distinguish, talking about the analytical workloads, I would distinguish two big different cases.
Starting point is 00:07:58 First is when you have LTP system as a main system you have, for example, a web or mobile app. And additionally, you need analytics. For example, e-commerce. And you need an analytical system to analyze what's happening with your e-commerce and a lot of stuff. Also, maybe for BI and so on. A lot of stuff. You need a lot of stuff there. And you can technically build it on top of Postgres. There are systems, I don't know, like Odoo and some ERP systems also that can be considered analytical built on top of Pentaho from the past. But probably you would prefer,
Starting point is 00:08:41 in this case, this companion system like Snowflake or something. In the past, it was a vertical or something. And there is a very different case when analytical workload is your primary workload. You are just building some analytical tool, and this is what you deliver. In this case, probably Postgres, it can be your main system, or it can be, I don't know, non-existent at all in this approach. So speaking of the first thing, when there is all TP in the center and analytical is like additional thing,
Starting point is 00:09:12 obviously by default, growing projects might choose running analytical workloads just on replicas, right? Yeah. We discussed this as a quite bad idea. I mean, it's not a bad idea for very beginning because you don't see a lot of tps and data volumes a lot of bytes so but eventually you will hit into different trade-off which doesn't have good solution which is called hot stand by feedback should be on or off yeah right and as, and as a reminder, if it's on,
Starting point is 00:09:45 then running long-running queries on statements, transactions actually on replicas with hot-slit-by-feedback turned on, it will affect the primary workloads and Autovacuum won't be able to delete freshly that tuples. If it's off, then it makes your replica a single-user replica because it accumulates a lag while you run your long transaction. And other people are asking, like, why lag?
Starting point is 00:10:13 And it depends a little bit if you need... With analytics, there can be cases where you don't need completely real-time analytics data. It's okay to have... Fair, yes, fair. where you don't need completely real-time analytics data. It's okay to have that. Fair, yes, fair. But then it's still not fully healthy when lag is accumulated. How to monitor it?
Starting point is 00:10:33 Is it caused by a normal workload? Somebody running a 30-minute report or something? Or it's already an incident and then we need to catch up? And then anyway, other users from my experience would expect, they will expect this replica not being lagging. And a good solution to this problem. So both on and off for host and wife are bad, but the good solution is branching. For example, if you have branching on aica, you can create your branch and analyze everything there
Starting point is 00:11:07 and destroy this branch or cloning, right? But it's not everywhere available. So this is the problem number one. And the problem number two is what we mentioned already, Postgres compression options, and especially not only compression, it's row store. So lack of column store. Which is, yeah.
Starting point is 00:11:33 I don't even think sharding should be like immediate. Maybe, of course, like, for example, ClickHouse. By the way, it's interesting, like analytical workloads and time-series workloads sometimes are quite close to each other, right? Yeah. So do you consider ClickHouse as analytical analytical database system or a time-series database system? I always think of it as analytics. But originally it was developed for analytical web analytics system, similar to Google Analytics.
Starting point is 00:11:58 But the primary type of, like, the data is time-serious, a series of events. Why is it called ClickHouse? It's a series of clicks, web clicks, with a lot of characteristics of each click. Click means like request, HTTP request or something. So it's time series, basically, also, right? Yeah, I see what you mean. But equally, I think they also have vector features now.
Starting point is 00:12:23 It's like, just because it has... How do you define Postgres? Is Postgres an analytics database as well? I think it's difficult to categorize these things in one category, right? But I've seen ClickHouse used for analytics stuff. And yeah, it's mostly time series data, but so is so much data in this world.
Starting point is 00:12:43 How much data do you collect that's mostly time series data, but so is like so much data in this world. How much data do you collect that isn't time series? We know the time scales, point of view, everything is time series. We know this, right? But yeah, I think even a human is time series. It has a birthday.
Starting point is 00:12:59 So all humans are time series data. And everything you have, it has created at timestamp and maybe updated and deleted. I don't know. Okay, so then we should also consider not Cytus, Hydra, and also TimescaleDB
Starting point is 00:13:19 as your options to stay with Postgres only. But in this case, if you have a strong LTP system and analytical workloads, which you know will be handling long-running queries or transactions, I would suggest having two clusters separately and connecting them through logical replication to avoid this hot-snap-by-feedback problem. And if we take it to the extreme at the other end,
Starting point is 00:13:46 the small side of things, I think you can start even without any of those. Even with partitioning, we get a lot of the benefits straight away. And I think with a little bit of, I think I've seen people use roll-up tables quite effectively. So if you can aggregate the data in a separate table, for example, that can massively improve very limited analytics workloads.
Starting point is 00:14:13 Or like you mentioned the case that you have an RTP workload, you need to provide some analytics over it. If you can use roll-up tables, you don't even need any of those extensions for quite a while. So I guess what we try to say is that it's not only like single Postgres cluster versus Postgres and companion Victoria metrics, for example, or I don't know, like Snowflake.
Starting point is 00:14:35 Quite expensive option, but very popular, obviously. We can consider a lot of options with staying inside Postgres ecosystem. And for example, if you can reach good query performance, probably it's fine to run it on
Starting point is 00:14:53 a single cluster replica. This is, I think, what we should call a HTAP case, when we have LTP workloads and analytical queries which don't take 30 minutes. Several seconds or so. Several seconds, well, up to one minute.
Starting point is 00:15:10 It's okay for AutoVacuum not being able to delete that tuples during some minutes, maybe 10 minutes. It's okay. It doesn't delete them immediately anyway. It will come back later. And in this case, Hydra is looking
Starting point is 00:15:25 interesting because you can define specific tables as column store and run analytical workloads on them. You may be having some data stored twice inside the same cluster, like originally row store and additionally as column store and uh it's interesting right yeah yeah yeah also already be interesting but it's different as i understand this store is row store the main storage but in memory they also built uh like 90 degrees uh how is it called I know only in Russian sorry transformed or transformed table so it becomes additionally column store
Starting point is 00:16:11 and you can run aggregates much faster but it's not open source Hydra is open source so I would root more for Hydra in this case so interesting and new options to explore before you think okay okay, enough,
Starting point is 00:16:27 I'm going to attach Snowflake. But sometimes you have business requirements. For example, if it's a large company, they build a whole department for analytics and these guys know how to work with particular analytical system and they say, okay, we need data there, Snowflake, that's it.
Starting point is 00:16:43 In this case case no options right we don't say let's let's try posgus so it depends on many many questions here i would still encourage people to consider it as an option and to to ask the to mention it to the team that it might be an option but if you've already got the expertise that does that does limit the the cons you know if you if you don't have some expertise, that does limit the cons. You know, if you don't have some of those downsides that we discussed, then it is an easier trade-off to make, especially if you have some of the pros,
Starting point is 00:17:14 especially if you're looking for, if you know about a feature in Snowflake that you get out of the box that you just don't get out of the box in Postgres or would take additional engineering effort. And I think a lot of the cases, it's very easy for us to brush over those i don't know these systems intimately so i'm sure there are features i've never heard of that some people find extremely useful in some cases like we haven't talked about we've we've got episodes on uh queuing on analytics already that i'll link up in the show notes for anybody that missed those but we don don't have one on full-text search yet.
Starting point is 00:17:45 I think that will be an interesting episode someday. But Elasticsearch is a huge product and has so many features, and I'm sure support must... Before we go to this topic, let's finish with analytics. And maybe time series as well. I think full-text search is definitely a new category to discuss. But speaking of analytical workloads, if you already decided to have an additional database or it's a requirement, for example, a lot of budgets are allocated for Snowflake and you cannot beat that.
Starting point is 00:18:17 They are already allocated. They should be spent to Snowflake. And people hire. So, okay. And there will be a question of synchronization. And this is interesting because you will need to deal with probably logical decoding
Starting point is 00:18:32 or logical replication. And probably you will need additional tool. I mean, it's possible to reach synchronization yourself. But if the company already spends money on Snowflake, probably they will decide to spend money on tools like, I don't know, Fivetrend, for example. Because a lot of stuff to solve there in synchronization. And over time, it's related to logical replication and decoding. And in this case, it's logical replication between two different systems.
Starting point is 00:18:59 So problems of avoiding initial full resynchronization is no joke there. And if tool is not right, for example, if you choose click, it will produce duplicates on the analytical database side because usual analytical database, such as Snowflake, BigQuery, what else, like Redshift or something, they don't have unique keys. And this is by design. And in this case, if synchronization tool, which you build on top of logical decoding,
Starting point is 00:19:33 if this tool is implemented in the wrong way, it will, you will have some duplicates. When replication switches from initial full sync of snapshot of data to cdc change data capture and this is a problem it's solvable problem the tool just needs to be fixed if you have duplicates obviously like the tool might be relying on the presence of unique keys on the subscriber side but it should not do because because such tools primary goal are pipes from LTP to analytical database systems. Analytical database systems don't have unique keys.
Starting point is 00:20:14 If this tool allows duplicates, it's a bad tool, don't use it. Or tell them fix it right now or we won't be paying you or so so duplicate problem is solvable just with careful organization orchestration of this process of switching from initial load to cdc but actually that's it i think today we have quite good capabilities on postgres side so data can be moved from altiped system to analytical system with very, very small lag. And it's amazing. I mean, you can have lag below one minute definitely and with a lot of TPS, if everything is done right. Great. Also, if you use some cloud replication tools like 5Train. I had bad experience.
Starting point is 00:21:09 It was kind of my fault. So it was Postgres database RDS, which obviously allows logical replication to anywhere. Logical replication slot is possible to create and logical replication connection from outside
Starting point is 00:21:24 is possible. I was checking 5ivetran from RDS to Snowflake. Snowflake was also on AWS, but at that time it was, I think it was early days of Fivetran. Now they are much, much better. So like I was seeing very bad performance and lag legs and so on in my synthetic tests like you take pgbench and you see the leg is accumulated very very fast and it turned out that five trend servers were in google cloud so latency it doesn't make sense you need these like software as a service platform as a service anything as a service, platform as a service, anything as a service.
Starting point is 00:22:05 You need to make sure network connectivity is good. I mean, not distant, latencies, throughput, everything you need to check. And right now, I know Fivetran have servers in AWS. So you just need to choose proper settings. And in this case, everything should be in the same region, right? In this case, well, unless you need a multi-region setup. In this case, performance should be good and delays, lags of replication should be very low.
Starting point is 00:22:40 And I think it's good if your company needs Snowflake, for example, it's good. Or ClickHouse, managed ClickHouse. I don't know, there is such a thing, right? So probably you need to manage ClickHouse and you need to stream changes from Postgres to ClickHouse. It's possible with some open source tooling. But I think it's great. I mean, why not?
Starting point is 00:23:05 And also, think it's great. I mean, why not? And also think about the progress. Like even five, 10 years ago, it feels like this would have been a dream for many people. Like the most things I heard about were extract, transform, load, kind of nightly processes to an analytics a analytics database so your data was like at you know a day old normally yeah right but that was that was the norm even just well maybe 10 years ago now maybe i'm a bit older than i realize many banks still use this nightly load approach nightly refresh and it's it's terrible. It should be redesigned. But of course, the cost of changes
Starting point is 00:23:45 are high, so they keep old systems working. If you wake up at night and try to use something or the application says down for a couple of hours, it's insane. It should not be so. But I would like to emphasize, maybe
Starting point is 00:24:01 banking systems, some banks use Postgres. Transactions are reliable, open source, quite good, robust, maybe like cost of ownership is low because no licensing from like Oracle level.
Starting point is 00:24:18 But then you need additional systems like banking system, or maybe e-commerce. E-commerce might be a good example here because obviously they need good analytics systems. So maybe this is where combination of Postgres for order processing and so on and something like Snowflake or Vertical or something for analysis is a way to go. Cool. I would be keen to get your thoughts
Starting point is 00:24:45 on some of the other use cases for... Yeah, let's discuss. What? Full-text search. Yeah, I think that's a great... Yeah, good starting point. Yeah, I think there are a lot of benchmarks saying that Elastic is better
Starting point is 00:25:00 than Postgres at full-text search, right? But unlike analytical workloads, which usually reorganize data in some form for analysis and so on, and usually you synchronize some data there and analyze, in the case of full-text search, you, I think, want the search to be a part of your main OLTP workload, right? Because user, for example, works with some pages or, I don't know, buttons, and then user decides to search something. And for user, it's a part of everything on this OLTP model or web app, right?
Starting point is 00:25:39 So it's a part of- I think it really depends. And this is a difference from analytical system. From analytical case we just considered, this is maybe the main difference because full-text search usually is a part of the main workload, all TPU, like, no? I think it can really depend.
Starting point is 00:26:00 But yeah, I think you're probably, well, you're probably right. But equally, user-facing analytics, people might expect to be up to date immediately as well. So I think there are cases for both of these to be, like, it depends how much that data is updating. Like, how much is it user-generated data versus, like, system data? Or, like, I do think there's interesting cases on both sides. But the full-text stuff, I've seen, I don't trust most benchmarks I see. I see a lot of benchmarks showing whichever vendor is doing the benchmark just miraculously comes out on top. Yeah, exactly.
Starting point is 00:26:40 And I do think Postgres has, until relatively recently, suffered a little bit from not having that marketing machine behind it. People super incentivized to do the marketing efforts. But now we do have a lot of commercial entities doing benchmarks that show, you know, Timescale doing various things and Superbase doing various things. I do think those two make a good effort to to make things fair and most i'm sure most companies try and make things fair but they know their system way better than they know their competitors systems yeah so there's all there's always going to be bias in those so i'd really really i think maybe if we if i could only encourage one thing for people from this if If you are comparing Postgres to a companion database,
Starting point is 00:27:27 don't trust the benchmarks from those two systems. Do your own benchmarking. Test your own kind of workload. So, actually, you're right. And in this case, probably Postgres full-tip search is good, but it will require also a lot of effort. I mean, in this case, if it's a part of main workload LTP, then you have two options. For example, we deal with Elastic, and it also will require some effort to use.
Starting point is 00:27:58 But obviously, this system is targeting only full-text search. Great. Now also vector search, I'm sure they have, right? They have everyone. But if you choose to use full-text search, it has a lot of capabilities, but it will require also an effort to set up properly the dictionaries, various ways, manipulative things,
Starting point is 00:28:22 like how to transform text to test vector for example and query to a test query properly and the query language probably told me not as powerful as at elastic but it will be possible to combine it in a single query with other postgres this is the power right and data is here in the same database so this this is super big pro to use full-text search. And I must say it's quite advanced. You require many days to understand all details which are possible. But looking at the benchmarks, I just looked at it before our podcast. I personally didn't touch full-text search for a few years,
Starting point is 00:29:05 but I know it's not progressing fast now because, you know, the same people who were driving progress for full-text search from Russia, they founded Postgres Pro and had different focus, different challenges. So I think full-text search
Starting point is 00:29:25 would have a new wave of attention to be developed further. Many things to improve because it's a very wide topic. But before, a few years ago, I touched it a lot and implemented a lot of things myself using
Starting point is 00:29:41 full-text search. And I must say, of course, I was trying to avoid Elastic all the time because of this. Data is here. I don't need to synchronize it. No dealing with lags. No transformations except to test query to test vector. And it's great.
Starting point is 00:29:59 But obviously, still, Elastic offers a lot of stuff. Some people say, I want to benefit from both. And there is Project Zombadb, right? Yeah, really impressive. Yes, and I'm impressed with Techno. I mentioned the benchmark this morning, Elastic versus Postgres. Usually, when I have a few minutes, I quickly find any benchmark. I quickly find downsides in Postgres. Usually, when I have a few minutes, I usually quickly find in any benchmark, I quickly find downsides in Postgres.
Starting point is 00:30:29 You remember recently some people said they are going to beat full-text search Postgres. It was, how is it called? ParadeDB, I think. ParadeDB. And they just forgot to create index, gym index. It's not a good idea to compare indexless, obviously. Okay, it was clear from the curves they published.
Starting point is 00:30:55 But the benchmark Postgres versus Elastic, I've checked right before our podcast. I'm trying to quickly find problems with Postgres setup, and I didn't find problems with Postgres setup. They tuned it properly, and it was named like Postgres full-text search, of course, loses to Elastic in terms of performance. So maybe still there is opportunity to tune Postgres there, but maybe in many cases it's quite like a valid statement.
Starting point is 00:31:22 Maybe Elastic handles full-text search workloads better. Okay. So if you decide to synchronize, look at ZomboDB, right? Yes. So, so Zombo is really interesting. I don't know exactly how it works, but my understanding was that it is backed by elastic search, but it's using, you query it through Postgres. So it's like an index access method, I believe.
Starting point is 00:31:53 So you're using the Elasticsearch index from within Postgres. So you get all those same benefits of being able to join it with other data in your system. And it handles some of that transaction boundary stuff for you, which sounds like it would be a nightmare. So, yeah, it's Eric Ridge, isn't it, and his team behind that. And it's open source as well, right? I think you can patch it. Yeah, that's great. And this reminds me, actually, that in the previous category, I forgot to mention PeerDB, which is a new company, YC-backed.
Starting point is 00:32:20 There are many Postgres-related companies which are YC back now and PureDB solves the problem of synchronizing data from postgres to data warehouse, analytical workloads and so on, analytical systems and they promise to do it much better than others so also worth looking at them but they just
Starting point is 00:32:40 started recently so very young company I've met with founder recently here at san diego and it was very good discussion enjoyed and also people which attack particular area they have a lot of great ideas so i i'm glad to see the postgres ecosystem is expanding in terms of various additional tools people need. All right, so, okay. So Postgres to analytical system, like I mentioned many names, but also look at PureDB, which is a new player,
Starting point is 00:33:15 and Postgres to Elastic, Zombodb, right? Great. So I would try to convince everyone to stay within Postgres only to solve full-text search problems. Also, 3-gram search. But I know how Gene sometimes behaves. Maybe you won't reach those levels of workloads, but I don't know, tuning it also requires effort. So maybe two systems here is not a bad idea. And again, maybe in some cases, right? Maybe it's a scale thing. For ages, you might be fine on just Postgres,
Starting point is 00:33:57 and then at some point in scale, you want to switch. It might be one of those trade-offs again. Maybe, but still, to me, being a Postgres guy, purely, of course, full-text search in Postgres is very great. A lot of options. My biggest concern is actually not related to full-text search itself, but the lack of ability to have a single index scan, which will work on full-text search, but also order by time or reverse order. So it was the idea behind RAM indexes, which like still,
Starting point is 00:34:28 like I think this will be good, great to improve and solve, but this is hard topic. And I'm sure, I'm not sure Elastic solves this problem well. Right? Yeah, I don't know enough about, like it feels to me like an extra dimension of data.
Starting point is 00:34:44 It feels to me like this might be where some of the work around dimensionality indexing around that might be good but again maybe with the downside of some missing date like maybe accidentally missing some post because of the accuracy but this is a good thing
Starting point is 00:34:59 to solve and it's not only about semantic search that PitchVector can provide for example I want fresh data to solve. And it's not only about full-text search, it's also about semantic search that PitchVector can provide. For example, I want fresh data to come first, which meets my requirements in terms of full-text search or vector search, but I want
Starting point is 00:35:15 fresh first. It's natural desire, you know, from social media applications, for example. Yeah. Cool. So what else? Queue-like workloads, we discussed them quite well some time ago, and I would like to
Starting point is 00:35:32 emphasize that I don't see people use logical decoding or replication from Postgres to, I don't know, some Kafka. Oh, why I don't see it? I see it, Debezium, right? So Debezium project open source, it's exactly what it's needed for. But now we have this new PGMQ system, right, from Tembo.io, a new company, which is worth looking at. I haven't tested myself
Starting point is 00:35:59 yet. And also here, I would like to mention this problem of synchronization, which by default is... So synchronization should not be done with Listen, Notify because of a lot of restrictions this mechanism has. Logical decoding is great for synchronization. It will require some effort. But also there is an approach maybe for queue-like workloads, which is more standard when you have in Postgres something and then you want to insert or update it somewhere else. There is two-phase commit, which probably you want to avoid because it's slow. But from microservices, they invented a lot of patterns like object-oriented programmers did a few decades ago. So there is a so-called transactional outbox pattern, which I find quite good to learn and it's simple, but it helps you avoid problems. So for Q, like workloads, if you offload workloads from Postgres, maybe this is the way to go.
Starting point is 00:37:03 For example, Rails, Rails people prefer Sidekick, for example. Some Rails people prefer Postgres. There are many libraries like DelayDrops. But some people prefer to offload tasks to Sidekick.
Starting point is 00:37:22 Which is Redis. Exactly. This is what I wanted because the original request was about Redis, right? Yeah, true. So this is... I know we're about to finish because we don't have a lot of time, but I needed to lead us to this point. Okay. So Postgres, Rails code and Sidekiq with Redis behind. It's like some kind of popular combination, I think. But I think many people maybe have issues with synchronization. Probably you're not noticing. If they do it strictly from Rails, not involving any two-phase commit and so on,
Starting point is 00:38:01 you inserted to Postgres but didn't insert to Sidekiq. You updated, like, I mean, you need to see how synchronization works. And this transactional outbox is quite simple. It's not rocket science at all. You just have a special table, like you have propagate symbol there, then consumers like offloaded to this Sidekiq reliably.
Starting point is 00:38:22 And in Sidekiq, you already have this processing. It's not already our problem, it's Redis' problem to process those events, right? Not to lose them and so on and so on. But I think in this particular case, Redis probably is needed for such kind of workloads like event processing. Let's look at Postgres. Usually if you look deeper, two things to check as we discussed in the Q episode, the too-long-don't-read approach. First is check how if they delete or update intensively, it's bad. I mean, how old events are cleaned up? It should be either truncate or drop partition to avoid bloat and auto vacuum issues. And second, if you need insane scale, I mean, multiple workers should have select for update.
Starting point is 00:39:25 Skip locked. Sorry. These are two things for proper performance long term. I mean, not degrading. Performance which is not degrading over time. Good? That's it? Anything else? I think so.
Starting point is 00:39:39 Yeah, last shout out to the one that's probably the most in flux at the moment, which is the vector database specialism i i actually think this might be a case where we don't know how the dust is going to settle yet it feels so there's so much progress and there's so many so much money involved so many companies very well funded competition is high yeah and and also there seem like there's still huge leap forwards like it the most recent release of PG Vector, the numbers that they released in their 0.5 update, because they added the HNSW index type or algorithm,
Starting point is 00:40:16 I don't even know how to describe it, but that was such a huge leap forward in a couple of ways that you might want to optimize things. So who knows what's next? Like who knows what progress there is down the line there? Yeah. So I want to say the important thing. If you like, for example, first of all, I use it already daily,
Starting point is 00:40:37 PgVector. Nice. But I haven't reached that scale. We haven't reached that scale where we need a good index because we only have less than 100 000 documents yet we will have a lot of more and in this case i mean i haven't i haven't done my own benchmarks or my team haven't done it hasn't done it yet we will because eventually we'll have many many millions i hope maybe dozens of millions of documents it's by the way postgres related but of course we use this hnsw which is stands for hierarchical hierarchical navigable
Starting point is 00:41:14 small worlds weird name by the way i think some some guy from or a bunch of guys from russia and from my own university invented it like five years ago or so maybe it's not there this term is not from them but the algorithm everyone tries to implement this from there I was surprised so I want to say that there is a very popular benchmark which compares various vector databases versus like Postgres and Elastic and so on. Every other already existing system which just added some kind of plugins, extensions, capabilities. And I must say this comparison still lacks HNSW capabilities from PGVector because it benchmarked older version of PGVectorector. That's why pgVector loses
Starting point is 00:42:06 drastically there. So don't trust those benchmarks. It became better. And don't rush into using, for example, Supabase had article. Let's touch it. Paul just also wrote me that pg PG vector with this uh HNSW indexes it's a lot faster than pinecone so look out look look at their blog post super base blog post with additional Benchmark and I hope that very popular vector Benchmark will be also updated soon right so to because it's not fair you know so posgus is good here actually and yeah it looks it looks really good the but please again don't trust any benchmarks people like check for yourself i do trust the super base people i've read that exact i've read that exact post and it's it seems great they even spend more on the pine cone instance than they're spending on the super base instances it seems like as fair as you could get but how are we to know like that they they're great at tuning pinecone
Starting point is 00:43:10 servers and they've done you know it's really difficult yeah i cannot keep it so as a small spoiler our bot will be capable of running benchmarks conducting benchmarks and i think we should do it as soon as possible this This particular case, let's check pgVector with various options, various indexes. We will do it. And it will be super easy for everyone to reproduce. We're just chatting with both. On Postgres, but not for comparison. Yeah, just Postgres.
Starting point is 00:43:39 We focus only on Postgres. But you can get all the same metrics and compare and see TPS latencies, all the details, a lot of details. So then think how to improve, and not only you will be thinking, but... Okay, enough. So vector workloads is good at Postgres. Just try to use them. Don't trust others.
Starting point is 00:44:01 Because others want to build new systems. They will be saying Postgres is bad. It's obvious. But it's not... If I trust Victoria Metrics CTO and I see numbers myself or Clickhouse, I know this technology is super good at both
Starting point is 00:44:18 or Elastic. Here I don't trust it yet. Let's give Postgres a good chance before moving on. Do you know what? That's a really good point and something I forgot to list in the pros for keeping a workload on Postgres. And it is still the wording exactly from the Superbase blog post,
Starting point is 00:44:35 but they've said Postgres is battle-tested and robust, whereas most specialized database haven't had time to demonstrate the reliability. It doesn't mean they're not going to be in time super reliable. It takes 10 years at least. Exactly. 10 years. It can't.
Starting point is 00:44:47 Yeah. So, yeah, really good note to it. And obviously... Backups, bugs, ICID, a lot of stuff. Yeah. It takes a lot of time. Even if you change the part of Postgres, you need a lot of time to be stable.
Starting point is 00:45:03 So there are systems each time I touch that demonstrate issues very quickly. But yeah. Like there's that group that do, is it Jepson? Like they're doing serious testing of systems. Distributed systems particularly. Yeah, but even like finding bugs in non-distributed systems, transaction isolation levels, and things like
Starting point is 00:45:27 that, some really impressive people doing some really impressive work, but they won't have done that on newer databases, and people wouldn't have had a chance to fix some of those issues yet. Yeah, it's a really good point, and I don't mind us being a little bit biased. But as we see from
Starting point is 00:45:42 full-text search and analytical use cases, it's possible to beat Postgres for particular workloads, for sure. Yeah, yeah, yeah. Good point. Cool. Thanks so much, Nikolai. Thanks, everybody. And keep the suggestions coming.
Starting point is 00:45:55 Yeah, thank you. Bye-bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.