Postgres FM - When not to use Postgres

Starting point is 00:00:00 Hello, hello, this is Postgres FM. As usual, I'm Nick, PostGISI, and as usual, my co-host is Michael, Pigea Mustard. Hi, Michael. How are you? Hello, Nick. I'm good. How are you? Very good. So the topic you chose is to talk about beyond Postgres when we should avoid using Postgres, right? Yeah, you put a shout out on a few social networks asking people. what kind of questions they'd like us to answer and we had lots of good suggestions as we've had for many years now and one of them was particularly good and I thought was worth a whole episode which was yeah when not to use Postgres I think there's a growing trend of or a few

Starting point is 00:00:47 popular blog posts of people saying you should you should consider Postgres for most workloads these days and I think it is still an interesting topic to discuss are there cases where it doesn't make sense? If so, what are those and what, like, when does it make sense not to use Postgres? And I thought, I was interested in your take on some of these as well. Yeah. Well, classic example is analytics, of course. Yeah. Do you want to list a few and then discuss them in a bit more detail or just want to... Yeah, let's create a frame of this result, right? So analytics, embedded databases. Yeah. So like, and I think analytics, Yeah, so analytics embedded, I think storing large objects, there are some cases where it makes sense.

Starting point is 00:01:38 Exactly, especially larger ones, like videos, like very large objects, 100%. And just let's agree, in every area we can discuss pros and cons of using Postgres because some people will definitely have opinion that it's not an excuse to avoid postgres. Let me add then. With this in mind, let me add. add than topic like ML data sets and pipelines. Yeah. Right? Yeah.

Starting point is 00:02:08 Machine learning and big data to, yeah. I think anything where there's specialized databases, like search, vector databases. Vectors, exactly. Let's talk about vectors separately. It's worth it. And then one more, which I think, well, actually, I had two more on my list.

Starting point is 00:02:27 One was potentially controversial. I wondered if there's a case if you're at extreme OTP Right heavy Very very right heavy And you Like let's say you've got

Starting point is 00:02:43 Institutional Experience with Vitesse I think sticking with that for the moment Makes a lot of sense So like when not to use Postgres I wondered if like a new project came along Starting with Vitesse while we get these Sharded Postgres

Starting point is 00:02:59 like OTP sharding, Postgres solutions up and running. I think at the moment, maybe it still makes sense to not use Postgres there. Let's add then time serious. Yeah. Let's discuss this area. Time serious. Yeah. And also data that can be compressed really well.

Starting point is 00:03:19 It's, this topic is close to analytics. True. Yeah. And then I had one more. One more, but I guess it's kind of what I said just now. I think if you or your organization have tons of experience with another database, the argument for using Postgres for your next project is weaker. I'm not sure it's like when not to use Postgres.

Starting point is 00:03:41 I think I could think of lots of counter-examples. Well, this is orthogonal discussion. You can say if we already have a huge contract with Oracle, we already signed for next five years. It's not wise to start using Postgres and those money will be spent for us. thing, right? There are many such reasons, right? Should we stick to technical ones then?

Starting point is 00:04:05 Yeah, yeah. Like area, like types of usage, area. Q-like workloads, I would have. Yeah, interesting. Yeah, yeah, yeah. The last one is like kind of Kafka territory. Or there are others, of course. Yeah.

Starting point is 00:04:23 All right, should we start with analytics then? I feel like that. I know we did a whole episode. Kind of a whole episode. Yeah, so Roastore is not good for analytics and select count will always be slow

Starting point is 00:04:36 you need denormalization or estimates, estimates will be slow, not slow, not too rough and too wrong sometimes yeah, it sucks. Yeah, but I think there's a scale, like I think we

Starting point is 00:04:53 talked about this before, but there's a scale up to which you'll be fine on Postgres. You could achieve better performance elsewhere, but if you have hybrid, a lot of systems now are hybrid, right? Like, they have to be transactional, but they have to provide some analytics dashboard for users or something, like, but they still want real-time data. They still want transactional processing, maybe 90 to 95, maybe even 99% of the workload transactional, and only, like, there's a few analytical queries from time to time.

Starting point is 00:05:22 I still think those make a ton of sense on Postgres. Counts everyone needs, right? you know pure LTP applications they need to show counts or understand pagination and like on silk have counts if you think about social media you need to show how many likes comments and so on reposts so yeah and it's and like to implement it in purely in posgous a good way for generalization would be needed yeah and also it like it can it There are so many rakes lying around this, you can easily step on them and have a hotspot. Like, you know, this classic hotspot when accounting system, like tracking balances,

Starting point is 00:06:10 and let's say we have like whole balance and all transactions update a single row. So this is where it can be an issue. But at the same time, there are many attempts to do better and some attempts led to companies being acquired, right? I mean, crunchy. There are many new waves of aiming to solve this problem, better analytics for PostGus. I see two big trends right now. One trend is how Sid,

Starting point is 00:06:46 founder of GitLab recently had a post saying that PostGas and Clickhouse is a great couple of database systems. I don't remember exact title of that post, but the idea is that it's great. They go together very well. Yeah. We also had Psi, who founder of PeerDB, which was acquired by Click House, and I met with him last week. And we talked about, like, again, the same basically idea that Click House and Postgreaves are great together. And this is one direction.

Starting point is 00:07:22 Another direction is saying Click House is very different. and not even maintaining maintaining is absolutely different but also using it it requires different mindset and skills so it's better to choose for example DuckDB keep

Starting point is 00:07:40 yeah so do everything inside postgres but in a smarter way this is what multiple companies worked on recently and one of them Crunchy was acquired by Snowflake and we had it we did an episode on PG DuckDB as well

Starting point is 00:07:57 so a slightly different approach on that. But yeah, the crunchy one's interesting because all the queries go through Postgres, but a lot of the data is stored in, like, iceberg or like some other file format. Yeah, exactly. Getting the column last side. Yeah.

Starting point is 00:08:13 So, yeah, this definitely feels like one of those ones. There's also a third, by the way, there's a third option, which is these massively parallel databases. Well, I was at a, you spoke to Syla last week. I was at a UK event this week, and there was presentation from the successors of the Green Plum project, the kind of the open source successors, which is called Cloudbury. It looks like really interesting work.

Starting point is 00:08:40 But that's another way of doing some analytics from within Postgres, kind of. Yeah, and from previous experience, from the past, I remember cases when Postgres and Green Plum was combined in one project, and it was great, and it was some bank, even quite big. bank and yeah but somehow i stopped looking at green plant for quite long already i don't know there are also of course commercial databases i remember vertica there is snowflakes super popular it's like major player in this area by the way i would distinguish two areas of analytics one is internal needs for company we need to understand how business is doing a lot of stuff We need a lot of reports.

Starting point is 00:09:25 And another need is we need to show our users some counts, like I said, on social media. So two big areas, I think, also. Yeah, good point. In the first case, users are internal, and the second case, users are external. I'm pretty sure there are a lot of mixed cases, additional cases. But I personally like these two directions. Of course, there are others. There is also redshift.

Starting point is 00:09:53 on AIWs. Yeah. Also originally based on Postgres, yeah. Yeah, yeah. So there are many options here. So yeah, is the short version at sufficient scale? It probably doesn't make sense to be using Postgres at this point for analytics. But like that, that level is quite high.

Starting point is 00:10:13 Yeah. But also I see cases when companies go to Snowflake, then try to escape it. Come back. Yeah. Okay. So going to Snowflake, it's like going to Oracle, in my opinion. You mean like in terms of financial? In terms of vendor lock-in?

Starting point is 00:10:32 Yeah. Because it's just purely commercial offering. There are, of course, many tempting things there, features, performance. Yeah, yeah. Integrations. Nice product to use as well, yeah, developer-friendly. Yeah, well, users love it. I agree.

Starting point is 00:10:52 But if we try to remain in more open source and vendor lock in less, then it's like it should be excluded. Even Click House, like Click House is open source itself. Yep. Right. You mentioned the time series being quite close to this. I feel like we should jump to that next. What do you reckon?

Starting point is 00:11:14 Well, timescale DB is great, but it's also kind of enderlocking because it's not open. yeah yeah so because of their license other cloud providers can't provide timescale as a service easily or at least not the not the version with lots of nice features yeah in timescale cloud I had a recent case where we saw limitations again very badly like create database doesn't work and moreover lack of observability tooling like again like I keep promoting on this podcast if guys who build platforms listen to us, you must add PG weight sampling

Starting point is 00:11:55 unless you are RDS, okay, but even in case of RDS we talked about this it's great to have it in SQL context and be able to combine weight event analysis with regular PGStatement's analysis and PG

Starting point is 00:12:11 Stat K Cash additional very good observability point. Because I had the case when guys just compared everything so worse performance worked closely with time scale but in case of RDS you see performance insights understand where time like where we wait right case of time scale only rare collection of samples from Pugist activity is possible it's sometimes good enough but it's quite rough tool to analyze performance so yes such things are lacking

Starting point is 00:12:48 and unfortunately more and more I come into conclusion that when I recommend timescale to customers it contradicts with the idea they want to stay on managed service yeah yeah because they're down to a single choice yeah yeah that being said the timescale cloud even offered me like some bounty if I convince someone to go to them and this is great like I love loyalty but I need to be fair. Some pieces, big pieces are missing, unfortunately. Yep.

Starting point is 00:13:25 And again, again, Postgres, even without time scale, can be used for good time series, like workloads up to a certain point. We're just talking about at very high scale, right, where all the features like compression, like continuous aggregates, like automatic partition. Straight to the point. Straight to the point. Yeah.

Starting point is 00:13:46 Oh, by the way, for time series, Yes, Clickhouse also is still a good option, and there is also Victoria Metrics, right? Well, and I learned just yesterday about even Cloudbury have incrementally updated in materialized views. I need to look into it, but that's quite cool. And if you're, like, maybe that would be a good thing about this. Wouldn't it be great to have in Postgres something like update materialize, you were, and you just define the scope and also concurrently? we should do a whole episode.

Starting point is 00:14:20 I think there are several projects that have started to look into incrementally update immaterialized views and I think they're more complicated than I've... It's like one of those topics the more you learn about it the harder you realize it is.

Starting point is 00:14:32 Right now in position where most, not everyone, but most of our customers are on managed postgres so it's really hard for me to look at extensions which are not available

Starting point is 00:14:43 on RDS CloudSQL and others. I understand. I'm just thinking like I think it's worth learning from the extensions as to what would be needed in core. Like, how did, what did they try? What was difficult about that? And it's not just extensions, right? There were whole companies that have been built on the premise of, is it called?

Starting point is 00:15:02 Is it material, like, or like, what was the thing? Yeah, yeah, yeah. I haven't heard for a few years from them, what's, like, I'm curious what's happening there. Lack of autonomous transactions will be an issue, right? Yeah. Or, or Q like, Q like. tool inside Posgos, so asynchronously update would be propagated through Q-like. If everyone had PGQ, like CloudSQL has, from Skype, developed 20 years ago.

Starting point is 00:15:34 In this case, implementing incrementally asynchronously updated materialized users would be easy. Well, yeah, A-Sync and Sync is, anyway, this is an interesting topic. Yeah, yeah, yeah, yeah. And we already, basically, we just touched Workloads, you like workloads, it's still hard. Blow it is an issue, right? We discussed it, I think. I think, well, we've discussed there are solutions, right?

Starting point is 00:15:57 I actually think Q's is one of the ones that I was going to fight you hardest on. Like, I think there were ways to do it badly within Postgres. And again, at extreme scale, I think it wouldn't be smart to put it in, especially not in any way. Skype had extreme scale. Yeah, well, yeah, okay. 20 years ago, one billion users was a target, one billion users. good point so maybe actually that of all the ones that we added to potentially be on the list

Starting point is 00:16:24 that would be one where I think if you manage it well like PGQ at scrap did with partitions actually I think you that's not an excuse to not use Postgres if that's the title yeah benefits are serious problem is like I when I recommend the Q like workloads inside Postgres I just say like I understanding whole complexity and hidden issues that you just need a high level of understanding of what's happening to have it yeah but if you have it it would be great it will be great it's just not a small project unfortunately usually yeah good point and this recent case with notify as well because it's also sometimes used in such workloads yeah yeah as a reminder notify exclusive

Starting point is 00:17:15 log on database, serializing all netifies makes it basically not scalable. Yeah. Yeah. All right. Anyway, like, what's the answer? Like, if you was, if you needed to create a project and you would need to think about analytics, like, think about like, okay, we will have terabytes of data very soon, fast growing. What do we choose for analytics? What do we choose for, Q-like workloads for time series. What are the choices you would make?

Starting point is 00:17:50 I think it does depend a lot. Like you already said with analytics are we talking about a bank that is doing nightly loads of data and only cares about internal reporting or are we talking about a user-facing web app that has

Starting point is 00:18:06 to do, or like a social media app that has to do counts and various like aggregations that are user-facing. Well, you need both. You need to think about both and what architecture choice

Starting point is 00:18:21 would you make in the beginning? Yeah. For target of terabytes of data in one year, for example. 10 terabytes in one year, what would you do? Would you choose stay in Postgres? I'm a big fan of simplicity.

Starting point is 00:18:36 I think I would stick with Postgres for as long, like, until it was painful. Okay. A re-engineer then, yeah. Yeah. And I know that would be painful, but that would be my preferred option. I think then I am tempted. I think it's quite new at the moment, but I am tempted by the work that Crunchy started

Starting point is 00:18:56 on moving data out to iSpoke format and still querying it from Postgres. Like, I like the, I like that I can keep queries going from the same place. But not possible on CloudSQL or RDS, right? Not yet, right? But I think it's quite early. Like, if I was starting a project today, I would hope that those, they caught up by then and if not then a click house like a whatever peer d is called now within click house like having that go out to an analytic system like click house makes a load of sense to me

Starting point is 00:19:24 what do you think yeah i would choose uh self-managed postgous 100% and i would use times scale db full fledged got it okay yeah and then i would i would consider dark db path as well additionally at some point and q workloads I would engineer perfectly and squeeze as much as POSGUS can. And what else we touched? So again, all of

Starting point is 00:19:53 these are actually sticking with Postgres and then just at some point in the future you're going to have to think about sharding if you get that in the future. Only one reason would make me do it. So I wouldn't go in these three areas we just discussed

Starting point is 00:20:09 I wouldn't go away from POSGUS. Although I understand very customers who we have and why they say we need click house or something in my company i would at click house only if there is a strong team which needs and this is their choice and i delegate and then i i don't i'm not involved in this decision but while this choice is still mine i would stick to progress and just make it work better yeah and can scale until ipo yeah i i saw it several times with several companies so yeah cool makes sense so i wonder if this next one's going to be the first one where we both would would and don't use postgres which is the storing large

Starting point is 00:21:00 objects like large files well definitely i i yeah last time i i tried to store a picture inside Postgres was probably 2006 or seven when it was just exploring you know like oh this is working okay but no I even don't know how

Starting point is 00:21:20 what will happen you know like this like this piece of postgres I touch super rarely you know yeah I think the one exception is text based stuff stuff that like you might want to query it

Starting point is 00:21:35 but even then like you probably want to be doing PDFs, but you know, it's like some representation of the PDF, not the actual PDF, but like the text extracted from, you know, like, it's going to be Or it can be marked down and then we have Pandock or something which converts to both to HTML and PDF, this is what we do with our checkups. Originally, it's in Markdown.

Starting point is 00:22:01 Yeah. And by the way, and another possible exception, I think it's almost not worth discussing is if you only need to store five pictures like maybe you know but i just don't i don't see many cases like that yeah yeah still cool all right still yeah we recently implemented attachments in our own system for pictures and various like archives of logs or something which customers upload or we upload pdf as well sometimes and of course we store them in gcs and in secure manner not in post goes 100% no yeah yeah i even don't don't think there yeah it's just exercise for those who probably don't have other tasks yeah all right another one that i think is maybe in

Starting point is 00:22:55 between these two like advanced search i still think there are cases like there are search use cases where people that i respect and trust would still choose elastic search over postgres even By the way, sorry for interrupting. Sorry for interrupting. I just realized you talked about blobs. Yeah, but we also can. So, and this can be an issue at upgrades I've heard, right? Major upgrades. I don't see them at all. So I mean, I mean, you can try to store it in postgres, but then you have some operational issues additionally. Because not only I don't see them often, people who design some procedures and tools, they also don't. see them often and it's kind of exotic to keep to have blobs yeah so when you when you say major

Starting point is 00:23:44 upgrades are we talking about like the speed of the initial sink or we're talking about like pg dump if we go in the dump restore route then actually it's all having to get dumped out yeah yeah i just remember some notes about it i saw maybe on rdios specifically about large objects maybe i'm wrong actually I just remember that I'm actually using it, not for super large objects, but several kilobytes like we store query plans in the in JSON format or text format. Text format ones don't tend to get that massive, but JSON format ones can be hundreds of, well, we've actually seen a couple that were tens of megabytes, I think one or two that

Starting point is 00:24:28 are in the hundreds of big. We need to amend this part of episode. You are talking about Varlena types. Yeah. There is a special thing, large object facility, a special chapter. Yes, sorry, that's a different, yeah. Yeah. Warlena, everyone uses jasons, large texts, everyone.

Starting point is 00:24:54 Even bite arrays. Okay, you mean there's a specific issue with the thing called large object? I cannot say I don't touch large jasons. Of course, I touch them a lot. We have them a lot. And, yeah, we talk about how they toast in Postgres. Yeah, yeah. Large jasons, large XML sometimes, right?

Starting point is 00:25:14 Texts, of course, large text, everything. For example, our Rack system for AI system has a really large text, chunks of source code, or mailing these discussions, kilobytes. And you put all of those in Postgres. Of course, because we need to parse them and also full text search and vectors. It's everything in podcast right now.

Starting point is 00:25:33 Well, you don't, you could vector, you could store the vectors in Postgres without storing the text in Postgres, but the full text search makes a lot of sense. Yeah, I understand you, but we do everything in Postgres even, even vectorization. It's maybe, it doesn't scale well if you need to deal with billions of vectors, but millions, it's fine. Yeah, makes sense. So what I was talking about is like, it's like a low create function, these things. Yeah, I've not used that. you're saying you don't see it yeah yeah this is what i don't see yeah hopefully everyone got the memo and no one's using it yeah hello from bite so yeah i don't use those and uh again

Starting point is 00:26:18 last time i touched that it was so long ago actual blobs all oh underscore hello put hello get these functions cool have no idea and i suspect something will be broken you start using them some operations like upgrade maybe well not not broken but you will need to take care of them like in seem like some side effects like table spaces can have you know it's not used also often if in cloud in cloud context we don't use often the table spaces but table spaces might might be headache when you do some migrations move your database from place to place or upgrade and so on yeah yeah okay yeah good one what else have we talked we haven't talked about embedded databases yet

Starting point is 00:27:13 on the kind of tiny scale of things i'm yeah i'm not an expert in an embedded databases i've heard school light is good yeah yeah in this category we do now have pg light um looks like a very interesting project yeah but i think at the moment unless i was doing some syncing between Postgres. Like, this had a really good reason. I'd probably still default to SQ Lite if, or SQLite, however they,

Starting point is 00:27:42 however their community pronounces it. But actually, I was going to include in this topic, like even browser local storage, for example. If you're, if you're wanting to do stuff, client side in like a browser app or web app, still makes sense to use the local storage there.

Starting point is 00:28:03 index DB or whatever. So there are like a few embedded cases where I don't think it makes sense to use Postgres or if you're going to try maybe maybe PG light. And can you remind me PG light? What does it like that what does it do? And is it related somehow to WebAssembly? I think yes, right? I think it must be right.

Starting point is 00:28:29 But I don't know. I don't know enough about it. Yeah, it's complete. I'm checking it's a complete awesome build of Postgres that it's under 3 megabytes exept. That's impressive. Yes, it's a cool project. But I guess there's an argument to say it's not actually Postgres.

Starting point is 00:28:44 Like it talks and behaves like Postgres, but it's kind of its own thing. Well, we can talk about many post-Gus variants like this, including Aurora and so on. Some of them more, more Postgres, some of them less. Yeah. But if the topic is like when, when not to use Postgres

Starting point is 00:29:05 yeah I guess Aurora I don't know if I count Aurora as that or not anyway I'm not sure I understand what you mean if the only solution was use Aurora let's say that was the there was like a one of these cases

Starting point is 00:29:23 it turns out Aurora was the absolute best for and way better than native Postgres I think I would count that as when not to use Postgres because it's It's kind of Postgres compatible, like, or Cockroach, like, or any of these kind of compatible, or Yulobai. Yeah, I'd say, Al-a-B-B, yeah, you're right, it's kind of a scale. It's hard to, like, draw a line where it, where it is and isn't.

Starting point is 00:29:48 It's a spectrum. It's a spectrum. Yeah, yeah. Cool. Okay. But, yeah, it feels like that's an easy one, right? If you're, if you've got little devices or little, you know, little sensors. Yeah, default choice is.

Starting point is 00:30:02 cool light already and i like the idea of pg light and i know superbase used it used it for database dot build project which i like a lot with like merging this with i and write in browser you can create pet projects and maybe like explore and it's it's very it's very creative tool to think about to like to bootstrap some new projects in how it could look like it has the AR diagram and you can iterate with AI it's great and there PG light works really well

Starting point is 00:30:37 and I'm sure I'm sure they already created this ability to deploy to sync what you build to real postgres in super by source somewhere right well I think that was the main aim of the of the company

Starting point is 00:30:53 behind PG light called like electric SQL or something replication yeah exactly um so it's the whole premise was local if you heard of local first development so yeah the idea like apps like linear the task management tool they they're like lightning fast because they do everything locally and then sync very thick client very thick yes basically like git like git clone when you write when you type git clone and executed it's basically a whole repository. It can live

Starting point is 00:31:29 on your machine. Distributed fashion, right? And it has to handle mergers. Yeah. You can miss there. It's great. I'm curious if we could explore better branching in that area because we already very close to implementing

Starting point is 00:31:46 synchronization between two DB lab engines. Yeah. But it's a different story. Yeah. I like the idea. So it might be a foundation for PG light I mean might be foundation for more apps which will leave in browser but then be synchronized with real posgous right yeah or even desktop apps like it doesn't have to be browser based well I guess the desktop apps built on top of using electron right oh no yeah good point good point and and and then if you don't have internet connection you still can work yeah like offline mode that's great I like the idea actually I like the idea

Starting point is 00:32:29 so you have Postgres mirror it reminds it reminds me a multi multi master replication this is

Starting point is 00:32:41 complexity all the same problems like with merging and conflicts yeah but at the same time recent Postgres has this ability

Starting point is 00:32:51 to create subscription and avoid loops of replication Yeah, true. So origin is something you can say. I want to replicate on the data which doesn't have origin, which means it was born here.

Starting point is 00:33:06 Local origin, basically, but it means no origin there. Somehow terminology is strange a little bit as usual in Postgres, right? But it's a great ability to break the loops, infinite loops. Yeah, it fixes one of the problems, but it doesn't, yeah. It picks all the time, conflicts. Yeah, exactly. And like last right win type things, yeah. But if you use, if you need to have a very good like service site application and so on,

Starting point is 00:33:36 you choose progress, but then you have this very thick clients and you need to choose database for them and you choose SQLite. Then you need to synchronize between them somehow. It may be even worse. Even harder. I think that's the different data types and models. Yeah. Yeah.

Starting point is 00:33:56 Cool. What about the specialized workloads like, well, vectors, and I was going to bring up search as well. I think search is slightly easier. Let's make sure. I don't actually have, I haven't written an app or been involved in an application that is heavily reliant on, like, very advanced search features. But the people I speak to that have swear by how good elastic search is

Starting point is 00:34:23 and this is also this is also what I see I touch Elastic only usually working with some logs application logs Postgres logs through Kibana so ELK or how is it called

Starting point is 00:34:37 this stack but I also see many customers use Elastic and like it and the shift from Fullsearch search and Postgres there okay their choice and I know limitations of Postgres 90 full search

Starting point is 00:34:52 yeah I'm also I don't understand paradeb and I haven't seen the benchmarks the benchmark I saw was only the in the beginning when they didn't create index on TS vector they made a really interesting hire recently I saw do you remember Zombo yes yes do you remember so yeah because there is money yeah but where are benchmarks so I don't understand what's there because I don't see benchmarks I tried recently, because their CEO founder, Philip, approached me. Nice.

Starting point is 00:35:30 And maybe asking, like, yeah, not maybe asking for, I guess, like, to spread the world. But I cannot spread the world if I don't see numbers. Yeah, yeah. If it's about performance company, show numbers, I might be missing something because, like, well, the other, the other product in this space for Postgres was an extension called, ZomboDB, which synchronized, which kept a Elasticsearch index maintained, but the data coming from Postgres originally. So I thought that was a really fascinating way of having both, a bit like when we talked

Starting point is 00:36:10 about analytics, like having the interface be from Postgres, but the actual query being run on something that isn't Postgres. So that was fascinating. And it was the founder of that ZomboDB that really. recently joined parade. So that seems interesting as like this this this whole story seems interesting. I don't understand it because I cannot find numbers at the same time I see everyone mentions them, a lot of blog posts, a lot of GitHub stars, a lot of like a lot of noise, but there are the numbers and benchmarks. So they removed it after initial ones. Yeah, well it would be so

Starting point is 00:36:50 that if you were to try and stay within Postgres, they seem like. like the obvious thing to try, but I still see people choosing Elasticsearch, and I'm not sure why. Yeah, yeah. Yeah, yeah. If, please, if someone listening to us can share benchmarks showing how Paradeeb is behaving under load, some like, I don't know, some number of rows, some number of queries, like latencies, buffers used ideally, right?

Starting point is 00:37:18 I would appreciate it because I'm still, I'm just, I'm stuck in. question. What is this? Cool. What about vectors? Vectors, I have a picture for you I saw yesterday near my house. These guys definitely, yeah, those who can see YouTube, please check this out. So these guys are definitely experts in vector storage. This is Nikolai's joking. It's like a removal company that are called vector moving and storage.

Starting point is 00:37:49 I saw storage as well. I thought it's funny, yeah. So we have TurboPassfer, right? Well, again, not Postgres, right? Not Postgres at all, and not open source at all. And not free at all, you know, free will. And data is in us three, and this new type of index is being used there. I already forgot the name. But, yeah, so not HNSW.

Starting point is 00:38:21 Oh, interesting, yeah. Yeah, yeah. I don't know. So HNSW doesn't scale beyond a few million rows. Disk on and we had the timescale tiger data guys who developed an advanced version of that. In my perception, I don't see what scale is 2 billion rows at all. And Tobu Pfeffer says they scale 2 billion rows, but as I understand, it's a multi-tenant approach. So every, like, it's not a one.

Starting point is 00:38:53 set of billion vectors. I also don't understand that, but I see some development. The plan scale for my SQL, they implemented the same index. And this development started, I think, at Microsoft and maybe in China, actually. Microsoft in China, which this is what I saw. And interesting, they choose Posgis for prototyping. So this area is worth additional research. I started it and didn't have time, unfortunately. But it's a very interesting direction. what's happening with vector search because I think Postgres is losing right now

Starting point is 00:39:28 well you say losing I think it's losing 100% well bear with me I think there are a lot of use cases that don't need the scale that you're talking about and a lot of those are fine on

Starting point is 00:39:44 on Postgres with PGVector but you're probably talking about the ones that then succeed and do really well and scale they hear hit a limit relatively quickly, or like within the first couple of years? It's really hard to maintain huge HNSW index and latency-wise. It's not good. TurboPyfer, I'm not fully sold on that idea that let's store everything on S3.

Starting point is 00:40:09 Speaking on S3, a few weeks ago, they released S3 vectors. AWS released S3 vectors, and this might become mainstream. so S3 itself right now supports vector indexes have you heard about this no I think this might become mainstream but if if big change doesn't happen in postgres ecosystem it will be worse and the case with full tech search and elastic it will be worse and how it's called like alarmist I am today right Well, this is the point of the episode, right? It's like, it's almost by design that we're talking about the weaknesses.

Starting point is 00:40:55 I was feeling so good saying, like, I would choose Posgis for this, this, and this. I can rely. But here, since we have 1.1 or 1.2 million, million vectors in our reg system for PostGos knowledge, knowledge. Is that mostly because of the mailing list? Yeah. Mailing list, I think, 70%. But we also have a lot of pieces of stuff.

Starting point is 00:41:19 source code of various versions and not only postgis, PG Bouncer and so on, and documentation, it's a lot of stuff. Also, block posts. And I feel not well thinking about how to add more. And we are going to add more. We're going to do 10x at some point. Of course, we will check what Tiger Data has,

Starting point is 00:41:40 but at the same time, I'm feeling not well in terms of. What's the main issue latency? Is it query latency? the main okay yeah latency index size and x build time all these things interesting ability to have additional filter which is on the shens w still legs right yeah maybe i do remember seeing an update on the pg vector repo but i can't remember i feel like they had a something to address this but i can't remember what

Starting point is 00:42:11 i haven't touched this topic for several months i might be already lagging in terms of updates it's a very hard topic of course very young as well right Yeah, and not something I'm experiencing again. It's more something I'm observing. So you're definitely way ahead of me on this. Yeah. So I know just companies, you have several customers who are on Postgreas, but they choose Tobopifer additionally.

Starting point is 00:42:36 And linear, you mentioned, for example. Cursor and linear, they also chose turbopifer. Notion chose TurboPyfer to store vectors. I'm just checking the website. They have some cool customers on this list, yeah. Yeah. And several more companies, which are not mentioned here, we also mentioned, and they are our customers in terms of post-guess consulting.

Starting point is 00:42:54 And I was super surprised to see that something is like massive migration of vectors is happening. Some moving company called to Biphyfer helped them move to their vectors to S3. But yeah, it's interesting and they use some different index, which is like younger idea. It's based on clustering and centroid, so vectors, so it's like A&N is implemented differently, not a graph-like as in HNSW, but basically quickly understand which centroids of clusters are closer to our vector and then work with those clusters. Quite simple idea, actually, but I guess there are many complexities in implementation. there yeah well and it's cheap right being on S3 it oh yes and slow but they have a turbo piper I guess they have additional layer to cache on regular disks closer to database so there is caching layer of course yeah but but it's much cheaper

Starting point is 00:44:05 much yeah actually this is another area if you have hundreds of terabytes of data tiering of storage and Postgres is still not fully solved problem right unless you shard right and this is super explicit to shard

Starting point is 00:44:25 and keep everything on disks especially if there is some archive data which you touch very rarely I would prefer to have it on S3 and time scale cloud tagger data they solved it in their solution we also had attempt to solve it from

Starting point is 00:44:40 Tembo which is not PostGus company anymore. Yeah, right. PGT, right, it was called. But this is, I think, this should be like more and more needed over time. Well, and it's a side effect of the, like the crunch data approach of putting things in iceberg. That also solves the problem, right? You can archive the data from Postgres at that point. So the, it's a similar solution, isn't it? So I guess these days, I would explore a three vectors at this point. If I needed to, maybe I will actually. Well, you are going a need to it sounds like well yeah yeah postgis i infrastructure mostly is on cloud sequel or not cloud google cloud google cloud no no no it was wrong google cloud so one level up yeah yeah yeah

Starting point is 00:45:29 yeah but a three is a double yes so it's but it's it's it's really interesting it should be cheap should be interesting to explore yeah and it's a big challenge big challenge to post ecosystem yeah or maybe opportunity if somebody creates a foreign data wrappers also yeah it's actually why not it should be it's a good project by the way right so interface there is foreign data wrapper to s3 already right i think so i think super i don't know i'll check should be just extended to have vector functions in my opinion okay enough it's like Kind of brainstorm mode already. Thank you so much.

Starting point is 00:46:14 See you next week. Thanks and catch you next week. Yeah, bye-bye.

Postgres FM - When not to use Postgres

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.