Software Huddle - Postgres for Search + Analytics with Philippe Noël

Starting point is 00:00:00 What's the Postgres community like in terms of like, hey, we're building a new extension? Like, was that hard to like break in and get to know people? Or is it pretty welcoming? Or what's that feel like? It was not super hard to break in, but it does take some time to get accepted. You know, they're smart people. They're hardworking people. They're very passionate people.

Starting point is 00:00:14 Many that have been doing this for longer than I've been alive. Where does search on Postgres fall down. When it comes to doing things at slightly higher scale and in a slightly more, let's say, like globalized way, the Postgres full text search doesn't use BM25 as its ranking algorithm. BM25 is the state of the art ranking algorithm. What does it look like to develop a Postgres extension? Hey folks, this is Alex.

Starting point is 00:00:42 Today we're talking about Postgres and search. I think a lot of times online you hear people asking, hey, how do you handle search? And everyone just says, just use Postgres, just use Postgres, right? And it always struck me as a little bit off just because like, hey, you know, there are things Postgres is good at. Search is like a different and a hard problem. I'm sure it works in some cases, sure it doesn't work in others. It came up this week on Twitter and And I found out about Felipe Noel, who is one of the founders of ParadeDB, which is basically making search work better in Postgres, right? Which includes both like adding some features and capabilities that native Postgres search doesn't have, but also improving the performance using, you know, Tantv, which is

Starting point is 00:01:20 like a more modern library for search. So I thought this was really interesting talking about like sort of the upsides, the downsides, when you should use Postgres search, when you should use something like Parade and then even like when you should reach out to something more advanced like Elastic. I thought it was really great.

Starting point is 00:01:34 So yeah, check it out. If you have any questions, if you have any comments, if you want guests on here, feel free to reach out to me or Sean. Please leave us a review if you like the show. And with that, let's get to the episode. Philippe, welcome to the show. Thanks for having me. It's good to be here.

Starting point is 00:01:50 Yeah. So you are one of the co-founders of ParadeDB, which is Postgres for Search and Analytics. Can you give us a little bit more on what that means and what you're up to at Parade? Sure, sure. So Postgres is a pretty famous, the most loved relational database nowadays. And it's good at a lot of things, but it's never been too shining on doing search type workloads and analytics workloads. Overwhelmingly over the years, people have used a tool called Elasticsearch, and it's starting to show its age. And we think people should be able to do that from the database they love. so what we do is we augment postgres with those capabilities so they can stay and keep using postgres for for everything yep awesome yeah i'm excited to dig into that some of like

Starting point is 00:02:35 the technical stuff but also like applying it and when you should use it when you how you should think about some of this stuff uh just to give folks some background the way we got connected here is a couple days ago there's a guy named arvid call who uh he's like kind of in the bootstrapper community he had this company called feedback panda that he grew and then sold and and just like given a lot of wisdom um he also has a new application called pod scan where he's basically taking all the podcasts that exist taking their transcripts and indexing them and making them searchable filterable all that sort of stuff so you can look up all sorts of stuff there. So you can imagine he's got a lot of data. He says he has about like 500 gigabytes of podcast data, like mostly like transcripts is what that is.

Starting point is 00:03:15 And he's been trying to do search on it. I've been trying a few different solutions. I think started with Melly search. Now lately, some people said, Hey, do it and just do it in my SQL that should work. And that's been pretty rough for him. So he's just sort of like asking for advice on like, Hey, what should I be using? You know,

Starting point is 00:03:32 what are, what are some options here? And I think there's like two paths that you see. One is like the, the just use Postgres, just use my SQL, like use your OLTP database for this sort of stuff. And,

Starting point is 00:03:42 and there's like the, no, no, no, no. You need elastic search you need like a 20 node cluster with you know s3 and all kinds of other different things like tied in there um so it was really interesting to see you like hey acknowledging there are some issues with like

Starting point is 00:03:56 like postgres can get you some of the way with like bog standard search but also like not not all the way there and you're like fill in that gap with Parade. Did I give that background right? Is that the rough where we started off here? Yeah, that's pretty right. I think I got tagged six or seven times on that blog post, on that Twitter thread story, which is nice. We've become the face of search and progress, and people are starting to recognize it. But yeah, I mean, there's good and bad trade-offs to every solution. Yeah, exactly. So maybe start off with

Starting point is 00:04:30 where does search on Postgres fall down? It does do some things well, but where do people see problems when they try and do some search in Postgres? Yeah, so the first thing... I mean, first of all, I encourage people to stay with the default search functionalities of Postgres for a very long

Starting point is 00:04:47 time. It's quite good. So for example, you can do basic fuzzy matching and so on quite well with Postgres. Performance is not bad with things like TSVector and the indices you can find. It's not bad. There's a few things, though.

Starting point is 00:05:04 I would say the main places where it falls down is when it comes to doing things at slightly higher scale and in a slightly more, let's say like globalized way. So for one, the Postgres full-text search doesn't use BM25 as its ranking algorithm. BM25 is the state-of-the-art ranking algorithm for full-text search. It's what Elastic uses. It's what Maily Search

Starting point is 00:05:26 uses that you mentioned. There's a lot of other companies that I can name that offer similar products and they're all based on that. So just hold on for a second. So BM25, I heard that come up a few times. Is that comparable to TF-IDF or is it like an implementation

Starting point is 00:05:42 of TF? How does it compare to TF-IDF or just trying to format where BM25 IDF or is it like a implementation of TF? Like, how does it compare to like TF IDF or like, um, like just trying to format where BM 25 fits. It's, it's like a, it's a scoring,

Starting point is 00:05:50 it's a scoring algorithm for like sparse vector results. So it's in the same type of category, if you will. It's different, right? It's, it's applied differently, but it's,

Starting point is 00:06:00 um, it's a similar type of algorithm. Okay. Okay. Sounds good. Thank you. And so, so that's a, that's a pretty big one. Um Okay. Okay. Sounds good. Thank you. And so that's a pretty big one. If you want, like, for relevancy of results,

Starting point is 00:06:13 you'll get much more relevant results if you use this. So relevancy is where it starts to fall down when you have bigger corpuses. The second one I would say is performance is not bad, but it starts to balloon. Like, in terms of indexing time, in terms of search time, when you have pretty big corpus of data, like we frequently hear people come to us

Starting point is 00:06:30 and tell us that their searches take several seconds, right? And like in today's world, I mean, like even in like 10 years ago as well, like that was just not acceptable, right? And so that's a big one. And then the last one I would say is Postgres full-text search is meant to be pretty basic, right? If you have a simple workload, you have like some English data you want to search, right? But if you want to do like faceting search where you bucket your search results based on different properties, right? French or in Russian or in Mandarin Chinese or in Afrikaan or whatever, like you can't really tokenize that very well in Postgres today. So those are areas where it starts to fall down.

Starting point is 00:07:10 So really like performance and breadth, I would say. So if you don't need, you know, very big performance and you have a pretty narrow use case, you'll do great. And you can do that for a while. Gotcha. And like, so when you say like not much breadth, is that like, Hey, I have a bunch of users and things like that. I want to give search on first name, last name on email address like that. Postgres is going to nail that. I don't go out to elastic search for that, but then, I mean, what is it about the size of the document or the, the, the size of the total data set or like, when is it like, oh, now I need something more? Like, is there some sort of like tipping point or that indicates I should I should move to something

Starting point is 00:07:49 else? Yeah, I think, in my opinion, it depends on the complexity of the queries people will run, right? So I actually know of some quite large companies, like public companies that run their searches on Postgres TS vector, and it works quite well. So it just depends what people are looking for, right? So if I'm looking for, I don't know, like colors, right? Like I want to search like various colors. Even if you have a lot of data, it's gonna make a pretty simple query,

Starting point is 00:08:18 it's gonna work well, right? But once you start to do things where you wanna offer more flexibility in user error, right? Even if the user makes typos or asks for something that's quite complex. Or even, you know, oftentimes users will search for things they don't fully understand how to search for, right? Like, you know, if you want to search for, like, I don't have a good example, like some sort of computer, right?

Starting point is 00:08:42 But you're actually looking for an iPad or whatnot. Like things that are going to be relevant, but are actually like pretty different to what you're searching for. It's going to be quite difficult. So the simpler the workload, I think, from the user perspective, the more you can do with Postgres.

Starting point is 00:08:56 Yeah, gotcha. And that last example you said, is that going to be something that'll be accomplished by, you know, the BM25 index? Or that last example of saying, hey, you're searching for something similar but you didn't even type it incorrectly or is that going to be more in like the the vector stuff or like the similarity search we've seen with like embeddings and stuff lately

Starting point is 00:09:14 yeah the last example is more of a similarity example that's perhaps a bad one um but like if you want to search for like full text is very good when you want to search for like point point concept like you know if i search for your name like it has like no similarity meaning to my name or whatever right um so so there's a bit of there's a bit of both you can do more today with basic full text posters thanks to pg vector and it's now supported on all the major postgres databases so that also enables people to stay even longer on it um But the truth is, for any data-intensive product where search is a key component, and that's increasingly a large number of them,

Starting point is 00:09:53 you just need something better. You'll see. People know. We don't have to convince them. They know when they start to need something else. They're very, very aware of it. Yeah, exactly. And then, okay, so you're very, very aware of it. Yeah, exactly. And then, okay, so you talked about the performance aspect of it,

Starting point is 00:10:11 you know, that you've done significantly better there. And how much of that is, hey, the BM25 index is that much better? How much of it is like, hey, you have this custom stuff you've written in Rust that is an extension in Postgres? Like, what's accounting for the performance gains there? Yeah, so that's a good question. A lot of the performance gains, the credit does not belong to us, to be honest. So we integrate a search library called Tentivy,

Starting point is 00:10:38 which is a Rust implementation of Apache Lusine. So Apache Lusine is the state-of-the-art search library that people have used for decades now, maybe like close to two decades. Elastic is based on it. Most of the dedicated search engines are based on some variants of it. Tentivy is a re-implementation in Rust

Starting point is 00:10:56 that's now adapted and used by many, many companies. The original company is called QuickWit. They're also offering a product based on it specifically for observability. There's a lot of other companies that are based on it and we're the same one. So the community, and by the community, I mean like overwhelmingly the folks

Starting point is 00:11:13 behind Tent TV and QuickWits are the primary people that are created for that. We get a lot of performance benefits there. And then what we do ourselves is we sort of glue the pieces in a smart way to make it so there's a lot of benefits you get as well from users just doing the right thing, right? There's a lot of value to be added in just like constructing your product in such a way that nudges the users. So they just always know the right behavior or the right action to take, right? The right index, how to employ the

Starting point is 00:11:43 index on their table, for example, right? So that they do it in the correct way, and then you get a lot from that as well. Yeah, and then even comparing Tantiveet versus Lucene, which, like, Lucene is, like, just a Java library, is that right? That's right, yeah. Okay. Comparing those,

Starting point is 00:12:00 like, is that, I mean, is that just Java versus Rust, or is that, like is that like hey you know we've done some we we have a better sense of this space now maybe lucene has has a lot of cred from 20 30 years of being built up and adding all these different features or like what accounts for even that speed difference between tentative and lucene there's it's both right so java is famously slow with the the jvm being quite bloated and doing garbage collection at random moments. Rust is very performant.

Starting point is 00:12:29 It does not do any of that garbage collection. Well, yeah, it doesn't do the JVM. And so you get a lot from that just off of the bat. And the folks behind Tentivy, if you look at their project, you'll see they say Tentivy is inspired by Apache Lucene, but it is not a re-implementation of Apache Lucene. So there's a lot of similarities because Lucene is

Starting point is 00:12:49 very good and they've gotten a lot of things right. But as you say, over 20 years, people have seen things as kind of too far committed to change. And so they chose to make some slightly different design decision and there's a lot of

Starting point is 00:13:05 performance that gets handed there as well yep cool um one thing that i was when i was researching parade um and i think you all even mentioned on your your post a little bit is is zombo db how do you all compare just in approach to solving this problem with with zombo db yeah so zombo zombo was kind of like the pioneer of what we're doing, right? And I know the main author, Eric, he's quite the man. He's a pioneer in the space. The tool that we used

Starting point is 00:13:34 to build our work was also written by the creator of ZomboDB because he wanted it to build ZomboDB better himself. But for context, ZomboDB, for those that don't know, is also a Post it's also a progress extension to offer similar capabilities but what it does is it allows you to operate and manage an elastic search cluster from your postgres database so in um like the ideological differences between us and and zombo is that zombo says you have

Starting point is 00:14:02 elastic you have postgres we make dealing with the two of them together as easy as possible. What we tell you is you should never need to have an Elastic search cluster. You should just be able to use Postgres. So in that way, we kind of see ourselves as a bit of a successor to what Eric has done

Starting point is 00:14:20 with Zumbo. He would be the first one to say that himself. But Zumbo definitely paved the way. Yeah, that's interesting. Yeah, because like, man, of all the infrastructure I've ever used, like Elasticsearch is like the scariest piece. So even like, even if something else is like sort of managing it, it's like, it's sort of managing it, but you're still, you're still like, even the managed Elasticsearch providers, it's hard. Yeah, it's like hard. Exactly, exactly.

Starting point is 00:14:44 The problems are still there like when people come and that's why also for example like so our performance right our search results thanks to tentative are about 25 25 to 50 faster than than elastic our indexing time is five times faster than elastic on our own benchmarks on on a terabyte of data so performance is great right people they come and they use it and they're like, wow, but that is never the reason that brings them in, right? Every time people, they don't come to us and say, Elastic is not fast enough.

Starting point is 00:15:15 They say, Elastic is so incredibly heavy and difficult to operate, right? And Postgres is so easy to operate. And by the way, I've already operated, right? So they come to us and they just want to have less mental burden rather than performance. And I'm convinced even if we were slower than elastic people would still use the work that we do um and like the fact that we're faster it's just like an added bonus they're like you know cherry on top of the sundae yep yep what i mean and so i think this is a part of like the whole like just use proscores

Starting point is 00:15:38 movement which you see a lot like what what accounts for that like what makes postgres so good at this um yeah and yeah i guess like so much so much energy around that movement right now yeah well i have a bit of a nuanced take on this which is i love the postgres for everything movement i know most of the people that work on it i think they're great people and i we're a part of it i will say i think we're still at a point where you should use postgres for many things. I don't think you should use Postgres for everything yet, but I do think one day will be possible, and we're working to make that happen.

Starting point is 00:16:11 But the reason, I mean, there's kind of two reasons. One is Postgres is very extensible. A lot of people compare Postgres to Linux nowadays, right? Where the core Postgres foundation has become more than just a relational database. It's just a data,

Starting point is 00:16:26 like an open-source data store that can be taken in so many different ways. And so just like we'll have people make Linux distributions or Linux projects, and for various... You have the security Linux with Alpine, and you have the enterprise Linux with Red Hat, and blah, blah, blah. Postgres is a

Starting point is 00:16:42 little bit like that. So I think their open-source community has built its trust over the last 30 years. Another reason, I would say, is there's blah blah. Postgres is a little bit like that. So I think their open source community has built its trust over the last 30 years. Another reason I would say is there's some issues that have happened with MySQL over the last six, seven years that have made people lose trust with it. MariaDB was part of, I believe, Sun and eventually under Oracle. And it's not been as open as possible. And data, you know, people care a lot about the sort of underlying infrastructure behind data being open, especially because of, you know, how important data is to come to everything.

Starting point is 00:17:14 And the third reason that I would say is when you pick, when you build a product, like every software, if we're being really, really, really reductive, reductive, every software is a wrapper around the database, right? Like most software, you just build things on top of your database to give experiences, right? So it's kind of the first thing you pick. And because it's the first thing you pick, and you want things to be as simple as possible, it's the best place to extend, right? Yeah, very cool. So yeah, that's cool. That extensibility part. So parade DB is a post growth extension, which means like people just have their post growth instance. They install

Starting point is 00:17:51 this extension and now they get, you know, these BM 25 indexes with, with query ability and stuff like that. What does it look like to develop a post growth extension? Like, are there just like certain hooks that you sort of, you that you register and hook into that, when this happens, call me? What does that look like to develop a Postgres extension? Yeah, I think you summarized it in a pretty eloquent way. The core Postgres team, they've done a lot of work to provide hooks around various functionalities. Even from an extension, the interfaces are very well defined and you can call hooks into the core functionalities of the database.

Starting point is 00:18:28 And so you can hook at the storage layer, you can hook at the query layer, you can hook at the plan layer, which is the level between like when someone writes SQL and the SQL actually gets executed on the data that are stored. And that gives us the ability

Starting point is 00:18:42 to go and do a lot of things, right? So we can say in our full-text search, for example, we hook at the indexing layer and we say, hey, we want to index the data in a different way to be used with BM25 search, right? And we can provide custom syntax to do that. We also have an analytics extension where we hook at the query layer before even the query hits the data store. And we say, hey, we want to process that query

Starting point is 00:19:08 a different way than the traditional Postgres because we know this is analytics data and analytics data can be, the numbers can be crunched in a more efficient way than what Postgres does normally. Yeah, yeah, very cool. Okay, yeah, I want to talk about the analytics part too, but just like even closing up the search part,

Starting point is 00:19:25 let's go to Arvid's problem, right? Where he has 500 gigs of podcast data, you know, continuing to increase as new transcripts come in. And, but, you know, doing like sort of high volume traffic on that. Is that something, hey, does that feel like a good fit for Parade?

Starting point is 00:19:44 Is that like, hey, that's getting too large for it? Like either in document size or total corpus size, right? How would you think about that problem starting to approach it? No, I think Harvard is a perfect ICP for the work that we do. I mean, he is not in this case because he's a MySQL user, right?

Starting point is 00:20:00 And we don't touch MySQL. But if you were in Postgres, this is exactly who it's for, right? 500 gigs of data increasing fast is too much to do in Postgres for break search. It's just not going to work very well, right? And that's the point where you would need something like Elastic.

Starting point is 00:20:16 But as he says, right, he's already tried Mils and he's already annoyed with that data movement piece. He wishes he would not have to do that. Well, that's exactly where ParadeDB comes in, right? That's exactly what we offer. Yep, absolutely. All right, that's great. Is there a point where you would say, hey, you know, even Postgres with Parade is going to have trouble with that? And I'm thinking of the people that use like Elastic for log data, and it's just like a ton of data coming in. Maybe now they're pushing multiple, multiple terabytes. Is that still like, hey, Postgres can handle that and do search on that? Or is it like, nah, that's sort of a different problem

Starting point is 00:20:46 at this point? So that's what I meant. And I don't think Postgres can do everything today. This is a good question. Postgres cannot handle that well today. Like, we can't handle that well today. So the specific use case that we focus on is, I would say, application or backend search, right?

Starting point is 00:21:00 So you want to expose an end user experience, right? You're building Facebook. You want people to search end user experience, right? You're building Facebook, you want people to search for, you know, people on the platform, you're building ZogDog, you want people to search for the corpus of medical professionals that you have, right? Those things, like those experiences are too big to be in the basic post-referential tech search, but they're big enough that we're not like writing petabytes of logs like every week, right? Today, ParadeDB is focused solely on single nodes. So we don't do distributed searching.

Starting point is 00:21:29 Elastic obviously offers that. When you're starting to do logs at scale, you will want to do that. This is not where we see our competitive advantage because people use Postgres to build their application, right? Not to build their observability suite. And so the benefit of being Postgres sort of diminishes. And the folks, for example, that wrote TempTV, they create a product called QuickWidth, which is specifically designed for petabytes scale log search over S3. And so TempTV itself is able to do this, a parade which is not to focus on that. yep yep absolutely you mentioned you know it's focused on single node right now how do extensions work with some of like the distributed postgres tools out there and i don't

Starting point is 00:22:10 know i'm super well but like you go by it or situs or or different things like that do like would a parade extension work with those or is that just like too much of a conflict between going to single node to distributed that it wouldn't work so there so we we make our work as postgres compatible as possible. Our goal is to be true Postgres, right? Because that's what people want. So the answer to your question is, yes, it does work. It depends on how much these people modify Postgres,

Starting point is 00:22:36 not on how much Parade modifies Postgres. So for Citus, it does work, right? Citus is an extension. In fact, we plan to one day offer a distributed version of Postgres. And working with Citus is probably going to be how we offer that. If you take the folks like Neon

Starting point is 00:22:52 that have modified Postgres pretty heavily, but have tried to remain like real Postgres shops, it works, but there are some things that need to be done to integrate it properly from our side that we haven't done yet. And if you take the folks like YugabyteByte that have gone pretty far in the deep end path, in my opinion, of Postgres, I will be honest with you. I do not know if it works with YugoByte. I suspect it does not. And maybe one day, if it makes sense, we do the work to make

Starting point is 00:23:20 it work. But there's going to need to be some of their effort poured into that as well. Yeah, exactly. Okay, cool. One other thing I saw you mention on Twitter that was interesting is for some customers, you recommend having separate OLTP and search instances of Postgres. Can you tell me what that

Starting point is 00:23:39 pattern looks like? Yeah, so that's a very important point. It depends. So when people hear extensions, overwhelmingly, they think of exactly what you described, which is I'll install this into my primary database, right? But the truth is, the search workload requires different hardware and different tuning of the database to be optimal, right? And it's very important to isolate systems. So that if something goes wrong with your transactional database, it doesn't take down your search and vice versa. If something goes wrong in your search process,

Starting point is 00:24:09 it doesn't take down your transactional database. So the problem that exists with Maile search and Elastic that Arvid was saying as well itself is in the data movement, you are forced to do a transformation. It's called ETL ETL, right? Extract, transform, load. Because we go from Postgres to a NoSQL database, right? That transformation is where the pain exists. Because if you change your schema, it breaks. If you upgrade versions, it breaks. It requires compute.

Starting point is 00:24:38 It's slow, things like that. That's where the problem lies. So the way in which we get sort of the best of both worlds for ParadeDB customers today is we say, hey, what we are is pure Postgres. You should still, if you have a lot of data, co-locate it separately from your transactional, but you can connect it via the Postgres replication logs, similar to our high availability read replica solution. So it's sort of like isolated, but tightly coupled type of workload. You don't have any of the pain point of the big heavy ETL that Arvid complained about, but you have all of the benefits of the isolation, right?

Starting point is 00:25:14 And that is our suggested deployment flow for larger workloads. Someone like Arvid, this is what I would recommend that he does. If you're a small company, like a startup, right, or like, you know, a small shop, having it all in the same database is going to be fine because your volume of data is lower, but overwhelmingly, this is not how people use us, despite being an extension.

Starting point is 00:25:38 Like, an extension is more of a convenience of fitting in and development experience rather than necessarily how you should install it yeah okay that's that's super interesting so it's basically just like another read replica but on this read replica i've installed an extension i've added some indexes that are only on my read replica not on my primary instance exactly exactly and that is like that is the best solution right when people when people think of the postgres for everything movement like oftentimes they're like i'll just merge everything in one database right well you

Starting point is 00:26:09 don't want to do that right you want to separate you want them to be close right but like if you go and you order a dish in the restaurant right like you want each part of your dish to like put in together in its own place not be mushed in into one thing right it's the same thing yeah yeah yeah interesting do you have like are most people running a couple read replicas of those into one thing, right? It's the same thing. Yeah. Yeah. Yeah. Interesting. Do you have, like, are most people running a couple read replicas of those or even read replicas to like your, your main search instance, which is like sort of a read replica itself? Like what's the topology look like for that generally? So, so the, the main topology is you, they will replicate from, from the high availability Postgres, like from a read replica or from the primary into ParadeDB.

Starting point is 00:26:47 And then depending on their preferred SLAs and uptime guarantees, ParadeDB itself might have read replicas or high availability solution, or it might be a single instance. It depends on your maturity as an organization and what you're looking to guarantee. And for most people, is that syncing, is that like asynchronous replication? They're probably not like doing that,

Starting point is 00:27:09 that synchronously from their HA Postgres. It is. So that's also something you get to decide. So the way they do it is there's always a lag, right? There's a small lag between your primary Postgres and your replicated, your read replicas. in today's and this is very fast right replicating you know high-quality solutions like lag very little behind the primary that's exactly the same case for postgres so for parade excuse me so it is synchronous in that you know it goes over um in real time but we offer two types

Starting point is 00:27:44 of guarantees for that. We offer strong consistency or weak consistency, depending on what you want. And so today, for most customers, if you go and you write data into ParadeDB, your index, your transaction is not going to commit until the index is updated. So the transactions were slowed down slightly, right?

Starting point is 00:28:02 What you get with this is, you know, if the data has been sent over biological replication, you know that if you make a search query, it will contain that data, right? The downside you get with this is it slows down ingestion slightly, right? And so we also offered the ability for enterprise customers to do weak consistency,

Starting point is 00:28:20 where it basically says, like, just dump as many logs as you want in there. It is possible that indexing lags behind a few seconds, right? But if you want to ingest large amounts of data, that's a trade-off that makes sense for you. Yep, absolutely. You mentioned that search requires just like a different sort of set of resources or just like a different profile than like your normal LLTP stuff. What do those nodes look like? Are they like higher memory or higher CPU? How do they sort of vary from your primary OLTP HA nodes?

Starting point is 00:28:54 Yeah, that's a good question. That's a good question. Oftentimes they're higher memory, in my opinion. I think it depends. I wish I could give you a really, really good answer. But to be honest, customers do a lot of different things. So it's hard to give you a really clear answer. Usually people will put more memory.

Starting point is 00:29:16 And depending on the frequency of search, they might want to have more network bandwidth too. It depends. Gotcha. Yeah, I wasn't sure if there was like, hey, search type workloads are just way more compute intensive or memory intensive or whatever. So I didn't know if that was a rule,

Starting point is 00:29:32 but it probably just depends a lot on specific use cases and things like that. I think as we service more and more people, there's a rule that will emerge. And if you ask me again in six months, I'll be able to provide a better answer. Tweet at me in October. Yeah, that'd be great be great yeah we'll figure this out i'm gonna i'm gonna set a reminder for myself so um cool but yeah let's switch to the to the analytics aspects

Starting point is 00:29:53 of parade db because you're you're both search and analytics so tell me i guess you know how is when i was looking at it's kind of a unique analytics setup. So tell me how ParadeDB Analytics works and where it's fixing gaps in the Postgres Analytics ecosystem. Yeah, so the reason to answer that, I'll tell you why we built Analytics in the first place. The reason we did that is we first built Search. That was our first product. And as we did, we talked to a lot of customers and started requesting Analytics.

Starting point is 00:30:23 And it turns out a lot of workloads that people use Elasticsearch for today is a combination of search and analytics in the same queries. Like you want to search for some results and then bucket them in categories, for example. There's not a lot of tools that do this very well. Like most of the search engines, they focus squarely on search. And Elastic is one of the rare ones that does both quite well. And so that's one reason people still use it, despite most people dislike operating it. So all this to say, we offer two types of analytics that are kind of inspired by Elastic.

Starting point is 00:30:58 The first one is you can do analytics directly on your search queries. So you're able to, like the example I just mentioned, like bucket search results, for example, you can do analytics directly on your search queries. So you're able to to, like the example I just mentioned, right? Like bucket search results, for example, you can do that. The other analytics offering we do is we allow you to do fast analytics over data stored in object storage. So like S3 for instance. So more and more nowadays

Starting point is 00:31:18 it's common for people to want to have data, like analytical data that's stored in a data lake or data warehouse be used to power some sort of user facing experience, right? And user facing experiences should be built against Postgres, not against S3 or Snowflake, right? And so that analytics extension, which we offer, which is called PG Lakehouse, is, is there to offer that. So you can join tables that are in postgres with data that is in the cloud and in s3 and you can use that to power recommender system is a big one that people will do for example right and that's a pretty closely uh tied experience and workload excuse me to the idea of search and recommendation yep cool okay cool. Okay, so what do you, you mentioned like, hey, you don't want to be building

Starting point is 00:32:05 against S3 or Snowflake directly, but this is still reaching out to an object store like S3. I guess like what sort of like user-facing latency response time should I expect on something like that that's reaching out to S3 and then doing some filtering

Starting point is 00:32:22 or something like that? That's a good question. Very little. Well, it depends. If you build it properly, right, that's part of setting the same default. So what we recommend, right, is when people use this, they have their S3 data in the same AWS region where they will have their ParadeDB instance. And so sort of like a typical larger scale and like customer, right? Or like mid-market customer that wants to use ParadeDB. They'll have one AWS RDS instance, let's say, or like one GCP Cloud SQL cluster. They'll be in one region, like US East 1 or US West 1, right? The data that they have in analytics is going to be in an S3 bucket or Google Cloud Storage bucket in that region as well. And they

Starting point is 00:33:01 will deploy ParadeDB in that region in the Kubernetes cluster, right? There's no internet access or network egress or whatever like that. When they run their queries, the queries are run from the parade BB instance over to s3. And so fetching the data is actually very quick. Now it does increase latency. PG lakehouse is very fast. As a result, the latency of reaching to s3 is an important part of it as well. So it might be about two times slower versus doing local queries on the data directly in Postgres, but it's really fast. You can see our published benchmarks, you can aggregate 100 million rows

Starting point is 00:33:40 in a few milliseconds. And so from the perspective of the end user like whether it takes two milliseconds or four milliseconds like you don't you don't you can tell right so it's fine yeah wow yeah wow i'm surprised again like read from s s3 that quickly um that's pretty amazing um and then so for those type instances hey especially if i'm reading like large amounts of data. So I guess it's using Data Fusion under the hood. Is that right? Yes.

Starting point is 00:34:11 It's using Data Fusion, but things will change. We may be looping in some Doug DB, but you heard it here first. It's not released yet. Okay, sounds good. That's cool stuff. Anyway, whatever it's doing, it's sort of farming it out. But on that same Postgres, like your Parade DB instance, it's reaching out to S3 and pulling it back. So am I going to need a pretty beefy memory and disk instance there, whether doing processing memory or spilling to disk or something like that, to handle if they're big result sets?

Starting point is 00:34:40 Or is it pretty smart about filtering on S3 if I have it structured pretty well with hierarchical keys and stuff like that? Or like, how should I think about that? Both Data Fusion and BugDB, which are the main tools that we use to do this, they're both very smart about doing this. Just like every analytics, right? Like the more memory you have, the more you can store in memory without writing to disk. The less you have to write and read from disk, the faster things are going to be. So obviously, you know, the more memory, the better. That being said, the people that have built those tools, again, we don't deserve any credit for that. We're just users of them.

Starting point is 00:35:18 They're very good at what they do, right? And they work quite well. So even with smaller instances instances the results are quite phenomenal yeah very cool okay and then so you mentioned uh you know some changes you're making and also it sounds like a hosted service is in the books like where are y'all at like is the hosted service is it available for folks right now or like where sort of parade db at yeah so that's a good question our search work um has been deployed on terabyte scale with some large organization, two public organizations as well, which is very exciting. Database has been deployed about 60,000 times. So it's out there, right? People use it. We're not calling ourselves V1. We're not going to call ourselves V1 for a while, right? I think people have very, very high expectations before data systems do that. And we want to live up to those. Our analytics work is slightly newer.

Starting point is 00:36:09 It's also working quite well. It's also deployed heavily. We're launching the next version next week, which will be like a significant stability milestone as well for it. But those are kind of coming together. We're working on the cloud offering. Our cloud offering is not a typical one. We do not host the instances for you. As I was describing you in that setup

Starting point is 00:36:33 before the core value prop you get is having everything in your own cloud living in the same environment, right? So our cloud offering is essentially a bring your own AWS account where we handle all the management in the format of the control plane, but the actual ParadeDB instance

Starting point is 00:36:47 gets deployed in a region of your choice in AWS. We expect that to ship by the end of July and to coincide with the next release on search and analytics. And so by the end of July, if everything goes well, you'll be able to go and just choose a region where you want ParadeDB to run

Starting point is 00:37:04 and within two three minutes have everything wired up together and be able to do like really high quality searching and analytics without any data leaving your your your region um that's like that'll be the main moment where i think we start to be like a real product for real enterprises beyond like sort of the the adventurous ones that do a lot of work themselves to make it work yep yep yep one thing that's kind of nice about like where you're at and situated in the stack is um like not to say that stability and data integrity is not important but it's like you're you're like secondary you're downstream right you're not like the primary oltp stuff

Starting point is 00:37:41 so um i don't know it makes it like a little lower risk, I think, for someone to adopt it, especially if they're having the pain with search. It's like, hey, we can put this out there, even like a pre V1 type thing. And like, you know, the downside from messing up search is different from the downside of messing up like your transactional order data or, you know, transactions or different things like that. Like, it can really solve that pain point. And especially like, given it's that it's that sort of downstream async replication, you can add it in. It's going to backfill that data, I imagine, and and be good to go.

Starting point is 00:38:12 So exactly, exactly. And so that's why we've had, you know, pretty big organizations that have already adopted it, as I said, like to to Fortune 500, which is which is quite crazy. Like even we didn't believe it when they reached out at first, to be honest. But that's also like, that's part of the vision point, right? There are people, there are amazing people that have done amazing tools for OLTP stacks today, right? All the big cloud providers, other companies,

Starting point is 00:38:38 the super base, Neons and whatnot of the world. And those people do great work and we want to work alongside them, not against them, right? And customers, they don't want to work alongside them not against them right and customers they don't want to move their oltp data like they're happy with where it is it's very sensitive it's very risky they shouldn't have to touch anything right so we wanted there and say like hey there's a small add-on things work out you're happy the risk is very low worst case just get rid of it right um and yeah it's been a while receiving that place yeah very cool how did you

Starting point is 00:39:05 like i guess like stumble upon this idea just like a pain point that you specifically had or how did you get down this road we um me me and my co-founder we spent some time building um like we're doing some like some software consulting in europe um at the beginning of 2023 and as we did we started interoperating a lot of Postgres and Elastic and other vector databases and so on and felt like, I don't know, the world just, it should be a better way to do things, right? But that was just the seedling, to be honest. Like the real way after I was just talking to a lot of people,

Starting point is 00:39:38 like I have talked to everyone that would talk to me, to everyone that are listening to this, if you have thoughts on this and you want to talk to me, please reach out, I will talk to you. We have learned so much from our users and there's still so much for us to learn. And that's kind of how we've gotten to where we are today. Yep, yep.

Starting point is 00:39:53 What's the Postgres community like in terms of like, hey, we're building a new extension? Was that hard to break in and get to know people? Or is it pretty welcoming? Or what's that feel like? The people are great. It was not super hard to break in but it does take some time to get accepted i think um they you know they're smart people they're hard-working people

Starting point is 00:40:12 they're very passionate people many that have been doing this for longer than i've been alive and so um you know when you come around you have to prove that you're worth the attention right and so people were welcoming in that we always knew we could quickly find out where to get started. But it is only once we started putting out really good work, and people saw the work was good, were impressed with it and happy with it, that we really started to feel welcome in the community. And now it's been about like 10-11 months. We feel like we have an increasingly warmer seat in the Postgres community

Starting point is 00:40:49 and we're happy to be a part of it and our big purchase of it. So I hope more people join it. Yeah, for sure. How big is the Parade team right now? We're four people. Four? Okay, very cool. Yeah, it's a small group.

Starting point is 00:41:01 Yep. And if someone's looking to get started with Parade, do you recommend like, hey, just go install Postgres, install this extension and get going? Or should they reach out to you and try and get something, you know, get some consulting set up? Or how do you recommend

Starting point is 00:41:15 getting going? We're open source. You can find our repo. We publish a Docker image and we also publish our extensions. We publish 38 versions of our extensions prepackaged for Debian, Ubuntu, Red Hat Linux on multiple Postgres versions, multiple architectures, and multiple OS versions. So for most people out there, literally two lines of code,

Starting point is 00:41:37 like a curl command and an install command, and you're good to get going. And then we have documentation where you can poke around. We have a Slack community that we link in our readme that people can join and say hello, and we're always very responsive to help. And if you need something more than that, then you can always message me or message anyone else,

Starting point is 00:41:54 and we're happy to talk one-on-one. Awesome. Fluke, this has been great. I love this. I think it's so cool what you're doing, and I think this is really well-needed. As someone that's struggled with Elasticsearch before and some of the frustrations there.

Starting point is 00:42:07 Like this is pretty cool stuff. So we'll have like, yeah, yeah. We'll have links to your Twitter and the website and everything in the show notes. Anything else, like if people want to find you or anything else you want to shout out before we head out? I mean, I think those are the main places. You can find me on Twitter as well.

Starting point is 00:42:23 I think if you Google my name, I should come up. But yeah, please, I do mean this. Please reach out. Anyone, everyone. I always love talking to people. I always have something to learn. So please reach out. Awesome.

Starting point is 00:42:34 Philippe, thanks for coming. Thanks for having me, man.

Your Ad Here

Software Huddle - Postgres for Search + Analytics with Philippe Noël

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.