Software Huddle - Postgres for Search + Analytics with Philippe Noël
Episode Date: June 25, 2024ParadeDB is Postgres for search and analytics. As Postgres continues to rise in popularity, the "Just Use Postgres'' movement is getting stronger and stronger. Yet there are still things that standard... Postgres doesn't do well, and advanced search and analytics functionality is near the top of the list. The ParadeDB team provides a pair of Postgres extensions. The first, pg_search, brings a more performant and full-featured search experience to Postgres. It uses Tantivy (think: Lucene but Rust) as the search engine and provides advanced ranking and querying functionality. The second, pg_lakehouse, allows you to perform large analytical queries over object store data. Together, these provide compelling new features wrapped in a familiar operational package. Philippe Noël is one of the founders of ParadeDB. In this episode, we talk about why these extensions were needed, why the 'Just Use Postgres' movement exists, and where ParadeDB fits in your architecture. Follow Philippe: https://x.com/philippemnoel Follow Alex: https://x.com/alexbdebrie Follow Sean: https://x.com/seanfalconer Check Out ParadeDB: https://www.paradedb.com/ Timestamps 01:50:18 Intro 04:30:23 Where does seach on Postgres fall down? 05:33:09 BM25 and TF-IDF 07:23:03 Postgres Tipping Point 10:05:08 Tantivy 11:50:14 Tantivy vs Lucene 13:07:06 vs ZomboDB 15:35:21 Just Use Postgres for Everything? 17:57:17 Developing a Postgres Extension 19:26:03 Arvid's Problem 20:27:08 Postgres and Log Data 23:28:01 Separate OLTP and Search Instances 28:32:01 Search Nodes vs OLTP Nodes 30:02:12 ParadeDB Analytics 35:27:05 Hosted Service 39:03:15 Stumbling upon the Idea 39:51:22 Community 41:01:15 Getting Started with ParadeDB
Transcript
Discussion (0)
What's the Postgres community like in terms of like, hey, we're building a new extension?
Like, was that hard to like break in and get to know people?
Or is it pretty welcoming?
Or what's that feel like?
It was not super hard to break in, but it does take some time to get accepted.
You know, they're smart people.
They're hardworking people.
They're very passionate people.
Many that have been doing this for longer than I've been alive.
Where does search on Postgres fall down. When it comes to doing things at slightly higher scale
and in a slightly more, let's say, like globalized way,
the Postgres full text search doesn't use BM25
as its ranking algorithm.
BM25 is the state of the art ranking algorithm.
What does it look like to develop a Postgres extension?
Hey folks, this is Alex.
Today we're talking about Postgres and search. I think a lot of times online you hear people asking, hey, how do you handle search?
And everyone just says, just use Postgres, just use Postgres, right?
And it always struck me as a little bit off just because like, hey, you know, there are things Postgres is good at.
Search is like a different and a hard problem. I'm sure it works in some cases, sure it doesn't work in others.
It came up this week on Twitter and And I found out about Felipe Noel,
who is one of the founders of ParadeDB, which is basically making search work better in Postgres,
right? Which includes both like adding some features and capabilities that native Postgres
search doesn't have, but also improving the performance using, you know, Tantv, which is
like a more modern library for search. So I thought this was really interesting talking about
like sort of the upsides,
the downsides,
when you should use Postgres search,
when you should use something like Parade
and then even like when you should reach out
to something more advanced like Elastic.
I thought it was really great.
So yeah, check it out.
If you have any questions,
if you have any comments,
if you want guests on here,
feel free to reach out to me or Sean.
Please leave us a review if you like the show.
And with that, let's get to the episode. Philippe, welcome to the show.
Thanks for having me. It's good to be here.
Yeah. So you are one of the co-founders of ParadeDB, which is Postgres for Search and
Analytics. Can you give us a little bit more on what that means and what you're up to at Parade?
Sure, sure. So Postgres is a pretty famous, the most loved relational
database nowadays. And it's good at a lot of things, but it's never been too shining on doing
search type workloads and analytics workloads. Overwhelmingly over the years, people have used
a tool called Elasticsearch, and it's starting to show its age. And we think people should be
able to do that from the database they love. so what we do is we augment postgres with those capabilities so they can stay and keep
using postgres for for everything yep awesome yeah i'm excited to dig into that some of like
the technical stuff but also like applying it and when you should use it when you how you should
think about some of this stuff uh just to give folks some background the way we got connected
here is a couple days ago there's a guy named arvid call who uh he's like kind of in the bootstrapper community he had this company
called feedback panda that he grew and then sold and and just like given a lot of wisdom
um he also has a new application called pod scan where he's basically taking all the podcasts that
exist taking their transcripts and indexing them and making them searchable filterable all that
sort of stuff so you can look up all sorts of stuff there. So you can imagine he's got a lot of data. He says
he has about like 500 gigabytes of podcast data, like mostly like transcripts is what that is.
And he's been trying to do search on it. I've been trying a few different solutions. I think
started with Melly search. Now lately, some people said, Hey, do it and just do it in my
SQL that should work.
And that's been pretty rough for him.
So he's just sort of like asking for advice on like,
Hey,
what should I be using?
You know,
what are,
what are some options here?
And I think there's like two paths that you see.
One is like the,
the just use Postgres,
just use my SQL,
like use your OLTP database for this sort of stuff.
And,
and there's like the,
no,
no,
no,
no.
You need elastic search you need
like a 20 node cluster with you know s3 and all kinds of other different things like tied in there
um so it was really interesting to see you like hey acknowledging there are some issues with like
like postgres can get you some of the way with like bog standard search but also like not not
all the way there and you're like fill in that gap with Parade.
Did I give that background right? Is that the rough where we started off here? Yeah, that's pretty right. I think I got tagged six or seven times on that blog post,
on that Twitter thread story, which is nice. We've become the face of search and progress,
and people are starting to recognize it. But yeah, I mean, there's good and bad
trade-offs to every solution.
Yeah, exactly.
So maybe start off with
where does search on Postgres fall down?
It does do some things well,
but where do people see problems
when they try and do some search in Postgres?
Yeah, so the first thing...
I mean, first of all,
I encourage people to stay with the default
search functionalities of Postgres for a very long
time.
It's quite good. So for
example, you can do basic
fuzzy matching and so on quite well
with Postgres. Performance is not
bad with things like TSVector and the indices
you can find.
It's not bad. There's a few things, though.
I would say the main places where it falls down
is when it comes to doing things at slightly higher scale
and in a slightly more, let's say like globalized way.
So for one, the Postgres full-text search
doesn't use BM25 as its ranking algorithm.
BM25 is the state-of-the-art ranking algorithm
for full-text search.
It's what Elastic uses. It's what Maily Search
uses that you mentioned. There's a lot of other
companies that I can name that offer
similar products and they're all based on that.
So just hold on
for a second. So BM25, I heard
that come up a few times. Is that
comparable to TF-IDF
or is it like an implementation
of TF? How does it compare to TF-IDF
or
just trying to format where BM25 IDF or is it like a implementation of TF? Like, how does it compare to like TF IDF or like,
um,
like just trying to format where BM 25 fits.
It's,
it's like a,
it's a scoring,
it's a scoring algorithm for like sparse vector results.
So it's in the same type of category,
if you will.
It's different,
right?
It's,
it's applied differently,
but it's,
um,
it's a similar type of algorithm.
Okay.
Okay.
Sounds good.
Thank you. And so, so that's a, that's a pretty big one. Um Okay. Okay. Sounds good. Thank you.
And so that's a pretty big one.
If you want, like, for relevancy of results,
you'll get much more relevant results if you use this.
So relevancy is where it starts to fall down when you have bigger corpuses.
The second one I would say is
performance is not bad, but it starts to balloon.
Like, in terms of indexing time,
in terms of search time,
when you have pretty big corpus of data,
like we frequently hear people come to us
and tell us that their searches take several seconds, right?
And like in today's world,
I mean, like even in like 10 years ago as well,
like that was just not acceptable, right?
And so that's a big one.
And then the last one I would say is
Postgres full-text search is meant to be pretty basic, right? If you have a simple workload, you have like some English data you want to search, right? But if you want to do like faceting search where you bucket your search results based on different properties, right? French or in Russian or in Mandarin Chinese or in Afrikaan or whatever, like you can't really
tokenize that very well in Postgres today. So those are areas where it starts to fall down.
So really like performance and breadth, I would say. So if you don't need, you know,
very big performance and you have a pretty narrow use case, you'll do great. And you can do that
for a while. Gotcha. And like, so when you say like not much breadth, is that like, Hey, I have
a bunch of users and things like that. I want to give search on first name, last name on email
address like that. Postgres is going to nail that. I don't go out to elastic search for that, but
then, I mean, what is it about the size of the document or the, the, the size of the total data
set or like, when is it like, oh, now I need something more? Like,
is there some sort of like tipping point or that indicates I should I should move to something
else? Yeah, I think, in my opinion, it depends on the complexity of the queries people will run,
right? So I actually know of some quite large companies, like public companies that run
their searches on Postgres TS vector, and it works quite well.
So it just depends what people are looking for, right?
So if I'm looking for, I don't know, like colors, right?
Like I want to search like various colors.
Even if you have a lot of data,
it's gonna make a pretty simple query,
it's gonna work well, right?
But once you start to do things where you wanna offer
more flexibility in user error, right?
Even if the user makes typos or asks for something that's quite complex.
Or even, you know, oftentimes users will search for things they don't fully understand how
to search for, right?
Like, you know, if you want to search for, like, I don't have a good example, like some
sort of computer, right?
But you're actually looking for an iPad or whatnot.
Like things that are going to be relevant,
but are actually like pretty different
to what you're searching for.
It's going to be quite difficult.
So the simpler the workload, I think,
from the user perspective,
the more you can do with Postgres.
Yeah, gotcha.
And that last example you said,
is that going to be something that'll be accomplished
by, you know, the BM25 index?
Or that last example of saying,
hey, you're searching
for something similar but you didn't even type it incorrectly or is that going to be more in like
the the vector stuff or like the similarity search we've seen with like embeddings and stuff lately
yeah the last example is more of a similarity example that's perhaps a bad one um but like if
you want to search for like full text is very good when you want to search for like point
point concept like you know if i search for your name like it has like no similarity meaning to my name or whatever right
um so so there's a bit of there's a bit of both you can do more today with basic full text
posters thanks to pg vector and it's now supported on all the major postgres databases so that also
enables people to stay even longer on it um But the truth is, for any data-intensive product
where search is a key component,
and that's increasingly a large number of them,
you just need something better.
You'll see. People know.
We don't have to convince them.
They know when they start to need something else.
They're very, very aware of it.
Yeah, exactly.
And then, okay, so you're very, very aware of it. Yeah, exactly.
And then, okay, so you talked about the performance aspect of it,
you know, that you've done significantly better there.
And how much of that is, hey, the BM25 index is that much better?
How much of it is like, hey, you have this custom stuff you've written in Rust that is an extension in Postgres?
Like, what's accounting for the performance gains there?
Yeah, so that's a good question.
A lot of the performance gains,
the credit does not belong to us, to be honest.
So we integrate a search library called Tentivy,
which is a Rust implementation of Apache Lusine.
So Apache Lusine is the state-of-the-art search library
that people have used for decades now,
maybe like close to two decades.
Elastic is based on it.
Most of the dedicated search engines
are based on some variants of it.
Tentivy is a re-implementation in Rust
that's now adapted and used by many, many companies.
The original company is called QuickWit.
They're also offering a product based on it
specifically for observability.
There's a lot of other companies that are based on it
and we're the same one.
So the community, and by the community,
I mean like overwhelmingly the folks
behind Tent TV and QuickWits
are the primary people that are created for that.
We get a lot of performance benefits there.
And then what we do ourselves is
we sort of glue the pieces in a smart way to make it so there's a lot of
benefits you get as well from users just doing the right thing, right? There's a lot of value to be
added in just like constructing your product in such a way that nudges the users. So they just
always know the right behavior or the right action to take, right? The right index, how to employ the
index on their table, for example, right?
So that they do it in the correct way, and then you get a lot
from that as well.
Yeah, and then even comparing
Tantiveet versus
Lucene, which, like, Lucene is, like, just a
Java library, is that right? That's right, yeah.
Okay. Comparing those,
like, is that,
I mean, is that just Java
versus Rust, or is that, like is that like hey you know we've done
some we we have a better sense of this space now maybe lucene has has a lot of cred from 20 30 years
of being built up and adding all these different features or like what accounts for even that speed
difference between tentative and lucene there's it's both right so java is famously slow with the
the jvm being quite bloated and doing garbage collection at random moments.
Rust is very performant.
It does not do any of that garbage collection.
Well, yeah, it doesn't do the JVM.
And so you get a lot from that just off of the bat.
And the folks behind Tentivy, if you look at their project,
you'll see they say Tentivy is inspired by Apache Lucene,
but it is not a re-implementation
of Apache Lucene. So there's
a lot of similarities because Lucene is
very good and they've gotten a lot of things
right. But as
you say, over 20 years, people
have seen things as kind of too far
committed to change.
And so they chose
to make some slightly different design
decision and there's a lot of
performance that gets handed there as well yep cool um one thing that i was when i was researching
parade um and i think you all even mentioned on your your post a little bit is is zombo db
how do you all compare just in approach to solving this problem with with zombo db
yeah so zombo zombo was kind of like the pioneer of what we're
doing, right? And I know the main author,
Eric, he's
quite the man. He's a pioneer
in the space. The tool that we used
to build our
work was also written by
the creator of ZomboDB because he wanted
it to build ZomboDB better himself.
But for context, ZomboDB,
for those that don't know, is also a Post it's also a progress extension to offer similar capabilities but what
it does is it allows you to operate and manage an elastic search cluster from your postgres database
so in um like the ideological differences between us and and zombo is that zombo says you have
elastic you have postgres we make dealing with the two of them together
as easy as possible.
What we tell you is you should never
need to have an Elastic search cluster.
You should just be able to use Postgres.
So in that way,
we kind of see ourselves as a bit of a successor
to what Eric has done
with Zumbo. He would be the first one to say that himself.
But
Zumbo definitely paved the way.
Yeah, that's interesting.
Yeah, because like, man, of all the infrastructure I've ever used, like Elasticsearch is like the scariest piece.
So even like, even if something else is like sort of managing it, it's like, it's sort of managing it, but you're still, you're still like, even the managed Elasticsearch providers, it's hard.
Yeah, it's like hard.
Exactly, exactly.
The problems are still there like when
people come and that's why also for example like so our performance right our search results thanks
to tentative are about 25 25 to 50 faster than than elastic our indexing time is five times
faster than elastic on our own benchmarks on on a terabyte of data so performance is great right
people they come and they use it and they're like, wow,
but that is never the reason that brings them in, right?
Every time people, they don't come to us and say,
Elastic is not fast enough.
They say, Elastic is so incredibly heavy and difficult to operate, right?
And Postgres is so easy to operate.
And by the way, I've already operated, right?
So they come to us and they just want to have less mental burden rather than performance.
And I'm convinced
even if we were slower than elastic people would still use the work that we do um and like the fact
that we're faster it's just like an added bonus they're like you know cherry on top of the sundae
yep yep what i mean and so i think this is a part of like the whole like just use proscores
movement which you see a lot like what what accounts for that like what makes postgres so
good at this um yeah and yeah i guess
like so much so much energy around that movement right now yeah well i have a bit of a nuanced
take on this which is i love the postgres for everything movement i know most of the people
that work on it i think they're great people and i we're a part of it i will say i think we're still
at a point where you should use postgres for many things. I don't think you should use Postgres for
everything yet, but I do think one day
will be possible, and we're working to make that happen.
But
the reason, I mean, there's kind of two reasons.
One is
Postgres is very extensible.
A lot of people compare Postgres to Linux
nowadays, right? Where the core Postgres
foundation has become more than just
a relational database. It's just a data,
like an open-source
data store that can be taken in so many
different ways. And so just like we'll
have people make Linux distributions or Linux
projects, and for various...
You have the security Linux with Alpine, and you have
the enterprise Linux with Red Hat, and blah, blah,
blah. Postgres is a
little bit like that. So I think their open-source community
has built its trust over the last 30 years. Another reason, I would say, is there's blah blah. Postgres is a little bit like that. So I think their open source community has
built its trust over the last 30 years. Another reason I would say is there's some issues
that have happened with MySQL over the last six, seven years that have made people lose
trust with it. MariaDB was part of, I believe, Sun and eventually under Oracle. And it's
not been as open as possible. And data, you know, people care a lot about the sort of underlying
infrastructure behind data being open, especially because of, you know, how
important data is to come to everything.
And the third reason that I would say is when you pick, when you build a product,
like every software, if we're being really, really, really reductive, reductive, every
software is a wrapper around the database,
right? Like most software, you just build things on top of your database to give experiences,
right? So it's kind of the first thing you pick. And because it's the first thing you pick,
and you want things to be as simple as possible, it's the best place to extend, right?
Yeah, very cool. So yeah, that's cool. That extensibility part. So parade DB is a post
growth extension, which means like people just have their post growth instance. They install
this extension and now they get, you know, these BM 25 indexes with, with query ability and stuff
like that. What does it look like to develop a post growth extension? Like, are there just like
certain hooks that you sort of, you that you register and hook into that,
when this happens, call me? What does that look like to develop a Postgres extension?
Yeah, I think you summarized it in a pretty eloquent way. The core Postgres team,
they've done a lot of work to provide hooks around various functionalities.
Even from an extension, the interfaces are very well defined and you can call hooks
into the core functionalities of the database.
And so you can hook at the storage layer,
you can hook at the query layer,
you can hook at the plan layer,
which is the level between like
when someone writes SQL
and the SQL actually gets executed
on the data that are stored.
And that gives us the ability
to go and do a lot of things, right?
So we can say in our full-text search, for example, we hook at the indexing layer and
we say, hey, we want to index the data in a different way to be used with BM25 search,
right?
And we can provide custom syntax to do that.
We also have an analytics extension where we hook at the query layer before even the
query hits the data store.
And we say, hey, we want to process that query
a different way than the traditional Postgres
because we know this is analytics data
and analytics data can be,
the numbers can be crunched in a more efficient way
than what Postgres does normally.
Yeah, yeah, very cool.
Okay, yeah, I want to talk about the analytics part too,
but just like even closing up the search part,
let's go to Arvid's problem, right?
Where he has 500 gigs of podcast data,
you know, continuing to increase
as new transcripts come in.
And, but, you know, doing like sort of
high volume traffic on that.
Is that something, hey, does that feel like
a good fit for Parade?
Is that like, hey, that's getting too large for it?
Like either in document size or total corpus size, right?
How would you think about that problem
starting to approach it?
No, I think Harvard is a perfect ICP
for the work that we do.
I mean, he is not in this case
because he's a MySQL user, right?
And we don't touch MySQL.
But if you were in Postgres,
this is exactly who it's for, right?
500 gigs of data increasing fast
is too much to do in Postgres for break search.
It's just not going to work very well, right?
And that's the point where you would need
something like Elastic.
But as he says, right, he's already tried Mils
and he's already annoyed with that data movement piece.
He wishes he would not have to do that.
Well, that's exactly where ParadeDB comes in, right? That's exactly what we offer. Yep, absolutely. All right, that's great. Is there
a point where you would say, hey, you know, even Postgres with Parade is going to have trouble
with that? And I'm thinking of the people that use like Elastic for log data, and it's just like a
ton of data coming in. Maybe now they're pushing multiple, multiple terabytes. Is that still like,
hey, Postgres can handle that and do search on that? Or is it like, nah, that's sort of a different problem
at this point?
So that's what I meant.
And I don't think Postgres can do everything today.
This is a good question.
Postgres cannot handle that well today.
Like, we can't handle that well today.
So the specific use case that we focus on
is, I would say, application or backend search, right?
So you want to expose an end user experience, right?
You're building Facebook. You want people to search end user experience, right? You're building Facebook,
you want people to search for, you know, people on the platform, you're building ZogDog, you
want people to search for the corpus of medical professionals that you have, right? Those
things, like those experiences are too big to be in the basic post-referential tech search,
but they're big enough that we're not like writing petabytes of logs like every week,
right? Today, ParadeDB is focused solely on single nodes.
So we don't do distributed searching.
Elastic obviously offers that.
When you're starting to do logs at scale,
you will want to do that.
This is not where we see our competitive advantage
because people use Postgres to build their application, right?
Not to build their observability suite.
And so the benefit of being Postgres sort of diminishes. And the folks, for example, that wrote TempTV, they create a product called QuickWidth, which is specifically designed for petabytes scale log search over S3. And so TempTV itself is able to do this, a parade which is not to focus on that. yep yep absolutely you mentioned you know it's focused on single node right now
how do extensions work with some of like the distributed postgres tools out there and i don't
know i'm super well but like you go by it or situs or or different things like that do like
would a parade extension work with those or is that just like too much of a conflict between
going to single node to distributed that it wouldn't work so there so we we make our work
as postgres compatible as possible.
Our goal is to be true Postgres, right?
Because that's what people want.
So the answer to your question is, yes, it does work.
It depends on how much these people modify Postgres,
not on how much Parade modifies Postgres.
So for Citus, it does work, right?
Citus is an extension.
In fact, we plan to one day offer
a distributed version of Postgres.
And working with Citus is probably going to be
how we offer that.
If you take the folks like Neon
that have modified Postgres pretty heavily,
but have tried to remain like real Postgres shops,
it works, but there are some things
that need to be done to integrate it properly
from our side that we haven't done yet.
And if you take the folks like YugabyteByte that have gone pretty far in the
deep end path, in my opinion, of Postgres, I will be honest with you. I do not know if it works with
YugoByte. I suspect it does not. And maybe one day, if it makes sense, we do the work to make
it work. But there's going to need to be some of their effort poured into that as well.
Yeah, exactly. Okay, cool.
One other thing I saw you mention on Twitter
that was interesting is for some
customers, you recommend having separate
OLTP
and search instances
of Postgres. Can you tell me what that
pattern looks like?
Yeah, so that's a very important point.
It depends. So when people
hear extensions, overwhelmingly, they think of exactly what you described, which is I'll install
this into my primary database, right? But the truth is, the search workload requires different
hardware and different tuning of the database to be optimal, right? And it's very important
to isolate systems. So that if something goes wrong with your transactional database,
it doesn't take down your search and vice versa. If something goes wrong in your search process,
it doesn't take down your transactional database. So the problem that exists with
Maile search and Elastic that Arvid was saying as well itself is in the data movement, you are
forced to do a transformation. It's called ETL ETL, right? Extract, transform, load.
Because we go from Postgres to a NoSQL database, right?
That transformation is where the pain exists.
Because if you change your schema, it breaks.
If you upgrade versions, it breaks.
It requires compute.
It's slow, things like that.
That's where the problem lies.
So the way in which we get sort of the best of both worlds for ParadeDB customers
today is we say, hey, what we are is pure Postgres. You should still, if you have a lot of data,
co-locate it separately from your transactional, but you can connect it via the Postgres replication
logs, similar to our high availability read replica solution. So it's sort of like isolated,
but tightly coupled type of workload.
You don't have any of the pain point of the big heavy ETL that Arvid complained about, but you have all of the benefits of the isolation, right?
And that is our suggested deployment flow for larger workloads.
Someone like Arvid, this is what I would recommend that he does. If you're a small company, like
a startup, right, or like, you know,
a small shop, having it
all in the same database is going to be fine because your
volume of data is lower, but
overwhelmingly, this is not how people
use us, despite being an extension.
Like, an extension is more of a convenience of
fitting in and development
experience rather than
necessarily how you should install it yeah
okay that's that's super interesting so it's basically just like another read replica but on
this read replica i've installed an extension i've added some indexes that are only on my read
replica not on my primary instance exactly exactly and that is like that is the best solution right
when people when people think of the postgres for everything movement like oftentimes they're like i'll just merge everything in one database right well you
don't want to do that right you want to separate you want them to be close right but like if you
go and you order a dish in the restaurant right like you want each part of your dish to like put
in together in its own place not be mushed in into one thing right it's the same thing yeah
yeah yeah interesting do you have like are most people running a couple read replicas of those into one thing, right? It's the same thing. Yeah. Yeah. Yeah. Interesting.
Do you have, like, are most people running a couple read replicas of those or even read replicas to like your, your main search instance, which is like sort of a read replica itself?
Like what's the topology look like for that generally?
So, so the, the main topology is you, they will replicate from, from the high availability
Postgres, like from a read replica or from the primary into ParadeDB.
And then depending on their preferred SLAs and uptime guarantees,
ParadeDB itself might have read replicas or high availability solution,
or it might be a single instance.
It depends on your maturity as an organization
and what you're looking to guarantee.
And for most people, is that syncing,
is that like asynchronous replication?
They're probably not like doing that,
that synchronously from their HA Postgres.
It is.
So that's also something you get to decide.
So the way they do it is there's always a lag, right?
There's a small lag between your primary Postgres
and your replicated, your read replicas. in today's and this is very fast right replicating you know high-quality solutions
like lag very little behind the primary that's exactly the same case for postgres so for parade
excuse me so it is synchronous in that you know it goes over um in real time but we offer two types
of guarantees for that.
We offer strong consistency or weak consistency,
depending on what you want.
And so today, for most customers,
if you go and you write data into ParadeDB,
your index, your transaction is not going to commit
until the index is updated.
So the transactions were slowed down slightly, right?
What you get with this is, you know,
if the data has been sent over biological replication,
you know that if you make a search query,
it will contain that data, right?
The downside you get with this
is it slows down ingestion slightly, right?
And so we also offered the ability
for enterprise customers to do weak consistency,
where it basically says, like,
just dump as many logs as you want in there.
It is possible that indexing lags behind a few seconds, right? But if you want to ingest large amounts of
data, that's a trade-off that makes sense for you. Yep, absolutely. You mentioned that search
requires just like a different sort of set of resources or just like a different profile than
like your normal LLTP stuff. What do those nodes look like?
Are they like higher memory or higher CPU?
How do they sort of vary from your primary OLTP HA nodes?
Yeah, that's a good question.
That's a good question.
Oftentimes they're higher memory, in my opinion.
I think it depends.
I wish I could give you a really, really good answer.
But to be honest, customers do a lot of different things.
So it's hard to give you a really clear answer.
Usually people will put more memory.
And depending on the frequency of search,
they might want to have more network bandwidth too.
It depends.
Gotcha.
Yeah, I wasn't sure if there was like,
hey, search type workloads are just way more
compute intensive or memory intensive or whatever.
So I didn't know if that was a rule,
but it probably just depends a lot on specific use cases
and things like that.
I think as we service more and more people,
there's a rule that will emerge.
And if you ask me again in six months,
I'll be able to provide a better answer.
Tweet at me in October. Yeah, that'd be great be great yeah we'll figure this out i'm gonna i'm
gonna set a reminder for myself so um cool but yeah let's switch to the to the analytics aspects
of parade db because you're you're both search and analytics so tell me i guess you know how is
when i was looking at it's kind of a unique analytics setup. So tell me how ParadeDB Analytics works and where it's fixing gaps in the Postgres Analytics ecosystem.
Yeah, so the reason to answer that,
I'll tell you why we built Analytics in the first place.
The reason we did that is we first built Search.
That was our first product.
And as we did, we talked to a lot of customers
and started requesting Analytics.
And it turns out a lot of workloads that people use Elasticsearch for today is a combination of search and analytics in the same queries.
Like you want to search for some results and then bucket them in categories, for example.
There's not a lot of tools that do this very well.
Like most of the search engines, they focus squarely on search.
And Elastic is one of the rare ones that does both quite well.
And so that's one reason people still use it, despite most people dislike operating
it.
So all this to say, we offer two types of analytics that are kind of inspired by Elastic.
The first one is you can do analytics directly on your search queries.
So you're able to, like the example I just mentioned, like bucket search results, for example, you can do analytics directly on your search queries. So you're able to to, like the example
I just mentioned, right? Like bucket search results,
for example, you can do that. The other analytics
offering we do is we allow
you to do fast analytics over data stored
in object storage. So like S3
for instance. So more and more nowadays
it's common for people to want to
have data, like analytical data that's
stored in a data lake or data warehouse
be used to power some sort of user facing experience, right? And user facing experiences should be built against Postgres, not against S3 or Snowflake, right? And so that analytics extension, which we offer, which is called PG Lakehouse, is, is there to offer that. So you can join tables that are in postgres with data
that is in the cloud and in s3 and you can use that to power recommender system is a big one
that people will do for example right and that's a pretty closely uh tied experience and workload
excuse me to the idea of search and recommendation yep cool okay cool. Okay, so what do you, you mentioned like,
hey, you don't want to be building
against S3 or Snowflake directly,
but this is still reaching out
to an object store like S3.
I guess like what sort of like
user-facing latency response time
should I expect on something like that
that's reaching out to S3
and then doing some filtering
or something like that?
That's a good question.
Very little. Well, it depends. If you build it properly, right, that's part of setting the same default. So what we recommend, right, is when people use this,
they have their S3 data in the same AWS region where they will have their ParadeDB instance.
And so sort of like a typical larger scale and like customer, right? Or like mid-market customer that wants to
use ParadeDB. They'll have one AWS RDS instance, let's say, or like one GCP Cloud SQL cluster.
They'll be in one region, like US East 1 or US West 1, right? The data that they have in analytics
is going to be in an S3 bucket or Google Cloud Storage bucket in that region as well. And they
will deploy ParadeDB in that region in the Kubernetes cluster, right? There's no
internet access or network egress or whatever like that.
When they run their queries, the queries are run from the parade
BB instance over to s3. And so fetching the data is actually
very quick. Now it does increase latency. PG lakehouse is very
fast. As a result, the latency of reaching to s3 is an important part of it as well. So it
might be about two times slower versus doing local queries on the data directly in Postgres,
but it's really fast. You can see our published benchmarks, you can aggregate 100 million rows
in a few milliseconds. And so from the perspective of the end user like
whether it takes two milliseconds or four milliseconds like you don't you don't you can
tell right so it's fine yeah wow yeah wow i'm surprised again like read from s s3 that quickly
um that's pretty amazing um and then so for those type instances hey especially if i'm reading like
large amounts of data.
So I guess it's using Data Fusion under the hood.
Is that right?
Yes.
It's using Data Fusion, but things will change.
We may be looping in some Doug DB, but you heard it here first.
It's not released yet.
Okay, sounds good.
That's cool stuff.
Anyway, whatever it's doing, it's sort of farming it out.
But on that same Postgres, like your Parade DB instance, it's reaching out to S3 and pulling it back.
So am I going to need a pretty beefy memory and disk instance there, whether doing processing memory or spilling to disk or something like that, to handle if they're big result sets?
Or is it pretty smart about filtering on S3 if I have it structured
pretty well with hierarchical keys and stuff like that? Or like, how should I think about that?
Both Data Fusion and BugDB, which are the main tools that we use to do this,
they're both very smart about doing this. Just like every analytics, right? Like the more memory
you have, the more you can store in memory without writing to disk. The less you have to write and read from disk, the faster things are going to be.
So obviously, you know, the more memory, the better.
That being said, the people that have built those tools, again, we don't deserve any credit for that.
We're just users of them.
They're very good at what they do, right?
And they work quite well.
So even with smaller instances instances the results are quite phenomenal
yeah very cool okay and then so you mentioned uh you know some changes you're making and also it
sounds like a hosted service is in the books like where are y'all at like is the hosted service
is it available for folks right now or like where sort of parade db at yeah so that's a good question
our search work um has been deployed on terabyte scale with some large organization, two public organizations as well, which is very exciting. Database has been deployed about 60,000 times. So it's out there, right? People use it. We're not calling ourselves V1. We're not going to call ourselves V1 for a while, right? I think people have very, very high expectations before data systems do that. And we want to live up to those.
Our analytics work is slightly newer.
It's also working quite well.
It's also deployed heavily.
We're launching the next version next week, which will be like a significant stability milestone as well for it.
But those are kind of coming together.
We're working on the cloud offering.
Our cloud offering is not a typical one.
We do not host the instances for you.
As I was describing you in that setup
before the core value prop you get
is having everything in your own cloud
living in the same environment, right?
So our cloud offering is essentially
a bring your own AWS account
where we handle all the management
in the format of the control plane,
but the actual ParadeDB instance
gets deployed in a region of your choice in AWS.
We expect that to ship by the end of July
and to coincide with the next release
on search and analytics.
And so by the end of July,
if everything goes well,
you'll be able to go and just choose a region
where you want ParadeDB to run
and within
two three minutes have everything wired up together and be able to do like really high
quality searching and analytics without any data leaving your your your region um that's like
that'll be the main moment where i think we start to be like a real product for real enterprises
beyond like sort of the the adventurous ones that do a lot of work themselves to make it
work yep yep yep one thing that's kind of nice about like where you're at and situated in the
stack is um like not to say that stability and data integrity is not important but it's like
you're you're like secondary you're downstream right you're not like the primary oltp stuff
so um i don't know it makes it like a little lower risk, I think, for someone to adopt
it, especially if they're having the pain with search. It's like, hey, we can put this out there,
even like a pre V1 type thing. And like, you know, the downside from messing up search is
different from the downside of messing up like your transactional order data or, you know,
transactions or different things like that. Like, it can really solve that pain point. And especially
like, given it's that it's that sort of downstream async replication, you
can add it in.
It's going to backfill that data, I imagine, and and be good to go.
So exactly, exactly.
And so that's why we've had, you know, pretty big organizations that have already adopted
it, as I said, like to to Fortune 500, which is which is quite crazy.
Like even we didn't believe it when they reached out at first, to be honest.
But that's also like,
that's part of the vision point, right?
There are people, there are amazing people that have done amazing tools for OLTP stacks today, right?
All the big cloud providers, other companies,
the super base, Neons and whatnot of the world.
And those people do great work
and we want to work alongside them, not against them, right?
And customers, they don't want to work alongside them not against them right and
customers they don't want to move their oltp data like they're happy with where it is it's very
sensitive it's very risky they shouldn't have to touch anything right so we wanted there and say
like hey there's a small add-on things work out you're happy the risk is very low worst case just
get rid of it right um and yeah it's been a while receiving that place yeah very cool how did you
like i guess like stumble upon this idea just like a pain point that you specifically had or
how did you get down this road we um me me and my co-founder we spent some time building um like
we're doing some like some software consulting in europe um at the beginning of 2023 and as we did
we started interoperating a lot of Postgres and Elastic
and other vector databases and so on and felt like,
I don't know, the world just, it should be a better way to do things, right?
But that was just the seedling, to be honest.
Like the real way after I was just talking to a lot of people,
like I have talked to everyone that would talk to me,
to everyone that are listening to this,
if you have thoughts on this and you want to talk to me,
please reach out, I will talk to you.
We have learned so much from our users
and there's still so much for us to learn.
And that's kind of how we've gotten to where we are today.
Yep, yep.
What's the Postgres community like in terms of like,
hey, we're building a new extension?
Was that hard to break in and get to know people?
Or is it pretty welcoming?
Or what's that feel like?
The people are great.
It was not super hard to break in but it does take
some time to get accepted i think um they you know they're smart people they're hard-working people
they're very passionate people many that have been doing this for longer than i've been alive
and so um you know when you come around you have to prove that you're worth the attention right
and so people were welcoming in that
we always knew we could quickly find out where to get started. But it is only once we started
putting out really good work, and people saw the work was good, were impressed with it and happy
with it, that we really started to feel welcome in the community. And now it's been about like
10-11 months. We feel like we have an increasingly warmer seat
in the Postgres community
and we're happy to be a part of it
and our big purchase of it.
So I hope more people join it.
Yeah, for sure.
How big is the Parade team right now?
We're four people.
Four? Okay, very cool.
Yeah, it's a small group.
Yep.
And if someone's looking to get started with Parade,
do you recommend like, hey,
just go install Postgres, install this extension
and get going? Or should they reach out to you and
try and get something, you know,
get some consulting set up?
Or how do you recommend
getting going? We're open
source. You can find our repo.
We publish a Docker image and we also publish
our extensions. We publish 38 versions
of our extensions prepackaged
for Debian, Ubuntu, Red Hat Linux on multiple Postgres versions,
multiple architectures, and multiple OS versions.
So for most people out there, literally two lines of code,
like a curl command and an install command,
and you're good to get going.
And then we have documentation where you can poke around.
We have a Slack community that we link in our readme
that people can join and say hello,
and we're always very responsive to help.
And if you need something more than that,
then you can always message me or message anyone else,
and we're happy to talk one-on-one.
Awesome.
Fluke, this has been great.
I love this.
I think it's so cool what you're doing,
and I think this is really well-needed.
As someone that's struggled with Elasticsearch before
and some of the frustrations there.
Like this is pretty cool stuff.
So we'll have like, yeah, yeah.
We'll have links to your Twitter and the website
and everything in the show notes.
Anything else, like if people want to find you
or anything else you want to shout out before we head out?
I mean, I think those are the main places.
You can find me on Twitter as well.
I think if you Google my name, I should come up.
But yeah, please, I do mean this.
Please reach out.
Anyone, everyone.
I always love talking to people.
I always have something to learn.
So please reach out.
Awesome.
Philippe, thanks for coming.
Thanks for having me, man.