Postgres FM - Companion databases
Episode Date: November 17, 2023Nikolay and Michael discuss companion databases — when and why you might want to add another database management system to your stack (or not), and some specifics for analytics, timeseries,... search, and vectors. Here are some links to things they mentioned:Heap were using Postgres + Citus for analytics as of 2022 https://www.heap.io/blog/juggling-state-machines-incident-response-and-data-soup-a-glimpse-into-heaps-engineering-culture Heap recently moved their core analytics to SingleStore (we only spotted this after recording 🤦♂️) https://www.heap.io/blog/heaps-next-generation-data-platform Posthog moved their analytics from Postgres to Clickhouse https://posthog.com/blog/clickhouse-vs-postgres Timescale https://www.timescale.com/ Citus https://www.citusdata.com/ Hydra https://www.hydra.so/ Our episode on real-time analytics https://postgres.fm/episodes/real-time-analytics Our episode on queues https://postgres.fm/episodes/queues-in-postgres Our episode on logical replication https://postgres.fm/episodes/logical-replication Full Text Search in PostgreSQL (docs) https://www.postgresql.org/docs/current/textsearch.html ZomboDB https://www.zombodb.com/ ParadeDB https://www.paradedb.com/ PeerDB https://www.peerdb.io/ RUM indexes https://github.com/postgrespro/rum Debezium https://debezium.io/ pgmq https://github.com/tembo-io/pgmq Tembo https://tembo.io/ pgvector vs Pinecone (Supabase blog post) https://supabase.com/blog/pgvector-vs-pinecone Jepsen analyses https://jepsen.io/analyses ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork
Transcript
Discussion (0)
Hello, hello, this is PostgresFM, your favorite Postgres podcast, one of many.
And my name is Nikolai, and as usual, I have Michael here. Hi, Michael.
Hi, Nikolai.
So today we chose, you chose actually, one of the topics suggested by our users,
so we can blame them, not us.
And tell us which topic are're going to discuss today.
Yeah, this is nice.
I like picking user suggestions or listener suggestions
because I know at least one person will be interested in it.
They suggested discussing companion databases,
which is a term I hadn't heard before but made complete sense.
So this is the concept of having a different database management system
for a particular workload.
So like Dave mentioned Redis, but we've discussed in the past queuing, we've discussed analytics
workloads. There's loads more that we haven't discussed, like vector databases are all the
rage these days, aren't they? Document databases, even search specificspecific databases. So, yeah, I thought this was an interesting topic.
I think there's probably some generic pros and cons,
but also some details in the specifics that might be of interest.
But I was, yeah, super keen to hear your thoughts and experiences
with when some of these things make sense, when they don't,
and whether we can come up with some rules of thumb.
Well, first of all, rules of thumb for me is if you can avoid it, avoid it, right?
But I'm okay with our systems as well.
And obviously, pros and cons in each case, sometimes easy to see.
For example, Victoria Metrics is much better in terms of compression and working with time
series and Postgres and even timescale if you check benchmarks.
So it's obvious.
We had a discussion with their CTO how many gigabytes one billion of rows should take.
My impression usually for Postgres like row store with very light compression
only for toast it's like kind of terabyte for them it's gigabytes so like several orders of
magnitude i'm i'm not sure gigabytes maybe dozens of gigabytes but definitely not a terabyte and
only this already highlights uh big chain big difference course, timescale is also good at compression, but not
as good as the Victoria Metrics, as I understand. I'm not going to blame timescale, it's great
technology. So obviously, like this just this single example says like sometimes you probably
want to consider different databases. For example, ClickHouse. Obviously, it has great capabilities
to handle workloads like log-like or stream of some events and so on at a very large scale.
You can do it with Postgres, but you will hit some difficulties, obviously.
Again, if you can avoid it, I would prefer avoiding it if you already chose Postgres because the main problem will be synchronization and like two big problems, synchronization
and management of additional tuning and so on of additional system requiring some expertise.
Would you count, yeah, the expertise, would you count that in the latter?
Like I'm thinking there's like a certain amount of educational overhead as well,
you know, team expertise that can make sense in larger companies or larger teams.
But yeah, do you count that in the management overhead?
I guess it is.
Right.
Yeah.
Well, I say management management it's not only
like you can find maybe a managed version of some system but so you won't be worrying too much about
backups although i think you should even form a system yeah and there's stuff like provisioning
replication failover a chain so on but it's still, most likely it will be different language,
dialect of SQL or maybe not SQL at all.
It will add complexity to application, definitely.
So management problem is not a good word here.
Like dealing with something else requires additional effort,
not in terms of management,
but in terms of just working with a new system.
Right?
So this is serious.
If you add additional system,
you need to add a lot of stuff additionally.
Yeah, I like it.
I think you're already starting to extract
some of the things I was hoping to
in terms of the trade-offs again.
So we've already touched on the analytics one,
I think is particularly interesting
because I think there are quite strict trade-offs there.
Some of the specialized systems contain features
we just don't have even with the Postgres extensions.
But yeah, you mentioned we do have some quite deep Postgres extensions
with Timescale and Citus that offer compression.
And if you look at some,
like I was looking for examples, and it's really interesting to see, for example,
Heap Analytics, a dedicated analytics product, I believe still using Postgres with Citus and
making it work even at extreme scale. And yet I also read a very similar write-up from PostHog,
also an analytics product, who are much newer than heap so i'm guessing
smaller but maybe not and they've already bit the bullet and moved their analytics workloads to
click house for the benefits but when you're an analyst for make it can make a ton of sense right
pay that migration cost but the i think the the interesting learning from me from that was
they did scale to a certain point using just Postgres.
So I think that I like your word.
Without sideoffs, you mean?
I don't know for sure, but my guess is without, yes.
I remember they used ZFS and ARC2.
It was interesting.
So tradeoffs is synchronization of data and additional requirements to your like development and administration
activities like so how about discussing several particular workloads and checking like these
trade-offs and and so on like starting from analytic analytical workloads maybe what do you
think yeah good sounds good i've got so you mentioned those as trade-offs i i saw
they're kind of like downsides aren't they they're like the cons of having a separate companion
database but i think we haven't mentioned some of the pros as much so like some of them we mentioned
more advanced features we mentioned performance like they can be better performance wise
in certain cases i think there's also a couple of others like they can be easier to scale
out well i think with postgres we have really good scale up options but we don't necessarily have
as easy scale out options where some of these dedicated databases do and then there's one more
which you brought up a few times in previous episodes which is reducing the load on the if
we've got a postgres primary that's already
fairly maxed out or we're pushing the the limits of our like managed service provider for example
we probably don't want to be adding additional workloads to it so or like anything we can do
to split out services could be beneficial as well so those feel to me like the the pros of going
down this path of adding a companion database.
But then the cons are, as you said,
like learning another system, another system to maintain,
and those, yeah, synchronization or managing transactions,
I guess, in some sense.
Right.
So if you mentioned Citus,
then we should mention Hydra as well
because it's a new thing that probably we
will soon hear some use cases, I hope.
And I would distinguish, talking about the analytical workloads, I would distinguish
two big different cases.
First is when you have LTP system as a main system you have, for example, a web or mobile app. And additionally, you need analytics.
For example, e-commerce. And you need an analytical
system to analyze what's happening with your e-commerce and a lot
of stuff. Also, maybe for BI and so on.
A lot of stuff. You need a lot of stuff there.
And you can technically build it on top
of Postgres. There are systems, I don't know, like Odoo and some ERP systems also that can
be considered analytical built on top of Pentaho from the past. But probably you would prefer,
in this case, this companion system like Snowflake or something.
In the past, it was a vertical or something.
And there is a very different case when analytical workload is your primary workload.
You are just building some analytical tool, and this is what you deliver.
In this case, probably Postgres, it can be your main system, or it can be, I don't know, non-existent at all in this approach.
So speaking of the first thing,
when there is all TP in the center
and analytical is like additional thing,
obviously by default,
growing projects might choose
running analytical workloads just on replicas, right?
Yeah.
We discussed this as a quite bad idea.
I mean, it's not a bad idea for very beginning because you don't see a lot of tps and data volumes a lot of bytes so but eventually you
will hit into different trade-off which doesn't have good solution which is called hot stand by
feedback should be on or off yeah right and as, and as a reminder, if it's on,
then running long-running queries on statements,
transactions actually on replicas
with hot-slit-by-feedback turned on,
it will affect the primary workloads
and Autovacuum won't be able to delete freshly that tuples.
If it's off, then it makes your replica a single-user replica
because it accumulates a lag while you run your long transaction.
And other people are asking, like, why lag?
And it depends a little bit if you need...
With analytics, there can be cases where you don't need
completely real-time analytics data.
It's okay to have...
Fair, yes, fair. where you don't need completely real-time analytics data. It's okay to have that.
Fair, yes, fair.
But then it's still not fully healthy when lag is accumulated.
How to monitor it?
Is it caused by a normal workload?
Somebody running a 30-minute report or something?
Or it's already an incident and then we need to catch up?
And then anyway, other users from my experience would expect, they will expect this replica
not being lagging.
And a good solution to this problem.
So both on and off for host and wife are bad, but the good solution is branching.
For example, if you have branching on aica, you can create your branch and analyze everything there
and destroy this branch or cloning, right?
But it's not everywhere available.
So this is the problem number one.
And the problem number two is what we mentioned already,
Postgres compression options,
and especially not only compression, it's row store.
So lack of column store.
Which is, yeah.
I don't even think sharding should be like immediate.
Maybe, of course, like, for example, ClickHouse.
By the way, it's interesting, like analytical workloads and time-series workloads sometimes are quite close to each other, right?
Yeah.
So do you consider ClickHouse as analytical analytical database system or a time-series database
system?
I always think of it as analytics.
But originally it was developed for analytical web analytics system, similar to Google Analytics.
But the primary type of, like, the data is time-serious, a series of events.
Why is it called ClickHouse?
It's a series of clicks, web clicks,
with a lot of characteristics of each click.
Click means like request, HTTP request or something.
So it's time series, basically, also, right?
Yeah, I see what you mean.
But equally, I think they also have vector features now.
It's like, just because it has...
How do you define Postgres?
Is Postgres an analytics database as well?
I think it's difficult to categorize these things
in one category, right?
But I've seen ClickHouse used for analytics stuff.
And yeah, it's mostly time series data,
but so is so much data in this world.
How much data do you collect that's mostly time series data, but so is like so much data in this world. How much data do you collect
that isn't time series?
We know the time scales,
point of view, everything is time series.
We know this, right?
But yeah, I think even
a human is time series.
It has a birthday.
So all humans
are time series data.
And everything
you have, it has created at timestamp
and maybe updated and deleted.
I don't know.
Okay, so then we should also consider
not Cytus, Hydra, and also TimescaleDB
as your options to stay with Postgres only.
But in this case, if you have a strong LTP system
and analytical workloads,
which you know will be handling long-running queries or transactions,
I would suggest having two clusters separately
and connecting them through logical replication
to avoid this hot-snap-by-feedback problem.
And if we take it to the extreme at the other end,
the small side of things,
I think you can start even without any of those.
Even with partitioning,
we get a lot of the benefits straight away.
And I think with a little bit of,
I think I've seen people use roll-up tables quite effectively.
So if you can aggregate the data in a separate table, for example,
that can massively improve very limited analytics workloads.
Or like you mentioned the case that you have an RTP workload,
you need to provide some analytics over it.
If you can use roll-up tables,
you don't even need any of those extensions for quite a while.
So I guess what we try to say is that
it's not only like single Postgres cluster
versus Postgres and companion Victoria metrics, for example,
or I don't know, like Snowflake.
Quite expensive option, but very popular, obviously.
We can consider a lot of options
with staying inside Postgres ecosystem.
And for example,
if you
can reach good
query performance, probably
it's fine to run it on
a single cluster replica.
This is, I think, what we should call
a HTAP case, when we have
LTP workloads and analytical
queries which don't take
30 minutes.
Several seconds or so.
Several seconds, well, up to one minute.
It's okay for AutoVacuum
not being able to delete
that tuples during some
minutes, maybe 10 minutes.
It's okay. It doesn't delete them
immediately anyway. It will come
back later. And in this case, Hydra
is looking
interesting because you can define specific tables as column store and run analytical
workloads on them. You may be having some data stored twice inside the same cluster,
like originally row store and additionally as column store and uh it's interesting right
yeah yeah yeah also already be interesting but it's different as i understand this store is
row store the main storage but in memory they also built uh like 90 degrees uh how is it called I know only in Russian sorry
transformed or
transformed table so it
becomes additionally column store
and you can run
aggregates much faster
but it's not open source Hydra is open source
so I would root more
for Hydra
in this case so interesting and
new options to explore
before you think okay okay, enough,
I'm going to attach Snowflake.
But sometimes you have business requirements.
For example, if it's a large company,
they build a whole department for analytics
and these guys know how to work
with particular analytical system
and they say, okay, we need data there,
Snowflake, that's it.
In this case case no options
right we don't say let's let's try posgus so it depends on many many questions here
i would still encourage people to consider it as an option and to to ask the to mention it to the
team that it might be an option but if you've already got the expertise that does that does
limit the the cons you know if you if you don't have some expertise, that does limit the cons.
You know, if you don't have some of those downsides that we discussed,
then it is an easier trade-off to make,
especially if you have some of the pros,
especially if you're looking for,
if you know about a feature in Snowflake that you get out of the box that you just don't get out of the box in Postgres
or would take additional engineering effort.
And I think a lot of the cases,
it's very easy for us to brush over those i don't know these systems intimately so i'm sure there are features i've
never heard of that some people find extremely useful in some cases like we haven't talked about
we've we've got episodes on uh queuing on analytics already that i'll link up in the
show notes for anybody that missed those but we don don't have one on full-text search yet.
I think that will be an interesting episode someday.
But Elasticsearch is a huge product and has so many features,
and I'm sure support must...
Before we go to this topic, let's finish with analytics.
And maybe time series as well.
I think full-text search is definitely a new category to discuss.
But speaking of analytical workloads, if you already decided to have an additional database or it's a requirement,
for example, a lot of budgets are allocated for Snowflake and you cannot beat that.
They are already allocated.
They should be spent to Snowflake.
And people hire.
So, okay.
And there will be a question of synchronization.
And this is interesting
because you will need to deal
with probably logical decoding
or logical replication.
And probably you will need additional tool.
I mean, it's possible to reach synchronization yourself.
But if the company already spends money on Snowflake,
probably they will decide to spend money on tools like, I don't know, Fivetrend, for example.
Because a lot of stuff to solve there in synchronization.
And over time, it's related to logical replication and decoding.
And in this case, it's logical replication between two different systems.
So problems of avoiding initial full resynchronization is no joke there.
And if tool is not right, for example, if you choose click,
it will produce duplicates on the analytical database side
because usual analytical database, such as Snowflake, BigQuery,
what else, like Redshift or something, they don't have unique keys.
And this is by design.
And in this case, if synchronization tool,
which you build on top of logical decoding,
if this tool is implemented in the wrong way,
it will, you will have some duplicates.
When replication switches from initial full sync
of snapshot of data to cdc change data capture
and this is a problem it's solvable problem the tool just needs to be fixed if you have duplicates
obviously like the tool might be relying on the presence of unique keys on the subscriber side
but it should not do because because such tools primary goal are pipes
from LTP to analytical database systems. Analytical database systems don't have unique keys.
If this tool allows duplicates, it's a bad tool, don't use it. Or tell them fix it right now
or we won't be paying you or so so duplicate problem is solvable just with careful
organization orchestration of this process of switching from initial load to cdc but actually
that's it i think today we have quite good capabilities on postgres side so data can be
moved from altiped system to analytical system with very, very small lag.
And it's amazing. I mean, you can have lag below one minute definitely and with a lot of TPS,
if everything is done right. Great. Also, if you use some cloud replication tools like 5Train.
I had bad experience.
It was
kind of my fault.
So it was Postgres database
RDS,
which obviously allows logical replication
to anywhere.
Logical replication slot is possible to create
and logical replication connection from outside
is possible.
I was checking 5ivetran from RDS to Snowflake.
Snowflake was also on AWS, but at that time it was, I think it was early days of Fivetran.
Now they are much, much better.
So like I was seeing very bad performance and lag legs and so on in my synthetic tests like you take
pgbench and you see the leg is accumulated very very fast and it turned out that five trend
servers were in google cloud so latency it doesn't make sense you need these like software as a
service platform as a service anything as a service, platform as a service, anything as a service.
You need to make sure network connectivity is good.
I mean, not distant, latencies, throughput, everything you need to check.
And right now, I know Fivetran have servers in AWS.
So you just need to choose proper settings.
And in this case, everything should be in the same region, right?
In this case, well, unless you need a multi-region setup.
In this case, performance should be good and delays,
lags of replication should be very low.
And I think it's good if your company needs Snowflake, for example, it's good.
Or ClickHouse, managed ClickHouse.
I don't know, there is such a thing, right?
So probably you need to manage ClickHouse
and you need to stream changes from Postgres to ClickHouse.
It's possible with some open source tooling.
But I think it's great.
I mean, why not?
And also, think it's great. I mean, why not? And also think about the progress.
Like even five, 10 years ago,
it feels like this would have been a dream for many people.
Like the most things I heard about were extract, transform, load,
kind of nightly processes to an analytics a analytics database so your data was like
at you know a day old normally yeah right but that was that was the norm even just well
maybe 10 years ago now maybe i'm a bit older than i realize many banks still use this nightly load
approach nightly refresh and it's it's terrible. It should be redesigned. But of course, the cost of changes
are high, so they keep old
systems working.
If you wake up at night and try to use something
or the application says
down for a couple of hours, it's insane.
It should not be so.
But I would
like to emphasize, maybe
banking systems, some banks
use Postgres.
Transactions are reliable, open
source, quite
good, robust,
maybe like cost of ownership
is low because no licensing
from like Oracle level.
But then you need
additional systems like banking system, or maybe
e-commerce. E-commerce might be a good example here
because obviously they need good analytics systems.
So maybe this is where combination of Postgres for order processing and so on and something
like Snowflake or Vertical or something for analysis is a way to go.
Cool.
I would be keen to get your thoughts
on some of the other use cases for...
Yeah, let's discuss.
What?
Full-text search.
Yeah, I think that's a great...
Yeah, good starting point.
Yeah, I think there are a lot of benchmarks
saying that Elastic is better
than Postgres at full-text search, right?
But unlike analytical workloads, which usually
reorganize data in some form for analysis and so on, and usually you synchronize some data there
and analyze, in the case of full-text search, you, I think, want the search to be a part of your main OLTP workload, right? Because user, for example, works with some pages
or, I don't know, buttons,
and then user decides to search something.
And for user, it's a part of everything
on this OLTP model or web app, right?
So it's a part of-
I think it really depends.
And this is a difference from analytical system.
From analytical case we just considered,
this is maybe the main difference
because full-text search usually is a part of the main workload,
all TPU, like, no?
I think it can really depend.
But yeah, I think you're probably, well, you're probably right.
But equally, user-facing analytics,
people might expect to be up to date immediately as well.
So I think there are cases for both of these to be, like, it depends how much that data is updating.
Like, how much is it user-generated data versus, like, system data?
Or, like, I do think there's interesting cases on both sides.
But the full-text stuff, I've seen, I don't trust most benchmarks I see. I see a lot of benchmarks showing whichever vendor is doing the benchmark just miraculously comes out on top.
Yeah, exactly.
And I do think Postgres has, until relatively recently, suffered a little bit from not having that marketing machine behind it.
People super incentivized to do the marketing efforts.
But now we do have a lot of commercial entities doing benchmarks that show, you know, Timescale doing various things and Superbase doing various things.
I do think those
two make a good effort to to make things fair and most i'm sure most companies try and make things
fair but they know their system way better than they know their competitors systems yeah so there's
all there's always going to be bias in those so i'd really really i think maybe if we if i could
only encourage one thing for people from this if If you are comparing Postgres to a companion database,
don't trust the benchmarks from those two systems.
Do your own benchmarking.
Test your own kind of workload.
So, actually, you're right.
And in this case, probably Postgres full-tip search is good,
but it will require also a lot of effort.
I mean, in this case, if it's a part of main workload LTP, then you have two options.
For example, we deal with Elastic, and it also will require some effort to use.
But obviously, this system is targeting only full-text search.
Great.
Now also vector search, I'm sure they have, right?
They have everyone.
But if you choose to use full-text search,
it has a lot of capabilities,
but it will require also an effort to set up properly
the dictionaries, various ways, manipulative things,
like how to transform text to test vector for example and query to a
test query properly and the query language probably told me not as powerful as at elastic
but it will be possible to combine it in a single query with other postgres this is the power right
and data is here in the same database so this this is super big pro to use full-text search.
And I must say it's quite advanced.
You require many days to understand all details which are possible.
But looking at the benchmarks, I just looked at it before our podcast.
I personally didn't touch full-text search for a few years,
but I know it's not progressing fast now
because, you know,
the same people who were driving progress
for full-text search from Russia,
they founded Postgres Pro
and had different focus,
different challenges.
So I think full-text search
would have a new
wave of attention to be developed
further. Many
things to improve
because it's a very wide topic.
But before, a few years
ago, I touched it a lot and implemented
a lot of things myself using
full-text search. And I must
say, of course, I was trying to avoid Elastic all the time
because of this.
Data is here.
I don't need to synchronize it.
No dealing with lags.
No transformations except to test query to test vector.
And it's great.
But obviously, still, Elastic offers a lot of stuff.
Some people say, I want to benefit from both.
And there is Project Zombadb, right?
Yeah, really impressive.
Yes, and I'm impressed with Techno.
I mentioned the benchmark this morning, Elastic versus Postgres.
Usually, when I have a few minutes, I quickly find any benchmark. I quickly find downsides in Postgres. Usually, when I have a few minutes, I usually quickly find in any benchmark,
I quickly find downsides in Postgres.
You remember recently some people said
they are going to beat full-text search Postgres.
It was, how is it called?
ParadeDB, I think.
ParadeDB.
And they just forgot to create index, gym index.
It's not a good idea to compare indexless, obviously.
Okay, it was clear from the curves they published.
But the benchmark Postgres versus Elastic, I've checked right before our podcast.
I'm trying to quickly find problems with Postgres setup,
and I didn't find problems with Postgres setup.
They tuned it properly, and it was named like
Postgres full-text search, of course,
loses to Elastic in terms of performance.
So maybe still there is opportunity to tune Postgres there,
but maybe in many cases it's quite like a valid statement.
Maybe Elastic handles full-text search workloads better.
Okay.
So if you decide to synchronize, look at ZomboDB, right?
Yes.
So, so Zombo is really interesting.
I don't know exactly how it works, but my understanding was that it is backed by
elastic search, but it's using, you query it through Postgres.
So it's like an index access method, I believe.
So you're using the Elasticsearch index from within Postgres. So you get all those same benefits of being able to join it with other data in your system.
And it handles some of that transaction boundary stuff for you, which sounds like it would be a nightmare.
So, yeah, it's Eric Ridge, isn't it, and his team behind that.
And it's open source as well, right?
I think you can patch it.
Yeah, that's great.
And this reminds me, actually, that in the previous category,
I forgot to mention PeerDB, which is a new company, YC-backed.
There are many Postgres-related companies which are YC back now and PureDB solves
the problem of synchronizing data
from postgres to
data warehouse, analytical workloads
and so on, analytical systems
and they promise to do it much better than
others
so also worth looking at them but they just
started recently so very young company
I've met with
founder recently here
at san diego and it was very good discussion enjoyed and also people which attack particular
area they have a lot of great ideas so i i'm glad to see the postgres ecosystem is expanding in
terms of various additional tools people need. All right, so, okay.
So Postgres to analytical system, like I mentioned many names,
but also look at PureDB, which is a new player,
and Postgres to Elastic, Zombodb, right?
Great.
So I would try to convince everyone to stay within Postgres only to solve full-text search problems.
Also, 3-gram search.
But I know how Gene sometimes behaves. Maybe you won't reach those levels of workloads, but
I don't know, tuning it also requires effort. So maybe two systems here is not a bad idea. And again, maybe in some cases, right?
Maybe it's a scale thing.
For ages, you might be fine on just Postgres,
and then at some point in scale, you want to switch.
It might be one of those trade-offs again.
Maybe, but still, to me, being a Postgres guy,
purely, of course, full-text search in Postgres is very great.
A lot of options.
My biggest concern is actually not related to full-text search itself, but the lack of ability to have a single index scan, which will work on full-text search, but also order by time or reverse order.
So it was the idea behind RAM indexes,
which like still,
like I think this will be good,
great to improve and solve,
but this is hard topic.
And I'm sure,
I'm not sure Elastic solves this problem well.
Right?
Yeah, I don't know enough about,
like it feels to me like an extra dimension of data.
It feels to me like this might be where some of the
work around
dimensionality indexing around that
might be good but again maybe with the
downside of some missing date
like maybe accidentally missing
some post because of the accuracy
but this is a good thing
to solve and it's not only about
semantic search
that PitchVector can provide for example I want fresh data to solve. And it's not only about full-text search, it's also about semantic search that
PitchVector can provide.
For example, I want fresh data
to come first, which meets
my requirements in terms of
full-text search or vector search, but I want
fresh first. It's natural
desire, you know, from
social media applications, for example.
Yeah. Cool.
So what else?
Queue-like workloads, we discussed them
quite well some time
ago, and I would like to
emphasize that I don't
see people use logical
decoding or replication
from Postgres
to, I don't know, some Kafka.
Oh, why I don't see it? I see it, Debezium, right?
So Debezium project open source, it's exactly what it's needed for. But now we have this new
PGMQ system, right, from Tembo.io, a new company, which is worth looking at. I haven't tested myself
yet. And also here, I would like to mention this problem of synchronization, which by default is...
So synchronization should not be done with Listen, Notify because of a lot of restrictions this mechanism has.
Logical decoding is great for synchronization. It will require some effort.
But also there is an approach maybe for queue-like workloads, which is more standard when you have in Postgres something and then you want to insert or update it somewhere else.
There is two-phase commit, which probably you want to avoid because it's slow.
But from microservices, they invented a lot of patterns like object-oriented programmers did a few decades ago.
So there is a so-called transactional outbox pattern, which I find quite good to learn and it's simple, but it helps you avoid problems.
So for Q, like workloads, if you offload workloads from Postgres, maybe this is the way to go.
For example, Rails,
Rails people prefer Sidekick, for example.
Some Rails people prefer
Postgres. There are many libraries
like DelayDrops.
But some people prefer
to offload
tasks to Sidekick.
Which is Redis.
Exactly. This is what I wanted because the original request was about Redis, right?
Yeah, true. So this is... I know we're about to finish because we don't have a lot of time,
but I needed to lead us to this point. Okay. So Postgres, Rails code and Sidekiq with Redis behind.
It's like some kind of popular combination, I think.
But I think many people maybe have issues with synchronization.
Probably you're not noticing.
If they do it strictly from Rails, not involving any two-phase commit and so on,
you inserted to Postgres but didn't insert to Sidekiq.
You updated, like, I mean,
you need to see how synchronization works.
And this transactional outbox is quite simple.
It's not rocket science at all.
You just have a special table,
like you have propagate symbol there,
then consumers like offloaded to this Sidekiq reliably.
And in Sidekiq, you already have this processing. It's not already our problem,
it's Redis' problem to process those events, right? Not to lose them and so on and so on.
But I think in this particular case, Redis probably is needed for such kind of workloads
like event processing. Let's look at Postgres. Usually if you look deeper,
two things to check as we discussed in the Q episode, the too-long-don't-read approach.
First is check how if they delete or update intensively, it's bad. I mean, how old events are cleaned up? It should be
either truncate or drop partition to avoid bloat and auto vacuum issues.
And second, if you need insane scale, I mean, multiple workers should have select for update.
Skip locked.
Sorry.
These are two things for proper performance long term.
I mean, not degrading.
Performance which is not degrading over time.
Good?
That's it?
Anything else? I think so.
Yeah, last shout out to the one that's probably the most in flux at the moment,
which is the vector database
specialism i i actually think this might be a case where we don't know how the dust is going to
settle yet it feels so there's so much progress and there's so many so much money involved so
many companies very well funded competition is high yeah and and also there seem like there's
still huge leap forwards like it the most recent release of PG Vector,
the numbers that they released in their 0.5 update,
because they added the HNSW index type or algorithm,
I don't even know how to describe it,
but that was such a huge leap forward in a couple of ways
that you might want to optimize things.
So who knows what's next?
Like who knows what progress there is down the line there?
Yeah.
So I want to say the important thing.
If you like, for example, first of all, I use it already daily,
PgVector.
Nice.
But I haven't reached that scale.
We haven't reached that scale where we need a good index
because we only have less than 100 000 documents yet we will have a lot of more and in this case
i mean i haven't i haven't done my own benchmarks or my team haven't done it hasn't done it yet we
will because eventually we'll have many many millions i hope maybe dozens of millions of documents it's by the way postgres
related but of course we use this hnsw which is stands for hierarchical hierarchical navigable
small worlds weird name by the way i think some some guy from or a bunch of guys from russia and
from my own university invented it like five years ago or so
maybe it's not there this term is not from them but the algorithm everyone tries to implement
this from there I was surprised so I want to say that there is a very popular benchmark
which compares various vector databases versus like Postgres and Elastic and so on.
Every other already existing system which just added some kind of plugins, extensions, capabilities.
And I must say this comparison still lacks HNSW capabilities from PGVector
because it benchmarked older version of PGVectorector. That's why pgVector loses
drastically there. So don't trust those benchmarks. It became better. And don't rush into using,
for example, Supabase had article. Let's touch it. Paul just also wrote me that pg PG vector with this uh HNSW indexes it's a lot faster than pinecone so look out look
look at their blog post super base blog post with additional Benchmark and I hope that very popular
vector Benchmark will be also updated soon right so to because it's not fair you know so posgus is good here actually and yeah it looks
it looks really good the but please again don't trust any benchmarks people like check for
yourself i do trust the super base people i've read that exact i've read that exact post and it's
it seems great they even spend more on the pine cone instance than they're spending on the super
base instances it seems like as fair as you could get but how are we to know like that they they're great at tuning pinecone
servers and they've done you know it's really difficult yeah i cannot keep it so as a small
spoiler our bot will be capable of running benchmarks conducting benchmarks and i think
we should do it as soon as possible this This particular case, let's check pgVector with various options, various indexes.
We will do it.
And it will be super easy for everyone to reproduce.
We're just chatting with both.
On Postgres, but not for comparison.
Yeah, just Postgres.
We focus only on Postgres.
But you can get all the same metrics and compare and see TPS latencies, all the
details, a lot of details.
So then think how to improve, and not only you will be thinking, but...
Okay, enough.
So vector workloads is good at Postgres.
Just try to use them.
Don't trust others.
Because others want to build new systems.
They will be saying Postgres is
bad. It's obvious.
But it's not...
If I trust
Victoria Metrics CTO and I see numbers
myself or Clickhouse, I know
this technology is super good at both
or Elastic. Here
I don't trust it yet.
Let's give Postgres a good chance
before moving on.
Do you know what?
That's a really good point and something I forgot to list
in the pros for keeping a workload on Postgres.
And it is still the wording exactly from the Superbase blog post,
but they've said Postgres is battle-tested and robust,
whereas most specialized database haven't had time
to demonstrate the reliability.
It doesn't mean they're not going to be in time super reliable.
It takes 10 years at least.
Exactly.
10 years.
It can't.
Yeah.
So, yeah, really good note to it.
And obviously...
Backups, bugs, ICID, a lot of stuff.
Yeah.
It takes a lot of time.
Even if you change the part of Postgres,
you need a lot of time to be stable.
So there are systems each time I touch
that demonstrate issues very quickly.
But yeah.
Like there's that group that do, is it Jepson?
Like they're doing serious testing of systems.
Distributed systems particularly.
Yeah, but even like finding bugs in non-distributed systems,
transaction isolation levels, and things like
that, some really impressive
people doing some really impressive work, but they won't have done
that on newer databases, and
people wouldn't have had a chance to
fix some of those issues yet.
Yeah, it's a really good point, and I don't mind
us being a little bit biased.
But as we see from
full-text search and analytical use cases,
it's possible to beat Postgres
for particular workloads, for sure.
Yeah, yeah, yeah.
Good point.
Cool. Thanks so much, Nikolai.
Thanks, everybody.
And keep the suggestions coming.
Yeah, thank you.
Bye-bye.