Postgres FM - PgDog update
Episode Date: January 23, 2026Nik and Michael are joined by Lev Kokotov for an update on all things PgDog. Here are some links to things they mentioned:Lev Kokotov https://postgres.fm/people/lev-kokotovPgDog https://gith...ub.com/pgdogdev/pgdogOur first PgDog episode (March 2025) https://postgres.fm/episodes/pgdogSharding pgvector (blog post by Lev) https://pgdog.dev/blog/sharding-pgvectorPrepared statements and partitioned table lock explosion (series by Nik) https://postgres.ai/blog/20251028-postgres-marathon-2-009~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello, hello. This is Posgess FM, your favorite podcast about Postgres.
And as usual, my name is Nick, Postgresia.
And Michael is here as usual as well from PG Master. Hi, Michael.
Hello, Nick.
And we have a great guest for the second time, Leif, PGA dog.
Hello, Leff. Thank you for coming.
Hello. So thank you for having me for a second time.
Yeah, I'm sure there is a lot new to discuss.
so yeah i think it's a great idea to see what have changed since last time we talked because it was around
your first release i guess right and yeah so where to start where to start yeah we're different
year 2026 so it's been uh almost what eight months since we last chatted something like that seven
months it was a lot of yeah now we're in january yeah i think 10
10 months. Oh boy. I mean, I've been working full time on this stuff, so I have a whole list of things that have been developed ever since. I think the biggest thing is BGDog is very much real and in production with real companies and real customers.
You know, last time we chatted it was like a, you know, a thing, but now it's a real thing. So that's the big update. It's in prod. Most of the code, I think, I would comfortably say is stable and performant, which.
Obviously I would say that, but there's real, real usage behind this, which is really cool.
Let's see, there's a long as the futures that's been developed since then.
Before we go, before we go on to those, I wondered of the people using it in production,
what are the primary drivers?
Is it sharding or is it like an upgraded pooler?
You know, I feel like there are a few selling points to PG-Dog that we talked about last time,
and I wondered which ones were driving, like the primary adoption?
That's a great question. One of our early adopters, they went straight for sharding, which to me was like, wow, you're brave, sir.
And they went straight for it and they replaced CITUS with it. So that was really cool.
Can you name some names or no?
They're on my website. One of their logos, I'll let you guess which one. I'm still trying to close them as an official customer.
So soon I'll be able to say their name, but their logos there. And yeah, they were, they were, they
They're pushing, I think, over 100K QPS per shard.
And they have 12 shards, I think.
Yeah, yeah, it's kind of crazy.
They're using direct to shard query, so that's the real, you know, why it works so early is because
they didn't use any of the fancy like aggregates and sorting across multiple shards.
That's the real tricky part, although we've made some progress there, but, you know, it's still not 100% compatible to Postgres does it.
So on its way, though.
But other use cases, you know, load balancers a huge, huge thing.
Like people love having the ability to talk to multiple replicas without actually manually creating a connection pool for each.
You know, there's health checks and failover.
One of the big uses is we deployed alongside, you know, Aurora or RDS, where Petroni is basically not used.
And it can detect primary replica failover and it'll move the traffic to the new primary, like,
within like a second or so, which is really great.
And then of course, when your replicas,
but one of the replicas starts failing,
it'll ban it and then use the remaining replicas for re-quiry.
So you get that, you know, that amazing like four nines of uptime for progress,
which is- I have questions about all these features.
All right.
Before that, I have underlying question.
10 months, is it really 10 months or 10 years?
Because you use AI.
Do you use AI to develop?
I do.
Yes, Claude, Claude, right?
Yeah, I use Claude Opus specifically.
Cloud code, Opus 4.5 right now, right?
So this is what you use.
That's right, yeah.
For only certain areas of the code, not all of it.
There are some areas where I don't trust it at all, but some areas, like,
like we're building like a dashboard like with all the metrics and stuff.
Not critical paths, right?
Yeah, critical paths are still organic code, if you will.
Organic code.
H.I.
Exactly.
This is the first time I hear this.
It's hilarious.
Yeah.
Homegrown organic code, but I don't write JavaScript anymore, for example.
I don't write CSS anymore.
Like that kind of stuff.
So you use code code for this additional pieces of software, not super critical
pass.
And I'm sure you ask it to deduplicate and reuse.
the code as much as possible recycle existing pieces of glass off okay it's it's it's
it's crazy how much that machine loves to generate like copy based code around it's it's allergic to
function we're used you really have to tell like hey yeah you know yeah i have a special additional
code-based review process which like basically targeting this like the duplication
sector.
Yeah.
But it's possible to achieve.
Yeah.
If you don't do that, by default, you will get a lot of duplicated code.
Yeah, that's great to hear.
Okay.
I see by the way, great logos on the website.
I missed that.
I haven't visited front page for a while.
I see Coinbase Circleback.
We use Circleback and work at work.
Some other names.
That's great.
Congrats.
Thank you.
Hard at work.
Yeah.
So what is your current pace?
pace, how often releases happen? How do you ship? And if you still write a code a lot,
like I guess it's not like a everyday new version. It's less, right? Yep. Not every day. Our
release schedule is every Thursday. I'm trying to be very diligent about it. Like whatever code
that's in Maine is usually stable, but I give it a week before I tag a release. And then
every Thursday, I compile release notes and ship it out.
And then you get it in GitHub and there's a Docker image that gets built for both, you know, X86 and arms.
So you can pretty much run it anywhere, including on your Mac if you wanted to.
And we have a helm chart for if you're using Kubernetes, our helm chart is getting a lot of good usage and contributions from a bunch of people.
So, you know, I'm not a Kubernetes expert, but they are.
So they're helping me along.
So if you're running Cube, you should be able to deploy PG-Dog in like 30 seconds, honestly.
but otherwise, you know, it's just a container, run it anywhere.
Right. And you use LibPG query for query parsing, so to build IST, trees, and yeah, and you
recently optimized it, right? So it's not your optimization not merged to upstream yet, but can
you tell us a little bit more about it? Also, specifically, I think my question is also about,
is it worth optimizing, like, how much of microseconds we save on your?
query, you know, should we worry about those microseconds at all?
Yeah, you should. It was, oh, so first of all, full disclosure, I told Claude to do it.
Yeah. And then it did it. I checked it. I was very open about it. I put in the PR. I know how
open source projects hate this kind of stuff. Like, you know, 3,000 lines of boilerplate.
Like, I was up front, look, guys, this is cloud generated. I reviewed it. It works okay. But, like,
give it a few months before.
seriously looking at it. But the idea is in LibPG Query, when there's a translation layer between
the C library and a bunch of other bindings. So notably, like, we use the rust binding,
and the translation layer is in Protobuff. So every single time you call parse or fingerprint,
every time you get the AST back, basically, it goes through a serialization and deserialization
process with Protobuf inside the same process. And that's really slow, actually. So I just replaced
that were like only for the rust
bindings with just like direct cedar
rust and rust-to-see mappings of the
Postgres A-ST
and that improved the translation part
of it by honestly I'm not even exaggerating
like 5X faster
the throughput went from like you know a thousand per
I think per millisecond or per second
to like five
or 10K or something like that the benchmarks
are in the PR but it's
huge and I think another
important factor is like CPU utilization
so like for us we're
network proxy, so our CPU budget has to be pretty reasonable. Like, if you can decrease it by 5x,
and that's most of our CPU comes from query parsing. Everything else is pretty straightforward stuff,
just I-O back and forth in Atomics. So for us, it's huge. So this optimization is just like,
you reduced a number of times how much query is copied between various buffers. Exactly. And the
proto-buff serialization is probably one of the fastest in the industry, but it's still very, very
slow compared to just like copying a C-struct in memory like M-M-CPI.
It sounds to me like shorter queries should suffer less from this like huge CTEs,
you know, like multi-glybites in size.
They should suffer even more, right?
Yeah, yeah, definitely.
I haven't measured how much faster are slow queries or long queries.
I just took up like a pretty sizable query like six hundred lines and used that as my
benchmark because a lot of people who use PGGG
did you like to write long and complex SQL statements for the for the database.
So that's what I went with and that's where latency will really matter.
So yeah, for small ones, maybe it's smaller, but doing Protobov DCR in memory, it's a big waste of time when you can just map a like a C-Struck directly with a few allocations.
Right, right.
I also saw there was optimization of like SISI, S-S-RQ directly.
I and the vector calculations.
So this is like cool stuff or what?
Like is it or is it or you saw some real benefit?
Well, the vector one is kind of funny.
There's a block post about it,
but basically I try to shard a PG vector index
among multiple Postgres instances.
And to do that, I implemented,
implemented one of their algorithms,
IVF flat inside PG dog.
And then it's able to basically like separate
the, uh, the vector
between multiple databases and then search for them using IVF flat.
And for that, I needed SIMD to calculate, I think I'm supporting,
we're supporting cosine distance only, just as a POC just to see if anyone even cares about this stuff.
And that's where the SIMD optimization comes from,
just to calculate the cosine distance between the input and one of the centroids,
because we store the centroids basically in a config file.
Everything is in PG-Dog and I know is config-driven because I trust the disk not to do anything.
Eventually, we'll move away into something more dynamic right now.
It's like you pre-calate your centroid in advance, and then you store them,
and then you can basically query one chart and multiple shards to get your probes,
the number of probes, if you will.
And then underneath each Postgres database has their own IVF flat or HNSW index.
So it's an approximation, top of an approximation, which is fun,
because you can scale PGVactor into like presumably billions of billions of.
of vectors like others announced, right?
And we said it's not true because it's not a single instance.
So yeah.
Yeah.
One billion vectors on single instance, I haven't seen this.
No, no, it's not possible with PGVector, but sharding it is.
Right.
So if you have sharding schema, define, like you can store a lot.
Right. So it's great, great feature.
Anyone use it, using it or just, it was just,
it will just showcase and this.
I don't think anyone.
I think I got an email from somebody who's like,
this is really cool.
And I'm like, well, this is how you need to use it.
And they're like, cool.
And I haven't heard back from them.
But yeah, it's more like a, yeah.
Speaking of performance, let's talk about two to face commit.
There is support for it, right?
A native support.
Can you describe like what's your experience and view
on using to face commit in larger systems?
Yeah.
Oh, we have a dog.
Um,
We have a dog.
And we have a doc on Two-Face Commit as well.
Okay.
Just for those who are listening on the audio version,
we just saw a dog actually walking in background.
That's great.
Big dog.
She is the reason why PG-D dog is named PG-Dog.
Right.
You explained us last time.
Right.
So what about Two-Face Committee and performance overhead it has and so on?
I'll be completely honest with you.
I haven't benchmarked.
it very much. Nobody really talked to me about using it yet in production, so I don't know
how much the Postgres internales themselves create an overhead. I just know from the proxy side,
it's two extra statements to execute, and also the additional complexity of managing
phase one, phase two of the commit. So if phase one fails, you need to roll back, and if phase
two fails, you need to roll forward or commit. And you also need to make it PG-doc crash-resistant,
which we haven't done yet.
So there's still work to do there to make it, I think, production grade.
But from our point of view, it was just an extra couple statements to execute.
Yeah, okay, I understood.
So this is just like also like some foundation for future work maybe, right?
Definitely, yeah.
Some feedback from production systems.
Okay, I see.
Another thing I noticed also like it's fun, fun feature.
You created ID generator, right?
Which is like we like PostG18 brings UUID version seven.
but it's 16 bytes.
It's quite like a lot of space spent for a single ID value.
Yours is 8 bytes, right?
Yeah, big end, yeah.
It's 8 byte integer, but it behaves similarly, so there is timestamp component in it.
This is cool feature and it's going to serve us until when?
70 years, right?
70 years left.
It's not a lot, but for many systems should be enough.
What inspired you to do?
do you to build this?
I have a potential production deployment and there were big usage of omnischarted tables or
replicated tables or mirror tables or whatever you want to call it. Basically the data in the
table is the same on all shards and they needed to generate the primary key for that row.
And you couldn't rely on the sequence because even if you used to face commit, which
they originally thought they would do, there's still a chance that the sequence numbers on each
shard will be different, especially because sequences are consumed, whether the transaction succeeds or fails.
So the ID could be different on each shard.
So you needed an external ID generation service in the first thought about using a third database or another database to use its sequence to pull it in and then insert into the query, which, you know, creates a single point of failure.
But it's much simpler to actually be able to generate a unique big end inside your proxy, inside Pichadog, which is not that hard to do.
Could you sequence we just could have some extra added like for example if we say we don't need more than
100 billion per sharp we could say on shard number one we always add 100 billion on the second
shard we add 200 billion why not they need to be the same on every shard this is the table that's replicated
okay i like it i think it's interesting so i guess it's because we don't have unique constraints across shards
right?
Yeah, that's right.
Well, we don't need them, basically,
because the ID generation is guaranteed to be unique.
Before.
Yeah, before.
I mean, all good database engineers will tell you should still have a unique index,
even for you IDs, which are theoretically guaranteed to be unique,
but even in this case, you don't necessarily need one.
So the idea is, again, so you insert the same ID into all the shards
and you need to be able to generate it quickly and efficiently.
And that's why this, like, it's the first of a,
That's kind. PG-Dog executed Postgres SQL function.
Okay.
Because we're using the Postgres parser, right.
So why not, you know, just hook into the query execution.
One step towards Citus.
And having whole Postgres in your middle there, no?
I mean, you heard it here first in Postgres.
No, I don't know.
I think it's a good start, but it was so simple to implement.
You extract, like, you find the function execution,
and then you pull, you generate.
you generate a unique ID inside the pooler.
And the way you do that is, well, you get the time stamp.
You kind of assume your clock is reasonably accurate.
Then you have an internal sequence number, so you can generate multiple IDs per millisecond.
So the clock tick is millisecond.
And then there's another unique number that call it node ID or PG-dog node ID.
So you can have multiple PG-Doc containers in front of the same deployment,
and it will guarantee that none of them will hit a duplicate.
from the signals yeah no collisions exactly so you know the 70 years is a little bit arbitrary we can
reduce any one of these components and increase the timestamp components so we can to exit by just
reducing the number of pg dog nodes that you can deploy right now it's a thousand twenty four which
if you're running that much you're your your facebook probably but you can reduce that by half and
give that extra bit to the timestamp and then you can also reduce the number of per millicester
second generation. Right now it's 4 million. You can generate 4 million per second per
millisecond. It's a lot. Um, yeah, probably reduce that. Uh, actually no, let me double
check. I think it's probably 4,000 per millisecond. That's, that's, that's, that sounds
more reason. Four million per second would make sense in that, yeah. That's right. Yeah.
This is where documentation or a good documentation comes in handy. That's a big change from
last time as well. Yep. The biggest thing I'm proud of is adding an icon next to every single link.
just, you know, being a very, like, attention to detail, I suppose.
But, yeah, like, every single comments I get from people is like, well, is this documented,
is this documented, anytime I talk about a feature, and I'm like, good point.
Now it is.
Thank you.
So if it's not in the docs, I want to say it doesn't exist, but that's not true.
There's a backlog of documentation to be done.
Yeah.
And I should help with this.
I would love for you guys to help out with anything.
Not I, AI.
Oh, yeah.
Maybe I as well.
What will help with AI.
Yeah, I'm looking at PIGD dog from time to time and definitely looking for some use case.
And as we all know, maybe not all that we are my team benchmarked.
And so I'm looking forward to revisit that benchmark to see how that optimization related to the protobuff helped.
I'm very curious.
This time we are going to visit not with trivial queries like select.
That's it, but with some heavy that's why I mentioned like I expect some benefits from for
four heavier queries, bigger in text. Yeah. So yeah, yeah.
Documentation I think AI should help with it. Like this is my point.
No, no way man. Like because they all sound the same. They all look the same.
Like you could give it a couple of tips like your sound original. But like if you're
write text using AI, it just all reads like it was written by the same.
But at least the reference could be like the usual problem with reference like
documentation which lists various like functions config parameters or everything it usually lags.
And to maintain it, like I think AI is great to maintain it.
Yeah, I mean I don't disagree. For me writing documentation is kind of like a
it's like a ritual. You know, I go through the things that I that we built that I'm like, oh,
document and then I look at what I wrote. I'm like, wait a minute. This is wrong. I'm putting myself
into the user shoes. Like, I'm breeding the dogs. I'm like, okay, this is what it looks like.
It's an additional point where you can verify the whole picture and see if all that makes sense
all needs to be adjusted additionally. Yeah, I agree. Also demos, right? Like when you think how to
show what was built, sometimes you end up changing everything. Yep. That has happened before.
Because you simply cannot explain it means that probably should be done differently, right?
Exactly.
Exactly.
So let's talk maybe about hard stuff, like prepared statements, what happened to them.
Like how digital supports prepared statements and transaction mode and so on.
Can you can do?
Funny.
You know, we have to self sharding, but prepare statements is where we really have issues.
Yeah, let's talk about it.
sure they work they're fine everything's great i it's it works uh i think you saved them right
yeah they're they're cached uh it's a global cache uh so if clients send multiple multiple
clients send the same statement the same statement on the server side is executed so
pretty high cash hits implicit or explicit like protocol level or prepare keyword
protocol level right now and partial support for explicit
you know, since we have the parser.
For the explicit ones, I need to differentiate on the number.
Sometimes you can pass the data types in the prepared statements.
Sometimes you don't.
I just need to add a little bit more support to the parser there to extract the
the differentiation between like different unique, identical prepared statements,
but they expect different data types for some reason.
So protocol ones work.
I guarantee it.
If they don't, I swear they do.
But the other kind is still a work in progress.
Yeah.
And have you seen any issues with preparersetments together, used together with partitioning
or your clients don't use partitioning at all because they have sharding?
Why use partitioning?
I think we use them with partitioning.
I haven't seen any issues.
Like if the query, if the preparatory step and plan is wrong,
postgres will return an error and we evicted from the cache and then we prepare it.
that simple right because first few calls are not cached yet in prepare statements because
first it uses custom plan for the first five times this is this there is hard-coded constant in
post-gross source code and with partitioning it's much more complicated i have a serious of
blockposts about this it's like super complicated yeah i'm curious about this maybe just nobody
encountered in issues yet but yeah i'm curious
about this.
No clue.
Yeah, let me know what I need to fix.
Yeah, I understand.
Very practical.
I understand that.
Yeah, but it's also,
also, Nick, I think you're, like,
this feels like the kind of thing that actually gets a little bit better with sharding.
Like, you just reduce the number, you reduce the QPS, you reduce the load on each.
Like, it, everything that could cause issues around that is,
however many shards you have times less bad.
or less likely to hit the cliff?
Absolutely. Sharding reduces the need in partitioning,
hands and so.
But maybe not fully eliminates it because partitioning,
if it's time-based, it's still makes sense within one shard.
We know cases of really large databases,
which combine both approaches.
Yeah, for maintenance reasons, it still makes complete sense, right?
Yeah, so yeah, it's tricky.
And if you want to prepare settlements, oh, it's like, it's good.
Oh, it's like it's super complex.
So what's what's what's what doesn't work?
In in, in, uh, without part, without partition.
So partitioning brings cash not at planning time, but at execution time.
And this increases complexity because, uh, if you just deal with planning time cash, it's quite simple.
So if, if it's not partitioned, only planner can decide the cash or not, right?
That's it.
If it's partitioned, there is a decision point shifted to execution time.
And there is a point when basically you are without cash and you cannot avoid it.
Because when cash is being built, at that execution, it's number sixth execution or fifth execution.
So at that point, executor doesn't have this cache because it's being built.
So we are without cash.
And in high QPS cases, when frequency of course,
is super high. We basically there is a moment when we must live without cash and it might
be a super non-performance. So performance can be terrible in that point. Oh I see. Which
doesn't happen without with unpartition case at all. This is like I already keep I
really start forgetting details because so many nuances if you don't live there if
you don't only visit it like I did for a couple of weeks and then but I have a
several blog posts about this problem with demonstrations.
See, this is the prime example why being a proxy for post-grat sharding is the best place to be.
Because to me, that problem is completely irrelevant, right?
Because I'm just passing statements through, right?
I think, unless I'm, because like they will work or, or they won't.
But for a performance perspective, like, if you have trouble with partition tables and
perversum, you just don't use prepare statements, right?
Yeah, yeah.
Yeah, with partitioning, we like, we must, so partitioning also brings the problem of explosion of excess share locks because.
Okay.
Yeah, I'm talking about this particular, but yeah, but.
So if you have a lot of partitions and we have plan cache, one idea, but also we have a routing to specific partition, which is called constraint exclusion, right?
We need to route specific partition.
this is another idea and at planning time if plan and if planner knows where to route this is normal
case when you don't have prepared statements it just routes to specific partition all good
no problem we lock only one partition with all its indexes but if uh we postpone it to execution
it needs to lock all tables with all its indexes and it's terrible
With sharding, I don't know.
You just, I think.
Well, you reduce the KPS.
Yeah, this is already.
And the log manager is not a problem anymore.
Or you just need to use PostGbis 18 and raise max locks per transaction.
In this case, the local manager also is same.
So yeah, anyway, this is an interesting topic.
And I'm curious if some people combine sharding and partitioning.
I think they shoot sometimes. They shoot sometimes and it's just it's in really big setups.
Yeah. I mean, as the guy is trying to build charting, you should always do sharding unless you can't, but like that's, you know, you know.
I think there's a maintenance argument. Imagine if you've got like a logs table per shard and you want to delete old logs after a set, you've got a retention period. You probably still want partitioning to make those deletes really efficient. Like I think there's still a use case for it.
but it's not so much for performance reasons, but it's for maintenance reasons. That'd be my guess.
Yeah, to build indexes, to have vacuum, working more efficiently.
Yeah, but I think those are just, I still categorize those slight, I know they are maintenance,
but they're kind of still performance reasons, right? You want them, because you don't want
the table to be too big. I'm talking about doing like deletes efficiently, being able to,
instead of having to like do batch deletes, you can, exactly. It's, exactly.
called not data retention but there is some additional term yeah yeah yeah the opposite of yeah
yeah there could be like a legal requirement for data retention i think that's the only reason why
people would ever delete data right uh well the other reason is performance so if you eliminate the
performance argument uh like if you make performance just free because of you know sharding makes that
i mean obviously that's not the case but the yeah and if you compare sharding and
partitioning uh with partitioning we often are willing to
to have one active partition where we write and all others are archived.
In case of sharding, it's strange, right?
Because a lot of hardware is not used.
Like if you have only one shard read, write, and others like archive, it's strange,
simply strange, right?
So you want to distribute rights, usually.
You want it to be equivalent.
You want like a perfect, like, all shards have the same load.
And that's why you'd usually use like hash-based partitioning,
if you can't again.
Or the way you design your schema for sharding is you make sure that each shards,
roughly has the same load if you have hot chart I mean you're gonna have a hot
chart problem one or rather than later yeah exactly especially for like multi-tenant
like what I call multi-tenance B2B basically where your sharding key is like a
customer ID if you will and then one of your customers grows big and you need a
second sharding tier or if you will make sure second partitioning does your dog
support resharding without downtime oh yeah yeah of course how technically working I mean
I mean, it's good. It's been tested, so that's kind of cool. The idea is, I mean, I don't want to say simple, but I think practical. For resharding, like, this is where the contrast between how this works and, like, NoSQL databases work is a little bit more apparent. With NoSQL, you kind of add another machine, and then it kind of rebalances all the data using existing machines. The PG-Dog approach is you create a brand new empty cluster, and then you copy all the data out into the new cluster. And there's the same.
decision is actually not because of a particular limitation, but it's, it's intentional.
Because you want to reduce the load on the existing cluster during resharting, because usually
when you start resharting, it's already underload, it's already underperforming, it's already
having problems. So you don't want to add additional load with resharting to it. So you create a
brand new cluster, you move the data over and we are able to use replicas since Postgres 16
to copy data out with logical replication. That was, that was a huge thing. That was, I think,
one of the game changers that made this possible is logical replication slots on replicas.
Why? Why? Why does it matter? Because you can create 16 replicas and paralyze your copying.
Why doesn't matter? Because Discaio or why does it matter? Like you don't want your
own primary or what? Well, you definitely don't. Your primary is already shot. Like if when you're
sharding, your primary is already having a really, really bad time. That's the main reason behind
charting you want to scale your scale your right so adding additional replication load to an already
like underwater database is is bad and when you want especially when you want to create let's say
you go from one chart to 16 charts extreme example but realistic you're going to add 16
replication streams to your primary that's like 16 times like streaming the same data the thing is
going to fall over i guarantee it but overhead is amplified here so the same work is done on different
on CPU, V-CPUs, so I guess it.
Yeah, yeah, decoded and recoded.
Yeah, yeah, yeah.
So you want to shard this load, right?
Exactly.
Exactly.
Like we're able to use one replica per table, basically, for resharding.
So if you have like, say, a thousand tables,
you're able to shard your sharding stream and you both split it into 16 different machines
or 32 different machines and sync it in parallel, which I think is pretty huge.
It's all the same LSN, right?
Because they're replicas, right?
you can synchronize at the end and make sure that you got you're at the right point and like
pause the live traffic and then wait for the logical stream to catch up and cut over to the new
cluster like all of that is the pieces are there is you just need a button to make it work we haven't
built the button yet but we built all the all the levers you need to push you know like as
so you say you you create cluster you mean you mean create postgast cluster is just basically
one new primary that's right and when you say you copy it means that there is already
something in pjadog that copies you just need to use it you exactly it's not on the
shoulders of the user right no no we we hook into the replication protocol to stream the rows and
we use copy binary copy underneath and we shard the copy stream all right this is the option
since post was 16 or 14 i think binary oh it's been a minute yeah yeah you initiate logical
replica I guess by this option binary equals true for initialization part interesting
yeah there is also trick to convert physical to logical which is official since
postg 17 I think it's it's called PG create logical or PG create subscriber
CLI oh interesting I'm using like PG output which streams everything in text
format yeah yeah yeah but but yes
but right but still there are additional things here you can convert physical replica to
logical replica and and we could do it before but now it's official oh
pg creates subscriber unfortunately we won't be able to use it on most managed services because
they don't expose this they should actually yeah they should expose this in api i hope they will
but in case of self-managed it's good right yeah with self-manage you have a lot more options
for sure. I think the original goal was to make this work and manage environments because that's where them...
Right. Yeah, you mentioned Aurora already support of Aurora. That's great.
Yeah. And these replicas, so you create logical replica out of copying data and performing CDC,
so that's right from a slot on a replica, which is supported since recently, right?
I'm curious this when you initialize logical replica from this is this physical standby doesn't have
hosstandby feedback on or off does it report to the primary that it still needs some data
because if it does the primary health can be affected if you eventually switch over that
blow it doesn't matter but if if you roll back so i think
If this physical replica is not used by anyone else, I would keep Hotsonby feedback off.
To let it lag.
I would raise the max.
There are a couple of settings which allow a replica to lag.
If Honsel feedback is off, it's allowed to lag.
There are many, many nuances here in this process.
I'm curious how you do it.
But, okay, you're using a logical replica here.
Yeah, that's a good question.
call that actually like my recommendation is to create a dedicated replica for this or dedicated set of replicas
but without host by faddeck i would do it if it's temporary replica why not like and you know
and also how this replica is created right like it's it's interesting topic do you integrate
with some backup tools or or you just it's out of scope right for pjd doc i feel like at the moment
it is but maybe in six months it won't be
Like right now I'm relying on people using like some kind of reliable way to create replicas.
If you use RDS, you just click a button.
Again, like if you're using like any of the managed services, you just click a button, you get a replica.
It's using disk snapshots.
If you self-hosting, like if you have ZFS, presumably you'll have your snapshots somewhere.
You can restore from a snapshot and resynchronized pretty quickly.
But even then, it takes a while to sync the ZFS snapshot on disk.
Tested that one.
It takes minutes.
sometimes hours.
Oh, well, if you have like a 2 terabyte file system,
you mean that the first send receive?
So to, yeah, between the hosts?
Yeah, I understand.
Yeah.
It's surprisingly limited by, I think, actually, network.
I needed to like compress it and then do something else to it and re.
I wrote a block post to it on our old blog at PostGus.
Oh my God.
I know.
Thank you.
but the website went down
so my shame went down with it as well
the blog post is gone
I think it was a good blockpost but I learned how to use
compression tools and ZFS and
you know big deal
just out of interest
are people generally using this with the managed service
or is it quite split between
managed service and self-managed
I right now it's split
a lot of management service use
especially like failover
for Aurora is a big deal.
Like failover detection, because you don't run Petroni typically.
So it's able to like query the PG is standby.
Sorry, what was the name of the function?
PG is in recovery.
PG is in recovery, thank you.
In my opinion.
Stop confusing people.
The people who know, know, know,
the people who don't know don't need to know.
Yeah, and I'm sure you know,
Aurora loves to fail over on a perfectly good day.
So it helps a lot with uptime.
But one of our biggest sharded deployments
is literally on-prem, on-prem.
Like they rent their own data center and everything.
So it varies.
Like that's the beauty of it, like operating
on the application network layer.
You really don't care that much where you're deployed.
So based on this, almost a year since you started this,
what's your current vision for your
like customer who needs pjadog and what's your vision for next future steps like where do you move yeah
this is a really interesting question because like sharding is always going to be like the north star
because it it explains to people like this is a scaling product ultimately you can start anywhere
i already feel like it's a better connection to a pg bouncer i already feel like the low balancing
aspect of it is a game changer and sharding as much as it is a work in progress is is the
the North Star for us. That's how we want this project to evolve. We want to make Postgres as scalable as
Cassandra or Scylla or Elasticsearch. I don't know what people think about those particular databases,
but I think they're like horizontally scale, though that we can all agree. So we want Postgres to be
like that as well. We're building like a it hasn't been released yet. We're building like an enterprise
edition of PG-Dog, which adds like, you know, the classic things like monitoring,
like understanding how everything works, like a control.
whole plane, like something that we can sell as a product to people.
You know, like you don't have to like go through like many, many steps
requires to set up like a sharding layer network proxy.
So that should come out within the next few months, hopefully.
Will it include automated sharding with AI?
Like, like pass a button and your database is sharded already.
You want automatic sharding with AI as part of our future set?
I think we just want more consulting clients.
I don't think we'll ever get automatic sharding with AI, but I don't know.
Right now my idea for that is like,
dump the, use PG-Dump, get the schema out,
pipe into chat GPT and ask it what to do.
With the prompt.
And test.
Yeah, exactly.
And test, yeah.
I don't know.
A human being is able to come up with a sharding key really quickly.
Like I talk to like a,
like a senior database person at a company or an app
engineer and they're like, oh, my sharding key is, like, they already know it's org ID or customer
ID or like they instinctively know how the data is organized and they even know what the edge
cases are. You know, like, I don't think you need AI for that one unless you literally have,
you've never seen this database before. But yeah, that's, that's the way I think about it.
Interesting.
On the vision stuff, how, how closely are you paying attention to the other like sharding, uh,
in the space and how do you how do you differ in terms of what you like your North Star
versus what you think theirs is based on what they're saying I'll be completely honest with
you I literally never look at them and it's not because like I think I'm cool or something
it's literally I just don't have the time like I have I have something in my mind I'm building it
what the other guys are doing is usually more of a distraction than the learning point to me
personally so I have my vision I have my North Star and I'm going to follow it until I hit a wall or
something and so far I haven't so
that's how I think about it.
Great.
It was to finish this podcast, I guess.
Is there anything left that you wanted to mention that we haven't asked about?
Well, let's see.
I'm going to do a post on HECON News soon with all the stuff that we've done.
But I'm just going to blab with you.
It's kind of boring.
We added a bunch of support for aggregate functions.
We do query rewriting now, which is really cool.
We take the ASD and we add whatever needs,
whatever we need to that query to execute it.
One thing that I think is really, really phenomenal is we support sharding key updates.
That's something I think CIT has never got around to do is like changing the actual value
of the sharding key and moving the rows between the shards.
Wow.
That one's really fun.
I don't know how often it's going to be used, but anytime somebody asks about sharding,
they ask about that future, so it must be, must be useful.
You know, we support multi-touple inserts, we rewrite insert statements, we separate the
tuples and we send them individually to each shard.
you know, all of that happens, you know, transparently, which is really cool.
We're just, we're focusing on usability.
Like the idea is you just deploy this and you don't change your app.
Like don't rewrite your queries, don't rewrite your code and just make it work.
But, you know, obviously, we're not there yet, but we're getting closer.
So every week there's something new that comes out.
And it comes from the fields, right?
Yep.
It comes from people asking me, hey, is this supported?
I'm like, yes.
Almost.
Next week.
Thank you for the suggestion.
Cool.
Yeah.
Production driven development.
It is.
I really enjoy it because every single future is immediately used.
It's great.
Cool.
Great.
Thank you.
Thank you for update.
It's really good to see the project is growing.
Excited to test it once again soon.
That'll be great.
And I still have to review and merge your PR.
It's okay.
but for i believe 10 months since the last time we talked something like this good well yeah really
good to see you again love take care thank you so much yeah thanks so much for having me guys
bye bye
