Postgres FM - PgDog
Episode Date: March 28, 2025Nikolay and Michael are joined by Lev Kokotov to discuss PgDog — including whether or when sharding is needed, the origin story (via PgCat), what's already supported, and what's coming next....   Here are some links to things they mentioned:Lev Kokotov https://postgres.fm/people/lev-kokotovPgDog https://github.com/pgdogdev/pgdogPgCat https://github.com/postgresml/pgcatAdopting PgCat (Instacart blog post) https://www.instacart.com/company/how-its-made/adopting-pgcat-a-nextgen-postgres-proxyPgDog discussion on Hacker News https://news.ycombinator.com/item?id=43364668Citus https://github.com/citusdata/citusSharding & IDs at Instagram (blog post) https://instagram-engineering.com/sharding-ids-at-instagram-1cf5a71e5a5cSharding pgvector (blog post by Lev) https://pgdog.dev/blog/sharding-pgvector~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artworkÂ
Transcript
Discussion (0)
Hello, hello, this is Postgresfm. My name is Nikolai, Postgres AI, and as usual, co-host is Michael,
PgMaster. Hi, Michael. Hello, Nikolai. And we have super interesting guest today, Lev Kokodov. Hi,
thank you for coming. Thanks. Hi, Nikolai. Glad to be here. Yeah, and the reason we invited you,
Yeah, and the reason we invited you is when I saw PGCAD,
it was interesting, like at that time I had a spike of interest to Sharding Solutions.
And I saw PGCAD was started as connection puller
alternative to PG bouncer.
Actually at the same time, a few other teams started
some new projects, but then Sharding was, okay, Sharding was quite straightforward way, just comments.
It was great, interesting.
But I'm not a cat guy.
I'm a dog guy.
And once I saw Pidgey Doggy started, we focused on Sharding.
It obviously attracted much more attention from me.
So that's why we invited you actually.
Awesome. Yeah. Let's talk about, I don't know, like maybe charting itself and probably let's
start with the idea that not everyone needs it because on a single Postgres cluster you
can grow and reach multi-billion evaluation, go to IPO, I have examples.
But obviously sometimes it's really tough, right?
Tough job to maintain huge monolith.
So what's your take in general about the idea
that you don't need Shardy?
Oh yeah, well, that's a good one.
I agree in principle.
So, you know, Postgres can be pushed quite far. And we
pushed it very, very far at Instacart, pre IPO, I should add. But we IPO it as a massively sharded
database. And we absolutely had to because a lot of things started to run quite slow.
We had a lot of writes. Instacart was,acard, you think most apps are 90% read,
10% write. I don't know if that was the case for us. I think we're a little bit more like
maybe like 80-20, maybe 70-30, but we were doing a lot of writes.
Hold on. 80-20 means 80 writes or?
No, no, no.
Reads, as usual reads, right?
Yeah, yeah, reads. Yeah, yeah. I'm thinking like 90% typical workloads, 90% reads, 10% writes.
Or even more sometimes.
Social media is definitely more reads, right?
But here's different.
Yeah, it was a little bit different, not by much, but even that small percentage, like
10%, 15% writes was enough to push R5 to 24x large, I don't know, like 192 cores, almost a terabyte of
RAM over the edge.
Sounds like RDS.
Actually, I'm super curious, it's off topic, but I'm super curious about so-called Instacart
zero downtime upgrade approach because we believe at PostGCI, we believe it leads to
corruption but I hope we will discuss it at different time.
Let's hold on to this.
Yeah, happy to talk about another podcast. That was one of the guys who worked on it.
Consider you invited. Zero downtime upgrades made on managed Postgres,
which doesn't allow to change recovery target of ASCENT.
Yep, I did it myself.
We had zero downtime.
Yeah, it's a super interesting topic.
Maybe in a few weeks, I really looking forward.
Sounds good.
So yeah, I understand that writes,
we can scale reads easily because it's just
a replica until some point.
But writes, we have only the primary.
That's, and you need to go either to services, microservices, or sharding, or combination of them,
right? This is several routes, actually two major routes here. Yeah, Instacart actually sharded what
we call functionally sharded. I don't know if that's a real term in the industry, but we just
took tables out of our main database and put it into a different database, you know, functional
sharding, air quotes, vertical split, vertical position.
Exactly.
And that happened even before I joined the company.
So that was way before, you know, IPO talks and all that stuff.
So you can run out of capacity on a single machine
quite quickly if you use it a lot.
Sorry, that's a little bit of a tautology,
but that's just the case.
It depends on your workload.
We did a lot of machine learning.
So we wrote a lot of bulk data a lot, like daily.
We would have hundreds of gigabytes of data
that were completely new data every day, basically.
So a lot of writes were happening,
and sharding was the only way basically.
We were running out of... Our vacuum was always behind and when vacuum is behind everyone gets a little bit scared.
You know, performance is worse as well. You know, we were getting like a lot of lock contention on the wall.
That's, you know, that happens a lot. You know, the wall is single threaded to this day and that's totally fine, but
you end up, you know, when you write like hundreds of thousands transactions per second. Well a single threaded is if you
use wall writer but backends are can write to wall themselves if you turn it
turn it off like this setting. Yeah, you're right you're absolutely right
but there's still a lock there's still a lock yeah they have to grab a log. We see
in performance insights if it's on RDS, like queries are spending time on
LW log.
That's the guy.
And that's the guy who takes your website offline on Sunday every morning, like a clockwork.
On write attention for close.
And I would say traditional optimization would be, let's write less, but I assume only in Postgres 13
wall-related metrics were added to PgSAR statements and explain. So it was very difficult to
understand which queries, well, with performance, as you can based on the weight event analysis,
to identify queries which are wall- well, write intensive, right?
And maybe optimize them, but it's still like limited approach, right?
Well, it's limited in its success in a company because when you tell somebody like, hey,
can you write less to your database, people are like, I'm sorry, but is that your job?
Like, you can write less, but we have...
Wall writes our problem for a query.
Consider a query which writes to a table and two situations.
This table has a couple of indexes versus 20 indexes.
And when we insert something, every index gets some insert.
And it amplifies the amount of wall written for the same and serve. We fished index, right?
It's really important to know that every single index has
an important business use case.
And that's why I was put there in the first place.
So like all of these are invariants.
Like when you say like, I have to win indexes on a table
and it's right up, like that's, it's by design.
And when you want to write more data to that table,
because your business is growing,
like us as a database engineering team,
we're like, all right, we're going to make it happen
because we can't go to like 25 individual teams
and tell them like, hey, can you fix your query?
They're going to be like, I don't have the time.
I'm trying to sell groceries here.
You know, like I have other concerns.
But anyway, I'm trying to say, I agree with you that
without sharding, we are limited in like,
we need to squeeze, squeeze, squeeze, but it's very limited. And at some point, you cannot say we can
grow to X anymore, right? That's right. And that 2X is actually on the low bound. What most engineering
leaders expect is like 3, 5, 6 X. Yeah, 10 to have that runway. If you don't have that runway in your
system, it's a red flag for most. And it goes to CTO level basically. The decision is like,
should we migrate out of Postgres to I don't know where. Right. So exactly. I know this so
many times, including sitting on CTO position myself, like 15 years ago, like I know this pain, sharding.
It's really funny.
You're just kind of sitting there and be like,
hey guys, I need 10x capacity.
And you're a database engineer is like, well, it's the wall.
There's nothing I can do.
It's the wall.
It's like, there's just like a disconnect
between these two people.
You need to like, you just need to make this thing scale.
Like that's the end.
And a lot of these are out of control.
But let's like, we can also enable compression for full page
rights.
We can also partition so rights become more local.
I don't know, you're skeptical.
Because all of that, partitioning is great.
I would never speak bad against partitioning.
Not every partitioning.
Sorry for interrupting.
Not every partitioning.
For example, people say, it's- Not every partitioning, sorry for interrupting. Not every partitioning. If, for example, people say,
okay, we are going to spread all customers
evenly distributed among partitions,
this is not going to help us in terms of wall rights.
But if you say we will have hot partition,
like receiving most rights,
and kind of archive partitions where auto vacuum
and everything was going like stabilized. In this case is super beneficial for wall writes and write
intensive workload optimization, right? Absolutely, but then you're gonna end up
having to read those code partitions anyway. It's a temporary band-aid which
we did. Again, absolutely we partitioned a lot of tables and that helped us with a
lot of things, but at the end of the day, you just need more compute.
It's the bitter lesson of database.
I don't know if you're familiar with the bitter lesson in AI where you just need more machines
to solve the problem.
It's the same thing in databases.
You just need more compute to just be able to read more stuff and write more stuff.
Yeah.
I'm still pulling us back intentionally because we are having conversations about
like scaling problems and interesting insight I've got yesterday talking to a very experienced
Postgres expert that partitioning not only helps with like data locality, write locality
and full page writes and wall amount as well. But also with backups, surprisingly,
if we do incremental backups with all G, BG, backrest,
with new incremental API, I still
need to learn about this more.
Or snapshots, cloud snapshots of VBS volumes and RDS
also relies on it, as I understand.
We imagine writes are spread out everywhere,
and you constantly change a lot of blocks versus you change
only specific blocks, and some blocks are not changed.
And it's less pressure on backups as well.
It's interesting, and it affects DR and RPO, RTO.
But I just want to test my understanding. If we go to Sharding blindly and miss some
optimizations, I remember I was dealing with Mongo and each node of Mongo, it was maybe 10 plus years
ago, I saw each node of Mongo can handle much less workload than Postgres, one node of Postgres.
Maybe jumping to Sharding too early,
it's premature optimization and you will be
not in good shape in terms of how much money
you spend on compute nodes,
because you missed a lot of optimization steps.
What do you think?
Although, absolutely.
If your first step is like, my database is broken,
I need to shard, you missed a lot.
Of course, you should look at all the possible
optimizations. And that's really important to keep you like on a
single machine. I think the benchmark for me is when you
need to start thinking about charting is when you tried like
three or four or five different things. And then you're like a
person or two working on it full time. And at the end of the day,
they're like, I ran out of ideas.
Do you have some numbers, like for modern hardware,
like how much wall to be written per day, for example,
like terabyte, five terabytes, where it's already an edge?
Right?
Yeah, for sure.
I can give you the exact number.
We benchmarked Postgres in the best possible scenario,
actually.
It was on EC2, it had NVMe drives, it had an XT4, a RAID
zero. So we did not want durability, we wanted performance. That thing could write four or
five gigabytes per second using Bonnie++. Now Postgres was able to write about 300 megabytes
per second of just like, whatever. I think we're using copy, like the fastest possible
way to dump data into PG. And it's just, it's the nature of the beast, right?
You have to like process it,
you have to put it into a table.
I'm sure there's a lot of like checks and balances
going between and again, lock contention
and all of that stuff.
So you can't necessarily squeeze out everything
out of your machine if you have the capacity.
So ultimately you need to split the processes themselves between multiple machines once you reach that number. And then, you know,
for everything else, if you have again, if you have write amplification, things are getting
a little bit more tricky. And then so I think that's that's usually the number once you're
writing like a, you know, 200 megabytes per second of just wall and you can see that in RDS,
there's a graph for that just just
wall writes. 200 megabytes of wall per second, each wall is 16 megabytes but on RDS it's 64,
they changed it. So how many wall files like, it's a lot right? It's a lot and that's the
scale you're looking at. Ten plus is already too much.
Yeah, I agree.
Yeah.
And so that's the red line.
Right?
If you're at the red line and you're thinking about sharding, we're going to have a bad
time.
And that's okay.
It's okay to have a bad time.
These lessons are learned when you're having a bad time.
I had a really bad time.
Yeah.
And I pulled the calculator just to check 200 megs per second, it gives you 17 terabytes
per day.
That's mild, honestly, for what we did at Instacart.
We had like petabytes of wall-in-s3 after like a week.
And that's just, again, it's the nature of the beast.
Single cluster.
It's just, it is it's the nature of the beast. From single cluster or multiple cluster? Single cluster. That's insane.
It is.
It's called selling groceries.
What can I tell you?
We had like, you know, hundreds of gigabytes
of machine learning data that would come in every day
because the model was rechained on a nightly basis
and all the embeddings have changed, right?
Again, the use cases are almost like,
I mean, they're interesting, but they're kind of off topic, I guess, because they're just
like, you just write a lot of data.
That's what you do for a living, right?
That's what Postgres is for.
That's what databases are for.
So the numbers are like, well, this is the number and that's when you should split up.
Yeah.
So let's probably, like, just for more sane situations, let's say 10 terabytes per day
or something of wall data, it's already too much.
If you approach that soon, it's already a wall
you will be hitting.
Several walls inside Moskvist you will be hitting.
Including light.
Yeah.
Yeah.
Several walls.
Yeah.
I also had to say, light web lock,
wall-related light web locks, wall right, right?
I don't know.
There are several of them.
Yeah, but it's a...
Yeah, this is the scale to be scared about.
It's hard to put a number on it because you can push that
further and who knows what's going to happen to your app.
Maybe that's as high as you'll go.
And not everyone has local NVMe's because in cloud environments, they are ephemeral.
It's kind of exotic and risky.
It is. It's definitely risky.
I love those because when you click Reboot on the machine, it wipes the encryption key and all your data is gone.
So you better be careful which button you press.
There's like a hot restart, which is safe, and there's the actual restart, which will delete all your data is gone. So you better be careful which button you press. There's like a hot restart, which is safe.
And there's the actual restart,
which will delete all your data.
So, you know, stay on RDS if you can.
But no, I don't agree with here because you just,
you should just use Patroni and multiple replicas,
at least three nodes and you'll be fine to use one node.
You lose one, something random, bad, anyway.
That's off topic, yeah.
Okay, thank you.
For me, it's enough to understand
that there are cases definitely
when you do need to split vertically
and splitting vertically usually is limited as well.
It's hard sometimes and at some point you need, we need sharding.
Yeah, so it's enough to explain that definitely
there are cases where you need to scale post-gass
beyond one primary.
It can be vertical or horizontal
and the result is sharding, right?
That's right.
Good.
Yeah, this is great explanation.
Well, sharding is definitely needed, but not for
everyone maybe, but for many.
I mean, yeah, I hope not for everyone. That would be, I mean, you know, we solved the
problem once and for all. Actually, last week, Michael and I had discussion about snapshots
and I said that Cloud SQL has hard limit 64abytes, because this is a limit for GCP, persistent
disks, including PD, SSD.
And I said RDS obviously can allow you to grow further, but beyond 64, but it's not.
After we recorded, I checked and same limitation 64 terabytes. It's another reason to have multiple primaries and
scale beyond one cluster. Well you'll hit that sooner on Postgres because of the 32 terabyte table
limit and actually I did hit that limit once and that was a scary error. The only thing that it
says is like there's no more bytes to be allocated or something
like that in some kind of database file.
I'm like, what?
And then you're done.
There's nothing you can do.
You have to move data out into that table.
You have to take the database offline, obviously.
Even if it's a single table.
Okay.
Okay.
Yeah.
Yeah.
A single table.
Yeah.
Which is actually not uncommon. like 32 terabytes these days.
Well, it's still not super common as well.
I see only cases like kind of 10 terabytes per table
and it's already when any DBA should scream,
where is partitioning?
DBAs like to scream, but application engineers be like,
I can't, aren't you the DBA?
Partitioning unfortunately requires application code changes.
Yeah, so sharding is needed for big projects, obviously.
Agreed.
What's next? Can we go, I reckon it's time to go back to the origin story of PGDog and obviously
then PGCat.
It would be great to hear like a little bit of that story, Lev.
Yeah, absolutely.
I'm happy to.
Do you guys care about the name of the project?
I don't know.
Or do you want to know what the inspiration for that came from?
Sure, it's fun.
Yeah.
Okay. Well, when I started PGCat, we just got a new cat
and I really loved that cat.
And I was working on Postgres,
so the two things came together naturally.
You could probably guess the origin story of PGDog now.
I got a dog and then I'm like, look, I love this dog.
What could I call my next project?
Obviously PGDog.
Yeah, so the naming issue solved.
PGCat came from the idea that it was really simple.
It was charting was not even in scope back then.
It was just like, we ran PGBalancer.
PGBalancer could only talk to one database at a time.
It didn't make sense.
You were pooling like just the Postgres instance.
We had a bunch of replicas.
We needed to load balance.
And we needed a load balancing algorithm that was smart.
When a replica went offline because of, again,
hardware issues, scaling issues, whatever,
we needed to remove it from the rotation
without affecting the app.
So we would regularly lose a replica
and then most of the site would go offline
because we had a Ruby gem that would randomize
access to those replicas and when one of them broke,
it worked okay, but doing this in the application code is really hard. Especially in Ruby,
there's a way to inject exceptions into Ruby, like sideline, and basically that breaks your state.
So we had multiple gems working against each other. And we just needed to do that in a place
where it made more sense. A load balancer is typically outside the application
because you have multiple applications
talking to it anyway.
So I just built a load balancer basically.
It was after actually I left Instacart.
I was just doing it as a site project
just to keep my mind going.
So I built it.
It was really simple, used, you know,
banding logic, which was kind of novel at the time
for Postgres, you know.
If you receive one single error from the database,
it's removed from the rotation. It's very aggressive, but Postgres never,
never throws errors like network related errors on this as a serious problem. So that actually
worked pretty well. I talked to a friend of mine on Instagram, I was like, hey, look, I built this
on the side. That looks fun, right? Like we thought about this. And he's like, yeah, all right.
Didn't you quit? Yeah, but now I have some free time to work on this. Right. And he's like, okay. And then he took the code, added a bunch of features that I didn't add.
Cause obviously I didn't have a use case anymore for it.
And he's like, all right, great.
We're going to deploy it.
Right.
And we tried it and they use it and they're, I mean, they put so much work into it.
They wrote a blog post about it.
You probably know about this.
So that went pretty well, you know, so it's working in production.
It's great.
And then I'm like, all right, well, you know, so it's working in production.
It's great.
And then I'm like, all right, well, you know, sharding is the next one.
They have a bunch of shard databases that we sharded and adding sharding routing
to that would be great because again, it was done in the application layer and
application layer routing, you know, I think we all agree is a little bit iffy,
especially if you have more than one app written in more than one language, like
you have to repeat the same logic across all apps.
So I had to just build a common system.
I knew like there's two sharding schemes at Instacart.
One uses the actually the hashing function
from partitions in Postgres.
I love that because you can actually split data
both at the client and in the server.
So you have multiple ways to move your data around.
And the other one is just a custom one.
Like we use like SHA1, take the last few bytes
and then mod that.
That's just, you know, it's random.
But it was available in multiple systems as well.
The data for that came from Snowflake.
So we could actually shard the data in Snowflake
and then ingest it into the instances directly.
And then on the routing layer in Ruby,
same hashing function, you know,
the sharding key is always available.
So that was good. So I added both of them and they're like, great. That's great. on the routing layer in Ruby, same hashing function. You know, the sharding key is always available.
So that was good.
So I added both of them and they're like, great, that's great.
And they tried to think the SHA-1 function
and I think it's working pretty well for them.
So that was fun.
Then I started another company that had none
to do with any of anything.
Postgres.mall, you might've heard about it.
That came from, you know, the idea that we shouldn't ingest,
you know, hundreds of gigabytes of machine learning data into Postgres, that we shouldn't ingest hundreds of gigabytes of machine learning data
into Postgres,
why we should just ingest the three megabyte model
and run inference online.
It was okay.
Stayed a couple of two and a half years there.
Didn't work out.
There's a lot of startups too.
Left.
And then I had some free time and I'm like,
well, what do I like to do in my free time, guys?
Writing Postgres poolers.
This is what I do. This is what I do to rest on vacation. post-guest poolers. This is what I do.
This is what I do to, you know, rest on vacation.
I write post-guest poolers.
So I'm like, well, let's do sharding for real this time.
Let's actually take what we built at Instacart,
make it into software because that's what we do, right?
We come up with an idea with a problem.
We find a problem, we find a solution.
We write it in code and we don't have to solve it again
every single time manually for like, you know,
hopefully hundreds, hopefully thousands of use cases
for this, you know, we'll see,
I'm still doing my research, but yeah.
So that's PGDog.
Sharding is, you know, sales and number one,
everything else is, you know, there's obviously,
it's obviously a pooler, it's obviously a load balancer,
it has all the features that PGCAD has,
almost all of them, I'm adding them as I go.
It's a rewrite, it's brand new.
I like new code.
That's what everyone loves to hear.
Hey, you've rewrote it from scratch, great.
You know, that code that we battle tested in production
and serving like half a million transactions per second,
well, that's obsolete now, I guess.
You're just gonna take this brand new code base
and check it out.
And it's written in Rust.
Yeah, absolutely.
It's in my kind of dreams to find some spare time to learn and try it because it looks like many folks move to Rust.
So what do you think?
Well, it took me about a decade to get good at it.
So the sooner you start, the sooner you will get it.
You know, it takes about 10 years to be good at it.
So at least that's what it took me.
Honestly, I was like, that's okay.
Again, if you start today,
eventually you'll get good at it and that's okay.
It's a journey.
You don't have to learn it like immediately.
It's just, it's such a paradigm shift in how things work.
The compiler is very aggressive about checking for things, especially concurrency errors, which is
for multi-threaded asynchronous poolers, very important. I don't have concurrency bugs in
PGDog or PGCAD because I'm using Rust. I don't have data races. And that's really important.
And the number of bugs that I ship is considerably lower
because the compiler is like, hey, listen, this variable, you're not using it. And I'm like, oh,
yeah, crap, I'm actually importing the wrong variable here. I'm using the wrong variable.
Good catch, because that was going to be a bug. A lot of things that just in other languages are
not available, Rust makes really nice. So it's really worth, again, you ask anyone who writes Rust
to be like, that's the best thing that ever happened to me
since sliced bread.
I'm like, that's true.
I haven't thought about sliced bread in a while,
but Rust is great.
Cool.
Another like side question is license.
I saw feedback, I joined that feedback.
So Pidgeycat was, I think on Apache or MIT, I don't remember
exactly, but kind of permissive. And for PGDoc, you chose AGPL, right? Can you elaborate a
little bit why?
Yeah. Yeah, of course. Yeah, happy to. The sentence is really simple. AGPL is actually
pretty misunderstood.
You can use that code and the application anywhere you want
and never have to share anything back.
As long as you use it internally and you don't provide,
like you don't use PGDog like publicly
as like a service for running PGDog.
And even if you do that in that case,
all you have to do is just tell us like,
what did you change and send us patches, right?
Or what do you change?
So like, it's pretty minimal, but it's a red flag for everyone. And that's okay. Yeah, I'm building
a company around it. Building a company around MIT codebase is, I think, probably possible. I've
never done it successfully. Building a company around the EGPL, I think, has been done before.
And I think it's probably fine. if it becomes a hurdle I'm not
like I'm not like married to it I just thought a GPL looks cool I like the
ideas behind it I like free and open source code I don't think MIT is
necessarily the the original idea behind you know free and open source code I
like MIT because I don't think about it I'm'm checking PostgresML. PostgresML is MIT, right?
Yeah.
Codebites.
Yeah, it is.
Yeah.
Rasta and MIT.
So it's interesting how you decided to change it.
I mean, I agree if it's misunderstood.
Yeah, but it's already so.
Majority of people misunderstood it,
and we cannot change it with a single project. So it's my majority of people misunderstood it and we cannot change it with single project.
So it's a reality.
Yeah.
But you don't care if it's okay.
Like you do it.
Yeah.
Yeah.
Because like if somebody tells me like, look, I would love to use your code, but a GPO is
a deal breaker.
I'll be like, well, we work something out.
You know, that's, that's not a big deal, you know.
But I think that's a good, it's a good thing to have a good license.
It's important.
You mentioned starting a company around it.
It strikes me that it's going to be tough.
Like, obviously, new code base, the main use cases are at scale.
But normally startups, like the easiest way of getting started is serving smaller companies,
right?
Like it's harder, like going straight to the enterprise
with something that's not yet battle-tested.
Like what's the, you've got a plan though.
It'd be great to hear like, what's the plan?
What's the plan?
It's okay, don't freak out, it's gonna be okay.
Yes, it's not actually that uncommon
to have enterprise startup products.
If the problem is interesting enough,
there's always gonna be somebody's gonna be like,
oh great, somebody's working on it full time.
That'll be amazing.
At this early stage, how this works usually
is I'm looking for design partners.
So it's companies like Instacart who are like,
hey, this is a great idea.
We're gonna try it out.
We're gonna develop it together.
And at the end of the day, it's gonna be in production
because we built it together.
And that's actually good
because you wanna build it with the users.
Like you don't wanna build it by yourself
like for several years and then she'll be like,
hey, does anyone need Postgres sharding?
And Nikolai is like, well, I don't know, maybe.
Maybe not, depends.
So what I'm actively looking for right now,
like code base is okay.
There's, I'm sure there's bugs in it,
performance issues and that's totally fine.
I'm just looking for people who'd be like,
this is an interesting idea.
Like I like the idea of post-gust sharding.
I like the way it's done at the pooler.
It's not done as an extension.
It's not done as like some kind of like other thing
that I can't even think of.
Like I like the way it's done.
So I'd like to try it out and help you finish the job.
You know, by deploying it in production,
by benchmarking it, by finding bugs, by reporting bugs,
by even fixing bugs would be great, not required. My job. I can confirm you were very quick reacting to requests. I remember
looking at PGCAD, I had the idea that mirroring, right, let's have mirroring to have like a kind
of A-B performance testing right in production, and you implemented it, it was great.
I think you were like, you were-
Actually to be perfectly correct,
it was actually Mustafa at Instacart who implemented it
because he had a use case for it.
He was my design partner.
It's very common request and the only problem like,
you need to have this pooler already in production
to use it, this is the trickiest part because it's zero to one it's always tricky new stuff
especially in the hot path always gonna be hard but if the problem is there and
if the problem is big enough I'll find my champion yeah exactly so so yeah and
you mentioned PGA Dock.
Maybe let's move back to technical discussions.
A little bit out of business and license.
So first of all, it's not like PGA Cat.
It's not explicit sharding when you command with SQL
comments how to route.
This PGA Dooc has automated routing, right? And second thing, there's no Postgres
as like
middleware for this, right? So it's just a puller with routing, right?
Can you explain like architectural decisions here and what do you use and
like what kind of components like, like you got a parser from Postgres to understand queries.
I'm very curious how you're going to automatically route
selects of functions which are writing, for example.
Or select for update, which you cannot route to a physical
standby replica, right?
Or I don't know, something else, like how you are going to,
or you obviously will have some limitations,
already have some limitations, right?
So can you talk about this a little bit?
Yeah, of course.
Well, select for update is actually really simple.
That's a clear intent to write something. So that's an easy one straight to the primary. No problem. That's
an easy one, which I actually should implement now that I'm thinking about it. I'm routing
it to the replica right now. Bug issue incoming. Thank you very much.
The other one is the functions. Obviously, it's impossible to know if a function is writing
or not by just looking at it.
Even if you look at the code, you're not going to know.
So like static analysis, I don't think is necessarily possible.
So for that one is, I think it should be pretty easy.
You put it in the config, you have a list of functions that actually write.
This is what Pidgeypool does, right?
I think so.
Yeah, I'm not sure.
Yeah, but you know, Pidgeypool does everything, so I'm sure they do it as well.
Exactly. You have to be careful. You can't do everything. People don't believe.
Right.
I know that approach is not perfect because if you add a new function, you have to update the config.
You're always going to forget and you're always going to have issues.
So for that one, I don't have a solution. My theory so far is that that is not as common as
I'd like, but I will probably prove it wrong. And then we'll
figure something out probably some kind of migration process
that says if you want to add a function that writes send it
through the like you should be writing migrations and sending
them through the through the pooler not not not some kind of
side channel and you can probably mark that function is
like, hey, this function writes like to put in a comment or something and then PGDog is gonna be like, great, good to know. You got any persistent storage for that kind of side channel. And you can probably mark that function as like, hey, this function writes, like to put in a comment or something.
And then PGDog is gonna be like, great, good to know.
You're gonna need persistent storage for that kind of stuff,
which you could probably implement.
Yeah, I agree.
Some docs behave like cats sometimes.
Exactly. Yeah, exactly.
People will forget to put that comment in
and then there are gonna be issues,
but that's just software.
When there's people involved,
there's always gonna be...
The more manual stuff you have to do, the more problems there are gonna be. but that's just software. You know, when there's people involved, there's always going to be...
The more manual stuff you have to do, the more problems there are going to be.
But if you're writing a function by hand in the first place, you know, that's just going
to be part of your re-process.
I was reading Hacker News comments and somebody asked about aggregates as well.
I think there's a limitation around those at the moment.
What's the story there?
Well, when I posted the thing, I didn't have support for aggregates at all. But then I'm like,
hey, you know what, let's add some I've been thinking about it for months. I might as well
just do a couple simple ones. So I added like count simple one, you just sum the counts across
all shares, max man again, super simple added some as well. Some is really just some everything.
Super simple, I added some as well. Some is really just some everything.
Small comment here, it's MapReduce basically,
like analogy for it, it's cool, right?
So it is cool, right?
This is cool, like, you know, it's-
You can scale to billions, trillions and so on, right?
Precisely, yeah, this is like a MapReduce
for those guys is phenomenal, you know?
I'm actually, okay, so I'm gonna release a blog post in a couple of days. I'm actually sharding and doing MapReduce for PG Vector. I don't know if you've heard about this one. Yeah, that one's really fun. PG Vector is like a...
No, no, PG Vector we know. What do you do with it?
What do you do with it? Well, you're going to have to wait till the block post comes out, but it's really fun. I'm doing both MapReduce and a machine learning algorithm to route queries in the cluster
because like, scale, like searching vectors is a completely different problem than searching
B-trees.
So I don't know how many people will need that solution.
Well, as soon as W is terrible, if you go beyond one million records, it's a big problem
still.
So that's the number I had in my blog post as well.
I don't know why 1 million just feels like the right number,
but that's exactly what I said.
I'll answer you how over a million,
you probably need to shard your PC Vector index.
Or use different approach.
Yeah.
Exactly.
That's great.
And MapReduce, someone told like in the past
that pale proxy by Skype, very old tech, also looked like
MapReduce, but it required you to use only functions, which is a huge limitation, especially
if you have ORM or GraphQL.
It's a big showstopper.
And also it was Postgres in the middle for routing and for this map reduce. In your case, this is
more lightweight software in the middle, PGDoc, right? And it does some simple, like, arithmetic
operations. Do you plan to define some interface for more advanced operations that the user could define, like, beyond simple sum or count or other aggregates?
I haven't worked much with custom data types, EDFs and all that stuff, so that's going to be a learning curve for me, I'm sure.
I'm sure it's not that hard, but once you add custom functions, you need to add custom logic.
I think that should be pretty straightforward to implement if there's a synchronization between the core is working
and BGAs.
This would give full-fledged map reduce capabilities
to this, right?
Yeah, absolutely.
We're opening in the same perspectives, I suppose.
Absolutely, yeah.
If I find someone who thinks this is cool as well,
we could definitely build it together.
Not only thinks it's cool, but has some production to try, right?
Because it's just...
Absolutely.
Exactly.
Yes, absolutely.
I think that would be pretty terrific.
It will work pretty well.
But yeah, again, there's a lot of interesting things about Agigrids that, for example, like
percentiles, notoriously difficult,
basically impossible to solve, I think,
at the sharding level,
because you need to look at the whole data set
to compute it, you can approximate it.
Approximation functions should be like a feature
that we add to like PGDog that says like,
you know what, I don't care about the exact number,
like average, the simplest one,
you could estimate it.
If you have-
For pencil tiles, we could define some custom data type
to remember how many members were analyzed and so on.
I don't know.
To bring not just one number from each chart,
but a couple of numbers,
and then it will be possible to understand percentiles.
Maybe, should be.
Maybe, yeah.
Not super difficult.
Feels like a similar, like Hyperlog Log had some similar,
like, I don't know how you do that cross-shard,
but it feels like there might be some secret source
in what they've done already
that could be applied more like cross-shard.
Yeah, Hyperlog Log is like a,
it's like a counter,
basically approximates how many members are in a set.
Actually, there's an extension for it in Progres.
It's pretty fun.
But yeah, I need a statistician, like a data scientist,
to come and be like, all right, this is how you approximate
percentiles.
And they're like, great.
Do you know Rust?
Because, yeah.
What about pure Rust? Cut stuff. Yeah. What do you think Rust is?
Exactly, yeah.
Well, so what is, like what's next?
What are you looking,
I guess it depends on what people want,
but what does tomorrow look like over the next week?
Well, you're gonna be shocked to hear this,
but I'm an engineer who does sales now.
I literally just like, yeah, as I said,
I'm building a company,
so I'm literally just like, yeah, as I said, I'm building a company. So I'm literally
just sending as many like LinkedIn and emails to whoever I can think of to find design partners
to for people to be like, Hey, I want to help. I want to, I, this problem exists. First of all,
that's the first feedback I need to get. Be like, I would like sharding. I would like it solved.
And I would like to solve it your way. That would be great. Or, you know, just tell me how
you like it to be solved. And if like there's an interlap and I think there to be solved that your way. That would be great. Or just tell me how you like it to be solved.
And if there's an interlap,
and I think there should be overlap, solve it together.
So that's what I'm doing mostly.
But I'm an engineer,
so I need a safe space from all the social activities.
So I still code.
So that's why the PG Vector sharding is coming out
because I needed to code a little bit.
And I thought the idea would be cool.
We're gonna be keep doing that.
I'm gonna keep adding these kind of features, keep adding tests, benchmarks,
fixing bugs, finding more use cases for update that I forgot.
But yeah, that's the plan.
I have specific question about rebalancing without downtime.
Right?
If one shard is huge, others are smaller, we need to rebalance.
What do you think?
Will this feature first of all be inside open source core offering?
Because we remember in Citus it was not until Microsoft decision, as I understand, Microsoft
decision to, or like aytos team decision to make everything
open source. And second question actually triggering by my this question like how do you compare
PidgeyDog to Cytos? Yeah the open source I would like to stay I don't foresee myself writing
closed source code maybe things around like deployments and orchestrating the stuff in companies.
My idea, we'll see if it's realistic, is to sell managed deployments of
Pichidog to on-prem deployments to companies. So probably the actual code
that orchestrates that stuff will probably be proprietary, mostly because
I'm embarrassed how much bash I use. Nobody really wants to know how that sausage is made. Do you follow Google bash code style? Yeah, well, somebody taught me to use curly
braces for my batch variables and ever since then I've been doing that religiously. So I
learn new things every day. But I think the core will stay open source forever. I don't see,
I don't even know, even that data migration part, like that's a that's a known, there's a known
solution for it. There's no point of building a closed source
solution that does the same thing like that's already been
solved. So might as well just
it doesn't exist yet. Right? So this is a rebalancing feature.
Well, the rebalancing feature is basically, I get it depends on
your sharding key, it depends how you store data on the shard
like Instagram wrote that blog post a long time ago where they pre-partition everything
into smaller shards and that's how Citus does it underneath.
Instead of like, if you say you want three shards, it's not going to build you three
tables, it's going to build 128 tables and move them between the shards.
How?
Well, logical replication became a thing in 10, so Cit's a side that uses that to move things around.
I think logical application makes sense up to a point.
You really have to catch the tables at the right time.
Once they get a little bit too big,
logical application can't catch up anymore.
So that's gonna be an orchestration problem.
I've seen logical kind of-
You can partition it virtually,
like what PRDB did, like they implemented virtual partitioning splitting.
If your primary key, or partition key,
partition key in this case, allows to define some ranges,
you can use multiple streams to copy initially
and then even to have CDC, so it's kind of interesting.
That's right.
So yeah.
Well, it's funny that you mentioned that,
because what is it, Postgres 16 allows us to now create
logical application for the laptop because-
Binary equals true.
Postgres 17.
17?
Or 16.
Maybe 16.
Maybe you are right.
I just looked at the computation a few days ago,
and I already forget.
Yeah, but it's still binary.
It's good, but it's a whole table.
Maybe we don't need it.
If we rebalance, we don't.
Well, in case if you split to already partitions,
it's fine, but if not, yeah, I'm very interested
to understand design decisions here.
It's going to be interesting.
Because I think for sharding at large scale, this is
one of the key features to understand this. Yeah, well, so when you started from like a one big
database to shard it into like 12, what we did at Instacart was we created a replication slot,
we snapshot at 12 databases, we restored them 12 different times, deleted the data that's not part of the shard,
synchronized the logical replication and launched.
So deleting data is faster than writing it from-
Copy it as physical replica first, right?
Or copy it logically, like provisioning logical replica,
basically dump restore before binary?
No dump restore.
So RDS, we were on RDS. So EBS. This leads us to
this discussion about upgrades because recovery targetless and doesn't exist. How like, like,
let's follow up on this. Like this is good bridge to zero downtime upgrades on RDS.
Yeah, those are fun. Oh yeah. So the second question, yeah, Citus. So the difference is
mostly philosophical in the architecture.
Citus runs inside the database,
which limits it to two things.
First of all, the database host has to allow you
to run Citus, which in a lot of cases is not the case.
Maybe because of a GPL,
maybe because they just don't want you competing
with their internal products.
Again, it's business, it's all fair game.
And then the second one is performance.
When you run something inside Postgres that needs to be massively parallel, you
know, you're limited by the number of processes you can spawn and by number of
connections you can serve.
So PGDog is asynchronous, Tokyo, Rust, lightweight, it's not even threaded.
I mean, it's multi-threaded, but it's mostly like asynchronous, like
task-based runtime.
So you can connect, I mean, I'm going pull up a big number here just for, you know,
for effect, but you could have like a million connections
going to PGDog from a single machine
and that technically should work
because it's using ePoll underneath.
But for Postgres, you could probably do like, you know.
You need PgBouncer or something in front of it.
You need PgBouncer
and you need most of those connections to be idle
because concurrency-wise,
Postgres can only do maybe like, you know, two per core. That's the myth, at least.
So again, like Citus, single machine, they have some kind of support for multiple coordinators,
but I think the readme just says, like, please contact us.
In any case, you're in the hands of Microsoft, you need to go to Azure, or you need to do
self-host everything.
You cannot use it on RDS because it's extensions required.
In your case, no extensions required.
This is the key.
So you can run it on RDS because the extensions are not needed.
Right?
Exactly.
So you can run it on any Postgres.
I see a lot of guys are trying to develop an extension ecosystem.
I over time became an opponent of extensions' idea because we have a lot of managed services
and if you develop an extension, it takes a lot of time to bring that extension to
extension it takes a lot of time to bring that extension to managed provider and they like it takes years so if you can do something without extensions it
might be better in some cases and sharding maybe is this such a case right
mm-hmm no absolutely yeah because if you develop an extension then it gets then
installed by RDS like I don't know if they're gonna pay you support or
support it or all that stuff.
RDS notoriously upgrades like once a year.
We need to wrap up soon, but like a few words
about encryption.
Does PidgeyDog support encryption now already
because it's super important for,
and it can be a bottleneck, I know.
Odyssey connection pooler was created
because PidgeyBouncer needed two layers of Pidgey Bouncer
to handle a lot of, yeah, this is what several companies
did in the past, two layers of Pidgey Bouncer
because of handshake.
So, TLS handshake.
What's in this area, encryption?
Sure, yeah, it supports TLS.
It's using a library, Tokyo.
It's like one of the Rust libraries that implements TLS.
It's completely fine.
You could use TLS.
And my personal favorite,
I finally find a library that implements
the Scram SHA-256 authentication.
So now that's finally supported.
Yeah, I saw MD5 is going to be deprecated
in the next few years in Postgres.
It still comes up.
I mean, it's been deprecated for 10 years and people still use it just because it's
really simple to implement.
Scram is really hard.
And one more small question about prepared statements.
I know PGCAD supported them In transaction pool mode, right?
Yeah, the implementation wasn't great,
but it does support it, yeah.
PGDoc supports them too.
Much better implementation this time.
Okay.
Cool, well, I'm not out of questions.
I'm out of time.
Yes!
But it was absolutely interesting.
Thank you so much.
I'm definitely rooting and going to follow the project.
Best luck to you and your new company.
And maybe Michael has some questions additionally.
I took a microphone for too long this time.
Apologies.
Well, I'll let Lev, is there anything we should have asked
that we didn't?
Oh, no, you guys are, I think you covered it.
Well, really nice to meet you.
Thanks so much for joining us.
And yeah, catch you next week, Nicolai.
Thank you so much.
Bye bye.
Thank you.
Bye bye.