Postgres FM - Multigres
Episode Date: July 11, 2025Nikolay and Michael are joined by Sugu Sougoumarane to discuss Multigres — a project he's joined Supabase to lead, building an adaptation of Vitess for Postgres! Here are some links to thi...ngs they mentioned:Sugu Sougoumarane https://postgres.fm/people/sugu-sougoumaraneSupabase https://supabase.comAnnouncing Multigres https://supabase.com/blog/multigres-vitess-for-postgresVitess https://github.com/vitessio/vitessSPQR https://github.com/pg-sharding/spqrCitus https://github.com/citusdata/citusPgDog https://github.com/pgdogdev/pgdogMyths and Truths about Synchronous Replication in PostgreSQL (talk by Alexander Kukushkin) https://www.youtube.com/watch?v=PFn9qRGzTMcConsensus algorithms at scale (8 part series by Sugu) https://planetscale.com/blog/consensus-algorithms-at-scale-part-1A More Flexible Paxos (blog post by Sugu) https://www.sougou.io/a-more-flexible-paxoslibpg_query https://github.com/pganalyze/libpg_queryPL/Proxy https://github.com/plproxy/plproxyPlanetScale Postgres Benchmarking https://planetscale.com/blog/benchmarking-postgresMultiXact member exhaustion incidents (blog post by Cosmo Wolfe / Metronome) https://metronome.com/blog/root-cause-analysis-postgresql-multixact-member-exhaustion-incidents-may-2025~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello and welcome to Postgres FM, a weekly show about all things Postgres QL.
I am Michael, founder of PG Mustard and I'm joined as usual by Nick,
founder of Postgres AI. Hey Nick.
Hi Michael and let's welcome our guest.
Yeah, we are joined by a very special guest, Sugu, who is a co-creator of Vitesse,
co-founded Planetscale and is now at SuperBase working on an exciting project called Multigress.
So welcome Sugu.
Thank you. Glad to be here.
All right, it's our pleasure. So it's my job to ask you a couple of the easy questions
to start off. So what is Multigress and why are you working on it. Multigress is a VITAS adaptation for Postgres.
It's been on my mind for a long time, many years.
And we've even had a few false starts with this project.
And I guess there is a timing for everything.
And finally, the timing has come.
So I'm very excited to get started on
this finally. Yeah timing is an interesting one it feels like for many
years I was looking at planet scale and Vitesse specifically very jealously
thinking you can promise the world you can promise this you know horizontal
scaling with a relational database for OTP
and it's you know all of the things that people want and we didn't really have a good answer
for it in Postgres but all of a sudden in the last few months it seems almost there
are now three four competing projects all doing it so why now why is it all happening
now? Why is it all happening now?
I would say I think there is a timing for every new idea to gather momentum. Like, the idea may be good, but the timing may
be bad for one reason or another. But once the timing is
good, then it is obvious that it is the right time. And it feels
like now is that time.
Like I would say, for example, PayPal,
there have been previous attempts at moving money online,
but it never worked.
But PayPal just came at the right time
and therefore took off at during that time.
There are many such examples where, you know,
some ideas came too early and didn't take off.
But for Postgres, this feels like it's the right time.
That's interesting. Let me reveal something.
We met in 2018, I think,
at Percona Conference and in 2019,
we met again in Nutanix at Meetup.
Nutanix, yes.
Go Meetup, right. Discussions about the test for Postgres started.
Couple of times we had a call I tried
to involve a couple of guys to and from my understanding it never worked because you
could not do it yourself being busy with other thing I guess my sequel related and
the guys are looking at the complexity other guys looking at the complexity of related
and and didn't proceed and
actually in one case it was they decided to build from scratch it was it was a
speak your project it's life and there is a it's just starting for post yeah
yeah and the report borrowed in borrowed in borrowed in yeah yeah so
another folks also involved and and so for me it was
Disappointing that it does work and at some point I saw the message in
Vitesse I think that we are not going to do it. So like don't expect I
Feel so bad because I was so excited about doing it and then I realized my god, you know, yeah
But but now you started in last week plan scale, they started to support Postgres.
So what's happening?
Like, I don't understand.
Just, just right time, right?
Yeah.
Enough, enough number of companies using Postgres, which really needed
at least one horse will win.
So, yeah, it's great.
But, but yeah, long long long story to this point. Yeah
sometimes when there are multiple projects there's kind of slight
differences in philosophy or approach or trade-offs willing to trade one thing
off in relation to another and I saw your plan I really liked that you use you
mentioned building incrementally.
So Vitesse is a huge project, lots and lots of features,
but I've heard you talk in the past
about building it quite incrementally while at YouTube.
It didn't start off as complex as it is now, obviously,
and you did it kind of one feature at a time.
And it sounds like that's the plan again with Multigrass.
Is that different to some of the other projects
or what do you see as your philosophy
and how it might differ slightly to some of the others?
I think my philosophy is that I would say
I don't want to compromise on what the final project is going to look like.
So any path that deviates me from hitting the target on the final project, I do not want to take.
But any shortcuts that can get us early results that we can do, but only those that ensure that
it doesn't risk what we become eventually, which is a project that should feel native
as if it was for Postgres, by Postgres kind of thing.
I want it to be a pure Postgres project.
In this case, I must ask,
in this case, are you sure you're still good
with choice of goal language. Because sometimes we see projects which have
really like sub-millisecond latency on average, like 100 microseconds for example on average,
and Go definitely will bring a few hundred microseconds of latency overhead. So usually it's not a big deal, but maybe in some cases
it's some deal, right?
Are you happy with Go?
Yeah, yeah.
I mean, because you were one of the first big Go language
users building with this, as we know from various interviews
and so on.
Yeah, so it's still a good choice,
because now there is a rust, right?
Yes. I would say, by the way, when we started Go compared to where Go is today, it was a
nightmare. Like 10 milliseconds or something round trip is what we were paying for. Those days we had hard disks by the way.
So that's another three to five milliseconds just within the database.
But things are a lot better now.
And at this point, the way I would put it is like the
trade-offs are in favor of go.
Let's put it that way, mainly because there is a huge amount of existing code that you can just lift and
port.
And rewriting all of that in Rust is going to just delay us.
And at least in Vitesse, it has proven itself to scale for
hundreds of terabytes.
And the latencies that people see are,
they are not affected by a couple of hundred microseconds.
So I think the, and plus there's this inherent acceptance
of this network latency for storage and stuff.
And if you bring the storage local,
then this actually wins out over anything that's there.
This is exactly what I wanted to mention. Yeah, I see PlanSkill right now they came out with
Postgres support but no Vitesse and I'm very curious how much it will take for them to bring
it like and compete with you. It's not an interesting question but from past week, I see my main impression is like, main stake is on local storage.
And this is great because local storage for Postgres, we use it in some places where we
struggle with EBS volumes and so on.
But it's considered like not standard, not safe, blah, blah, blah.
There are companies who use it.
I know.
Like I know myself, right?
And it's great.
Today, for example, Patroni and since Postgres 12,
we don't need to restart nodes when we have failover.
So if we lose node, forget about node, we just failover,
and so on.
And with local storage, not a big deal.
But now I expect with your plans to bring local storage, it will
become more, like I expect it will be more and more popular and that's great. So you
shave off latency there and keep go, which brings 200...
Good compromise. Effectively, it's a win because one network hop completely eliminates language level overheads.
Sounds good. Yeah. Maybe a goal will improve additionally, but yeah. Good.
I wanted to go back, Sugi, you mentioned not wanting to compromise on feeling Postgres native.
And that feels to me like a really big statement
coming from Vitesse being very MySQL specific saying you want to be Postgres native feels
like it adds a lot of work to the project or you know it feels like a lot to me. What
does it mean? Is that about compatibility with the protocol? What does it mean to be Postgres Native?
There's two answers. One is why do we still think we can bring Vitesse if it
wants if it was built for MySQL and how do you make it Postgres Native?
That's because of Vitesse's history. For the longest time, Vitesse was built not to be tied to MySQL.
It was built to be a generic SQL 92 compliant database.
That was actually our restriction for a very long time until the MySQL community said,
you need to support all these MySQL features said, you need us to, you know, you need to support
all these MySQL features.
Otherwise we won't move.
CTEs, right?
Common Table Expressions with, right?
It's I guess SQL 99 feature, not 99.
Yeah, I think the first parser I built was SQL 92, which is the most popular one that
I know of.
So that's answer one.
Answer two is more with the behavior of Postgres.
What we want is completely mimic the Postgres behavior
right from the beginning.
Basically, in other words, we plan to actually copy
or translate what we can
from the Postgres engine itself,
where that behavior is very specific to Postgres.
And the goal is not compatibility
just at the communication layer, but even internally,
possibly even recreating bugs
at the risk of recreating bugs.
In this case, it's very like...
So there is Citus.
In the hands of Microsoft, it got everything open sourced.
So before, resharding was only in paid version.
Now it's in free version open source.
So it's fully open source.
And they put Postgres in between.
So they don't need to mimic it.
They can use it, right?
And latency overhead is surprisingly low.
We checked it.
Well, it's C, but it's whole database in between.
But it's sub-millisecond, so it's acceptable as well.
I think it's half a millisecond or so in our experiments with simple select one or
something.
Select, just like that. experiments with simple select one or something. So don't you think it's like in comparison,
it's quite a challenging point when you say I'm going to mimic a lot of stuff but
they just use Postgres in between? Yeah, I think there's a difference in architecture
between our approach between multigrass versus
Citus. I think the main difference is it's a single coordinator for Citus and
there is some bottleneck issue with that if you scale to extremely large
workloads like that goes into millions of QPS, hundreds of terabytes. So having that single bottleneck, I think,
would be a problem in the future.
Whereas...
Yeah.
As I understand, you can put multiple,
like you can have multiple of nodes there,
and also you can put a Pidgey bouncer
to mitigate the connection issues.
So it can scale as well.
Oh, that's good.
That's not, that's something I haven't known before.
So it's possible then that they may also have something that can viably scales for
OLTP.
Yeah.
So, we are still exploring this and more benchmarks are needed.
Actually I'm surprised how few like not comprehensive benchmarks there are there are published for this.
Yeah what I know of Citus is probably what you told me when we met.
So it's about five years old. Yeah yeah. Another big difference and this is typically Nikolai's
question is on the license front I think you've picked about as open
a license as you could possibly pick, which is not the case, I think, for many of the
other projects. So that feels to me like a very super based thing to do and also in line
with what the test did. And that seems to me like a major advantage in terms of collaborating
with others,
other providers also adopting this
or working with you to make it better.
What's your philosophy on that side of things?
My philosophy is my metric for success is adoption.
Yeah.
And the only way to have a project be adopted
is to have a good license,
a license that people are confident to use.
That has been the case from day one of with us.
We actually first launched with BSD license,
which is even more permissive than Apache.
And then when we moved the project to CNCF, they said,
oh, no, no, we need Apache license, which is why we converted it to that one. So adoption has always been, yeah. Why do they say it? Do you know? Why CNCF wants Apache?
I think Apache is a pretty good license. They just made it a policy. I mean, had we asked to keep the
BSD license, they would have allowed us, but we didn't feel like it was a problem to move to Apache.
Yeah, and I remember you described,
when you did it in YouTube, you thought about external users.
You need external users for this project to grow.
And I guess at Google, HPPL is not popular at all.
We know.
Oh, banned.
Yeah, banned.
And Citus is AGPL.
Oh, okay.
Yeah.
Also, compared to Citus, I think you have chances to be compatible with RDS and other
managed Postgres, to work on top of them, unlike Cytos which requires extension and so on.
Correct. This was actually something that we learned very early on.
We made a five-line change on MySQL just to make the test work initially,
and it was such a nightmare to keep that binary up,
to keep that build running.
Fork.
Yeah, to keep that fork alive.
So we decided, no, it's like,
we are going to make this work without a single line
of code of change in VITA in MySQL.
And that actually is what helped VITAS move forward.
Because people would come in with all kinds of configurations
and say, you know, make it work for this.
So in this case, actually,
we'll probably talk about the consensus part.
That is one part that we think it is worth making a patch
for Postgres, and we're going to work hard
at getting that patch accepted.
But I think what we will do is,
we will also make multigrass work for unpatched Postgres
for those who want it that way,
except they will lose all the cool things
about what consensus can give you.
I'm smiling because we have so many variations
of Postgres these days. I would expect people coming not only so many variations of Postgres these days.
I would expect people coming not only with different configurations of Postgres, but
also with various flavors like Avar.
We have a client who just migrated from regular Postgres to AloyDB.
Hello from Google Cloud.
And they already shared that on the application side, But imagine they could come to you and say,
let's support AloyDB now.
It's almost postgres, right?
So these things might happen as well.
Don't they claim full compatibility with Postgres?
I thought you are over at that.
Not full, but most of it.
They did interesting stuff in memory like
Column storage in memory for tables. It's raw storage on disk, but column store in memory.
But it looks like kind of Postgres and we actually even had some questions answered from my team
unexpectedly because we don't normally work with AloyDB, but it looks likeers so I could imagine the request let's support a Lloyd DB as well anyway
but my question like I remember featuring the test that we work with RDS
and managed my sequel did this features like has this feature survived no
actually later we decided that at least we call it actually managed
versus unmanaged. Managed meaning that means that Vitesse manages its databases
and unmanaged means that the database is managed by somebody else with us just
access a proxy to serve queries. At some point of time, we realized that supporting both
is diluting our efforts.
And that's when we decided, okay, you know,
it's not worth it to try and make this work
with every available version that exists
out there in the world.
And we said, okay, we will do only managed,
which means that we will manage it ourselves.
And if you want, we'll build the tools to migrate out of
wherever you are and they will, we'll make it safe.
We'll make it completely transparent.
In other words, you deploy with this on both
and then we'll migrate your data out without you
having to change your application.
But then with this can be more intentional about its features,
about more opinionated about how clusters are to be managed.
And we were able to commit to that because at that point,
Vitesse had become mature enough,
people were completely trusting it.
They actually preferred it over previous
other managed solutions.
So it wasn't a problem at that time.
Yeah, it's not a surprise, that's why I asked. Because you talk about local disks, backups,
HA, a lot of stuff, right? And basically...
Yeah, 5.9 is like what VITAS shoots for. And like most big companies that run VITAS do operate at that level of availability.
So what's the plan for multigrass? Going to support, so not only managed version, right?
Only, so the, it would be, yes, it would be only managed versions because I believe
that the cluster management section of VITES will port directly over to
Postgres, which means that you will once it goes live, it will be coming with batteries included
on cluster management, which should hopefully be equal to or better than what is already out there.
So I don't see a reason why we should try to make it work
with everything that exists today.
So it means there is no, like,
this is the same like with Cytos.
It doesn't work with RDS on one hand,
but on another hand, we, like,
I don't see it's only a sharding solution.
It's everything, which is great.
I mean, it's interesting, super interesting.
A lot of problems will be solved and I expect even more managed services will be created. I
don't know how it will continue in terms of super base because of the very open license and so on,
but also I expect that many people will reconsider their opinion about managed.
many people will think, reconsider their opinion about managed. We had an episode about this.
This is my usual opinion about managed services because they hide super user from you.
They don't have provided your access.
It's hard to troubleshoot problems.
In this case, if problems are solved in this and this gives you a new way to run Postgres,
so if many problems solved, it's great.
If you want to do self to solve problems, for example.
Yeah, if you may not know, the initial focus of with test is
actually solving these problems first.
Sharding was actually came much later,
protecting the database, making sure that they survive abusive queries.
Basically, that's what we built with us for initially.
And the counterpart of taking away power from the user,
like you said is one is, well,
we now know exactly how to make sure
that the cluster doesn't go down.
And two, we countered that by building
really, really good metrics.
So when there is an outage,
you can very quickly zero in on a query.
If a query was responsible,
we test will have it on top of,
like on the top of the line saying that this is a query
that's killing your database.
So we build some really, really good metrics
and which should become available in multigrass,
probably from day one.
and which should become available in multigrass, probably from day one.
That's interesting.
I didn't see, maybe I missed, I didn't see in the readme
you were writing right now in the project.
There's a last section called observability.
I missed it, I need to revisit.
We're actually building something there as well.
I for regular postgres, I'm very curious.
I will definitely revisit this. Interesting. Okay.
So yeah, great. And yeah. Also quite a big, I feel like this is quite a big difference on the,
at least with Citus in terms of the philosophy or at least the origin story.
I feel like that started much more with OLAP focused features in terms of distributed queries and
parallelized across multiple shards and aggregations and column
and loads of things that really benefit OLAP workloads.
Whereas this has come from a philosophy of let's not worry about optimizing for
those cross shard queries, this is much more,
let's optimize for the cross shard queries this is much more let's optimize for the single shard very very short quick LTP queries and let's make sure we protect
it against abusive query you know so it feels like it's coming the
architecture it's coming from a very different place of what to optimize for
first and historically that was YouTube's problem. Surviving the onslaught of a huge number of QPS
and making sure that one single QPS doesn't take the rest of the site down.
Yeah, perfect. Makes loads of sense.
So actually before we move on too much from that,
where do you see sharding as becoming necessary?
Like is it just a case of a total number of QPS or like rights per second type?
We've talked about sharding in the past and talked about kind of
a max that you can scale up to perhaps in terms of rights,
in terms of wall per second I think was the metric we ended up discussing.
Are there other reasons or kind of bottlenecks that you see people getting to
that sharding then kind of makes sense as it's now time or you should be
considering at this point?
Well, there is a physical limiting factor, which is the single, if you max out
your single machine, that is your Postgres server,
then that's the end of your scale.
There is nothing more to do beyond that.
And there are a lot of people already hitting those limits
from what I hear.
And the sad part of it is they probably don't realize it.
As soon as that limit is hit,
in order to protect the database,
they actually push back on engineering features, indirectly, saying that limit is hit, in order to protect the database, they actually push back on engineering features
indirectly saying that, you know,
this data, can you make it smaller?
Can you somehow lower the QPS?
Or could you put it elsewhere?
Let's stop showing this number on front page.
And so, yeah.
Yeah, yeah.
And it affects the entire organization.
It's a very small, it's a very subtle change,
but the entire organization slows down.
Like we experienced that at YouTube
when we were at our limits,
the default answer from a DBA was always no.
We used to even kid, no?
The answer is no, what's your question?
And when we started sharding,
it took us a while to change our answer,
to say that, you know, bring your data,
we can scale as much as you want.
Believe it or not, we went from 16 shards
to 256 in no time.
And the number of features in YouTube exploded
during that time because there
was just no restriction on how much data you wanted to put. And coming back here,
the upper, like reaching the limit of a machine is actually something you should never do. It's very
unhealthy for a large number of reasons. Like even if there is a crash, like how long is it going to take to recover?
Like the thing that we found out is once you can shard,
it actually makes sense to keep your instances
way, way small.
So we used to run like 20 to 50 instances
of MySQLs per machine.
And that was a lot healthier than running big ones
for a couple of reasons.
One is if you try to run so many threads within a process,
that itself is a huge overhead for the machine.
And it doesn't do that very efficiently,
whereas it does it better if you run it as smaller instances.
I think it's more of a feeling, but I don't know if there
is proof or whatever.
But in like Go, for example, wouldn't do well.
Go, I think beyond a certain memory size or beyond a certain number of Go routines would start to slow down,
would not be as efficient as it was before, mainly because the data structures to keep track of those threads and stuff,
they are getting, they are growing bigger, right?
But more importantly, on an outage,
a smaller number of users are affected.
If you have 256 shards and one shard goes down,
it is 1-256th of the outage, right?
And so the site looks a lot healthier, behaves a lot healthier.
There's less panic if a shard goes down. So people are, you know, a lot less stressed managing such instances.
Right, I wanted to mention that
this discussion was with Lev Kokotov, Pidgey Dog, which is competitor as well, new sharding tool written
in Rust.
And we discussed that there is a big limitation when Postgres.
So replication, physical replication has limitation because it's single threaded process on standby
if we reach like somewhat like 150, 200, 250 megabytes per second depending
on core and also number of, not number, structure of tuples and so on.
We hit one single CPU 100% one process and it becomes bottleneck and replica standbys
they start lagging.
It's a big nightmare
because you usually buy that, but that's like, at high scale you have multiple
replicas and you offload a lot of read-only queries there and then you
don't want, don't know what to do except as you described let's remove this
feature and slow down development and this is not not fun at all. So what I'm
trying to do here is trying to move us to discussion
of replication, not physical but logical. I noticed your plans involve heavily logical
replication in Postgres, but we know it's improving every year. So when we started the
discussion five, six years ago, it was much worse. Right now it's much better. Many things are solved, improved,
but many things still are not solved. For example, schema changes are not replicated.
And sequences. There is work in progress, but if it's committed it will be only in Postgres
19, not in 18. So it means like long wait for many people. So what are your plans here? Are you ready to deal with
problems like this? Pure Postgres problems, you know? Yeah, yeah. If you ask me, I think the
Postgres problems are less than what we faced with MySQL. I wanted to involve physical as well because this great talk by Kukushkin which describes
very bad anomalies when data loss happens and so on.
Let's talk about this.
Yeah, we should talk about both.
I think overall the Postgres design is cleaner is what I would say.
You can feel that from things.
Like the design somewhat supersedes performance,
which I think in my case is a good trade off,
especially for sharded solutions,
because some of these design decisions affect you
only if you are running at, you know,
if you are pushing it really, really hard,
then these design decisions affect you.
But if your instances are small to medium size,
you won't even know and then you benefit from the fact that these designs are good.
So I actually like the approaches that Postgres has taken with respect to the
wall as well as logical replication. And by the way, I think logical replication
theoretically can do better things than what it does now and we should push those limits but yes I think the the
issue about schema not being as part of logical replication it feels like that
is also a theoretically solvable problem except that people haven't gotten to it
I think there are issues about the transactionality of DDLs, which doesn't even exist in MySQL. So at least in Postgres,
it's like it exists in most cases. There are only a few cases where it is not.
And for such things like a proxy layer, like multigrass or withers,
it's a no problem for them because you should say, Oh, okay.
This particular construct is not transactionally safe, well, then we'll even
prevent you from doing it transactionally because we don't want you to get the wrong impression.
We'll let you do it non-transactionally and we know that it's non-transactional and therefore
we can do something about it, right? Those abilities don't exist previously, but eventually
if it becomes transactional, then we can actually include it in a transaction. Yeah, just for those who are curious
because there is like concept all DDL in Postgres is transactional, here we talk
about things like creating this concurrently because we had discussion
offline about this before recording. So yeah, creating this concurrently can be an issue but you
obviously have a solution for it. That's great. The way I would say it is we have
dealt with much worse at with MySQL. So this is much better than what was there then.
Sounds good. Okay, good. And let's talk about physical replication because I saw you are going to use it.
So each shard is going to have a standbys.
And we've, yeah, so we've, we've Quorum commit, right?
So like we don't lose data
because data is on local disks by default, as I understand.
Like if a metal in cloud,
so we want to be sure data is written
to at least two places, for example, or three, right?
Configurable, of course.
Here this interesting talk by Kokushkin, he presented it recently online conference by Microsoft describing that
synchronous replication in Postgres is not what you think.
Correct.
What are you going to do about this? Well, I was just chatting with
someone and essentially synchronous replication is theoretically impure when it comes to consensus.
I think it's provable that if you use synchronous replication then you will hit corner cases that you can't handle.
And the most egregious situation is that it can lead to
some level of definitely split brain,
but in some cases it can even lead to downstream issues.
Because it's a leaky abstraction,
it's a leaky implementation,
there are situations where you can see a transaction and think
that it is committed, and later the system may fail and in the recovery may choose not
to propagate that transaction, or may not be able to, and it's going to discard that
transaction and move forward.
But this is the same as with asynchronous replication, it's the same, we're just losing some data, right?
Exactly. It is the same as asynchronous replication.
Yes.
It's not split-brain.
It's just data loss.
It's data loss, correct.
It's data loss.
But for example, if you are running
a logical replication of one of those,
then that logical replication may actually
propagate it into an external
system and now you have corrupted downstream systems that don't match the source.
So those risks exist and at VITAS scale people see this all the time, for example, and they
have to build defenses against this and it's very, very painful. It's not impossible, but it's very hard to reason about failures
when a system is behaving like this.
So that is the problem with synchronous replication.
And this is the reason why I feel like it may be worth patching Postgres.
Because there is no existing primitive in Postgres
on which you can build a clean consensus system.
I feel like that primitive should be in Postgres.
I now remember from Kukushkin's talk,
there is another case when primary transaction
looks like not committed,
because we wait a replica,
but the replica somehow was like
lost connection or something.
And then we suddenly, and client thinks it's not committed, We wait a replica, but the replica somehow is like lost connection or something.
And client thinks it's not committed because commit was not returned.
But then it suddenly looks committed.
It's like not data loss, it's data un-loss somehow.
Correct.
Like suddenly.
And this is not all right as well.
And when you think about consensus, I think you are very good describing these things,
like concept and distributed systems.
It feels like if you have two places to write, definitely there will be corner cases where
something will go off if you don't use two-phase commit, right?
Correct.
And here we have this.
But when you say you're going to bring something with consensus, it
immediately triggers my memory how difficult it is and how many attempts it was made to
bring pure HA to POSGAS just to have auto failover.
All of them failed.
All of them.
And they say, let's be outside of Postgres. So here maybe it will be similar
complexity to bring these two inside Postgres. Is it possible to build it outside this thing
maybe?
It is not possible to build it outside. Because if it was, that is what I would have proposed.
The reason is because building it outside is like putting Band-Aid over the problem.
It will not solve the core problem.
The core problem is you've committed data in one place
and if that data can be lost and there is a gap
when the data can be read by someone,
causes is the root cause of the problem.
That is unsolvable.
Even if you later Raft may choose to honor that transaction or not, and that becomes
ambiguous, but we don't want ambiguity.
What if we created something extension to commit, like make it extendable to talk to
some external stuff to understand that commit can be finalized or something.
I don't know, consensus.
Correct, correct.
So essentially if you reason through about this, your answer will become a two-phase
system.
Yeah.
Without a two-phase system.
Which spurs me.
But as I told you, a two-phase commit in OTP world, in Postgres OTP world considered really,
really slow and the rule is let's just avoid it. I see your enthusiasm and I think I couldn't find good
benchmarks. Zero. Published. This is not two-phase commit by the way. This is two-phase synchronization.
I understand. It's not in two-phase commit. It's like more communication happens. I understand this.
two-phase commit, it's like more communication happens. I understand this. So two-phase synchronization, the network overhead is exactly the same as full sync,
because the transaction completes on the first sync. Later it sends an acknowledgement saying
that yes, I'm happy you can commit it, but the transaction completes on the first sync,
so it will be no worse than full sync. Yeah, compared to the current situation, when primary commit happens, but there is a lock
which is being held until...
Correct, it is the same cost.
Yeah, we wait until standby.
And for users, it looks like when lock is released, it thinks, okay, commit happens.
But the problem with this design, if, for example, standby restarts, lock is released, it thinks, okay, commit happens. But the problem with this design, if for example, standby restarts,
lock is automatically released and commit is here
and it's unexpected.
This is data on loss, right?
So you are saying we can redesign this,
network cost will be the same, but it will be pure.
Yeah, that's great.
I like this.
I'm just thinking, will it be acceptable? Because bringing
Autofill over is not acceptable. There was another attempt last year from someone with
great enthusiasm. Let's bring Autofill over inside Postgres. Actually, maybe you know
this guy, it was Konstantin Osipov, who built Taranto database system. It's like memory. He was ex MySQL in performance after Zaitsev. Zaitsev
was ex MySQL when Osipov was MySQL. So Konstantin came to Postgres saying, let's build this.
Great enthusiasm, but it's extremely hard to convince such big thing to be in core.
So if you say it's not big thing, this already.
So I can, it's a,
I'll probably have to explain it in a bigger blog.
But essentially, now that I've studied the problem well enough,
the reason why it's hard to implement consensus in Postgres with the wall,
is because they are trying to make
Raft work with wall and there are limitations about how the Raft, how commits work within
Postgres that mismatch with how Raft wants commits to be processed and that mismatch, so far, I have not found a way to work around that. But a
variation of Raft can be made to work.
Interesting.
The way the, I don't know if you know about my blog series that I wrote when I was at
PlanetScale. It's an eight-part blog series about generalized consensus.
People think that Raft is the only way to do consensus,
but it is one of a thousand ways to do consensus.
So that blog series explains the rules you must follow
if you have to build a consensus system,
and if you follow those rules,
you will get all the properties
that are required by a consensus system.
So this one that I have, the design that I have in mind, follows those rules.
And I am able to prove to myself that it will work, but it's not Raft.
It's going to be similar to Raft.
I think we can make Raft also work, but
that may require changes to the wall, which I don't want to do.
So this system I want to implement without changes to the wall, which I don't want to do. So this system I want to implement
without changes to the wall as possibly a plugin. Well, now I understand why you could like another
reason you cannot take Patroni not only because it's Python versus Postgres, but also because
you need another version of consensus algorithm. Correct.
And among those hundreds of thousands, millions of ways to...
By the way, Patroni can take this and use it, because he's very close to how FullSync works.
Good. Okay.
I was just thinking, watching Alexander Kokushkin's talk, he said a couple of things that were interesting.
One is that he was surprised that this hasn't happened upstream.
So you definitely have an ally in Kokushkin in terms of trying to get this upstreamed.
But also that he thinks every cloud provider has had to patch Postgres to,
in order to offer their own high availability products with Postgres,
each one has had to patch it and they are having to, or you, you mentioned
earlier today, how painful it is to maintain even a small patch on something.
I don't think it's every, I think it's Microsoft for sure, knowing where
Kukushkin works at, but maybe more, not every.
Yeah. sure, knowing where Kukushkin works at. But maybe more, not everything. Yeah, all I mean is that there are a growing number of committers working for hyperscale
and hosting providers, so I suspect you might have more optimism for consensus or at least a few
allies in terms of getting something committed upstream. So I personally think there might be
growing chance of this happening,
even if it hasn't in the past for some reason.
Yeah, I feel like also being new to the PostFest community,
I am feeling a little shy about proposing this upfront.
So what I'm thinking of doing is at least show it working,
show it working at scale,
have people gain confidence
that this is actually efficient and performant and safe. So I also plan to, I don't know if you've
heard of Flexback source, which is actually, in my opinion, a better way to handle durability,
because today's cloud environments are a lot more complex and a
simple majority based quorum is actually very hard to configure if your needs are different,
which actually FlexPaxos does handle.
It's actually something I'm a co-inventor of some sort.
And this blog post.
I only heard the name that's it. Can you explain a little bit?
Oh sure, yeah. So actually let me explain what is the reason why. So FlexPackSource
was published a few years ago, about seven years ago or so. And you'll see my name mentioned there, you know, which I feel very proud of.
And the block series that I wrote is actually a refinement of flexboxes.
And that actually explains better why these things are important.
The reason why it's important is because people think of consensus as either a
bunch of nodes agreeing on a value, right?
That's what you commonly hear,
or you think of like reaching majority,
reaching quorum is important.
But the true reason for consensus is just durability.
Is when you say, when you ask for a commit
and the system says, yes, I have it,
you don't want the system to lose it.
is yes, I have it. You don't want the system to lose it.
So instead of defining quorum and all those things,
define the problem as it is and solve it
the way it was asked for is,
how do you solve the problem of durability
in a transactional system?
And the simple answer to that is
make sure your data is elsewhere.
And the simple answer to that is, make sure your data is elsewhere.
Yeah.
I love how simple you make it.
Yeah, if you make sure your data is elsewhere,
if there is a failure,
your challenge is to find out where the data is
and continue from where it went.
And that is all that consensus is about.
And then all you have to do is have rules
to make sure that these properties are preserved.
And Raft is only just one way to do this.
So if you look at this problem,
if you approach this problem this way,
you could ask for something like,
I just want my data to go across availability zones.
As long as it's in a different availability zone,
I'm happy, right?
Or you can say, I want the data to be across regions,
or I want at least two other nodes to have it, right?
So that's your durability requirement.
But you could say, I want two other nodes to have it,
but I want to run seven nodes in the system or 20 nodes.
It sounds outrageous but it is actually very practical. In YouTube we had 70 replicas but only one node
the data have to be in one other node for it to be durable and we were able to
run this at scale. The trade-off is that when you do a failover,
you have a wild goose chase looking for the transaction
that went elsewhere, but you find it and then you continue.
So that is basically the principle
of this consensus system,
and that's what I want to bring in multigrass,
while making sure that the people that
want simpler majority-based core items to also work
using those same primitives.
Just quickly, to clarify, when you say the wild goose chase,
is that because it was one of 70,
but different transactions could have
gone to different of the 70?
Or it's always the same one, but you have
to know which one that is.
No, it could be anyone. The way we ran it, the way we ran it, it is one.
It could not be at any given point of time. There's only one primary,
which means that there is only one transaction that you have to chase down.
Ah, the latest one. The latest one, yes. Yeah, makes sense.
Yeah.
There was a time when we found that transaction in a different country.
So we had to bring it back home and then continue.
It was once it happened in whatever the 10 years that we ran.
It's interesting that talking about sharding,
we need to discuss these things, which are not sharding per se.
So it's about HA inside each shard.
It's actually like what I would call healthy database
principles, which is, I think, somewhat more important than sharding.
Yeah.
Yeah.
Yeah.
It is, it is true that it is to do with it being a distributed system, right?
Like, and that is because it's sharded, no?
Uh, I think they are orthogonal.
Okay.
Yeah.
I think sharding,
like you can do sharding on anything.
You can do sharding on RDS.
Somebody asked me, what about neon?
I said, you can do sharding on neon too.
It's like you put a proxy in front and then it does a sharding.
But the problem with sharding is it is not just a proxy.
That's what people think of it when they first think of the problem
because they haven't looked ahead.
Once you have sharded, you have to evolve.
You start with four shards, then you have to go to eight shards.
And the evolution is not linear this way.
Actually, it's an exponential growth because 4, 8, 16.
But at some point of time, it changes
because your sharding scheme itself will not scale.
Like if you, for example, are in a multi-tenant workload
and you say shard by tenant, at some point of time, a single tenant is going to be
so big that they won't fit in an instance.
And that we have seen.
And at that time, we have to change the sharding scheme.
So how do you change the sharding scheme?
Slack had to go through this,
where they were a tenant-based sharding scheme
and a single tenant just became too big. They couldn't even fit one tenant-based sharding scheme and a single tenant just became too big.
They couldn't even fit one tenant in one shard.
So they had to change their sharding scheme
to be user-based.
They actually talk about it in one of their presentations.
And Vitesse has the tools to do these changes
without actually incurring any kind of downtime,
which again, multigrace will have. I keep talking about Vitesse, but these are all again, multigrace will have.
I keep talking about vitess, but these are all things
that multigrace will have, which means that you are future
proofed when it comes to.
And these are extremely difficult problems to solve.
Because when you're talking about changing the sharding
scheme, you are basically looking
at a full crisscross replication of data.
looking at a full crisscross replication of data.
And across data centers.
Yeah, and also, like I know with this version three, right? It was when you changed, basically created a new planner,
right, to deal with arbitrary query
and understand how to route it properly and where to execute
it.
Is it a single shard or it's global or it's different shards and so on?
Are you going to do the same with Postgres?
I think yes, right?
So that's the part that I'm still on the fence.
By the way, the V3 now has become Gen 4.
It's actually much better than what it was when I built it.
The problem with V3 is that it is still not a full query.
It doesn't support the full query set yet.
It controls supports like 90% of it, I would say,
but not everything.
On the temptation side, there's the Postgres engine
that supports everything.
So I'm still debating how do we bring the two together.
If it was possible to do a simple Git merge,
I would do it.
But obviously, this one is in C, this was in goal.
And the part that I'm trying to figure out
is how much of the sharding bias exists in the current engine
in VITES.
If we brought the postgres engine as is,
without that sharding bias, would this engine
work well for a sharded system?
So this looks like Citus already, right?
If you bring a whole Postgres.
There's a library libpg query by Lucas Fittal,
which basically takes the parsers part of Postgres
and brings it to, and there is a Go version of it as well.
So we don't talk on top of it.
Oh, libpgquery, you said?
Yeah, yeah, yeah, I will send it to you.
So and many, many, many, many systems use it
when we need to parse.
Yeah, one day I told it to the SPQR guys about this,
and eventually I think they ended up using it. I think Lef uses it,
PGDock also uses it, the Rust version.
Is it like 100% Postgres compatible?
Well it's Postgres source, it's based on Postgres source code. So parsers is fully
broad, but it's not whole Postgres. So maybe you should consider this. If thinking about
parsing, I mean queries and so on, but I'm very curious, I also noticed you mentioned
routing, read-only queries routed to replicas automatically, and this concerns me a lot because many POSGAS developers, I mean, who
use it, users, they use PLPG scale functions, all PLPython functions, anything, which are
writing data.
And the standard way to call function is SELECT, SELECT function name.
So understanding that this function is actually writing data is not trivial.
And we know in PGPool, which I hold my life, I just avoid.
I touched it a few times, decided not to use it at all, because it tries to do a lot of stuff at once.
And always considered like, no, I'm not going to use this tool so Pidgeypool solves it like saying okay like let's build a list of functions which are actually writing or
something like this so it's like patch approach you know workaround approach so
this is going to be a huge challenge I think if you yeah yeah yeah for
automatic routing it's it's a huge Yeah, I think this is the reason why I think it is important to have the full Postgres
functional engine in Multigress, because then these things will work as intended, is my hope.
What we will have to do is add our own sharded understanding to these functions and figure,
oh, what does
it mean to call this function, right? If this function is going to call out to a
different shard, then that interpretation has to happen at the higher level. But
if that function is going to be accessing something within shards, then
push the whole thing down and just let the push the whole select along with the
function down and let the individual
Postgres instance do it.
Yeah, but how to understand? Function can contain another function and so on. It can
be so complex in some cases. It's also funny that there is still, there is actually a Google
Cloud, Cloud SQL supports it, like kind of language,
it's not language called PL proxy, which is sharding for those who use, who have workload
only in functions.
This can route to proper shard.
It was created at Skype.
It still exists, but not super popular these days.
But there is a big requirement to write everything in functions. In your
case, if you continue like, I would expect in some cases you would say, okay, don't use
functions. But I'm afraid it's not possible. Like I love functions. Actually, SuperBase
loves functions because they use Postgres. Postgres like it provokes you to use functions. Oh really? Oh yeah, yeah, yeah. Actually I saw that, yeah.
So the, in VITES, I feel like this was a mistake
that we made, which is if we felt that anything,
any functionality that you used didn't make sense.
Like if I were you, I wouldn't do this, right?
Because it's not, it won't scale, it's a bad idea, right? Because it's not, it won't scale.
It's a bad idea, you know, it's like,
those we didn't support.
We didn't want to support.
We said, no, we will never do this for you
because, you know, we'll not give you a rope
long enough to hang yourself.
This is, basically, that was our philosophy.
But in the new, in multigrass, we want to move away from that. Which means
that if you want to call a function that writes, have at it.
Just put a comment that's going to write something.
Yeah, if you want a function that calls a function that writes, have at it. If we cannot, like the worst case scenario for us is
we don't know how to optimize this.
And what we will do is we'll execute the whole thing
on the VT gate side.
There's another, I remember there is interesting solution
in I think in AWS RDS proxy, proxy, which, as I know it, maybe
I'm wrong, when they needed to create a global, it's called, I think, Aurora global database,
maybe, or something like this.
So there is a secondary cluster living in a different region, and it's purely read-only,
but it accepts writes. And when write comes, this proxy routes it to original
primary waits until this write is propagated back to replica and responds.
Oh wow! I don't think that feature can be supported.
No, it's just some exotic interesting solution I just wanted to share.
Maybe, you know, if we, for example, if you originally wrote a write to a replica,
then somehow in Postgres you understand, oh, it's actually a write.
So maybe 100% is theoretically impossible to support.
Okay, it's super exotic.
Okay.
But I think if people are doing like that, doing things like that, it means that they are trying to solve
a problem that doesn't have a good existing solution.
So if we can find a good existing solution, I think they'll be very happy to adopt that
instead of whatever they were trying to do. Well, this is just multi-region setup.
When I saw not one CTO which wanted it for dealing with Postgres, like I say, we are
still single region.
We need to be present in multiple regions in case if one AWS region is down.
It's also about the chain.
So availability and the business characteristics.
So yeah, anyway. Okay. Yeah, it's exotic. But interesting still. Yeah.
So you've got a lot of work ahead of you, Sergey.
I feel like we barely covered like one of so many topics.
Let's touch something else.
Maybe it's a very long episode, but it's worth it, I think.
It's super interesting.
What else?
I think the other interesting one would be 2PC and isolation.
Isolation from what?
The one issue with the sharded solution is that, again, this is a philosophy for the longest time in VITESS we didn't allow 2PC.
You said you sharded in such a way that you do not have distributed transactions.
And many people lived with that.
And some people actually did not.
Let me interrupt you here because this is the best feature I liked about Vitesse.
It's this materialized feature when data is brought.
Oh yeah, materialized is another topic.
That's actually a better topic than 2PC.
Well, yeah, because this is your strength, right?
So this is like, I love this idea.
Basically, distributed materialized view,
which is incrementally updated.
Yes, yes, yes.
That's great.
We need it in Postgres ecosystem.
Just maybe as a separate project, even. We like it in Postgres ecosystem. Correct. Just maybe as a separate project even.
We like it everywhere.
So yeah, this is how you avoid distributed transactions basically, right?
No, this is one way to avoid it.
One way.
Yeah.
There are two use cases where materialized views are super awesome.
You know, a table that has multiple foreign keys,
that has foreign keys to two different tables
is the classic use case,
where the example I gave was a user that's producing music
and listeners that are listening to it,
which means that the row where I listen to this music
has two foreign keys, one to the creator
and one to the listener.
And where should this role live?
Should this role live with the creator
or should this role live with the listener
is a classic problem.
And there is no perfect solution for it.
It depends on your traffic pattern.
But what if the traffic pattern is one way in one case
and another way in another case?
There is no perfect solution.
So this is where in multigrace, what you could do is you say, okay, in most cases,
this role should live with the creator.
Let's assume that, right?
So then you say this role lives with the creator and we shard it this way, which
means that if you join the creator table with
the, with this event row, it'll be all local joins. But if you join the listener's table
with this event row, it's a huge cross shard while while goes chase. So in this case, you
can say materialize this table using a different foreign key, which is the listener's foreign key, into
the same sharded database as a different table name. And now you can do a local
join with the listener and this event table. And this materialized view is near
real time, basically the time it takes to read the wall and apply it. And this
can go on forever.
And this is actually also the secret behind resharding,
changing the sharding key.
This is essentially a table that has real time
present with two sharding keys.
If you say, oh, at some point of time,
this is more authoritative,
all you have to do is swap this out,
make one the source, the other is a target.
You've changed your sharding key.
Actually, the change sharding key works exactly like this for a table.
Distributed denormalization technique.
This is what it is.
Yeah, exactly.
The other use case is when you reshard,
you leave behind smaller tables,
reference tables, we call them.
And they have to live in a different database because they are too small.
And even if you shard them, they won't shard well.
Like if you have, you know, a billion rows in one table and, you know, a thousand rows
in a smaller table, you don't want to shard your thousand row table.
And there's no benefit to sharding that either.
So it's better that that row lives,
that table lives in a separate database.
But if you want to join between these two,
how do you do it, right?
The only way you join is join at the application level,
read one and then read the other.
And so at high QPS, it's not efficient. So what we can do
is actually materialize this table on all the shards as reference, and then all joints become
local. Yeah, and you definitely need logical replication for all this. So this is where
we started, like challenges with logical replication. Yeah. Yeah. Great.
You do have the reason why 2PC is still important.
Because there are trade-offs to this solution,
which is there's a lag.
So it takes time for the things to go through the well and come to the other side.
Whereas 2PC is essentially basically the transaction system itself trying to complete a transaction.
Which means that it will handle cases where there are race conditions, right?
If somebody else tries to change that row elsewhere while this row is being changed,
2PC will block that from happening.
Whereas in the other case case you cannot do that.
Yeah, if it's just user views of some video like on YouTube, we can say, okay, there will
be some lag, probably some small mistake, it's fine, but if it's financial data, it
should be 2 PC, but latency of write will be high, throughput will be low, right? This is... I actually want to, like I read the design of,
which is again, by the way, very elegant API.
And I assume I can see the implementation on the API.
And I don't think we will see performance problems
with 2PC.
We need to benchmark it.
We will benchmark it, but I will be very surprised.
I think there are some isolation issues that we may not have time to go through today because
it's a long topic.
Like the way 2PC is currently supported in Postgres, I think it will perform really well.
Isolation issues when we sit in read committed and used to PC.
You mean this, right? Not unrepeatable read in default.
Yeah, read committed I think will be, there will be some trade-offs on read committed, but not the kind that will affect most applications.
MVCC will be the bigger challenge. But from what I hear is most people don't use, like the most common use case is read committed.
Of course, it's default.
Yeah, it's faster.
It's the default.
Yeah.
So people won't even, yeah, I don't, I think this is...
They already on some, they're already in bad state.
It won't be worse.
It won't be worse.
Yes.
Yes.
Yeah.
As for 2PC, it of course depends on the distance
between nodes, right, a lot.
Like if they are far, they need to,
we need to talk like client is somewhere,
two nodes are somewhere,
and if it's different availability zones, it depends, right?
So this distance is a big contributor
to latency, right, network.
Because there are four communication
messages that are needed.
So correct.
Correct.
Actually, you can, I have actually the mathematics for it, but you're probably right.
It's about double the number of round trips.
Yeah, if we put everything in one AZ, client and both primaries, we are fine.
But in reality, they will be in different places and if it's different regions, it's
nightmare, of course.
But at least it is.
Yeah, the 2PC is not done by the client, by the way.
The 2PC would be done by VT Gate, which would be, it should have the nodes nearby.
Ah, should have. In one availability zone?
Unless you did some crazy configuration,
they should be pretty close to each other.
Pretty close means still different availability zones,
in general case, right?
No. The availability zone is only for durability,
for replica level.
But two PC are coordinating between two primaries,
which may actually be on the same machine for all you care.
It's not...
Well, imagine a real practical case.
We have shard, every shard has primary and a couple of standbys, right?
Correct.
So, are you saying that we need to keep primaries all in the same availability zone?
That's usually how things are.
Ah, interesting.
I didn't know about this.
By the way, I wanted to rattle a little bit about plain scale benchmarks they published
last week.
They compared to everyone.
They compared to everyone.
I wanted just, it's not like, I'm sorry, I will take a little bit of time. They compare to everyone and they just publish like planet scale versus something.
And this very topic, they, on charts, we have planet scale in single AZ, everything, client
and server in the same AZ.
And line, which is like normal case, client is different as that.
And line with same as that is active, line is normal, not active.
And others like neon, super base, everyone, it's different.
And of course, plain scale looks really well, because by default,
they presented numbers for the same availability zone.
And below the chart everything is explained, but who reads it, right?
People just see the graphs.
And you can select proper planet scale numbers and see that they are similar.
But by default the same as that number is chosen.
This is like benchmarking.
You know, like...
Well, I think if you look at the architecture,
like even fair comparison,
planet scale should come out ahead,
like the performance of a local disk.
Of course.
Should.
But this was select one.
Disks don't matter.
Disks don't matter.
But select one is not a benchmark.
Well, it was part of benchmark.
It's just checking query path, but it fully depends on where client and server are located.
So what's the point showing better numbers just putting client closer, right?
I don't like that part of that benchmark.
Oh, okay. Yeah. I saw the publications, but I didn't go into the details because I thought,
well, it has to be faster because it's on local disks.
Well, for data which is not fully in cache, of course. Yeah. Local disks are amazing versus
ABS volumes.
You're right. Yeah. If the data is in cache, then there is, yeah, then all performance, the performance of everything would be the same.
Yeah, well, I wanted to share this. I was annoyed about this, but I fully support the idea of
local disks. It's great. I think we need to use them more and more systems. I think, I wouldn't
be surprised if you reached out to PlanetScale. They may be willing to like if you want to run your benchmarks
They may be willing to give you the yeah, this was got published and in general benchmarks look great
The idea is great. Okay, and actually with local disks. The only concern is usually the
Limit hard limit. We cannot have more space, but if you have a sharded
solution, there is no such thing.
Correct. But speaking about the hard limit, today's SSDs, you can buy 100 plus terabytes
size SSD, single SSD, and you can probably stack them up on next to the other.
But in cloud there are limitations.
In cloud there are limitations, for instance.
Okay, yeah.
I saw AWS SSD over 100 terabytes.
In Google Cloud 72 terabytes is hard limit for Z3 metal and I didn't see more.
So 72 terabytes is a lot but sometimes it's already not a lot.
At that limit, your storage is not the limit. You will not be able to run a database of that size on a single machine.
Why not? You'll max out your CPU.
We have cases, CPU. Well, again, the problem will be replication. If we talk about single node, we can have 360 cores,
or in the edible, yes, almost 1,000 cores already
for Xeon Scalable Generation 5 or something.
So hundreds of cores.
The problem is Postgres design.
If physical application was multi-threaded,
we could scale more.
By the way, replication is not the only problem.
Backup recovery.
If your machine goes down, you're down for hours.
Recovering something of that size.
Yeah, yeah.
Not many hours.
Someone in my team recently saw 17 or 19 terabytes per hour for recovery with PgBest. Wow!
Yes, in eight double years.
I was like, my jaw dropped.
I was, on our podcast, I usually say one terabyte is-
Can you repeat that?
17 or 19 terabytes per hour?
17 terabytes per hour.
17 terabytes per hour with local disks.
So this is important.
With ABS, it's not possible.
It's good to know.
Yeah, well, with Michael, I was always saying,
one terabyte is what you should achieve.
If it's below, it's bad.
Now I'm thinking, one terabyte is already.
Yeah, yeah.
So with ABS volumes, we managed to achieve, I think,
seven terabytes per hour to restore
with all G.
But there is a danger there, you could become a noisy neighbor. So we actually built throttling
in our restore just to prevent noisy neighbors.
With local disks you lose ability to use B best snapshots, cloud disk snapshots.
Correct, correct, yeah.
That's what you lose, unfortunately.
And they're great, and people enjoy them more and more.
So I agree.
And as I remember, for 17 terabytes,
it was 128 threads of OLG or PgBecraster,
I don't remember.
Wow.
But with local disks, S3.
I need to update my knowledge,
this technology is changing too fast.
Exactly, yeah.
Hundreds of course, terabytes of RAM already, right?
Like, yeah.
Yeah, yeah.
But it does go straight to your point of the smaller they are, the faster you can
recover still.
And you don't hit some of these limits.
Like these systems were not designed with these types of limits in mind.
Some weird data structure, you know, suddenly the limit of this is only, you know, a hundred
items or, you know, and you hit those limit and then you're stuck.
Like recently Metronome had an issue.
Yeah.
They had that outage.
Multi-exact.
The multi-exact thing which nobody has ever run before,
but they hit that problem.
Yeah. We saw many problems also when you're on the edge.
It pushes forward Postgres actually sometimes.
But if you want to be on the safe side, I really like it.
It's kind of resilience characteristics.
Even if it's down, it's only a small part of your system is down.
That's great. That's mature architecture already.
That actually makes it easier to achieve five nines uptime because that's the way you calculate.
Like if only one node is down, you divide it by the number of users being affected.
Down time budget. Yeah. That's good. Cool. I think it's maybe one of the longest episodes we had.
Oh my God!
I enjoyed it.
I hope we will continue the discussion of issues with logical, for example, and so on,
and maybe if things will be improved and so on.
Looking forward to test POC once you have it.
Thank you so much for coming.
I am so excited. Yes. Thank you. Yeah. Is there any last things you wanted to add or
anything you wanted help from people on? I would say it feels like nothing is happening
on the repository except me pushing, you know, a few pushes, a few things, changes, but a
huge amount of work is happening in the background like some of
these design work about consensus are all like almost ready to go and there's also hiring going
on there are people coming on board very soon so you will see this snowball it's a very tiny
snowball right now but it's going to get very big as momentum builds up.
So pretty excited about that.
We still, we may still have one or two spots open to add to the team, but it's filling
up fast.
So if any of you are very familiar, this is a very high bar to contribute to a multi-grace.
You have to understand consensus, you have to understand query processing. But if there are people who want to contribute, we are still looking for maybe one or two people
and also on the orchestration side and the kubernetes side of things.
Do you mind a small joke in the end, just not to finish on serious?
I do not mind at all, let's hear it.
Yeah, so I know what you're doing, You're writing a lot of markdown right now
And then you will feed it to AI
I wish
My god, I I almost hope that day never comes but
It is so fun working on this project creating it why why do I want to give it to an AI to do it?
You know?
Okay.
Good.
Thank you.
Enjoy it a lot.
Yeah.
Yeah.
Thank you so much for joining us.
It's great to have you as part of the Postgres community now.
And I'm excited to see what you get up to.
And we too.
Wonderful.
Thanks so much.
Thank you.
Bye-bye.
Bye.