Postgres FM - Top ten dangerous issues
Episode Date: May 9, 2025Nikolay and Michael discuss ten dangerous Postgres related issues — ones that might be painful enough to get onto the CTO and even CEOs desk, and then what you can proactively. The ten iss...ues discussed are:Heavy lock contentionBloat control and index maintenance Lightweight lock contentionTransaction ID wraparound4-byte integer PKs hitting the limitReplication limitsHard limitsData lossPoor HA choice (split brain) Corruption of various kindsSome previous episodes they mentioned that cover the issues in more detail: PgDog https://postgres.fm/episodes/pgdogPerformance cliffs https://postgres.fm/episodes/performance-cliffsZero-downtime migrations https://postgres.fm/episodes/zero-downtime-migrations Queues in Postgres https://postgres.fm/episodes/queues-in-postgresBloat https://postgres.fm/episodes/bloatIndex maintenance https://postgres.fm/episodes/index-maintenanceSubtransactions https://postgres.fm/episodes/subtransactionsFour million TPS https://postgres.fm/episodes/four-million-tpsTransaction ID wraparound https://postgres.fm/episodes/transaction-id-wraparoundpg_squeeze https://postgres.fm/episodes/pg_squeeze synchronous_commit https://postgres.fm/episodes/synchronous_commitManaged service support https://postgres.fm/episodes/managed-service-support And finally, some other things they mentioned: A great recent SQL Server-related podcast episode on tuning techniques https://kendralittle.com/2024/05/20/erik-darling-and-kendra-little-rate-sql-server-performance-tuning-techniques/Postgres Indexes, Partitioning and LWLock:LockManager Scalability (blog post by Jeremy Schneider) https://ardentperf.com/2024/03/03/postgres-indexes-partitioning-and-lwlocklockmanager-scalability/Do you vacuum everyday? (talk by Hannu Krosing) https://www.youtube.com/watch?v=JcRi8Z7rkPgpg_stat_wal https://pgpedia.info/p/pg_stat_wal.htmlThe benefit of lz4 and zstd for Postgres WAL compression (Small Datum blog, Mark Callaghan) https://smalldatum.blogspot.com/2022/05/the-benefit-of-lz4-and-zstd-for.htmlSplit-brain in case of network partition (CloudNativePG issue/discussion) https://github.com/cloudnative-pg/cloudnative-pg/discussions/7462 ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello, hello. This is Posgus FM. I'm Nick, Posgus AI, and as usual, my co-host is Michael,
PG Master. Hi, Michael. How are you doing? I'm good. How are you? Great. Everything's all right.
A lot of bugs to fix and incidents to troubleshoot, to perform root cause analysis,
as we say, right? RCA. Sounds related to our topic today, maybe.
Oh, yeah, maybe, yes.
So the topic I chose is,
I'm still not 100% sure how to name it properly.
So let's decide together.
But the situation is simple for me, relatively.
So we help a lot of startups and at some point I decided
to focus only on startups, being like raised a few startups and helped many startups. I
know how it feels to choose technology and grow, grow, grow until some problems start hitting you. This is exactly like... usually people
choose RDS or Cloud SQL or SuperBase, anything, and they don't need to hire DBAs, DBREs,
and they grow quite well until a few terabytes of data or 10,000 TPS, that kind of scale.
or 10,000 TPS, that kind of scale. And then problems pop up here and there, and sometimes they come in batches, not just one
problem but several.
And here usually, for us it's good, they come to us.
I mean, Postgres AI, we have still consulting wing quite strong and growing.
And we helped more than 20 startups over the last year,
which I'm very proud of. And I collected a lot of case studies, so to speak. And I
decided to have some classification of problems that feel not good at very high
level, for example, CTO level or even CEO, when you think they might
start thinking, is Postgres the right choice?
Or it's giving us too much headache?
And it's not about like, oh, out of disk space suddenly or major upgrades requiring some
maintenance window, although this also can cause some headache. But it's more about
problems like where you don't know what to do or you see it requires a lot of effort to solve it properly.
Yeah, I've had a sneak peek of your list, so I like how you've described it. I also like the thought
process of whether it hits the CTO or the CEO. And I was thinking, let's say you have a non-technical CEO,
if they start hearing the word Postgres too often,
it's probably a bad sign, right?
Exactly.
Ideally, you might mention it once every few years
when you do a major version upgrade,
but then nothing bad happens
and they don't hear it again for a few years.
But if they're hearing, if it's getting to the CEO that Postgres is
causing problems over and over again, the natural question is going to be, is
there an alternative?
What could we do instead?
Or, you know, is this a big problem?
You know, so I guess it's these, these kinds of dangers, not just to the
startup, but also to Postgres's continued use at that startup.
Yeah, I like the word dangerous here because when you deal with some of these problems,
it might feel dangerous to have Postgres for them. It's bad. I would like if things were
better. Yeah. So I have a list of 10 items. Yeah. And we can discuss and
the list is unordered and I'm going to post it to my social networks so folks can discuss
and I'm like sincerely think that this list is useful. If you're a startup, it's great to just use this checklist to see how your cluster is
doing or clusters are doing and are you ready.
So Postgres growth readiness checklist.
And interesting that I didn't include vertical and horizontal scaling there.
I did it indirectly, we will touch it. But obviously, this is the most discussed topic,
biggest danger, like how Postgres scales, like cluster, single primary and multiple standbys,
how far we can go. We know we can go very far, very, very far on a single cluster. At
some point microservices or maybe sharding. It's great. But we had a great episode with
Lev Kokotov, a Pidgey dog. And it resonates one of the items I have today. It resonates
with what he said during our episode. So anyway, let's exclude vertical and horizontal scaling
and talk about stuff which kind of sounds boring.
My first item is heavy lock contention.
This is very popular.
Maybe 50% of companies that come to us have this issue.
Somehow.
Yeah. So at some point I decided to start saying everyone,
if you have queue like workloads or additionally,
and or if you don't know how dangerous it is
to change schema in Postgres,
just adding column can be a problem, right?
We discussed it, I think, many
times. You are not ready to grow and at some point, sooner or later, it will hit you. It
will hit you as a spike of active sessions. We know some managed Postgres platforms provoke provoke you to have huge number of huge max connections, where there's like 5,000, 2,500.
Why do they do this?
Easier for them, but it's dangerous
because it creates like kind of performance cliff additionally.
Yeah, yeah.
It's another version of these cliffs, isn't it?
We had another good episode recently.
Yeah, I plan to research this a little bit. Probably we will publish something in this area
to prove that it's still not good. Even if you have Postgres 14 plus, which has great optimizations
for large number of idle connections, it's still not good.
Yeah, and there have been some improvements. Like I know a very good engineer who took
ProSquis down by adding a column with a default, I think it was, but it was many years, many
there was, there's some improvements in recent years of some DDL changes that are less dangerous
than they were.
Yeah, there are several levels.
Yes. Yeah, of course. And if they get stuck stuck in, if they don't have a lock timeout,
for example. In fact, yeah, we're probably going to be pointing to episodes on every single one
of these bullet points, but we have, we had one on zero downtime migrations. I think it's probably
the best for that. And we had a separate one on queues actually, didn't we? So yeah, yeah.
Yeah. So definitely there are solutions here and you just need to practically deploy. It's interesting that I see some companies
grow quite far not noticing this problem, for example, with DDL. It's like going to
casino, like you can win, you can win. Sometimes boom, you lose. Because if you deploy some
DDL and you get blocked, you can block others and it can
be disaster. We discussed it several times. And if you had a hundred deployments successfully,
it doesn't mean you will keep winning. So it's better to have... And it concerns me.
I have a feeling we should implement this in Postgres, like alter table concurrently
or something like this.
It should itself perform these retries
with low lock timeout.
Yeah, it's tricky, isn't it?
But I agree, but then people still need to know
that it existed to actually use it
because I think the main issue here is people not realizing
that it can be a problem.
And the fact that it probably hits users,
let's say you've got a statement timeout,
like when are you actually gonna notice
that users have been waiting for,
are you gonna notice that spike on your monitoring?
I'm not sure, like it depends how many users
actually got stuck waiting behind it and had slow queries.
So, and it's gonna be hard to reproduce, like you might not know why it was that. many users actually got stuck waiting behind it and had slow queries.
And it's going to be hard to reproduce that you might not know why it was that.
So log-log-waits is off so you don't see who blocked you.
It might be auto-vacuum running in this aggressive mode or it can be another session, long running
transaction, which holds access accessory locked to a table and
you cannot alter it and boom you block others so this is like a reaction chain
and yeah it's not good and and queue like workloads same like at some smaller
scale you don't see problems at all. You occasionally experience them, but if you grow very fast, you will start hitting these
problems very badly.
They look like spikes of heavy lock contention, or just heavy lock and lock in tosgo-stimulology.
It's the same, so just lock contention.
It doesn't look good.
Suggestion is so simple.
It's funny that we talk a lot and people that come to us,
actually they mention they watch podcast.
And I say like, okay, Q like workload, just take care of indexes,
take care of bloat, like maybe partitioning.
But most importantly, like
skiplocked, right? And that's it. Like this is a solution, but we spent hours to discuss details
and because it's like when you go to reality, it's not easy to learn this, like there are
objections sometimes, but this is what we do. Like we work with those objections and help to
We work with those objections and help to implement.
So yeah, but for everything we had episode.
There are episodes for everything. So this was number one, heavy load contention.
And I chose most popular reasons.
Of course there are other reasons,
but in my view, DDL and Q-like workloads,
not the biggest ones.
Okay, next it's boring, super boring.
Bloat control and index management.
We had episodes about it, maybe several actually.
But since, again,
managed Postgres platforms don't give you tools.
For example, RDS, they did a great job in auto-vacuum tuning, but only half of it.
They made it very aggressive in terms of how much resources.
They throttled, they gave a lot of resources. But they don't adjust scale factors.
So it visits, or the vacuum visits your tables not often, not often enough for LTP.
So blood can be accumulated and so on.
And they don't give you resources to understand the reasons of blood.
It's actually not, I'm thinking about it and I think it's tricky and it's also a problem of post-goss
documentation because it lacks clarity how we troubleshoot the reasons of the blood.
Because we always say long running transaction, but not every transaction is harmful. For
example, in default transaction isolation level read committed,
transaction is not that harmful if it consists of many small queries.
If it's a single query, it holds snapshot, it's harmful.
So I guess with in observability we should shift from long-running transaction language
to Xmin Horizon language fully and
discuss that.
Anyway, I can easily imagine and I observe how people think, oh, MongoDB doesn't have
this stuff, or some other database system, they don't have the problem with bloat, or
indexes.
Actually, with indexes, my true belief is that degradation of index health is happening
in other systems as well, we also discussed it.
So they need to be rebuilt.
I was listening to a SQL Server podcast
just for fun the other day,
and they had the exact same problem.
But in the episode where we talked about index maintenance,
I think it came up that even if you're really on top
of auto vacuum, even if you have it set configured really nicely, there can still be occasions
where you get some bloat. If you have like a spike or if you have a large deletion or
you have like a, there's a few cases where you can end up with sparsely populated indexes
that can't self heal. Like if for example, you've got like an, like a even UIDV7 index,
and then you have a section of it,
maybe delete some old data and it's not partitioned,
then you've got a gap in your index.
So there's a bunch of reasons
why they can get bloated anyway,
even if you're on top of auto-vacuum.
So I think this is one of those ones that,
yes, auto-vacuum fixes most of the problems, but you probably still want to have some plan for index maintenance
anyway.
Yeah. So there are certain things that are not automated by Postgres itself or by Kubernetes
operators or by, well, some of them automated some things, but not everyone, not everything. Or managed service providers.
Even upgrades.
Also, like a lack of automation there.
We can mention this lack of automation of analyze, but fortunately, future Postgres
versions will be definitely fine because dump restore of statistics is implemented finally
and goes to Postgres 18, which is super great news.
Lack of automation might feel like, oh, this is a constant headache, but it's solvable. Fortunately,
it requires some effort, but it's solvable. Okay, next thing is, let's talk about lightweight lock
contention. We talked about heavy lock contention or just lock contention.
Lightweight lock contention is also...
This feels like pain and of various kinds.
Lightweight locks can be called latches.
It's in memory.
So when some operations with buffer pool happen, for example, there are lightweight locks Postgres needs to establish or working
with wall or various data structures, also can mention lock manager. So things like lightweight
lock manager or buffer mapping or subtrans SLRU, multi-exact SLRU. When you hear this, for me, these terms, imagine there is a font like bloody, red,
blood, blood, drops of blood.
Because I know so many projects suffered big pain, big incidents.
So for me, these terms are like bloody terms, because it was a lot of pain sometimes.
For example, you know I'm a big fan of sub-transactions, right?
My natural advice is just to eliminate them all.
Well, over time I'm softer.
I say, okay, you just need to understand them
and use very carefully.
But lock manager, couple of years,
remember Jeremy Schneider posted like-
Yeah, great post.
Horror stories and we discussed it as well.
So this kind of contention might hit you
and it feels like performance cliff usually. So all good of contention might hit you and it feels like performance
cliff usually. So all good, all good, boom.
Right, so what is or was, is it changing in 18 but it was a hard-coded limit, right?
16 you mean this for fast path? Also as far as you, sizes are now configurable in, I think in 17 already.
Well, nice, good, but not always enough because, okay, you can buy some time, but still there
is a cliff and if you're not far from it, again, boom.
Or this, I recently saw it, like remember we discussed 4 million transactions per second.
We discussed that we first we found PgSATK cache was an issue, was fixed, and then PgSAT
statements.
Okay, PgSAT statements, if the transactions are super fast, it's bringing observer effect.
We see it in newer Postgres versions as LW log PgSAT statements, because finally code is covered by proper probes,
right?
Not probes, it's wrapped and it's visible in weight event analysis observing PgSAT activity.
So I saw it recently at one customer, you know, like some layer of lightweight log PgSAT
statements.
So we need to discuss what's happening.
It happens only when you have a lot of very fast queries, but it can be a problem as well.
But yeah, and performance cliffs.
It requires some practice to understand where they are.
It's hard because you need to understand how to
measure usage, how to understand like saturation risks.
This requires some practice.
I think this is one of the hardest ones to see coming.
After all our stories with LWLock manager, every time I see some query exceeds 1000 QPS queries per second, I'm already
thinking, okay, this patient is developing some chronic disease.
Okay, that's another one I haven't heard. We've done several rules of thumb before,
but that's another good one. So 1000 queries per second for a single query. Check, is it?
It's very relative also how many vCPUs we have.
If we have less, it can hit faster.
Although we couldn't reproduce exactly the same nature as we see on huge machines like
192 cores, 96 cores.
We couldn't reproduce that nature on eight core machines at all.
So yeah, it's for big boys only, you know.
This is like, or maybe for adults.
So young projects don't experience these problems.
Anyway, that's a good point.
Actually, the startups that have hit this that you've written about and things have tended to be
further along in their journeys.
Huge, huge.
But yeah, but still growing quickly.
And it's even a bigger problem at that point.
But yeah, good point.
Should we move on?
Yeah.
So the next one is our usual suspect, right?
It's a wraparound of 8-byte transaction ID and multi-exact ID.
So many words already said about this. It just bothers me that monitoring doesn't cover,
for example, usually it doesn't cover multi-exact IDs. And people still don't have alerts and so on.
and people still don't have alerts and so on. So it's sad, right?
To be in transaction society is easy to create these days.
I get the impression though.
I mean, there were a few high profile incidents that got blogged about.
I think, yes. Yeah, exactly.
And I feel like I haven't seen one in a long while.
And I know there are a lot of projects that are having to,
you know, I think Agen have spoken about,
if they weren't on top of this,
it would only be a matter of hours before they'd hit,
you know, it's that kind of volume.
So they're really, really having to monitor
and stay on top of it all the time.
But I haven't heard of anybody actually hitting this
for quite a while. Do you think,
I wondered if for example, there were some changes to, I think it was auto vacuum being
like, I think it, it kicks into do an anti wraparound vacuum differently, or it might
be lighter, a lighter type of vacuum that it runs now. I think I remember Peter Gaggen
posting about it, something like that. Do you remember a change in that area?
I don't remember, honestly.
I just know this is still a problem.
Again, at CTO level, it feels like,
how come Postgres still has four byte transaction IDs
and what kind of risks I need to take into account?
But you're right,
managed Postgres providers do quite a good job here.
They take care. I had a guest at Postgres TV, Hano Kroosink, who talked about how to escape from it
in non-traditional and in his opinion, and actually my opinion as well, in a better way
and actually my opinion as well in a better way, without single user mode. And since he is a part of Cloud SQL team, so it also shows how much of effort manage
Postgres providers do in this area, realizing this is a huge risk.
Yeah.
And it's not even, even if it's a small risk, the impact when it happens is not small.
So it's one of those ones where...
Absolutely good correction.
It's low risk, high impact.
Exactly.
Yes.
Yes.
So I think the cases that were blogged about were down, were hours and hours possibly even
getting...
Was it even a day or two of downtime for those organizations?
And that was... That is then... I mean, you're talking about dangers, right?
Global downtime.
Exactly.
You're going to lose some customers over that, right?
Yeah.
Unlike the next item, a four byte integer primary keys, it's still a thing.
You know, I was surprised to have recently this case, which was overlooked by our tooling.
Oh, really?
I couldn't like realize how come?
Yeah, because it was a non-traditional way to have this.
Go on.
Well, it was first of all, the sequence which was used by multiple tables and one for all of them and somehow it was defined so our report in
position cup didn't see it so when it came I was like how come this like this
is an old friend or old enemy not friend enemy. Old enemy. Yeah. I haven't seen you for so many years.
And you look differently, you know,
because multiple tables use,
but still, it's not fun.
And this causes partial downtime
because some part of workload cannot work anymore,
you cannot insert.
Yeah.
Yeah, so, by the way, I also learned that if you just do in place alter table for huge table
it
Not so dumb as I thought I checked source code. I was impressed and that this code is from nine point something
Yeah before so it if you alter table alter column to change from in to four to in eight, it actually performs
job like similar to vacuum full, recreating indexes and you don't have bloat. I expected
like 50% bloat. Oh, why? Because I thought it will rewrite the whole table. I was mistaken.
It's quite smart. Yeah, it's of course, it's a
blocking operation. It causes downtime to perform it, but you end up having quite clean
state of table and indexes. Not quite clean state. It's fresh.
Yeah. So that is a table rewrite, no?
Yes. Well, table, yes. Well, you're right. I was thinking about table rewrite is very
dumb thing like create more tuples and delete other tuples.
Got it, got it, got it.
But there is mechanism of table rewrite in the code.
Now I saw it finally, I'm still learning, you know, sorry.
You might end up with some padding issues
if you had it optimized well before, but yeah.
Yeah, it also feels like Postgres could implement
some reshape eventually because there are building blocks
in the code already, I see it, and like to first,
like offline style to change column order,
and then if you want it, and then fully online style.
If PgSqueeze goes to core, right?
Yeah, yeah, yeah.
That would be great.
Yeah, I'm just connecting paths here
and can be very powerful in like three to five years maybe.
But it's a lot of work additionally.
So all those who are involved in moving huge building blocks,
I have huge respect.
Yeah.
So OK.
And I think this is, if you know what you're doing,
this one's easier to recover
from. I assume like with the sequence, for example, you can handle it multiple ways,
but you could you can set the sequence to like negative 2 billion. Everyone thinks they
are smart. And this is the first thing I this is the first thing I hear all this when we
discuss this is this is what's in, like, let's use negative values.
Of course, if you can use negative values, do it.
Because we know Postgres integers are signed integers.
So we use only half of capacity of...
For byte capacity, half of it is 2.1 billion, roughly.
So you have 2.1 billion more but it not always is
possible to use. But this is old old old story still making some people nervous
and I think it's good to check in advance. So much better so much better to
have alerts when you're checking. I've said several companies big ones from this just raising this and I know in some companies
It was like one year or few years work to fix it
So what was the problem before was it looking at columns instead of looking at sequences?
Or what sequences are always I'd be bit bites. That's It was always so like, if they are 8 bytes.
Problem with report, I don't remember honestly. There was some problem with report. It was not
standard way, not just create table and you have default with sequence and you see it.
Something else, some function, I don't remember exactly. But usually our report catches such things.
Or you can just check yourself if you have primary keys for 4-byte, it's time to move
to 8-bytes or to UUID version 7, maybe. That's it about this. Then let's hit up the situation, replication limits. So, in the beginning I
mentioned vertical and horizontal scaling and usually people say there's not enough CPU or
something and disk IO and we need to scale and you can scale read-only workloads, having more read-only standbys. But it's hard to scale rights. And
this is true. But also true that at some point, and Lef mentioned it in our PGDoc episode,
he mentioned that in Instacart they had it, right? So like 200 megabytes per second,
world generation, it's already over 100. I don't remember exactly
what he mentioned, but somewhere in that area...
It was a lot.
Well, just 10 walls per second, for example, it gives you 160. Well, RDS, that's less
because they have 64 megabyte walls. They raised it four times. Anyway, 100 or 200, 300 megabytes per second,
you start hitting the problems with single threaded processes. Actually, it wonders me
why we don't monitor. Any post-gist monitoring should have low-level access. RDS doesn't have it, but low-level access. We should see how
much CPU usage for every important single-threaded Postgres process. Val sender, Val receiver,
logical replication worker, checkpointer maybe as well.
That would be helpful. replication worker, checkpointer maybe as well. Of course, because it's dangerous to grow at scale
when you hit 100% of a single vCPU. And then you need to either vertical or horizontal scale or
start saving on wall generation. And fortunately in PgSAT statements, we have wall metrics, three columns, since Postgres 13.
But unfortunately, this is query level.
What we need, we need also table level to understand which tables are responsible for
a lot of wall.
And PgS total activity lacks it.
I think it's a good idea to implement.
If someone wants hacking, this is a great idea.
Add three more columns, wall-based, wall-related metrics to PgStat all tables, PgStat sys tables,
and user tables. It would be great. Also, maybe to PgStat database as a global view of things.
How much wall? Yeah, I've got a vague memory. There were a couple of war
related new views, new system views. Oh yes, but that's about, yeah, it's about, are you talking
about Pyrrhus-Stateo? No. No. War related in post-Gust 13, it went to explain and it went to
Pyrrhus- went to persist statements.
This is what happened.
Anyway, this is not an easy problem to solve.
It's easy to check if you have access.
Unfortunately, if you're on managed Postgres, you don't have access.
They need to check it.
What's happening, especially on standbys.
And also it makes sense to tune compression properly because compression can eat your
CPU.
Remember we discussed wall compression.
I always said let's turn it on.
Now I think let's turn it on unless you have this problem.
In this case wall standard.
We'll get close to the limit.
Yeah.
You need to check how much of that is like, is wall compression. And
also we have new compression algorithms implemented in fresh Postgres versions. So, yeah. And
that STD. So, we can a better one.
For example, Asset 4 should be providing similar.
As I remember, I saw some benchmarks.
I didn't do it myself yet.
So it should be similar to BGLZ default compression
in terms of compression ratio,
but it takes much less CPU,
like two to three times less CPU.
So it's worth choosing that.
I looked at it briefly just for our own use for plan storage.
And it was, we even got better performance as well.
So it won on both for us.
Well, you probably saw what I saw small data on bulk, right?
No.
Okay.
Maybe.
Let's move on because there are more things to discuss.
Design limits. So some people already think what they will do when their table reaches 32 terabytes.
Yeah, I guess this and the last one both feel like adult problems again, right? Like,
This and the last one both feel like adult problems again, right? Like there aren't too many small startups hitting the Instacart level of war generation
or 32 terabytes of data.
So yeah, it's a really big clusters.
But we can start thinking about them earlier and be better prepared.
Maybe sometimes not spending too much time because you know, like if you spend too much time thinking about how you will bring statistics to new cluster after
measure of red, but, but it's already implemented in 18.
So just a couple of years more and everyone forgets about this problem.
It's the kind of thing that's useful to have, like, uh, if you've, if you've
got like a to-do list you're
going to be using in a couple of years time, or a calendar
that you know you're going to get an alert for,
just put a reminder in for a year or two's time.
Just check.
We also mentioned a few times during last episodes,
latest episodes, this bothers me a lot.
First I learned Cloud SQL has this limit,
then I learned RDS also has this limit,
64 terabytes per whole database. What's happening here? It's already not a huge database.
But again, in a couple of years, who knows, they might have increased that.
Yeah, well, I think it's solvable. Of course, I guess this is the limit of single EBS volume or disk on Google Cloud,
PD, SSD, or how they call it.
So yeah, it's solvable, I think, right?
With some tricks.
But these days, it doesn't feel huge.
64, it feels big database, right?
But when we say-
Yeah, to most of us.
But we have stories, 100 plus databases.
All of them are self-managed, I think.
And all of them were sharded.
All of those were sharded.
Yeah, 100%.
Yeah, we had a great episode, 100 terabytes,
where it was atN, who else?
We had Notion and we had Figma.
Figma, right.
And Notion and Figma, they are on RDS, but it's a sharded, single cluster, it's impossible
on RDS, right?
And I think ADN has 100 plus terabytes.
No, they have 100 terabytes, but it's self-managed.
Yes.
Because on RDS it's impossible,
not supported. Well and they're sharded also. Yeah, yeah, yeah, but yeah, in large companies like
that there are always like some parts are sharded, some not. So anyway, when you have 10 to 20
terabytes it's time to think if you are on RDS or Cloud SQL, is it like how
you will grow 5x? Because if you're 20 terabytes to grow 5x, 100, it's already not possible
with a single cluster, right? Another reason to think about splitting somehow.
Then a few more items, like data loss. Data loss is a big deal. If you poorly designed
backups or HA solutions. Yeah, it can be. Let's join this with poor HA choice leading to failures
like split brains. So data loss, split brain. Actually, I thought we had the discussion,
there is ongoing discussion in the project called
Cloud Native PG, where I raised the topic of split brain and demonstrated how to do
it a couple of weeks ago.
Good news is I see they decided to implement something to move in a direction similar to Patroni because when network partition happens and
the primary is basically alone, it's bad because it remains active.
And as I demonstrated, some parts of application might still talk into it.
This is classical split-brain.
And I saw, based on discussion, I saw it triggered,
I never like I never thought deeply actually, but is split brain just a variant of data loss?
Well, I guess you technically might not have lost the data. It's still there. You just have two
versions. Which one is correct? It's worse than data loss. It's worse than data loss. It's still there. You just have two versions. Yeah, which one?
It's worse because now you have two versions of reality
Yeah, and it's bad with data loss. You can apologize and ask to like bring your data back again, please
And in some cases we like allowed some data loss, of course, they also is really
sad thing to have but But sometimes we have it like officially some data loss might happen. The risk is very low and maximum this number
of bytes for example, but it might happen. With split-brain it's worse. You need to spend
a lot of effort to merge realities into one.
Yeah.
Most cases of data loss I've seen have tended to be at least two things gone wrong.
Like user error or some way that the node's gone down, but quite often it's user error, accidental delete without aware clause or dropping a table in one
environment.
Well, this is like higher level data loss.
Well, but that, but that can cause the low level data loss if you then also
don't have tested backups and it turns out you didn't have a good view.
So it's the combination of the two things.
Often for me, it's still, yeah.
Well, yes.
Yeah. If backups are missing, it's bad. Yeah. You, yeah if backups are missing it's bad, yeah
you cannot recover but also like data loss for me classical like lower level
it's database said commit successful and then my data has gone. So that's
scary and dangerous as like a CTO. It undermines trust into Postgres again, right? Yeah. If procedures are
leading to data loss and also split brains. Is it actually happening often or is it more the CTO
thinking I don't want to choose this technology because it could happen? Depends on the project
and the level of control.
I'm pretty confident in many, many, many web and mobile app, all TPE style projects.
Unless they are financial or something like social network, social media, maybe even e-commerce
and so on, data loss happens, sometimes unnoticed.
If you have synchronous replicas and failover happened and the process
of failover, like Patroni by default with asynchronous replicas allows up to one megabyte,
it's written in config, up to one megabyte of data loss officially, mibi byte. So one
mibi byte is possible to lose. And who will complain if it's a social media, comments
or something, we store like comments,
we lost some comments, nobody noticed maybe.
But if it's a serious project, it's better not to do it.
Or split brain.
Yeah.
Anyway, this is more not like Postgres doesn't, Postgres does job quite well here.
It's question mostly to everything around Postgres infrastructure and if it's managed
Postgres to their infrastructure, how they guarantee there is no data loss.
And last time we discussed the problem with synchronous commit and we discussed in detail
how right now Postgres doesn't do good job not revealing proper LSNs on standbys.
So even Patroni cannot choose, can have data loss in case of synchronous commit with remote
write.
We discussed it.
Okay.
Anyway, this feels like something for improvement definitely here.
Good corruption, my general feeling,
people don't realize how many types
of corruption might happen.
And it remains unnoticed in so many cases.
When you start talking, people like reaction sometimes
is wait what?
So yeah, corruption at various levels might have them.
So actually maybe this, so in terms of the list where we started kind of the point of
the topic, there was kind of dangers, right? Is this one of those ones that if it silently
happened for a while, it suddenly becomes a complete loss of trust of the underlying system or...
Yeah. As usual, I can rattle about how Postgres defaults are outdated. We know data checksum
is only recently enabled by default.
Yeah, great change.
Like a month or two ago, right? It will be only in 2018.
It should be done five years ago.
We saw evidence.
Yeah.
Yeah, many managed Postgres providers did it,
like RDS, that's great.
Which is great.
And that kind of is then the default for a lot of people.
Yeah, but it also doesn't guarantee
that you don't have corruption.
You need to read all the pages from time to time, right?
And do they offer something, some anti-corruption tooling?
They don't.
Nobody.
Okay, they enabled, so what?
This is just a small piece of the whole puzzle.
Yeah, and Amcheck improving as well.
I think, is it in 18?
Yeah, that's great, that's great.
We're getting in indexes as well.
Yeah, big.
In fact, I think we, did we have an episode on Amcheck?
I feel like it came up at least once or twice.
Maybe index maintenance.
I cannot remember all our episodes or many.
Yeah, me neither.
But it's great to see the progress in the area of Amcheck,
but it's again, it's a small piece of the puzzle
and it has settings like options to choose from.
So it's not trivial to choose among options.
And also again, like a couple of episodes ago,
we discussed support and how difficult it is
to sometimes understand what's happening, right?
And I right now have a case where very big platform
doesn't help to investigate corruption and nobody has the
ability to investigate it but them, but they are not helping.
It was one of the reasons I was provoked to talk about that in that episode.
It looks really bad.
You guys are responsible
for that corruption case and you don't do a great job. I think it's a problem with industry
and we discussed it already, so let's not repeat. But in general, I think if you are
CTO or leader who decides priorities, my big advice is to take this list and like checklist,
like evaluate situation for yourself,
better let us do it of course,
but you can let yourself
and then plan some proactive measures
because corruption testing can be done
even on RDS proactively. If it happens
you need support of course because sometimes it's like it's low level you
don't have access but at least you will feel control over it. So over corruption
so anti-corruption tooling is needed this is what I feel. Yeah, that's it, that's all my list.
But I'm sure it's lacking something.
Like security more, more security related stuff
for example as usual I tend to like it.
What do you think like, was it good?
Yeah, you see a lot more of these things than I do
obviously, but yeah I think it's a really good list and check with checklists, right?
It's not, of course, if you could go on forever with, with things to be scared of.
But this, this feels like if you ticked all of these off, you'd be in such a
good position versus most, and obviously things can still go wrong, but these are
some of the most, at least even
if they're not the most common, some of the things that could cause the biggest issues,
the things that are most likely to get on the CEO's desk or in the inbox. So yeah,
this feels like if you're on top of all of these things you're going to go a long long way before
you hit issues. I have a question to you. Guess among these 10 items which item I
never had in my production life. Oh maybe give me a second. It's tricky right?
Transaction ID wraparound? Exactly. I only... It's rare. Well it's's rare, yeah. Let's cross it off, no problem. Yeah, I only found a way to emulate it, which we did multiple times, but never had it in
reality in production.
Everything else I had.
Yeah.
Not once.
I nearly guessed too quickly that it was going to be Splitbrain but then I was like, no wait, read the whole list.
But I'm guessing you had split brains like one a couple of times maybe, Max, man. Yeah, replication manager split brain is a service. Yes.
Okay. Yeah, pre-patrony days.
Yeah, makes sense. All right. Nice one. Thanks so much, Nikolai. Thank you. See you next time. Catch you next time. Bye.