Postgres FM - Top ten dangerous issues

Episode Date: May 9, 2025

Nikolay and Michael discuss ten dangerous Postgres related issues — ones that might be painful enough to get onto the CTO and even CEOs desk, and then what you can proactively. The ten iss...ues discussed are:Heavy lock contentionBloat control and index maintenance  Lightweight lock contentionTransaction ID wraparound4-byte integer PKs hitting the limitReplication limitsHard limitsData lossPoor HA choice (split brain) Corruption of various kindsSome previous episodes they mentioned that cover the issues in more detail: PgDog https://postgres.fm/episodes/pgdogPerformance cliffs https://postgres.fm/episodes/performance-cliffsZero-downtime migrations https://postgres.fm/episodes/zero-downtime-migrations Queues in Postgres https://postgres.fm/episodes/queues-in-postgresBloat https://postgres.fm/episodes/bloatIndex maintenance https://postgres.fm/episodes/index-maintenanceSubtransactions https://postgres.fm/episodes/subtransactionsFour million TPS https://postgres.fm/episodes/four-million-tpsTransaction ID wraparound https://postgres.fm/episodes/transaction-id-wraparoundpg_squeeze https://postgres.fm/episodes/pg_squeeze synchronous_commit https://postgres.fm/episodes/synchronous_commitManaged service support https://postgres.fm/episodes/managed-service-support And finally, some other things they mentioned: A great recent SQL Server-related podcast episode on tuning techniques https://kendralittle.com/2024/05/20/erik-darling-and-kendra-little-rate-sql-server-performance-tuning-techniques/Postgres Indexes, Partitioning and LWLock:LockManager Scalability (blog post by Jeremy Schneider) https://ardentperf.com/2024/03/03/postgres-indexes-partitioning-and-lwlocklockmanager-scalability/Do you vacuum everyday? (talk by Hannu Krosing) https://www.youtube.com/watch?v=JcRi8Z7rkPgpg_stat_wal https://pgpedia.info/p/pg_stat_wal.htmlThe benefit of lz4 and zstd for Postgres WAL compression (Small Datum blog, Mark Callaghan) https://smalldatum.blogspot.com/2022/05/the-benefit-of-lz4-and-zstd-for.htmlSplit-brain in case of network partition (CloudNativePG issue/discussion) https://github.com/cloudnative-pg/cloudnative-pg/discussions/7462 ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, hello. This is Posgus FM. I'm Nick, Posgus AI, and as usual, my co-host is Michael, PG Master. Hi, Michael. How are you doing? I'm good. How are you? Great. Everything's all right. A lot of bugs to fix and incidents to troubleshoot, to perform root cause analysis, as we say, right? RCA. Sounds related to our topic today, maybe. Oh, yeah, maybe, yes. So the topic I chose is, I'm still not 100% sure how to name it properly. So let's decide together.
Starting point is 00:00:39 But the situation is simple for me, relatively. So we help a lot of startups and at some point I decided to focus only on startups, being like raised a few startups and helped many startups. I know how it feels to choose technology and grow, grow, grow until some problems start hitting you. This is exactly like... usually people choose RDS or Cloud SQL or SuperBase, anything, and they don't need to hire DBAs, DBREs, and they grow quite well until a few terabytes of data or 10,000 TPS, that kind of scale. or 10,000 TPS, that kind of scale. And then problems pop up here and there, and sometimes they come in batches, not just one problem but several.
Starting point is 00:01:34 And here usually, for us it's good, they come to us. I mean, Postgres AI, we have still consulting wing quite strong and growing. And we helped more than 20 startups over the last year, which I'm very proud of. And I collected a lot of case studies, so to speak. And I decided to have some classification of problems that feel not good at very high level, for example, CTO level or even CEO, when you think they might start thinking, is Postgres the right choice? Or it's giving us too much headache?
Starting point is 00:02:13 And it's not about like, oh, out of disk space suddenly or major upgrades requiring some maintenance window, although this also can cause some headache. But it's more about problems like where you don't know what to do or you see it requires a lot of effort to solve it properly. Yeah, I've had a sneak peek of your list, so I like how you've described it. I also like the thought process of whether it hits the CTO or the CEO. And I was thinking, let's say you have a non-technical CEO, if they start hearing the word Postgres too often, it's probably a bad sign, right? Exactly.
Starting point is 00:02:53 Ideally, you might mention it once every few years when you do a major version upgrade, but then nothing bad happens and they don't hear it again for a few years. But if they're hearing, if it's getting to the CEO that Postgres is causing problems over and over again, the natural question is going to be, is there an alternative? What could we do instead?
Starting point is 00:03:14 Or, you know, is this a big problem? You know, so I guess it's these, these kinds of dangers, not just to the startup, but also to Postgres's continued use at that startup. Yeah, I like the word dangerous here because when you deal with some of these problems, it might feel dangerous to have Postgres for them. It's bad. I would like if things were better. Yeah. So I have a list of 10 items. Yeah. And we can discuss and the list is unordered and I'm going to post it to my social networks so folks can discuss and I'm like sincerely think that this list is useful. If you're a startup, it's great to just use this checklist to see how your cluster is
Starting point is 00:04:09 doing or clusters are doing and are you ready. So Postgres growth readiness checklist. And interesting that I didn't include vertical and horizontal scaling there. I did it indirectly, we will touch it. But obviously, this is the most discussed topic, biggest danger, like how Postgres scales, like cluster, single primary and multiple standbys, how far we can go. We know we can go very far, very, very far on a single cluster. At some point microservices or maybe sharding. It's great. But we had a great episode with Lev Kokotov, a Pidgey dog. And it resonates one of the items I have today. It resonates
Starting point is 00:05:00 with what he said during our episode. So anyway, let's exclude vertical and horizontal scaling and talk about stuff which kind of sounds boring. My first item is heavy lock contention. This is very popular. Maybe 50% of companies that come to us have this issue. Somehow. Yeah. So at some point I decided to start saying everyone, if you have queue like workloads or additionally,
Starting point is 00:05:33 and or if you don't know how dangerous it is to change schema in Postgres, just adding column can be a problem, right? We discussed it, I think, many times. You are not ready to grow and at some point, sooner or later, it will hit you. It will hit you as a spike of active sessions. We know some managed Postgres platforms provoke provoke you to have huge number of huge max connections, where there's like 5,000, 2,500. Why do they do this? Easier for them, but it's dangerous
Starting point is 00:06:13 because it creates like kind of performance cliff additionally. Yeah, yeah. It's another version of these cliffs, isn't it? We had another good episode recently. Yeah, I plan to research this a little bit. Probably we will publish something in this area to prove that it's still not good. Even if you have Postgres 14 plus, which has great optimizations for large number of idle connections, it's still not good. Yeah, and there have been some improvements. Like I know a very good engineer who took
Starting point is 00:06:47 ProSquis down by adding a column with a default, I think it was, but it was many years, many there was, there's some improvements in recent years of some DDL changes that are less dangerous than they were. Yeah, there are several levels. Yes. Yeah, of course. And if they get stuck stuck in, if they don't have a lock timeout, for example. In fact, yeah, we're probably going to be pointing to episodes on every single one of these bullet points, but we have, we had one on zero downtime migrations. I think it's probably the best for that. And we had a separate one on queues actually, didn't we? So yeah, yeah.
Starting point is 00:07:17 Yeah. So definitely there are solutions here and you just need to practically deploy. It's interesting that I see some companies grow quite far not noticing this problem, for example, with DDL. It's like going to casino, like you can win, you can win. Sometimes boom, you lose. Because if you deploy some DDL and you get blocked, you can block others and it can be disaster. We discussed it several times. And if you had a hundred deployments successfully, it doesn't mean you will keep winning. So it's better to have... And it concerns me. I have a feeling we should implement this in Postgres, like alter table concurrently or something like this.
Starting point is 00:08:08 It should itself perform these retries with low lock timeout. Yeah, it's tricky, isn't it? But I agree, but then people still need to know that it existed to actually use it because I think the main issue here is people not realizing that it can be a problem. And the fact that it probably hits users,
Starting point is 00:08:29 let's say you've got a statement timeout, like when are you actually gonna notice that users have been waiting for, are you gonna notice that spike on your monitoring? I'm not sure, like it depends how many users actually got stuck waiting behind it and had slow queries. So, and it's gonna be hard to reproduce, like you might not know why it was that. many users actually got stuck waiting behind it and had slow queries. And it's going to be hard to reproduce that you might not know why it was that.
Starting point is 00:08:48 So log-log-waits is off so you don't see who blocked you. It might be auto-vacuum running in this aggressive mode or it can be another session, long running transaction, which holds access accessory locked to a table and you cannot alter it and boom you block others so this is like a reaction chain and yeah it's not good and and queue like workloads same like at some smaller scale you don't see problems at all. You occasionally experience them, but if you grow very fast, you will start hitting these problems very badly. They look like spikes of heavy lock contention, or just heavy lock and lock in tosgo-stimulology.
Starting point is 00:09:39 It's the same, so just lock contention. It doesn't look good. Suggestion is so simple. It's funny that we talk a lot and people that come to us, actually they mention they watch podcast. And I say like, okay, Q like workload, just take care of indexes, take care of bloat, like maybe partitioning. But most importantly, like
Starting point is 00:10:05 skiplocked, right? And that's it. Like this is a solution, but we spent hours to discuss details and because it's like when you go to reality, it's not easy to learn this, like there are objections sometimes, but this is what we do. Like we work with those objections and help to We work with those objections and help to implement. So yeah, but for everything we had episode. There are episodes for everything. So this was number one, heavy load contention. And I chose most popular reasons. Of course there are other reasons,
Starting point is 00:10:38 but in my view, DDL and Q-like workloads, not the biggest ones. Okay, next it's boring, super boring. Bloat control and index management. We had episodes about it, maybe several actually. But since, again, managed Postgres platforms don't give you tools. For example, RDS, they did a great job in auto-vacuum tuning, but only half of it.
Starting point is 00:11:15 They made it very aggressive in terms of how much resources. They throttled, they gave a lot of resources. But they don't adjust scale factors. So it visits, or the vacuum visits your tables not often, not often enough for LTP. So blood can be accumulated and so on. And they don't give you resources to understand the reasons of blood. It's actually not, I'm thinking about it and I think it's tricky and it's also a problem of post-goss documentation because it lacks clarity how we troubleshoot the reasons of the blood. Because we always say long running transaction, but not every transaction is harmful. For
Starting point is 00:12:02 example, in default transaction isolation level read committed, transaction is not that harmful if it consists of many small queries. If it's a single query, it holds snapshot, it's harmful. So I guess with in observability we should shift from long-running transaction language to Xmin Horizon language fully and discuss that. Anyway, I can easily imagine and I observe how people think, oh, MongoDB doesn't have this stuff, or some other database system, they don't have the problem with bloat, or
Starting point is 00:12:40 indexes. Actually, with indexes, my true belief is that degradation of index health is happening in other systems as well, we also discussed it. So they need to be rebuilt. I was listening to a SQL Server podcast just for fun the other day, and they had the exact same problem. But in the episode where we talked about index maintenance,
Starting point is 00:13:02 I think it came up that even if you're really on top of auto vacuum, even if you have it set configured really nicely, there can still be occasions where you get some bloat. If you have like a spike or if you have a large deletion or you have like a, there's a few cases where you can end up with sparsely populated indexes that can't self heal. Like if for example, you've got like an, like a even UIDV7 index, and then you have a section of it, maybe delete some old data and it's not partitioned, then you've got a gap in your index.
Starting point is 00:13:35 So there's a bunch of reasons why they can get bloated anyway, even if you're on top of auto-vacuum. So I think this is one of those ones that, yes, auto-vacuum fixes most of the problems, but you probably still want to have some plan for index maintenance anyway. Yeah. So there are certain things that are not automated by Postgres itself or by Kubernetes operators or by, well, some of them automated some things, but not everyone, not everything. Or managed service providers.
Starting point is 00:14:07 Even upgrades. Also, like a lack of automation there. We can mention this lack of automation of analyze, but fortunately, future Postgres versions will be definitely fine because dump restore of statistics is implemented finally and goes to Postgres 18, which is super great news. Lack of automation might feel like, oh, this is a constant headache, but it's solvable. Fortunately, it requires some effort, but it's solvable. Okay, next thing is, let's talk about lightweight lock contention. We talked about heavy lock contention or just lock contention.
Starting point is 00:14:46 Lightweight lock contention is also... This feels like pain and of various kinds. Lightweight locks can be called latches. It's in memory. So when some operations with buffer pool happen, for example, there are lightweight locks Postgres needs to establish or working with wall or various data structures, also can mention lock manager. So things like lightweight lock manager or buffer mapping or subtrans SLRU, multi-exact SLRU. When you hear this, for me, these terms, imagine there is a font like bloody, red, blood, blood, drops of blood.
Starting point is 00:15:37 Because I know so many projects suffered big pain, big incidents. So for me, these terms are like bloody terms, because it was a lot of pain sometimes. For example, you know I'm a big fan of sub-transactions, right? My natural advice is just to eliminate them all. Well, over time I'm softer. I say, okay, you just need to understand them and use very carefully. But lock manager, couple of years,
Starting point is 00:16:14 remember Jeremy Schneider posted like- Yeah, great post. Horror stories and we discussed it as well. So this kind of contention might hit you and it feels like performance cliff usually. So all good of contention might hit you and it feels like performance cliff usually. So all good, all good, boom. Right, so what is or was, is it changing in 18 but it was a hard-coded limit, right? 16 you mean this for fast path? Also as far as you, sizes are now configurable in, I think in 17 already.
Starting point is 00:16:46 Well, nice, good, but not always enough because, okay, you can buy some time, but still there is a cliff and if you're not far from it, again, boom. Or this, I recently saw it, like remember we discussed 4 million transactions per second. We discussed that we first we found PgSATK cache was an issue, was fixed, and then PgSAT statements. Okay, PgSAT statements, if the transactions are super fast, it's bringing observer effect. We see it in newer Postgres versions as LW log PgSAT statements, because finally code is covered by proper probes, right?
Starting point is 00:17:29 Not probes, it's wrapped and it's visible in weight event analysis observing PgSAT activity. So I saw it recently at one customer, you know, like some layer of lightweight log PgSAT statements. So we need to discuss what's happening. It happens only when you have a lot of very fast queries, but it can be a problem as well. But yeah, and performance cliffs. It requires some practice to understand where they are. It's hard because you need to understand how to
Starting point is 00:18:06 measure usage, how to understand like saturation risks. This requires some practice. I think this is one of the hardest ones to see coming. After all our stories with LWLock manager, every time I see some query exceeds 1000 QPS queries per second, I'm already thinking, okay, this patient is developing some chronic disease. Okay, that's another one I haven't heard. We've done several rules of thumb before, but that's another good one. So 1000 queries per second for a single query. Check, is it? It's very relative also how many vCPUs we have.
Starting point is 00:18:49 If we have less, it can hit faster. Although we couldn't reproduce exactly the same nature as we see on huge machines like 192 cores, 96 cores. We couldn't reproduce that nature on eight core machines at all. So yeah, it's for big boys only, you know. This is like, or maybe for adults. So young projects don't experience these problems. Anyway, that's a good point.
Starting point is 00:19:23 Actually, the startups that have hit this that you've written about and things have tended to be further along in their journeys. Huge, huge. But yeah, but still growing quickly. And it's even a bigger problem at that point. But yeah, good point. Should we move on? Yeah.
Starting point is 00:19:38 So the next one is our usual suspect, right? It's a wraparound of 8-byte transaction ID and multi-exact ID. So many words already said about this. It just bothers me that monitoring doesn't cover, for example, usually it doesn't cover multi-exact IDs. And people still don't have alerts and so on. and people still don't have alerts and so on. So it's sad, right? To be in transaction society is easy to create these days. I get the impression though. I mean, there were a few high profile incidents that got blogged about.
Starting point is 00:20:18 I think, yes. Yeah, exactly. And I feel like I haven't seen one in a long while. And I know there are a lot of projects that are having to, you know, I think Agen have spoken about, if they weren't on top of this, it would only be a matter of hours before they'd hit, you know, it's that kind of volume. So they're really, really having to monitor
Starting point is 00:20:38 and stay on top of it all the time. But I haven't heard of anybody actually hitting this for quite a while. Do you think, I wondered if for example, there were some changes to, I think it was auto vacuum being like, I think it, it kicks into do an anti wraparound vacuum differently, or it might be lighter, a lighter type of vacuum that it runs now. I think I remember Peter Gaggen posting about it, something like that. Do you remember a change in that area? I don't remember, honestly.
Starting point is 00:21:07 I just know this is still a problem. Again, at CTO level, it feels like, how come Postgres still has four byte transaction IDs and what kind of risks I need to take into account? But you're right, managed Postgres providers do quite a good job here. They take care. I had a guest at Postgres TV, Hano Kroosink, who talked about how to escape from it in non-traditional and in his opinion, and actually my opinion as well, in a better way
Starting point is 00:21:46 and actually my opinion as well in a better way, without single user mode. And since he is a part of Cloud SQL team, so it also shows how much of effort manage Postgres providers do in this area, realizing this is a huge risk. Yeah. And it's not even, even if it's a small risk, the impact when it happens is not small. So it's one of those ones where... Absolutely good correction. It's low risk, high impact. Exactly.
Starting point is 00:22:11 Yes. Yes. So I think the cases that were blogged about were down, were hours and hours possibly even getting... Was it even a day or two of downtime for those organizations? And that was... That is then... I mean, you're talking about dangers, right? Global downtime. Exactly.
Starting point is 00:22:30 You're going to lose some customers over that, right? Yeah. Unlike the next item, a four byte integer primary keys, it's still a thing. You know, I was surprised to have recently this case, which was overlooked by our tooling. Oh, really? I couldn't like realize how come? Yeah, because it was a non-traditional way to have this. Go on.
Starting point is 00:22:56 Well, it was first of all, the sequence which was used by multiple tables and one for all of them and somehow it was defined so our report in position cup didn't see it so when it came I was like how come this like this is an old friend or old enemy not friend enemy. Old enemy. Yeah. I haven't seen you for so many years. And you look differently, you know, because multiple tables use, but still, it's not fun. And this causes partial downtime because some part of workload cannot work anymore,
Starting point is 00:23:38 you cannot insert. Yeah. Yeah, so, by the way, I also learned that if you just do in place alter table for huge table it Not so dumb as I thought I checked source code. I was impressed and that this code is from nine point something Yeah before so it if you alter table alter column to change from in to four to in eight, it actually performs job like similar to vacuum full, recreating indexes and you don't have bloat. I expected like 50% bloat. Oh, why? Because I thought it will rewrite the whole table. I was mistaken.
Starting point is 00:24:22 It's quite smart. Yeah, it's of course, it's a blocking operation. It causes downtime to perform it, but you end up having quite clean state of table and indexes. Not quite clean state. It's fresh. Yeah. So that is a table rewrite, no? Yes. Well, table, yes. Well, you're right. I was thinking about table rewrite is very dumb thing like create more tuples and delete other tuples. Got it, got it, got it. But there is mechanism of table rewrite in the code.
Starting point is 00:24:52 Now I saw it finally, I'm still learning, you know, sorry. You might end up with some padding issues if you had it optimized well before, but yeah. Yeah, it also feels like Postgres could implement some reshape eventually because there are building blocks in the code already, I see it, and like to first, like offline style to change column order, and then if you want it, and then fully online style.
Starting point is 00:25:19 If PgSqueeze goes to core, right? Yeah, yeah, yeah. That would be great. Yeah, I'm just connecting paths here and can be very powerful in like three to five years maybe. But it's a lot of work additionally. So all those who are involved in moving huge building blocks, I have huge respect.
Starting point is 00:25:41 Yeah. So OK. And I think this is, if you know what you're doing, this one's easier to recover from. I assume like with the sequence, for example, you can handle it multiple ways, but you could you can set the sequence to like negative 2 billion. Everyone thinks they are smart. And this is the first thing I this is the first thing I hear all this when we discuss this is this is what's in, like, let's use negative values.
Starting point is 00:26:07 Of course, if you can use negative values, do it. Because we know Postgres integers are signed integers. So we use only half of capacity of... For byte capacity, half of it is 2.1 billion, roughly. So you have 2.1 billion more but it not always is possible to use. But this is old old old story still making some people nervous and I think it's good to check in advance. So much better so much better to have alerts when you're checking. I've said several companies big ones from this just raising this and I know in some companies
Starting point is 00:26:49 It was like one year or few years work to fix it So what was the problem before was it looking at columns instead of looking at sequences? Or what sequences are always I'd be bit bites. That's It was always so like, if they are 8 bytes. Problem with report, I don't remember honestly. There was some problem with report. It was not standard way, not just create table and you have default with sequence and you see it. Something else, some function, I don't remember exactly. But usually our report catches such things. Or you can just check yourself if you have primary keys for 4-byte, it's time to move to 8-bytes or to UUID version 7, maybe. That's it about this. Then let's hit up the situation, replication limits. So, in the beginning I
Starting point is 00:27:47 mentioned vertical and horizontal scaling and usually people say there's not enough CPU or something and disk IO and we need to scale and you can scale read-only workloads, having more read-only standbys. But it's hard to scale rights. And this is true. But also true that at some point, and Lef mentioned it in our PGDoc episode, he mentioned that in Instacart they had it, right? So like 200 megabytes per second, world generation, it's already over 100. I don't remember exactly what he mentioned, but somewhere in that area... It was a lot. Well, just 10 walls per second, for example, it gives you 160. Well, RDS, that's less
Starting point is 00:28:37 because they have 64 megabyte walls. They raised it four times. Anyway, 100 or 200, 300 megabytes per second, you start hitting the problems with single threaded processes. Actually, it wonders me why we don't monitor. Any post-gist monitoring should have low-level access. RDS doesn't have it, but low-level access. We should see how much CPU usage for every important single-threaded Postgres process. Val sender, Val receiver, logical replication worker, checkpointer maybe as well. That would be helpful. replication worker, checkpointer maybe as well. Of course, because it's dangerous to grow at scale when you hit 100% of a single vCPU. And then you need to either vertical or horizontal scale or start saving on wall generation. And fortunately in PgSAT statements, we have wall metrics, three columns, since Postgres 13.
Starting point is 00:29:47 But unfortunately, this is query level. What we need, we need also table level to understand which tables are responsible for a lot of wall. And PgS total activity lacks it. I think it's a good idea to implement. If someone wants hacking, this is a great idea. Add three more columns, wall-based, wall-related metrics to PgStat all tables, PgStat sys tables, and user tables. It would be great. Also, maybe to PgStat database as a global view of things.
Starting point is 00:30:21 How much wall? Yeah, I've got a vague memory. There were a couple of war related new views, new system views. Oh yes, but that's about, yeah, it's about, are you talking about Pyrrhus-Stateo? No. No. War related in post-Gust 13, it went to explain and it went to Pyrrhus- went to persist statements. This is what happened. Anyway, this is not an easy problem to solve. It's easy to check if you have access. Unfortunately, if you're on managed Postgres, you don't have access.
Starting point is 00:30:57 They need to check it. What's happening, especially on standbys. And also it makes sense to tune compression properly because compression can eat your CPU. Remember we discussed wall compression. I always said let's turn it on. Now I think let's turn it on unless you have this problem. In this case wall standard.
Starting point is 00:31:19 We'll get close to the limit. Yeah. You need to check how much of that is like, is wall compression. And also we have new compression algorithms implemented in fresh Postgres versions. So, yeah. And that STD. So, we can a better one. For example, Asset 4 should be providing similar. As I remember, I saw some benchmarks. I didn't do it myself yet.
Starting point is 00:31:53 So it should be similar to BGLZ default compression in terms of compression ratio, but it takes much less CPU, like two to three times less CPU. So it's worth choosing that. I looked at it briefly just for our own use for plan storage. And it was, we even got better performance as well. So it won on both for us.
Starting point is 00:32:16 Well, you probably saw what I saw small data on bulk, right? No. Okay. Maybe. Let's move on because there are more things to discuss. Design limits. So some people already think what they will do when their table reaches 32 terabytes. Yeah, I guess this and the last one both feel like adult problems again, right? Like, This and the last one both feel like adult problems again, right? Like there aren't too many small startups hitting the Instacart level of war generation
Starting point is 00:32:50 or 32 terabytes of data. So yeah, it's a really big clusters. But we can start thinking about them earlier and be better prepared. Maybe sometimes not spending too much time because you know, like if you spend too much time thinking about how you will bring statistics to new cluster after measure of red, but, but it's already implemented in 18. So just a couple of years more and everyone forgets about this problem. It's the kind of thing that's useful to have, like, uh, if you've, if you've got like a to-do list you're
Starting point is 00:33:25 going to be using in a couple of years time, or a calendar that you know you're going to get an alert for, just put a reminder in for a year or two's time. Just check. We also mentioned a few times during last episodes, latest episodes, this bothers me a lot. First I learned Cloud SQL has this limit, then I learned RDS also has this limit,
Starting point is 00:33:45 64 terabytes per whole database. What's happening here? It's already not a huge database. But again, in a couple of years, who knows, they might have increased that. Yeah, well, I think it's solvable. Of course, I guess this is the limit of single EBS volume or disk on Google Cloud, PD, SSD, or how they call it. So yeah, it's solvable, I think, right? With some tricks. But these days, it doesn't feel huge. 64, it feels big database, right?
Starting point is 00:34:23 But when we say- Yeah, to most of us. But we have stories, 100 plus databases. All of them are self-managed, I think. And all of them were sharded. All of those were sharded. Yeah, 100%. Yeah, we had a great episode, 100 terabytes,
Starting point is 00:34:43 where it was atN, who else? We had Notion and we had Figma. Figma, right. And Notion and Figma, they are on RDS, but it's a sharded, single cluster, it's impossible on RDS, right? And I think ADN has 100 plus terabytes. No, they have 100 terabytes, but it's self-managed. Yes.
Starting point is 00:35:04 Because on RDS it's impossible, not supported. Well and they're sharded also. Yeah, yeah, yeah, but yeah, in large companies like that there are always like some parts are sharded, some not. So anyway, when you have 10 to 20 terabytes it's time to think if you are on RDS or Cloud SQL, is it like how you will grow 5x? Because if you're 20 terabytes to grow 5x, 100, it's already not possible with a single cluster, right? Another reason to think about splitting somehow. Then a few more items, like data loss. Data loss is a big deal. If you poorly designed backups or HA solutions. Yeah, it can be. Let's join this with poor HA choice leading to failures
Starting point is 00:35:57 like split brains. So data loss, split brain. Actually, I thought we had the discussion, there is ongoing discussion in the project called Cloud Native PG, where I raised the topic of split brain and demonstrated how to do it a couple of weeks ago. Good news is I see they decided to implement something to move in a direction similar to Patroni because when network partition happens and the primary is basically alone, it's bad because it remains active. And as I demonstrated, some parts of application might still talk into it. This is classical split-brain.
Starting point is 00:36:42 And I saw, based on discussion, I saw it triggered, I never like I never thought deeply actually, but is split brain just a variant of data loss? Well, I guess you technically might not have lost the data. It's still there. You just have two versions. Which one is correct? It's worse than data loss. It's worse than data loss. It's still there. You just have two versions. Yeah, which one? It's worse because now you have two versions of reality Yeah, and it's bad with data loss. You can apologize and ask to like bring your data back again, please And in some cases we like allowed some data loss, of course, they also is really sad thing to have but But sometimes we have it like officially some data loss might happen. The risk is very low and maximum this number
Starting point is 00:37:34 of bytes for example, but it might happen. With split-brain it's worse. You need to spend a lot of effort to merge realities into one. Yeah. Most cases of data loss I've seen have tended to be at least two things gone wrong. Like user error or some way that the node's gone down, but quite often it's user error, accidental delete without aware clause or dropping a table in one environment. Well, this is like higher level data loss. Well, but that, but that can cause the low level data loss if you then also
Starting point is 00:38:12 don't have tested backups and it turns out you didn't have a good view. So it's the combination of the two things. Often for me, it's still, yeah. Well, yes. Yeah. If backups are missing, it's bad. Yeah. You, yeah if backups are missing it's bad, yeah you cannot recover but also like data loss for me classical like lower level it's database said commit successful and then my data has gone. So that's scary and dangerous as like a CTO. It undermines trust into Postgres again, right? Yeah. If procedures are
Starting point is 00:38:46 leading to data loss and also split brains. Is it actually happening often or is it more the CTO thinking I don't want to choose this technology because it could happen? Depends on the project and the level of control. I'm pretty confident in many, many, many web and mobile app, all TPE style projects. Unless they are financial or something like social network, social media, maybe even e-commerce and so on, data loss happens, sometimes unnoticed. If you have synchronous replicas and failover happened and the process of failover, like Patroni by default with asynchronous replicas allows up to one megabyte,
Starting point is 00:39:32 it's written in config, up to one megabyte of data loss officially, mibi byte. So one mibi byte is possible to lose. And who will complain if it's a social media, comments or something, we store like comments, we lost some comments, nobody noticed maybe. But if it's a serious project, it's better not to do it. Or split brain. Yeah. Anyway, this is more not like Postgres doesn't, Postgres does job quite well here.
Starting point is 00:40:02 It's question mostly to everything around Postgres infrastructure and if it's managed Postgres to their infrastructure, how they guarantee there is no data loss. And last time we discussed the problem with synchronous commit and we discussed in detail how right now Postgres doesn't do good job not revealing proper LSNs on standbys. So even Patroni cannot choose, can have data loss in case of synchronous commit with remote write. We discussed it. Okay.
Starting point is 00:40:38 Anyway, this feels like something for improvement definitely here. Good corruption, my general feeling, people don't realize how many types of corruption might happen. And it remains unnoticed in so many cases. When you start talking, people like reaction sometimes is wait what? So yeah, corruption at various levels might have them.
Starting point is 00:41:10 So actually maybe this, so in terms of the list where we started kind of the point of the topic, there was kind of dangers, right? Is this one of those ones that if it silently happened for a while, it suddenly becomes a complete loss of trust of the underlying system or... Yeah. As usual, I can rattle about how Postgres defaults are outdated. We know data checksum is only recently enabled by default. Yeah, great change. Like a month or two ago, right? It will be only in 2018. It should be done five years ago.
Starting point is 00:41:46 We saw evidence. Yeah. Yeah, many managed Postgres providers did it, like RDS, that's great. Which is great. And that kind of is then the default for a lot of people. Yeah, but it also doesn't guarantee that you don't have corruption.
Starting point is 00:42:00 You need to read all the pages from time to time, right? And do they offer something, some anti-corruption tooling? They don't. Nobody. Okay, they enabled, so what? This is just a small piece of the whole puzzle. Yeah, and Amcheck improving as well. I think, is it in 18?
Starting point is 00:42:20 Yeah, that's great, that's great. We're getting in indexes as well. Yeah, big. In fact, I think we, did we have an episode on Amcheck? I feel like it came up at least once or twice. Maybe index maintenance. I cannot remember all our episodes or many. Yeah, me neither.
Starting point is 00:42:34 But it's great to see the progress in the area of Amcheck, but it's again, it's a small piece of the puzzle and it has settings like options to choose from. So it's not trivial to choose among options. And also again, like a couple of episodes ago, we discussed support and how difficult it is to sometimes understand what's happening, right? And I right now have a case where very big platform
Starting point is 00:43:04 doesn't help to investigate corruption and nobody has the ability to investigate it but them, but they are not helping. It was one of the reasons I was provoked to talk about that in that episode. It looks really bad. You guys are responsible for that corruption case and you don't do a great job. I think it's a problem with industry and we discussed it already, so let's not repeat. But in general, I think if you are CTO or leader who decides priorities, my big advice is to take this list and like checklist,
Starting point is 00:43:49 like evaluate situation for yourself, better let us do it of course, but you can let yourself and then plan some proactive measures because corruption testing can be done even on RDS proactively. If it happens you need support of course because sometimes it's like it's low level you don't have access but at least you will feel control over it. So over corruption
Starting point is 00:44:18 so anti-corruption tooling is needed this is what I feel. Yeah, that's it, that's all my list. But I'm sure it's lacking something. Like security more, more security related stuff for example as usual I tend to like it. What do you think like, was it good? Yeah, you see a lot more of these things than I do obviously, but yeah I think it's a really good list and check with checklists, right? It's not, of course, if you could go on forever with, with things to be scared of.
Starting point is 00:44:54 But this, this feels like if you ticked all of these off, you'd be in such a good position versus most, and obviously things can still go wrong, but these are some of the most, at least even if they're not the most common, some of the things that could cause the biggest issues, the things that are most likely to get on the CEO's desk or in the inbox. So yeah, this feels like if you're on top of all of these things you're going to go a long long way before you hit issues. I have a question to you. Guess among these 10 items which item I never had in my production life. Oh maybe give me a second. It's tricky right?
Starting point is 00:45:35 Transaction ID wraparound? Exactly. I only... It's rare. Well it's's rare, yeah. Let's cross it off, no problem. Yeah, I only found a way to emulate it, which we did multiple times, but never had it in reality in production. Everything else I had. Yeah. Not once. I nearly guessed too quickly that it was going to be Splitbrain but then I was like, no wait, read the whole list. But I'm guessing you had split brains like one a couple of times maybe, Max, man. Yeah, replication manager split brain is a service. Yes. Okay. Yeah, pre-patrony days.
Starting point is 00:46:20 Yeah, makes sense. All right. Nice one. Thanks so much, Nikolai. Thank you. See you next time. Catch you next time. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.