Postgres FM - High availability

Starting point is 00:00:00 Hello, hello, this is PostgresFM episode number 56. I'm Nikolai, this is my co-host Michael. Hi, Michael. Hello, Nikolai. So we are going to talk today about HA, very funny topic. If you write it three times, you will have ha-ha-ha, right? Yeah, I never thought of it like that. Yeah, HA, very funny topic, actually. Many people fight for some achievements in this area, but it's high availability, opposite to low availability, which is like downtime and incidents and so on.

Starting point is 00:00:34 Yeah, or number of nines. That's how it was popular to describe how many nines you were a few years ago, wasn't it? Right. How many nines do you think is good? Well, I mean, I remember when loads of the major providers were claiming five nines. And then when you did the math and realized like quite how little downtime that is, it's like five minutes per year. Yeah, nearly impossible to guarantee that. It's called downtime budget. So yeah, what do you think? I mean, I think four is

Starting point is 00:01:03 still pretty good. But you you can what would you say oh yes excellent i think if during whole year you allowed only five minutes of downtime i think it's quite good of course six means like half a minute of downtime per year it's i think absolutely excellent but i would take three as well it's like like three nines, it's more than four hours. Well, for many, it's okay, not for all. Actually, I skipped one. For nines, it's 52 minutes per year. And it's okay.

Starting point is 00:01:35 Well, it's already quite good goal for nines. Yeah. Right. But of course, five nines is, I think, maybe it's like A mark and six nines is A plus mark, if you can achieve it. And I mentioned budget because some people have management for AHA, for downtime, so-called budget. And, for example, if you plan some works which have planned downtime or you have high risks of some downtime, if you, for example, upgrade upgrade something os or major postgres upgrade

Starting point is 00:02:06 or something and you expect some things can go wrong so in this case if you already had downtime minutes or seconds which contributed to low down your ha characteristics in this month you probably want to postpone your work till next month right this is regular approach to manage this like ha budget downtime budget and presumably because some services do credits don't they for their customers if they don't achieve certain goals so that might avoid them paying out lots in terms of penalties i had a question for you as well though like with we normally talk about extremely high throughputthroughput environments, but if you've got quite a low, like, let's say you might even provide an internal application for a team or something,

Starting point is 00:02:53 have you ever heard that phrase or the kind of philosophical question, if a tree falls in a forest and there's no one around to hear it, does it make a sound? If a service is unavailable but it receives no requests in that time, is it down? Well, I usually consider a service down observed by external service. It should be observed externally. So if from outside, from various points, for example, from various regions in the world, service cannot be reached. And some simple checks like front page is not responding at all or returning 5xx errors, meaning some problems on server side, or some simple API calls like health checks are not

Starting point is 00:03:39 responding from outside. This is downtime. Internally, probably we can have everything working and just some network incident happening, which usually affects like some small issue with network configuration can lead to huge downtime, huge outage for many customers. For example, sometimes we see AWS or GCP has issues with some reconfiguration of network devices or routing and so on. And this leads to huge incidents. It's always DNS, right? Not always, actually, but very often, yes. That was the joke. Sorry. Just for listeners,

Starting point is 00:04:21 I've had awful internet today. So this is the first day that we can't see each other. So Nikolai couldn't see that I was grinning, which is my normal giveaway that I'm telling a joke. Episode number 56, and I didn't learn how to recognize your smiling just from audio. Yeah. Okay. So HA is about uptime. Uptime is opposite of downtime. And what should we discuss in this area?

Starting point is 00:04:50 Where do you want to start? Well, I think there might be a lot of people out there like me that conflate HA and high availability with just switchover and failover tools and processes and runbooks books and you make a really good point that it's not about those those are a couple of techniques we have to achieve higher availability but it's really about avoiding downtime and when you made that point to me beforehand while we were preparing it made sense to then go and list what are all the ways we can cause ourselves downtime, or the ways, what are the main causes for, let's say, Postgres-specific downtime.

Starting point is 00:05:34 And then maybe once we've gone through those, or maybe as we go through those, discuss some of the things that we can then do about it. Does that sound sensible? Yeah, it does. And you're also right. It depends on your point of view. If, for example, you are responsible only for infrastructure. And remember, we can distinguish infrastructure DBA and application DBA. And if you are purely infrastructure DBA, at some point, you probably should consider everything engineers, developers do with application changes, for example, schema changes and so on. Probably you should consider there's a black box

Starting point is 00:06:12 and care only about failures related to infrastructure. For example, as you mentioned, autofailover, meaning that some node is down due to a number of reasons, or PGBouncer or any puller also being a single point of failure, SPOF, right? Or some networking issues, something like this. So purely infrastructure, and you consider everything application does as a black box. But if you are application DBA, or if you care of both parts of like, actually, the boundary here is quite not so strict, right? So I prefer consider everything.

Starting point is 00:06:58 And if you're a database expert, or you part of your work, at least part, is about databases, related to databases. In this case, I prefer taking both infrastructure things like autofailover and purely application things. For example, we create primary key integer 4. We have 2 billion rows. Soon we have huge outage. Or we forgot to implement low log timeout and retries logic. And releasing something, some small schema change when auto vacuum and transaction ID wrap around prevention mode is working we are going to have a queue of queries which are waiting for us while while we are waiting for this auto vacuum so we have at least partial outage as well sometimes global i mean

Starting point is 00:07:55 sometimes everything is going down because of this simple mistake which is easy to do because it's default behavior. If you don't do anything, at some point you will have it. So I prefer to care about both worlds and consider include application stuff to the set of measures to prevent downtime. So for example, for me, primary key integer four, when we know we will have a lot of rows inserted, it's also definitely related to HA. Yeah, good point. I mean, you've mentioned the DBA roles, but really, it's development as well, isn't it? It's everybody who has responsibility for any of the design is involved in this as well

Starting point is 00:08:45 and and even things that we can't avoid right we've had episodes already about major version upgrades and discussed quite how difficult it is to do those potentially even impossible to do those without any doubt like with absolutely zero downtime it's possible so yeah oh yeah we we discussed that it's possible i provided a recipe by the way we mentioned this new tool to orchestrate major upgrades the tool called pg easy upgrade pg easy replicate sorry pg easy replicate which supports major upgrades written ruby as i remember and we mentioned that i haven't tried it, but I criticized it a little bit. But it this tool got attention at Hacker News. And yesterday, the author contacted me directly on Twitter. So we discussed improvements. And obviously, it's possible to

Starting point is 00:09:39 achieve practically zero downtime, I mean, like one, two seconds of latency spike if you have at least if you have not huge data volumes and not big tps numbers for example if you have less than 10 terabytes and if you have less than 10 000 tps on the primary it's definitely possible to achieve like basically zero downtime upgrade when you will have just a spike of latency but this needs component besides postgres you need pg bouncer or any puller we had episode about pullers which supports postresume with ability to switch primary to different machine yeah right or presumably some of the load balancing tools and operators support that kind of thing as well. Pause-resume is not very popular, I think. But I might be mistaken.

Starting point is 00:10:28 It's an interesting topic. I think it's underdeveloped. It received not a lot of attention so far, this topic. Because, you know, we have bigger problems. We have bigger points when we can have downtime. For example, even minor upgrade can lead to downtime for up to one minute. And there are many risks where you can put everything down to many minutes. Yeah, I know we don't have time to discuss each one in depth, but should we run through a few

Starting point is 00:10:57 more? You know, let's discuss first autofailover because of course, it's very important component to achieve HA. And I remember discussions 10 years ago or so, discussions that it should live outside Postgres. This should not be inside Postgres itself. So here we are. We have Patroni and alternatives. And inside Postgres we almost don't have anything. Actually, we have something. And in recent Postgres versions, we have improvements in libpq. This is a library which is used in many, many, many drivers, which are used by application code to connect to Postgres. And already for many years, you could

Starting point is 00:11:37 use multiple hosts in connection. So you can say this host or that host or that host. And automatically, the PQ, so like I mean at driver level, it would try first, if it's down, immediately try second one. And this is already convenient to achieve lower downtime, right? Yeah, or higher availability.

Starting point is 00:12:03 Or bigger uptime, or how to say it. Yeah. Better uptime. And in Postgres 15 or 16, I don't remember particularly, additional improvements happened there. I don't remember details, but I remember previously it was quite, you know, like the order is hard-coded. Ah, load balancing went there.

Starting point is 00:12:24 So load balancing is supported at protocol level right now in very new Postgres versions. So maybe in 16, actually, which is not released yet in its beta 2, as we speak. But this is good addition. And so meaning that some things go to Postgres core, basically, because LePicu is part of Postgres. But also there is target attributes.

Starting point is 00:12:47 I don't remember exactly, but you can say, now I need read-write connection. And now I need only read-only connection. And establishing connection, libpq knows what you need. And it can automatically switch to the primary. It knows you will need writes. And this is also some interesting thing. You can build interesting stuff on top of it. But of course, it's not autofillover.

Starting point is 00:13:13 Autofillover is needed. I remember when we had a couple of servers or three servers, and many people, many DBAs considered autofillover as a very bad thing because who knows what it will do. It was quite popular opinion that you should do it yourself. It's better to be paged, wake up at 1 a.m. and perform failover properly. And it's okay to be down for a few minutes, but not allow some who knows who developed it. like who knows what it will do maybe it will lead

Starting point is 00:13:47 to split brain or something so it was a matter of trust but then tooling started to improve as usually i should warn don't use replication manager rep mgr and other tools which are not based on consensus algorithms like raftft or Paxos. So use Patroni or similar tool which is based on consensus algorithm. And a lot of work was invested to make it work properly. So with replication manager, I personally had several cases of split brain. With Patroni, I never had it. I had several cases of split brain. With Patroni, I never had it. I had different issues.

Starting point is 00:14:25 And usually, Kukushkin, the maintainer of Patroni, fixes them quite quick. So it's quite well maintained. And I think right now it's standard de facto. Many systems, many post-gas operators included as a component. And now nobody thinks already that you should use manual failover, right? It's already nonsense. Yeah, I can't imagine somebody still believing that. But for anybody that isn't familiar with split brain, the time I've come across that most is, well, in fact,

Starting point is 00:14:57 actually, definition wise, is that just when you can end up with one database, one node believing one thing and one node believing another in terms of data. So we have two primaries. Old name for primary is master. We have two masters. So we have two primary nodes and some writes go to one node, other writes go to another node and nobody knows how to merge them back and solve it.

Starting point is 00:15:22 It's very hard usually in general case. In some cases it's possible but in general case for example if we have had reduction of split brain when some writes on one node happened to particular tables and on another node to different tables in this case probably it's quite easy to merge but in general general case, it's unsolvable problem. So you should avoid it. This is the biggest problem of autofailover, basically. So this risk of split brand is the biggest problem. And those tools which do not use consensus algorithm

Starting point is 00:15:57 basically are bad in this area. Even if they have like witness node and a lot of heuristics coded it's not good and i remember i spent some time analyzing replication manager c code and i found places exactly where split break can occur and it's like unsolved issue so just don't use it i'm quite maybe like sound not polite but it's based on many years of experience and observations of troubles i saw and my customers saw and experienced so patron is battle proven and standard the factor right now but there are a few other alternatives which also use

Starting point is 00:16:38 based on raft or or paxes well i think a lot of people these days are using cloud provider defaults you know you can you can just click a button that says high availability and it gives you your replicas and auto failure voters like a as a feature not having to worry about exactly how that's implemented and the people and even people some people are using i believe it's a big feature of the kubernetes operate or still not sure how to pronounce that but the operators that are coming And even some people are using, I believe it's a big feature of the Kubernetes operator, or still not sure how to pronounce that, but the operators that are coming up left, right, and center. So I think there are new options where you don't have to worry about the exact underlying technology, or maybe I should be more worried about that. Yeah, yeah, yeah.

Starting point is 00:17:19 Okay, so split-brain is one risk if you use autofillover, and we agreed we should use autofillover to achieve better HA characteristics. Another risk is data loss. By default, if we use asynchronous standby nodes, if autofillover happens, we might lose some data which were written on the primary, but none of standbys received it. In this case, autofillover happens, and different node becomes primary,

Starting point is 00:17:49 and it lacks a tail of writes. So some portion is not propagated there yet, because, for example, of replication lag, which all standbys had. Because they are allowed to have lag, because they're asynchronous. By default, physical standbys are asynchronous. And Patroni, for example, has a special knob in configuration

Starting point is 00:18:11 which says lags more than, by default, I think 10 mb are not allowed for autofillover to happen. So if all asynchronous standby nodes have lag above this threshold, out of a lever won't happen. So we choose, we need to prioritize. We have a trade-off here. What to choose? Data loss or availability.

Starting point is 00:18:38 Data loss versus downtime and HA goes down. I mean, HA characteristics, so uptime worsens. And usually people can just, we have two solutions. We can tune this threshold and say, okay, even one megabyte is too much for us. But we understand that it increases risks to be down because autoflower doesn't happen. Cluster exists, but in read-only mode

Starting point is 00:19:04 because primary has gone due to a number of happen. Cluster exists, but in read-only mode because primary has gone due to a number of reasons. Some hardware failure, for example. And then we just need to, manual intervention is needed. But by the way, the story about autofillovers and how they work properly. So Postgres AI, it's not a big database. I mean, our website. And it's running on Kubernetes Zalando operator, Patroni inside. And at some point, I noticed the timeline exceeds 100. What do you mean?

Starting point is 00:19:38 100. When we have a failover or switchover, when we change primary node, timeline increments by one. So we created a cluster in NDB, timeline is one. We have switchover, we change primary, or failover. So switchover is planned failover, basically. Failover is like unplanned change of primary. Switchover is planned. For example, we want to upgrade or something, change something, we perform switchover, and then we, for example, remove old node, old primary.

Starting point is 00:20:10 And I've noticed that our timeline exceeded 100, meaning that we had so many autofill overs and nobody noticed. Is that a good thing or a bad thing? Both. Because it was due to, in Kubernetes, it's usually a regular problem to have out-of-memory killer killing Postgres. It's a big problem and people have solutions to it. So we needed to adjust some settings, add more memory, so we did. But the main problem here was

Starting point is 00:20:43 our monitoring didn't tell us about these cases. And we left with it some months, maybe actually a year or maybe more. So we had the changing node, changing primary node all the time. And it just happened and happened and happened and nobody complained, which is a good thing. But lack of proper monitoring alerting is a bad thing, of course.

Starting point is 00:21:06 We should notice earlier. But this case demonstrates how autofollower can be helpful not to notice some issues, right? Meaning it can hide those issues. So it worked well. Patroni worked well. What not worked well, monitoring in this case. And it's, of course, our fault. We should configure it properly

Starting point is 00:21:27 and receive all alerts and so on. Every time fallover happens, we should receive alert, of course, and investigate and try to prevent it because, of course, it's a stress. Again, I mentioned risks of split-brain, which I can say Patroni almost certainly shouldn't have,

Starting point is 00:21:45 shouldn't have, and risk of data loss. So this is not good. But in these cases, data loss didn't happen because, okay, shutdown, but when we have two streams of data propagation, first streaming replication, when I'm actually, I'm not sure if on killer issues, is it sick kill? So meaning that wall sender won't send all bytes to standbys, right? But at least we have also archiving, and archiving goes to object storage.

Starting point is 00:22:17 It was wall G, and then we restore. So data loss at least was not noticed there. But it's a bad story, actually. I feel right now embarrassed about it yeah do as i say not as i do right yeah don't do this yeah or learn from my mistakes it feels like this is a good time to like go on to some of the other things we could you mentioned alerting already a few other things that we've mentioned in previous episodes that you can alert on are at certain points of running out of transaction IDs.

Starting point is 00:22:48 So making sure you've got some... Hold on. Sorry, hold on. Let me finish with second solution. Second solution is using synchronous replicas. And this is what serious setups should have. You have a synchronous replica and better you have quorum commits. So you have a semi-synchronous approach when you say, have okay i have five replicas five standby nodes and when commit happens on the primary it should be at least received or applied by at least one of nodes and postgres

Starting point is 00:23:16 modern postgres allows very fine-tuned configuration here so you need you can choose and decide between like of course if you, if you do it, you increase latency of committal for writing transactions. We talked about that in detail, I think, in our replication episode. Right. So this is what you should have, like, you should use it. In this case, ideally, if autoflower happens, you will have zero data loss because data always exists somewhere already. It's at least at one node. It's guaranteed.

Starting point is 00:23:47 And in this case, this is for serious approach. This is what you should do. And that's it. Also, I wanted just to finish about switchover. Switchover by default will lead to downtime because of shutdown checkpoint. We discussed it a few times as well. But if you just decide to perform switchover, Patroni needs to shut down primary.

Starting point is 00:24:07 I don't remember if Patroni issues explicit checkpoint. It should probably, but worth checking. The recipe for good fast switchover is make sure you issue explicit checkpoint. This helps shutdown process to perform shutdown checkpoint much faster because for clean shutdown Postgres needs to write all dirty buffers from

Starting point is 00:24:28 the buffer pool to disk and if we issued explicit checkpoint shutdown checkpoint will have only little to write this is the recipe and in this case when you perform shutdown during shutdown wall senders make sure that everything

Starting point is 00:24:43 is sent to standbys. So clean shutdown, standby nodes received all data from the primary, so no lags. And also archive command, everything is backed up. And this is a big problem I noticed in Patroni, actually. We had an incident when we also, due to poor monitoring, monitoring is super important for HA, of course, but due to poor monitoring is super important for hr of course but due to poor monitoring we didn't notice that archive command was lagging a lot meaning that a lot of

Starting point is 00:25:12 walls were not backed up so primary accumulated many like say thousand walls being not backed up but when you try to perform clean shutdown and patroni tried to do it right it tries to wait on it postgres this is a postgres behavior postgres waits while archive command is working and it's working very long because and maybe there is some reason that cannot work properly so it still waits and waits and patroni waits and so on. So now I know it was fixed in Patroni and now it waits not long and performs failover. Also trade-off here, right? Because we want to back up everything. But Patroni choice is to perform failover in this case

Starting point is 00:25:56 and let backup process to work additionally. This is super interesting topic. Again, worth checking details, but many people fail to understand, and I also was failing to understand, that when we shut down Postgres, it waits while archive command is working. And this can lead to downtime, actually.

Starting point is 00:26:18 Unarchived walls. I feel like we could do a whole episode solely on Petroni. That might be a good one in the future. Yeah, actually, good idea. I can refresh memory on some interesting cases and so on. And yeah, we can do it. I think it's a good idea. Patroni is an interesting tool developed at good pace

Starting point is 00:26:38 and receiving a lot of good features. But I must say it's mostly on shoulders on a single maintainer, Alexander Kokushkin, and definitely it's Python. So I think more contributors could help this project because more people work with Python than C, right? And for some people, it's easier to read and write Python code. So I advertise to help Alexander with better bugs description testing and sometimes contribution in form of pull requests to GitHub repository of Patroni. It's worth doing it.

Starting point is 00:27:16 Nice. Okay, I think we covered both failover and switchover and how it can contribute to downtime and affect uptime or HA characteristics. Let's switch to something else. Well, I don't think we've actually covered it, but I've written a list in preparation of all the things that I've seen cause downtime in the past. I guess we're running a bit short of time and we could run through a few of those. But in terms of solutions, it's not just those, right? It's not just failover and you made this point early on we've also got those alerts that we can set up we've got logging we can do to help

Starting point is 00:27:50 not only just to help avoid this in the first place but also to help us learn when it does happen exactly what happens so we can prevent it from happening again some of it's just learning right like it's good to be notified or alerted when you're running out of integers in an int4 column. But it's even better to not have int4 columns in the first place. Like I read a really good post by Ryan from Rust Proof Labs. Yeah, just saying why not default to int8? It's, you know, it's only a few extra bytes and you avoid so much pain in the future if you ever scale so yeah that makes a lot of sense to me so some of it's learning i think some of it's helping educate your team and some

Starting point is 00:28:31 of it i've seen you and others do or suggest people do and your health checks and your capacity planning things like that where you do look ahead a bit and try and so a lot of these issues are scaling issues right it's not too surprising that you can cause yourself downtime through scaling things like that where you do look ahead a bit and try and so a lot of these issues are scaling issues right it's not too surprising that you can cause yourself downtime through scaling we've all read the stories on normally on hacker news or a lot of good blog posts over the years of people causing themselves downtime via lots of traffic or just general scaling issues or just mistakes yeah absolutely which are not mistakes if people don't know about it. That's why it's good to learn and read about other people's failures. Yeah, exactly. Yeah. So

Starting point is 00:29:13 let me emphasize one thing here. I'm naturally working in the area of prevention of issues. And I tend to underpay attention to monitoring and reactive components. They are very important. So every time something occurs, it should lead to alert. I think it's even more important than prevention is the first step of building good HA system. First, you need to cover yourself with good alerts. And for example, they can be different. For example, sometimes it's alert about failover, about auto disk space incident, about some locking,

Starting point is 00:29:51 deadlocks and so on, many things. But then you cover yourself with alerts. And I must say, monitoring systems currently, Postgres monitoring systems, none of them have good alerts for Postgres. None of them. Like many have

Starting point is 00:30:05 parts of it we can also like play with this game give me some monitoring system and i name you 10 alerts it's lacking this is the problem we have we have no mature monitoring systems designed yet or provided on the market it's a big big underdeveloped area many many systems they are all good in their sense but no systems which i would give even b plus i'm not speaking about a mark great right so nothing no systems would get even b plus from from me but uh so we need to cover with like incident happened, some problem happened with all these things. Then we should try to think about prevention already and cover with some alerts about thresholds, like 80% of something. For example, you mentioned integer for our flow.

Starting point is 00:30:58 We know when it happened. It's 2.1 billion. If some sequence has 2.1 billion value, okay, we should define some alert. In this case, I would define it maybe at 40, 50% even, because it takes a lot of time to develop proper solution and so on. Or also transaction ID wrap around prevention. By the way, these two cases, after failover won't help because everything you have on primary yeah standbys also have right so you need to work at application level with schema with adjusting something or in the case of transaction idea up around at like dba level you need to fix

Starting point is 00:31:37 vacuuming you need to make sure auto vacuum or your explicit vacuum did job properly. Also, monitoring long transactions, a lot of things. And a lot of alerts can happen. Then we have like 80%, 50% thresholds for alerting for dangerous behavior. And also we need some forecasting component because it's not enough usually to say, okay, we reached 50% of integer for capacity. Usually people immediately ask next question, when will doomsday happen? So they need some forecast and you can apply some machine learning techniques here or use some

Starting point is 00:32:16 libraries. Or just basic statistics and some rough estimates. It doesn't have to be too fancy yeah yeah so yeah and in this case uh this is already about prevention and yeah health checks can help and these like proactive not reactive but proactive behavior management of your postgres can help find problems yeah do you want to mention some other issues we wanted to mention or should we do a quick fire and you tell me how common they've been in your experience let's do it hardware failure well hardware failure actually it improves over time i remember like 15

Starting point is 00:32:59 years ago it was nightmare right now like even cloud environment when hardware failure should be considered as quite common thing i i feel it's it improved i mean cloud providers and many providers improved the characteristics of hardware and it's becoming less common actually so like of course if you manage like many thousands of servers you will have multiple incidents per week, definitely. Maybe almost every day you will have incidents. It's normal. But if you manage only a dozen of servers, it's quite stable already. If you have an incident per week with hardware managing just a few dozen virtual machines or physical machines, it's time to think about moving somewhere

Starting point is 00:33:45 change provider yeah how about operating system failures operating system failures also improve like we unless you went with some fancy kernel settings it's not happening often usually here we have failures of different kind not like like everything was fine and then suddenly something. Usually, well, we need tuning, of course, some kind of tuning, and we need to take care of behavior of page cache in Linux and so on, especially if you have a lot of writes. But usually when we upgrade a version system, this is where we might have problems, you know, related to,

Starting point is 00:34:23 first of all, related to JLPC version and collation changes. Not normally downtime but corruption, right? It's not downtime, I agree. However, Postgres 15 started to complain about JLPC

Starting point is 00:34:40 version change. And I don't remember, is it warning or an error? I think it's a lock warning. I'm not sure an error? I think it's a lock warning. I'm not sure. It's good if it's just a warning. I don't remember. Yeah, actually, I'm not sure at all. I've got a couple more, three more, I think.

Starting point is 00:34:55 Pooler or load balancer issues? Pooler we discussed recently. If you grow, grow, grow, if you have a single... First of all, pooler can... For example, PgBouncer can have a single first of all puller can for example pg bouncer can be a single point of failure and if this virtual machine down it's not good you need to have multiple ones and balance between them and second big problem it happens all the time with many people still pg bouncer is a single threaded process so it utilizes only one core. You can have dozens of cores on this machine, but it will take only one core.

Starting point is 00:35:30 So you need to run multiple PIDG bouncers. Otherwise you will have saturation and it will cause downtime definitely somewhere between like 10 or 20,000 TPS, roughly, depending on particular CPU you are using. This experience is based on Intel Pro mostly. If you're, for example, on ARM processor, in this case, it can happen earlier.

Starting point is 00:35:56 So it's already like, I mean, 10,000 TPS, it's already not a huge load today. So single PGA Bouncer, it's a big bump that can trigger at some point and this is not good. Second to last one, running out of disk space? Very common of course. If you don't have monitoring and your project is growing, it's very common. It's a good thing that one of the reasons was improved. I mean, previously for replication slots, basically mainly logical replication slots,

Starting point is 00:36:29 if you created it, but no logical replica is using it, it leads to accumulation of walls on the primary and eventually you are out of disk space. But now you can specify a threshold after which Postgres will say enough waiting. I give up this logical slot in favor of not being down because of disk space. It can be improved at Postgres, I think. People complain about behavior. When Postgres reaches out of disk space, you need to perform some dance to fix it,

Starting point is 00:36:58 and it can be improved. I remember recent discussions. I don't remember details, but I was personally in this position many times usually i escape rather quickly but it's very annoying to people when you're out of this space and you need to escape so yeah yeah i thought you were going to say this is less common now with bigger disks and cloud providers that some of them let you kind of scale this don't they yeah most of them support zero downtime scaling i mean if you see scaling up anyway scaling up yeah not down yeah yeah good point and if you have 90 i remember some issues on azure but i don't like azure anyways it's my natural choice like they didn't support online change until recently and also recently

Starting point is 00:37:45 also with some limitations, but GCP and Azure and others support online change. So you can just say, I want like a terabyte more. And then you, depending on your file system, you need to issue a few commands in terminal and then Postgres already is fine. Right? So it's easy. For what it's worth, I'm warming to Azure. I think there's some cool Postgres already is fine, right? So it's easy. For what it's worth, I'm warming to Azure. I think there's some cool Postgres stuff going on there, but I haven't tried it myself. Well, different level,

Starting point is 00:38:14 like I call it second layer of cloud world, right? Managed Postgres services, of course they should have it fully automated. If your managed cloud provider doesn't offer scaling up for disk space, it's not good at all. This feature all should have already. Final one, direct denial of service. Either an external actor, or I've seen this internally as well, where someone's accidentally hammering their own database.

Starting point is 00:38:41 Have you seen this? I guess it isn't necessarily a full outage, but yeah, at least a partial outage normally. Yeah, first slow down and then outage, especially if you have max connections a lot. And like, it's very common if some mistake happened in application code, and it started to issue some queries at a huge rate, a lot of queries per second of some kind, it can put you down definitely, of course. Yeah.

Starting point is 00:39:13 Awesome, that's all I have. Good. I think it was a good overview of some, maybe some like not comprehensive because this topic is huge actually. And especially if you include, as I prefer, if you include application level stuff into it, HA is a huge topic

Starting point is 00:39:31 and big departments are working to achieve good HA. So SREs and DBAs, DBREs and so on. But I hope we performed some both overview and dived into a few particular areas. Of course, not into all of them. Thank you so much. Yeah, awesome.

Starting point is 00:39:50 Thanks so much, Nikolai. Yeah, I forgot to mention the importance of likes and subscriptions. Please do it because I know we received a lot of attention recently with recent episodes and also our anniversary and so on. And every time you like it, please make sure you left some like or comment or something. This is super important for us. This is our fuel and also it helps to grow our channels. We have multiple channels and this helps to grow and more people can benefit from what we discuss here. So please don't underestimate these social media likes and comments and so on. And share with colleagues as usual.

Starting point is 00:40:33 Thank you everyone for everything. Yeah, we appreciate it. Take care.

Your Ad Here

Postgres FM - High availability

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.