Postgres FM - Beginner tips

Episode Date: July 21, 2023

Nikolay and Michael discuss 10 beginner tips Nikolay recently shared — they go into a bit more detail on each, and even disagree a little on one or two!  Here are some links to some thing...s they mentioned:Nikolay’s tweet with all 10 tips: https://twitter.com/samokhvalov/status/1679953049899642880 Tip 1: tuples are physical versions of rowsRelated episodes: https://postgres.fm/episodes/how-to-become-a-dba and https://postgres.fm/episodes/hot-updates Tip 2: always use EXPLAIN (ANALYZE, BUFFERS)Related episode: https://postgres.fm/episodes/buffers-by-default Tip 3: throw away pgAdminRelated episode: https://postgres.fm/episodes/psql-vs-guis  Tip 4: enable as much logging as you can affordRelated episode: https://postgres.fm/episodes/default-configuration Tip 5: install pg_stat_statementsRelated episodes: https://postgres.fm/episodes/pg_stat_statements and https://postgres.fm/episodes/auto_explain, and https://postgres.fm/episodes/macro-query-analysis-introTip 6: run experiments on realistic data sets (use thin cloning and branching)Related episode: https://postgres.fm/episodes/database-branching Tip 7: make sure data checksums are enabled Related episode: https://postgres.fm/episodes/corruption  Tip 8: tune autovacuum to run frequently and move fasterRelated episode: https://postgres.fm/episodes/vacuumTip 9: query optimization will eventually be more important than configuration tuningRelated episodes: https://postgres.fm/episodes/intro-to-query-optimization and https://postgres.fm/episodes/102-query-optimizationTip 10: indexes need to be rebuilt, unfortunately, since their health decline over timeRelated episode: https://postgres.fm/episodes/index-maintenance  Bonus tip: the official documentation, release notes, and source code comments/readmes are your friends! ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Postgres FM, a weekly show about all things PostgresQL. I am Michael, founder of PG Mustard. This is my co-host Nikolai, founder of Postgres AI. Hey Nikolai, what are we going to be talking about today? Your time to announce. Well, yeah, so anybody following you on Twitter probably saw your tweet this week. They've got a couple of hundred retweets, a lot of likes, and it was a summary of tips for beginners, you said, right? So 10 tips for people that are new to Postgres. So we thought we'd discuss that. New means probably people use Postgres for a few years already.
Starting point is 00:00:36 But, you know, like you can be a developer, you can use some backend developer or even frontend developer sometimes. You can use some ORM, Rails, Django or something, maybe touch some SQL sometimes, but consider Postgres as some kind of black box, relational black box, you know. But at some point, of course, it's worth understanding at least something, like to scratch surface and understand what's happening under the hood,
Starting point is 00:01:03 just to avoid bad surprises, like why my delete is so slow, or why it's behaving like that. Like I deleted a lot of data, but disk space consumption is still the same. Or why bloat accumulates so quickly, and so on. Yeah, so I think you're right. I think some of these will be useful to people even people that have been using postgres for years but i also think some of them will be really useful to people that have only picked it up recently or whose organization are
Starting point is 00:01:35 migrating to postgres from another database so i think there's a whole range of people that are new to postgres or even i mean even i even I enjoyed reading these and, you know, thought, oh yeah, I do need to look into that. I haven't, I've never installed that extension or something. So yeah, even for me, there were a few things. Right. But as you very correctly noticed, all of these things we already discussed. So each, each, every, every one of all those 10 tips have a link, can have a link to our previous episode for a deeper dive, right? Yeah, absolutely. I think some of them aren't the exact same topic.
Starting point is 00:02:16 So I'm interested to ask you a couple of questions on a few of them. But yeah, I'll include links to episodes for maybe people that are newer to our podcast and haven't listened to every single episode we've done right and from there you can go to very beautiful blog posts articles talks and other books and so on and so on you can use this episode as an entry point other episodes as like deeper entry points to specific topics and from there you can learn a lot so it's kind of like a tree right yeah yeah Yeah, yeah, absolutely. Right, so, I mean, we don't have that long to discuss 10 things. Should we get started?
Starting point is 00:02:50 Yeah, let's do it. So the first one on our list is all about tuples. Do you want to give the summary of this? Yeah, so I don't know why everyone continues mixing. Not everyone. Many people continue mixing rows and tuples. Tuples are physical version of rows. It's MVCC model, multiversal concurrency control model, Postgres users.
Starting point is 00:03:13 And you can feel them if you look into secret columns. Not secret, but invisible columns. CTID, Xmin, Xmax. And we discussed it. So CTID is physical location of this tuple. Again, tuple is a physical version of row. So if you change it, physical location usually also changes. Well, in most cases, it changes.
Starting point is 00:03:36 And CTID consists of two parts. It's a page number and internal address inside page, so like Fset. And my usual trick, if you need to extract only for page number, for example, you convert it to point data type and then address it like X or Y component. So X is page, Y is offset, and then you can extract it and convert it back to number and that's it. So it's easy. Because this like CTID data type, I think it lacks some operators or something or functions. By the way, maybe it's a small addition that could be first contribution to Postgres if you want to try to writing some C code. It's quite simple. So this city ID is
Starting point is 00:04:17 interesting and it's one of my favorite tricks to surprise people, like saying update table set ID equals ID, where ID equals one, and you suddenly see that city ID, this hidden column changes its value. Even if logically, you didn't do anything. But Postgres copies whole tuple each time you change it, right? In the best case, it can happen inside the same page and so you don't receive index updates you don't need to update all indexes in the worst case it goes to different page and all indexes should be updated unfortunately receiving new at entries and and xmin xmax it's like date of birth date of death right how to say in terms of transaction id we discussed it right we discussed it we have episode about mvcc or or blood or so and so on it's all related i think the episode
Starting point is 00:05:14 we discussed this most in was our other more beginner focused episode which was how to become a dba so i'll link up that one for sure i I thought it was a great tip. I had nothing to add. I think it's quite simple as well. And you've gone into good detail. Right, but I'm quite sure that this should be clear, not only to DBAs, this knowledge should be clear for everyone who works with Postgres. Because even if these columns are hidden, you inevitably will deal with bloat, with slowdown, with some slow delete, some slow update, massive update, for example, and so on. And understanding this starts with understanding what tuple is. Yeah, I completely agree. It's going to be coming up again in a couple of the later tips as well.
Starting point is 00:05:57 Because DBS usually don't write most of the code. Most of the code is written by developers. And if developers don't understand that update set ID equals ID leads to physical action, quite expensive one, they will continue writing it, right? If they understand it, they will avoid it, because
Starting point is 00:06:16 they understand what's happening under the hood. I agree. Number two. Number two. Number two is one that's close to my heart as well, isn't it? Always use buffers when checking explain analyze. Yeah, we mentioned it so many times. If it's not your first episode, you definitely know it. Nothing to add.
Starting point is 00:06:34 But if it's first episode, just remember buffers should always be inside parentheses with analyze when you write explain analyze. So analyze should be in parentheses, comma, buffers. Don't use explain analyze. Use analyze should be in parentheses, comma, buffers. Don't use explain, analyze. Use explain, analyze buffers always. No exclusions. I was going to add to it. I think especially if you're brand new to using explain
Starting point is 00:06:54 and you might have to ask a colleague or somebody online for help, I would also recommend when you're getting the explain plan to use verbose and settings as well. They're two more parameters. Verbose because it gives the people reading it more information about your schema as well as output columns and things like that. So more information for people that aren't as familiar with the query that you're working with. And settings because if you've changed anything locally that will show up, or if you've configured the database differently to how production is configured or something like that, those
Starting point is 00:07:29 things will show up and they'll help people avoid wasting time when helping you. They might spot an obvious issue in there. Yeah, well, yeah, yeah, yeah. So deviations from settings. But I also like, okay, verbose will make it verbose. If you write JSON, formal JSON, it will be even more verbose and so on, but it's more difficult to consume by humans. But I also think sometimes it makes sense to turn off costs, for example, if you just care about execution in some cases,
Starting point is 00:08:00 and you want narrower output to share with your colleagues, in some cases, it makes sense to remove costs. But of course, costs are very important to understand what a planner is doing. As a beginner, I wouldn't be doing that. Just in case somebody is helping, you can use it for something. Right. But if you have your posts, it's so wide and long and so on. I just see sometimes people share plans with cost off just because the place where they share,
Starting point is 00:08:29 if they don't do it, it's hard to read the plan, if it's textual. I kind of disagree. Buffers, for example, in JSON format, buffers add 12 keys per node. So that's a lot more than for Bose it's adding. Well, okay, okay. I just noticed quite experienced guys do it because of I also started to use it sometimes, especially
Starting point is 00:08:52 if you already collected plan without execution using just explain, there is no reason to repeat the same information and subsequent plans, right? If it's, you know, the same plan, but sometimes, of course, you add index plan changes. So if you don't use costs, then it's hard to compare and understand why planner chose this, what planner thinks about
Starting point is 00:09:14 costs of particular nodes inside the plan. Okay, okay. I agree with you, but sometimes still like, okay. Let's move on. Move on. So always use buffers. No matter what else we've just so always use buffers no matter what else we've
Starting point is 00:09:25 just said use buffers number three you've said throw away pg admin i start with yeah i said that many years ago and i keep saying the same so why tell us why many many reasons and uh where to start okay this is my personal opinion. Same opinion I have, for example, about Windows, but I understand that some people like Windows, so it's okay. But in this case, I see that it's not only my opinion. Every person I talk to says the same. PGAdmin takes a very important stage.
Starting point is 00:10:04 It pretends it's an official Postgres UI, graphical UI, right? But it's a separate project. It's developed by not the same people. It has different release cycle and so on. And I think it's advertised unfairly. And it forms wrong expectations. Because the quality of this product is not good. That's it. It has a lot of issues, UX issues, first of all. I think you're right. I think there are better free and even open source graphical user interfaces available.
Starting point is 00:10:37 I think you're right also that because it comes with a couple of the officially linked to distribution. So if you download Postgres for mac os or for windows pg admin comes bundled which does lend itself to believing it's like default official but equally i was looking at the stack overflow survey results recently and i was seeing that even though postgres has gone ahead of mysqSQL in terms of popularity and in terms of even usage for professional developers, when people are learning to code, it's still a long way behind MySQL.
Starting point is 00:11:11 And I was actually thinking that one of the reasons is, like, I think it is useful for beginners to have a graphical interface in a lot of cases. So I can see the temptation to bundle one. But yeah, I think this is good advice and there are good alternatives, which is really helpful. Yeah, there are a lot of alternatives and just check opinions. Don't trust me. Check reviews, opinions from others. And you will see so many people don't like PgAdmin,
Starting point is 00:11:35 especially when they already saw something better. There are some better tools. If you want UI. But my advice for me is learn console, PSQL and Tmux and VI. And well, maybe even without the VI, some people just use nano and that's it. It's not for them. PSQL is something worth learning. Even if you spend most of your time in UI, you will still benefit from understanding some PSQL tricks because CLI, it's better for programmers because you can code something in it. You can make your colleagues repeat the very same steps you did, right? Because it will be a script. Well, more than that, it's also Postgres quality. It's also shipped with Postgres,
Starting point is 00:12:18 written by the same people. A lot of people, same people, many same, like not the same people, but a lot of people who work on Postgres engine itself, they work also sometimes with on P-SQL and it has so many features. It has the same release cycle, the same quality, same, same, same. P-SQL is definitely Postgres native tool. PgAdmin is not native tool. It pretends to be native, but it's not. One counter I have heard to this
Starting point is 00:12:45 from a few people that have also said they prefer other interfaces is that they still use pgAdmin for its PLPG SQL debugger. So I don't know of another tool that has one built in. I think you can use TMAX and VI and still have debug it's a hardcore approach. I personally don't use debugger, honestly. I just use printlining everywhere. Somehow I use debugger in many, many languages and environments quite often trying to pretend it works. But somehow always, like printlining is always with you, always. And if you know how to do it right, and in PGSQL there are approaches how to do it right. For example, you can raise debug, for example, right? And all raise log, raise info, raise notice, raise warning, raise exception, although it's an error, but it's called exception. You can
Starting point is 00:13:37 catch them. In this case, you will deal with sub-transactions. Also, interesting thing, I have an article about sub-transactions to warn about their issues under heavy loads but printlining is always a few, it's always working maybe less convenient maybe a little bit more time consuming in some cases but it's good
Starting point is 00:13:58 so debugger I respect debugger obviously, debugger is a great thing it's like professional level, right? But still, I personally don't use it, so I cannot recommend. But to be fair, PgAdmin also has schema div component, for example.
Starting point is 00:14:16 I looked into it. It's code. It's quite interesting code. And you shouldn't expect magic there. I mean, it's schema div. It will produce some DDL to convert one schema to another without context. So you might have issues if, for example, you rename column versus you drop and add column. So data can be lost.
Starting point is 00:14:36 So you should be careful with these steps. And also, of course, it will be DDL, which is written in non-zero downtime fashion. And I don't see anyone who implemented zero downtime DIV in terms of DDL. It's a hard topic. But some other UIs also have components for DIV. Also, PGAdmin has good components to work with GIS systems and present some data on map. I just saw people using this quite well. I personally don't do it, but yes, it has some good features.
Starting point is 00:15:12 I think DeBeaver also has it already, right? That's actually when you compare, when you look at comparison discussions online between the two or between various GUIs, the one that comes up most often for praise when it comes to displaying graphical information is DBeaver. It's what a lot of people love about it. But bottom line, I personally don't like pgAdmin at all. I think it should be replaced everywhere,
Starting point is 00:15:38 but I'm not alone. And if you don't trust my opinion here, just check other reviews. That's it. Let's move on. Next is all settings, logging settings, right? So logging is super important. And usually when you need it and you find yourself working with default logging and something happened and you don't see it, don't see data. So you should enable logging of checkpoints of auto-vacuum. I recommend to log everything.
Starting point is 00:16:07 Of course, if you have a lot of tiny tables and auto-vacuum constantly touching it, writing like, okay, I just processed like one megabyte table, it's maybe spamming your logs, but it's still worth understanding some details usually and so on. I usually recommend enabling logging for all autovacuum occurrences. And temp files, I also prefer logging all temp files. Log log weights, which is based on dead log timeout, which is by default one second. Some people tend to increase it. It's for dead log detection, but it also
Starting point is 00:16:40 indirectly defines how events of being blocked by other session will be logged, right? Blocked, locked. So this is by default disabled. So if some locking issues are happening and locks, you don't see them, unfortunately. And that's bad. I prefer seeing everything here. So just enable it. And every time some session waits more than one second, by default, deadlock timeout, you will see the details about this case and it will help you improve your application code and so on. And also log min duration statement. Well, sometimes it's okay to set to zero, but very rarely because yeah, to log all queries with duration. I'm shaking my head well there is
Starting point is 00:17:26 opinion we should have it for example if you have nginx for example right it's normal to log all requests to http server i mean nginx for example right http server we lock all requests why with database we can do we cannot We log all requests. Why with database we can do, we cannot do log all requests? Because usually, especially if you put, if you use syslog, journal, journal D, and also you put your logs on very like some magnetic device, well, it will be sequentially sequential rights. It's not that bad usually but still if you you can do the math how many iops and megabytes per second you will need to have right but yeah so observer effect is no joke i put very important production systems a few times myself down put
Starting point is 00:18:21 them down multiple times and it was a super expensive experiment. And the lessons learned were very hard. So yeah, I wouldn't do it on serious production. I mean, putting zero, but trying to go down definitely, yes. Well, and saying it to something, I think is the key here. Not minus one, you mean? Yeah, something something positive but there's a value that anybody can afford here right like there's there's five seconds one second it'd be yeah right so like if you can't afford to do that you've got bigger problems right because if you if everything is slow it won't be so so many queries you can't have that many yeah right but i think it's super fast but still you need to learn your observer effect.
Starting point is 00:19:06 Understand these capabilities where you have logs and maybe test it and understand, okay, we are okay to log maybe 10 log entries per second, for example, normally, and maybe like 100 during peak times. It's okay for us, for example. Or maybe we should adjust settings and raise some bars like log min duration statement or log out of vacuum min duration. We can also raise it a little bit and so on. Wonderful.
Starting point is 00:19:34 Should we move on? Yeah. So it's quite related. Still looking at performance, but also here a bit more logging. We've got install pgstat statements, is a must you've said and if you can a few more extensions pgstat kcache pgweight sampling auto explain even you mentioned pg sentinel here as well but with a few extra decisions on settings for those last few right right so yeah pgstat statements should be everywhere installed everywhere overhead is very
Starting point is 00:20:02 low we had episode about it of course and extensions are, these pieces of advice are mostly for people who manage Postgres themselves. Because cloud providers usually provide something extra to PGSTAT statements. For example, RDS provides performance insights, so you don't need PG-Wide sampling. But they don't provide anything similar to pgstartk cache, which gives you physical angle for query analysis, unfortunately. So you cannot answer who is responsible for high CPU consumption, which query is responsible for high CPU consumption. You can only indirectly answer this. With pgstartk cache, you can directly answer this.
Starting point is 00:20:43 So these extensions are good things. But auto-explain is everywhere, right? only indirectly answer this with pgstart kcache you can directly answer this so yeah so these extensions are good good things but auto explain is everywhere right auto explains everywhere it is now yeah we had an episode on that as well where i claimed that it wasn't on heroku but then the next day it was available yeah yeah yeah i remember this and you, worth reminding, you have an article saying that sometimes it's okay to enable timing option for auto-explain. And of course, we want buffers there. Both have some overhead. And your post was about, unlike common opinion, sometimes this overhead is not huge. And you can enable them both, probably.
Starting point is 00:21:24 Not always. And also never with a threshold of zero milliseconds oh right right right so again it's only for slower queries but slow down zero is fine if you use sampling and maybe it was you who told me that sampling in auto expand exists for already many years, unlike slow query logging component, which is log mean duration statement based. Now we have, in modern Postgres version, we have sampling capabilities for both. And these things are like, they compete. If you choose AutoExplain, you don't need log mean duration statement because...
Starting point is 00:22:02 Well, we discussed this. Let's keep details for those who are interested, and we have a link to episode where we dive deeper into this topic. Perfect. Let's move on. So my opinion is that all development should happen on larger datasets. It's not worth wasting time developing on small databases, on empty databases. Just take bigger size, realistic size clone and just develop using it. Of course, you should remove PII, personal identifiable information, like phones, SSNs, emails, and so on. Because if you use it
Starting point is 00:22:43 for development, some auditors won't be happy if you use production clone for development. But not always. Sometimes people are okay to clone. In many projects, it's fine. Because you might work with just public data, for example. But my idea is, I see people tend to develop with small or fake,
Starting point is 00:23:07 some mocked databases. Then they are surprised why some queries work slow or don't work at all. Yeah, exactly. I think people are surprised not only that more rows can be slower, but also that it can affect planner decisions quite so drastically. Even whether, you know, I think people understand that join algorithms depend on data sizes but they don't always realize that scan operations can depend on it as well and all sorts of planner choices and it's even not just query performance right it's also spotting locking issues if you have a tiny table you might not spot something that's taking an exclusive lock that's going to cause you downtime if you apply that in production so i think it can help you with other kind of spot other issues as well by
Starting point is 00:23:48 doing this i think it's a good tip right but even even not related to performance if you develop using some like if you present your feature to your product manager or project manager or some qa specialist or some other people not engineers. And it has some random data. You don't have good page navigation. Search doesn't work. And because you just generated a few rows, that's it. It's also not working well compared to the case when you have production clone
Starting point is 00:24:18 with PII removed or you generated more realistic data sets. And you can share a clone quickly using DatabaseLab, for example, database branching. Cheap and fast, you can share it and they can dependently test it. It's very good. It unblocks many not clear in the beginning processes inside development process. But I also wanted to say that the idea of using large databases sometimes conflicts with the idea I want to take my laptop and go to some vacation or work on plane where internet is slower. So people, of course, want to have database on their laptops, on their working
Starting point is 00:25:00 machines sometimes to avoid the need to use internet connection and i understand that of course and in this case if you want one terabyte well probably today we already have disks with like one terabyte some sometimes you need subsetting of course so you need to minimize but still you can work with more realistic data and if you can clone if you can have one terabyte on disk, but clone it 10 times without extra costs in terms of money and time, this is how things should work. And worth noticing, we have copy and write everywhere already. We have containers, it's copy and write on file level, layerFS, overlay2. When you create a container based on some image, it doesn't copy everything because it's copy-on-write. And we also have Git.
Starting point is 00:25:47 Git is also copy-on-write on file level. You create a branch, it doesn't copy the whole repository, right? It just virtually creates a copy, but you share the same files among all branches which are the same until you change some file, right? Same here. you need to develop applying copy and write to databases and this is this is exactly what we implemented trying to help people i mean we both say yeah you've spent a lot of time thinking about this we did a whole episode
Starting point is 00:26:19 on it i think the database branching episode is the best one to share here so i'll do that nice so we've got a few more, but luckily some of these are shorter. Make sure data checksums are enabled. So this is about avoiding corruption. We've got a whole episode on that. Anything additionally to add there? Well, unfortunately, the places created
Starting point is 00:26:38 a few years ago probably don't have them enabled. And I think still Postgres itself has default off. Because it should be done during a need to be time when you create a cluster. Of course, they have overhead. Everything has overhead, but this overhead is worth having, because without them risks are high. And it's hard to like you need to implement a lot of additional procedures to control your corruption risks. And somebody, by the way,
Starting point is 00:27:05 asked about if you have checksums on lower level, for example, if you have ZFS, like Database Lab, because it uses ZFS as well. In this case, probably you don't need data checksums. I think we have data checksums in Postgres level at higher level. They protect not only from hardware issues,
Starting point is 00:27:23 for example, when data is corrupted and data checksum says okay i expected to read something but i see this page has something else so we have corruption i think it also protects from from particular maybe not like very likely but some bugs in postgres itself or in file system itself because checksumming at higher level feels safer to me. You created some content in shared buffers. You have checksum. And then everywhere, like Postgres itself, file system, personal system, and then hardware,
Starting point is 00:28:01 all of them are under control. If you do it at lower level, you have some layers of your system which bypass that checksumming. Maybe, of course, it's already good if you have checksums in file system. Definitely good. But it feels not as complete as Postgres would do it.
Starting point is 00:28:21 Corruption can happen in many, many, many places. Yeah. If we remember this is for beginners, I guess a couple of other things to mention that I think we covered both in the episode, I can't remember. One is that you could do a smoke test to see if you've definitely got some corruption, for example, in some of your indexes using Amcheck. And you can also use Amcheck for HIP. If you just opened documentation, it was a surprise for me a couple of weeks ago.
Starting point is 00:28:46 Amchek is officially for heap as well. Oh, really? Yes. Cool. It has pgAmchek, which has a parallelization option, dash j. You can use multiple jobs and move faster, of course, using more cores and put more pressure on your disk or your system. So you can say heap all indexed, some flag, heap all indexed as I remember it. And in this case, if all your tables have unique keys or primary keys,
Starting point is 00:29:15 which cover whole table, primary key covers whole table, heap will be checked as well. Oh, great. In this case, you don't need pgdump to def null anymore. Because it turned out, I started recently, how to run pgdump to def null in parallel. And it turned out not to be so trivial. And then someone, basically, well, Andrej Borodin,
Starting point is 00:29:39 told me that pgm check can be used for checking everything, both indexes and heap. that's great nice the other thing i wanted to mention is that some backup tools have the option to check for corruption at backup time which is quite nice if you can't yeah but is it checking corruption in database or in backups corruption can happen in many places. This is the main rule of corruption. It can happen everywhere. I think it checks it at the database level. I could be wrong. It should be used something existing
Starting point is 00:30:13 like Amchek. I'm looking into it. I don't know it. It's interesting. Should we move on? Yeah. Number eight is tune out a vacuum to run frequently
Starting point is 00:30:24 and to make it move on? Yeah. Number eight is tune auto-vacuum to run frequently and to make it move faster. Yes. Default are bad. Default settings are not good enough for you usually. In most modern cases, although recently it was adjusted, like a few years ago, default settings were adjusted. Cost limit, cost delay. Cost delay went down to two milliseconds from 20 to two, so 10 times more quota in terms of CPU and especially IO. And now a single worker can
Starting point is 00:30:54 move faster and so many things inside this auto vacuum tuning. Basically, you need to move it faster. So because by default, it's also throttled. So throttling and also frequency of occurrences. By default, it happens like with 10% 20% of tuples affected, it's too infrequent. It's too conservative. For OLTP, you should use like 1% half a percent. So make it 10 times to run 10 times more frequently or even 20 times more frequently and so yeah completely agree we've got a whole episode on vacuum i think that's worth people checking
Starting point is 00:31:33 out if they aren't aware of this but yeah the best analogy i ever heard was vacuum's a bit like exercise if it hurts you're not doing it often enough. Right. And I usually say if you, for example, data deleted or insert with rollback something, you produce some dead tuples, well, you put some pressure. And if you have 96 cores, 128 cores, and all your cores can do that, right? They can do writes and produce dead tuples.
Starting point is 00:32:02 Why do you have only three workers? Three workers is not enough to clean up after all those guys, right? You should allocate at least 10% of your course, like 10 workers or maybe 20 workers to clean up faster. Because a lot of work done, a lot of cleanup work should be done. And the bigger your tables get, the worse the default settings get for them in terms of those scale factors.
Starting point is 00:32:28 So, yeah. And you partitioned, but didn't change Autowacom workers number. It's still three, but you partitioned. Well, it's not good as well. You cannot benefit from having a lot of cores, maybe. So, yeah, it's an interesting topic.
Starting point is 00:32:43 We definitely discussed it, but this should be done by all because defaults are not enough in most cases. Agreed. Your topic, query optimization. Yeah, this is probably my favorite. I actually look back and we've done two episodes on this one.
Starting point is 00:32:58 So the introduction to query optimization and we call it 102 as well. So yeah, obviously I liked your tip here. It's that query optimization and we call it 102 as well so yeah obviously i liked your tip here it's that query optimization will eventually be more important than config tuning yeah config tuning is enough like one per year query optimization should be done every time you perform a lot of changes in application so if you tend to have a couple of releases per day you need a lot of optimization activities. Yeah, I've seen quite a few people end up throwing money at this problem instead.
Starting point is 00:33:30 You know, instead of tuning a few queries, they scale up the database. People think configuration tuning is something like a silver bullet. No, you change infrastructure level quite infrequently. For example, you upgrade your VM or move to different infrastructure, upgrade your operational system. In this case, you need to reconsider Postgres settings. But usually you've done it like 50-80 approach. You did a good enough job. It's not worth spending time to fine-tune it further and further. It won't pay off. But if you have very frequent releases, you have big instability in this area. Fortunately,
Starting point is 00:34:13 usually it's not so wide as configuration tuning. Configuration tuning usually is like global. You change the workmem, default workmem, it will affect all queries. In query tuning, usually it's very narrow. Okay, a couple of tables, several queries working with those tables. So you can scale this process and involve more people into this process, especially if they all have independent clone databases. I won't stop saying this, because it's obvious. Everyone should have their own big size database. Yeah. And again, going back to previous tips, PG stat statements makes this a lot easier as well.
Starting point is 00:34:51 Yeah. Yeah. Well, from top to bottom analysis. Yes, I agree. Okay. And last tip is about indexes. And I found it interesting that people don't realize that index health declines over time. Even in modern Postgres, even with all the duplication optimization
Starting point is 00:35:08 happened in Postgres 13 and 14, if you have Postgres 15, the latest major version 15, even with it, over time, index health will decline because they receive a lot of updates. Not in all cases they can rebalance, right? They receive some bloat still.
Starting point is 00:35:27 It's much better with newer Postgres, much better than before. Postgres 12 is much worse than Postgres 15 in this area. But still you need to re-index them occasionally from time to time. You need to re-index them and you should just control bloat using some bloat estimate queries they will produce some numbers with some mistakes sometimes like up to like two sometimes i saw 20 30 percent is mistake it says 50 but it has only 20 percent of bloat it's okay i mean we can re-index them but re-indexing is heavy operation and it requires two things. Disk I.O. and it also requires some brief locking and it can be blocked by auto-vacuum running
Starting point is 00:36:10 in transaction ID wraparound prevention mode. So you need retries, you need to be careful and usually you need to put this additional work on some weekend or nighttime. But it should be automated. And I think it's already obvious that we need some better tooling. And it's a shame that cloud providers don't provide additional tools to automate this work, because almost everyone needs it. This is another popular tweet that got a lot of likes when I
Starting point is 00:36:40 criticized cloud providers, because progress went in the wrong direction, in my opinion, because they implemented backups, provisioning, replication, autofailover, and then jumped to very exotic topics like serverless. And they left us with all these needs alone, basically, like query analysis, index maintenance, some of the vacuum-related work, bloat analysis, repacking, and so on and so on. And monitoring in cloud providers has very slow progress as well. You know, like down-to-earth topics, not exotic ones. It's very great that we have a lot of progress on the frontier of exotic topics, but we still
Starting point is 00:37:23 need to close some gaps in administration. And I think this is one of them, index maintenance. Also unused indexes, redundant indexes, invalid indexes. Sorry, I cannot stop here. We've got a whole episode on this for anybody new to this topic. I'll link that up. But yeah, I do want to give a shout out.
Starting point is 00:37:44 The work on Postgres 13 on deduplication and i think on 14 on bottom-up deletion is wonderful work and has massively reduced this problem but it still exists but it does mean if you're upgrading like i've seen quite a few people lately up doing upgrades from like 11 to 15 or 11 to 14 so those people especially if you're upgrading in a way that preserves your bloat definitely worth looking into re-indexing as part of your well after your upgrade because you might get significant savings and speed ups okay you mean re-index all indexes to be in better shape because these these optimizations will work only after you rebuild your index, right? Well, the optimizations work only afterwards. I'm not sure, but basically... Some kinds of optimization definitely will work only after you create new index. So you need to
Starting point is 00:38:36 rebuild it, yes. But yeah, the damage is already done, basically. So it's going back to the start and then it won't degrade as quickly in the new version, but you'll also get that initial boost. Right. But the main idea, you need to re-index anyway, you know, for newer Postgres. And I don't think this need will go away in the future because looking at other systems like SQL Server, for example, they also have this need. Okay. Index maintenance, meaning we rebuild indexes sometimes. Of course, index maintenance also means we clean up unused indexes, redundant indexes. But also we rebuild indexes which health is not so good, meaning blowout accumulated.
Starting point is 00:39:17 This is normal. Nice. So these are your 10 tips. I've actually got a bonus one I wanted to add. I think not all beginners realize that the Postgres documentation and release notes will be as good as they are. And even the source code, the comments in the source code is really accessible even to beginners.
Starting point is 00:39:34 So those things are your friends when you're trying to work out why something's happening or what the truth should be. So yeah, documentation, release notes, and source code comments are all brilliant. And the readme files in source code as well. That's, yeah, good point. Some directories will have readme files, and they are quite good, plain English,
Starting point is 00:39:52 explaining things not explained in documentation sometimes. Yeah, completely agree. Good. I also completely agree. I think it was good. I think it was helpful for some people. Please share your opinion, and don't forget to share this episode with colleagues you think might benefit from listening. And till next time. Thank you. Much appreciated. Take care. Thank you.
Starting point is 00:40:16 Bye bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.