Postgres FM - Corruption
Episode Date: May 19, 2023Nikolay and Michael discuss database corruption — various types, how they can come about, and what to do (and not do) if you come across it.  Here are links to a few things we mentioned:�...�The dangers of streaming across versions of glibc (TripAdvisor mailing list thread)The floor is Java memeData Corruption talk by Sebastian Webber (on Postgres TV) Data corruption monitoring & troubleshooting talk by Andrey Borodin (on Postgres TV)Postgres data corruption (document from Nikolay) Data Corruption and Bugs Runbook (document from Nikolay)Corruption (Postgres wiki)Checksumspg_checksumsOriginal pg_checksums (by Credativ)amcheckOur episode on index maintenance14.4 release notes about create index / reindex concurrently issue and fixamcheck to check unique constraints in btree indexes (Commitfest entry)amcheck verification of GiST and GIN (Commitfest entry) How to corrupt your Postgres database (blog post from Cybertec)Christophe Pettus talkChristophe Pettus slidespg_hexeditpageinspectpg_catcheck~~~What did you like or not like? What should we discuss next time? Let us know on YouTube, on social media, or by commenting on our Google doc.If you would like to share this episode, here's a good link (and thank you!)~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello, hello, this is Postgres.fm. I'm Nikolai and my co-host is Michael. Hi, Michael.
Hello, Nikolai.
And today we are talking about corruption, right?
Yeah, absolutely. Database corruption.
Technical corruption.
Yeah.
No money involved unless you need to pay some consultants a lot of money to fix it,
or at least to understand what's happening and how to escape from it.
Because sometimes you cannot fix, you need just to escape and lose some data and save what you
want and so on, right? Yeah, these can be some pretty horrible stories, can't they? Every now
and again, I see somebody mailing one of the mailing lists or popping up on one of Discord
on Slack or something like that. And they say something along the lines of, I'm getting this
error message, help. And you can just tell, your heart sinks and you think, oh, they're in trouble
here. This could be bad. Right. Let's start maybe from what corruption is. What definition
do you have in your mind? Because I've heard various kinds of definition.
Yeah, good point. I guess it's a few categories of issue.
There's some things on the subtler side,
which are like incorrect information coming back,
let's say for a query,
all the way through to the database not starting.
You've probably got one written in front of you,
but I guess it's data not being stored on disk in the way it was supposed to be.
Something along those lines?
Yeah, well, we can distinguish a particular type of corruption called data corruption.
And in this case, that is quite broad, but I think it's quite a good definition.
You wrote something, you asked Databoys to write something, you tried to read it,
and you cannot read it or you read something
else and so on for example you wrote zero but when you read it it's it's one this is corruption
but if you cannot read it and you have some error instead on just no response at all well sometimes
it can be called corruption but sometimes if server is is down, it's not corruption, it's just it's down, right?
Or it's not working well at all and have some bug or something.
Some people can name this corruption as well, but it probably doesn't involve any wrong data writes or data reads.
Maybe just like server itself is not working properly.
And data is fine on disk, for example.
There's no real corruption there.
But if we apply this quite broad definition,
you wrote something, but you cannot read it
or read something else.
A lot of cases can be called corruption,
but which are not really corruption.
So I'm struggling to find very good definition
for corruption.
Well, I think probably if we got given 100 cases and we had to categorize them into corruption and not corruption, we'd probably agree on 98 or 99 of them.
It feels like there's quite generally accepted cases.
Should we go through some of the more common ones?
How do you want to do this?
Yeah, definitely.
First of all all data corruption
can be in heap in tables for example and it's not uncommon because still a lot of databases have
data checksums disabled right because you need specifically enable them they have this has
overhead some managed postgres providers enable them by default.
Some don't.
But if you install Postgres, if you installed it many years ago,
it's hard to switch to data checksums on
because basically you need the NintB and dump restore.
Or you need to play some tricks,
and we can discuss it with some extension.
But it's not easy if you have a big database under load
and the maintenance window is not welcome.
It's hard.
I see many cases when people keep postponing this, leaving without data checksum enabled.
So they probably have data corruption in table files easily.
It can be also in toast files, which is also part of data files.
Then it can be in various other places, for example, in indexes. And let's discuss this in depth, like in detail
a little bit later. But now just let's walk around and see how what else can be corrupted.
Visibility map can be corrupted.
Yeah, a bit more niche that one,, I guess. But the other major category
I've seen a lot, like I've heard a lot of people that have had
tables data corrupted, indexes corrupted. It seems to be the most common one
to me, especially recently with the drama around version 14.
But the other one that I see quite often is
backup file corruption.
Oh, yes.
And that catches people out.
Yes.
And let's talk about this in detail as well.
Now let's just collect ideas for our episode.
And I also can be, what can be corrupted?
Wall files potentially can be corrupted,
but fortunately for ages we have data checks,
wall record checks enabled.
Everyone has it enabled.
It's well ready.
That's why we don't hear a lot
about wall corruption.
If you think also about levels,
for example, data,
you write something to the table.
Corruption might happen at software. It might also happen
at lower software file system, for example. It might happen in hardware. So we have also layers.
We have verticals and we also have horizontal classification. It's like metrics. When we say data files corrupted, well, but what if at protocol level something happened?
You are writing zero, but protocol substituted to one.
You have data checksums enabled.
It's crazy.
It's probably not going to happen, but I'm just trying to explain about levels.
So protocol replaced your value, right?
And then Postgres wrote with data checksums enabled,
everything is fine, checksum is fine,
file system is fine, disks are fine.
No corruption, right?
But it already happened at upper level.
Yeah.
And you haven't even mentioned operating systems, I think.
And that's like one of the major.
Yeah.
So there are so many places.
The application, if application took some value from user input and then replaced it,
Postgres cannot notice it because it's already very upper level.
But we won't be talking about application level.
Yeah.
Let's, as far as I'm concerned, I don't think if Postgres has been told to store a certain value and then it returns that same value, we're not talking about corruption.
So if the application is to blame, I think that's out of the scope for me of the definition here.
Cool.
Right.
Also, replication, both logical and physical, might have issues.
And it might spread some corruption, for example. Well, of course, if some data is corrupted on the
primary, it will be easily spread if you use physical. But interesting point here, if you use
logical, probably you'll be fine. It depends, of course. Well, I was going to say you could have
different corruption on each. They could be i was gonna say you could have different corruption
on each like they could be exactly the same it could be different all combinations seem to be
possible but yeah really good point that logical can help avoid some of the types of corruption
spreading makes sense but equally i've heard problems like i was reading up on a thread by
a trip advisor uh thread a while back where I think it was actually replication
that was part of the problem.
It was replicating between two systems
as part of an upgrade,
but the replica was on a different operating system
and that ended up causing corruption.
So it almost certainly was GLBC related.
I can't remember the details though.
Yeah, well, maybe to finish with upper level
when application corruption happens i remember i had a very weird case when my scripts based on
psql returned one value while application had different so different value from database. And application was Java.
You know, this meme, ground is Java,
and all junior developers looking as goats,
they are all on tree.
So it was Java.
It was a quite good customer.
And they asked me, like,
I trust P-SQL much more, of course.
What happens in Java with all those, like,
Spring, everything, Hibernate,
a lot of stuff there. What's happening
here? But then I noticed that
in DBWare, value is
the same as in Java.
And I think maybe we have an issue
with P-SQL, maybe some bug. But then I
realized DBWare is also Java.
Good point.
So if you use GDB-C...
It has some extra flow digits or something
so numbers uh look differently sometimes and you just need to change this and you'll be fine
so this also kind of corruption to me because like i expect one value but it's it's it's really
at upper level the stored value was fine pcql showed it well it's called at upper level. The stored value was fine. PC will show that well.
It's called extra flow digits.
The default in Postgres
is not the same as default in JDBC.
And this is
how you can easily get corruption. If you don't
notice it, okay, it's good that I
noticed it, but if you write some application
and your application doesn't notice it,
switching from JDBC and
to other things like ODBC, for example,
switching back and forth can cause corruption.
If you read something and write to a different place,
to a different table, you will have different results.
It's corruption.
Just causing by lack of knowledge about this setting.
Just realized we forgot to say that this was a listener request.
Oh, yeah, yeah, yeah.
So thank you for that.
And they mentioned they were interested in what kind of types of corruption there are,
how to detect it, how to fix it, how to safeguard from it.
And even, are there any performance penalties to some of those safeguards?
So that's an interesting question.
I feel like it's hard to carry on without talking about, well, they asked types of it.
I think we've covered that a little bit, but it would be easier to keep going by talking about how can we detect it or how do we generally find out about this kind of thing.
Yeah, let's dive into specific types of corruption. But to finalize this overview of types of corruption,
I must admit that it's one of the topics when you know this saying,
the more I learn, the more I understand that I don't know anything.
It's the topic where you can easily realize that,
oh my, this also can happen, this can happen.
And then you already have a feeling that there should be many more types of and kinds of corruption, which we probably keep being
unnoticed by everyone, right? But at some point, we start talking about them. So we had a few
sessions on Postgres TV also, including with Andrey Borodin, where we discussed in detail
various things. I will try to mention a few approaches.
And also, I have a couple of documents where we try to summarize a lot of links, tools,
and also mitigation issues, monitoring approaches, and so on.
So where do you want to start?
Heap, Index, anything else?
Well, or even, as you said, some of those links could be useful i was looking at
the postgres wiki i thought it had a really good page on corruption it's good but it's very very
limited i agree but they make roughly the same point as you which is almost by definition a lot
of cases of corruption are quite unique if they happened all the time hopefully there'd be things
in place to mitigate them so the people designing software systems really don't want this to happen this is like
when we spend quite a lot of episodes talking about things like performance but performance
doesn't matter at all if you're you know it's like quite this is quite a low level need this
is something people come to databases and really want if you if you say oh our database is a thousand
times faster has loads of other benefits but there's like a five percent higher chance of
corruption or i don't even know how those numbers would work but this is a really like basic need
this is something that postgres i think has quite a good reputation on generally but we still have
some quite famous cases of of issues so yeah I thought it was a good starting point, that wiki article.
I'll link it up as well.
But you're right, it's not huge.
Definitely.
And when we talk about corruption, I always start also from that article.
It has some very core knowledge.
And in the first lines, you will see, first rule, if you try to fix something,
make a backup of your data directory or backups or what you have.
Because with trying to investigate and fix it, you are going to use some interesting, risky tools and approaches.
Some of them probably are not working well and you may make harm additionally or just lose everything.
So you need to start with additional backup of everything.
Yeah, if you can copy everything
and make notes of what you're changing, if you can't.
There's two things though, right?
Firstly, you can make things worse.
You can end up in an unrecoverable state
when you were in a recoverable state previously.
And the second thing is, even if you do recover,
if you don't have a copy of what it was like
in a corrupted state,
you might struggle to understand what the root cause was or there might be some useful information in there for
working out what caused this and it also i think it says quite early on you might want to contact
professionals for this so most of us in our day-to-day work who aren't consultants i count
myself in this don't see this often if ever in our careers hopefully you'll never see this in your career but consultants get called about this i don't know you could tell me the exact number but
probably at least monthly like seeing different cases of this kind of thing so there are people
that have seen this dozens of times if not hundreds of times in their career so you might
want to lean on their services and it's probably worth it financially to pay somebody to help you with this it's it's one of the topics which for consultants like my team and i like to do because
this clients have fears if you explain properly what kind of fears they should have they they
easily have them and of course they they usually ready to to invest into improving things.
For example, improving monitoring and mitigation and so on.
Let's start from backups, probably.
I find that as probably one of the easiest places.
Backups without proper testing are like Schrodinger backups, right?
Because nobody knows if they are working.
And of course, all backups should be tested.
Of course, it requires some resources. You need to provision virtual machines, for example,
and test them and so on. You need to test recovery from a backup. And usually it means,
usually physical backups, they have full backup plus maybe delta backups and also
wall stream, stream of 16 or 64 megabyte walls so when we test we need to test full backups
delta backups and also we need to test all walls and point in time recovery until the very last
point before next backup started so it's like it's a lot actually to to test because otherwise
probably full backups is working but later in the day for example you have daily backups and you
test full backup is working but then some wall is missing for example or corrupted or something
well wall corruption might be something which won't happen as we discussed but the backup tool
compressing walls with some innovative compression might cause corruption.
You cannot restore it, for example.
A full backup test passed, but some wall later, if you don't test the whole stream until the next full backup, you probably also won't be able to restore until some point.
Until the latest point, usually.
In my experience,
larger companies
tend to have this down. They tend to
have a process for doing this
on a semi-regular basis, if not
very regular basis. It's somebody's
job to do a full run of this
every couple of weeks or
every couple of months, depending on how
serious... Every day. It should be
the same frequency as backup creation. Oh, think i'm sure if you automate it if you so you're talking
about an automatic everyone must automate it otherwise backup backup system is super weak
that's what when people say untested backups are not backups this is it you need to test all backups
of course you need to pay for it, unfortunately.
And all walls, including all walls.
But a lot of people, I think,
there's a lot of, especially smaller companies,
startups that either have never tested a backup
or did it once as like a one-off.
Or they just trust RDS and that's it.
Yeah, exactly.
That's another good point about only having backups in one place, right?
But how can we be sure that RDS tests all backups properly?
Or that they're not corrupt, right?
How do you know they're not being stored in a way that's accidentally...
You need to test restore.
Oh, by the way, when you restore, there are also some levels.
First, you restore and can reach a recovery point and Postgres is working,
like accepting connections queries, right?
But then usually we think,
okay, what about data in this restored Postgres?
Maybe something wrong, right?
Maybe corruption from production propagated to backup.
And here we usually, like very simple trick,
you need to dump it fully to DevNull, causing reads of all data files, but not indexes, by the way, right?
And dumping, it's very, it's like, it sounds silly probably, but this is a very good approach just to cause reading of all files.
It might take a long time.
While this, you need to pay cloud for these resources.
Resource consumption is significant here.
But this is the only option to test backup and see it definitely works, right?
So backup recovery works.
Point-of-time recovery to any point in the day works.
And we can read data.
But reading data, it's already verification of data files not recovery
itself not restoration process right it's already beyond backups but this is a good place if you
have backup verification automation at least some like maybe not every time but sometimes this is a good point where you can start testing various things, including data corruption risks and index corruption risks as well.
Yeah.
And also worth mentioning, some larger companies, you mentioned, they have some additional replica, for example, it might be delayed replica, used sometimes to quickly recover accidentally, mistakenly deleted files, deleted
records in database.
So they have, for example, a replica which lags some hours, like six hours or four hours,
always behind.
And this replica usually, of course, it's based on wall shipping.
And this replica can be considered as a test tool for wall stream you still need to test recovery of full backups but
walls are tested by this replica or you can install database lab engine and it will be reading for
walls and also can if it's down there is a problem with false so it's a complex system to test backups
yeah that sounds good and it's a really good point that these can be corrupted. And the best way to check is see if you can restore it. S3 can have problems as well.
To store backups in object storage is very good because S3 or GCS, they have very reliable storage
so they don't lose data. But uptime might be worse than EBS volumes, for example.
So when you test it, you need to understand that sometimes it's slightly down, but they won't lose
your data. This is a very good thing. So some retries logic needs to be implemented. It's like
it's a whole project to test backups properly. But okay, let's shift to data checksums probably,
and then we'll talk about indexes. Because I think these topics are narrower.
So data checksums should be enabled.
Of course, there is some penalty overhead you need to pay for it.
But in my opinion, they should be enabled everywhere.
There is also PG checksums, which is, remember, all new versions, they have it in its country module.
And it helps you enable checksums, and there is a recipe,
so you can do it on replica first and perform switchover
and then cover all existing replicas.
So with very small downtime, close to zero, it's possible.
But for older versions, this tool, PgChecksums, didn't exist.
But there is a third party.
This tool is available for all the Postgres versions, And it's also possible to use it with replication, switchover, achieving downtime close to zero.
Do you know what that one's called?
Also pgchecksums.
Okay, cool. Makes sense. I'll find it and link it up. I wasn't aware of that. Yeah, so they're both called similarly.
Maybe some difference.
I don't remember in detail.
But this is it.
You need to enable data checksums, and that's it.
It still doesn't mean, for example,
some constraints can be violated silently.
Data checksums won't save you from this.
It's like upper level corruption.
Data checksums will save you from having wrong writes.
So you wrote something, disk, for example, corrupted.
File system has bugs.
And then you read something else or cannot read it at all.
This is where data checksums are needed.
Also, you need to choose either to like, there's a setting saying that you need to stop if corruption was detected, or you can continue but just have some errors and logs.
It's interesting.
I had very interesting discussions with very experienced people who have good experience with Oracle and SQL Server.
And we discussed what is better for current business. For example, you have e-commerce, which generates $1,000 per minute, for example, or per second.
And you have corruption and you have very good, strong tool and you enabled it and it puts your
server down if corruption is detected saying, oh, corruption, fire, right? Of course, you'd prefer to continue.
And if corruption is limited,
you don't want to be down at all.
So it's better to know about it,
mitigate somehow,
but you want to continue operating.
This is interesting, right?
And they also mentioned that some
like SQL server
they have interesting
approach when
corruption is found
on one node
it can auto heal
getting
not corrupted
files from
replicas
for example
from standby nodes
bringing them
interesting
let's
auto heal
Postgres does have it
but it's an
interesting idea
but like rule of thumb here enable data checksums Let's auto-heal. Postgres does have it, but it's an interesting idea.
But rule of thumb here, enable data checksums.
If you don't have them enabled, use this extension to enable.
So indexes.
Let's talk about indexes.
Yeah.
Uncheck is a good tool.
Very good tool.
Is it worth quickly covering like what so bt index specifically why why do we tend to see more index corruption than heap corruption because of collation and gdpc
version changes for example if you upgrade from one linux for example ubuntu to another
there was an article which said like gdp-C version 2.27 maybe has dangerous changes.
So you definitely need to be careful.
And it said other versions are fine.
But in my practice, I have zero trust.
If we plan OS upgrade, if we plan any upgrade of underlying software, GLEAP-C, and you can version, ldd-version, you can check version.
If we expect that this version will change, or if we also play with containers, you might
have some binaries in container, and if you have container with Postgres, and PgData is
outside of container, probably PgData was created using different glipc version and
in container we have different like it's if it is not if it is mismatched here what can happen
b3 is built using one order of characters collation right and now we have different order
and you can just for example not being able to find some, like you select something and previously you had some rows returned,
but now you don't have them at all.
And we can distinguish two types of problems here.
One is read problems.
And another one is propagation of read problems to writes,
which hurts more because it's permanent.
So customers cannot find they search
something and cannot find something or they get back incorrect data right like they get back data
not expecting so that's that's like often how these present themselves or as you said like the
right problem if we if we can end up with duplicates like we can end up looking for an
identifier it doesn't exist so we in like we insert another one. And we don't think it exists
because we're looking up
via the B-tree lookup.
Duplicates is a slightly different problem,
not related to the order of
JLipC version.
But it also can happen.
Duplicates is a different problem.
I think it could be related to...
Isn't it using the primary key?
Because it's... Isn't it using the primary key?
Isn't it using the B tree for the unique constraint checks?
Right, but if order changed,
the uniqueness is not violated.
Yeah, we just changed the order of columns
and order by.
This is a key problem.
Cool.
Well, my understanding is
you can also end up with data
being in the wrong partition, for example.
So ordering affects quite a few things.
Oh, yes, partition corruption.
So I was thinking the index is used for those things, but no.
So unique installation might happen,
and user actions might cause it sometimes.
But I don't see how the order,
like how order characters can
lead to it
maybe I'm wrong
maybe I just
don't see it
but what can
happen in my
practice I saw
it like what
can happen
wrong order
so user
expect one
order
having different
order and
missing results
completely because
of some ordering
and limits
results are out
of scope and
this like
unexpected and
so on and
problem bigger problem is indeed
rights and if it's some billing subsystem you can propagate wrong rights some people can
not get money for example right yeah i was just thinking when you gave your e-commerce example
i was wondering what the trade-off would be for like like, let's say a bank. If you've, yeah, but it's,
so if you're worried about people's like balances being wrong, for example,
is that a different matter?
But I still think they'd make the trade-off of keeping it up
and dealing with the consequences in a lot of cases.
So it's an interesting question, like which applications would choose which.
Yeah, yeah.
But anyway, OS upgrades are dangerous, right?
And containers are slightly dangerous.
If you bring PgData, we had it also.
You bring PgData from different operational system
and you run your Postgres in container.
Well, even without containers,
if you just copy PgData between two nodes
and one node is running Ubuntu 16.04
and another is 22.04,
you are going to get in trouble
because you copy at physical level in this case.
You need to copy at logical, dump restore.
This is where dump restore plays good
because it eliminates the problem completely.
And I'm not sure we've said it explicitly
since we switched to talking about indexes,
but the tool for checking B-tree index corruption is still AmpCheck.
AmpCheck, right.
And it has several modes,
and unfortunately the proper way to check it is heavy,
and you shouldn't do it on production because it requires exclusive lock.
This is where you probably need to put it to backup verification system.
And after you restore, you run it on all indexes it will take a
lot of time but you need to run into using special snippet which like parent blah blah i don't
remember but it also checks relationship between leave nodes and this is where order can be easily
broken so you need this and it will take time and the funny thing from practice is that if you do it at SQL level, it's single-threaded.
But it's not fun.
If you have a lot of indexes and you have a lot of cores, you usually…
There is a trade-off.
It's backup verification, same trade-off.
If it's in cloud, you usually think, okay, I will hire a much more powerful machine, but for a shorter time.
And I will do job much faster
and destroy it.
So it's, I mean, budgeting here,
it's interesting for such kind
of verification projects.
But if you do arm check,
don't do it in single thread.
Do it utilizing all cores and disks,
like saturating them
because it's a verification machine.
We should go full speed.
Nobody's there, right? If it's a verification machine we should go full speed nobody's there
right if it's so of course but then i i twice i wrote uh parallelization scripts for it
and only then realized that since postgres 12 or 13 maybe 13 there is a cli tool pg uncheck
which has dash j dash j it's for parallelization so so and you can use
it for older postgres as well yes that's a really good point i think you brought this up we would
i think we had an episode on index maintenance and you made you so we've actually covered am
checking this stuff in quite a lot of detail for people that want to go back to that one i'll link
that one up in the show but it's covered even more with our sessions with Andrei Borodin and
in documents I mentioned.
We will definitely link
to this episode of these documents.
And we have a table for
monitoring and prevention mitigation
measures.
And AmpCheck,
we see patches to
check uniqueness violation.
There is such patch.
But it's not committed yet. It's ready
for committer and fortunately
it won't be in Postgres 16
because it's already almost better. Next week
we have better, right?
Better one. So we've
gone past the feature freeze for
Postgres 16, right?
Right, right.
But I hope it will make it into Postgres 17 and i think it's one of
good cases when you can start using it before official release and also help testing it because
on these verification machines temporary machines where you verify your backups probably it's you
can run like not polished software and so on it's a good
point and what's the worst that like i guess if it's got bugs like that's the normal reason to
not use it right is it there are false positives and false negatives false negatives like you
miss some problem well without using it you miss it for sure if it happens. So still, unpolished software is still beneficial.
Force positives are annoying.
It says, oh, unique disfellation.
Okay, it's easy to check and verify.
Yeah, exactly.
Right.
So it's quite a nice failure case.
For corruption, slightly more complex, of course.
For example, it says, oh, there's corruption index.
Well, in this case, you can just rebuild it on
this machine, this very machine. You rebuild
it and check again. If it still
says corruption
in the same place, probably
it's false positive.
Because you fixed it already. Rebuilding
fixes it. Yes, so that's
I guess where we should
be going is how do you fix some of these
issues? And with indexing, it's relative or with B-tree indexes, because it's an ordering issue,
normally, right, re-indexing should fix the ordering because you're re-indexing with the new collation.
So that's the fix.
And as you say, if it shows again, it should be a false positive.
Although, I guess that would then mean it's a bug in Amgec, right?
Yeah, of course
yeah and uh worth mentioning again uh postgres versions 14.0.1.2.3 are super bad because if you
use a re-index create index re-index concurrently you might have a corruption and in this case if
you use the same version on this verification machine and Amchek says corruption, you rebuild it with concurrently again.
Probably it says corruption again.
Not good.
Concurrently option in create index and reindex is super cool.
But it has a long trail of bugs, unfortunately.
And if we run Amchek on this clone, non-production, I would verify all findings
with regular index.
That's a good shot.
Which is more reliable.
Because you don't have people reading, right?
You're not worried about the looks.
Single user machine, right?
It's faster too, right?
In this case,
even if I was running 14.0
and corruption came from production because of the use of concurrently
option. Here I would uncheck find it. I run reindex and uncheck shouldn't find it anymore.
It means it's not false positive, right? But uncheck for bitrate is quite reliable. It
shouldn't produce false positives. It's quite broadly used and well tested and like production tested
i'm check with this patch for uniqueness violation of course different story and also i'm check with
patch which also the same like it's it's ready for committer but not uh won't happen in postgres 16
unfortunately i'm check for gist and gin indexawaited thing. It's very also needed.
Andrei Bardin also participated in this.
And this patch, I tested myself,
applied it to Amcheq and started to test.
I saw false positives.
And I was told that there were many false positives
in the beginning,
but a lot of them were fixed already.
So it has many iterations already.
So I would consider this quite good.
And I encourage people to try it in non-production.
Again, you still need non-production because of this exclusive log.
Andrey mentioned some improvements for future Postgres that will probably make possible to run AmpCheck in production in a less aggressive way.
But currently, I prefer AmpChecking on the clone where I can do more things and so on,
utilize all cores, situate disks, and so on.
It's better.
So that's probably it with AmpCheck. Again, I'm looking forward to having Gist engine support
and also uniqueness violation support
in future versions of AmpCheck,
but we can already start using them,
taking just patches, right?
What else?
Visibility map, there is also some tool
to check visibility map corruption
because it's not good to have it as well, right?
If it happens. I think this episode could easily be tool to check visibility map corruption because it's not good to have it as well right if it
happens pg visibility i think this episode could easily be like three hours long i suspect we could
go on for a long time yeah but there was there's a few good posts that i wanted to mention i mentioned
mention my document has lots of information if you find something it's missing let me know i will put
it i'm trying to collect everything about corruption, and it's a very broad topic.
It has many directions.
Nice, I look forward to reading that.
There's one by CyberTech that's quite a fun post of Lawrence, I believe,
going through about giving quite a few examples of ways to corrupt the database,
which is quite a nice way of learning the kinds of things you can do to end up going.
How to corrupt your database?
It's an interesting area of DBA practice.
Let's corrupt to learn.
Yeah.
And I mentioned, I remember something like corruption related to Xydra Peround.
Is it this topic?
Is this post or no?
I can't remember that one, sorry.
It's okay. How to corrupt? Yeah, I remember it. I have it in my list.
There are two posts on CyberTech about corruption.
One about Ksitra Proud and another is just about corruption.
There's also an interesting tool, Noisia from Alexey Lysovsky,
it's like how to damage your postcards.
A tool just to learn and practice how to escape from situations.
We didn't cover cases when you have corruption, how to do it.
Of course, PGL reset wall is a tool.
It will take time.
And you definitely need to back up before.
I use it several times in my practice.
One case was interesting
when Postgres was running on top of
NFS in AWS
and got corrupted.
But fortunately,
only one table was corrupted.
It was PG statistic.
So it was super easy to fix.
Just analyze whole database, it's rebuilt. But to find what's corrupted, it took time, like half an hour at least,
and it was nervous because it was under pressure from clients. And I remember some cases takes
hours or day to fix and to escape with some data losses.
So it's a huge topic.
There's one more final thing I wanted to mention.
There's a good presentation.
I think it was quite an old one from Christoph Pettus that the slides are available online.
I'll check and see if I can find the talk as well.
But I thought the slides alone were very good.
And that was detailing a previous case.
So you talked about the 14.0 index and re-index concurrently corruption issue.
There was also apparently one that I wasn't aware of back in 9.3.1.
So even in a minor release, there was a case.
Let me mention a few names as well.
I mentioned Andrej Bradin, a lot of work related to Amchec indexes and so on.
Of course, Peter Gagan and tools, work on B3 indexes and so on of course peter gagan and tools work on b3 indexes and so on and also a tool called pg hex edit i use it a few times you
need linux for it but it's still like very good thing and on top of page inspect visualize b3s
and and other types of indexes as well also thomas wandra pg check pgcatcheck from EnterpriseDB by the way
interesting thing
what was it called?
pgcatcheck
catalogcheck
I think it's about
yeah
I'm not sure
it could be applied
to regular Postgres
but the
like
catalogchecker
to find corruption
in system catalogs
it's also
interesting topic
worth exploring
as well
pgtoolkit by Bertrand
Drouveau. Sorry, my
pronunciation is wrong.
But also there is a PG Toolkit
or also works
by Jeremy Schneider, good posts
and talks as well
and so on. Interesting materials
by Jeremy Schneider
from AWS team.
A lot of stuff.
Wonderful.
And if people can only take one thing away from this, what should they, maybe they should
restore, like check their backups.
What would you, what would your number one advice?
Check backups with Wallstream.
Dump it to death null, causing reads of data.
Add uncheck for B3.
And of course, enable checksums, right?
Data checksums.
If you want to go further, use visibility map checks right on production.
Also use patched uncheck to verify just engine if you have any in non-production after backup
restoration.
And also uniqueness violation,
and many more tools, like I mentioned,
this pgCutCheck and so on and so on.
It's a rabbit hole.
One more thing.
Read release notes, even for minor version releases.
Because if there are any Postgres bugs causing issues like this,
they will be mentioned in there with mitigation steps.
Sorry, I completely forgot that one.
I didn't mention many good names as well.
Sorry, it's a big list.
PG Checksums, it's from Creditive.
And it went to Postgres Core and Postgres 12.
Very cool.
Yeah.
Well, if Nikolai forgot to mention anybody, it's a definite case of corruption.
He's just mentioned all his friends.
That's it.
No, if I got to mention something, I mentioned that it just was not recorded.
Oh, yeah.
Good one.
Blame the internet connection.
Recording corruption.
Right, right.
Yeah.
So if you have more ideas, please tell us on Twitter, LinkedIn, and comments, YouTube, anywhere and help me
collect. This is community effort. Like 99% of what is there, it's not from me. I'm like a hub.
And these documents are quite good. Like, if you take any of these documents, you can find work
for any DBA for a year at least in serious projects. Definitely for a year. Like a lot of stuff.
That's bad joke from me for today.
Would you consider your work a collation?
Okay.
Okay.
It's time.
Thanks, everybody.
Good ending point.
Right.
Catch you next week.
Have a good... Yeah, bye.