Postgres FM - caSe-inSENsiTive
Episode Date: August 8, 2025Nikolay and Michael discuss case-insensitive data — when we want to treat columns as case-insensitive, and the pros and cons of using citext, functions like lower(), or a custom collation.�...�Here are some links to things they mentioned:citext https://www.postgresql.org/docs/current/citext.htmlOur episode on over-indexing https://postgres.fm/episodes/over-indexingNondeterministic collations https://www.postgresql.org/docs/current/collation.html#COLLATION-NONDETERMINISTICHow to migrate from Django’s PostgreSQL CI Fields to use a case-insensitive collation (blog post by Adam Johnson) https://adamj.eu/tech/2023/02/23/migrate-django-postgresql-ci-fields-case-insensitive-collationThe collation versioning problem with ICU 73 (blog post by Daniel Vérité) https://postgresql.verite.pro/blog/2023/10/20/icu-73-versioning.htmlamcheck https://www.postgresql.org/docs/current/amcheck.html~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith credit to:Jessie Draws for the elephant artwork
Transcript
Discussion (0)
Hello, hello. This is PostGosafm. As usual, I'm Nick, PostGIS AI, and as usual my co-host is Michael
Pigemaster. Hi, Michael. Hello, Nicola. How are you doing today?
Have good, thanks. How are you?
Great. Very good, very good. Thank you for asking. So today we have a very big topic. I'm joking.
Yeah, I hope this podcast episode should be the shortest because we are going to discuss very narrow topic, although this topic hits almost every project, as I noticed.
So every system has usually some table like users, it has some column like email, and in email, we usually expect the search to not to distinguish registry of characters.
it should be case insensitive search right so if it's all uppercase email or
lowercase or mixed all those values must be considered the same however all people made
this mistake including myself many many times you usually you start using varchar something or
text for email column and then you see that multiple multiple accounts multiple users
have basically the same record just written in different cases.
Right. So here we usually, what do we do? We just introduce additional unique index
over expression, lower email, so we convert email value to lower case or uppercase,
it doesn't matter, and build an index, bitter index with unique option, create unique index concurrently.
on lower or upper function from that value,
and this gives us uniqueness, cancer-sensitive uniqueness, right?
But that's not it.
We then need to fix our search queries.
And make sure that every future use of email uses the function as well
in order to be able to use the index,
or have two indexes if you want to be able to, you know,
there's a few kind of like gotchas around this.
um this approach honestly in like roughly half of the cases you know i we do a lot of health check
of our startups like usually like they are grown to some like terabyte or so and we see
some tables and not every time we look into the table structure but oftentimes we do and i
just see some indexes like index on raw email value and on lower email i see it a lot
and some queries use one index so we hunt for unused indexes which part of our health check
and we discussed it on our podcast in terms of how to maintain good health of indexes
extra indexes is not what you want to have and in this case if we have index on raw value
and on lower of email often we also see both of them are used which means there is a mess right
so all of this was introduction of our today's topic
which is called CI text right
yeah so this was
I know it's a very narrow topic
this was my choice because I was looking through
the contract modules at what was in there
that was both popular and that we hadn't discussed yet
and we've got to enough episodes now
that that was not a long list of ones that
I mean there were definitely some other extension
we had some other contraib modules
that we haven't discussed yet,
but the ones on that list
are not ones I hear discussed
often, not ones I see in use month.
Whereas CI text, I feel like
at least was quite popular.
It did get quite a lot of use,
or at least a lot of people
were interested in the use case.
So yeah, I was interested
kind of in your perspective on this,
and I also, after suggesting it as a topic,
started looking into it more
and found a topic more interesting
than I was expecting to.
so yeah I'm looking forward to getting your thoughts on this
and what people should be doing
but just in terms of those drawbacks
of using the function approach
there's a couple more mentioned by the documentation
that I thought were good
one is less important but
there is just the verbosity of every query
and just having to have functions in them
but a more important one potentially
depending on other opinions you might have
is that you also then can't use
as a primary key if it's just text because, or at least you'd have to have another unique
index on it to guarantee uniqueness to make sure somebody couldn't sign up with the same
email address, but with some case different because it's the same email address.
But yeah, email is an important one.
I don't know if you'd actually want email as a primary key anywhere, but also username, like a lot
of services allow you to set a username that is case-insensitive, like social media
handles and things, those will often be, if not case insensitive, definitely unique, depending
on case, in a lot of cases.
So like Twitter handles, for example.
Yeah.
And the question is why it's in extension, not in core, right?
Well, there's a couple of questions, right?
Like, could this be a data type?
Yeah.
Could it be a core data type?
Well, it is data type, but it's living outside of the core.
But I actually think, so, well, I don't know everyone how quickly you want to jump ahead.
But the reason I found this super interesting is in the docs for CI text, the contra module, it actually suggests not using either CI text or the lower function approach, but instead creating a case and sensitive collation.
And that was so elegant.
I mean, but it is something you have to do.
at the user level
and we could
yeah we could
shit like I don't see why pagegres
couldn't come with one of those
pre-built that you could just use
and maybe even maybe
it wouldn't be a case insensitive feel
I could imagine a case
sorry I need to stop using the word case
to mean like argument
I can imagine an argument for
having email as a data type
that could be
oh
that's interesting
I remember for Posgis
some huge snippet
to check that it's really
email. It's like
there's RFC, right? And
it's not trivial to ensure
some whether it's valid
email address or not. It's not
really like so trivial
as one might think.
Well, so I've seen two cases for this.
One is the really complicated one,
which actually I think
suits the argument.
that it should be pre-built
because you don't want
everybody rolling
their own version of that
and the other version is
no, just strip it right back
all you need to do is
it does it contain an ad
maybe is it all lowercase
or can like cast it to lowercase
and then that's it
maybe a period after the app
or it's a full stop after the app
but I think anything more complicated
than that runs into potentially
not allowing things that should be allowed
right so I will tell you what I think first of all I agree that collation is
elegant solution but and actually collation it's the job of collations even in
SQL standard is to define rules how we compare values right and KSensitive or Kets
insensitive this is comparison so it's exactly where there should be the only
thing as I remember there should be some like you need to distinguish
comparison at byte level like logical level
it's like deterministic or something right yes yeah yeah yeah so honestly i never used it myself
never i just read about this and i i have no idea my own experience is i was like when i first
discovered cites probably it happened 19 years ago or so like on my second year of postgous
use or maybe in the first year even i have the history of it here by the way yeah it's a very
all contribute module. First release
to the public, 17th of
February 2003, but it
wasn't Contribute module at that point.
That only happened a few years later
and it was
kind of like a redesign at that
point. But yeah, so you were pretty early.
Well, yeah, I started using Postgres
in 2005 or 6 and of
course it was social media
so we definitely had the table
users. At the time I
tried to avoid
plural so it was maybe something.
something else, like person or something.
Yeah, then I was convinced by the influence of Ruben Reelsen, Jango, and others, like, okay,
it should be plural.
And then I quickly became a fan of say I text.
And then I quickly became not a fan because, like, maybe one or two years later, I stopped
being a fan because, like, it's extra effort to install extension and then you need to
describe everyone and what, like, whoa, this is special data type.
And I found it easier to keep using text or virtual.
and these days I think just text
well it's not a topic
text or varchar and with limit or no
for length
but did you use collations or no no no no no no
no I just interesting just functions
unique yeah function index
with unique option if we need uniqueness
that's it and then we need to adjust all the queries
and I and I was in my own projects I was hunting to
If we made decision that it should be case and sensitive search,
then probably it should be global,
so we should get rid of index on raw value.
So that's it.
And fix it on all queries and avoid sequential scans,
and that's it.
So this is like, you either want sensitive search,
so index should be on raw value.
Or you want case and sensitive search.
In this case, you must have index on lower or upper.
You need to choose in advance
and make it a rule, like part of your code style, probably.
And then use it usually on both sides of comparison,
so input value also get lowercase or uppercase
and the value of column,
and we know that we rely on index on expression here, right?
Yeah, well, this is an interesting thing that the original author,
so it was originally, the CI text module was originally authored by Donald Fraser,
and in the initial email about this,
one of the use cases was
multi-column functional indexes.
So if you wanted to have,
so if you want an index on a single,
like I got just on email, fine, you can do a function on that.
But at least at the time, you couldn't have it on two,
let's say you wanted it on email, comma, username,
which may have been more common back then
before they had like the includes option.
If you wanted to be on the lower of both of those,
you'd have to have like a kind of apparent function
that did both of those.
So it got a bit messy in terms of multi-column indexes,
whereas if you have them as their own data type,
you can just do that at will.
Multi-colonization on two expressions.
So lower email, lower something else.
I think right now it's definitely possible.
Okay, great.
I didn't realize that.
But back then it wasn't.
So that was like, yeah, that's interesting.
I don't remember this, this nuance.
Yeah, okay.
But yeah, so it's difficult to name, to use the term multi-colum here
because it's already not column, but expression.
Yeah.
Multi-layered or something.
I don't know.
Multi-dimension?
Maybe.
Index on expressions.
It should work these days, not a problem, but also there is a, like,
There is in convenience, if you go this route and use like Lauer, email or something,
it's easier to forget that we need to, when index is created,
we need to run an alize on the table because it lacks statistics on expression.
Yes. Yeah, good point.
If you build the regular index on raw, a common value, statistics usually already there.
Yeah.
But for expressional indexes, it's not there.
But, yeah, I haven't, since then, I made this decision in favor of expression indexes.
And I never went back.
So why should I use CI text in 2025?
I don't think you should, but I also don't think you should be having to do lower functions.
Both paths are bad.
Collation type is good.
Okay.
Yes.
That's what I think is the solution now.
So it's like super interesting.
I actually came across.
just by chance while looking into this,
a blog post by Adam Johnson,
who's worked on Django for many years.
It seems like Django actually removed support
or mostly removed support for CI Text
in version 5.1 of Django.
So they actually deprecated this
and he blogged about what you should do instead
and wrote up about the
how to set application
and what we kind of define all the kind of ICU.
I don't even know what you call them like codes.
well it's called a locale isn't it
but it's like made up of a bunch of different
complete gibberish in my opinion
but you can look up the
spec and kind of choose
which language do you want it to be
and do you want it to be K-sensitive or not
do you want it to be sensitive to accents or not
so yeah that's cool
and it's much more flexibility
yes and it's also SQL standard approach
I think
yes it is
the last time I looked into
SQL standard it was more than 10 years ago
so my memory might
make tricks with me but I remember
something about collations
so this came up
well in 2003
when it was first proposed I think
I read that Tom
a post from Tom Lane in that thread
that said basically the SQL standard
way of doing this is via collations but I can't
see us doing that any time soon so this
makes sense as a hack in the meantime
oh that's great
yeah so that's great
So it's very hard.
How are we going to name this episode?
I thought it will be CI text now.
It's shifted.
I was going to suggest maybe case-in-sensitive text or something.
Because it shows confusion, you know, like it's like-case-sensitive comparison, yeah.
Because that's what we want, isn't it?
But it goes beyond just cases.
You mentioned accents.
That's great, because some people might use, like,
names for example like Thomas Juan maybe but as you said this I think the two main
use I think the two main use cases for this are email addresses and
usernames that's where I've seen it and email addresses are unique no matter
the case but I don't know if they're unique no they are not they are not but
we might we might decide for example if username in our system is allowing like
variants of latin characters we might say okay we allow them but we don't we don't want collisions
like one one one user used pure latin another user used like native version like from another language
many languages have have it right but we say we allow them but first person who used it
should be like should be protected from collisions in this case we might decide to use collations
on defining unique indexes.
That's a great idea, I think.
Yeah, good point.
In fact, I know it's not quite related to that,
but reading about this,
I came across somebody giving an example
of lower not always being ideal,
or lower and upper have, like, being subtly different
because in, there's like in some languages,
you can get an uppercase character
that can have two alternative, lowercase versions of it.
I didn't know about this.
Nor me.
But that, I mean, you can imagine,
and the kind of hassle that's going to end up causing.
So, yeah, it's...
The collations are where this stuff is defined properly, right?
Yeah, yeah.
So it gives you big freedom.
But what are our downsides of using collisions?
Well, I did wonder, because do you remember the...
You know, if G-LibC changes and we can end up with corrupted indexes?
I don't...
How can I...
Well, of course you remember.
Yes, of course I remember.
We had multiple cases with multiple clients.
Could that be, could that be an issue here?
I don't, I don't, well, if you use ICU, it should not be so.
I don't see.
Right, yeah, right.
And I think in this case, we definitely want to specify provider ICU, right?
Yes, good point.
Yeah, and this comparison is, it's, I don't see issue here.
And also there we have issues,
Not with, like, comparison, but with order, right?
True.
So...
Isn't that the same thing?
No.
Who is bigger, who is lower, who's upper, who is lower, basically, in the order.
It doesn't, it's not the same as who equals who, right now.
Except, if you're looking for, like, is there a clash?
Is this, is this unique?
You might traverse it only to the point where you're,
expecting it to be, but it's actually lower down. So you say, oh, no, I've already got past
that point in the index. You can insert this and you can end up with duplicate values in the
unique index. I think order. Yeah, what I don't know is if for source, here we use
deterministic false, right? Yes, yes. And it means that at byte level, the characters are
different, but we logically, we think about them that as equal. But if it, if you think about
is equal in the order by they should go together right but do we have deterministic ordering of them
always this is i don't know like should be right but yeah or this is some an deterministic thing if
so that that might cause index corruption or yeah if it's not unique index if it's unique index it's
just one value i don't see corruption uh risks but if it's non unique index
index. And like these values will come in groups. Is there like deterministic structure?
I mean, the rule how they are ordered. So we avoid a switch of order when JLIPC version changes.
But again, here we talk about ICU. So JLIPC is not involved, right? So everything should be fine.
The only thing, just not to forget, ICU should be enabled at compile time.
I just, yes.
Yeah, which is default, I think, right now.
Is it default?
I don't know.
I just remember Cursor and I were fighting a few weeks ago
because my laptop didn't have ICU installed,
so I always reminded that I created even Cursor rule without ICU.
If you run configure without issue,
then I surrendered and installed ICUs
just because somehow Cursor kept losing this rule somehow.
I don't know why.
so yeah
so it should
it should be present
it might be not present
and not be available
on the system
I think
well lots of people
use and managed
services these days
I think it tends
to be on those
but yeah
really good point
but yeah
this this seems to me
like the way to do it now
and just to
complete that
I'm going to read
from Adam's blood post
the syntax is
create collation
and then you can call it
if you want
for example case insensitive
and then you set
provider
and we, he does provider equals ICU, then locale, and then that's the string that you can
like set all the various settings we mentioned earlier, and then you choose whether it's, you have
to choose in this case deterministic equals force. So that's the bite-wise thing that you're
talking about. Right. Right. So yeah. In fact, just, and is it worth going through that
gobbledygook in the locale quickly? So he's gone with und, so UND, or und, I thought it was
German at first, but it actually just means undetermined language.
So you could set it to, like, German or French or something if you really wanted to.
U specifies Unicode attributes.
And then KS. Level 2.
So apparently that's collation strength.
And then level two doesn't include case in comparisons, only less than accents.
So letters and accents can be considered different, but case can't.
So, yeah.
So that's what you'd want.
email address use case, I think.
Yeah. So, yeah, that's interesting.
Anything else? We have chances to have this episode, as I promised, very short.
Or there are some additional aspects here.
No, I think that's it.
So it looks like both of us, if we were building a new system,
we probably would try to use this approach with collations.
Sounds like the most flexible, most powerful these days, right?
Yeah.
Yeah. You know what? I'm checking right now, and I see cases about collection. Corruption.
Oh, no.
Yeah. I see Daniel Verite, or how to pronounce.
Oh, yeah. Is he a psycho PG?
Not sure. But I remember this name. I read some blogposts in the past, but definitely
collation versioning problem with ICU 73.
so it's interesting about to think about possible corruption when something changes under the hood
yeah and if we go this route what to expect I like personal experience with it
but I would definitely study this topic closer because I thought I thought if we
use ICU collations well JLIPC change is not a problem anymore
We can upgrade OS, right?
But it looks like an ICU library can be, upgrade can be a problem.
So we need to check this, right?
We should be very careful considering what happened in the past with GDPC changes.
I remember it started from not knowing about it at all,
then thinking only one change was a problem,
and then adjusting vision, like, you know what every,
gdlypc version upgrade should be very carefully tested because it like things change quite often
actually there yeah so i i would bring this experience to here and then and double check what's
happening it's probably a case for re-indexing at those times like having a maybe a prioritized list
of indexes that you want to make sure like maybe unique indexes at least uh maybe all
It's only about unique indexes.
If we talk about corruption, we, like, again, order what matters.
And we don't want some values to be returned in different order
or search is not working as expected and so on, yeah.
I was thinking about the limitation of only being able to concurrently create,
or like re-index one at a time.
You probably want to prioritize lists.
So which ones would you do first?
Yeah.
Maybe you're unique ones first.
Yeah, during OS upgrades, we usually look at them and using Amcheck.
to identify
indexes which are going to be corrupted
and then we plan to
just rebuild them
or we rebuild them
on if it's
this upgrade involves logical
replication we rebuild them
before we switch
this is flex
this logical replication gives us
flexibility here
so yeah
but I'm check
back in the days
could only check
B3 indexes
yeah
There was a multi-year work to bring it to GIST and GIN.
I don't remember.
There was some progress there recently,
so Postgust 18 probably will bring something,
but I don't remember from top of my head.
So that's why you probably want to rebuild all GYN and GIST indexes
just for safety,
because corruption might happen there as well.
Nice.
Yeah.
Okay.
Yeah, I hope this was helpful.
I learned something from you.
Thank you so much.
Next time I need this, I probably will look into collations based on ICU and deterministic equals false.
Yeah, nice.
Yeah, and probably will think about accents next time.
Yeah, it's a good thing to keep in mind.
For sure.
All right, nice on a clar.
Yeah, thank you.
See you next time.
Bye-bye.