Postgres FM - caSe-inSENsiTive

Starting point is 00:00:00 Hello, hello. This is PostGosafm. As usual, I'm Nick, PostGIS AI, and as usual my co-host is Michael Pigemaster. Hi, Michael. Hello, Nicola. How are you doing today? Have good, thanks. How are you? Great. Very good, very good. Thank you for asking. So today we have a very big topic. I'm joking. Yeah, I hope this podcast episode should be the shortest because we are going to discuss very narrow topic, although this topic hits almost every project, as I noticed. So every system has usually some table like users, it has some column like email, and in email, we usually expect the search to not to distinguish registry of characters. it should be case insensitive search right so if it's all uppercase email or lowercase or mixed all those values must be considered the same however all people made

Starting point is 00:01:15 this mistake including myself many many times you usually you start using varchar something or text for email column and then you see that multiple multiple accounts multiple users have basically the same record just written in different cases. Right. So here we usually, what do we do? We just introduce additional unique index over expression, lower email, so we convert email value to lower case or uppercase, it doesn't matter, and build an index, bitter index with unique option, create unique index concurrently. on lower or upper function from that value, and this gives us uniqueness, cancer-sensitive uniqueness, right?

Starting point is 00:02:07 But that's not it. We then need to fix our search queries. And make sure that every future use of email uses the function as well in order to be able to use the index, or have two indexes if you want to be able to, you know, there's a few kind of like gotchas around this. um this approach honestly in like roughly half of the cases you know i we do a lot of health check of our startups like usually like they are grown to some like terabyte or so and we see

Starting point is 00:02:41 some tables and not every time we look into the table structure but oftentimes we do and i just see some indexes like index on raw email value and on lower email i see it a lot and some queries use one index so we hunt for unused indexes which part of our health check and we discussed it on our podcast in terms of how to maintain good health of indexes extra indexes is not what you want to have and in this case if we have index on raw value and on lower of email often we also see both of them are used which means there is a mess right so all of this was introduction of our today's topic which is called CI text right

Starting point is 00:03:32 yeah so this was I know it's a very narrow topic this was my choice because I was looking through the contract modules at what was in there that was both popular and that we hadn't discussed yet and we've got to enough episodes now that that was not a long list of ones that I mean there were definitely some other extension

Starting point is 00:03:53 we had some other contraib modules that we haven't discussed yet, but the ones on that list are not ones I hear discussed often, not ones I see in use month. Whereas CI text, I feel like at least was quite popular. It did get quite a lot of use,

Starting point is 00:04:10 or at least a lot of people were interested in the use case. So yeah, I was interested kind of in your perspective on this, and I also, after suggesting it as a topic, started looking into it more and found a topic more interesting than I was expecting to.

Starting point is 00:04:24 so yeah I'm looking forward to getting your thoughts on this and what people should be doing but just in terms of those drawbacks of using the function approach there's a couple more mentioned by the documentation that I thought were good one is less important but there is just the verbosity of every query

Starting point is 00:04:43 and just having to have functions in them but a more important one potentially depending on other opinions you might have is that you also then can't use as a primary key if it's just text because, or at least you'd have to have another unique index on it to guarantee uniqueness to make sure somebody couldn't sign up with the same email address, but with some case different because it's the same email address. But yeah, email is an important one.

Starting point is 00:05:10 I don't know if you'd actually want email as a primary key anywhere, but also username, like a lot of services allow you to set a username that is case-insensitive, like social media handles and things, those will often be, if not case insensitive, definitely unique, depending on case, in a lot of cases. So like Twitter handles, for example. Yeah. And the question is why it's in extension, not in core, right? Well, there's a couple of questions, right?

Starting point is 00:05:42 Like, could this be a data type? Yeah. Could it be a core data type? Well, it is data type, but it's living outside of the core. But I actually think, so, well, I don't know everyone how quickly you want to jump ahead. But the reason I found this super interesting is in the docs for CI text, the contra module, it actually suggests not using either CI text or the lower function approach, but instead creating a case and sensitive collation. And that was so elegant. I mean, but it is something you have to do.

Starting point is 00:06:22 at the user level and we could yeah we could shit like I don't see why pagegres couldn't come with one of those pre-built that you could just use and maybe even maybe it wouldn't be a case insensitive feel

Starting point is 00:06:35 I could imagine a case sorry I need to stop using the word case to mean like argument I can imagine an argument for having email as a data type that could be oh that's interesting

Starting point is 00:06:51 I remember for Posgis some huge snippet to check that it's really email. It's like there's RFC, right? And it's not trivial to ensure some whether it's valid email address or not. It's not

Starting point is 00:07:10 really like so trivial as one might think. Well, so I've seen two cases for this. One is the really complicated one, which actually I think suits the argument. that it should be pre-built because you don't want

Starting point is 00:07:23 everybody rolling their own version of that and the other version is no, just strip it right back all you need to do is it does it contain an ad maybe is it all lowercase or can like cast it to lowercase

Starting point is 00:07:37 and then that's it maybe a period after the app or it's a full stop after the app but I think anything more complicated than that runs into potentially not allowing things that should be allowed right so I will tell you what I think first of all I agree that collation is elegant solution but and actually collation it's the job of collations even in

Starting point is 00:08:00 SQL standard is to define rules how we compare values right and KSensitive or Kets insensitive this is comparison so it's exactly where there should be the only thing as I remember there should be some like you need to distinguish comparison at byte level like logical level it's like deterministic or something right yes yeah yeah yeah so honestly i never used it myself never i just read about this and i i have no idea my own experience is i was like when i first discovered cites probably it happened 19 years ago or so like on my second year of postgous use or maybe in the first year even i have the history of it here by the way yeah it's a very

Starting point is 00:08:49 all contribute module. First release to the public, 17th of February 2003, but it wasn't Contribute module at that point. That only happened a few years later and it was kind of like a redesign at that point. But yeah, so you were pretty early.

Starting point is 00:09:05 Well, yeah, I started using Postgres in 2005 or 6 and of course it was social media so we definitely had the table users. At the time I tried to avoid plural so it was maybe something. something else, like person or something.

Starting point is 00:09:21 Yeah, then I was convinced by the influence of Ruben Reelsen, Jango, and others, like, okay, it should be plural. And then I quickly became a fan of say I text. And then I quickly became not a fan because, like, maybe one or two years later, I stopped being a fan because, like, it's extra effort to install extension and then you need to describe everyone and what, like, whoa, this is special data type. And I found it easier to keep using text or virtual. and these days I think just text

Starting point is 00:09:52 well it's not a topic text or varchar and with limit or no for length but did you use collations or no no no no no no no I just interesting just functions unique yeah function index with unique option if we need uniqueness that's it and then we need to adjust all the queries

Starting point is 00:10:13 and I and I was in my own projects I was hunting to If we made decision that it should be case and sensitive search, then probably it should be global, so we should get rid of index on raw value. So that's it. And fix it on all queries and avoid sequential scans, and that's it. So this is like, you either want sensitive search,

Starting point is 00:10:35 so index should be on raw value. Or you want case and sensitive search. In this case, you must have index on lower or upper. You need to choose in advance and make it a rule, like part of your code style, probably. And then use it usually on both sides of comparison, so input value also get lowercase or uppercase and the value of column,

Starting point is 00:11:02 and we know that we rely on index on expression here, right? Yeah, well, this is an interesting thing that the original author, so it was originally, the CI text module was originally authored by Donald Fraser, and in the initial email about this, one of the use cases was multi-column functional indexes. So if you wanted to have, so if you want an index on a single,

Starting point is 00:11:29 like I got just on email, fine, you can do a function on that. But at least at the time, you couldn't have it on two, let's say you wanted it on email, comma, username, which may have been more common back then before they had like the includes option. If you wanted to be on the lower of both of those, you'd have to have like a kind of apparent function that did both of those.

Starting point is 00:11:54 So it got a bit messy in terms of multi-column indexes, whereas if you have them as their own data type, you can just do that at will. Multi-colonization on two expressions. So lower email, lower something else. I think right now it's definitely possible. Okay, great. I didn't realize that.

Starting point is 00:12:12 But back then it wasn't. So that was like, yeah, that's interesting. I don't remember this, this nuance. Yeah, okay. But yeah, so it's difficult to name, to use the term multi-colum here because it's already not column, but expression. Yeah. Multi-layered or something.

Starting point is 00:12:30 I don't know. Multi-dimension? Maybe. Index on expressions. It should work these days, not a problem, but also there is a, like, There is in convenience, if you go this route and use like Lauer, email or something, it's easier to forget that we need to, when index is created, we need to run an alize on the table because it lacks statistics on expression.

Starting point is 00:12:59 Yes. Yeah, good point. If you build the regular index on raw, a common value, statistics usually already there. Yeah. But for expressional indexes, it's not there. But, yeah, I haven't, since then, I made this decision in favor of expression indexes. And I never went back. So why should I use CI text in 2025? I don't think you should, but I also don't think you should be having to do lower functions.

Starting point is 00:13:34 Both paths are bad. Collation type is good. Okay. Yes. That's what I think is the solution now. So it's like super interesting. I actually came across. just by chance while looking into this,

Starting point is 00:13:46 a blog post by Adam Johnson, who's worked on Django for many years. It seems like Django actually removed support or mostly removed support for CI Text in version 5.1 of Django. So they actually deprecated this and he blogged about what you should do instead and wrote up about the

Starting point is 00:14:06 how to set application and what we kind of define all the kind of ICU. I don't even know what you call them like codes. well it's called a locale isn't it but it's like made up of a bunch of different complete gibberish in my opinion but you can look up the spec and kind of choose

Starting point is 00:14:23 which language do you want it to be and do you want it to be K-sensitive or not do you want it to be sensitive to accents or not so yeah that's cool and it's much more flexibility yes and it's also SQL standard approach I think yes it is

Starting point is 00:14:41 the last time I looked into SQL standard it was more than 10 years ago so my memory might make tricks with me but I remember something about collations so this came up well in 2003 when it was first proposed I think

Starting point is 00:14:57 I read that Tom a post from Tom Lane in that thread that said basically the SQL standard way of doing this is via collations but I can't see us doing that any time soon so this makes sense as a hack in the meantime oh that's great yeah so that's great

Starting point is 00:15:13 So it's very hard. How are we going to name this episode? I thought it will be CI text now. It's shifted. I was going to suggest maybe case-in-sensitive text or something. Because it shows confusion, you know, like it's like-case-sensitive comparison, yeah. Because that's what we want, isn't it? But it goes beyond just cases.

Starting point is 00:15:38 You mentioned accents. That's great, because some people might use, like, names for example like Thomas Juan maybe but as you said this I think the two main use I think the two main use cases for this are email addresses and usernames that's where I've seen it and email addresses are unique no matter the case but I don't know if they're unique no they are not they are not but we might we might decide for example if username in our system is allowing like variants of latin characters we might say okay we allow them but we don't we don't want collisions

Starting point is 00:16:19 like one one one user used pure latin another user used like native version like from another language many languages have have it right but we say we allow them but first person who used it should be like should be protected from collisions in this case we might decide to use collations on defining unique indexes. That's a great idea, I think. Yeah, good point. In fact, I know it's not quite related to that, but reading about this,

Starting point is 00:16:50 I came across somebody giving an example of lower not always being ideal, or lower and upper have, like, being subtly different because in, there's like in some languages, you can get an uppercase character that can have two alternative, lowercase versions of it. I didn't know about this. Nor me.

Starting point is 00:17:10 But that, I mean, you can imagine, and the kind of hassle that's going to end up causing. So, yeah, it's... The collations are where this stuff is defined properly, right? Yeah, yeah. So it gives you big freedom. But what are our downsides of using collisions? Well, I did wonder, because do you remember the...

Starting point is 00:17:29 You know, if G-LibC changes and we can end up with corrupted indexes? I don't... How can I... Well, of course you remember. Yes, of course I remember. We had multiple cases with multiple clients. Could that be, could that be an issue here? I don't, I don't, well, if you use ICU, it should not be so.

Starting point is 00:17:52 I don't see. Right, yeah, right. And I think in this case, we definitely want to specify provider ICU, right? Yes, good point. Yeah, and this comparison is, it's, I don't see issue here. And also there we have issues, Not with, like, comparison, but with order, right? True.

Starting point is 00:18:15 So... Isn't that the same thing? No. Who is bigger, who is lower, who's upper, who is lower, basically, in the order. It doesn't, it's not the same as who equals who, right now. Except, if you're looking for, like, is there a clash? Is this, is this unique? You might traverse it only to the point where you're,

Starting point is 00:18:40 expecting it to be, but it's actually lower down. So you say, oh, no, I've already got past that point in the index. You can insert this and you can end up with duplicate values in the unique index. I think order. Yeah, what I don't know is if for source, here we use deterministic false, right? Yes, yes. And it means that at byte level, the characters are different, but we logically, we think about them that as equal. But if it, if you think about is equal in the order by they should go together right but do we have deterministic ordering of them always this is i don't know like should be right but yeah or this is some an deterministic thing if so that that might cause index corruption or yeah if it's not unique index if it's unique index it's

Starting point is 00:19:33 just one value i don't see corruption uh risks but if it's non unique index index. And like these values will come in groups. Is there like deterministic structure? I mean, the rule how they are ordered. So we avoid a switch of order when JLIPC version changes. But again, here we talk about ICU. So JLIPC is not involved, right? So everything should be fine. The only thing, just not to forget, ICU should be enabled at compile time. I just, yes. Yeah, which is default, I think, right now. Is it default?

Starting point is 00:20:13 I don't know. I just remember Cursor and I were fighting a few weeks ago because my laptop didn't have ICU installed, so I always reminded that I created even Cursor rule without ICU. If you run configure without issue, then I surrendered and installed ICUs just because somehow Cursor kept losing this rule somehow. I don't know why.

Starting point is 00:20:38 so yeah so it should it should be present it might be not present and not be available on the system I think well lots of people

Starting point is 00:20:46 use and managed services these days I think it tends to be on those but yeah really good point but yeah this this seems to me

Starting point is 00:20:52 like the way to do it now and just to complete that I'm going to read from Adam's blood post the syntax is create collation and then you can call it

Starting point is 00:21:03 if you want for example case insensitive and then you set provider and we, he does provider equals ICU, then locale, and then that's the string that you can like set all the various settings we mentioned earlier, and then you choose whether it's, you have to choose in this case deterministic equals force. So that's the bite-wise thing that you're talking about. Right. Right. So yeah. In fact, just, and is it worth going through that

Starting point is 00:21:28 gobbledygook in the locale quickly? So he's gone with und, so UND, or und, I thought it was German at first, but it actually just means undetermined language. So you could set it to, like, German or French or something if you really wanted to. U specifies Unicode attributes. And then KS. Level 2. So apparently that's collation strength. And then level two doesn't include case in comparisons, only less than accents. So letters and accents can be considered different, but case can't.

Starting point is 00:22:04 So, yeah. So that's what you'd want. email address use case, I think. Yeah. So, yeah, that's interesting. Anything else? We have chances to have this episode, as I promised, very short. Or there are some additional aspects here. No, I think that's it. So it looks like both of us, if we were building a new system,

Starting point is 00:22:27 we probably would try to use this approach with collations. Sounds like the most flexible, most powerful these days, right? Yeah. Yeah. You know what? I'm checking right now, and I see cases about collection. Corruption. Oh, no. Yeah. I see Daniel Verite, or how to pronounce. Oh, yeah. Is he a psycho PG? Not sure. But I remember this name. I read some blogposts in the past, but definitely

Starting point is 00:23:00 collation versioning problem with ICU 73. so it's interesting about to think about possible corruption when something changes under the hood yeah and if we go this route what to expect I like personal experience with it but I would definitely study this topic closer because I thought I thought if we use ICU collations well JLIPC change is not a problem anymore We can upgrade OS, right? But it looks like an ICU library can be, upgrade can be a problem. So we need to check this, right?

Starting point is 00:23:44 We should be very careful considering what happened in the past with GDPC changes. I remember it started from not knowing about it at all, then thinking only one change was a problem, and then adjusting vision, like, you know what every, gdlypc version upgrade should be very carefully tested because it like things change quite often actually there yeah so i i would bring this experience to here and then and double check what's happening it's probably a case for re-indexing at those times like having a maybe a prioritized list of indexes that you want to make sure like maybe unique indexes at least uh maybe all

Starting point is 00:24:32 It's only about unique indexes. If we talk about corruption, we, like, again, order what matters. And we don't want some values to be returned in different order or search is not working as expected and so on, yeah. I was thinking about the limitation of only being able to concurrently create, or like re-index one at a time. You probably want to prioritize lists. So which ones would you do first?

Starting point is 00:24:56 Yeah. Maybe you're unique ones first. Yeah, during OS upgrades, we usually look at them and using Amcheck. to identify indexes which are going to be corrupted and then we plan to just rebuild them or we rebuild them

Starting point is 00:25:12 on if it's this upgrade involves logical replication we rebuild them before we switch this is flex this logical replication gives us flexibility here so yeah

Starting point is 00:25:25 but I'm check back in the days could only check B3 indexes yeah There was a multi-year work to bring it to GIST and GIN. I don't remember. There was some progress there recently,

Starting point is 00:25:40 so Postgust 18 probably will bring something, but I don't remember from top of my head. So that's why you probably want to rebuild all GYN and GIST indexes just for safety, because corruption might happen there as well. Nice. Yeah. Okay.

Starting point is 00:25:58 Yeah, I hope this was helpful. I learned something from you. Thank you so much. Next time I need this, I probably will look into collations based on ICU and deterministic equals false. Yeah, nice. Yeah, and probably will think about accents next time. Yeah, it's a good thing to keep in mind. For sure.

Starting point is 00:26:17 All right, nice on a clar. Yeah, thank you. See you next time. Bye-bye.

Postgres FM - caSe-inSENsiTive

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.