Postgres FM - Full text search

Starting point is 00:00:00 Hello, hello, this is PostgresFM, episode number 98, right? 98? Yes. Yeah, and this is Nikolai Samokvalov, founder of Postgres.ai, and my co-host is Michael Christofidis, founder of Pidgey Mustard. Hi, Michael. Wow, that's the first time you've attempted my last name, and you did great. Okay, yeah. Thank you so much. I practiced all these years. So what's the topic today?

Starting point is 00:00:29 Yeah, so this was your suggestion. I think it's a great one. We've been meaning to come back to it for a while now. It's full-text search. Why did you want to talk about it? I think it was in our list for a very long time. Maybe from the very beginning because it's a huge topic.

Starting point is 00:00:45 And when we built that list, I think PgVector already existed, but only a few people paid attention to it, unlike today. So since we didn't manage to discuss full-text search before this AI boom started last year, I think we should maybe compare it a little bit. before this AI boom started last year, I think we should maybe compare it a little bit. And since I used full-text search a lot in the past, I still remember many features it offers in Postgres. And I think it's interesting to not only discuss like usual discussion,

Starting point is 00:01:21 embedded full-text search in Postgres versus Elastic. But also it's interesting to discuss embedded full-text search embedded of Postgres versus pgVector extension and similarity search, or maybe not versus, maybe together. And this is interesting. But I think time is changing and evolution is very interesting. And today people probably pay less attention to full-text search, but they at least should know what capabilities it can offer. And let's maybe start from use cases and discuss functions, features, FTS, full-text search provides in Postgres.

Starting point is 00:02:07 What do you think? Or maybe we can talk about some historical aspects. I also can provide some overview of what was happening in Postgres and not only in Postgres. What do you think? Features, use cases, history, what's better? Actually, I don't know the history that well. that would be awesome to hear a tiny bit about that and then use i'd

Starting point is 00:02:30 love to jump onto use cases where you see used most commonly that kind of thing yeah actually we can combine i know see how we can do it so if for example we roll back to first web era, web 1.0, so to speak, like 25 years ago, and when Google was created. Because Google is kind of, it's a search engine originally, right? And it solves the problem of finding proper texts, most relevant texts, right?

Starting point is 00:03:01 And at that time, before Google, already other search engines were created, obviously. And full-text search capabilities were originally called T-Search 2. Actually, I remember this contrib module I was seeing at 2005, maybe, when I first started working with Postgres, created by Oleg Bartonov, Theodor Segaev, and then Alexander Korotkov joined them. They created it because they were building search engine originally. They helped building quite big search engine.

Starting point is 00:03:37 And then they incorporated many of their works into Postgres, and this became T-Search 2. Contrib module, which later was merged to the core, and you don't need to install extension, so it's already in core, right? But interesting thing that it was before social networks and so on,

Starting point is 00:03:59 but if we look at how it's built, it's like it has a lot of things. For example, you can, for example, exclude some words. And by default, it excludes some words. It's called dictionary, which is like in Postgres full-text search terminology, dictionary is a program. It can process input somehow transforming it. So stop words, it's just exclusion of various articles, for example, and so on, which we probably don't have any meaning. So we can just exclude them. That's it. Then we have snowball, which is a stammer. So it removes the endings. And it's

Starting point is 00:04:42 super fast and super simple. So we can, for example, runs and run, it's like kind of the same word. So s should be removed, right? And then more full-fledged thing we can ask to join, to use some real dictionary, usually ISPEL, which is used also for the thesaurus. It can give a lot of things, but it's usually slower, requires some memory. And all of these things are configurable in full-text search and they were configurable in TS Vector 2 originally. And it's great.

Starting point is 00:05:19 But if you think about the task, the use case, what we want to try to solve, we want to present user some information which is most relevant. What is most relevant? User has some query. I want recipe with, I don't know, with some eggplant, for example. If we have, for example, to keep puts eggplant if we have like, for example, food recipe website or system knowledge base. And then we just removing stop words removing, like normalizing words, we can just filter right. But filtering is just one thing we also need to somehow order maybe right and present only the most relevant. What is the most relevant?

Starting point is 00:06:07 Maybe these words are most frequently used in these documents, right? Like repeated many times, for example. And this is quite basic idea. Like if the word is used many times, it's considered most relevant. Now, like confession, I never used this. Why? Because I'm from different era. I'm from web 2.0 where most relevant meant something else because

Starting point is 00:06:33 it was social networks. It was also time matters a lot. If it's fresh document, like fresh blog post or something, it matters a lot. But Postgres full-text search lacks this. It's hard to build together. There is special index RAM. I mentioned it many times already. Maybe I need to revisit finally and think maybe it still can be used because if you're on self-managed Postgres,

Starting point is 00:07:00 this type of index can be installed as additional extension. But original full-text search can order by only this TSRank, it's called function TSRank. And there is another function which also considers positions and distance between words, I think like density or something, not density, but the closer words are,

Starting point is 00:07:21 the better ranking score is, right? But for me, it was always nonsense to use only these word positions and so on. I couldn't imagine how I can use it because I always needed to take into account time. And also sometimes things like likes, reactions, comments. If this post is heavily discussed, it's very important, right? But back to history, we know there is BM25 and also there is this TF IDF. The idea of TF, it's term frequency, inverted document frequency, right?

Starting point is 00:07:59 If some document has this term, like word, mentioned many times, it's good for this document, right? And if, in general, this term is mentioned in our whole database, its whole dataset, it's mentioned not very frequently. It's also good for our document because it means that it's exceptional in terms of frequency, right? So that's why I invert the document frequency. So it means that for such thing, we need to analyze the whole data set. Postgres full-text search doesn't do it. It's quite hard to maintain such type of thing,

Starting point is 00:08:35 but I think if there are some attempts to have it in Postgres, maybe some extensions exist. And this is what Elastic uses, right? But there is another also approach, and back to 25 years ago, Google, why Google? Why did Google win? Because it had PageRank. If we think deeply about this,

Starting point is 00:08:56 it's already from Web 2.0 idea, because PageRank, it takes into account kind of social network of documents, I would say. It takes into account links, right? So interactions. We are not alone.

Starting point is 00:09:14 And this idea is super powerful. I'm curious, are there any extensions which implement this in Postgres and how it could be implemented? This is super interesting. But this is a big power. If we have links, we value this document more. And I think there are good

Starting point is 00:09:34 algorithms how to solve it with billions of documents, of course. There are papers written. Anyone who studied search engine theory, of course. There is a course from Stanford, Mining Massive Datasets, very good material where I studied this page rank

Starting point is 00:09:51 and other algorithms closer to machine learning on big datasets. So idea is like links, right? You multiply by some matrices and you have some good index and you can use it. Again, Postgres doesn't provide this. And Elastic doesn't provide PageRank, I guess. But it's good because we start taking into account not only word positions.

Starting point is 00:10:17 Full-text search in Postgres is only word position for a single document and that's it. But since full-text search in Postgres is inside relational database system, we can have other columns, right? And we can have indexes on them. So for example, likes or something like timestamp when document was created. We can construct SQL, which takes into account all these things. It will be just slow, right, maybe, because the idea that the fastest query is always like index scan or index-only scan is the best, right?

Starting point is 00:10:53 But when we have full-text search, we have other columns and we have bit-rease probably on them. Combination is a problem, right? This is the problem. But there are things how you can solve it, but let's maybe discuss a little bit later, I mean, Btriggist and other extensions. But originally, the combination of different factors

Starting point is 00:11:18 in the ranking task, this is the problem, right? Later, Google and others, they started taking into account social network activities. If some document is mentioned on Twitter, for example, at Facebook, it happened like 15, 10 years ago. This document, everyone who dealt with search engine optimization knows that if you have a lot of likes in social networks, your document goes up in positions. But if we deal with our system, we can take into account interactions like that because it's just some data we store and that's it. And today, it's not that bad usually if you don't have tens of billions of records or just millions or tens or hundreds of millions.

Starting point is 00:12:05 It's not a huge volume of data. So you can rely on multiple indexes and let Postgres decide what to choose based on statistics. This is very similar to what we discussed, remember? Yeah. Yeah, very similar. Because it triggered my problems with full-text search I had trying to incorporate it to social media projects. Because you have two indexes which to choose.

Starting point is 00:12:30 Because you cannot choose them together. Well, you can. You can have bitmap index scan, but it's already slower. But most likely Postgres will choose just one index and won't, for example, over time. Or full-text search based on statistics. And then on the fly, it will apply either sorting in memory or filtering in memory, right? Yeah, exactly.

Starting point is 00:12:52 That's where the inefficiency comes in. Yeah, yeah. So this is like historical overview. We had simple stop wars, stemming, dictionary. It's not simple, by the way. It's a lot of functionality. But ranking based on just our

Starting point is 00:13:09 opinion about how this document is relevant to your query, not considering relations or other things. Second thing is, I think, page rank, and Postgres doesn't have it. And the third thing is taking into account various

Starting point is 00:13:25 factors like time and likes and so on. But also there is similarity search now. We know search engines use it for long, right? So... Actually, I want to go back to... I think your example is remarkably useful. So the example you gave

Starting point is 00:13:42 of searching a site of recipes for eggplant, let's like add that maybe we'll write maybe we're not super sophisticated we type an eggplant so we can for example so the an would count as a stop word we don't want to search for all rest like we don't want recipes that say an a lot to rank highly so that's getting rid of the stop words we might want for example to do something slightly more we might want to rank recipes that list eggplant in the title we might want those to score higher than ones where it's listed in the in the document more times so that's that's like an interesting additional challenge secondly it might be let's say a user-generated

Starting point is 00:14:22 recipe site and we might want to factor in how many likes a recipe has got or how recent it is or something like that. So there's like all these, even in the simplest example you can think of these days, it can get... Oh, I've got one more. You might also, if it's user-generated content, you could have British people or like people using aubergine instead of eggplant, and you also want those to rank higher.

Starting point is 00:14:47 Yes, exactly. So synonyms, it's already a bridge to discussing full-text search versus similarity. Let's keep it as the last item, right? But great attempt to zoom. This is good. Let's do it. Or diving deeper. So let's slightly mention how full-text search works in Postgres.

Starting point is 00:15:09 So after this pre-processing, we just discussed removal of stop words, and you can control it. The good thing about full-text search in Postgres, a lot of customization options, a lot. For example, you can control stop list and stop considering n as stop word, right? Or you can, for example, remove accents. There is a special extension on accent,

Starting point is 00:15:32 right? Additional preprocessing. So like in French, for example, with lots of... Or there are support of multiple languages also. It's also good. So you can start considering additional stop words, for example, remove it if they mention

Starting point is 00:15:47 too often, like almost all documents have them, let's remove it because it doesn't make sense to use it, right? Then like stemming or dictionary processing, I spell dictionary processing. So in the end we have a set of normalized words, right, and then

Starting point is 00:16:03 we build an array. It's called TSVector. Again, also vector, by the way. So it's a set of words with positions already normalized in a normalized form. But basically it's just an array of

Starting point is 00:16:19 texts. Like words or lexemes or whatever the yeah yeah for example i forgot also to mention hyphened words i think you also can control it as i remember it was many years ago i think you can choose either to consider them as a single one word or to split them to multiple words like two two words basically yeah first and second or or together like you can put a pair of separate words and whole word is hyphen right total three already right so that if people search one or the other they still get that result back right for example shift left testing shift left If it's a half and you can put both,

Starting point is 00:17:07 not both shift left and shift left as three separate entries in your test vector. Yeah. You can control how Postgres will do it. We're building test vector. Also, you can concatenate test vectors. So data can come from different sources.

Starting point is 00:17:23 For example, we have title, we have body, subject or body. Like for example we have title we have body subject or body like for example if it's email or if it's blog post title and content text how you name it doesn't matter and you can concatenate two test vectors or you can concatenate it before that with additional space right right, and then built TS vector. So TS vector is just an array of texts. And you can explicitly store it, consuming space.

Starting point is 00:17:53 Or you can rely on function. Because then we need to create index on it. Originally, it was GIST, I think it was the work of Hallerstein in Berkeley very long ago, in the early 90s maybe or when. And it was not finished, but the guys who I mentioned, Bertunov and Sigayev, they just saw this directory in Postgres source code, as they say. And understood this is exactly what they need. There's also additional thing. So gist is generalized search tree. So it's similar to B3, but it's abstract. You can use it for many data types and data operators. For B3, it will be numbers and operators less than or more than and equals,

Starting point is 00:18:51 and just one axis. R3, it's two axes, and then you can have it for arrays as well. In this case, we talk about operators intersect, includes, contains, is contained by, so and overlaps right overlaps is at at usually operate you can actually redefine it and use your own symbols right it's it's actually first you define function then operator it's since posgus is very extendable but historically it's at at or less than at or at greater than. So these operators

Starting point is 00:19:29 you can define how to deal with them in terms of this B3-like index structure. So it's balanced with many, many children on each node. And balancing is automated. Everything is automated. So when the guy,

Starting point is 00:19:45 then this folks found this, Harry Stein's work, it was not finished and it was, wall was not supported. They worked on that, supported wall. And then additional thing, we won't go deep because of lack of time. And I also forgot everything, but there's also a thing like it's called signature tree. So to put our TS vectors to this structure, this additional concept of signature tree, you need to build signature. It's also defined. So these signatures are placed into the leaves of this tree. And then search is very bit tree-like, but with additional exclusion, which makes these days gist option is not popular at all. Recheck is needed because it's not certain with search.

Starting point is 00:20:36 I've heard it's called lossy index types. Right, right. So if you check, this is what you do in PgMaster, right? Explain, analyze buffers, right? You check and see in the plan, recheck happened. If gist was used, recheck happened. So let me close some gap. Query also processed to build a vector, right?

Starting point is 00:20:59 But it's called test query. It's called test query, different data type, but it's also similar, also a bunch of words, also with preprocessing. You can have different preprocessing. You decide how to preprocess, right? But usually it's same as for TS vector, for TS query, same preprocessing. You can also remove stop words, remove normalized words, and so on. So and then our question is, let's find everything which includes, like, we need to find documents which include everything we asked for in our query.

Starting point is 00:21:31 And rank them. Yeah, and rank also additionally. So there is an approach to put it to trees, to such values, to leaves. And then it searches similar to b3, but instead of greater than or where to go, greater than or less than, right? Instead of that, it checks if this vector is contained in that vector, right? I mean, this array is contained in that array, and that's why you need to go left or right, something like this. So it's called Russian doll tree because it's like Russian doll, right? Like somehow. RD tree.

Starting point is 00:22:09 So by the way, it's also in Berkeley papers, this term RD tree somehow. So I think it was invented before Burton-Wolfson critical. Probably, maybe they influenced, I don't know. But then like,

Starting point is 00:22:24 obviously it's slow in terms of search for large volumes of data because of this recheck. And this is not how Google and others worked in terms of, not even Google. Google, we know, PageRank is a bigger thing. But eventually, Burtunov, Sigaev, and then

Starting point is 00:22:39 Kratkov, they created GIN. GIN is Generalized Inverted Tree where we have a list of terms, our words, and for each list of documents. Right, all tuple IDs, tuple

Starting point is 00:22:55 city IDs, I think, where this is stored, but not lists. There are B3s there actually for faster search. So it means that GIST is good only at one thing, update speed. GIST versus GIN, yeah. Right, right. But GIN, it was improved.

Starting point is 00:23:14 There is also fast update option. But anyway, default option for us is GIN, right? Yeah. And comparing TS Query and TS V, GIMP is good, like search is fast, but order by TS rank, right? Which is not probably good. So back to the comparison, back to these use cases you started mentioning.

Starting point is 00:23:37 Let's think about them. First, you said, or it was second, let's distinguish subject and body, right? For example. Yeah, should we rank, should the presence of an ingredient in it or... Word. Yeah, words.

Starting point is 00:23:55 Because we know that people aren't going to be searching in our search bar for random things. Maybe there's a few different things they might search by. The most common is probably ingredient. Should recipes that list that in their title bar for random things like maybe there's like a few different things they might search by the most common is probably ingredient should recipes that list that in their title rank higher or but it'd be way too higher even if they only mention it like a few couple of times later on it's i feel like personally i'd be expecting those to rank higher than ones where it was like a small ingredient or i only needed a

Starting point is 00:24:25 small amount of it maybe maybe the amount of the ingredient matters a lot maybe that's easier right right so it's definitely good uh a good thing to to have to to want right if a word is entitled means that maybe whole article is about this, right? If eggplant is entitled, it pays attention more to it. Makes sense. So for this, there are two bits which you can use. Two bits means four options, right?

Starting point is 00:24:55 So this function set weight and when you combine information from multiple sources from title column and content column, for example, body column. You can say set weight A to first one, set weight B to second one. And they are only A, B, C, D, uppercase. Because, again, two bytes only are used for this.

Starting point is 00:25:20 And then later in your query, you can also say say i pay attention to this or to that actually you can use it for filtering as well and this is what i did but originally it was created for ranking i don't remember details but set weight function you need to search in documentation it should be there and you will find out how to like it's it's a very strange concept. Why only four? Because of two bits only. They had only two bits to spend for it. Maybe there should be more. But I use it for sometimes, like, we search only in subject, right?

Starting point is 00:25:56 And I say, okay, only a category. It's called category maybe. It's embedded inside TS vector. So you can skip using it. It's there, but you can just ignore it in search. But in different time, you can say, I want to search only category IA. It means only title search. You don't

Starting point is 00:26:14 need to build two TS vectors separately. However, you could, right? Got it. So like if, for example, going back to our example, I search what's clearly like an author name, and we can maybe on the application side we're doing a little like a quick um maybe we're doing something first to try and categorize what people are searching for if it's an author name i could then send that

Starting point is 00:26:36 through to postgres as like let's only look in this category right right but and again it originally was created by for ranking i just didn't use it, so I cannot explain how. But documentation, of course, explains how. So, good thing to mention also, now we have generated always thing, right? Generated columns. They are stored, but Postgres maintains them automatically. So I think generation of TS vector is probably a good thing to use together with that functionality. So you have author, title, body of blog post, and then TS vector can be generated based on some expression,

Starting point is 00:27:21 which puts different categories to different sources of data, again, up to four, and generates and puts TS vector there, and then you have index on it, right? Or you can define a trigger. All my life, I defined triggers for this. And also, there is another option not to store it at all and just use index on expression gin and then your big expression maybe with these set weight we just discussed but in this case it's good in terms of storage less thing to

Starting point is 00:27:54 keep in memory but it might be bad in terms of I don't know sometimes you need to deal with TS vector directly right and then if you have only expression I don't know. Sometimes you need to deal with TS vector directly, right? And if you have only expressional index, then you need to construct this expression once again to deal with it, right?

Starting point is 00:28:16 So I don't like this approach somehow because it limits in terms of what you can do with such records. So if you don't store it, you need to build it on the fly to deal with it, right? To additionally somehow analyze it or so. So I always prefer to store them, although they can consume a lot of data and the toast and so on. Sometimes I just put them to separate table, understanding that then I will need to join.

Starting point is 00:28:44 For me, it's easier to create trigger and use regular column, right? Sure, sure. But if you think store it or not to store it, it's good to think what you will do with it. And if you want to deal with TS

Starting point is 00:28:59 vector in different ways, then keeping it stored only in the index itself. It's maybe not enough. It's good to put it to table. But to put it to the same table as a column and rely on toasting

Starting point is 00:29:15 and so on, or maybe to allocate different table and one-to-one relationship. It's a good question. Again, depending on workloads. Right. So, okay, this we covered. Can you remind me of other use cases you mentioned?

Starting point is 00:29:34 Well, I think it's all, I've considered this kind of one use case, but I guess there's like all the complexities that you may or may not want to support. One that I didn't mention, but probably we care about, is what if somebody spells eggplant with one G? Yeah, typos. Yes. For this, there is pgtrgm, three grams?

Starting point is 00:29:57 Three grams, yeah. Extension, which actually also uses RD3. GIN, actually. GIN these days. RD3, forget about it. Forget about GIST. Very rarely used. So GIN, right? So we have some text. Basically, 3G should

Starting point is 00:30:13 work for words. We have word, and we suspect maybe there is a typo there. So it appends one space on one side and two spaces to different side, and then just split three words, three letters, three letters, three letters, right?

Starting point is 00:30:30 And then we have array again. Vector. Vector, right. And then the question is which is the most the closest. Closest means like most of overlapping. Most of members of array

Starting point is 00:30:45 are the same. Might be not all of them. And if just one letter typo, it means three members of this vector will be different. Ah, nice.

Starting point is 00:31:01 Yeah. It wouldn't include EGG. It wouldn't include GGP, and is that they're the only two? In this case, I think possibly just they're the only two. So three grams, we're just shifting, shifting, shifting also, right? We don't just split, we're shifting. So we start position, first is, for space then two letters then second first letter

Starting point is 00:31:26 second letter third letter shift shift and we construct a array out of it and then if for example you just missed forgot one letter overlapping will be huge distance is very close similarities distance low similarity is high. If you, for example, mixed positions of two letters, just swap them, also huge. This is an interesting idea behind 3-grams. And again, we use either RD3, nobody loves it, or GIN. We use GIN so we can find arrays which are closest, overlapping is higher. That's how we find words which actually present in our document dataset. And then they say, you probably thought about this word. One thing, in the past, I remember we maintained the list of words with usage counts probably using some statistics

Starting point is 00:32:27 provided by full-text search. I don't remember details, but you can build the list of words with stats and then you can store it in a table and then use three grams on top of it. And you need the triggers to maintain this, or you need to refresh it periodically. For example, if you just imported a lot of documents which use new words, this list of words is already outdated. You need to rebuild it. And then this trigram approach worked. You have input words, you check the table, and it works. Now I think we don't need it, right?

Starting point is 00:33:05 It's automatically maintained, this list of words. Or we just don't care and build an index and then take whole input. That's how I've seen it used, yeah. So there are several ways to use it and maintaining this additional table probably still makes sense, right?

Starting point is 00:33:23 List of words. Do you think it would then speed up queries, or what do you think the benefit is? You lose speed here probably, right? But accuracy is good. Accuracy, yeah. This is great, but I think the conversation

Starting point is 00:33:40 you mentioned earlier, like the comparison of full-text search versus semantic search in the current climate. And i wanted to like introduce was the the whether the use case suits false positives or suits false negatives better like i think it's really difficult to to come up with a solution that that does neither but often you see kind of slightly biased towards like is is it better like google for example what page rank was really good at was making sure it didn't miss a website that was clearly very relevant to your search term and so it was very very good at avoiding false negatives but you did often especially in the earlier days get quite a lot of false positive like you get

Starting point is 00:34:23 articles that you didn't that didn't match your intention, at least, even if it did match the words. But the recipe example, maybe we don't care so much about. If you miss one great recipe, but you get 17 perfectly matched ones, that's still a good result. Maybe you'd rather that kind of weigh around on the trade-off. I don't know how you see that and whether that's relevant. I see this as many different small use cases.

Starting point is 00:34:49 For example, if, first of all, I remember your different use cases. This is what I mentioned as well. We want to take into account likes, comments, timestamp, everything. It gives us ability to build very big query and then we think, okay, but our full text search single GIN index is not enough, right? Sometimes we can use B3GIST, for example, right?

Starting point is 00:35:16 Which allows us combining these arrays of text and regular numbers or timestamps in a single index. And this is good, right? And we can have single index scan. Also, there is B3Gene. Also interesting thing, right? Also, there is RAM, which you need to install as a separate extension. But these two, B3Gest and B3Gene, they are included with regular

Starting point is 00:35:41 contrib model. So with any Postgres it's available. But in some cases we won't like, I forgot to mention, Postgres also supports it I think functional web search to TS query or something like this. It supports

Starting point is 00:35:57 some small language of phrase search and inverted search. You can exclude, user can say, I want everything, but this word should not be there. Just minus, right? Or put in double quotes for exact match, right? This all great means that user can control and in regular manner, Google also supports things like that, right? So it means that you want, for example, to see exact phrase mentioning,

Starting point is 00:36:28 you know it's there and you just can do it. But then similarity search. Two big use cases. For example, you mentioned synonyms, right? Someone doesn't use the word eggplant, there's another word meaning the same thing. Aubergine. Honestly, that's what we call them.

Starting point is 00:36:51 Aubergine. A-U-B-E-R. I'm not going to try the rest. I haven't heard it. Cool. So if you want a synonym search, full text search supports it. But you need to maintain it. You need to maintain the dictionary of synonyms.

Starting point is 00:37:07 And the normalization process will automatically, all synonyms will be defined in our test vector and test query. It will be one word we chose so we can maintain synonyms. It's easy. Not easy. I mean, it's straightforward. It's not easy because it requires effort of maintaining synonyms, right? But on the other hand, if we use PG vector,

Starting point is 00:37:32 it probably puts both words in this highly dimensional space very close, right? Yeah. Because meaning is the same. Hmm. Right. Interesting. So maybe semantic search provided by PGgVector is better here, right? Okay.

Starting point is 00:37:51 Another thing is, for example, something is not working versus something is working. Not is a stop word, right? What do we try to search? Something is not working. For example, my car is not working. Maybe not working I should put in double quotes so not is definitely there for exact phrase search. Or maybe I just need to use PG vector

Starting point is 00:38:16 because it definitely will distinguish semantically that not working and working very fine in this highly dimensional space. So similarity is not good. so similarity is not good right distance is not good and if we are trying to find my car is not working we won't find documents if they will be put not high right in terms of ranking by similarity. But also pgVector... I think that depends. I think that's contextually important and interesting because

Starting point is 00:38:50 let's say this was a car forum. How many posts are you going to find about cars working great? Realistically, in the data set. Well, it was a bad example. But in general, inverted,

Starting point is 00:39:10 since how full-text search works in Postgres, it will remove not as a stop word. Yeah. And it's bad sometimes, right? Because it's opposite meaning, right? Yeah, makes sense. I didn't realize that. We might think about similar examples where removal of stop word basically leads to very bad ranking or filtering also. Yeah.

Starting point is 00:39:33 Like one that comes up quite often is like film titles or like band titles. If you put the in front of it, it could be a completely different band or or some english is very um has very such example a lot of such examples for example go on go on go it's like goes right on it's just some so go on means continue right yeah meaning is very different if you just look at words separately, you don't get this meaning at all. And there are many such phrases. Right? In many languages. And semantic search

Starting point is 00:40:14 will capture the meaning and put it to vector, right? And then we use approximate nearest neighbor search, ANN search. But it lacks many capabilities, like, for example, exact phrase search or inverted word search

Starting point is 00:40:30 or this categorization I mentioned. What do you think? It's interesting. Should we use only semantic search these days? I don't think so, but only because I'm thinking of like some quite boring use cases where for example you're set like I quite often think about software as a service applications and let's say you might want you might be looking up a customer and you might want to look up them

Starting point is 00:40:58 by email address or by a name or by like there's a few different things you might want to look them up by and you might like want to provide your users with a single search bar to to look those up or you can have tags for example if if based on what i just described overall like these vectors or arrays of text text arrays you can you can use gin gin for tag, putting all text for document into single value, single column, right? All the text, denormalizing them. I first did it, by the way, using GIST in 2008.

Starting point is 00:41:35 It was the topic of my first talk in the US I presented in 2008. Can you imagine? And we were building this. We put what was social media and all tags were stored in single column instead of EAV approach when you have separately words like tags, terms, can be phrases. One table documents another table

Starting point is 00:42:02 and between them like relationship table, right? And then you need to have three table joins, like two joins all the time. In terms of storage and speed of search, it was terrible. It remains terrible, EAV, Entity Attribute Value approach. But here you can put all tags, full text search can provide you exact search by tags versus if you put everything to vector probably how we would do it. We probably would just take author, title, body and then we have tags. And we append probably tags, colon, this, comma, this, comma, this, right? And rely on our vector search that it will be used somehow.

Starting point is 00:42:53 But we lose control, right? If it's full-text search or we have multiple indexes, we can reliably find proper documents which definitely contain some tags. This might be my answer i think if i care a lot about reducing false negatives like it would be really bad if you search the exact name of a customer in my application and we didn't show it back to you if we couldn't find like so sometimes the problem with the index types for vector search is they can have false negatives. Results that should be the correct answer turn out to then not be returned in the answer.

Starting point is 00:43:35 And also, we remember that vector search, since dimensionality is super high, is approximate nearest neighbors. So that's what I'm trying to explain. But full text search is K nearest neighbors or something like filtering. It's exact. Yes. Precise. So let me rephrase you. If we want to present user just single input with a button, fine.

Starting point is 00:44:02 Probably similarity search is enough. Right? input with a button, fine, right? Probably similarity search is enough, right? Or if we want to have some of this language, like exclude this word or exact phrase search in double quotes, or we have advanced form, search form, like choose the offer, date range or something, similarity search is not enough enough and probably we will need not only full text search but also like faceted search with filtering additional like things. Some things we can put to full text search index as I described but not all of them. And it's interesting that now we have similarity search, now we have full-text search, typo correction, basically, but it requires effort, as I said. Other indexes, B3, doesn't go away at all, right?

Starting point is 00:44:52 And then we can probably build some system which can be very powerful in terms of what you can do with it. And if we consider a particular example, for example, I know you used PostgreSQL.org and Google search, right? Two search engines, right? And PostgreSQL.org search is remarkably weak. Yeah, it's not great, is it? It seems it doesn't use full-text search, or maybe it uses it in some strange form, because I don't see all power full-text search provides. Maybe we could look into details,

Starting point is 00:45:35 because I guess source code should be available. But in general, it can be improved, and things like just using Postgres' own full-text search will improve it and we could find more things. But if pgVector would be installed, it would be even better.

Starting point is 00:45:55 Because we could combine semantic search and full-text search phrase. I think for me, full-text search is good in terms of these capabilities like this language, like phrase search and negative exclusion, and also categories. Yes. So as someone who built a lot of systems using full-text search and recently built one system using PgVector and similarity based on semantic similarity, right? Semantic similarity.

Starting point is 00:46:31 I now think it's probably proper time to start adding full-text search capabilities also, and it will be tricky and interesting how to properly combine them. I saw simple examples, trivial examples, like how to combine them. Let's find 50 documents using semantic search, 50 documents using full-text search. That's it. Or let's find 1,000 using full-text search, then rerank them using similarity and leave only 100. And then define some additional ranking and leave only 10. Quite weak examples because I have no idea how pagination will work, right? And also, what about speed?

Starting point is 00:47:06 So for me, it remains open question how to look back at old functionality provided by full-text search and bring them to similarity search. So it's a super interesting topic. It depends on the use case, right? So I'm thinking about your bot for Postgres AI. You don't need to paginate, right? You're returning the results in a chat-based interface. Various.

Starting point is 00:47:34 So it's an interesting, depending on the use case, maybe you don't always have the same constraints we're used to having via web search type interfaces. Well, I think yes and no we don't need it now but i i already see building the bot i follow my own behavior using google when using google yeah and there i definitely sometimes go to page number two page number three but not page 20 like pagination maybe it's not that big a deal if you're only looking at the first three pages. You're probably right.

Starting point is 00:48:07 But speed will suffer. But anyway, I think I should combine full text. Start combining with full text search because sometimes I want exact search. I know exact phrase or function name. Sometimes I want to exclude something. I mean, bot already wants to exclude something. It happens.

Starting point is 00:48:27 And also I think about two-phase search when we search for more money. Maybe indeed pagination is not needed. I think about the idea to search a lot of items, like 1,000, 2,000 items, but then analyze snippets like in Google using, again, automated analysis using LLM and leave only a few. And for them, we need to expand because I always

Starting point is 00:48:52 open every relevantly looking snippet. I just click it and open a new tab and consume whole document, right? But we cannot consume whole documents for 10 or 20 items it's interesting there but full-tick search I think is about to return because again sometimes I want

Starting point is 00:49:12 search by title reliably I know it was mentioned in title or something or authors also author yeah makes sense exciting times maybe it was

Starting point is 00:49:26 slightly messed in terms of content structure this time but I still have questions and I don't like

Starting point is 00:49:35 examples let's just 50 semantic 50 full text search do you want to hear from people should people let us know how they're

Starting point is 00:49:41 currently yeah maybe there are better ideas already I think it's a very evolving area, right? And maybe there will be more systems where capabilities will be combined because we know, for example, in Google, it combines a lot, including similarity and full-text things. Yeah, so if you're doing this in postgres already combining

Starting point is 00:50:08 the two techniques let us know yeah how exactly what are the details it's super yeah yeah good nice one cheers nicolo thank you

Your Ad Here

Postgres FM - Full text search

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.