Postgres FM - Full text search
Episode Date: May 24, 2024Nikolay and Michael discuss full text search in Postgres — some of the history, some of the features, and whether it now makes sense to try to replace or combine it with semantic search. H...ere are some links to things they mentioned:Full Text Search https://www.postgresql.org/docs/current/textsearch.htmltsearch2 https://www.postgresql.org/docs/9.6/tsearch2.htmlDictionaries https://www.postgresql.org/docs/current/textsearch-dictionaries.html RUM index https://github.com/postgrespro/rum Okapi BM25 https://en.wikipedia.org/wiki/Okapi_BM25 tf–idf https://en.wikipedia.org/wiki/Tf%E2%80%93idf unaccent https://www.postgresql.org/docs/current/unaccent.html tsvector and tsquery https://www.postgresql.org/docs/current/datatype-textsearch.html GiST indexes https://www.postgresql.org/docs/current/gist.html GIN indexes https://www.postgresql.org/docs/current/gin.html Controlling Text Search (including setweight function) https://www.postgresql.org/docs/current/textsearch-controls.html pg_trgrm https://www.postgresql.org/docs/current/pgtrgm.html btree_gist https://www.postgresql.org/docs/current/btree-gist.html btree_gin https://www.postgresql.org/docs/current/btree-gin.html websearch_to_tsquery https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES  pgvector https://github.com/pgvector/pgvector Our previous episode on search https://postgres.fm/episodes/search ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello, hello, this is PostgresFM, episode number 98, right? 98?
Yes.
Yeah, and this is Nikolai Samokvalov, founder of Postgres.ai, and my co-host is Michael
Christofidis, founder of Pidgey Mustard. Hi, Michael.
Wow, that's the first time you've attempted my last name, and you did great.
Okay, yeah. Thank you so much. I practiced all these years.
So what's the
topic today?
Yeah, so this was your suggestion. I think it's a great
one. We've been meaning to come back to it for a
while now. It's full-text search.
Why did you want to talk about it?
I think it was in our list
for a very long time.
Maybe from the very beginning
because it's a huge topic.
And when we built that list, I think PgVector already existed, but only a few people paid
attention to it, unlike today.
So since we didn't manage to discuss full-text search before this AI boom started last year,
I think we should maybe compare it a little bit. before this AI boom started last year,
I think we should maybe compare it a little bit.
And since I used full-text search a lot in the past,
I still remember many features it offers in Postgres.
And I think it's interesting to not only discuss like usual discussion,
embedded full-text search in Postgres versus Elastic.
But also it's interesting to discuss embedded full-text search embedded of Postgres versus pgVector extension and similarity
search, or maybe not versus, maybe together. And this is interesting. But I think time is changing
and evolution is very interesting. And today people probably pay less attention to full-text search,
but they at least should know what capabilities it can offer.
And let's maybe start from use cases
and discuss functions, features, FTS,
full-text search provides in Postgres.
What do you think?
Or maybe we can talk about some historical aspects.
I also can provide some overview
of what was happening in Postgres
and not only in Postgres.
What do you think?
Features, use cases, history, what's better?
Actually, I don't know the history that well. that would be awesome to hear a tiny bit about that and then use i'd
love to jump onto use cases where you see used most commonly that kind of thing yeah actually
we can combine i know see how we can do it so if for example we roll back to first web era,
web 1.0, so to speak, like 25 years ago,
and when Google was created.
Because Google is kind of,
it's a search engine originally, right?
And it solves the problem of finding proper texts,
most relevant texts, right?
And at that time,
before Google, already other search engines were created,
obviously. And full-text search capabilities were originally called T-Search 2. Actually,
I remember this contrib module I was seeing at 2005, maybe, when I first started working
with Postgres, created by Oleg Bartonov, Theodor Segaev,
and then Alexander Korotkov joined them.
They created it because they were building search engine originally.
They helped building quite big search engine.
And then they incorporated many of their works into Postgres,
and this became T-Search 2.
Contrib module,
which later was merged to the core,
and you don't need to install extension,
so it's already in core, right?
But interesting thing that
it was before social networks and so on,
but if we look at how it's built,
it's like it has a lot of things.
For example, you can, for example, exclude some words.
And by default, it excludes some words.
It's called dictionary, which is like in Postgres full-text search terminology, dictionary is a program.
It can process input somehow transforming it. So stop words, it's just exclusion of various articles,
for example, and so on, which we probably don't have any meaning. So we can just exclude them.
That's it. Then we have snowball, which is a stammer. So it removes the endings. And it's
super fast and super simple. So we can, for example, runs and run,
it's like kind of the same word. So s should be removed, right? And then more full-fledged
thing we can ask to join, to use some real dictionary, usually ISPEL, which is used also
for the thesaurus. It can give a lot of things,
but it's usually slower, requires some memory.
And all of these things are configurable in full-text search
and they were configurable in TS Vector 2 originally.
And it's great.
But if you think about the task, the use case,
what we want to try to solve,
we want to present user
some information which is most relevant. What is most relevant? User has some query. I want
recipe with, I don't know, with some eggplant, for example. If we have, for example, to keep puts eggplant if we have like, for example, food recipe website or system knowledge base. And then we just removing stop words removing, like
normalizing words, we can just filter right. But filtering is just one thing we also need
to somehow order maybe right and present only the most relevant.
What is the most relevant?
Maybe these words are most frequently used
in these documents, right?
Like repeated many times, for example.
And this is quite basic idea.
Like if the word is used many times,
it's considered most relevant.
Now, like confession, I never used this. Why? Because
I'm from different era. I'm from web 2.0 where most relevant meant something else because
it was social networks. It was also time matters a lot. If it's fresh document, like fresh
blog post or something, it matters a lot. But Postgres full-text search lacks this.
It's hard to build together.
There is special index RAM.
I mentioned it many times already.
Maybe I need to revisit finally
and think maybe it still can be used
because if you're on self-managed Postgres,
this type of index can be installed
as additional extension.
But original full-text search can order by only this TSRank,
it's called function TSRank.
And there is another function which also considers positions
and distance between words,
I think like density or something,
not density, but the closer words are,
the better ranking score is, right?
But for me, it was always nonsense to use only these word positions and so on.
I couldn't imagine how I can use it because I always needed to take into account time.
And also sometimes things like likes, reactions, comments.
If this post is heavily discussed, it's very important, right?
But back to history, we know there is BM25
and also there is this TF IDF.
The idea of TF, it's term frequency, inverted document frequency, right?
If some document has this term, like word, mentioned many times,
it's good for this document, right? And if, in general, this term is mentioned in our whole database, its whole dataset, it's mentioned not very frequently.
It's also good for our document because it means that it's exceptional in terms of frequency, right?
So that's why I invert the document frequency.
So it means that for such thing,
we need to analyze the whole data set.
Postgres full-text search doesn't do it.
It's quite hard to maintain such type of thing,
but I think if there are some attempts to have it in Postgres,
maybe some extensions exist.
And this is what Elastic uses, right?
But there is another also approach, and back to
25 years ago, Google, why Google?
Why did Google win?
Because it had PageRank.
If we think deeply about this,
it's already from
Web 2.0 idea,
because PageRank, it
takes into account
kind of social network of documents, I would say.
It takes into account links, right?
So interactions.
We are not alone.
And this idea is super powerful.
I'm curious, are there any extensions which implement this in Postgres and how it could be implemented?
This is super interesting.
But this is a big power. If
we have links,
we value
this document more. And I think
there are good
algorithms how to solve it with
billions of documents, of course.
There are papers written.
Anyone who studied search engine
theory, of course.
There is a course from Stanford,
Mining Massive Datasets,
very good material where I studied this page rank
and other algorithms closer to machine learning
on big datasets.
So idea is like links, right?
You multiply by some matrices
and you have some good index and you can use it.
Again, Postgres doesn't provide this. And Elastic doesn't provide
PageRank, I guess. But it's good because we start taking into
account not only word positions.
Full-text search in Postgres is only word position for a single document
and that's it. But
since full-text search in Postgres is inside relational database system,
we can have other columns, right? And we can have indexes on them. So for example,
likes or something like timestamp when document was created. We can construct SQL,
which takes into account all these things. It will be just slow, right, maybe,
because the idea that the fastest query is always like index scan
or index-only scan is the best, right?
But when we have full-text search, we have other columns
and we have bit-rease probably on them.
Combination is a problem, right?
This is the problem.
But there are things how you can solve it,
but let's maybe discuss a little bit later,
I mean, Btriggist and other extensions.
But originally, the combination of different factors
in the ranking task, this is the problem, right?
Later, Google and others,
they started taking into account social network activities. If some document is mentioned on Twitter, for example, at Facebook,
it happened like 15, 10 years ago. This document, everyone who dealt with search engine optimization
knows that if you have a lot of likes in social networks,
your document goes up in positions. But if we deal with our system, we can take into account
interactions like that because it's just some data we store and that's it. And today, it's not that
bad usually if you don't have tens of billions of records or just millions or tens or hundreds of millions.
It's not a huge volume of data.
So you can rely on multiple indexes and let Postgres decide what to choose based
on statistics. This is very similar to what we discussed, remember?
Yeah.
Yeah, very similar.
Because it triggered my problems with full-text search I had
trying to incorporate it to social media projects.
Because you have two indexes which to choose.
Because you cannot choose them together.
Well, you can.
You can have bitmap index scan, but it's already slower.
But most likely Postgres will choose just one index and won't, for example, over time.
Or full-text search based on statistics.
And then on the fly, it will apply either sorting in memory
or filtering in memory, right?
Yeah, exactly.
That's where the inefficiency comes in.
Yeah, yeah.
So this is like historical overview.
We had simple stop wars, stemming, dictionary.
It's not simple, by the way.
It's a lot of functionality.
But ranking based on
just our
opinion about
how this document is relevant to your
query, not considering
relations or other
things. Second thing is, I
think, page rank, and Postgres doesn't have it.
And the third thing is
taking into account various
factors like time and likes and so on.
But also there is similarity search now.
We know search engines use
it for long, right?
So...
Actually, I want to go back to... I think your
example is remarkably
useful. So the example you gave
of searching a site of
recipes for eggplant, let's like add
that maybe we'll write maybe we're not super sophisticated we type an eggplant so we can for
example so the an would count as a stop word we don't want to search for all rest like we don't
want recipes that say an a lot to rank highly so that's getting rid of the stop words we might want for example to do
something slightly more we might want to rank recipes that list eggplant in the title we might
want those to score higher than ones where it's listed in the in the document more times so that's
that's like an interesting additional challenge secondly it might be let's say a user-generated
recipe site and we might want to factor in how many likes a recipe has got
or how recent it is or something like that.
So there's like all these, even in the simplest example you can think of
these days, it can get...
Oh, I've got one more.
You might also, if it's user-generated content,
you could have British people or like people using aubergine
instead of eggplant, and you also want those to rank higher.
Yes, exactly.
So synonyms, it's already a bridge to discussing full-text search versus similarity.
Let's keep it as the last item, right?
But great attempt to zoom.
This is good.
Let's do it.
Or diving deeper.
So let's slightly mention how full-text search works in Postgres.
So after this pre-processing, we just discussed removal of stop words,
and you can control it.
The good thing about full-text search in Postgres,
a lot of customization options, a lot.
For example, you can control stop list and stop considering n as
stop word, right?
Or you can, for example, remove accents.
There is a special extension on accent,
right? Additional
preprocessing. So like in
French, for example, with lots of...
Or there are support of multiple languages
also. It's also good.
So you can start considering
additional stop words, for example,
remove it if they mention
too often, like almost all documents
have them, let's remove it because
it doesn't make sense to use it, right?
Then like stemming or dictionary
processing, I spell dictionary processing.
So in the end we have
a set of normalized words,
right, and then
we build an array.
It's called TSVector.
Again, also vector, by the way.
So it's a set
of words with positions
already normalized
in a normalized form.
But basically it's just an array of
texts.
Like words or lexemes or whatever the yeah yeah for example i forgot also
to mention hyphened words i think you also can control it as i remember it was many years ago
i think you can choose either to consider them as a single one word or to split them to multiple
words like two two words basically yeah first and second
or or together like you can put a pair of separate words and whole word is hyphen right
total three already right so that if people search one or the other they still get that
result back right for example shift left testing shift left If it's a half and you can put both,
not both shift left and shift left as three separate entries in your test
vector.
Yeah.
You can control how Postgres will do it.
We're building test vector.
Also,
you can concatenate test vectors.
So data can come from different sources.
For example,
we have title,
we have body, subject or body. Like for example we have title we have body subject
or body like for example if it's email or if it's blog post title and content text
how you name it doesn't matter and you can concatenate two test vectors or you can
concatenate it before that with additional space right right, and then built TS vector.
So TS vector is just an array of texts.
And you can explicitly store it, consuming space.
Or you can rely on function.
Because then we need to create index on it.
Originally, it was GIST, I think it was the work of Hallerstein in Berkeley very long ago, in the early 90s maybe or when.
And it was not finished, but the guys who I mentioned, Bertunov and Sigayev, they just saw this directory in Postgres source code, as they say. And understood this is exactly what they need.
There's also additional thing.
So gist is generalized search tree.
So it's similar to B3, but it's abstract.
You can use it for many data types and data operators. For B3, it will be numbers and operators less than or more than and equals,
and just one axis.
R3, it's two axes, and then you can have it for arrays as well.
In this case, we talk about operators intersect, includes, contains,
is contained by, so and overlaps right overlaps is at at usually
operate you can actually redefine it and use your own symbols right it's it's actually first you
define function then operator it's since posgus is very extendable but historically it's at at or less than at or at
greater than.
So these operators
you can define how to
deal with them in terms of this
B3-like index structure.
So it's balanced
with many, many children
on each node. And
balancing is automated. Everything is automated.
So when the guy,
then this folks found this, Harry Stein's work, it was not finished and it was, wall
was not supported. They worked on that, supported wall. And then additional thing, we won't
go deep because of lack of time. And I also forgot everything, but there's also a thing
like it's called signature tree. So to put our TS
vectors to this structure, this additional concept of signature tree, you need to build signature.
It's also defined. So these signatures are placed into the leaves of this tree. And then search is
very bit tree-like, but with additional exclusion, which makes these days gist option is not popular at all.
Recheck is needed because it's not certain with search.
I've heard it's called lossy index types.
Right, right.
So if you check, this is what you do in PgMaster, right?
Explain, analyze buffers, right?
You check and see in the plan, recheck happened.
If gist was used, recheck happened.
So let me close some gap.
Query also processed to build a vector, right?
But it's called test query.
It's called test query, different data type,
but it's also similar, also a bunch of words, also with preprocessing.
You can have different preprocessing.
You decide how to preprocess, right?
But usually it's same as for TS vector, for TS query, same preprocessing.
You can also remove stop words, remove normalized words, and so on.
So and then our question is, let's find everything which includes, like, we need to find documents which include everything we asked for in our query.
And rank them.
Yeah, and rank also additionally.
So there is an approach to put it to trees, to such values, to leaves. And then it searches similar to b3, but instead of
greater than or where to go, greater than or less than, right? Instead of that, it checks
if this vector is contained in that vector, right? I mean, this array is contained in
that array, and that's why you need to go left or right, something like this. So it's called Russian doll tree because it's like Russian doll, right?
Like somehow.
RD tree.
So by the way,
it's also in Berkeley papers,
this term RD tree somehow.
So I think it was invented
before Burton-Wolfson critical.
Probably, maybe they influenced,
I don't know.
But then like,
obviously it's slow in terms of search
for large volumes of data because of
this recheck.
And this is not how Google and others
worked in terms of,
not even Google. Google, we know, PageRank
is a bigger thing. But eventually,
Burtunov, Sigaev, and then
Kratkov, they created GIN.
GIN is Generalized Inverted Tree
where we have a list of
terms, our words,
and for each
list of
documents. Right, all
tuple IDs, tuple
city IDs, I think, where this is
stored, but
not lists. There are B3s there
actually for faster search.
So it means that GIST is good only at one thing, update speed.
GIST versus GIN, yeah.
Right, right.
But GIN, it was improved.
There is also fast update option.
But anyway, default option for us is GIN, right?
Yeah.
And comparing TS Query and TS V, GIMP is good, like search is fast,
but order by TS rank, right?
Which is not probably good.
So back to the comparison,
back to these use cases you started mentioning.
Let's think about them.
First, you said, or it was second,
let's distinguish subject and body, right?
For example.
Yeah, should we rank,
should the presence of an ingredient in it or...
Word.
Yeah, words.
Because we know that people aren't going to be searching
in our search bar for random things.
Maybe there's a few different things they might search by.
The most common is probably ingredient. Should recipes that list that in their title bar for random things like maybe there's like a few different things they might search by the most
common is probably ingredient should recipes that list that in their title rank higher or but it'd
be way too higher even if they only mention it like a few couple of times later on it's i feel
like personally i'd be expecting those to rank higher than ones where it was like a small
ingredient or i only needed a
small amount of it maybe maybe the amount of the ingredient matters a lot maybe that's easier
right right so it's definitely good uh a good thing to to have to to want right if a word is
entitled means that maybe whole article is about this, right? If eggplant is entitled, it pays attention
more to it. Makes sense.
So for this, there are
two bits which you can use.
Two bits means
four options, right?
So this function set weight
and when you combine
information from multiple sources
from title column and content
column, for example, body column.
You can say set weight A to first one, set weight B to second one.
And they are only A, B, C, D, uppercase.
Because, again, two bytes only are used for this.
And then later in your query, you can also say say i pay attention to this or to that actually
you can use it for filtering as well and this is what i did but originally it was created for
ranking i don't remember details but set weight function you need to search in documentation it
should be there and you will find out how to like it's it's a very strange concept. Why only four?
Because of two bits only.
They had only two bits to spend for it.
Maybe there should be more.
But I use it for sometimes, like, we search only in subject, right?
And I say, okay, only a category.
It's called category maybe.
It's embedded inside TS vector.
So you can skip using it.
It's there, but you can just ignore it in search. But
in different time, you can say, I want to search
only category IA. It means
only title search. You don't
need to build two TS vectors
separately. However, you could, right?
Got it. So like if,
for example, going back to our example,
I search what's clearly like an
author name, and we can
maybe on the application side we're doing a little like a quick um maybe we're doing something first
to try and categorize what people are searching for if it's an author name i could then send that
through to postgres as like let's only look in this category right right but and again it originally
was created by for ranking i just didn't use it, so I cannot explain how.
But documentation, of course, explains how.
So, good thing to mention also, now we have generated always thing, right?
Generated columns. They are stored, but Postgres maintains them automatically. So I think generation of TS vector is probably a good thing
to use together with that functionality.
So you have author, title, body of blog post,
and then TS vector can be generated based on some expression,
which puts different categories to different sources of data, again, up to four,
and generates and puts TS vector there, and then you have index on it, right?
Or you can define a trigger. All my life, I defined triggers for this. And also, there is
another option not to store it at all and just use index on expression gin and then your big expression maybe with these
set weight we just
discussed but in this case
it's good in terms of
storage less thing to
keep in memory but
it might be bad in terms
of
I don't know sometimes
you need to deal with
TS vector
directly right and then if you have only expression I don't know. Sometimes you need to deal with TS vector directly, right?
And if you have only expressional index, then you need to construct this expression once again to deal with it, right?
So I don't like this approach somehow because it limits in terms of what you can do with such records.
So if you don't store it, you need to build it on the fly
to deal with it, right?
To additionally somehow analyze it or so.
So I always prefer to store them,
although they can consume a lot of data and the toast and so on.
Sometimes I just put them to separate table,
understanding that then I will need to join.
For me, it's easier to create
trigger and use regular column,
right? Sure, sure. But
if you think store it
or not to store it, it's good to think
what you will do with it.
And if you
want to deal with TS
vector in different
ways, then keeping it
stored only in the index
itself. It's maybe not enough.
It's good to
put it to table. But to put
it to the same table as a column
and rely on toasting
and so on, or maybe to
allocate different table
and one-to-one relationship.
It's a good question.
Again, depending on workloads.
Right.
So, okay, this we covered.
Can you remind me of other use cases you mentioned?
Well, I think it's all, I've considered this kind of one use case,
but I guess there's like all the complexities
that you may or may not want to support.
One that I didn't mention, but probably we care about,
is what if somebody spells eggplant with one G?
Yeah, typos.
Yes.
For this, there is pgtrgm, three grams?
Three grams, yeah.
Extension, which actually also uses RD3.
GIN, actually. GIN these days.
RD3, forget about it. Forget about GIST.
Very rarely used.
So GIN, right?
So we have some text.
Basically, 3G should
work for words. We have
word, and we suspect maybe there is
a typo there.
So it appends one space on
one side and two spaces
to different side, and then just split
three words, three
letters, three letters, three letters, right?
And then we have array again.
Vector.
Vector, right. And then
the question is which is the most
the closest. Closest means like most
of overlapping.
Most of
members of array
are the same.
Might be not all of them.
And if just one letter typo,
it means
three members of this
vector
will be different.
Ah, nice.
Yeah.
It wouldn't include EGG.
It wouldn't include GGP,
and is that they're the only two?
In this case, I think possibly just they're the only two.
So three grams, we're just shifting, shifting, shifting also, right?
We don't just split, we're shifting.
So we start position, first is, for space then two letters then second first letter
second letter third letter shift shift and we construct a array out of it and then if
for example you just missed forgot one letter overlapping will be huge distance is very close
similarities distance low similarity is high.
If you, for example, mixed positions of two letters, just swap them, also huge.
This is an interesting idea behind 3-grams.
And again, we use either RD3, nobody loves it, or GIN.
We use GIN so we can find arrays which are closest, overlapping is higher. That's how we find
words which actually present in our document dataset. And then they say, you probably thought about this word. One thing, in the past, I remember we maintained the list of words with usage counts probably using some statistics
provided by full-text search. I don't remember details, but you can build the list of words
with stats and then you can store it in a table and then use three grams on top of it.
And you need the triggers to maintain this, or you need to refresh it periodically.
For example, if you just imported a lot of documents which use new words, this list of words is already outdated.
You need to rebuild it.
And then this trigram approach worked.
You have input words, you check the table, and it works.
Now I think we don't need it, right?
It's automatically maintained, this list of words.
Or we just don't care and build an
index and then take whole input.
That's how I've seen it
used, yeah. So there
are several ways to use it
and maintaining this additional
table probably still makes sense, right?
List of words.
Do you think it would then speed up queries, or what do you
think the benefit is?
You lose speed here probably, right?
But accuracy is good.
Accuracy, yeah.
This is great,
but I think the conversation
you mentioned earlier, like the comparison
of full-text search versus semantic
search in the current climate. And i wanted to like introduce was the the whether the use case suits
false positives or suits false negatives better like i think it's really difficult to to come up
with a solution that that does neither but often you see kind of slightly biased towards like is is it better like google
for example what page rank was really good at was making sure it didn't miss a website that was
clearly very relevant to your search term and so it was very very good at avoiding false negatives
but you did often especially in the earlier days get quite a lot of false positive like you get
articles that you didn't that didn't match your intention, at least,
even if it did match the words.
But the recipe example, maybe we don't care so much about.
If you miss one great recipe, but you get 17 perfectly matched ones,
that's still a good result.
Maybe you'd rather that kind of weigh around on the trade-off.
I don't know how you see that and whether that's relevant.
I see this as many different small use cases.
For example, if, first of all,
I remember your different use cases.
This is what I mentioned as well.
We want to take into account likes, comments, timestamp, everything.
It gives us ability to build very big query
and then we think, okay, but our full text search
single GIN index is not enough, right?
Sometimes we can use B3GIST, for example, right?
Which allows us combining these arrays of text
and regular numbers or timestamps in a single index.
And this is good, right?
And we can have single index scan.
Also, there is B3Gene.
Also interesting thing, right?
Also, there is RAM, which you need to install as a separate extension.
But these two, B3Gest and B3Gene, they are included with regular
contrib model.
So with any Postgres it's available.
But in some cases we won't
like, I forgot to mention,
Postgres also supports it
I think functional web
search to TS query
or something like this. It supports
some small
language of phrase
search and inverted search. You can
exclude, user can say, I want everything,
but this word should not be there. Just minus, right? Or put in double quotes for exact match,
right? This all great means that user can control and in regular manner, Google also supports
things like that, right? So it means that you want, for example,
to see exact phrase mentioning,
you know it's there and you just can do it.
But then similarity search.
Two big use cases.
For example, you mentioned synonyms, right?
Someone doesn't use the word eggplant,
there's another word meaning the same thing.
Aubergine.
Honestly, that's what we call them.
Aubergine.
A-U-B-E-R.
I'm not going to try the rest.
I haven't heard it.
Cool.
So if you want a synonym search, full text search supports it.
But you need to maintain it.
You need to maintain the dictionary of synonyms.
And the normalization process will automatically,
all synonyms will be defined in our test vector and test query.
It will be one word we chose so we can maintain synonyms.
It's easy.
Not easy.
I mean, it's straightforward.
It's not easy because it requires effort of maintaining synonyms, right?
But on the other hand, if we use PG vector,
it probably puts both words in this highly dimensional space very close, right?
Yeah.
Because meaning is the same.
Hmm.
Right.
Interesting.
So maybe semantic search provided by PGgVector is better here, right?
Okay.
Another thing is, for example, something is not working versus something is working.
Not is a stop word, right?
What do we try to search?
Something is not working.
For example, my car is not working.
Maybe not working I should put in double quotes
so not is definitely there for exact phrase search.
Or maybe I just need to use PG vector
because it definitely will distinguish semantically
that not working and working very fine
in this highly dimensional space.
So similarity is not good. so similarity is not good right
distance is not good and if we are trying to find my car is not working we won't find documents
if they will be put not high right in terms of ranking by similarity. But also pgVector...
I think that depends. I think that's
contextually important and interesting because
let's say this was
a car forum.
How many posts are you going to find about
cars working great?
Realistically, in the data set.
Well, it was a bad example.
But in general,
inverted,
since how full-text search works in Postgres,
it will remove not as a stop word.
Yeah.
And it's bad sometimes, right?
Because it's opposite meaning, right?
Yeah, makes sense.
I didn't realize that. We might think about similar examples where removal of stop word basically leads to very bad ranking or filtering also.
Yeah.
Like one that comes up quite often is like film titles or like band titles.
If you put the in front of it, it could be a completely different band or or some english is very um has very such example
a lot of such examples for example go on go on go it's like goes right on it's just some so go on
means continue right yeah meaning is very different if you just look at words separately, you don't get this meaning at all. And there
are many such phrases.
Right?
In many languages.
And semantic search
will capture the meaning
and put it to vector, right?
And then we use
approximate nearest neighbor search,
ANN search.
But it lacks many capabilities,
like, for example, exact phrase search
or inverted word search
or this categorization I mentioned.
What do you think?
It's interesting.
Should we use only semantic search these days?
I don't think so,
but only because I'm thinking of like some quite boring use cases
where for example you're set like I quite often think about software as a service applications
and let's say you might want you might be looking up a customer and you might want to look up them
by email address or by a name or by like there's a few different things you might want to look them
up by and you might like
want to provide your users with a single search bar to to look those up or you can have tags for
example if if based on what i just described overall like these vectors or arrays of text
text arrays you can you can use gin gin for tag, putting all text for document into single value, single
column, right?
All the text, denormalizing them.
I first did it, by the way, using GIST in 2008.
It was the topic of my first talk in the US I presented in 2008.
Can you imagine?
And we were building this. We put what was social media
and all tags were stored in single column
instead of EAV approach
when you have separately words like tags,
terms, can be phrases.
One table documents another table
and between them like relationship table, right?
And then you need to have three table joins, like two joins all the time.
In terms of storage and speed of search, it was terrible.
It remains terrible, EAV, Entity Attribute Value approach.
But here you can put all tags, full text search can provide you exact search
by tags versus if you put everything to vector probably how we would do it. We probably would
just take author, title, body and then we have tags. And we append probably tags, colon, this, comma, this, comma, this, right?
And rely on our vector search that it will be used somehow.
But we lose control, right?
If it's full-text search or we have multiple indexes,
we can reliably find proper documents which definitely contain some tags.
This might be my answer i think
if i care a lot about reducing false negatives like it would be really bad if you search the
exact name of a customer in my application and we didn't show it back to you if we couldn't find
like so sometimes the problem with the index types for vector search is they can have false negatives.
Results that should be the correct answer turn out to then not be returned in the answer.
And also, we remember that vector search, since dimensionality is super high, is approximate nearest neighbors.
So that's what I'm trying to explain.
But full text search is K nearest neighbors or something like filtering.
It's exact.
Yes.
Precise.
So let me rephrase you.
If we want to present user just single input with a button, fine.
Probably similarity search is enough.
Right? input with a button, fine, right? Probably similarity search is enough, right? Or if we want to have some of this language, like exclude this word or exact phrase search in
double quotes, or we have advanced form, search form, like choose the offer, date range or
something, similarity search is not enough enough and probably we will need not only
full text search but also like faceted search with filtering additional like things. Some things we
can put to full text search index as I described but not all of them. And it's interesting that
now we have similarity search, now we have full-text search, typo correction, basically, but it requires effort, as I said.
Other indexes, B3, doesn't go away at all, right?
And then we can probably build some system which can be very powerful in terms of what you can do with it.
And if we consider a particular example, for example, I know you used PostgreSQL.org and Google search, right?
Two search engines, right?
And PostgreSQL.org search is remarkably weak.
Yeah, it's not great, is it? It seems it doesn't use full-text search,
or maybe it uses it in some strange form,
because I don't see all power full-text search provides.
Maybe we could look into details,
because I guess source code should be available.
But in general, it can be improved,
and things like just using Postgres' own full-text search
will improve it and we could
find more things.
But if pgVector
would be installed,
it would be even better.
Because we could combine
semantic search and full-text
search phrase. I think
for me, full-text search
is good in terms of these capabilities like this language, like phrase search and negative exclusion, and also categories.
Yes.
So as someone who built a lot of systems using full-text search and recently built one system using PgVector and similarity based on semantic similarity, right?
Semantic similarity.
I now think it's probably proper time to start adding full-text search capabilities also,
and it will be tricky and interesting how to properly combine them.
I saw simple examples, trivial examples, like how to combine them.
Let's find 50 documents using semantic search, 50 documents using full-text search.
That's it.
Or let's find 1,000 using full-text search, then rerank them using similarity and leave only 100. And then define some additional ranking and leave only 10.
Quite weak examples because I have no idea how pagination will work, right?
And also, what about speed?
So for me, it remains open question how to look back at old functionality
provided by full-text search and bring them to similarity search.
So it's a super interesting topic.
It depends on the use case, right?
So I'm thinking about your bot for Postgres AI.
You don't need to paginate, right?
You're returning the results in a chat-based interface.
Various.
So it's an interesting, depending on the use case,
maybe you don't always have the same constraints
we're used to having via web search type interfaces.
Well, I think yes and no we don't need it now
but i i already see building the bot i follow my own behavior using google when using google
yeah and there i definitely sometimes go to page number two page number three but not page 20 like
pagination maybe it's not that big a deal if you're only looking at the first three pages.
You're probably right.
But speed will suffer.
But anyway, I think I should combine full text.
Start combining with full text search
because sometimes I want exact search.
I know exact phrase or function name.
Sometimes I want to exclude something.
I mean, bot already wants to exclude something.
It happens.
And also I think about two-phase search
when we search for more money.
Maybe indeed pagination is not needed.
I think about the idea to search a lot of items,
like 1,000, 2,000 items,
but then analyze snippets like in Google
using, again, automated analysis using LLM
and leave only a few. And for them, we need to expand because I always
open every relevantly looking
snippet. I just click it and open a new tab and consume whole document, right?
But we cannot consume whole documents
for 10
or 20 items
it's interesting there
but full-tick search I think is about to return
because again sometimes I want
search by title reliably
I know it was mentioned in title
or something or authors
also author
yeah
makes sense
exciting times
maybe it was
slightly messed
in terms of
content
structure
this time
but
I still have questions
and I don't like
examples
let's just
50 semantic
50 full text search
do you want to hear
from people
should people let us
know how they're
currently
yeah maybe there are
better ideas already
I think it's a very evolving area, right?
And maybe there will be more systems where capabilities will be combined
because we know, for example, in Google, it combines a lot,
including similarity and full-text things.
Yeah, so if you're doing this in postgres already combining
the two techniques let us know
yeah how exactly
what are the details it's super
yeah yeah good
nice one cheers nicolo
thank you