Postgres FM - Search
Episode Date: March 29, 2024Nikolay and Michael have a high-level discussion on all things search — touching on full-text search, semantic search, and faceted search. They discuss what comes in Postgres core, what is ...possible via extensions, and some thoughts on performance vs implementation complexity vs user experience. Here are some links to things they mentioned:Simon Riggs https://www.linkedin.com/feed/update/urn:li:activity:7178702287740022784/Companion databases episode https://postgres.fm/episodes/companion-databasespgvector episode https://postgres.fm/episodes/pgvectorFull Text Search https://www.postgresql.org/docs/current/textsearch.htmlSemantic search https://en.wikipedia.org/wiki/Semantic_searchFaceted search https://en.wikipedia.org/wiki/Faceted_searchFaceting large result sets in PostgreSQL https://www.cybertec-postgresql.com/en/faceting-large-result-sets/RUM index https://github.com/postgrespro/rum Hybrid search (Supabase guide) https://supabase.com/docs/guides/ai/hybrid-search Elastic https://www.elastic.co/ GiST indexes https://www.postgresql.org/docs/current/gist.html GIN indexes https://www.postgresql.org/docs/current/gin.html btree_gist https://www.postgresql.org/docs/current/btree-gist.html btree_gin https://www.postgresql.org/docs/current/btree-gin.html pg_trgrm https://www.postgresql.org/docs/current/pgtrgm.html Text Search Types (tsvector and tsquery) https://www.postgresql.org/docs/current/datatype-textsearch.html Postgres full text search with the “websearch” syntax (blog post by Adam Johnson) https://adamj.eu/tech/2024/01/03/postgresql-full-text-search-websearch/Understanding Postgres GIN Indexes: The Good and the Bad (blog post by Lukas Fittl) https://pganalyze.com/blog/gin-index ParadeDB https://www.paradedb.com/ ZomboDB https://www.zombodb.com/ Introduction to Information Retrieval (book by Manning, Raghavan, and Schütze) https://www.amazon.co.uk/Introduction-Information-Retrieval-Christopher-Manning/dp/0521865719 How to build a search engine with Ruby on Rails (blog post by Justin Searls) https://blog.testdouble.com/posts/2021-09-09-how-to-build-a-search-engine-with-ruby-on-rails/~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a week's show about all things PostgresQL.
I am Michael, founder of PGMustard. This is my co-host Nikolai, founder of PostgresAI.
Hello Nikolai, what are we talking about today?
Hi Michael, let's talk about search at high level.
But before we proceed, let me express a few words, a few thoughts about the news we've got few
hours ago couple of hours ago about Simon Riggs we are recording this on
Wednesday March 27 and Simon Riggs just passed away I remember him as a very
bright mind I remember he was not an easy person to deal with obviously. I remember like 100 emails
even 100 emails to convince him to come to Moscow to speak at conference. Many people involved but
eventually he did, he came and it was great talk. But the work he did and like in general, yeah, it's a big loss, obviously, for Postgres community.
So, yeah, condolences to family, friends and co-workers, ex-co-workers and so on.
And Simon built a lot of things.
And he was quite brave to attack very complex topics in Postgres system in general, right?
In core of Postgres and in the engine itself.
For example, point-in-time recovery, things related to replication.
Many achievements were made by Simon or involving Simon.
So it's a big loss, definitely.
Yeah, over many years as well, right?
I have only actually had the opportunity to meet him a couple of times
at a couple of London events and heard him speak.
And not only was he a great contributor to the code base,
but I was amazed at how he was able to communicate and educate
and also community build, right?
Like he was involved in organizing
a lot of events especially in the uk um growing companies and a lot more around the ecosystem as
well i must say i remember very well this look in in simon's eyes which was like has had some
sparkles and i remember the conference the, the very first conference in the American continent
I attended in 2007,
speaking with Peter Eisentraut.
I was a baby, actually.
I implemented some parts of XML implementation functions
and type in Postgres.
And I remember Simon looking directly to me
with those sparks and asking, what's your next thing to build in Postgres. And I remember Simon looking directly to me with those sparks and asking,
what's your next thing to build in Postgres? I was caught out of guard, didn't answer anything,
actually. And yeah, so this is what I remember about Simon, this look and courage.
Do I pronounce it right? Courage, yeah. Courage, yes. He obviously had
big courage, huge.
And the ability to silence Nikolai,
that's quite the...
Yeah, well,
yeah.
It's interesting, yeah.
So, it's sad, very sad.
So,
yeah.
Yeah, absolutely.
Condolences to everybody who knew him and worked with him.
I don't really know how to move on from that, actually,
but we were going to talk about search, right?
Let's just, after this small break, let's return to the search topic.
It's wide.
It's a very wide topic.
And I guess we just want to touch it a little bit today
at a very high level, right?
Yeah, well, it's amazing.
If you consider all the things Postgres is used for,
search is one of the top use cases.
But looking back at our episodes,
we've touched on it a few times,
like when we've looked at using external
databases or um forgotten what somebody called them actually like a second type of database like
a partner database or something like that um so we touched on them a few times but never um we
haven't done anything on full text search we haven't done anything more recently on semantic
search we've done like a pg vector episode and a few a few kind of related
ish subjects but no it crossed my mind that we hadn't touched on this really as a as a topic
and i obviously it's one of those subjects that the more you learn the more you realize you didn't
know or the more complicated it gets so i can imagine us doing quite a few follow-ups on the
more specifics or you know implementation details or the tricky parts.
But yeah.
Don't forget faceted search.
There's such a term, right?
Yeah.
I could be saying it wrong,
but I think I've heard it called faceted.
You're like roaring bitmaps and things like that.
Well, usually we start from UI.
In my head, this starts from UI.
We have some things
in big form consisting
of multiple, very different
selectors
and filters and so on.
This is very common
in various marketplaces.
For example, imagine Airbnb.
You want to limit
price and location
and various categories and properties and so on.
And let me put this on the table right away.
Ideally, we always should have just a single index scan or index only scan.
This is our ultimate goal always.
But unfortunately, it's not always possible.
Why do we need it?
Because it's the best plan. And I think you can get like do you agree or not because you deal with plans
with pg master to explain plans and so on like we yeah single index scan is the best
one of the things i love about databases and performance in general is when it gets to the point where you have to trade one thing off against another.
And I think search is one of those topics where often we're trading off user experience versus complexity of the backend.
So some of the nicest search features are just a search bar. But without getting any input from the user, you have to do really a lot of work on the back end to be able to serve that in any kind of performant way.
So you've got the complexity of matching the results to the intent with the additional caveat that you want it to give some at least some results good
results quickly and those are trade that's a trade-off like it it's really easy to get well
not easy but you can give great results if you can search through everything you have and score
everything and you know if you've got forever to return them, but if you've given yourself a budget of returning within a few hundred milliseconds, suddenly that becomes a more difficult problem.
So I love that it's a trade-offs and yes one of those is performance
but sometimes i think you are willing to pay a little bit of performance for a better result
yeah this this is difficult i guess what what is better results and what is like
high quality of search right like i i remember definition that users should be happy, which is very broad.
What makes users happy?
Maybe we return good results, but UI is very bad, so they're not happy, right?
It's quite an interesting topic. And I think you're right, but also I just dove into the very bottom of performance. Performance matters a lot, right? If search is very slow, the users won't be happy and it means poor quality of the search, right? So we do care about performance, but also we do care about things like
if it's a full-text search,
we want stop words to be removed and ignored.
We want some dictionaries to be used,
maybe synonyms to be applied,
and so on and so on, right?
This matters a lot, and of course,
but this also moves us to performance part
because if these steps are slow, it's also
bad.
Why I was mentioning faceted search?
I just see a common pattern.
Postgres is huge in terms of capabilities and extensibility and various index types,
extensions. But we have simple problems unsolved.
For example, take full-text search and order by timestamp or ID.
I want the very, like, instead of old-school regular approach,
return, like, most relevant documents to me,
I want fresh documents to go first because
it's social media and this is number one pattern but also they need to follow some full-text search
query i used right i just need to see the latest but following some text patterns and this problem
is unsolved in postgres unfortunately and there is good, the best attempt to solve it, it's called the RAM index, which is extension, like a new generation
of GIN index. But why isn't it in core? Because it has issues. It's very, it's huge, it's slow,
and so on. And similar things I now observe, and not only observe, I touch them.
For example, just before we started recording, you showed me the Supabase blog post about how to combine full-text search
and semantic search based on embeddings.
I don't like the word embeddings.
I like the word vectors.
Because embedding, in my opinion, in a database, it doesn't settle in my mind at all.
Embedding is what we embed to our prompt.
This is content.
But vector is vector.
Maybe I missed something, but why do we call vectors embedding?
Honestly, in our database, this column
is called embedding just because
OpenAI dictated it.
But I also see OpenAI
assistance APIs name
state machines' states
as status, which also
like, what's happening there?
It's status
is state. State machines
state, like in progress, function call, etc.
But it's off topic.
So these vectors, they provide us great capability to have semantic search.
And we have text search.
And supervised article describes how to combine them.
But basically, we perform two searches and then merge results.
It means we cannot do pagination.
Pagination is very important.
Maybe user needs to go to second page, third page.
In quality search engines, they do need it.
And in this case, it means that
it's similar to offset problem.
We described many times.
Well, that's the only solution, I guess,
is pagination through offset.
So far, yes, but maybe it also possible to combine something right. Honestly, gene is also about
multi dimensional things and like, I don't know, I don't know, it also has can I don't
know, like it's I, I know only parts of things here. But what like, like, I don't like to
index scans, and then we combine things and we combine things and we lose pagination.
I mean, we can have pagination, but if we want to go to page number 100, it's insane how much
data we need to fetch. And buffers will show very bad numbers and analyze buffers, it means it's not working well. And different example, sorry, I will finish my complaining speech.
So different example is what we have right now.
In our bot, we imported more than 900,000 emails from six mailing lists,
25 years of them.
So we have more than 1 million documents.
And of course, immediately, like before we only imported to the bots knowledge base only
documentation, source code and blog posts.
And all things were quite relatively fresh, Almost. But when we imported 25 years of mailing lists,
archives,
I'm asking,
hey, bot, what can you explain me about
sub-transactions? Okay, this is documentation,
my article, but also
this very good email from Bruce
Momjan from 2002.
And it went to first
place.
It's not good
we need to
basically we need to take into account
the age of the data
here right
how to do that
there is no good way
if you work with pgvector
there is no good way to
deprioritize
all documents
to take into account the age of the data.
So what we did,
we just, when
usually we need to find like 10 or
15, 20 entries, maximum like 100
usually entries, and
embed them as embeddings
to the prompt.
So what we do, we find
1000 entries.
And then just in memory, Postgres recalculates adjusted similarity,
adjusted distance based on logarithm of H.
And this is how we do it.
If nothing new, okay, we are satisfied with old documents.
So we take into account the age.
But again, this doesn't scale well.
If we will have a lot of like 10 million documents, it will be worse.
And also we cannot have pagination if we talk about search here, right?
Kind of similar problem as well.
And this makes me think, great that we have extensibility, but these types of searches are so different.
We have like... What is the name of when different things are combined? So it means that it's hard to
build good system which works... Yes, heterogeneous. I know how to spell it,
but I cannot pronounce it because I saw it
many times in papers,
but in scientific papers
and so on, in technical papers, but
yeah, pronunciation... I'm not even sure I know
how to pronounce it. Heterogeneous or something
like that? I cannot pronounce
only in Russian, sorry.
So, what I'm trying to say,
we are kind of Linux early stage. You need to compile a lot of drivers and deal with it to make the system work as you want,
like a good product, right?
Compared to some things like Elastic, when you take it and things work together very well
because it's a single product.
What do you think about it?
This is a problem.
Accessibility has a negative side here.
I think you've jumped straight to where are the limits of Postgres' search capabilities right now.
And that's a really interesting topic and quite deep already.
But it skips over all the things you can
do already in postgres and there are a ton of different like inbuilt things or a mod add-on
modules or extensions that mean that those limits are being pushed further and further and i think
a lot of people come from an assumption that Postgres won't be able to handle search super well because products like Elastic exist and are successful.
And therefore, probably people aren't doing this in the database.
But I see a lot of use cases that can be served adequately with good results in acceptable response times for users without touching any external services. So
I think you're right that there are edges and there are limits that can be better served by
other products, but those limits are quite far down the road for a lot of use cases. You can
build pretty good search features for a lot of different use cases especially
if you're willing to learn exactly how it works and factor in your own product or services
requirements if you're not just searching every field for every word or like i'm assuming like
a text search type field it can be really powerful already. Yeah, I agree.
I agree.
But, yeah, well.
Can we talk about some of them quickly?
Like just to cover the basics. Yeah, let's talk about them.
I agree.
And you're like basically echoing the usual problem I have.
Like people, I had the cases when people listening to me said I'm a Postgres hater.
So again, of course, this criticism goes quite deep.
And of course, I don't like the idea to have elastic search for full-text search
and the need to constantly synchronize or maybe some some what's the name of these
new vector database
systems, Pinecone or something like that.
So you basically need to synchronize
data from your main LTP database
all the time and you have a lag and
then you bring some
regular data there and you think
how to combine and search that data.
Because obviously for Elastic you need to not only bring textual data but you
need to bring categories to have the same faceted search sometimes people
want like I want to do full text search but again limit price, for example, right? Range, some range. And this usually is stored in a regular column in the relational database.
And of course, we have good capabilities to combine it with full-text search
and achieve single index scan.
For example, if you use GIST index, well, GIST is slower.
It works well for smaller datasets. But you
combine it with GIST
B3
and GIST B3, right?
B2GIST, I think.
Right.
And then you have a capability to combine
both full-text search
and numeric
range filter
and have a single index scan.
This is perfect.
Again, I'm staying on the same point.
Single index scan is the best.
But unfortunately, in many cases,
we cannot achieve it.
Ideally, user types something,
chooses something on the form,
press search, or maybe it's automated.
Like, I don't like automated, I like to
press search explicitly. Anyway,
we have request, and
this request translates to single
index scan, and we
return, and this is an ideal case
in terms of performance.
Otherwise, for
bigger data sets, you
will have very bad performance.
Well, in terms of performance, but also in terms of system resources, right?
Like we're also not having to use a lot of system resources
to satisfy quite a lot of searches,
whereas a lot of the alternatives require,
because they're not just one index scan, require more resources as well.
So I think it's efficient from a couple of angles,
but it very much limits what the user
can search for if you can if it has to be indexable that way some other search searches
wouldn't be possible like i i don't know about you but since um to give people a bit of an insight
into how we do this we agreed a topic about 24 24 hours ago and every product I've used since I've been
like thinking, how does search work here exactly? And it's really interesting how different products
implement it. And not everyone does it the same. And we've been somewhat spoiled as users by Google,
in my opinion, Google and Gmail, both of which have incredibly good search features for quite a long time and most people have
experienced those but it isn't the same in every product not every product is firstly capable of
doing that but also it is not quite the right trade-off for a lot of products either so like
a lot of products you use or a lot of products I use, things like Slack, for example, or Stripe, they will encourage you to use filters.
They let you type whatever you want in and they will perform a wide search depending on whatever you type.
But they encourage use of, you know, for example, in Slack, search within just one channel or just from one person or you know things filter it right down to
make those searches much more efficient so it's interesting that they're doing that partly i guess
to give you the results you're exactly looking for high up but also i guess to reduce system
use like they they don't have to do as much work if you filter it down for them.
All right.
So I think there's a few things that the big beginners, I think, or when I was a beginner, when I didn't know quite how this stuff worked, I don't think I fully appreciated the complexity of doing search well. So there's the basics we when we say full text search by the way i never really
understood whether like what the word full is doing in there it's basically just text search
does this document or does this uh sentence or something contain this word and this word or this
word or this word and so basic kind of text-based searches i don't know why it's called full. Do you know?
No.
Good question.
I think, right, so I understand the difference.
Like, you can compare the whole value, in this case,
a bit reasoner, but you'll be dealing with the problem
of the size of this value, right?
But there's also, there's like a million complexities just,
if we only consider that, there's a million complexities to do,
like should you care about the case of, like does the capital letters matter?
So full-text search is a very well-established area already. And where instead of comparing whole value or doing some
mask, regular expressions, right, which is also an interesting topic, but it's also related,
like a gene can be, T-gram search can be used, right? Yes. Instead of that, we consider words like, first of all,
we, as usual,
have this problem
with first normal form, but it's off-topic,
right? Because this value
is not atomic anymore, right? We consider
each value as
like, we have atoms in this
molecule, right?
And first of all, some words
we usually either normalize using
this stemmer snowball stemmer right or we use some dictionary to find kind of
synonyms or like no no synonyms is one thing it's also synonyms you can be used
but I'm talking about I spell dictionaries, for example, when you take a word and...
Stammers is very dumb.
It's just cut the ending.
That's it.
You can feed some new word and it will cut it according to some rule.
But I spell, it's a dictionary.
It knows the language.
It knows the set of words.
And it can transform words in different forms to some, like, normalized.
It normalizes every word, right?
And then we have, basically, we can build either a tree and use GIST,
generalized index search tree, which can be used for even B3 or R3. B3 is one dimension,
R3 is two or more dimensions, and
R3 is based on GIST in Postgres because
implementation based on GIST was better than original implementation, which
was not the case for B3. B3 remained the native
implementation,
but there is GIST implementation.
That's why I already mentioned it, right?
And then, so, 3, right?
Great.
So you can find the entries which have your words,
but also you can, there's an inverted index, GIN, right?
And GIST actually, oh, gist, I didn't mention.
So, B3 is one dimension, so just one axis.
R3, two or more dimensions.
And you can build three, for example, rectangles in two dimensions.
But what to do with text?
Because it has a lot of words.
We can consider words as kind of array or set, right? And then we can say this set contains that set, right?
Or is contained.
So we define operators, intersects, is contained, contained.
And in this case, we talk about sets
and we can still build the tree based on gist.
There are seven functions you need to implement to define the operations.
And like basically, so for sets, we can build tree
and it's called actually RD3, Russian doll tree.
This is how full-text search, it's official name for Berkeley paper.
We can attach the link to it.
And this is how originally full-text search was implemented based on GIST.
But also later it was implemented GIN, which is generalized inverted index,
which works much better for very large volumes of data.
And this is what search engines use. So it's basically a list
of terms and links to
in which document terms
are mentioned
and then there are internal
B3s to find
faster each term.
I think there are two kinds of
B3 inside GEN, but it's like
implementation details.
In general, it means that we can say
okay these words are present in this document right and we can very fast find them and we can
also order by like rank rank it's interesting thing it's calculated based on most relevant documents. For example, I don't know, words are mentioned more in this document, right?
Do we have phrase search?
Can we do double quotes?
Yeah, we have also some.
Whoa.
And or we can write some formulas, right?
There's like followed by.
Yeah, you can do like followed by for phrase search,
but there's also a...
So we have some data types and loads of functions
that are really helpful for doing this without...
Four categories, right?
A, B, C, D, right?
We can...
Like waiting.
What do you mean by categories?
I don't remember.
I remember when you define Genindex,
you need to convert data using two TS vectors.
So you convert to special text search vector, TS vector type.
And you can say that some parts can be considered one category, some parts different category.
There are maximum four categories.
And when you search, you can say, I'm searching within only a specific category. It means that, for example, you can build one TS vector, but take, for example, if you're indexing, for example, emails,
you can take words from subject and mark them as category A, for example, right?
But body is B.
And then you have freedom and flexibility to search globally, not taking into account the origin of the words. Or you can limit inside single, same search, you can limit saying I'm searching only inside subject.
So you can mark these parts of TS vector, which is good, but four is not enough in many cases.
Right.
So there are many capabilities in development.
Yes, and you don't have to build that much around it to get something pretty powerful out.
And one thing I learned about relatively recently from a blog post was web search to TS query.
So there's TS query the the query representation and the
tears vector being the vector representation like the normalized so once you've like taken each word
normalized the like plaws out of it and things like that yeah web search means you can achieve
a query that there's a bit like you might imagine a fairly complex search engine search like taking
like the not operator and saying i don't want documents that include this word or using and
and or type operators as well so it's it could let you build something that has some basic
search engine like features built into the text search field without much work at all so and these will come
like in postgres core i don't you don't even need an extension for the for this lot which is pretty
cool but yeah i was going to move on i think obviously that contains loads of complexities
and helps you solve some some of the trickier things immediately but there's also relatively inbuilt feet like like modules for
fuzzy search so like handling typos elegantly or like the kind of things that just start to get a
bit more complicated do you want do you want to be able to match on people nearly getting the name
right and not not all products do but it's pretty common that we do want to. When I first came across Treegram search or pgTreegram, the extension,
I was blown away by how elegant a solution that is to typos.
So were you not as impressed?
I was thinking it's so simple and it works so well.
Treegrams.
Yeah.
Let me disagree with you.
I use it many times.
Go on.
For many, many years.
And I cannot say it's elegant
because it requires, first of all,
it requires a lot of effort.
It's not just that you say,
I want 3 grams here, that's it.
No, you need to do something,
like some things you need to do.
And also at really high volumes,
it doesn't work
well in terms of performance.
Sure. And if you have
a lot of updates coming,
you will bump into
this dilemma
fast update or without
fast update gene and
pending list
which is
by default 4 megabytes, right?
And during regular select, it decides it needs to be processed.
Select is timing out.
Then you need to tune it.
Well, I still have a feeling it could be better.
No doubt.
And I guess this is one of the examples of can work really well for a while.
And once you like at a certain scale, maybe an external system is beneficial.
But yeah. who are new Postgres companies, to improve and benefit from a more polished version
when you develop some product on top of Postgres, right?
Yeah, and there are some startups, right,
that are quite focused on search,
or at least what, is it ParadeDB
that are looking into this stuff?
Right, maybe you're right, yeah.
There's also one other that's worth,
it would be a shame to finish this episode without mentioning them.
ZomboDB.
Yes.
Who do synchronization,
who take the hard parts out of synchronizing Postgres with Elasticsearch.
Yeah, that's a good target.
Also, I wanted to mention,
sorry, I was trying to find a book I recently bought.
For specialists, it should be a super important book, I guess. And I need to thank Andrei Borodin, as usual.
This guy helps me a lot with Postgres in general.
So this book is called
Introduction to Information Retrieval
by Christopher
Manning,
I'm sorry,
and Henrik Schutze.
So this book is interesting, and
this is exactly where I saw this, like
what is good quality? Users
are happy. And then a lot of formulas,
right? It's interesting.
Well, and it's a moving target one of
the good blog posts like i reread as part of doing research for this was by justin searles about a
ruby implementation and he made a really good point at the end about it being a moving goal
post so users might be happy with waiting a few seconds one year, and then five, ten years later,
that may not be considered good enough anymore
because they can use other search engines that are much faster.
Or your data volumes grow.
You talked about if your implementation relies on,
well, volume might change, but also patterns might change.
And you might find it's harder to provide as good results as your data changes or as your users change in their expectations.
So it's kind of a moving goalpost constantly as well.
So not only might the same results 10 years later not be good enough, but also like, yeah, it's a tricky one. But I think user happiness is one good one, but also Google uses lots of metrics like things like Pound.
Go on.
So since we talk about semantic search,
which supports some kind of AI systems, like these embeddings,
I'm thinking about not user happiness, but LLM happiness, so to speak.
And I think that usually we deal with very large documents and when we generate vectors,
there are certain limits.
For example, open AI limit is 8,120 tokens, roughly like 15, 16 characters.
And for example, my article about sub transactions, it exceeds 30, 16 characters. And for example, my article about sub-transactions,
it exceeds 30,000 characters.
So it was hard to vectorize it, right?
And we needed to summarize it first using LLM
and then vectorize the summary only.
So unfortunately, right?
It works, okay.
But what I'm trying to say,
when we talk about traditional search engines,
search results are not whole documents.
They are snippets.
True.
And it's also a part of this happiness or quality is how we present results.
For example, we can provide snippets and highlight the words
which are present in the query, right?
It's good. It's good for user experience user immediately
sees the words mentioned that user typed it like if it's a synonym it will be like different but
anyway like it's it's a good practice but if you think like about google how it works you see some
results second page first page second page and so on. But then it's good when very relevant results are on the very first page, right?
Maybe on the first top, like included in the first page, and you're satisfied.
There is a new topic, LLM happiness.
They should be satisfied.
But what does it mean?
It means that what we decided to include, to embed, to prompt and use in answers should be very relevant, right?
And this is a very big topic which is not yet discovered properly.
In my opinion, what I have in my mind, I'm just sharing, and if some guys are interested, I would love to have a discussion around it.
So what I think, this is what we are going to do. Not yet, we are going to
do it. We are going to return many snippets or summaries or something like that, not whole
documents, and then ask LLM internally during the same cycle of requests. We just say, LLM,
evaluate. Is it relevant? Do you think based on this snippet,
is it relevant to your original request? And if it's on the scale from 1 to 10,
then we limit the result and then it's important, and this is how humans behave.
We open each document and inspect it fully and think again is it relevant and only then and
maybe we go to second page I don't know like maybe second page is not needed we
just how our first page can be bigger because it's everything is automated
here but we inspect full document and decide is it worth keeping in using in
the in the answer or no maybe no maybe in the end of this process we will have zero documents left.
Maybe we need to think maybe to provide different query. And what I see from LLM sometimes if we do
this, like you ask something, there is this RAC system which performs a search and you are not
satisfied. And if you just tell, like if you consider LLM as a junior engineer,
you just say,
I'm not satisfied,
you can do better.
Invent some new searches and try them.
You don't say exactly what searches to use.
LLM can use,
in all cases I did this,
I was satisfied on the second step.
So I mean the internal process of searching can be very complex.
It might take longer, much longer, but the result will be much better.
And the question about quality here is shifting.
This semantic search should be somehow different
to be considered as good quality for this type of process,
for AI systems.
So I'm reading this book, understanding that it was written for search engines targeting
humans, and I'm very interested how this topic will be changed.
I'm sure a lot of things will be inherited, a lot of sciences there, like huge science in search engine, like information search, it's a huge topic, right?
So a lot will be inherited, but there will be changes as well because of high level of automation, and I'm very curious how quality will be changed.
So this is a very interesting topic.
I have many more answers, many more questions than answers.
Yeah.
Thank you for listening to this long speech.
It just sits in my head right now, like what will happen.
Don't you consider search engines semantic search too?
Yeah, well, Google, it's known,
and all top search engines,
they do semantic search for many, many years already.
I know that. But now we, like, they didn't provide,
like, do you know API to Google?
I don't know.
I only know some workaround solutions to it, right?
But now we are building small Googles
for our small knowledge bases.
And it's interesting what like PgVector is a very basic thing. Of course, a lot of work
there a lot like two types of indexes it provides and so on, performance improvements. But what
to do with age? What to do with this embedding process? Yeah, these questions are big right now in my head.
Cool.
I'm looking forward to some in-depth ones on this.
Yeah, I hope we will have some follow-up episodes
maybe about full-text search as well
and semantic search as well and so on.
Faceted search as well.
Yeah.
Sounds good. All right. Thanks so much, Nikolai. Take care.
Thank you.