Postgres FM - Search

Episode Date: March 29, 2024

Nikolay and Michael have a high-level discussion on all things search — touching on full-text search, semantic search, and faceted search. They discuss what comes in Postgres core, what is ...possible via extensions, and some thoughts on performance vs implementation complexity vs user experience. Here are some links to things they mentioned:Simon Riggs https://www.linkedin.com/feed/update/urn:li:activity:7178702287740022784/Companion databases episode https://postgres.fm/episodes/companion-databasespgvector episode https://postgres.fm/episodes/pgvectorFull Text Search https://www.postgresql.org/docs/current/textsearch.htmlSemantic search https://en.wikipedia.org/wiki/Semantic_searchFaceted search https://en.wikipedia.org/wiki/Faceted_searchFaceting large result sets in PostgreSQL https://www.cybertec-postgresql.com/en/faceting-large-result-sets/RUM index https://github.com/postgrespro/rum Hybrid search (Supabase guide) https://supabase.com/docs/guides/ai/hybrid-search Elastic https://www.elastic.co/ GiST indexes https://www.postgresql.org/docs/current/gist.html GIN indexes https://www.postgresql.org/docs/current/gin.html btree_gist https://www.postgresql.org/docs/current/btree-gist.html btree_gin https://www.postgresql.org/docs/current/btree-gin.html pg_trgrm https://www.postgresql.org/docs/current/pgtrgm.html Text Search Types (tsvector and tsquery) https://www.postgresql.org/docs/current/datatype-textsearch.html Postgres full text search with the “websearch” syntax (blog post by Adam Johnson) https://adamj.eu/tech/2024/01/03/postgresql-full-text-search-websearch/Understanding Postgres GIN Indexes: The Good and the Bad (blog post by Lukas Fittl) https://pganalyze.com/blog/gin-index ParadeDB https://www.paradedb.com/ ZomboDB https://www.zombodb.com/ Introduction to Information Retrieval (book by Manning, Raghavan, and Schütze) https://www.amazon.co.uk/Introduction-Information-Retrieval-Christopher-Manning/dp/0521865719 How to build a search engine with Ruby on Rails (blog post by Justin Searls) https://blog.testdouble.com/posts/2021-09-09-how-to-build-a-search-engine-with-ruby-on-rails/~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is brought to you by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the amazing artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to PostgresFM, a week's show about all things PostgresQL. I am Michael, founder of PGMustard. This is my co-host Nikolai, founder of PostgresAI. Hello Nikolai, what are we talking about today? Hi Michael, let's talk about search at high level. But before we proceed, let me express a few words, a few thoughts about the news we've got few hours ago couple of hours ago about Simon Riggs we are recording this on Wednesday March 27 and Simon Riggs just passed away I remember him as a very bright mind I remember he was not an easy person to deal with obviously. I remember like 100 emails
Starting point is 00:00:51 even 100 emails to convince him to come to Moscow to speak at conference. Many people involved but eventually he did, he came and it was great talk. But the work he did and like in general, yeah, it's a big loss, obviously, for Postgres community. So, yeah, condolences to family, friends and co-workers, ex-co-workers and so on. And Simon built a lot of things. And he was quite brave to attack very complex topics in Postgres system in general, right? In core of Postgres and in the engine itself. For example, point-in-time recovery, things related to replication. Many achievements were made by Simon or involving Simon.
Starting point is 00:01:42 So it's a big loss, definitely. Yeah, over many years as well, right? I have only actually had the opportunity to meet him a couple of times at a couple of London events and heard him speak. And not only was he a great contributor to the code base, but I was amazed at how he was able to communicate and educate and also community build, right? Like he was involved in organizing
Starting point is 00:02:06 a lot of events especially in the uk um growing companies and a lot more around the ecosystem as well i must say i remember very well this look in in simon's eyes which was like has had some sparkles and i remember the conference the, the very first conference in the American continent I attended in 2007, speaking with Peter Eisentraut. I was a baby, actually. I implemented some parts of XML implementation functions and type in Postgres.
Starting point is 00:02:41 And I remember Simon looking directly to me with those sparks and asking, what's your next thing to build in Postgres. And I remember Simon looking directly to me with those sparks and asking, what's your next thing to build in Postgres? I was caught out of guard, didn't answer anything, actually. And yeah, so this is what I remember about Simon, this look and courage. Do I pronounce it right? Courage, yeah. Courage, yes. He obviously had big courage, huge. And the ability to silence Nikolai, that's quite the...
Starting point is 00:03:14 Yeah, well, yeah. It's interesting, yeah. So, it's sad, very sad. So, yeah. Yeah, absolutely. Condolences to everybody who knew him and worked with him.
Starting point is 00:03:31 I don't really know how to move on from that, actually, but we were going to talk about search, right? Let's just, after this small break, let's return to the search topic. It's wide. It's a very wide topic. And I guess we just want to touch it a little bit today at a very high level, right? Yeah, well, it's amazing.
Starting point is 00:03:52 If you consider all the things Postgres is used for, search is one of the top use cases. But looking back at our episodes, we've touched on it a few times, like when we've looked at using external databases or um forgotten what somebody called them actually like a second type of database like a partner database or something like that um so we touched on them a few times but never um we haven't done anything on full text search we haven't done anything more recently on semantic
Starting point is 00:04:20 search we've done like a pg vector episode and a few a few kind of related ish subjects but no it crossed my mind that we hadn't touched on this really as a as a topic and i obviously it's one of those subjects that the more you learn the more you realize you didn't know or the more complicated it gets so i can imagine us doing quite a few follow-ups on the more specifics or you know implementation details or the tricky parts. But yeah. Don't forget faceted search. There's such a term, right?
Starting point is 00:04:52 Yeah. I could be saying it wrong, but I think I've heard it called faceted. You're like roaring bitmaps and things like that. Well, usually we start from UI. In my head, this starts from UI. We have some things in big form consisting
Starting point is 00:05:08 of multiple, very different selectors and filters and so on. This is very common in various marketplaces. For example, imagine Airbnb. You want to limit price and location
Starting point is 00:05:23 and various categories and properties and so on. And let me put this on the table right away. Ideally, we always should have just a single index scan or index only scan. This is our ultimate goal always. But unfortunately, it's not always possible. Why do we need it? Because it's the best plan. And I think you can get like do you agree or not because you deal with plans with pg master to explain plans and so on like we yeah single index scan is the best
Starting point is 00:05:58 one of the things i love about databases and performance in general is when it gets to the point where you have to trade one thing off against another. And I think search is one of those topics where often we're trading off user experience versus complexity of the backend. So some of the nicest search features are just a search bar. But without getting any input from the user, you have to do really a lot of work on the back end to be able to serve that in any kind of performant way. So you've got the complexity of matching the results to the intent with the additional caveat that you want it to give some at least some results good results quickly and those are trade that's a trade-off like it it's really easy to get well not easy but you can give great results if you can search through everything you have and score everything and you know if you've got forever to return them, but if you've given yourself a budget of returning within a few hundred milliseconds, suddenly that becomes a more difficult problem. So I love that it's a trade-offs and yes one of those is performance
Starting point is 00:07:26 but sometimes i think you are willing to pay a little bit of performance for a better result yeah this this is difficult i guess what what is better results and what is like high quality of search right like i i remember definition that users should be happy, which is very broad. What makes users happy? Maybe we return good results, but UI is very bad, so they're not happy, right? It's quite an interesting topic. And I think you're right, but also I just dove into the very bottom of performance. Performance matters a lot, right? If search is very slow, the users won't be happy and it means poor quality of the search, right? So we do care about performance, but also we do care about things like if it's a full-text search, we want stop words to be removed and ignored.
Starting point is 00:08:31 We want some dictionaries to be used, maybe synonyms to be applied, and so on and so on, right? This matters a lot, and of course, but this also moves us to performance part because if these steps are slow, it's also bad. Why I was mentioning faceted search?
Starting point is 00:08:51 I just see a common pattern. Postgres is huge in terms of capabilities and extensibility and various index types, extensions. But we have simple problems unsolved. For example, take full-text search and order by timestamp or ID. I want the very, like, instead of old-school regular approach, return, like, most relevant documents to me, I want fresh documents to go first because it's social media and this is number one pattern but also they need to follow some full-text search
Starting point is 00:09:33 query i used right i just need to see the latest but following some text patterns and this problem is unsolved in postgres unfortunately and there is good, the best attempt to solve it, it's called the RAM index, which is extension, like a new generation of GIN index. But why isn't it in core? Because it has issues. It's very, it's huge, it's slow, and so on. And similar things I now observe, and not only observe, I touch them. For example, just before we started recording, you showed me the Supabase blog post about how to combine full-text search and semantic search based on embeddings. I don't like the word embeddings. I like the word vectors.
Starting point is 00:10:22 Because embedding, in my opinion, in a database, it doesn't settle in my mind at all. Embedding is what we embed to our prompt. This is content. But vector is vector. Maybe I missed something, but why do we call vectors embedding? Honestly, in our database, this column is called embedding just because OpenAI dictated it.
Starting point is 00:10:49 But I also see OpenAI assistance APIs name state machines' states as status, which also like, what's happening there? It's status is state. State machines state, like in progress, function call, etc.
Starting point is 00:11:06 But it's off topic. So these vectors, they provide us great capability to have semantic search. And we have text search. And supervised article describes how to combine them. But basically, we perform two searches and then merge results. It means we cannot do pagination. Pagination is very important. Maybe user needs to go to second page, third page.
Starting point is 00:11:33 In quality search engines, they do need it. And in this case, it means that it's similar to offset problem. We described many times. Well, that's the only solution, I guess, is pagination through offset. So far, yes, but maybe it also possible to combine something right. Honestly, gene is also about multi dimensional things and like, I don't know, I don't know, it also has can I don't
Starting point is 00:11:56 know, like it's I, I know only parts of things here. But what like, like, I don't like to index scans, and then we combine things and we combine things and we lose pagination. I mean, we can have pagination, but if we want to go to page number 100, it's insane how much data we need to fetch. And buffers will show very bad numbers and analyze buffers, it means it's not working well. And different example, sorry, I will finish my complaining speech. So different example is what we have right now. In our bot, we imported more than 900,000 emails from six mailing lists, 25 years of them. So we have more than 1 million documents.
Starting point is 00:12:48 And of course, immediately, like before we only imported to the bots knowledge base only documentation, source code and blog posts. And all things were quite relatively fresh, Almost. But when we imported 25 years of mailing lists, archives, I'm asking, hey, bot, what can you explain me about sub-transactions? Okay, this is documentation, my article, but also
Starting point is 00:13:16 this very good email from Bruce Momjan from 2002. And it went to first place. It's not good we need to basically we need to take into account the age of the data
Starting point is 00:13:31 here right how to do that there is no good way if you work with pgvector there is no good way to deprioritize all documents to take into account the age of the data.
Starting point is 00:13:46 So what we did, we just, when usually we need to find like 10 or 15, 20 entries, maximum like 100 usually entries, and embed them as embeddings to the prompt. So what we do, we find
Starting point is 00:14:01 1000 entries. And then just in memory, Postgres recalculates adjusted similarity, adjusted distance based on logarithm of H. And this is how we do it. If nothing new, okay, we are satisfied with old documents. So we take into account the age. But again, this doesn't scale well. If we will have a lot of like 10 million documents, it will be worse.
Starting point is 00:14:31 And also we cannot have pagination if we talk about search here, right? Kind of similar problem as well. And this makes me think, great that we have extensibility, but these types of searches are so different. We have like... What is the name of when different things are combined? So it means that it's hard to build good system which works... Yes, heterogeneous. I know how to spell it, but I cannot pronounce it because I saw it many times in papers, but in scientific papers
Starting point is 00:15:12 and so on, in technical papers, but yeah, pronunciation... I'm not even sure I know how to pronounce it. Heterogeneous or something like that? I cannot pronounce only in Russian, sorry. So, what I'm trying to say, we are kind of Linux early stage. You need to compile a lot of drivers and deal with it to make the system work as you want, like a good product, right?
Starting point is 00:15:38 Compared to some things like Elastic, when you take it and things work together very well because it's a single product. What do you think about it? This is a problem. Accessibility has a negative side here. I think you've jumped straight to where are the limits of Postgres' search capabilities right now. And that's a really interesting topic and quite deep already. But it skips over all the things you can
Starting point is 00:16:07 do already in postgres and there are a ton of different like inbuilt things or a mod add-on modules or extensions that mean that those limits are being pushed further and further and i think a lot of people come from an assumption that Postgres won't be able to handle search super well because products like Elastic exist and are successful. And therefore, probably people aren't doing this in the database. But I see a lot of use cases that can be served adequately with good results in acceptable response times for users without touching any external services. So I think you're right that there are edges and there are limits that can be better served by other products, but those limits are quite far down the road for a lot of use cases. You can build pretty good search features for a lot of different use cases especially
Starting point is 00:17:05 if you're willing to learn exactly how it works and factor in your own product or services requirements if you're not just searching every field for every word or like i'm assuming like a text search type field it can be really powerful already. Yeah, I agree. I agree. But, yeah, well. Can we talk about some of them quickly? Like just to cover the basics. Yeah, let's talk about them. I agree.
Starting point is 00:17:35 And you're like basically echoing the usual problem I have. Like people, I had the cases when people listening to me said I'm a Postgres hater. So again, of course, this criticism goes quite deep. And of course, I don't like the idea to have elastic search for full-text search and the need to constantly synchronize or maybe some some what's the name of these new vector database systems, Pinecone or something like that. So you basically need to synchronize
Starting point is 00:18:14 data from your main LTP database all the time and you have a lag and then you bring some regular data there and you think how to combine and search that data. Because obviously for Elastic you need to not only bring textual data but you need to bring categories to have the same faceted search sometimes people want like I want to do full text search but again limit price, for example, right? Range, some range. And this usually is stored in a regular column in the relational database.
Starting point is 00:18:51 And of course, we have good capabilities to combine it with full-text search and achieve single index scan. For example, if you use GIST index, well, GIST is slower. It works well for smaller datasets. But you combine it with GIST B3 and GIST B3, right? B2GIST, I think.
Starting point is 00:19:14 Right. And then you have a capability to combine both full-text search and numeric range filter and have a single index scan. This is perfect. Again, I'm staying on the same point.
Starting point is 00:19:30 Single index scan is the best. But unfortunately, in many cases, we cannot achieve it. Ideally, user types something, chooses something on the form, press search, or maybe it's automated. Like, I don't like automated, I like to press search explicitly. Anyway,
Starting point is 00:19:48 we have request, and this request translates to single index scan, and we return, and this is an ideal case in terms of performance. Otherwise, for bigger data sets, you will have very bad performance.
Starting point is 00:20:03 Well, in terms of performance, but also in terms of system resources, right? Like we're also not having to use a lot of system resources to satisfy quite a lot of searches, whereas a lot of the alternatives require, because they're not just one index scan, require more resources as well. So I think it's efficient from a couple of angles, but it very much limits what the user can search for if you can if it has to be indexable that way some other search searches
Starting point is 00:20:35 wouldn't be possible like i i don't know about you but since um to give people a bit of an insight into how we do this we agreed a topic about 24 24 hours ago and every product I've used since I've been like thinking, how does search work here exactly? And it's really interesting how different products implement it. And not everyone does it the same. And we've been somewhat spoiled as users by Google, in my opinion, Google and Gmail, both of which have incredibly good search features for quite a long time and most people have experienced those but it isn't the same in every product not every product is firstly capable of doing that but also it is not quite the right trade-off for a lot of products either so like a lot of products you use or a lot of products I use, things like Slack, for example, or Stripe, they will encourage you to use filters.
Starting point is 00:21:30 They let you type whatever you want in and they will perform a wide search depending on whatever you type. But they encourage use of, you know, for example, in Slack, search within just one channel or just from one person or you know things filter it right down to make those searches much more efficient so it's interesting that they're doing that partly i guess to give you the results you're exactly looking for high up but also i guess to reduce system use like they they don't have to do as much work if you filter it down for them. All right. So I think there's a few things that the big beginners, I think, or when I was a beginner, when I didn't know quite how this stuff worked, I don't think I fully appreciated the complexity of doing search well. So there's the basics we when we say full text search by the way i never really understood whether like what the word full is doing in there it's basically just text search
Starting point is 00:22:30 does this document or does this uh sentence or something contain this word and this word or this word or this word and so basic kind of text-based searches i don't know why it's called full. Do you know? No. Good question. I think, right, so I understand the difference. Like, you can compare the whole value, in this case, a bit reasoner, but you'll be dealing with the problem of the size of this value, right?
Starting point is 00:23:05 But there's also, there's like a million complexities just, if we only consider that, there's a million complexities to do, like should you care about the case of, like does the capital letters matter? So full-text search is a very well-established area already. And where instead of comparing whole value or doing some mask, regular expressions, right, which is also an interesting topic, but it's also related, like a gene can be, T-gram search can be used, right? Yes. Instead of that, we consider words like, first of all, we, as usual, have this problem
Starting point is 00:23:49 with first normal form, but it's off-topic, right? Because this value is not atomic anymore, right? We consider each value as like, we have atoms in this molecule, right? And first of all, some words we usually either normalize using
Starting point is 00:24:07 this stemmer snowball stemmer right or we use some dictionary to find kind of synonyms or like no no synonyms is one thing it's also synonyms you can be used but I'm talking about I spell dictionaries, for example, when you take a word and... Stammers is very dumb. It's just cut the ending. That's it. You can feed some new word and it will cut it according to some rule. But I spell, it's a dictionary.
Starting point is 00:24:39 It knows the language. It knows the set of words. And it can transform words in different forms to some, like, normalized. It normalizes every word, right? And then we have, basically, we can build either a tree and use GIST, generalized index search tree, which can be used for even B3 or R3. B3 is one dimension, R3 is two or more dimensions, and R3 is based on GIST in Postgres because
Starting point is 00:25:15 implementation based on GIST was better than original implementation, which was not the case for B3. B3 remained the native implementation, but there is GIST implementation. That's why I already mentioned it, right? And then, so, 3, right? Great. So you can find the entries which have your words,
Starting point is 00:25:40 but also you can, there's an inverted index, GIN, right? And GIST actually, oh, gist, I didn't mention. So, B3 is one dimension, so just one axis. R3, two or more dimensions. And you can build three, for example, rectangles in two dimensions. But what to do with text? Because it has a lot of words. We can consider words as kind of array or set, right? And then we can say this set contains that set, right?
Starting point is 00:26:12 Or is contained. So we define operators, intersects, is contained, contained. And in this case, we talk about sets and we can still build the tree based on gist. There are seven functions you need to implement to define the operations. And like basically, so for sets, we can build tree and it's called actually RD3, Russian doll tree. This is how full-text search, it's official name for Berkeley paper.
Starting point is 00:26:42 We can attach the link to it. And this is how originally full-text search was implemented based on GIST. But also later it was implemented GIN, which is generalized inverted index, which works much better for very large volumes of data. And this is what search engines use. So it's basically a list of terms and links to in which document terms are mentioned
Starting point is 00:27:11 and then there are internal B3s to find faster each term. I think there are two kinds of B3 inside GEN, but it's like implementation details. In general, it means that we can say okay these words are present in this document right and we can very fast find them and we can
Starting point is 00:27:33 also order by like rank rank it's interesting thing it's calculated based on most relevant documents. For example, I don't know, words are mentioned more in this document, right? Do we have phrase search? Can we do double quotes? Yeah, we have also some. Whoa. And or we can write some formulas, right? There's like followed by. Yeah, you can do like followed by for phrase search,
Starting point is 00:28:04 but there's also a... So we have some data types and loads of functions that are really helpful for doing this without... Four categories, right? A, B, C, D, right? We can... Like waiting. What do you mean by categories?
Starting point is 00:28:19 I don't remember. I remember when you define Genindex, you need to convert data using two TS vectors. So you convert to special text search vector, TS vector type. And you can say that some parts can be considered one category, some parts different category. There are maximum four categories. And when you search, you can say, I'm searching within only a specific category. It means that, for example, you can build one TS vector, but take, for example, if you're indexing, for example, emails, you can take words from subject and mark them as category A, for example, right?
Starting point is 00:29:00 But body is B. And then you have freedom and flexibility to search globally, not taking into account the origin of the words. Or you can limit inside single, same search, you can limit saying I'm searching only inside subject. So you can mark these parts of TS vector, which is good, but four is not enough in many cases. Right. So there are many capabilities in development. Yes, and you don't have to build that much around it to get something pretty powerful out. And one thing I learned about relatively recently from a blog post was web search to TS query. So there's TS query the the query representation and the
Starting point is 00:29:47 tears vector being the vector representation like the normalized so once you've like taken each word normalized the like plaws out of it and things like that yeah web search means you can achieve a query that there's a bit like you might imagine a fairly complex search engine search like taking like the not operator and saying i don't want documents that include this word or using and and or type operators as well so it's it could let you build something that has some basic search engine like features built into the text search field without much work at all so and these will come like in postgres core i don't you don't even need an extension for the for this lot which is pretty cool but yeah i was going to move on i think obviously that contains loads of complexities
Starting point is 00:30:37 and helps you solve some some of the trickier things immediately but there's also relatively inbuilt feet like like modules for fuzzy search so like handling typos elegantly or like the kind of things that just start to get a bit more complicated do you want do you want to be able to match on people nearly getting the name right and not not all products do but it's pretty common that we do want to. When I first came across Treegram search or pgTreegram, the extension, I was blown away by how elegant a solution that is to typos. So were you not as impressed? I was thinking it's so simple and it works so well. Treegrams.
Starting point is 00:31:23 Yeah. Let me disagree with you. I use it many times. Go on. For many, many years. And I cannot say it's elegant because it requires, first of all, it requires a lot of effort.
Starting point is 00:31:36 It's not just that you say, I want 3 grams here, that's it. No, you need to do something, like some things you need to do. And also at really high volumes, it doesn't work well in terms of performance. Sure. And if you have
Starting point is 00:31:50 a lot of updates coming, you will bump into this dilemma fast update or without fast update gene and pending list which is by default 4 megabytes, right?
Starting point is 00:32:05 And during regular select, it decides it needs to be processed. Select is timing out. Then you need to tune it. Well, I still have a feeling it could be better. No doubt. And I guess this is one of the examples of can work really well for a while. And once you like at a certain scale, maybe an external system is beneficial. But yeah. who are new Postgres companies, to improve and benefit from a more polished version
Starting point is 00:32:49 when you develop some product on top of Postgres, right? Yeah, and there are some startups, right, that are quite focused on search, or at least what, is it ParadeDB that are looking into this stuff? Right, maybe you're right, yeah. There's also one other that's worth, it would be a shame to finish this episode without mentioning them.
Starting point is 00:33:11 ZomboDB. Yes. Who do synchronization, who take the hard parts out of synchronizing Postgres with Elasticsearch. Yeah, that's a good target. Also, I wanted to mention, sorry, I was trying to find a book I recently bought. For specialists, it should be a super important book, I guess. And I need to thank Andrei Borodin, as usual.
Starting point is 00:33:38 This guy helps me a lot with Postgres in general. So this book is called Introduction to Information Retrieval by Christopher Manning, I'm sorry, and Henrik Schutze. So this book is interesting, and
Starting point is 00:33:55 this is exactly where I saw this, like what is good quality? Users are happy. And then a lot of formulas, right? It's interesting. Well, and it's a moving target one of the good blog posts like i reread as part of doing research for this was by justin searles about a ruby implementation and he made a really good point at the end about it being a moving goal post so users might be happy with waiting a few seconds one year, and then five, ten years later,
Starting point is 00:34:25 that may not be considered good enough anymore because they can use other search engines that are much faster. Or your data volumes grow. You talked about if your implementation relies on, well, volume might change, but also patterns might change. And you might find it's harder to provide as good results as your data changes or as your users change in their expectations. So it's kind of a moving goalpost constantly as well. So not only might the same results 10 years later not be good enough, but also like, yeah, it's a tricky one. But I think user happiness is one good one, but also Google uses lots of metrics like things like Pound.
Starting point is 00:35:13 Go on. So since we talk about semantic search, which supports some kind of AI systems, like these embeddings, I'm thinking about not user happiness, but LLM happiness, so to speak. And I think that usually we deal with very large documents and when we generate vectors, there are certain limits. For example, open AI limit is 8,120 tokens, roughly like 15, 16 characters. And for example, my article about sub transactions, it exceeds 30, 16 characters. And for example, my article about sub-transactions,
Starting point is 00:35:45 it exceeds 30,000 characters. So it was hard to vectorize it, right? And we needed to summarize it first using LLM and then vectorize the summary only. So unfortunately, right? It works, okay. But what I'm trying to say, when we talk about traditional search engines,
Starting point is 00:36:04 search results are not whole documents. They are snippets. True. And it's also a part of this happiness or quality is how we present results. For example, we can provide snippets and highlight the words which are present in the query, right? It's good. It's good for user experience user immediately sees the words mentioned that user typed it like if it's a synonym it will be like different but
Starting point is 00:36:33 anyway like it's it's a good practice but if you think like about google how it works you see some results second page first page second page and so on. But then it's good when very relevant results are on the very first page, right? Maybe on the first top, like included in the first page, and you're satisfied. There is a new topic, LLM happiness. They should be satisfied. But what does it mean? It means that what we decided to include, to embed, to prompt and use in answers should be very relevant, right? And this is a very big topic which is not yet discovered properly.
Starting point is 00:37:14 In my opinion, what I have in my mind, I'm just sharing, and if some guys are interested, I would love to have a discussion around it. So what I think, this is what we are going to do. Not yet, we are going to do it. We are going to return many snippets or summaries or something like that, not whole documents, and then ask LLM internally during the same cycle of requests. We just say, LLM, evaluate. Is it relevant? Do you think based on this snippet, is it relevant to your original request? And if it's on the scale from 1 to 10, then we limit the result and then it's important, and this is how humans behave. We open each document and inspect it fully and think again is it relevant and only then and
Starting point is 00:38:06 maybe we go to second page I don't know like maybe second page is not needed we just how our first page can be bigger because it's everything is automated here but we inspect full document and decide is it worth keeping in using in the in the answer or no maybe no maybe in the end of this process we will have zero documents left. Maybe we need to think maybe to provide different query. And what I see from LLM sometimes if we do this, like you ask something, there is this RAC system which performs a search and you are not satisfied. And if you just tell, like if you consider LLM as a junior engineer, you just say,
Starting point is 00:38:48 I'm not satisfied, you can do better. Invent some new searches and try them. You don't say exactly what searches to use. LLM can use, in all cases I did this, I was satisfied on the second step. So I mean the internal process of searching can be very complex.
Starting point is 00:39:07 It might take longer, much longer, but the result will be much better. And the question about quality here is shifting. This semantic search should be somehow different to be considered as good quality for this type of process, for AI systems. So I'm reading this book, understanding that it was written for search engines targeting humans, and I'm very interested how this topic will be changed. I'm sure a lot of things will be inherited, a lot of sciences there, like huge science in search engine, like information search, it's a huge topic, right?
Starting point is 00:39:49 So a lot will be inherited, but there will be changes as well because of high level of automation, and I'm very curious how quality will be changed. So this is a very interesting topic. I have many more answers, many more questions than answers. Yeah. Thank you for listening to this long speech. It just sits in my head right now, like what will happen. Don't you consider search engines semantic search too? Yeah, well, Google, it's known,
Starting point is 00:40:27 and all top search engines, they do semantic search for many, many years already. I know that. But now we, like, they didn't provide, like, do you know API to Google? I don't know. I only know some workaround solutions to it, right? But now we are building small Googles for our small knowledge bases.
Starting point is 00:40:46 And it's interesting what like PgVector is a very basic thing. Of course, a lot of work there a lot like two types of indexes it provides and so on, performance improvements. But what to do with age? What to do with this embedding process? Yeah, these questions are big right now in my head. Cool. I'm looking forward to some in-depth ones on this. Yeah, I hope we will have some follow-up episodes maybe about full-text search as well and semantic search as well and so on.
Starting point is 00:41:21 Faceted search as well. Yeah. Sounds good. All right. Thanks so much, Nikolai. Take care. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.