Postgres FM - pgvectorscale

Episode Date: June 14, 2024

Nikolay is joined by Mat Arye and John Pruitt, from Timescale, to discuss their new extension pgvectorscale and high-performance vector search in Postgres more generally. Main links:https://...github.com/timescale/pgvectorscalehttps://www.timescale.com/blog/pgvector-vs-pineconehttps://postgres.fm/people/matvey-aryehttps://postgres.fm/people/john-pruitt~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Nikolay Samokhvalov, founder of Postgres.aiMichael Christofides, founder of pgMustardWith special thanks to:Jessie Draws for the elephant artwork 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, hello, this is PostgresFM, episode number 101. My name is Nik Samokhvalov, founder of Postgres.ai. And I thought about, since last time we had episode number 100, and it was interesting because we discussed databases at scale of 100 terabytes and beyond, logically this episode should be something very simple and beginner entry level because it's 101, you know, in the US. 101 means something like some introduction to something, right?
Starting point is 00:00:34 But I guess we will dive quite deep. I hope we will dive quite deep into some details of some things. Very interesting to me personally. But also, since a couple of days ago Timescale released addition to PgVector called PgVector Scale, I asked
Starting point is 00:00:54 developers to join and they agreed. So meet Matt Arie. Hi, Matt. Thank you for coming. Thank you for having us. And John Reed. Thank you for coming as well. It's great to have you.
Starting point is 00:01:13 Great to have you. And let's start. It's kind of an interview, but maybe it will be some informal discussion. Because, I don't know, we know each other for some time and why not, right? So let's start from the beginning, from maybe some distance. Why do we need at all some vector search in Postgres and maybe some vector search in general? Because recently, not long ago, LLM is becoming bigger and bigger at a very high pace in terms of how many dimensions, but also in terms of context window. Gemini now supports 1 million. There are talks about 2 million. Although I must admit Gemini is terrible in terms of
Starting point is 00:02:05 reliability. We have a lot of credits being an AI startup, we have a lot of credits to spend. It's a very well-known problem with Gemini that 500 errors all the time. It's making me very sad. But 1 million context window means that probably we don't need vector search at all and we can just put everything into one prompt, right? There was such a crazy idea when context became really large. Imagine if we have like 100 million tokens context window. You can just put everything to a single question. Or it's an insane idea. What do you think? I think there are a few points to point out. First of all, there's still the bandwidth that you need to use up to send the question to the LLM, right?
Starting point is 00:03:05 So there is going to be a pure networking limit at some point. Another point is that even with very big context windows or token windows, yes, you can give the LLM a lot of information, but there's actually an open question about whether it will use all of the information it gets well. And there is some academic research out of LLM that is actually showing that giving it less information but more relevant information than the prompt gets you better results than giving it all the information you have, right? And so, I mean, with LLMs in general, all of this is purely empirical research.
Starting point is 00:04:08 There's actually not very good scientific backing either way. But with the LLMs that people have tested, I know that huge prompts are often not very good in terms of performance, in terms of the answers you get. So in that sense, narrowing problems down using either Vesta search or text search or hybrid search, whatever you want, in many cases, it gives you a better answer. And apart from that, when this is a big controversial, I'm actually not convinced that RAG is the killer app for vector databases. There's a lot of other use cases for vector databases, starting from clustering, recommendation engines, plain old search engines, right?
Starting point is 00:05:20 A lot of other things that have nothing to do with the RAG and that increasing context windows were not a helper hit. These are just orthogonal things. And the basic thing that what the vector search gives you is semantic search. And there's a lot of applications for that. So not only reg, right? So interesting. Yeah, that's... Okay, so your answer is like
Starting point is 00:05:53 has two points to summarize. First, we still need to narrow down and find... And if we put everything to one question, it will be not efficient. I agree with LLM sometimes. If you put too much information, the answers are not good.
Starting point is 00:06:14 Quality of answers decrease for sure. And second, it's not only about RAC. I would like to explore this. Do you see some good examples beyond retrieval augmented generation? Actually, while you think, let me just spend maybe half a minute, because I know our audience is mostly Postgres engineers, DBAs, and backend developers. Of course, many already heard about RAC, but just a small recap. It's quite a simple thing. And actually, you're right, maybe vector search is not needed in RUG.
Starting point is 00:06:50 For example, we can use full-text search there, right? So idea is when some request is coming to LLM, before passing this request to LLM, we have intermediate software which finds additional information which should be relevant, and here vector search can be useful, but maybe not only vector search, for example full-text search. And then we already augment this request with this information, and LLM has some context. Basically it's similar to if you go to an exam and you have a lot of information in your pockets, when you need to answer, you can get this information
Starting point is 00:07:34 out of your pockets and answer better. Or use Wikipedia, for example. Similar. But you but you just told me that it's not only RAC. What else for vector search particularly? I mean, fundamentally, I think any time that you are right now using full-text search or text search, you could substitute that with semantic search to give you better results. Let me just give a simple example for people. With a full-text search, searching for a query on car does not return results that have truck in the document, right?
Starting point is 00:08:25 Semantic search solves that because it's kind of I mean, that sounds almost magical, but it's kind of a search on meaning. So things that are close together in meaning get retrieved. So you could use
Starting point is 00:08:42 a totally different lexicon structure, totally different lexical structure, totally different word, and somehow this type of search figures out that they are similar in the meaning space, if you will. Or you could use it to augment full-text search or as a hybrid of the two. Have you seen good examples of this?
Starting point is 00:09:10 I saw terrible techniques like, for example, let's find 100 results from full-text search, 100 results from vector search, and then combine them. But always I have a question, what's the ranking system? Because it's very different and sometimes we want fresh information. A specific question I would like to explore maybe later, how to reflect freshness of data. We can discuss it maybe later because I also already want to discuss, but you think your particular just released. But did you see any good examples of combination of full-text search and semantic search like vector search?
Starting point is 00:09:54 It's hard to say because we are database providers, so we kind of see the... We don't see the examples that our customers have. We get told by our customers that
Starting point is 00:10:10 hey, this technique works. You find it useful, blah, blah, blah. But we rarely actually see the results ourselves. But like anecdotally, a lot of people are using hybrid
Starting point is 00:10:26 and like in the AI community, I think it's almost becoming the standard, which is potentially a problem for Postgres we can go into later, but that's what we're hearing at least a lot. But I will agree that this is totally all ad hoc, non-scientific, like crack my left ear with my right arm type thing, right? So, yeah. For me, it looks almost always like all examples I see, they look ugly because, for example, if you think about pagination, there is no way to have pagination there.
Starting point is 00:11:11 Although if we go back to vector search, it's also problematic there because to go to page number 100, you need to extract all 100 pages and it's very inefficient performance-wise. So let's discuss the project. It's called PGVectorScale. It looks like you just append a suffix scale to PGVector. And it's a very interesting approach because let me ask you this. It's very well known that TimescaleDB is a very great thing for time series. It has a free version and you can host it yourself, a community edition, I think it's called. I doubt it's true open source because it doesn't have OCI-approved license, right?
Starting point is 00:12:00 So it's like some specific thing. Or maybe Apache license exists, it it lacks a lot of good stuff for example i think it lacks compression maybe i'm wrong is it yes it lacks compression so we have an apache version we have a community version yes right i think you can use extended version not still not paying if you host yourself but it's not it cannot be considered through open source because this license is already something like hybrid and so on. We call it source available. Source available, yeah, yeah.
Starting point is 00:12:33 That's a common term. Right, so then Postgres itself has a very, very, very simple and permissive license. That's why many commercial products developed on top of it. Then we have PgVector, which is a separate extension, which in my opinion
Starting point is 00:12:56 should at some point go to core Postgres. But there is a big question, how come? Because it's breaking all the rules because just creating index you might have results which differ from original results without index. It's something which we didn't have before because always indexes just speeded up queries. Now we can change results because it's approximate neighbor search. This project, I think it's
Starting point is 00:13:27 on which license it is? I don't remember, but I think, yeah. Postgres license. It's also Postgres license, right. Maybe that's why you also developed your product on top of it using Postgres license. But it's very interesting because if before Timescale had this major product, everyone loved compression especially, but only if you go that route source available, or you go to Timescale Cloud, now you release this thing on Postgres license, meaning that anyone, any cloud provider, most of them already have PgVector. So naturally, they should already be considering
Starting point is 00:14:13 adding your extra to their services. And I think, of course, you've thought about it very well. So my question is why you changed your approach compared to Timescale. Yeah, so I'll speak for myself, maybe not for the company. I think, look, we all love Postgres, but I think Postgres needs different things in different areas. Postgres, in the time series area, Postgres already has a lot of functionality to build in, and we are already building upon a well-known foundation where I would argue that given our broad definition of time series,
Starting point is 00:15:09 in many ways, Postgres already dominates the market. So we felt that we could build a company while contributing some things to post-growth, but also, quite frankly, making a business for ourselves and making money in this area by preventing hyperscalers from making all the money off of our work, which is what the TSL license does. It doesn't stop any individual from benefiting from our work or really any company. It only stops the cloud providers from making money off our work.
Starting point is 00:15:54 And that was then for pragmatic reasons so that we could make a business. In terms of vector, I think the whole vector market is much more nascent and much younger. And I don't think Postgres has won the market a lot. other databases have gotten millions of dollars in the past year to develop their own solution. This is a very hot, fast-moving market where I don't think Postgres yet has a particular...
Starting point is 00:16:42 It hasn't won, right? And so, because we like Postgres so much, we wanted to help Postgres win. And so... There are many benchmarks where Postgres has very poor results with pgVector on if-flat index without HNSW and so on right i will say still around i will say with pgvector 0 7 things have improved on the pgvector side
Starting point is 00:17:15 quite a lot as well um but yeah and i think not only in terms of benchmarks, I think in terms of actual usage, if, you know, last episode you talked about 100 terabyte databases. We want to reach 100 terabyte databases that have extra data. We have to do a lot of work to make that happen. Right. And I think that's the ultimate goal. And there should be the ultimate goal of everybody in the community. And we thought that it was more wise to help the entire community succeed at this point in time. Yeah. When I saw benchmarks like 100,000, 500,000 vectors, I was like, what's happening here? And people say, oh, it's already a lot. It's not a lot, right? You take any company, take their databases, and they have already so many data entries in their databases. And out of those entries, we can create vectors usually, right? So it means that we should speak about billions already,
Starting point is 00:18:28 but it's not there yet because it's problematic because vectors are large and building indexes is slow, takes a lot of time. And latencies of search, of course. Yeah, so everything is quite slow. And I'm glad you did your benchmarks starting
Starting point is 00:18:46 with 50 million vectors, already very good. And I'm super glad to hear that you talk about terabytes of data. But by the way, some people say, okay, for us, like for Postgres guys, not for timescale, regular Postgres guys, one terabyte is usually one billion rows. With timescale compression, it's many more. One terabyte should be maybe tens of billion rows. So it's a different order of magnitude. But if we talk about PG vector, 50 million vectors, let's talk about like, this project has two things to bring on table. Maybe let's close first with no technical question, because I cannot skip this question's just sitting in my head. So PGVector had EFLAT index originally, then HNSW was added, and I remember Neon participated
Starting point is 00:19:50 starting with a separate project and decided to contribute to PGVector. So now it has two indexes. And PGVector scale brings a third type of index, right? Mm-hmm. It could be a pull request to PGVector, no? So we actually talked with Andrew Kane about this the issue
Starting point is 00:20:11 is that we are written in the Rust and not C and Andrew thought that it would be better as a separate project and I can't say I disagree I think either could have worked, but
Starting point is 00:20:29 I think given the language difference, it makes some sense to have it as a separate project. But we did offer to upstream. Yeah. Well, makes total sense, but why Rust? Because I like to work quickly, and I thought Rust would allow me to do that more than other things. I should say I'm one of the people that worked on compression timescale, and we did our compression system had their own data type on disk. And I remember the pain of having to make sure that the way you're writing data on disk would work very well. And the cross-over platforms, big end, little end, and all of these, the alignment issues, it's just a lot of double checking that you have to do to
Starting point is 00:21:37 make sure everything is correct. I wanted to avoid this. And with the Rust, a lot of this could be a lot simpler. So like... Yeah, makes total sense. Yeah. Right. So this makes total sense. But for end user, it's interesting, especially for those who self-manage. So you need to take PostGus, then you need to bring pgVector, then you need to bring pgVectorScale. It's an interesting situation. We do have webbing packages, but yes. Yeah, so this is interesting.
Starting point is 00:22:20 But now I understand better. Let's talk about the index itself. What can you tell us about the index compared to HNSW? I know HNSW is like a memory thing. This works with disk. This is definitely the right thing to do if we talk about terabytes of data and billions of rows and so on. Ilya Buryatkin Yeah. So I'm going to hand wave a to index vector data is using a graph structure, so a graph database. You can imagine each vector is a node and is connected to other nodes that are mostly that are close to it, right? And then there are some for edges
Starting point is 00:23:28 and the traditional problem in these types of indexes is getting from one end of the graph to another end of the graph. So your starting point is very far from your query vector, which is what you're looking for. It takes a lot of hops in these graphs. And HNSW and this KNN are pretty much different ways of solving that basic problem.
Starting point is 00:24:04 HNSW solves this by introducing layers into the graph, where the top layer where you start only has long distance edges. So it kind of allows you to jump a long distance but close to where your query is, and then you go down a level in order to make more fine-grained jumps and there are several levels by the time you get to the lowest level you are kind of in the most fine-grained area of the graph and those top levels are the things that help you solve this long jump issue, if you would. This KNN doesn't use multiple graphs. Instead, it kind of uses a neat trick in the construction to, how should I usually constructed, specifically to inject these long
Starting point is 00:25:28 edges into the graph, which probabilistically allows you to jump faster to where you need to go. And this kind of flat structure is what allows you to keep a better locality of where you are. So instead of using multiple levels, the flat structure allows you to make less hops in memory, which allows this to work on disk, on SSD rather than RAM. By the way, maybe it's a side note, but I noticed in Postgres stuff you use local SSD disks, not EBS volumes.
Starting point is 00:26:19 Is it fair? I mean, they're ephemeral. If you restart machine, you lose them, right? So in your benchmark particularly, I noticed this small point. Our benchmarks very concretely benchmark Python against self-hosted Postgres and the most self-hosted Postgres people I know actually use NVMe's and they do like... But not local NVMe's. But you can do backups other ways, right? You could do streaming replication.
Starting point is 00:27:10 You could do other ways to get around this issue of things going away. I'm just very curious how much it helped. Because, of course, latency is better. Discutency is better. I can tell you it helped immensely and I can say that we are right now actively thinking about how to bring this kind of
Starting point is 00:27:33 performance to cloud as well stay tuned I think we'll have some exciting stuff coming out but yes that's a very astute observation because you know, quite frankly, none of this works on EBS.
Starting point is 00:27:50 You need the random read performance of SSDs to make this work. So, by the way, I would be shocked if Pinecone wasn't using...
Starting point is 00:28:06 I don't know. I have no idea what they use, but I would be shocked if this was EBS-based. And for these 50 million vectors, how big was the database? The disk and index was about 40 gigs. The entire table, I think, was around 250, somewhere between 250 and 300 gigs.
Starting point is 00:28:40 Right, and machine has definitely less than that. So, of course, it was not cached. I mean, we do need the disk. Okay. And back to indexes, disk ANN, and I saw you mentioned in the article and the project itself, Microsoft work, but you modified this work, right? But question is like HNSW and this disk N both are approximate nearest neighbors algorithms. Can we say one is better than another for most of workloads and data types, data sets, or there are cases where one can win, there are cases where another can win? I think. SW index supports concurrent index builds, which could make it faster, at least for building
Starting point is 00:29:30 the index. For building index, yeah. So, yeah. But for search? For search and accuracy, look, we haven't benchmarked everything.
Starting point is 00:29:46 The thing is, we have benchmarked. We've seen higher throughput and higher accuracy. Or the tradeoff was always kind of better than HNSW, but, you know, we haven't benchmarked everything. We've kind of concentrated on modern embedding systems. So, for example, we haven't gotten to benchmark very low dimensional vectors like 128 or 256, the lowest thing we've benchmarked is 768. So a lot of caveats, and I think, you know,
Starting point is 00:30:29 the space is so new, people should actually test on their own data, but I can say, in terms of search, we haven't seen where we're worse yet. People should always test on their own data, even if it's old school full-text search and so on,
Starting point is 00:30:50 because who knows what you have, right? Absolutely. That's one thing that made benchmarking so hard, especially versus our competitors, because a lot of the specialized vector databases have very few parameters that let you actually control the index. Whereas on Postgres with both PG Vector and PG Vector scale, there are, you know, several different parameters to tune the build of the index. And then also query time, plus all of the Postgres settings. And, you know, if you're self-hosting, you've also got OS and machine-level things to play with. So it's just a mind-boggling number of variables you could play with
Starting point is 00:31:29 if you had the time and money to spend on it. Yeah, and in PgVector scale particularly, I checked source code and documentation that we have so far, and I saw also parameters you can touch a couple of them, like query time and if you can touch a couple of them like query time. And if you can be also adjusted during build time, what can you say about them? Like some list sizes and so on. Query search list size and during build time, num neighbors, how many neighbors?
Starting point is 00:32:02 Do you recommend trying to tune this or so far no? We tried to put in the small defaults. That is what we saw as the best combination, but obviously we wanted to let people also experiment. But in terms of build parameters, I would recommend keeping them as a default. The runtime parameters is really where I think people should experiment because that allows you, at query time, to make the tradeoff between accuracy and speed, right? And there's the rescore parameter that I think is the one we recommend to actually tune. I just realized today, we probably should change the default as pretty low. That was my fault.
Starting point is 00:33:00 But that is the parameter I think people should play with. Yeah, that one seemed to have the most impact on both speed and accuracy at query time. Right. Okay, so one more question about the index. You call it streaming disk ANN, right? Why streaming? How does it differ from original Microsoft implementation? Yeah, so I think both the HNSBU implementation and the original disk ANN implementation,
Starting point is 00:33:39 you tell it ahead of time how many things you want to return and that just doesn't play very well with Postgres quite frankly because the Postgres model for indexes is Postgres the other part of Post, can always ask the index for the next tuple. And this is done so that you can filter your results after index retrieval. So for example, the closest vectors to a given query that also meets some other criteria that belong to the business department or the engineering department, right? The vector index only fetches the closest things to the query.
Starting point is 00:34:46 And then what Postgres does is that after index retrieval, it will filter out department equals engineering, right? Now, let's say your closest hundred vectors are all from the business department. That means if that's correct... What? I mean, in this case, it will be fast because... Right, but if your parameter is, say, returning the closest 50 from the index, then no results will be returned at all.
Starting point is 00:35:29 Because the index returns 50 results. Those 50 results are then filtered by the department equals engineering. Different department, right, right, right. A lot of work to be done, but zero results, right? Exactly. And there are zero results because there's this arbitrary limit that you give at the beginning of the query to tell you, hey, retrieve this many results. Whereas in reality, Postgres has no idea how many results you need to retrieve in order to match both the query and the problem
Starting point is 00:36:10 with engineering, right? So what the streaming part of the algorithm does is it removes that restriction and makes the algorithm work in the way Postgres expects other indexes to work. So you can tell the index, hey, give me the next closest thing, the next closest. Actually, you could traverse the entire graph, your entire table like that. And that makes all of your queries that have a secondary filter completely accurate. And this secondary filter, that's a different column, for example, text or integer, right? Yeah, it's a different column. Often it's a JSONB, right? So you could have like, you could have an article, you could have a list of associated tags with it in JSONB. And you can say, hey, find me all the articles about
Starting point is 00:37:14 cars that also come from country USA, right? Or any kind of other metadata. So it's this combination of semantic and metadata, which is actually incredibly common. Right. But there we have usually either situation when we have B3 and GIN index
Starting point is 00:37:41 and Postgres needs to decide which one to use and then still apply a filter or we need to use something like additional extension like GIN B3 I'm bad with names as usual and then try to achieve
Starting point is 00:38:00 single index scan but here it's not possible I mean ideal world is single index scan without additional filtering. Yes, that is the ideal world. I'll teach you that. Right now, none of the indexes support that. And so all of the… the best you can do is, the best, the state of the art on Postgres right now, and that would argue state of the art period, but we could leave that argument for another time, is you can hope that you can retrieve from the vector index and then post-filter and still have accurate results. I can imagine, for example, if we take this department name, put it inside this input text which builds a vector, and then additionally we filter by integer like department ID,
Starting point is 00:39:07 it probably will work better in this case, right? Because first we apply semantic search involving in the query which department we want. But then we think, okay, this filter will make final polishing. But it sounds like, again, like some ugly solution. Yeah, it's not only ugly. I'm actually not sure it won't work. But like using my intuition, like the department engineering in the text would kind of skew the semantics away from the actual thing in the text you're talking about. And so I'm not sure that would combine well in this kind of semantic multidimensional space.
Starting point is 00:39:57 I mean, you could maybe add a dimension and like synthetically set a value in the dimension to represent which department it is. And one more dimension for time, like timestamp. Because this is so natural. I'm very curious if there are some works already in this direction because everyone needs creation time, like publication time or something for each data entry. And I didn't see good discussions about that yet. For example, we loaded almost one million entries from Postgres mailing list archives. It started working great, but when you
Starting point is 00:40:37 see some discussion from Bruce Momajan from 2002 about something, I don't know. I remember just some cases. It's already not relevant at all. You think maybe I should delete all data, but still it might be relevant. So what we did, ugly solution, we return 1,000 or maybe 5,000 results, and then we dynamically apply some score,
Starting point is 00:41:01 like I think it's logarithm approach for time. So we add some penalty if the article, well, email is very old. It quickly becomes less relevant. We just combine it with score. PgVector provides a similarity or distance, right? And this works well, but sometimes you have like one second latency on one million rows data set, and this is terrible and doesn't scale. So this problem of additional dimensions, I think it's huge.
Starting point is 00:41:40 We need to extend original vector with non-semantic dimensions and use them in filtering and achieve one single index scan. Great performance, right? I would say that in the academic literature there is some progress being made. That is a filtering disk NN paper, which I believe is two years old, which is ancient in this space. There was another paper from a group out in Berkeley talking about building and filtering into these graph-based indexes as well. But, you know, you got to walk a crow before you can walk. We just haven't had any chance to really implement these inside this index. But there is kind of academic work in this area. Right. For this algorithm...
Starting point is 00:42:48 So you can be able to signal that certain dimensions, if that's where you're putting your filters, are... that they would need to be exact versus approximate. Yeah. Right. And do you think this approach is better than SW in this particular problem or filtering yes I do and
Starting point is 00:43:15 obviously I'm biased but I think the the simplicity of going from multi levels to a single level really helps in a lot of these things just because there's a lot less edge cases to consider let's not forget this uh helps as well right, I think streaming was a lot easier to implement because
Starting point is 00:43:48 it's a single level. Interesting. Let's not forget to talk about compression because I'm sure we need it in vector world, in vector search, we need it. And I saw in the recent PgVector, there there are ideas let's not use integers
Starting point is 00:44:09 let's use floats and so on with already kind of compressions so to speak but you talk about real compression right maybe some experience from timescale db extension like i mean or no or it's different because there is time series it's i i saw i remember articles uh block has like excellent articles but maybe it's different here right i don't think i directly used any of the algorithms from the time series space for this but the basic insight of using the statistical properties of the vectors you have to kind of make compression give you better results is exactly what led to... So we have an algorithm called statistical binary quantization, SBQ, and it takes quite a simple but well-known algorithm called BQ and the kind of adapts it to your data set in a better way. You know we have a blog post about it it's pretty simple it's pretty much using the means of each dimension and the kind of standard deviations
Starting point is 00:45:29 to kind of better segment your space, if you would. And yeah, and we just, honestly, we took a very experimental approach. We took various data sets and we tried different things. And this turned out that it worked. It added a few percentage points to the accuracy, which once you get into the 90s, a few percentage points is a pretty big deal. And so yeah, you could read about it in our blog post. The algorithm is fully explained there.
Starting point is 00:46:17 And that's interesting. I end that. So the compression that Matt's talking about is happening in the index. And the timescale DB compression is converting row-based into columnar and then compressing each column in the heap. And so we actually tried compressing the vectors
Starting point is 00:46:42 in the heap with timescale DB compression. And seemingly random vector strings of numbers don't compress very well. tried compressing the vectors in the heap with timescale DB compression and seemingly random vector strings of numbers don't compress very well. So that didn't help very much. Because in timescale DB, one of the key ideas is that for time series, it's like values are changing, not jumping, how to say. So, so there are deltas, and these deltas are quite low and so on. Yeah, this is what I remember from those blog posts. So yeah, for vectors, this doesn't work, I understand.
Starting point is 00:47:14 Okay, and for index compression, how much would we expect in terms of compression ratio to achieve so SBQ uses one bit per dimension and
Starting point is 00:47:37 the uncompressed version is flow 32 so it's a very easy calculation. It's always a 32x compression ratio. Good. Understood. Great. So yeah, anything else in technical area we should mention?
Starting point is 00:48:02 I think with the size but it also helps with performance because you can fit more. Of course. Fewer buffers to load to the buffer pool. Buffers buffers, right. This makes sense and you also, I remember your blog post also mentions
Starting point is 00:48:19 storage costs you compare with with Pinecone in terms of how much do you need to spend each month to store 50 million vectors and the difference is very noticeable i would say so yeah this this makes sense for sure anything else like technical stuff anything maybe you're working on right now? I think about PGA vector scale. That's about it. We haven't talked about PGAI, which is the other thing we announced this week.
Starting point is 00:48:58 This is Python untrusted, so it won't be possible to run it on managed services, as I understand, but it allows you to, or it's something different. Because it makes calls to open an API. It does. So with untrusted languages, you can't allow users on clouds to write their own functions. But if the functions are included inside an extension, that's fine. And it's easy to see because most extensions are written C,
Starting point is 00:49:37 which is completely untrusted, right? So the entire point of the PGAI extension is specifically so that this could be run on the clouds. So it limits capabilities and if a cloud vendor decides, verifies everything works well, nothing, bad calls cannot be done through it. So yeah, I understand that. It's almost whitelisting certain functions. Interesting, interesting, interesting. Yeah, makes sense. So the idea is to make it really simple
Starting point is 00:50:17 to create vectors from regular data types, just like transparently calling this, just with SQL, right? I think there are other similar implementations of this, but I like your idea of like betting on cloud providers, including this, even not providing untrusted languages capabilities. There are a few things that do similar things, and I was always curious why they didn't just run the Python code to do this. And so that's what we did. A lot of people do very complicated stuff to get the same result.
Starting point is 00:51:01 I don't know why, but it was really hard work. I like this approach myself very well like i mean i do it a lot for many years just select and it calls something externally but of course it should be a huge warning on side like it doesn't scale well because if Python call, like you basically add latency to your queries of the primary and CPU, primary CPU is
Starting point is 00:51:33 the most expensive resource you have, right? Because if you want to generate vector you cannot run it on replica because you need to write it. Right. So you cannot run it on replica because you need to write it. So you must do it on the primary. And while this query is running, it's primary node and offloading this work on Python application
Starting point is 00:52:01 nodes makes total sense because database doesn't notice this at all and primary doesn't notice. You only speak to it when the result is already retrieved from OpenAI or other LLM provider. So this is handy but very dangerous in terms of scalability and performance and future issues.
Starting point is 00:52:21 CPU load and so on. Completely agreed. Completely agreed with you and this is like this is a way for people to get up and running quickly and testing and experimenting
Starting point is 00:52:36 and we do have another project called PG Vectorizer for when you need to batch things up and scale it and do it in the background we have all of that as well so i haven't seen that that's interesting yeah but i will all of that plumbing that you have to write to drag data out of the database and then send it off to an api and put back in the database. I mean, you end up providing so much code that, you know, when it's in the database,
Starting point is 00:53:09 you don't have to do not to mention all of the, you know, bandwidth you're consuming and latency there. So, you know. Right. I remember I was copying images from S3 in SQL, Of course, it's similar. You call some APIs from Postgres. It's great, but it's an interesting direction.
Starting point is 00:53:35 Of course, with warning, for a start, it's good. For a start, it's good, for sure. Great. Okay, I think I don't have any more questions. I I'm glad I'm looking forward to trying this project. We wanted to do it before this call, like yesterday already, but we had some technical difficulties. I hope we will solve them soon and try PG vector scale
Starting point is 00:54:01 on our, for our case. I will tell you when we have results. Thank you. So as a summary... Just one more thing about that. This is a new project. We're trying to be very responsive. So if you run into any problems at all, file GitHub issues or contact us on Discord,
Starting point is 00:54:23 we are more than happy and very eager to talk to anybody. This is a young project, as I said, so all feedback is very welcome. Sure, yeah, that's a good point. And honestly, seeing this work and Timescale company repository on GitHub, it makes me have some good expectations in terms of if some problems are encountered, they will be addressed and so on. This is good.
Starting point is 00:54:50 And as a summary, let's say everyone who works with vectors in Postgres should check out this new type of index and compare. And this is great. I think there are many people who do this right now and looking forward to results in our case and other cases as well. Thank you for good project and very interesting. Yeah. Thank you. Thank you. Thank you. Good luck with it. So yeah. Bye bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.