Postgres FM - pgvector

Starting point is 00:00:00 Hello and welcome to PostgresFM, a weekly show about all things Postgres Query. I am Michael, founder of PGMustard, and today I am delighted to be joined by Jonathan Katz, Postgres Query core team member amongst other teams, Principal Product Manager at AWS, and contributor to PG Vector. Welcome to the show, Jonathan. Thank you, Michael. Happy to be here. Definitely a huge fan of your work and very excited for our conversation today. Kind of you'd say me too so

Starting point is 00:00:25 i reached out to you about this topic because it's about time we did an episode on pg vector how could we not have given the last year but neither of us knew it well both nicolai and myself haven't used it until recently nicolai has been using it. So we wanted to make sure we covered it and who better to, well, Andrew Cain was unavailable and you are definitely the next best person. So we really appreciated all the content you've been putting out, the talks you've been giving on this. So yeah, thanks for joining.

Starting point is 00:00:57 Yeah, happy to be here. I mean, and by the way, when you say next best person, it's like well below Andrew Cain. Andrew has done phenomenal work on PgVector over the past several years. Yeah, right. And it has been, it's like well below andrew kane andrew has done phenomenal work on pg vector over the past several years yeah right and it has been it's it's definitely predates the chat gbt hype right it was a couple of years beforehand do you want to give us like a little bit of an introduction as to to what it is and how you can become involved with it yeah so pg vector you know at surface sounds you know very simple It's a Postgres extension that allows

Starting point is 00:01:25 you to store and search vectors. And, I mean, I'll talk a little bit about how I got involved in it. But really to understand, you know, how this all came to be, it actually helps to look back in history of Postgres. Postgres has actually always been able to support vectors, you know, back since the Berkeley days, and it comes down to the array data type. So I mean, quite simply, a vector is, you know, it's an array with certain properties, it's, it has, you know, certain amount of dimensionality, there's certain things that each dimension must meet. And you know, there's all sorts of math around it. I mean, it's, you know, there's about over a century of like what a vector is. But the reason why it was added to Postgres was not because of any mathematical properties,

Starting point is 00:02:02 it was actually as a shortcut to be able to look up ACL rules. So instead of having to do a join to another catalog table to get the ACL rules, it's just embedded within a single column, you know, as a as an array. Oh, really? Yeah, yeah, no, it's actually pretty cool. And then fast forward to the early 2000s, Postgres added the cube data type, which allowed you to index up to 100 dimensional vectors, ultimately using the gist index. But it added all sorts of distance operations, which are the mathematical operations you need for a vector, which we can talk a little bit more about that as we get into this. But this is to say Postgres has actually supported vectors and vector lookups for a while.

Starting point is 00:02:41 But certainly there's been a more pressing need for it. And in terms of a little bit how I got involved, perhaps my secret in the Postgres community was I originally wanted to go into machine learning when I was in college. And while I was studying it, I was like, oh, it's really fascinating, but to do anything with it, you're going to need a PhD and everything's ad hoc. And i had a very entrepreneurial spirit i think i got a patient and of course you know i didn't have the foresight to see like oh this will be commoditized and it'll be very simple to access machine learning algorithms

Starting point is 00:03:15 through through simple api so credit to where's credit is due some very smart people you know identified that and you know i've been really building towards that but again you know, identified that and, you know, I've been really building towards that. But again, you know, working with machine learning systems in college, the fundamental data type was the vector. And back then, the, you know, a high dimensional vector is something considered to be like 20 dimensions. And think about it, Michael, like, what's a 20 dimensional vector? Like, I can barely, you know, I can understand three, you know, three is like points in space. I understand four you know points in space moving around what's five plus time right yeah yeah i i studied i don't know if you know but i studied maths at university and did pretty much just pure i did i did a tiny bit of stats and a tiny bit of mechanics but

Starting point is 00:03:58 i was best at the pure stuff and yeah five dimensions hurts my head to think about it like in order to try and conceptualize it in any way yeah it's funny that you say that i studied math too though like i would say i was much stronger of the computer science side but real analysis was one of my favorite classes because you know you were proving like everything in n-dimensional space and again like everything's a vector or matrix you know you're working you basically spend a semester just studying vector spaces and all their properties and And to me, that was fun. And maybe it was telling, fast forwarding 20 years later.

Starting point is 00:04:30 But my journey back to vectors came probably a little bit later than some folks. The P2Vector project started in 2021. Andrew had the foresight to see that a lot of data was going to be generated around machine learning and AI systems. And Postgres didn't really have support for the types of vector lookups that we needed. So let's back up a second. So I said that Postgres had support for vectors all along, but it didn't, it didn't. One thing about relational databases and Postgres in particular is that you look for exact results. You say like, hey, I want to find like the 10, you know, well, let's say I want to look up like Michael or Jonathan in a database.

Starting point is 00:05:11 You know, I'm going to write a select query with a where, you know, where Michael is or a name equals Michael, right? And you expect to get the results for like Michael, like all the Michaels in the database. You know, same thing when you're looking up points. Postgres has the ability to index things by k nearest neighbor, or, you know, find all the points that are around me. And again, if I say like, hey, go in and find like the 10 closest coffee shops to me, you're going to find exactly the 10 closest coffee shops. And these things work in, you know, like three dimensions, four dimensions, which is typically where Postgres operates. If you take the PostGIS extension,

Starting point is 00:05:45 which adds geospatial support to Postgres, you can index three and four dimensional points, and you can do it quite well. The GIST and the SPGIST indexes provide the ability to write these indexes. But as you start adding dimensions, you go from, you know, even like eight to 10 to 100, like which is the limit of the cube extension, it's gonna start getting very exhaustive first in terms of placing the vector or the, I mean, it's not really a point anymore. But you know, think about as a point, you're trying to place this in space and try to figure out I have 100 dimensions, and I have to like situate, you know, somewhere in my graph where that is, it's gonna be very exhaustive. And then even if you are doing some kind of exact nearest neighbor search, or some index search, there's gonna be a lot of cost to it.

Starting point is 00:06:30 Yeah. And this is where Postgres, you know, start falling over in a ways, you know, because a lot of these vectors we're seeing, if you look at some of the popular embedding systems, they're 1536 dimensions, which again, mind blowing number, like, I still don't even know what like a five dimensional vector is. And here we're talking about 1500 dimensional vectors. And to do an exact nearest neighbor search, as they call it, or the key nearest neighbor, you have to compare against every single vector that you store in a database. So you have a million vectors that are 1500 dimensions, you're doing a million times 1500 computations on them. And that's really expensive. I can tell you it's very expensive, because I've had to, you know, I've been personally,

Starting point is 00:07:04 you know, benchmarking this for quite a bit. And it makes sense, right? It's at minimum a sequential scan of the entire table plus all of the CPU of doing those calculations 1,500 times per row. 1,500 times, yeah. And we can talk about it terms of like all the, you know, all sorts of the, the optimizations that are in place there.

Starting point is 00:07:28 Yeah. This is where like that exact nearest neighbor problem gets hard. And again, you know, Postgres didn't have the indexing mechanisms to handle it. Probably we'll further dive into that as we, as we talk more, but actually what I'll say,

Starting point is 00:07:41 one of the big ones is just in terms of just storing the data, storing the data within the index that you have to fit your index value for a single row within an eight kilobyte limit within the page. So that's already going to start creating some constraints. Postgres was a whole field of modern vector research, as I like to call it, and start getting to this idea of approximate nearest neighbor. And the idea with approximate nearest neighbor is that you're trying to find your nearest neighbors, but with a best fit, which is you're not going to search every single vector in a particular table or database. You can search a subset of it, and you're hoping that you're seeing the representative set of the data you want to return. Now, again, for us relational database folks, that's mind blowing.

Starting point is 00:08:29 It's like, wait a second, I'm not returning the exact answers. I'm returning like an approximate subset. And I'll tell you the first time I used PG vector, and I was playing around with it, and I was not getting like the results I expected. At first, I'm like, oh, this is broken. This doesn't work. This doesn't make sense. And, you know, that was the app developer in me, the DBA in me that started diving deep

Starting point is 00:08:48 and trying to understand like, oh, wow, there's like a whole science around this that we kind of have to relearn when using a database. Yeah, well, I often say to people, you can run a database without adding any indexes and you can run any query you want and get the results back. It's just faster to use an index. Your query result will not change as a result of you having an index. And that sentence is no longer true if we include these approximate nearest neighbor type indexes.

Starting point is 00:09:17 So I have to stop saying that or work out a way of caveating it. Yeah, well, just say, well, in this world, when you're dealing with vector similarity vector similarity search but back to the question you know how did i get involved in pg vector so rewind to about the end of 2022 we may have heard about some of these generative ai systems that were captured in the imagination and we go and we you know type things into them and we get back these you know human-like responses and it's like, we're trying to think like, oh, what are all the different ways to use it? And one of the techniques that emerged as these systems grew in rapid popularity was retrieval augmented generation, where these large language models, foundational models have trained on vast

Starting point is 00:10:00 quantities of data, typically publicly available data. But you may have data sitting in your database, or there's some other private source that is not publicly available. For good reason, it might be your, your product catalog, or information related to, you know, a documentation collection that you have. But you're trying to deliver something to your users, your customers that can take advantage of what's in these generative AI systems, but provide that personalized experience. And to do that, you need some way to be able to take that data that exists in your private databases and safely and securely augment them with your foundational models to be able to deliver that. And that's where the vector search comes in, because the vector search provides a representation that you can search over and be able to augment

Starting point is 00:10:45 those answers. So seeing that, seeing this is how people want to use it. And frankly, people want to use it with Postgres, you know, a lot of a lot of the user and customer conversations I had were around, hey, I already have this data existing my Postgres database, or I've already built all my search mechanisms around Postgres, and won't be able to simply use it. And there was an extension that did this, and it was PG Vector. So for me personally, it started to become a perfect storm. You know, I'd studied this vector stuff in college. I was a hobbyist machine learningist, albeit like, you know, the field has gotten way more complex than when I studied it. And I happen to like databases as it happens,

Starting point is 00:11:25 particularly Postgres. So I looked at this extension. It seemed like a lot of people were starting to adopt it in rapid fashion. I mean, if you look at the star history of PG vector, it's a curve like this. Like I've never seen that before for a Postgres extension. But I stepped back a minute as well

Starting point is 00:11:40 and I saw that these are net new database workloads. A lot of what you see in Postgres in that these are net new database workloads you know a lot of what you see in postgres in the past has been moving database workloads over you know a lot of traditional you know transactional relational applications though you know i call it one you know you might call it once in a generation which in the tech world's every 10 years there's this new workload that emerges that just generates net new data. So about 10, 15 years ago, we saw that with JSON. We developed this lingua franca across the web that made it simple to communicate between all these APIs, but there became a demand to be able to store JSON data directly in a database

Starting point is 00:12:17 and easily query it out. And Postgres developed a solution for that, being the JSON and the JSONB data type, which became quite efficient at being able to query it. And what was nice is that, you know, the Postgres community rallied around what people were doing and created a very efficient, effective solution. Side note, I've talked to one Postgres user that has 40 terabytes of JSON in a single database, and they say Postgres rocks it, which I'm like, wow, that's great. I'm glad it works. That's so nice for you.

Starting point is 00:12:49 Yeah, yeah. It is really cool. Again, I can't figure out how I would personally do it, but I'm glad that it works really well for them. But we're seeing something similar with vectors. You know, vectors are just a data type. You know, it's actually a well-studied data type. It's something you kind of look at, you know, in your introductory computer science classes. But when it comes to efficient data search and retrieval, there's like so many nasty properties with them that make it like a really fun and challenging problem to work on.

Starting point is 00:13:15 But the thing is, like, this is, you know, this is this new workload that is going to be available to databases, including Postgres. And for me personally, I want to make sure that we're positioning Postgres in a way to handle it. So where I got started with the PG vector was first using it, helping to get it deployed at AWS. But like any open source project, if you're interested in it, just participate. There's a few things that I particularly focused on, which was performance, both from a contribution standpoint and a testing standpoint. And first, I would just want to say a large chunk of the performance features have been developed by Andrew Keen. He's done an awesome job on it. And more recently,

Starting point is 00:13:58 Heike Linenkangas has even further improved the performance on it. Where I stepped in was both just benchmarking to make sure that we're focusing on the right areas of performance and trying to prioritize what things to look at. And a few patches here and there, particularly around some of the index costing early on for IVF-flat. But yeah, I saw that. Oh, thank you.

Starting point is 00:14:22 I would love to get onto performance and tips for people. But before that, I hadn't thought of this as net new data. That's a super interesting way of looking at it. And I hadn't thought of JSON as necessarily net new data. I was thinking of it much more along the lines of the NoSQL workloads that felt quite competitive with the relational for a while. But yeah, really interesting way of looking at it and it does feel like an opportunity much like that did to be able to handle more within postgres with all of the benefits that we've done episodes on uh the pros and cons of having data inside and outside

Starting point is 00:14:57 using specialized databases but you get gained so many benefits like the transactional nature of things or the low latency or being able to join things together like there are so many benefits, like the transactional nature of things or the low latency or being able to join things together. There are so many benefits of having it all together. And not to mention the lower maintenance overhead or the lower operational knowledge needed of managing multiple systems. So, yeah, great point. Love it. Thank you. Cool. Yeah, so I was looking at through your PRs and the IV blog posts on the addition so ivf flat was the first index type that pg vector supported and then more recently uh got added a very well not as new as i expected when i look back at the history but the competing index type hnsw

Starting point is 00:16:00 which you've also been involved in uh tuning i've seen. Yeah, so Ivy Flat was first. And I think one of the reasons why is that it is a bit simpler to implement. So it's a clustering algorithm. Clustering algorithms are well studied. I mean, they continue to be improved. But one of the things you have to think about when you're looking at implementing these algorithms is that something that might work for an in-memory workload doesn't necessarily work for a database workload, where you're going between memory and an IO level, wherever your storage may reside.

Starting point is 00:16:37 And with IVF-flat, where I began getting involved was at AWS, we had rolled out PG Vector. I started talking to some of our customers who were early PG Vector adopters. And it was definitely all over the board. But I was talking to one who was definitely very savvy on vector searches in general. And they were remarking that when they were trying to increase the number of probes in their query.

Starting point is 00:17:00 So probe, so let's step back. So the way IVN flat works is that you build clusters of vectors. So let's say you have, let's keepN flat works is that you build clusters of vectors. So let's say you have, let's keep it simple. You have like 10,000 vectors. What you want to do is you want to try to bucket them into lists. So let's say you want to have a hundred lists. So there would be a thousand vectors.

Starting point is 00:17:18 No, there would be a hundred vectors per list. There we go. Clearly I did well with my math major. Arithmetic was not included, right? Yeah, exactly i yeah after a certain year you never saw a number so you have a hundred so so this backup so we have 10 000 vectors we want to put them in a hundred list each have 100 vectors what happens with ibf flat is that you try to find centers so you want to find all the vectors you know near a particular center because the idea of that is that those are going to be the most similar vectors to each other so that way when idea of that is that those are going to be the

Starting point is 00:17:45 most similar vectors to each other. So that way, when you do a lookup, you're going to one of those centers. And the idea is that these are likely, by approximation, the vectors that are most similar to the ones that you're looking up. However, it may not be because let's say you have, you know, two centers here and here, and your query vector is here. Yeah, which one are you closer to? I had to have somebody explain this to me like twice and the second time they used the analogy it worked better for me if we go back to two dimensions and look at let's say a map of let's say europe and we had the countries laid out we could have for example it's probably not a great analogy but

Starting point is 00:18:20 you could have the capital cities and first off we could compare any one point and say which capital city is this nearest to that's a much cheaper operation than looking at every single city and then we can go into that country and look or which of the actual towns or cities that we closest to but if we're if something's close to a border or the fact that not all countries are the same size we you can end up in situations where you don't necessarily get the right center or country by doing that first one. Is that a reasonable analogy? Yeah, that's really good because I started thinking about that. So, I'm based in New York and I'm thinking what capital am I closest to is technically Trenton, which is in New Jersey. So, you know, and, you know, if I just looked at one probe, it would say like,

Starting point is 00:19:04 oh, you're in New Jersey. Yeah. Actually, I think it would probably take three probes because I think Albany is the third closest capital to. There you go. So there you go. Yeah. Great analogy. So probes is the nump.

Starting point is 00:19:15 So you'd then do that first search and say, which are the three closest centers? Yeah. Centers, list. I mean, list is the term that's used. You might also be, It might be used interchangeably with centers. So if we increase that number of probes, we're increasing the cost of doing the query because we're looking at more things, we're comparing to more things, but we're increasing the likelihood we're going to get back the correct result or the number of things that

Starting point is 00:19:41 are correct. In our expected set. So the measurement of expectation is recall. So if you know that these set of vectors are like your actual 10 exact vectors, but you get like 8 out of the 10, then you have 80% recall. So with IVFAT, as you say, as you increase the number of probes,

Starting point is 00:20:00 you're going to increase your recall. But you are going to increase the query time. Yeah. Now, in that that patch we were talking about, the problem was that we were actually over-costing. So anytime we were above five probes, we then flipped back to a sequential scan, which will give you 100%

Starting point is 00:20:18 recall, but it was taking the query time from, you know, it was like a 50x increase in query time based upon the dataset. And we definitely still had room to grow. So the first patch really was focused on let's get the costing to be a more reasonable level, which gets into, you know, startup costs being one of those elements. For me personally, it was it really was my first time like diving into the query planner code, and definitely saw like all sorts of interesting things. I mean, first off, like the query planner

Starting point is 00:20:43 code is, there's a level of brilliance to it like it encapsulates a lot and like the work that folks have done over the past you know 35 years tuning it is quite remarkable i also say that it is quite clean it is it takes a lot like there's a lot of it right so like there's a lot to dive into but like once you get into it you can you can navigate your way around. Editing it is another story. It's one thing to understand, it's nothing to propose a recommendation. But what's really cool about Postgres is that you can make a lot of impact within an extension, or basically, you can add this functionality without having to fork Postgres, you could just write it in a separate module and then add it in.

Starting point is 00:21:22 And PG Vector does this through the index access method interface, which is actually quite robust. We build index types that don't fit the natural B-tree ordering of a relational database. So it's quite powerful there. And we were able to ultimately get, I'd say, pretty good costing for IVF-flat without having to make any upstream modifications to Postgres.

Starting point is 00:21:46 Nice. And the other, yeah, so on the IVF-flat, because of these centers, my understanding is it doesn't necessarily, it suits a data set you already have that is static or not changing much, because if you're adding data, it can't adjust as it goes. So there is this trade-off inherent in it. And I'm guessing some of the customers you were talking to had different workloads that wouldn't necessarily suit that.

Starting point is 00:22:17 And actually, what happened with your customer conversations next? Yeah, so that's a good discussion. And I'd say, you know, let's burn it. Like it's used as a PG vector in general. You know, a lot of these conversations, I have the open source project, which is, again, awesome. So that's definitely something that came up that if my data set, you know, if I keep adding vectors to the data set, or every now and then you update them as well, or even delete them, what happens is that the centers start to skew, because where you calculated where all these centers are may start shifting. So your results might start shifting as well. And you might not be getting the expected results or the recall that you'd want to see. The other thing as well is that for a lot of app developers, I mean, first off, again, you have to wrap your head around approximate nearest neighbor that you may not be getting the exact results that you expect. And like I said, I tripped up over that as well. Like it's, you know, it is it's normal to make that

Starting point is 00:23:07 mistake, you know, you have to learn something different. The other thing is that it started turning app developers into DBAs in the sense that you had to worry about some of your tuning parameters. And granted, you know, there's two of them, right? There's the number of lists when you build the index, which is not necessarily terrible, because you are an app developer, you're still writing your your SQL code, you're creating that index. But you have to think like, what is the number of lists, and the PG vector project gives some guidance on the number of lists to pick, but it's still like you have to experiment. And then you have to be mindful of things like, Oh, am I adding more data to it? I need to re index, how do I re index, you know, who's doing the re indexindexing, et cetera.

Starting point is 00:23:45 And then you have to select pros, which again, you know, you're writing queries possibly in your ORM and suddenly you have to set like this magic parameter to choose, you know, how many of these lists you want to visit during a query. So it's not as simple as set and forget. And, you know, that's when I come back to my history of being an app developer, I always want to just write code and I want to all just work. is set and forget. And that's when I come back to my history being an app developer, I always want to just write code and I want to all just work. Yeah, I happen to explore the database as a side effect of that, not as a first order principle of that. So it became time to explore an algorithm

Starting point is 00:24:16 that could be closer to set and forget. Like none of these approximate nearest neighbor algorithms are set and forget, but we can at least make it easier. And this is where HNSW came in a hierarchical nav goals, small worlds. So HNSW is a little bit different than IVF five, because it's not cluster based, it's graph based. And when you hear graph, you can actually almost hear, you know, a graph is a superset of a tree. So even though Postgres is very tree based, and a lot of its algorithms, implement something graph based as well, particularly with the custom index framework. And the way an HNSW works is that you build a hierarchy or layers of vectors. And as you go down the hierarchy, the layers get denser, denser and denser. And the idea is that you start at the top layer, and you try to find which vector,

Starting point is 00:25:02 which vector am I closest to, you find like your local maximum or effectively like the vector I'm closest to, then you go down, you search within that area, and you find the one that you're next closest to and you go down and then you'm a dancer and you search and so on and so forth till you get to the final layer. And the vectors that you should be clustered around are most likely the vectors you're most similar to. When you're building the index, there's two parameters that you now you have two index building parameters, you have something called m, which is the number of vectors that you're linked to in a given time, and PG vector defaults to 16. You have something called EF construction, which is the search

Starting point is 00:25:39 radius that you're keeping as you're descending down it the idea that if you look at a larger search radius, you're going to see the vectors that you're more similar to. The idea also with a larger M is that you're going to be clustered to vectors you're more similar to as well. But there's definitely trade-offs between how you pick that. What happens is that when you actually do the query, you should be looking over a smaller space. So IEVA flat is going to grow linearly as you add the number of probes, and it's going to grow linearly by the number of vectors that are in each list. So the idea is that if you can keep the list relatively small, you're going to be able to do faster searches with IVF-flat. But

Starting point is 00:26:14 there's also trade-offs to that as well, that you might not be getting the recall that you want. So you can see on IVF-flat query, it can get expensive over time, particularly if you're linearly growing it. With HNSW, you're only searching over a smaller subset of the space. And again, this could be dictated by your one search parameter, which is EFSearch, but you're only looking at a subset of the graph. So you're going to be looking at far fewer vectors. And the idea is that you're being put into a cluster that's very similar to you. So there's trade-offs. With IVF-flat, I can build the index super quickly. And technically, if I'm only using you're being put into a cluster that's very similar to you. So there's trade-offs. With IVF-flat, I can build the index super quickly. And technically, if I'm only using one probe,

Starting point is 00:26:50 I might be able to query it super quickly. But the trade-off is that to boost recall in IVF-flat, you know, what we've seen empirically is that it's going to start getting more expensive. With HNSW, we can query really quickly and and can query, you know, and again, the results are showing that we can query and get high recall or, you know, high, you know, we're basically seeing the expected results and with pretty good accuracy. But the trade off is gonna be on the build time, because we're going to do a lot more work upfront to build that hierarchy and see all of the, you know, basically try to visit enough vectors that seem most similar to me.

Starting point is 00:27:23 And that's the push and pull between the, you know, these two algorithms and just vector similarity search in general is that you're going to have to pay the cost somewhere and you have to figure out what that cost is going to be and what makes the most sense for your workload. With HNSW at least, you know,

Starting point is 00:27:38 if you're willing to pay that upfront cost, which, you know, a lot, it seems like a lot of folks are, yeah, you are going to get like this very high performance, high recall system, most likely. And again, there are other factors that go into this as well including

Starting point is 00:27:49 your embedding model or how you're acquiring the data or how quickly you're adjusting the data so like there's there's a lot to consider here yeah i'm a performance guy at heart and i love it when there is actually a true trade-off and people people in different situations will have different preferences but to me it seems like a lot of the currently popular use cases can pay or have a have a desire for as high a recall as is possible within a latency uh threshold like so they're willing to pay some latency and build like index build time costs for as high a recall as possible to a point. And then everyone has like a different point at which that is. But yeah, I actually wanted to ask almost the opposite question. Do you see some use cases where the index build time itself is the primary driver? Or any other reasons that you're seeing people choose IVF flat at the moment?

Starting point is 00:28:53 Yeah, there are cases where you might be needing to do rapid loads and a bit of analysis, but it's very transient. So one of the cases I've heard is that there's a system with, let's say, like 100 million vectors. And it's really more about getting a rapid build and doing a spot check on the data within it, as opposed to having something that's permanent that needs to be run all the time. I think it's like a lot of these transient or ephemeral workloads where an IVF flat index can make sense. But frankly, like I've talked to people who have just rolled out production systems with IVF flat with several million vectors in it. They're perfectly happy.

Starting point is 00:29:24 They're like, we're getting the results that we need. We're getting it in the amount of time that we expected to return the query in. And they're completely fine with it. So some of it is personal preference. And some of it, like I said, depends on other factors, such as the data that has been vectorized.

Starting point is 00:29:39 How similar are the vectors based on the embedding model? And are you able to look at the distances within a range where you're seeing the set that you want? Awesome. That's really interesting. So you've got a great talk on this that I'll link up that you gave recently at PGConf EU. And you've also had a couple of great blog posts

Starting point is 00:29:59 on the releases around 0.5.0 and 0.5.1. I'll link those up so people can read them in their own time. I did actually want to ask, you mentioned some tuning, like some people becoming DBAs and having to think about, you mentioned right up top the page size of Postgres, for example. Do you want to talk a little bit about why that's important and some of the lower level stuff that people need to think about when they're doing this all right so so tuning or technical i heard two different things let's go tuning as more practical we don't have to just explain toast we've we've done a whole episode on toast before but some like there was some interesting stuff in there that really got me

Starting point is 00:30:39 thinking that i hadn't thought about before yeah so so this there's a whole bunch of areas to go into. So just to briefly recap, because I think this does impact tuning. So Toast is a system that, well, step back further. So the foundational unit of Postgres is the page. That's the atomic unit. When you store data, you're actually storing it within this atomic unit. You're not just storing a row randomly on disk. It's going to be fit within a page.

Starting point is 00:31:07 By default, the page for Postgres is 8 kilobytes. Now, you can recompile Postgres to use a different page size, but most folks just use the default for a lot of good reasons. What's interesting is that, you know, so if you have data that doesn't fit within a page, what happens is that it gets toasted, and it gets stored out of line, and you can store it arbitrarily large. I believe it's up to one gigabyte per field. Now, what's interesting, well, there's three interesting things here. So first, Postgres has what's called a minimum toast threshold, which is two kilobytes.

Starting point is 00:31:40 So anything above two kilobytes is going to get toast, unless you change that threshold. The second thing is that index pages must abide by that 8 kilobyte limit. And you can actually toast data in line with an index page to shrink it down a bit so you can get a little bit more on that index page. But if it's going to be over 8 kilobytes, Postgres can't index it. So this is where it gets interesting for vectors for a few reasons. So first, any vector that's above that 2 kilobyte threshold, which I believe, I think I forgot off the top of my head, it's around like 514 dimensions for PG vector currently, it's going to get toasted. Okay, that might seem okay. But any vector currently, you know, there's a hard cap in terms of indexing PG vector vectors of 2000 dimensions,

Starting point is 00:32:25 which there are some value use cases I've heard of vectors that go beyond that size. But typically, most things are going to fall within that range for now. Now, there's a few things here. First, you're like, well, you say you can toast things in line. So why can't we toast these vectors in line? Well, I hate to be the bearer of bad news, but it's actually very challenging to compress effectively, you know, 2000 dimensions of random floating point numbers. Yeah.

Starting point is 00:32:51 So there's not really much you can do other than dimensionality reduction to shrink it down. There are some techniques that are out there called quantization, which you can touch on after this, but you know, they all have their trade-offs such as you know losing information so the second thing is that toasting can screw up the query planner a bit what do i mean by this so currently when the query planner is trying to estimate parallel workers or essentially you're reading data in parallel it's going to look at it basically looks at your heat pages or your main table pages and uses that to drive the estimate and the thing about toast is that your heat pages actually be

Starting point is 00:33:30 quite small you know in this case let's say if an id and a vector well i'm not going to have that many pages in my regular heap because it just could be a bunch of ids and pointer to my toast table the toast table is going to be quite large and likely you need those parallel workers to suck all the data out but what postgres is going to do is it's going to underestimate the workers today. And it actually makes sense historically, because the data that you typically toasted was not in your search path or your ordering path. It's typically something that you need to do some post filtering on as you pull it out, you know, think like, you know, a big blob of text. Yeah, but here, we are actively querying the vectors. We are

Starting point is 00:34:06 calculating distances between them. So we probably want them closer to our main set of data. Now, we can choose to store the vectors in line. You use this technique called set storage plane, and that will keep your 1500-dimensional vector in line with your other table data. But also keep in mind, this is going to cost a full page because a 1500 dimensional vector is about six kilobytes and Postgres is just going to have to allocate a page for each of that. So no matter what, you're storing eight kilobytes of data in them. So this gets very interesting. Yeah. So in terms of tuning, again, part of this, you got to look at what your workload is and what makes the most sense in that

Starting point is 00:34:43 what we're seeing so far with, at least with H H and SW is that, you know, even if you're toasting a 1500 dimensional vector, the estimates are still pretty good overall for making sure that you're using your H and SW index. I haven't seen as much impact there. We saw more impact, particularly with IVF flat based upon some of that costing model. But I think there's, I think there's some, this is one of those areas where I think there's some improvements to how we work with toasted data upstream. I think I have a couple of emails or threads on that subject. And if I get my C chops better, maybe I can propose a patch,

Starting point is 00:35:15 but that's a, that's like one area to be mindful of is like how you store the data. And again, I think we were starting to get a little bit more set and forget there, but I think as we see like these workloads increase, it is definitely something to be mindful of. From a buyer standpoint, as a Postgres supporter and advocate, I've seen quite a lot of benchmarks that show PG Vector doing very well as a vector database compared to other dedicated systems. So that's impressive considering they're probably not even tuning some of those lower level

Starting point is 00:35:47 things while doing those benchmarks. So it's exciting for me to hear that there's actually some potential further wins there on the Postgres side. Yeah, I mean, and maybe like hot off the presses, Andrew proposed a patch last night for speeding up HNSW index building. Since like the 0.5.1 release, there's already been a lot of work to improve this, but the patch last night even further accelerates it. So there's

Starting point is 00:36:09 support for parallel builds for HNSW indexes. And the initial work still had to leverage a lot of the data coming from disk as opposed to being fully in memory. The proposal last night is fully in memory. And I actually did a test. I was so excited to see it.

Starting point is 00:36:25 I did a test. I have this 10,000,000, 1500 dimensional vector data set, or really like 1,536 dimensional data set. Yeah. Random vectors, right? This is really more just for like beating up on performance. You know, I'm not measured recall here, which for real benchmarks,

Starting point is 00:36:39 you got to measure performance and recall. But I'm trying to just understand like, how good is the index build time? So I did this with what's currently you know on the master branch to date i use this and then i also did the test with the hnsw fast build branch which is this in-memory system so i saw a couple of things so first this new branch was about i think it's 7.3x faster at building h&m hnsw index than the other branch. And just to compare,

Starting point is 00:37:06 like let's get some real numbers. So the old branch, which is, you know, this, you know, the unreleased support for the H and W parallel builds, it took about three hours and change to build the entire index.

Starting point is 00:37:18 And this was with like 64 parallel workers. I'm throwing, you know, this is, this is a big, this is a big beefy box with With the new branch, it took 25 minutes. It's like, wow. Like, that's mind-blowing, right?

Starting point is 00:37:31 These are big vectors. You're doing a lot of computations. I even had cranked up the EF construction value, which, you know, does increase the time. I did compare it to this previous method I've been recommending, which was concurrent inserts. And I just, you know, I did a spot check between a blog post I wrote about it and this new patch. So in the blog post, where I had a lower value of EF construction, it was, I think, 64. I was getting about a little bit over a thousand, a thousand vectors per second, you know, looking over the entire front.

Starting point is 00:38:00 With this new technique that Andrew posted, I was getting over 6,500 vectors per second. So again, huge. And this is what's cool about this, because this is one of those set-and-forget things that you might need to tweak one parameter. In this case, I tweaked max-parallel maintenance workers. But again, huge performance boost in terms of the index build. And it simplifies it too, because now it makes it truly viable

Starting point is 00:38:26 to say preload all your vectors, which will be a much faster operation than doing a concurrent insert. I mean, because there's a concurrent insert and even like further speed up your load of your vectors. But then you can just do create index, you know, embeddings using HNSW and boom, like you have an index way faster than other systems.

Starting point is 00:38:44 And like, this is a key thing to look at, you know, you touch on this, Michael, is that when you're dealing with this vector data, you got to look at everything, you got to look at your ingestion time, your index build time, your query time. And it, you know, you have to focus on all these things, because there is a trade off with all of them. Yeah, well, exciting times ahead ahead it feels like the most rapidly improving area at the moment and for good reason what so i saw on this it's got a great change log pg vector and i saw there's like even the unreleased part but so 0.5.2 is currently on there and unreleased so anything you're particularly excited about that's coming up or anything not on that list that you'd love to see?

Starting point is 00:39:26 Yeah. And I think this is where it really helps to have feedback on how people are using it. So I was very excited about parallel HNSW builds. Like, this is... Like, I can't... Even, like, in the middle of that test I was running last night, like, I emailed Andrew, like, Andrew, the results are amazing.

Starting point is 00:39:42 Like, I can't wait for it to finish. Like, I can see where this is trending. So there's two things in particular I'm very excited for. The first, you know, the first, and we'll see if it makes it in, but it's around pre-filtering. So a lot of the queries today we've been seeing is, like, select star from vector table, order by, blah, blah, blah, blah, you know, limit 10.

Starting point is 00:40:02 Find me my 10 nearest neighbors. But what's happening in practice? In practice, it's like select star from table where, you know, some condition, category ID equals this, order by, blah, blah, blah, blah. So what this is kind of like is really a multi-column index because what will happen is like today, either you won't use the index,

Starting point is 00:40:23 effectively you're doing an exact nearest neighbor search, which means very accurate results. And maybe, you know, your data set gets sorted down to a small enough value where it doesn't matter, like you're getting you're getting very fast search. But what if you have multiple fields that you need to filter over? Or what if the data set you get back is 50,000 of these, you know, 1500 dimensional vectors like that? This can be exhausting. So there's a patch out there. It's based on a newer paper called HQANN. You're checking the PG vector repo right now.

Starting point is 00:40:52 That lets you build effectively a multi-column index where it's able to build the links or group together. Well, there's two things, right? Because you still want to be able to search over your entire vector set and find your most similar vectors, But it also groups the vectors by your filters, so that way you're searching over just your filters, and it does that pre-filtering. This has come up by...

Starting point is 00:41:17 It's almost like... This is like a what-have-you-done-for-me-lately feature, because as soon as you put it out, users find, like, you you know more of these case studies emerge and users find like hey like i really need this what's great is that there are people who are testing it against real data and this is where if you want to be involved in pg vector you can help is that you see these patches out there please test and report back on them because if you're finding the hq ann brands useful describe your use case like there's an

Starting point is 00:41:42 open issue where people talk about you know know, they might say like, hey, I really need this. And what makes it super valuable is when you talk about like, here's exactly how I'm using it, because it just further justifies that this is the right direction. I mean, one of the big goals of PG vector is to try to keep the code base simple to maintain and also make the product simple or the extension simple to use. So hearing more about what people are doing is great. Another patch that I'm excited for, and again, we'll see if it makes it into 0.5 too, you know, no guarantees, is being able to support smaller dimensional sizes.

Starting point is 00:42:12 So right now, dimensions are four byte floats, but there are definitely embedding models that provide two byte floats or one byte unsigned integers. So there's a couple of branches that have those in there. But again, hearing those use cases will help further support it because the added bonus to supporting

Starting point is 00:42:29 these smaller dimensions is that we can index larger vectors and that'll be able to go beyond that 2K limit. Yes, interesting. The one final question I had is, do you see any world where this becomes part of core Postgres grow in the long term? Yeah, so at PGCon 2023 last year, Do you see any world where this becomes part of cool Postgres growth in the long term? Yeah.

Starting point is 00:42:45 So at PGCon 2023 last year, I had a lightning talk, which is basically to first shout to the wind to the community, like, hey, these vector workloads, this is real. That was still early. But it's like, hey, this is coming. I call it like a storm of data coming. We want to make sure Postgres is positioned for it. This is very similar to what we saw with JSON.

Starting point is 00:43:06 And I was able to get an unconference session as well, where we discussed it. And the general consensus was in the fullness of time, it does make sense to have something like this in upstream Postgres. Great. But there's, I think there's a few things here. First, we have to look at release cycles.

Starting point is 00:43:23 You know, Postgres releases once a year. In fact, the feature freeze for Postgres 17 is coming up in about the less than three months. But effectively, there's a... Let's say we come up with the idea we want to support a vector data type right now, or let's say after feature freeze. I mean, there's effectively an 18-month window before it gets in because we have to go through the whole cycle. And given the pace that this field is moving, we don't necessarily want to wait on the Postgres release cycle. So being able to do this work in PGVector or other extensions does help accelerate adoption of Postgres as a vector database, so to speak. The other thing is that once it is in upstream Postgres, that is the on-disk format.

Starting point is 00:44:05 That is the rule. And Tom Lee made a very good point during that unconference session, which is like, let's see how things shake out in terms of what the on-disk format is. Now, PGVector is also trying to stay true to that contract and try to keep the on-disk format as effectively

Starting point is 00:44:21 not to change it. Because as soon as you change it, you got to like rebuild, re-index, restore everything. That's the cost of the operation. So PGVector is trying to apply the same level of rigor as Postgres to implementing these features, but it can move a little bit faster because it is an extension.

Starting point is 00:44:38 It can have its own release lifecycle. So I think that's where the, you know, is that I'm not gonna say this is an official or unofficial community position, but is there interest in supplying it upstream? Absolutely. But given the rapid emergence and development we need to make, we're trying to make as much progress as possible within PG vector. Here's the other thing too, like the big difference between now and JSON, we have the index access method. We can do this in an extension. That is a big change since 10, 15 years ago. That's really interesting and good to know. There's another big difference, which is cloud providers.

Starting point is 00:45:13 So whilst it could be in core Postgres, currently a lot of people that install Postgres themselves or manage Postgres themselves more specifically can install pgVector and people that don't are often on a managed service provider those folks most managed service providers have added support for pgVector already which is the fastest I've ever seen them add a a new extension and it's been pretty much everyone so most folks that want access to pgVector at least some version of it maybe not always the latest version can have it now as well yeah it's definitely it's definitely exciting to see all the adoption of pg vector and yeah yeah you know to tease a little bit i think i still think the best is yet to come like i think pg vector has made a lot of progress in in the past year i mean it's been a tremendous work by the community and in particular andrew like i can't

Starting point is 00:46:04 i can't say enough nice things about the work Andrew has done and really the diligence he's put into it. And I still think there's more to do. Like I said, even just like, well, the patch that came out last night, I mean, just shows that there's still even more performance that we can get out of it. And for folks listening that we're recording on the 16th. So I'll link that up. Wonderful.

Starting point is 00:46:24 Jonathan, thank you so much. Is there anything else I should have asked you but didn't, or any last things you wanted to say? Yeah, I think one of the big things when you think about contributing to open source is writing a patch, writing code. And I must say this in the context of PG Vector, but I think this applies to any project, even Postgres itself, is that there are many different ways to contribute. Testing is huge, you know, because testing, particularly if you can test something

Starting point is 00:46:48 that's close to a real workload, like don't test your production workloads with these things, but like test something that's close to production. That helps because that helps drive the use case and hearing how you use something and just talking about your different use cases and being supportive of the community in that way. That helps. Helping to write documentation, helping to advocate for something that you think can help others. Again, all these things can help a project. So if you want to contribute to PG Vector or Postgres, there are a variety of different ways. And maybe to, as it's top of mind right now,

Starting point is 00:47:18 PGCon, which was effectively the annual developer conference, it's evolved. It's now called pgconf.dev, P-G-C-O-N-F dot dev. It's being held in Vancouver at the end of May. I can tell you, I can almost guarantee vector workloads will be a topic of discussion there, but just all things Postgres.

Starting point is 00:47:38 And the idea is that, while certainly a lot of the folks at the discussions are around technical hacking topics, really, if you step back, the gist is, how do we continue building and growing the Postgres community? So the CFP actually just closed yesterday, or when this airs, probably a few years before. But that's a great way to participate as well, even if you're new to the community, because I know personally, at one point, I was new to the community. And the first time I went to PeachyCon,

Starting point is 00:48:06 I'm like, oh my God, I know nothing about how a database works. I can write a select query, but geez. But it is a way to help have impact on the community. And just talking to folks who are working on the software or working on events or hosting podcasts or finding ways to help grow the community,

Starting point is 00:48:24 it's a great way to participate and help grow. So certainly, you know, think about attending and participating. And again, there's all sorts of different ways to contribute. So that's the parting message I have. I'm far from a database hacker. I can write like a couple of lines to see here and there. But where I've found a home in open source is working on all sorts of other aspects around open source projects. What a great message to end on.

Starting point is 00:48:54 Thank you, Michael. Thanks so much for your time, Jonathan, and for everything you do in the community. Likewise. Thank you for having me on your wonderful podcast. Cheers. Take care. Take care.

Starting point is 00:49:04 Goodbye.

Your Ad Here

Postgres FM - pgvector

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.