Postgres FM - pgvector
Episode Date: January 19, 2024Michael is joined by Jonathan Katz, PostgreSQL Core Team member, Principle Product Manager at AWS, and also pgvector contributor, to discuss pgvector. They cover what it is, why it's so popul...ar all of a sudden, some tuning and tradeoff guidance, and a look to the future. Here are some links to things they mentioned:pgvector https://github.com/pgvector/pgvectorOur episode on companion databases https://postgres.fm/episodes/companion-databasesVectors are the new JSON (talk by Jonathan) https://www.youtube.com/watch?v=D_1zunKblAUpgvector 0.5.0 feature highlights and howtos (blog post by Jonathan) https://jkatz05.com/post/postgres/pgvector-overview-0.5.0HNSW indexing and searching (blog post by Jonathan) https://aws.amazon.com/blogs/database/accelerate-hnsw-indexing-and-searching-with-pgvector-on-amazon-aurora-postgresql-compatible-edition-and-amazon-rds-for-postgresqlOur episode on TOAST https://postgres.fm/episodes/toastpgvector changelog https://github.com/pgvector/pgvector/blob/master/CHANGELOG.mdHQANN paper https://arxiv.org/abs/2207.07940HNSW fast build branch https://github.com/pgvector/pgvector/tree/hnsw-fast-build pgconf.dev https://2024.pgconf.dev ~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the amazing artworkÂ
Transcript
Discussion (0)
Hello and welcome to PostgresFM, a weekly show about all things Postgres Query.
I am Michael, founder of PGMustard, and today I am delighted to be joined by Jonathan Katz,
Postgres Query core team member amongst other teams,
Principal Product Manager at AWS, and contributor to PG Vector.
Welcome to the show, Jonathan.
Thank you, Michael. Happy to be here.
Definitely a huge fan of your work and very excited for our conversation today.
Kind of you'd say me too so
i reached out to you about this topic because it's about time we did an episode on pg vector
how could we not have given the last year but neither of us knew it well both nicolai and
myself haven't used it until recently nicolai has been using it. So we wanted to make sure we covered it and who better to, well, Andrew Cain was unavailable
and you are definitely the next best person.
So we really appreciated all the content
you've been putting out,
the talks you've been giving on this.
So yeah, thanks for joining.
Yeah, happy to be here.
I mean, and by the way, when you say next best person,
it's like well below Andrew Cain.
Andrew has done phenomenal work on PgVector
over the past several years. Yeah, right. And it has been, it's like well below andrew kane andrew has done phenomenal work on pg vector over the past several years yeah right and it has been it's it's definitely predates the chat gbt hype right
it was a couple of years beforehand do you want to give us like a little bit of an introduction
as to to what it is and how you can become involved with it yeah so pg vector you know
at surface sounds you know very simple It's a Postgres extension that allows
you to store and search vectors. And, I mean, I'll talk a little bit about how I got involved in it.
But really to understand, you know, how this all came to be, it actually helps to look back in
history of Postgres. Postgres has actually always been able to support vectors, you know, back since
the Berkeley days, and it comes down to the array data type. So I mean, quite simply, a vector is, you know,
it's an array with certain properties, it's, it has, you know, certain amount of dimensionality,
there's certain things that each dimension must meet. And you know, there's all sorts of math
around it. I mean, it's, you know, there's about over a century of like what a vector is.
But the reason why it was added to Postgres was not because of any mathematical properties,
it was actually as a shortcut to be able to look up ACL rules. So instead of having to do a join to another catalog table to get the
ACL rules, it's just embedded within a single column, you know, as a as an array. Oh, really?
Yeah, yeah, no, it's actually pretty cool. And then fast forward to the early 2000s,
Postgres added the cube data type, which allowed you to index up to 100 dimensional vectors,
ultimately using the gist index.
But it added all sorts of distance operations, which are the mathematical operations you
need for a vector, which we can talk a little bit more about that as we get into this.
But this is to say Postgres has actually supported vectors and vector lookups for a while.
But certainly there's been a more pressing need for it.
And in terms of a little
bit how I got involved, perhaps my secret in the Postgres community was I originally wanted to go
into machine learning when I was in college. And while I was studying it, I was like, oh,
it's really fascinating, but to do anything with it, you're going to need a PhD and everything's
ad hoc. And i had a very
entrepreneurial spirit i think i got a patient and of course you know i didn't have the foresight to
see like oh this will be commoditized and it'll be very simple to access machine learning algorithms
through through simple api so credit to where's credit is due some very smart people you know
identified that and you know i've been really building towards that but again you know, identified that and, you know, I've been really building towards that. But again, you know, working with machine learning systems in college,
the fundamental data type was the vector. And back then, the, you know, a high dimensional
vector is something considered to be like 20 dimensions. And think about it, Michael, like,
what's a 20 dimensional vector? Like, I can barely, you know, I can understand three,
you know, three is like points in space. I understand four you know points in space moving around what's five plus time right yeah
yeah i i studied i don't know if you know but i studied maths at university and did
pretty much just pure i did i did a tiny bit of stats and a tiny bit of mechanics but
i was best at the pure stuff and yeah five dimensions hurts my head to think about it
like in order to try and
conceptualize it in any way yeah it's funny that you say that i studied math too though like i would
say i was much stronger of the computer science side but real analysis was one of my favorite
classes because you know you were proving like everything in n-dimensional space and again like
everything's a vector or matrix you know you're working you basically spend a semester just
studying vector spaces and all their properties and And to me, that was fun.
And maybe it was telling, fast forwarding 20 years later.
But my journey back to vectors came probably a little bit later than some folks.
The P2Vector project started in 2021.
Andrew had the foresight to see that a lot of data was going to be generated around machine
learning and AI systems. And Postgres didn't really have support for the types of vector
lookups that we needed. So let's back up a second. So I said that Postgres had support for vectors
all along, but it didn't, it didn't. One thing about relational databases and Postgres in
particular is that you look for exact results.
You say like, hey, I want to find like the 10, you know, well, let's say I want to look up like Michael or Jonathan in a database.
You know, I'm going to write a select query with a where, you know, where Michael is or a name equals Michael, right?
And you expect to get the results for like Michael, like all the Michaels in the database.
You know, same thing when you're looking up points.
Postgres has the ability to
index things by k nearest neighbor, or, you know, find all the points that are around me. And again,
if I say like, hey, go in and find like the 10 closest coffee shops to me, you're going to find
exactly the 10 closest coffee shops. And these things work in, you know, like three dimensions,
four dimensions, which is typically where Postgres operates. If you take the PostGIS extension,
which adds geospatial support to Postgres, you can index three and four dimensional points,
and you can do it quite well. The GIST and the SPGIST indexes provide the ability to write these
indexes. But as you start adding dimensions, you go from, you know, even like eight to 10 to 100,
like which is the limit of the cube extension, it's gonna start getting very exhaustive first in terms of placing the vector or the, I mean, it's not really a point anymore.
But you know, think about as a point, you're trying to place this in space and try to figure
out I have 100 dimensions, and I have to like situate, you know, somewhere in my graph where
that is, it's gonna be very exhaustive. And then even if you are doing some kind of exact
nearest neighbor search, or some index search, there's gonna be a lot of cost to it.
Yeah. And this is where Postgres, you know, start falling over in a ways, you know, because
a lot of these vectors we're seeing, if you look at some of the popular embedding systems, they're 1536 dimensions, which again, mind blowing number, like, I still don't even know
what like a five dimensional vector is. And here we're talking about 1500 dimensional vectors.
And to do an exact nearest neighbor
search, as they call it, or the key nearest neighbor, you have to compare against every
single vector that you store in a database. So you have a million vectors that are 1500
dimensions, you're doing a million times 1500 computations on them. And that's really expensive.
I can tell you it's very expensive, because I've had to, you know, I've been personally,
you know, benchmarking this for quite a bit.
And it makes sense, right?
It's at minimum a sequential scan of the entire table plus all of the CPU of doing those calculations 1,500 times per row.
1,500 times, yeah.
And we can talk about it terms of like all the,
you know,
all sorts of the,
the optimizations that are in place there.
Yeah.
This is where like that exact nearest neighbor problem gets hard.
And again,
you know,
Postgres didn't have the indexing mechanisms to handle it.
Probably we'll further dive into that as we,
as we talk more,
but actually what I'll say,
one of the big ones is just in terms of just storing the data,
storing the data within the index that you have to fit your index value for a single row within an eight kilobyte limit within the page. So that's already going to start creating some constraints. Postgres was a whole field of modern vector research, as I like to call it, and start getting
to this idea of approximate nearest neighbor. And the idea with approximate nearest neighbor is that
you're trying to find your nearest neighbors, but with a best fit, which is you're not going to
search every single vector in a particular table or database. You can search a subset of it,
and you're hoping that you're seeing the representative set of the data you want to
return.
Now, again, for us relational database folks, that's mind blowing.
It's like, wait a second, I'm not returning the exact answers.
I'm returning like an approximate subset.
And I'll tell you the first time I used PG vector, and I was playing around with it,
and I was not getting like the results I expected.
At first, I'm like, oh, this is broken.
This doesn't work.
This doesn't make sense.
And, you know, that was the app developer in me, the DBA in me that started diving deep
and trying to understand like, oh, wow, there's like a whole science around this that we kind
of have to relearn when using a database.
Yeah, well, I often say to people, you can run a database without adding any indexes
and you can run any query you want and get the results back.
It's just faster to use an index.
Your query result will not change as a result of you having an index.
And that sentence is no longer true
if we include these approximate nearest neighbor type indexes.
So I have to stop saying that or work out a way of caveating it.
Yeah, well, just say, well, in this world,
when you're dealing with vector similarity vector similarity search but back to the question you know how did i get involved in pg vector
so rewind to about the end of 2022 we may have heard about some of these generative ai systems
that were captured in the imagination and we go and we you know type things into them and we get
back these you know human-like responses and it's like, we're trying to think like, oh, what are all the different ways to use it?
And one of the techniques that emerged as these systems grew in rapid popularity was retrieval
augmented generation, where these large language models, foundational models have trained on vast
quantities of data, typically publicly available data. But you may have data sitting in your
database, or there's some other private source that is not publicly available. For good
reason, it might be your, your product catalog, or information related to, you know, a documentation
collection that you have. But you're trying to deliver something to your users, your customers
that can take advantage of what's in these generative AI systems, but provide that personalized experience. And to do that, you need some way to be able to take
that data that exists in your private databases and safely and securely augment them with your
foundational models to be able to deliver that. And that's where the vector search comes in,
because the vector search provides a representation that you can search over and be able to augment
those answers. So seeing that, seeing this is how people want to use it. And frankly,
people want to use it with Postgres, you know, a lot of a lot of the user and customer conversations
I had were around, hey, I already have this data existing my Postgres database, or I've already
built all my search mechanisms around Postgres, and won't be able to simply use it. And there was an extension that did this, and it was PG Vector.
So for me personally, it started to become a perfect storm.
You know, I'd studied this vector stuff in college.
I was a hobbyist machine learningist, albeit like, you know, the field has gotten way more complex than when I studied it.
And I happen to like databases as it happens,
particularly Postgres.
So I looked at this extension.
It seemed like a lot of people
were starting to adopt it in rapid fashion.
I mean, if you look at the star history of PG vector,
it's a curve like this.
Like I've never seen that before for a Postgres extension.
But I stepped back a minute as well
and I saw that these are net new database workloads.
A lot of what you see in Postgres in that these are net new database workloads you know a lot of what
you see in postgres in the past has been moving database workloads over you know a lot of
traditional you know transactional relational applications though you know i call it one you
know you might call it once in a generation which in the tech world's every 10 years
there's this new workload that emerges that just generates net new data. So about 10, 15 years ago, we saw that
with JSON. We developed this lingua franca across the web that made it simple to communicate between
all these APIs, but there became a demand to be able to store JSON data directly in a database
and easily query it out. And Postgres developed a solution for that, being the JSON and the JSONB
data type, which became quite efficient at being
able to query it. And what was nice is that, you know, the Postgres community rallied around what
people were doing and created a very efficient, effective solution. Side note, I've talked to
one Postgres user that has 40 terabytes of JSON in a single database, and they say Postgres rocks it,
which I'm like, wow, that's great.
I'm glad it works.
That's so nice for you.
Yeah, yeah.
It is really cool.
Again, I can't figure out how I would personally do it, but I'm glad that it works really well for them.
But we're seeing something similar with vectors.
You know, vectors are just a data type.
You know, it's actually a well-studied data type.
It's something you kind of look at, you know, in your introductory computer science classes.
But when it comes to efficient data search and retrieval, there's like so many nasty properties with them that make it like a really fun and challenging problem to work on.
But the thing is, like, this is, you know, this is this new workload that is going to be available to databases, including Postgres. And for me personally, I want to make
sure that we're positioning Postgres in a way to handle it. So where I got started with the PG
vector was first using it, helping to get it deployed at AWS. But like any open source project,
if you're interested in it, just participate. There's a few things that I particularly focused
on,
which was performance, both from a contribution standpoint and a testing standpoint.
And first, I would just want to say a large chunk of the performance features have been
developed by Andrew Keen. He's done an awesome job on it. And more recently,
Heike Linenkangas has even further improved the performance on it.
Where I stepped in was both just benchmarking
to make sure that we're focusing on the right areas of performance
and trying to prioritize what things to look at.
And a few patches here and there,
particularly around some of the index costing early on for IVF-flat.
But yeah, I saw that.
Oh, thank you.
I would love to get onto performance and tips for people.
But before that, I hadn't thought of this as net new data.
That's a super interesting way of looking at it.
And I hadn't thought of JSON as necessarily net new data.
I was thinking of it much more along the lines of the NoSQL workloads that felt quite competitive with the relational for a while.
But yeah, really interesting way of looking at it and it does
feel like an opportunity much like that did to be able to handle more within postgres with all of
the benefits that we've done episodes on uh the pros and cons of having data inside and outside
using specialized databases but you get gained so many benefits like the transactional nature
of things or the low latency or being able to join things together like there are so many benefits, like the transactional nature of things or the low latency or being able to join things together.
There are so many benefits of having it all together.
And not to mention the lower maintenance overhead or the lower operational knowledge needed of managing multiple systems.
So, yeah, great point. Love it. Thank you.
Cool. Yeah, so I was looking at through your PRs and the IV blog posts on the addition so ivf flat was
the first index type that pg vector supported and then more recently uh got added a very well
not as new as i expected when i look back at the history but the competing index type hnsw
which you've also been involved in uh tuning i've seen. Yeah, so Ivy Flat was first.
And I think one of the reasons why is that it is a bit simpler to implement.
So it's a clustering algorithm.
Clustering algorithms are well studied.
I mean, they continue to be improved.
But one of the things you have to think about
when you're looking at implementing these algorithms
is that something that might work for an in-memory workload doesn't necessarily work for a database workload, where you're going between memory and an IO level, wherever your storage may reside.
And with IVF-flat, where I began getting involved was at AWS, we had rolled out PG Vector.
I started talking to some of our customers who were early PG Vector adopters.
And it was definitely all over the board.
But I was talking to one who was definitely very savvy
on vector searches in general.
And they were remarking that
when they were trying to increase the number of probes
in their query.
So probe, so let's step back.
So the way IVN flat works is that
you build clusters of vectors. So let's say you have, let's keepN flat works is that you build clusters of vectors.
So let's say you have, let's keep it simple.
You have like 10,000 vectors.
What you want to do is you want to try to bucket them into lists.
So let's say you want to have a hundred lists.
So there would be a thousand vectors.
No, there would be a hundred vectors per list.
There we go.
Clearly I did well with my math major.
Arithmetic was not included, right? Yeah, exactly i yeah after a certain year you never saw a number so you have a hundred
so so this backup so we have 10 000 vectors we want to put them in a hundred list each have
100 vectors what happens with ibf flat is that you try to find centers so you want to find all
the vectors you know near a particular center because the idea of that is that those are going
to be the most similar vectors to each other so that way when idea of that is that those are going to be the
most similar vectors to each other. So that way, when you do a lookup, you're going to one of those
centers. And the idea is that these are likely, by approximation, the vectors that are most similar
to the ones that you're looking up. However, it may not be because let's say you have, you know,
two centers here and here, and your query vector is here. Yeah, which one are you closer to?
I had to
have somebody explain this to me like twice and the second time they used the analogy it worked
better for me if we go back to two dimensions and look at let's say a map of let's say europe
and we had the countries laid out we could have for example it's probably not a great analogy but
you could have the capital cities and first off we could compare any one point and say which
capital city is this nearest to that's a much cheaper operation than looking at every single
city and then we can go into that country and look or which of the actual towns or cities that
we closest to but if we're if something's close to a border or the fact that not all countries
are the same size we you can end up in situations where you don't necessarily get the right center or country by doing that first one. Is that a reasonable analogy?
Yeah, that's really good because I started thinking about that. So, I'm based in New York
and I'm thinking what capital am I closest to is technically Trenton, which is in New Jersey.
So, you know, and, you know, if I just looked at one probe, it would say like,
oh, you're in New Jersey.
Yeah.
Actually, I think it would probably take three probes because I think Albany is the third closest capital to.
There you go.
So there you go.
Yeah.
Great analogy.
So probes is the nump.
So you'd then do that first search and say, which are the three closest centers?
Yeah.
Centers, list.
I mean, list is the term that's used.
You might also be, It might be used interchangeably
with centers. So if we increase that number of probes, we're increasing the cost of doing the
query because we're looking at more things, we're comparing to more things, but we're
increasing the likelihood we're going to get back the correct result or the number of things that
are correct. In our expected set.
So the measurement of expectation is recall.
So if you know that these set of vectors
are like your actual 10 exact vectors,
but you get like 8 out of the 10,
then you have 80% recall.
So with IVFAT, as you say,
as you increase the number of probes,
you're going to increase your recall.
But you are going to increase the query time.
Yeah. Now, in that that patch we were talking about,
the problem was that we were actually over-costing.
So anytime
we were above five probes,
we then flipped back to a sequential
scan, which will give you 100%
recall, but it was taking the query time
from, you know, it was like a 50x
increase in query time based upon the
dataset. And we
definitely still had room to grow. So the first patch really was focused on let's get the costing
to be a more reasonable level, which gets into, you know, startup costs being one of those elements.
For me personally, it was it really was my first time like diving into the query planner code,
and definitely saw like all sorts of interesting things. I mean, first off, like the query planner
code is, there's a level of
brilliance to it like it encapsulates a lot and like the work that folks have done over the past
you know 35 years tuning it is quite remarkable i also say that it is quite clean it is it takes
a lot like there's a lot of it right so like there's a lot to dive into but like once you get
into it you can you can navigate your way around. Editing it is another story. It's one thing to understand,
it's nothing to propose a recommendation. But what's really cool about Postgres is that you
can make a lot of impact within an extension, or basically, you can add this functionality
without having to fork Postgres, you could just write it in a separate module and then add it in.
And PG Vector does this through the index access method interface,
which is actually quite robust.
We build index types that don't fit the natural B-tree ordering
of a relational database.
So it's quite powerful there.
And we were able to ultimately get, I'd say,
pretty good costing for IVF-flat without having to make
any upstream modifications to Postgres.
Nice.
And the other, yeah, so on the IVF-flat, because of these centers,
my understanding is it doesn't necessarily,
it suits a data set you already have that is static or not changing much,
because if you're adding data, it can't adjust as it goes.
So there is this trade-off inherent in it.
And I'm guessing some of the customers you were talking to
had different workloads that wouldn't necessarily suit that.
And actually, what happened with your customer conversations next?
Yeah, so that's a good discussion.
And I'd say, you know, let's burn it.
Like it's used as a PG vector in general. You know, a lot of these conversations, I have the open source project, which is, again, awesome. So that's definitely something that came up that if my data set, you know, if I keep adding vectors to the data set, or every now and then you update them as well, or even delete them, what happens is that the centers start to skew, because where you calculated where all these centers are may start shifting.
So your results might start shifting as well.
And you might not be getting the expected results or the recall that you'd want to see.
The other thing as well is that for a lot of app developers, I mean, first off, again, you have to wrap your head around approximate nearest neighbor that you may not be getting the exact results that you expect.
And like I said, I tripped up over that as well. Like it's, you know, it is it's normal to make that
mistake, you know, you have to learn something different. The other thing is that it started
turning app developers into DBAs in the sense that you had to worry about some of your tuning
parameters. And granted, you know, there's two of them, right? There's the number of lists when you
build the index, which is not necessarily terrible, because you are an app developer, you're still writing your your SQL code, you're creating that index. But you have
to think like, what is the number of lists, and the PG vector project gives some guidance on the
number of lists to pick, but it's still like you have to experiment. And then you have to be mindful
of things like, Oh, am I adding more data to it? I need to re index, how do I re index, you know,
who's doing the re indexindexing, et cetera.
And then you have to select pros, which again, you know, you're writing queries possibly
in your ORM and suddenly you have to set like this magic parameter to choose, you know,
how many of these lists you want to visit during a query.
So it's not as simple as set and forget.
And, you know, that's when I come back to my history of being an app developer, I always
want to just write code and I want to all just work. is set and forget. And that's when I come back to my history being an app developer, I always want
to just write code and I want to all just work. Yeah, I happen to explore the database as a side
effect of that, not as a first order principle of that. So it became time to explore an algorithm
that could be closer to set and forget. Like none of these approximate nearest neighbor algorithms
are set and forget, but we can at least make it easier. And this is where HNSW came in a hierarchical nav goals, small worlds. So HNSW is a little bit different than
IVF five, because it's not cluster based, it's graph based. And when you hear graph, you can
actually almost hear, you know, a graph is a superset of a tree. So even though Postgres is
very tree based, and a lot of its algorithms, implement something graph based as well,
particularly with the custom index framework. And the way an HNSW works is that you build a hierarchy
or layers of vectors. And as you go down the hierarchy, the layers get denser, denser and
denser. And the idea is that you start at the top layer, and you try to find which vector,
which vector am I closest to, you find like your local
maximum or effectively like the vector I'm closest to, then you go down, you search within that area,
and you find the one that you're next closest to and you go down and then you'm a dancer and you
search and so on and so forth till you get to the final layer. And the vectors that you should be
clustered around are most likely the vectors you're most similar to. When you're building the index,
there's two parameters that you now you have two index building parameters,
you have something called m, which is the number of vectors that you're linked to in a given time,
and PG vector defaults to 16. You have something called EF construction, which is the search
radius that you're keeping as you're descending down it the idea that if you look at a larger
search radius, you're going to see the vectors that you're more similar to. The idea also with a larger M is that
you're going to be clustered to vectors you're more similar to as well. But there's definitely
trade-offs between how you pick that. What happens is that when you actually do the query,
you should be looking over a smaller space. So IEVA flat is going to grow linearly as you add
the number of probes, and it's going to
grow linearly by the number of vectors that are in each list. So the idea is that if you can keep
the list relatively small, you're going to be able to do faster searches with IVF-flat. But
there's also trade-offs to that as well, that you might not be getting the recall that you want. So
you can see on IVF-flat query, it can get expensive over time, particularly if you're
linearly growing it. With HNSW, you're only searching over a smaller subset of the space. And again,
this could be dictated by your one search parameter, which is EFSearch, but you're only
looking at a subset of the graph. So you're going to be looking at far fewer vectors. And the idea
is that you're being put into a cluster that's very similar to you. So there's trade-offs.
With IVF-flat, I can build the index super quickly. And technically, if I'm only using you're being put into a cluster that's very similar to you. So there's trade-offs. With
IVF-flat, I can build the index super quickly. And technically, if I'm only using one probe,
I might be able to query it super quickly. But the trade-off is that to boost recall in IVF-flat,
you know, what we've seen empirically is that it's going to start getting more expensive.
With HNSW, we can query really quickly and and can query, you know, and again, the results
are showing that we can query and get high recall or, you know, high, you know, we're
basically seeing the expected results and with pretty good accuracy.
But the trade off is gonna be on the build time, because we're going to do a lot more
work upfront to build that hierarchy and see all of the, you know, basically try to visit
enough vectors that seem most similar to me.
And that's the push and pull between the,
you know,
these two algorithms and just vector similarity search in general is that
you're going to have to pay the cost somewhere and you have to figure out
what that cost is going to be and what makes the most sense for your
workload.
With HNSW at least,
you know,
if you're willing to pay that upfront cost,
which,
you know,
a lot,
it seems like a lot of folks are,
yeah,
you are going to get like this very high performance,
high recall system, most likely. And again, there are other factors that go into this as well including
your embedding model or how you're acquiring the data or how quickly you're adjusting the data so
like there's there's a lot to consider here yeah i'm a performance guy at heart and i love it when
there is actually a true trade-off and people people in different situations will have different preferences but to me it seems like a lot of the currently popular use cases can pay or have a
have a desire for as high a recall as is possible within a latency uh threshold like so they're willing to pay some latency and build like index build time costs for as high a recall as possible to a point.
And then everyone has like a different point at which that is.
But yeah, I actually wanted to ask almost the opposite question.
Do you see some use cases where the index build time itself is the primary driver?
Or any other reasons that you're seeing people choose IVF flat at the moment?
Yeah, there are cases where you might be needing to do rapid loads and a bit of analysis, but it's very transient.
So one of the cases I've heard is that there's a system with,
let's say, like 100 million vectors.
And it's really more about getting a rapid build and doing a spot check on the data within it,
as opposed to having something that's permanent that needs to be run all the time.
I think it's like a lot of these transient or ephemeral workloads where an IVF flat index can make sense.
But frankly, like I've talked to people who have just rolled out production systems with IVF flat with several million vectors in it.
They're perfectly happy.
They're like, we're getting the results
that we need. We're getting it in the amount
of time that we expected to return
the query in. And they're completely
fine with it. So some of it is personal
preference. And some of it, like I said,
depends on other factors, such as
the data that has been vectorized.
How similar are the vectors based on the
embedding model? And are you able
to look at the distances within a range
where you're seeing the set that you want?
Awesome. That's really interesting.
So you've got a great talk on this that I'll link up
that you gave recently at PGConf EU.
And you've also had a couple of great blog posts
on the releases around 0.5.0 and 0.5.1.
I'll link those up so people can read them in their own time.
I did actually want to ask, you mentioned some tuning, like some people becoming DBAs and having
to think about, you mentioned right up top the page size of Postgres, for example. Do you want
to talk a little bit about why that's important and some of the lower level stuff that people
need to think about when they're doing this all right so so tuning or technical i heard two different things let's go
tuning as more practical we don't have to just explain toast we've we've done a whole episode
on toast before but some like there was some interesting stuff in there that really got me
thinking that i hadn't thought about before yeah so so this there's a whole bunch of areas to go into.
So just to briefly recap, because I think this does impact tuning.
So Toast is a system that, well, step back further.
So the foundational unit of Postgres is the page.
That's the atomic unit.
When you store data, you're actually storing it within this atomic unit.
You're not just storing a row randomly on disk.
It's going to be fit within a page.
By default, the page for Postgres is 8 kilobytes.
Now, you can recompile Postgres to use a different page size,
but most folks just use the default for a lot of good reasons.
What's interesting is that, you know, so if you have data
that doesn't fit within a page, what happens is that it gets toasted, and it gets stored out of line, and you can store it arbitrarily large.
I believe it's up to one gigabyte per field.
Now, what's interesting, well, there's three interesting things here.
So first, Postgres has what's called a minimum toast threshold, which is two kilobytes.
So anything above two kilobytes is going to get toast, unless you change that threshold. The second thing is that index pages must abide by that 8 kilobyte limit.
And you can actually toast data in line with an index page to shrink it down a bit so you can get a little bit more on that index page.
But if it's going to be over 8 kilobytes, Postgres can't index it.
So this is where it gets interesting for vectors for a few
reasons. So first, any vector that's above that 2 kilobyte threshold, which I believe, I think I
forgot off the top of my head, it's around like 514 dimensions for PG vector currently,
it's going to get toasted. Okay, that might seem okay. But any vector currently, you know,
there's a hard cap in terms of indexing PG vector vectors of 2000 dimensions,
which there are some value use cases I've heard of vectors that go beyond that size.
But typically, most things are going to fall within that range for now.
Now, there's a few things here.
First, you're like, well, you say you can toast things in line.
So why can't we toast these vectors in line?
Well, I hate to be the bearer of bad news, but it's actually very challenging to compress effectively, you know,
2000 dimensions of random floating point numbers.
Yeah.
So there's not really much you can do other than dimensionality reduction to
shrink it down.
There are some techniques that are out there called quantization,
which you can touch on after this, but you know,
they all have their trade-offs such as you know losing information so the second thing is that toasting can screw up the query planner a bit what do i mean by this
so currently when the query planner is trying to estimate parallel workers or essentially
you're reading data in parallel it's going to look at it basically looks at your heat pages
or your main table pages and uses that to drive the estimate and the thing about toast is that your heat pages actually be
quite small you know in this case let's say if an id and a vector well i'm not going to have that
many pages in my regular heap because it just could be a bunch of ids and pointer to my toast
table the toast table is going to be quite large and likely you need those parallel workers to suck
all the data out but what postgres is going to do is it's going to underestimate the workers today.
And it actually makes sense historically, because the data that you typically toasted was not in
your search path or your ordering path. It's typically something that you need to do some
post filtering on as you pull it out, you know, think like, you know, a big blob of text. Yeah,
but here, we are actively querying the vectors. We are
calculating distances between them. So we probably want them closer to our main set of data. Now,
we can choose to store the vectors in line. You use this technique called set storage plane,
and that will keep your 1500-dimensional vector in line with your other table data.
But also keep in mind, this is going to cost a full page because
a 1500 dimensional vector is about six kilobytes and Postgres is just going to have to allocate
a page for each of that. So no matter what, you're storing eight kilobytes of data in them.
So this gets very interesting. Yeah. So in terms of tuning, again, part of this,
you got to look at what your workload is and what makes the most sense in that
what we're seeing so far with, at least with H H and SW is that, you know, even if you're toasting a 1500 dimensional
vector, the estimates are still pretty good overall for making sure that you're using your
H and SW index. I haven't seen as much impact there. We saw more impact, particularly with
IVF flat based upon some of that costing model. But I think there's, I think there's some,
this is one of those areas where I think there's some improvements to how we
work with toasted data upstream.
I think I have a couple of emails or threads on that subject.
And if I get my C chops better, maybe I can propose a patch,
but that's a,
that's like one area to be mindful of is like how you store the data.
And again,
I think we were starting to get a little bit more set and forget there,
but I think as we see like these workloads increase, it is definitely something to be mindful of.
From a buyer standpoint, as a Postgres supporter and advocate,
I've seen quite a lot of benchmarks that show PG Vector doing very well as a vector database compared to other dedicated systems.
So that's impressive considering they're probably not even tuning some of those lower level
things while doing those benchmarks.
So it's exciting for me to hear that there's actually some potential further wins there
on the Postgres side.
Yeah, I mean, and maybe like hot off the presses, Andrew proposed a patch last night for speeding
up HNSW index building.
Since like the 0.5.1 release, there's already been a lot of work to improve this, but
the patch last night even further accelerates
it. So there's
support for parallel builds for
HNSW indexes.
And the initial work still
had to leverage a lot of the data coming from disk
as opposed to being fully in memory.
The proposal last night is fully in memory.
And I actually did a test.
I was so excited to see it.
I did a test.
I have this 10,000,000, 1500 dimensional vector data set,
or really like 1,536 dimensional data set.
Yeah.
Random vectors, right?
This is really more just for like beating up on performance.
You know, I'm not measured recall here,
which for real benchmarks,
you got to measure performance and recall.
But I'm trying to just understand like,
how good is the index build time?
So I did this with what's currently you know on the master branch to date i use this and then i
also did the test with the hnsw fast build branch which is this in-memory system so i saw a couple
of things so first this new branch was about i think it's 7.3x faster at building h&m hnsw index
than the other branch.
And just to compare,
like let's get some real numbers.
So the old branch,
which is,
you know,
this,
you know,
the unreleased support for the H and W parallel builds,
it took about three hours and change to build the entire index.
And this was with like 64 parallel workers.
I'm throwing,
you know,
this is,
this is a big,
this is a big beefy box with With the new branch, it took 25 minutes.
It's like, wow.
Like, that's mind-blowing, right?
These are big vectors.
You're doing a lot of computations.
I even had cranked up the EF construction value, which, you know, does increase the time.
I did compare it to this previous method I've been recommending, which was concurrent inserts.
And I just, you know,
I did a spot check between a blog post I wrote about it and this new patch. So in the blog post,
where I had a lower value of EF construction, it was, I think, 64. I was getting about a little
bit over a thousand, a thousand vectors per second, you know, looking over the entire front.
With this new technique that Andrew posted, I was getting over 6,500 vectors per second.
So again, huge.
And this is what's cool about this,
because this is one of those set-and-forget things
that you might need to tweak one parameter.
In this case, I tweaked max-parallel maintenance workers.
But again, huge performance boost in terms of the index build.
And it simplifies it too, because now it makes it truly viable
to say preload all your vectors,
which will be a much faster operation
than doing a concurrent insert.
I mean, because there's a concurrent insert
and even like further speed up your load of your vectors.
But then you can just do create index,
you know, embeddings using HNSW and boom,
like you have an index way faster than other systems.
And like,
this is a key thing to look at, you know, you touch on this, Michael, is that when you're
dealing with this vector data, you got to look at everything, you got to look at your ingestion time,
your index build time, your query time. And it, you know, you have to focus on all these things,
because there is a trade off with all of them. Yeah, well, exciting times ahead ahead it feels like the most rapidly improving area at the moment and
for good reason what so i saw on this it's got a great change log pg vector and i saw there's like
even the unreleased part but so 0.5.2 is currently on there and unreleased so anything you're
particularly excited about that's coming up or anything not on that list that you'd love to see?
Yeah.
And I think this is where it really helps to have feedback
on how people are using it.
So I was very excited about parallel HNSW builds.
Like, this is...
Like, I can't...
Even, like, in the middle of that test I was running last night,
like, I emailed Andrew, like, Andrew, the results are amazing.
Like, I can't wait for it to finish.
Like, I can see where this is trending.
So there's two things in particular I'm very excited for.
The first, you know, the first, and we'll see if it makes it in,
but it's around pre-filtering.
So a lot of the queries today we've been seeing is, like,
select star from vector table, order by, blah, blah, blah, blah,
you know, limit 10.
Find me my 10 nearest neighbors.
But what's happening in practice?
In practice, it's like select star from table where,
you know, some condition, category ID equals this,
order by, blah, blah, blah, blah.
So what this is kind of like is really a multi-column index
because what will happen is like today,
either you won't use the index,
effectively you're doing an exact nearest neighbor search, which means very accurate results. And maybe, you know, your data set gets sorted down
to a small enough value where it doesn't matter, like you're getting you're getting very fast
search. But what if you have multiple fields that you need to filter over? Or what if the data set
you get back is 50,000 of these, you know, 1500 dimensional vectors like that?
This can be exhausting.
So there's a patch out there.
It's based on a newer paper called HQANN.
You're checking the PG vector repo right now.
That lets you build effectively a multi-column index where it's able to
build the links or group together.
Well, there's two things, right?
Because you still want to be able to search over your entire vector set and
find your most similar vectors, But it also groups the vectors by your filters,
so that way you're searching over just your filters,
and it does that pre-filtering.
This has come up by...
It's almost like...
This is like a what-have-you-done-for-me-lately feature,
because as soon as you put it out,
users find, like, you you know more of these case
studies emerge and users find like hey like i really need this what's great is that there are
people who are testing it against real data and this is where if you want to be involved in pg
vector you can help is that you see these patches out there please test and report back on them
because if you're finding the hq ann brands useful describe your use case like there's an
open issue where people talk about you know know, they might say like, hey, I really need this. And what makes it super valuable is when you talk
about like, here's exactly how I'm using it, because it just further justifies that this is
the right direction. I mean, one of the big goals of PG vector is to try to keep the code base
simple to maintain and also make the product simple or the extension simple to use. So hearing
more about what people are doing is great. Another patch that I'm excited for,
and again, we'll see if it makes it into 0.5 too,
you know, no guarantees,
is being able to support smaller dimensional sizes.
So right now, dimensions are four byte floats,
but there are definitely embedding models
that provide two byte floats
or one byte unsigned integers.
So there's a couple of branches that have those in there.
But again, hearing those use cases
will help further support it
because the added bonus to supporting
these smaller dimensions is that
we can index larger vectors
and that'll be able to go beyond that 2K limit.
Yes, interesting.
The one final question I had is,
do you see any world where this becomes
part of core Postgres grow in the long term?
Yeah, so at PGCon 2023 last year, Do you see any world where this becomes part of cool Postgres growth in the long term? Yeah.
So at PGCon 2023 last year, I had a lightning talk,
which is basically to first shout to the wind to the community,
like, hey, these vector workloads, this is real.
That was still early.
But it's like, hey, this is coming.
I call it like a storm of data coming.
We want to make sure Postgres is positioned for it.
This is very similar to what we saw with JSON.
And I was able to get an unconference session as well,
where we discussed it.
And the general consensus was in the fullness of time,
it does make sense to have something like this
in upstream Postgres.
Great.
But there's, I think there's a few things here.
First, we have to look at release cycles.
You know, Postgres releases once a year.
In fact, the feature freeze for Postgres 17 is coming up in about the less than three months.
But effectively, there's a... Let's say we come up with the idea we want to support a vector data
type right now, or let's say after feature freeze. I mean, there's effectively an 18-month window
before it gets in because we have to go through the whole cycle.
And given the pace that this field is moving, we don't necessarily want to wait on the Postgres release cycle.
So being able to do this work in PGVector or other extensions does help accelerate adoption of Postgres as a vector database, so to speak.
The other thing is that once it is in upstream Postgres, that is the on-disk format.
That is the rule.
And Tom Lee made a very good point during that unconference session, which is
like, let's see how things shake out
in terms of what the on-disk format
is. Now,
PGVector is also trying to stay true to that
contract and try to keep the on-disk format
as effectively
not to change it. Because
as soon as you change it,
you got to like rebuild, re-index, restore everything.
That's the cost of the operation.
So PGVector is trying to apply the same level of rigor
as Postgres to implementing these features,
but it can move a little bit faster
because it is an extension.
It can have its own release lifecycle.
So I think that's where the, you know,
is that I'm not gonna say this is an official
or unofficial community position, but is there interest in supplying it upstream? Absolutely.
But given the rapid emergence and development we need to make, we're trying to make as much
progress as possible within PG vector. Here's the other thing too, like the big difference between
now and JSON, we have the index access method. We can do this in an extension. That is a big change since 10, 15 years ago.
That's really interesting and good to know. There's another big difference, which is cloud providers.
So whilst it could be in core Postgres, currently a lot of people that install Postgres themselves or manage Postgres themselves more specifically can install pgVector and people that don't are often on a managed service provider those folks most managed
service providers have added support for pgVector already which is the fastest I've ever seen them
add a a new extension and it's been pretty much everyone so most folks that want access to pgVector
at least some version of it maybe not
always the latest version can have it now as well yeah it's definitely it's definitely exciting to
see all the adoption of pg vector and yeah yeah you know to tease a little bit i think i still
think the best is yet to come like i think pg vector has made a lot of progress in in the past
year i mean it's been a tremendous work by the community and in particular andrew like i can't
i can't say enough nice things about the work Andrew has done and really the diligence
he's put into it.
And I still think there's more to do.
Like I said, even just like, well, the patch that came out last night, I mean, just shows
that there's still even more performance that we can get out of it.
And for folks listening that we're recording on the 16th.
So I'll link that up.
Wonderful.
Jonathan, thank you so much.
Is there anything else I should have asked you but didn't,
or any last things you wanted to say?
Yeah, I think one of the big things when you think about contributing to open source
is writing a patch, writing code.
And I must say this in the context of PG Vector,
but I think this applies to any project, even Postgres itself,
is that there are many different ways to contribute. Testing is huge, you know, because testing, particularly if you can test something
that's close to a real workload, like don't test your production workloads with these things,
but like test something that's close to production. That helps because that helps drive the use case
and hearing how you use something and just talking about your different use cases and being supportive
of the community in that way. That helps. Helping to write documentation, helping to advocate for something that you think can help others.
Again, all these things can help a project.
So if you want to contribute to PG Vector or Postgres,
there are a variety of different ways.
And maybe to, as it's top of mind right now,
PGCon, which was effectively the annual developer conference,
it's evolved.
It's now called pgconf.dev,
P-G-C-O-N-F dot dev.
It's being held in Vancouver at the end of May.
I can tell you,
I can almost guarantee vector workloads will be a topic of discussion there,
but just all things Postgres.
And the idea is that,
while certainly a lot of the folks
at the discussions are around
technical hacking topics,
really, if you step back, the gist is, how do we continue building and growing the Postgres
community? So the CFP actually just closed yesterday, or when this airs, probably a few
years before. But that's a great way to participate as well, even if you're new to the community,
because I know personally, at one point, I was new to the community. And the first time I went to PeachyCon,
I'm like, oh my God,
I know nothing about how a database works.
I can write a select query, but geez.
But it is a way to help have impact on the community.
And just talking to folks
who are working on the software
or working on events or hosting podcasts
or finding ways to help grow the community,
it's a great way to
participate and help grow. So certainly, you know, think about attending and participating. And
again, there's all sorts of different ways to contribute. So that's the parting message I have.
I'm far from a database hacker. I can write like a couple of lines to see here and there. But
where I've found a home in open source
is working on all sorts of other aspects
around open source projects.
What a great message to end on.
Thank you, Michael.
Thanks so much for your time, Jonathan,
and for everything you do in the community.
Likewise.
Thank you for having me on your wonderful podcast.
Cheers.
Take care.
Take care.
Goodbye.