Disseminate: The Computer Science Research Podcast - Liana Patel | ACORN: Performant and Predicate-Agnostic Hybrid Search | #60

Episode Date: November 11, 2024

In this episode, we chat with with Liana Patel to discuss ACORN, a groundbreaking method for hybrid search in applications using mixed-modality data. As more systems require simultaneous access to emb...edded images, text, video, and structured data, traditional search methods struggle to maintain efficiency and flexibility. Liana explains how ACORN, leveraging Hierarchical Navigable Small Worlds (HNSW), enables efficient, predicate-agnostic searches by introducing innovative predicate subgraph traversal. This allows ACORN to outperform existing methods significantly, supporting complex query semantics and achieving 2–1,000 times higher throughput on diverse datasets. Tune in to learn more!Links:ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data [SIGMOD'24]Liana's LinkedInLiana's X Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host, Jack Wardby. Today we are going to be going on a journey into the world of vector databases and taking us on that journey is going to be Liana Patel, who is going to be talking about her recent work that she published at Sigma this year, ACORN, Performant and Predicate Agnostic Search over Vector Embeddings and Structured Data. Now, if you don't know any of those things, me and listener, by the end of the podcast, I'm sure you will do. So yeah, welcome to the show, Liana. Thank you, Jack. It's great to be with you. Cool. Before we do get into all the cool technical stuff, let's take a quick detour and find about the person behind the paper, right? So yeah, tell us about your journey and how you became interested in databases
Starting point is 00:01:07 and machine learning research. Yeah, so I'm a now fourth year PhD student at Stanford. Before that, I was an undergrad at UPenn where I did a lot of systems work and I had a really good advisor there, Boon Thao Lu, who kind of got me very interested in this area. And when I came to Stanford, I was kind of broadly interested in ML systems. So I was kind of reading and exploring a little bit of both, but I didn't exactly know what I wanted to work on. And my two really wonderful advisors, Matej Sahari and Carlos
Starting point is 00:01:40 Guestrin, kind of I broadly brainstormed with them really a number of different areas. And kind of this was an exciting area, you know, vector databases at the time. This is now, I guess, a few years ago before kind of the term vector databases was really mainstream. So it was kind of a little bit lesser known, but seemed really interesting. And, you know, I read some of these ANN indexing papers like H H&SW a couple of others and I thought the ideas are really cool and they seemed really powerful so I was kind of interested to explore more there. Yeah it must have been sort of nice to be a kind of an early not an early adopter or sort of pioneer in the sense of that kind of the space wasn't really that well defined and sort of been able to help sort of define it must have been quite nice. It was really interesting yeah it was
Starting point is 00:02:24 an interesting time I remember kind of we had a chat with a couple of folks in industry and we would use the term vector databases of course like there were companies in that area who had been working on this but it was that we would kind of get like funny looks when we when we use that word because it wasn't necessarily as popular as it is as it is now yeah and another thing as well obviously with this being sort of fast-moving space, and I guess the machine learning space in general, sort of really on the AI space
Starting point is 00:02:48 is so fast-moving. When I was reading your paper, I think it was at the start, section two, there was a background section and the sheer volume of references there, I was like, wow.
Starting point is 00:02:58 I mean, I thought it was sort of bad in general and research is always sort of like new materials coming out, but that's a lot of research in a lot, I guess, quite a short amount of time as well. yeah it's crazy because ann search is really well
Starting point is 00:03:09 studied so there's all these classical papers but then also you're right it's such a fast-moving space so there's a lot of recent works too so there's a really massive amount of literature to to look over so we try to be thorough and try to do our homework nice cool so yeah well we've mentioned a few sort of key words there when we were just chatting but let's set some more context for the listener and give some background to your work then so i guess maybe we can kind of start give us a kind of a 101 on the vector database space tell us kind of what a vector embedding is what is approximate and i'll use the acronym then you can explain it a and n a and n search and yeah tell us all about it yeah great uh so a and n search, and yeah, vector search. Tell us all about it. Yeah, great. So A and N
Starting point is 00:03:46 search is approximate nearest neighbor search. And the basic paradigm here is that we want to represent everything as a vector in a metric space. And that will be very convenient for us to work over because then we can essentially just compare two points, compute a distance between them, and that will give us some notion of semantic similarity. And that's great when we want to work with unstructured data and do these types of similarity search. So the user has a query. I want to find all the items from my database that are most similar to it. All I need to do now is embed my query, convert it to a vector. And offline, I've already converted everything in my database to a vector. So this problem essentially reduces to just finding the nearest neighbors to my query. But okay, so actually finding the exact nearest
Starting point is 00:04:29 neighbors, we can't do that efficiently. So we'll actually find the approximate nearest neighbors. Okay, so how does this relate to vector databases? And kind of, you know, there's a lot of work on Ragnarok. Well, ANN search is really the core component of a wide range of systems for search, for recommendation, for text retrieval, and most recently, a lot of work on retrieval augmented generation, where the crux of doing your similarity search is going to be some type of ANN index, some type of data structure that can efficiently perform your search, and then give you back your results for that query. So kind of a really popular use case has been RAG, where for an AI system, if you want to kind of ground your LM on the context or
Starting point is 00:05:14 the knowledge in your corpus that is most relevant to the question, you would first do this search. And there's now a lot of offerings in both startups and also in an industry that serve these types of queries with specialized indices or specialized data systems or just data types in your your database and then you can perform that get back the most relevant data pass it to your alum and get a grounded response and also ideally reduce the risk of hallucination. Kind of given that then let's shift to the research focus of your paper now. So in this thing called hybrid search, because whilst ANN is good,
Starting point is 00:05:51 it's not great for every use case, right? And there's maybe some certain types of queries and things that we may want to do that that doesn't cover. So yeah, tell us about hybrid search and the elevator pitch, I guess, for ACON is where I'm going. Yeah, so ANN search is really great when you have fully unstructured data. So the queries that I want to serve here are similarity search. So given my query, find me the most similar documents. But, you know, this very realistic setting that we have in a lot of use cases is where I have
Starting point is 00:06:19 both unstructured and structured data. So let's give you an example of this. Like, let's say I want to find articles that are relevant to, let's say, the presidential elections. That's going to be my query. But I want to actually find papers that are from the last day, let's say, or articles that are from the last day. So date in my database or my knowledge corpus would be a structured field. And so this actually represents a hybrid query where what I'm trying to do, I'm trying to do a similarity search on the text part of my query, you know, election, or maybe something about presidential election. And I also want to serve a predicate here over my data, which is filtering on date. So that's the hybrid query. It's essentially combining both search over unstructured data and structured data.
Starting point is 00:07:04 Nice. And I'm guessing the existing solutions before Acorn, they weren't great at these, right? So give us a rundown of the few years, there's pre and post filtering, and then some specialized data structures, but they, they aren't really solved the problem. So yeah, tell us about those. Yeah, yeah, I think this is a really interesting problem, right? Because it feels like the semantics of my query have barely changed, but actually the problem changes a lot. So kind of like the most naive way you can do this is by taking your standard ANN index. And so your ANN index is going to give you efficient search for the similarity piece over your unstructured data.
Starting point is 00:07:41 And you can essentially try to do your filtering either before or after doing that search. So that will give you pre-filtering, which is one method. So there, first you filter your data on your predicate, and then you do this brute force type of search over your data doing distance computations to perform that ANN piece. That's going to have a lot of scaling issues. So anytime you have a predicate that is maybe a medium to high selectivity, essentially a lot of things are going to pass your filter and it's going to become very inefficient to actually do that brute force distance computation piece. On the other hand, you can also do post-filtering. So there what you would do is you search your ANN index and essentially you're going to expand the scope of your search so that afterwards you'll perform your filtering and you hope to retrieve enough
Starting point is 00:08:26 things from your search that pass the filter. But they're actually doing this type of search scope expansion becomes very expensive where essentially you're paying distance computations to do that type of expansion. And again, this is not going to give you very good accuracy guarantees. It'll kind of have all different types of failure cases. And also it's expensive again. So in both cases, and we benchmarked this, performance can degrade significantly, much worse than what you would expect typically from your ANN search. So kind of researchers were realizing this, and we kind of know now that pre- and post-filtering, you know, they have these very severe performance limitations, even though they're kind of like
Starting point is 00:09:03 widely used in practice. And so the prior works that we had been looking at, they were all these types of specialized indices. And there's a number of those works, things like filter disk ANN and HQ, HKN, QDRAN also offers one version of this. And essentially what they do is they will put some type of restriction on the predicate set or your query workload. And they will then use that to try to build this type of specialized index that can serve the workload well. And so in particular, like some of the assumptions that they'll make are things like, you know, your predicate set has to have low cardinality. So for instance, you have to have less than maybe 10,000 distinct predicates that you can serve. Or maybe the types of predicate operators
Starting point is 00:09:44 that you want to serve are also limited. So maybe they can only be equality operators. So think about, I can no longer do date ranges. I have to do my date equals X or my date equals Y. So again, these types of things become very restrictive. And for the general case, like the workloads that we're seeing, you generally don't want to make these assumptions. Ideally, you have great performance and you can do that for any query workload that you're given or any predicate essentially yeah nice yeah it's i was kind of when i was reading so i kind of wrote down like i don't know if you have any sort of uh what's the word gauge on how far this sort of very limited set of predicates and can actually get you in practice like is it the case of yeah you can
Starting point is 00:10:22 maybe just do like it's good enough for 90 of the cases or is it like this is just like get catching 10% of what people want to do with this sort of thing? That's a good question so there are use cases where you it can be okay to make some of these assumptions it really depends like in general what kind of our philosophy of ACORN is that you should be able to serve all career workloads like very robustly and not make any assumptions if you can make some of these restrictive assumptions you can actually do something that's kind of like this ideal or theoretical ideal which we propose in the paper which we call this oracle partition index and so you actually have like a decent way to handle this in like a more brute force way which
Starting point is 00:10:59 will be okay yeah so i guess it really depends on the data, which I don't have kind of a brain type view on it. While you were talking there, it's interesting sort of, if you kind of have this nice general sort of way of doing things, but then you can almost, I'm probably solutionizing here and taking this off course, so I apologize. But sort of being able to like kind of declare like hints or things like that, you can incorporate that sort of information at sort of runtime into your query
Starting point is 00:11:25 and then either decide to go the one way or the other way, sort of be able to factor in and make a smart decision. I'm thinking more, you're almost like choosing a different type of index based on what your information's kind of coming in, but maybe I'm getting off course. So I think you're making an implicit assumption here, which is that if you can make these restrictive assumptions,
Starting point is 00:11:44 you can actually get better performance. And maybe I'm jumping ahead a little bit, which is that if you can make these restrictive assumptions you can actually get better performance and maybe i'm jumping ahead a little bit which maybe i shouldn't do but maybe a small spoiler is like you can actually get state-of-the-art performance without making those types of assumptions which is really cool okay nice yeah that is nice cool so let's talk about aircon some more in some more detail and let's get into it. So yeah, high level, what is the sales pitch for Acorn and what's the magic idea behind it? Yeah. So what we essentially want to do, kind of what I was describing is we want to provide both a very performant index for hybrid search and also it should ideally be predicate agnostic. And what I mean by that is we're not
Starting point is 00:12:21 making any assumptions about the query workload, the types of predicates you can serve, the cardinality of predicates, things like that. So we're going to build this type of general index, and then we'll serve search for whatever query we get at query time. And the intuition here is that there is this kind of theoretical ideal, which is not actually tractable to build in all all cases or in like kind of like the practical case but it'll give you give us like kind of a nice way of thinking about how we want to do our search and how to build the index and so this theoretical ideal essentially you know we can imagine if we have some predicate set where maybe I have you know x number of distinct predicates what I would do is I could essentially do this very simple type of partitioning approach where I'll just take all my data set records or all my items that pass each one of my given predicates and I'll build an ANN index or, you know, the one we use in the
Starting point is 00:13:16 paper is HNSW, which stands for Hierarchical Navigable Small Worlds. Cool name, by the way. It's a cool name for something like that. Yeah, I love the HNSW paper. It's really great. Really awesome name as well. But yeah, HNSW is the one we use in the paper. But the idea is that essentially, if you have this type of system where you can assume that you have some low cardinality predicates set or something that you can enumerate over, essentially, you would just build an ANN index for every one of your known predicates. And then what you've essentially done is you've reduced this hybrid search problem to an ANN search problem. And then all the nice kind of performance and methods that we can use in ANN search, we can apply here. And so HNSLV is the one that we use
Starting point is 00:13:59 in paper. But again, this is the general idea. Okay, so you can actually do that because in general, like I said, we don't want to make any type of assumption on the carnality of the predicate set. Also, sometimes you just don't know your predicates during construction. So you will typically build your index offline, and you just want to build an index that will be able to serve whatever you have at query time. So we can't do that. But what we'll do instead is Acorn will build this kind of general index that looks kind of like HNSLView, but not quite. And the main idea is that we're going to essentially emulate search over the oracle partition index without actually ever building that. And so the way we'll do this is essentially during search, we'll traverse this Acorn index and specifically
Starting point is 00:14:42 we'll traverse what we call the predicate subgraph. And so this predicate subgraph is going to be the subgraph that's induced by the nodes that pass my predicate, my search predicate of my query. So that's the thing that I want to search over. It's the subgraph within my acorn index. So the search actually is not going to be too complicated. But what will be difficult is we actually need to construct this index such that any arbitrary predicate subgraph is going to look something like an HNSW graph. And so basically that subgraph needs to have all these types of navigability and connectivity properties just like HNSW.
Starting point is 00:15:19 And if we can do that, essentially what we're doing is essentially, you know, emulating search over that Oracle partition. Nice, cool. So it might be useful now to sort of maybe go into how we actually go about constructing the HNSW and actually searching that, because I guess that's sort of the baseline that we're kind of going to build ACON on top of or augment things slightly. So yeah, I guess, how do we construct and how do you search? great yeah so so hnsw is this really awesome um graph-based index for ann search um so there's a various i mean there's a there's a number of ann indices um hnsw is a graph-based one um there's other types of indices too like tree-based ones um hashing like lsh-based ones cluster-based ones etc but graph-based ones are particularly
Starting point is 00:16:04 nice because they give us very good performance um and hnsw is one of these very widely used like LSH-based ones, cluster-based ones, etc. But graph-based ones are particularly nice because they give us very good performance. And HNSW is one of these very widely used indices. So we focus on this one. So there's a couple of components. So, you know, hierarchical navigable small worlds, that's what HNSW stands for. And kind of the name is very descriptive.
Starting point is 00:16:21 So I can maybe briefly talk about some of the properties that we have without going into too much depth, but I really like the paper. So I kind of encourage anyone who's interested to check out the original paper by Uri Malkov. So first of all, it's a hierarchical graph structure in the sense that HNSLB will have multiple layers. And in each of those layers, we're going to have a subset of nodes in the data set represented. So at the bottom level, every node will be, every item in your data set is going to be a node in that level.
Starting point is 00:16:52 And then as you go up your graph, those layers are going to become sparser. And so essentially, you're going to sample some subset of your nodes at every layer above. So to get to the i plus 1th level, I'll take some subset of my nodes at every layer above. So to get to the i plus one level, I'll take some subset of my of my nodes from the i-th level. So that's the hierarchical piece. And there's kind of some nice properties that arise from that related to efficiency of traversing that graph. I won't go into too much detail there. But another key property is that you also have a bounded degree of every node in your graph. And so again, this is a graph index where every vector, every item in your data set is a node in your graph, and then you're going to have edges connecting those nodes. And the edges connecting the nodes, so you're going to have edges between things that are
Starting point is 00:17:40 close together, or things that are semantically similar, things that are points that are close together or things that are semantically similar, things that are points that are close together. And every node will have at most M neighbors. And so that gives you a bounded degree, which again will give you some nice properties of your search. Okay, so essentially you have this type of graph index. And if it's not completely crystal clear, that's okay. But let's just kind of black box it as you have a graph structure where essentially all your nodes represent your items and you have some edges between them. Okay, so then your search traversal, the way we'll do this to actually find our approximate nearest neighbors, we're going to start from some predefined entry point. And then we're essentially just going to do a greedy search. And the way this looks on HMSL view is kind of like a slight variation of greedy search where what I'll
Starting point is 00:18:24 actually do is I'll do a greedy search layer by layer. So I'll start from my top layer. I have an entry point into my top level and then I'll do a greedy search just on that level. And once my greedy search converges, I'll use that point to drop down into my next level. And remember I said that every level is essentially every level. So if I have level i, level i plus one is going to be a subset. And so if I find something on level i plus one, it will also be in level i. And so I can just essentially drop down to that point and then repeat my procedure. I do again a greedy search, find my convergence point, drop down, and I continue until I get to my bottom level, which then has all my nodes in my data set and will give me kind of this
Starting point is 00:19:05 finest granularity search. So the idea is kind of, you can think of it as you're starting from these, you're starting from this kind of like zoomed out search and you're gradually refining your search as you go down because you have more and more nodes as you drop down every level. I mean, this thing is a lot easier to describe with images right this is where the limitations of the of the audio only format is like we need a whiteboard now and it would be a lot it'd be a lot easier but yeah so you kind of yeah i'm just doing a lot here yeah we're waving a lot this now we're doing all that yeah so just kind of run through that again real quick so we have n number of levels and then on
Starting point is 00:19:45 the base level there that's where all the nodes are and the nodes is the node represents somewhere a vector right that's one of the vector embeddings i guess and that they're kind of linked up to some degree m based on that's configurable and that's kind of how many um edges we'll have going out of and in of any one of these nodes and then as we kind of go up we have a we have a smaller subset of them all and when we want to search we start at the top we do a greedy search which means i guess we just kind of go to the next one based off where we are at the moment so it's kind of like using local only information explore the top tier or the tier you're on then drop down because obviously you're in the next level because it's a subset and you keep doing that till you get to the very bottom at which
Starting point is 00:20:22 point you're left with k number of or N number of whatever letter you want to use, nearest amount of vectors that are similar to you and jobs are good in. Yes, exactly. Yes, that's a great high level sketch. Perfect, I had a good teacher. Cool. Amazing stuff, yeah.
Starting point is 00:20:38 On that note, we've got our H and SW. How do we modify this? How do we get to ACOM? Okay, so there's really two pieces here, right? So it's going to be the search algorithm and the construction algorithm that I'll talk about. So I'll start with the search because it's kind of a smaller modification, very intuitive. Okay, so the goal here that I mentioned earlier was we want to traverse to and navigate over this predicate subgraph. Okay, so our search is going to essentially look very similar to HNSW
Starting point is 00:21:05 search. We're going to do this greedy search that starts at a sumfix entry point at the top level of our graph. And the main difference is just going to be how we do neighbor lookups at every node that we traverse. So typically, when we do neighbor lookup, we would look at all the neighbors and we pick the one that gets us closest to the point that we're looking for. So essentially that has the minimum distance. But what we'll instead do here is we'll first just filter out the nodes that don't pass the predicate. So now what I'm essentially doing is I'm going to be searching over exclusively these nodes that pass the predicate. And so this is helping me navigate over that predicate subgraph
Starting point is 00:21:45 that I was mentioning. So kind of simple idea there, we're just performing a filtering step during our greedy search. Okay, so there's going to be kind of a small addition that we add. And I'll describe that a little bit in more detail. But I'll talk about the construction algorithm first. So how do we actually construct an index that gets us close to something where we can emulate the oracle partition with any type of predicate subgraph that we have? So the main idea is that we're going to create this denser HNSW graph. When we create that denser graph, that's actually going to create a little bit of an issue because we're going to blow up our memory overhead
Starting point is 00:22:21 or the space footprint of the index. And this is really important because actually graphs like HNSW, they're memory resident. And so you actually care a lot about kind of keeping a modest memory overhead. And so to address that, we'll then perform this type of compression. And that compression is going to be predicate agnostic. So again, it needs to not use any information about possible predicates I might see during query time. Because again, this is just offline. This is the construction process. Okay, so the first piece creating the denser graph is going to just expand each one of
Starting point is 00:22:53 my neighbor lists by this factor gamma. So typically, I told you my nodes have a bounded degree of m. But what we'll say to you is we'll take m times gamma neighbors per node. So there's a small question you might be asking yourself. How do I set this gamma? What is this gamma? But what we'll set do is we'll take m times gamma neighbors per node. So there's a small question you might be asking yourself. How do I set this gamma? What is this gamma? That gamma is an expansion factor.
Starting point is 00:23:16 And maybe I won't go into all the details here, but essentially we would set this to be inversely proportional to the minimum selectivity predicate that we would want to handle with our index. And there are ways to set this. I won't go into too much detail, but you can set that either with or without knowledge of your predicate set. And essentially, there is a point where pre-filtering would just become more efficient and give you a very high accuracy solution. And so you can essentially find that point and choose your gamma that way. Okay, so so far, we've just expanded our graph. We got our denser
Starting point is 00:23:43 HNSW graph. Now we've blown up our memory. we have a very high space footprint. So what do we do, we have to do some type of pruning procedure. And so we, we basically kind of analyzed this in the paper, HNSW actually does have a pr what we want is to we're going to remove any nodes that are already neighbors of uh neighbors of nodes that are previously stored in our neighbor list so that's the key idea of the pruning and what we'll actually do during search is a slight modification where we'll actually expand to neighbors of neighbors when we look over our neighbor list okay cool so if we've we've got this denser graph and then we have this so when does the pruning step happen that is that is happening at the construction time right when i'm deciding okay so when i'm when i'm searching i kind of i have my kind of hot friends of friends of friends two two two hops nodes right i know i say friends of friends because i'm always thinking about my
Starting point is 00:24:49 go-to example of a graphic it's a friend of friends social network yeah my friends of friends i've got my got my two hops and that allows me to to to kind of look for i always i always okay so now i'm getting the pruning is discarding when actually creating it i only include people in there if they're not already with if they're already a friend of somebody else so fusing the friends example again i don't have them in my list okay so it kind of gives us a nicer representation like a more richer representation cramming more information in basically which allows us to search search sort of more efficiently right and get closer to this this idealistic goal okay cool so with that you mentioned a lot there about being memory resident
Starting point is 00:25:26 and being conscious of space there are two variations of acorn right there are acorn gamma and acorn one so um i said acorn then as in the singer acorn um yes so yeah tell us about these two differences between between the acorn variants i guess yes yeah so that's a great point so we can do something maybe even stronger than this compression that acorn gamma does and what we can actually do is just completely move this type of search expansion or just keep completely keep like the search expansion during search and not actually do the expansion the neighborless expansion during construction. And so what that gives us is this ACORN1 algorithm, which is search heavy and construction light, meaning that we're doing less work during construction and we're doing more work,
Starting point is 00:26:15 slightly more work during search. And so essentially our construction algorithm now, it will just give us m neighbors per node with no pruning and then during search what we'll do is we'll expand our neighbor list to also look at neighbors of neighbors and then once we do that we'll apply our similar our predicate filter to that neighbor list and get back the subset that will help us traverse our predicate subgraph nice so of, I guess we're trading off here search time, I guess for construction time on memory space, right? As well as space in memory. So I guess that is a nice segue into how Acorn performs. Tell me how you went about evaluating it and what was the data sets you used? Who did you compare it against? How bad does post filtering
Starting point is 00:27:00 suck? Tell me everything. Yeah. Yeah. So we looked at a number of data sets and we really, there is two types. So first I talked a lot about these specialized indices and prior works that were good for these types of constrained query workloads, specifically workloads with low cardinality predicate sets. And so we tested on those data sets because we wanted to be able to compare to all of those prior works and those baselines. So that was the first class. And then we also tested on these kind of more complex data sets, which have, you know, these arbitrary predicates, various types of predicate operators, like even regex expressions, for example, and also high cardinality predicate sets. So you can think about, you know, like many more than thousands or
Starting point is 00:27:46 tens of thousands of predicates in your predicate set. So these are the two types of data sets we looked at. They were these simpler ones, which prior works had looked at, and also these more complex ones. And then we also did a number of ablations on various properties of the query workload. So things like the selectivity of the predicates that we're looking at, we actually define this notion of correlation. So there are cases where your query vector might be very close to the cluster of points that pass your predicate. That represents this positive query correlation setting, but the reverse might also be true. So your query vector might also be far away from the cluster of points that pass your predicate. And so ideally, a hybrid index needs to serve both of these.
Starting point is 00:28:29 And so correlation was a really important property to also analyze. So we looked at that. And then last... Sorry to interject real quick, Liana. Do you have a kind of example of the query correlation? Because that's a really interesting thing. Do you have like an intuitive thing of like, kind of like what that might...
Starting point is 00:28:43 An example of that, basically. Yeah, yeah. Yeah, that's a great a great question yeah so i really liked looking into this because we actually looked at it on this image data set and there were kind of a lot of fun examples that we we showed in the paper so again slightly limited because i'm going to try to just describe them but i'll give you one example so So, you know, we have this image of this small child that's in this meadow of like green grass. And so the query image is just that. And my predicate is going to be some type of keyword. So, you know, one keyword I might have is green, like show me images that look like that child in the meadow. And I also want a keyword match on the word green. So intuitively, because the child is in this meadow of green grass,
Starting point is 00:29:30 green, it kind of corresponds with that image, right? And so what I'll get back are these images that look a lot like my query vector. All the things that are most similar to my query image, in fact, actually kind of match this keyword green. They're also small children in these grassy fields or meadows. But my filter could be something else. So let's say I have this image of this cute, adorable child in this grassy field, but my keyword filter is going to be scary, let's say. Scary, it's not really capturing my image. It's kind of like the opposite
Starting point is 00:30:03 intuitively of what my image is showing. So what I'll actually. It's kind of like the opposite intuitively of what my image is showing. So what I'll actually get back are kind of like these images of, you know, like children during Halloween or like something like a sad child, kind of like a child making a scary face. So that's an example where I don't actually expect the things that match my keyword, my predicate, to be very semantically similar to my image which is kind of this cute child not scary at all in this field of grass that's a really cool example yeah but no yeah i found it fascinating when i was reading through i was like oh yeah because this and that and i was like wow okay yeah i'd not how did you encounter that and i'm probably going off again going off topic but
Starting point is 00:30:40 how did you kind of first come across this idea of query correlation? Yeah, that's a great question. So I'm trying to think back how we first started thinking about it. So like a reason why you should care a lot about query correlation is because one of the baselines that are pretty widely used in a lot of systems post-filtering, it will fail significantly when you vary your query correlation. So it was very relevant for studying that method. And then we actually showed this in the paper, like if you have these negative correlation queries, post filtering will give you very bad accuracy.
Starting point is 00:31:13 And the reason is that the implicit assumption that you're making when you post filter is that the spot in your, or like kind of like the place that your vector search drops you in within your larger vector space, it's going to be something that is close to the set of things that pass your predicate. So in other words, the post-filtering assumption is that your hybrid search targets are close to your ANN search targets. That's the assumption you're making. But that's not true. And that doesn't have to be true in all cases. And there's many cases like LRE detection, you know, those are real use cases where that might not be true. And so yeah, it becomes very important to be able to actually serve that. Yeah, that's really cool. Anyway, so let's get back to the evaluation. Sorry. Yeah,
Starting point is 00:31:57 we've been through the data sets, I think. And did we talk about who we compared against? Okay, yes, great. So there were a number of baselines here. And so we looked at essentially benchmarking all of these methods. So that includes methods like FilterDiskANN, which has kind of these two variants. Milvus also has this hybrid search. And we looked at various indices. And with their hybrid search, there was HNSW pre-filtering and post-filtering. There's also this method NHQ, which we looked at, which is a specialized index.
Starting point is 00:32:27 So essentially the baselines were these sets of specialized indices, as well as pre- and post-filtering, which we looked at. And then one additional thing that was also kind of interesting to just kind of give us some intuition on these low cardinality predicate sets, or these data sets with low cardinality predicate sets, these data sets with low cardinality predicate sets, which are relatively simpler, it was also possible to simulate this oracle partition index, which reposes this theoretical ideal. And so kind of the goal there was to check, are we actually getting performance close with Acorn to that oracle partition? And so these are
Starting point is 00:32:59 all the methods that kind of we were comparing against. Cool. Now it's numbers time. So yeah, how did it do? Yeah. Okay. So I'll give you kind of like the high level first. So across all these data sets, ACORN Gamma and ACORN1 achieve two to a thousand X higher QPS at fixed recall. So really significant performance gains, which is really cool. And this is across all these data sets. If I'm to break down the numbers for you. So we'll first start on these low cardinality predicate sets, which is where the specialized indices are really designed for. Okay, so we basically expect those specialized indices methods like filter to scan n and hq, etc, to basically perform very well on those. In these settings, Acorn Gamma outperformed those indices by 2 to 3 x higher QPS at a fixed recall. So that was 0.9 recall that we looked at. And we also very closely
Starting point is 00:33:53 emulate the Oracle partition method. So that's pretty cool because we're essentially getting very close to the theoretical ideal there without any such assumptions. Okay, so that's the first setting. The second setting were these kind of more complex data sets. And so we looked at things like keyword search, regex matching, some of these others. And so there, Acorn Gamma outperformed the baselines, which were pre- and post-filtering by 20 to 1000x. So really significant performance gains, very much highlighting the limitations of pre- and post-filtering, where post-filtering will typically not give you high quality results, and pre-filtering just won't scale very well.
Starting point is 00:34:30 So those are kind of the main benchmarks. And then we also did a number of ablations looking at variations in predicate cell activity, scaling of the data set, and then also this correlation setting. And Acron Gamma and Acron One they they actually perform very robustly across all three dimensions and create workloads and so that was very cool to see nice that's it hybrid search predicate predicate anostic hybrid search done solve problem now that's it yeah mic draw yes so i always have to ask about limitations of work even though we've obviously we've just concluded it to solve problem but yeah are there any scenarios in which the performance of Acorn is sort of suboptimal still? Yeah, so I would say that the main downside or kind of like the main overhead of these methods,
Starting point is 00:35:15 it's the time to index. And also depending on which method you're using, it can also be space footprint. So as I mentioned, Acorn Gamma, you're essentially doing this neighborless expansion during your construction. And so that will give you a higher time to index. It essentially increases your TTI, the time to index by this gamma log gamma factor. And so we did some complexity analysis looking at this. And that's one thing you should definitely consider if you want to use this method for hybrid search so acorn one actually kind of addresses that limitation so you're no longer doing this gamma log gamma you're just going to potentially increase your space footprint and very marginally compared to hnsw so that's one thing to consider is kind of what is the overhead that you're willing to pay um for this type of generality in your hybrid search method.
Starting point is 00:36:07 So yeah, kind of that's a big thing. I would say like one general thing is that, you know, HNSW does not give you guarantees on performance. And so again, with Acorn, we don't do that either. But what you do need to do is just benchmark your performance on the given data set you have and ideally a query workload that you have yeah fine kind of what works for your workload uh on the time to index thing is this is this sort of the the current the pattern here is i create my index and then how long does that index then get
Starting point is 00:36:36 used for because obviously data is changing right so like i'm thinking here like how about how about we like modify our index in place rather than having to build a new one from scratch every time? As this sort of an updatable version of ACON, something that you would maybe consider, maybe that's not the way people do things. Yeah, this is a great thought because it's actually great future work. And so in general, incremental updates
Starting point is 00:36:59 to graph-based indices like HNSW is kind of non-trivial. And there has been work studying this where as you keep updating your index, you will get this small or can be large, sometimes small degradation in performance. So essentially, as you keep updating your performance, your index is not doing as great. So this is kind of a research problem that has gotten some work in this area. And I think there's more to do, especially with Acorn in particular. So graph-based indices, standard graph-based indices have some methods that can be applied. Essentially, you look at two hop neighbors of the node that you're updating. So
Starting point is 00:37:34 this is particularly an issue if you're doing a delete. So you'll essentially look at this neighborhood of that item that you're trying to sleep and you'll rearrange your your edges so that kind of you can account for that so we haven't yet applied that to acorn so this is kind of a great direction for future work it'd be really interesting to look at that um and you're exactly right that would kind of help you to amortize this this kind of like fixed cost you pay to do your indexing and essentially you can keep your index longer if you were able to solve for that so yeah that's a great thought yeah kind of on future on future work, then what else have you got on there in the pipeline for aircon? What's where else do we go with it? Yeah, I think there's,
Starting point is 00:38:12 you know, so much to do in this area, really excited about kind of all the excitement and activity that's been in industry and both and also in research, some things that could be really cool. Obviously, you know, one addressing kind of of some of the TTI and the overhead limitations, that could be very powerful. There's also a lot of interest, I think, in kind of like serverless architectures for vector databases or for serving these types of ANN queries. So looking at what that might look like for a graph-based index like HNSW or ACORN could be very powerful. So in particular, if you have this distributed setup and you want to maybe be able to route your query across different nodes, you know, how do you actually build that efficiently? That could be really interesting.
Starting point is 00:38:54 I think there's kind of this interesting work to do on the indexing side, something that I've also been really interested about and kind of has been some of the recent work that I've been looking at. I think there is also so much to do and kind of thinking through, you know, what other queries do users want to ask over their data? And kind of how do we serve those queries or those questions with these efficient systems that are general purpose? So, you know, RAG is kind of something that has been pretty well studied. Hybrid search is kind of this interesting variant where it's not quite RAG,
Starting point is 00:39:25 but I think there's actually so much more to do just in this general space. RAG is really just one point in this larger space of queries that we need to ask or need to serve over the data. Nice. Is that more a case of defining the top level, the clarity of where you ask these queries or more in a sense of,
Starting point is 00:39:44 I'm asking these types of queries and these are the indexes i need to use for these queries this subset of queries or is it something i missed the point is it something else yeah maybe i guess i i kind of i was very vague about what i meant by this i think the kind of broader theme like when you think about you know where like hybrid search or where ann search is kind search is mostly used, it's probably in this LM paradigm, where you do retrieval, then you pass it to your LM. And that's good. Essentially, that represents a point lookup on your data, where you have some answer.
Starting point is 00:40:14 Basically, you have a question, and that answer can be localized to a little bit of data in your database. But I think there's this whole space of what we would call bulk semantic processing, where you might want to run your lm over like a lot of data and serve these queries or questions over your data where the the pattern or the query pattern that you need it looks very different than what rag what rag gives you very nice in talking about sort of real world impact i guess kind of going going
Starting point is 00:40:42 building off it slightly there what we're just talking about but is what impact do you think acorn can kind of have in or is already having it in sort of production systems is it can it be easily incorporated into a existing vector database and it's out there in the wild already yeah what's the story with it there is it had a much impact yeah so i'm really excited about to see kind of the impact of acorn and you know some of the things we thought through as we were designing the method, we wanted it to be very easy to implement in existing HNSW libraries or existing ANN indexing libraries.
Starting point is 00:41:13 And we've actually seen some adoptions. So there's this really cool release, for example, from the Weevia team, and I'm not being paid to say this, but it was a really cool release where they actually incorporated this Acorn-based hybrid search. And we see these very substantial performance improvements. So that's really cool release where they actually incorporated this Acorn-based hybrid search. And we see these very substantial performance improvements. So that's really cool. I think
Starting point is 00:41:29 hybrid search in general can be very powerful for letting users leverage both structured and unstructured data. And so I'm very excited to see more systems adopt methods like this so that you can really do that very effectively you need to send them the invoice liana for that the sponsorship the paid promotion cool yeah yeah so how long have you been working on the acorn project for from start to finish how long how long has the journey been with it yeah i guess it was kind of about a year or so so this is one of the the first major projects of my PhD had a lot of fun with it and then we we presented it at SIGMOD which you saw so yeah it was about a year yeah yeah the question I'm leading on to is sort of
Starting point is 00:42:17 across that duration what was the thing that you when you were studying it that kind of jumped out the most like oh that's surprising and did not expect that like what caught you off guard yeah i think there were really two things which maybe i somewhat hinted at earlier but i think the first thing is when i was first looking at this space it was totally not clear at all you know why is this problem even difficult to solve because all we're really doing is changing the semantics of the query like a typical retrieval query very slightly like where you're adding this predicate. But it feels like it should be, you know, pre or post filtering should work quite well. But when we actually benchmarked those methods, what we saw is that just changing that query very slightly completely changes the performance.
Starting point is 00:43:00 And what you typically expect out of an ANN index index you don't get from your hybrid search query with the baselines that people were using performance was very severely degraded um so that was like just surprising itself kind of just understanding the problem right and like the nuances here as to why hybrid search is so difficult I think the the second thing that was kind of cool and that I don't know I think was kind of like a very fun outcome of this project was that you actually don't need to forego generality to get, or yeah, you essentially don't need to forego generality for like very good performance on this problem. And so specifically when we looked at these specialized indices, you know, the question was like, can you really get something that does quite well without making those assumptions about your query workload? And actually, yeah,
Starting point is 00:43:45 you can, you can do that. And essentially the way we formulated in Acorn was to, you know, think about this Oracle partition index or that theoretical ideal, and then try to emulate that. So that was also a really cool outcome. Yeah. I mean, you kind of do have that implicit assumptions. Oh, and you generalize something you maybe don't, it ends up not being as performant, like a specialization normally is kind of you target in one specific use case therefore you're probably going to go faster or whatever yeah that is a really surprising outcome right that the generality making it pretty good and i was like actually it's faster than everything else like which is yeah it's a nice outcome how did this idea come out in the first place and is it sort of something that yeah who
Starting point is 00:44:22 proposed this sort of space initially who proposed the the space? Yeah, that's a good question. So I think kind of broadly, I was kind of interested in this area and I was talking with my advisors about factor databases and, you know, what could be like kind of interesting problems within this area. And so we had kind of like read some of the existing works and I felt like hybrid search could be one kind of one interesting problem where, you know, there is these real query workloads where people really want to ask questions over both unstructured and structured data. And so that was initially kind of compelling. And I don't think it was until we really ran those benchmarks that I was kind of convinced like this is a good problem to focus on.
Starting point is 00:45:03 So yeah, it was kind of, you know, very incremental and very iterative going from kind of convinced like this is a good problem to focus on so yeah it was kind of you know very incremental and very iterative going from kind of this could be interesting to okay I want to work on this yeah it's funny I think there's never often there's not like just a light bulb moment was okay I'm gonna do this this is the thing I want to do for the next 12 months right it's always sort of incremental and I kind of that actually is a very nice segue into my next question which is about identity idea generation and selecting things to work on so yeah how do you approach that specific problem is it incremental yeah how do you go about it yeah this is a really hard question to answer because also my favorite question long time listeners will know I love this question this is my favorite one yeah yeah it's a great question I'd actually be curious to hear your thoughts as well, since you also completed a PhD. But yeah, I find that it is somewhat ad hoc. I think one of the very nice things about academia is that you have an incredible amount of freedom. And so generally, what kind of will steer, you know, the broad direction is just my interest. And I kind of like, what do I want I want to learn about or like what would be nice to spend the next year learning about so I think that's
Starting point is 00:46:08 kind of first because you know just kind of like doing all the work like there's a lot of you know experiments that need to be done like we're done like a lot of papers to read so kind of just like what feels interesting that's kind of the first thing and then I also think that there's a lot of thought that goes into impact and this this is something that, you know, I really appreciate my advisors because I think they have great taste and they've like helped me really kind of refine my taste as well in terms of like what could be an impactful project. And so, yeah, I think kind of what ends up happening is I'll kind of modulate between these things, like kind of what is interesting, what could be impactful.
Starting point is 00:46:42 I do like try to find some sweet spot between those two. And I would say that, again, it's like a very iterative process where, you know, I'll kind of like brainstorm broadly, like these could be directions that we could go in. Sometimes I'll be a little blocked. Sometimes I'll have these late nights of like thinking about problems and like how to formulate them. And eventually kind of things will come together where at a certain point, like we've actually formalized, this is the problem definition. And like, kind of things will come together where at a certain point like we've actually formalized this is the problem definition and like kind of this is what we think we could achieve ideally yeah yeah that's amazing i mean having i mean i imagine matthew knows a lot about impact right i mean in terms of what works you want to yeah like quite successful on that front so yeah he's got good guidance there so yeah but no do you know i think
Starting point is 00:47:24 you hit on a few key points that I completely agree with I mean you need to surround yourself with people that you can kind of sound ideas off and it's kind of you can go to say what do you think what this and they can use on they can rely on their experience to inform whether you should drop this or or pursue that any further but I kind of found that my approach to it changed kind of across my PhD as I kind of got more and more familiar with an area so I think what I lacked at the start what we need at the start is you need a certain amount of breadth basically so you kind of know what's out there and you know what you like right so you need that for one and then once you've sort of narrowed down it's kind of
Starting point is 00:47:58 kind of reading like you've got to read a lot right you've got to kind of know your area because once you've only once you've got to that point can I then? You've got to kind of know your area because once you've, I have to want to only, only once you've got to that point, can I then ask, I think, oh, maybe tweak it this way or maybe tweak it this way. But I always like to, when I'm reading things,
Starting point is 00:48:12 go through and highlight the assumptions, right? And sort of what, and then, then challenge those assumptions because that's where you can create something new a lot of the time. Oh, what happens if we relax this?
Starting point is 00:48:21 But you've got to start from somewhere, right? Yeah. I think getting a lot of breadth in there, having, surrounding yourself with really smart people always helps um and then just when when things when you kind of get something that really gets you kind of you go in basically you're really sort of interested in just kind of be relentless with it and pursuing it and keep iterating and then if you keep iterating eventually something good's going to fall out of it i yeah hope so anyway but yeah hopefully
Starting point is 00:48:45 cool so we're getting our way towards the last word now before we do that i wanted to get your take on the vector database space and there was it only because there was this keynote or a panel at vldb this year where they were talking about sort of do we need standalone vector databases versus should we just should that be consumed by SQL and the relational world? And I kind of want to get your opinion on it and whether you think, what are your take on that, and whether it sort of should be its own independent thing and it needs to be, or it should be,
Starting point is 00:49:17 we can kind of put these indexes and things in an existing relational database and job done. And then you obviously have all the benefits of like transactions and all the other nice things that we've got for decades of of work on these relational systems yeah so what's your take on that question yeah this is a really good question so i think i i have some slight biases but you know kind of like a large premise of this work was that i think there's so much power in unifying both structured and unstructured data and working over both of those. So I do think that systems that will be very successful can kind of bring both of those worlds and let users kind of seamlessly operate between both structured and unstructured data.
Starting point is 00:50:02 And kind of Acorn is one example of something that moves us in that direction. Kind of, as you noted at the beginning, I did also spend an internship at one of these specialized vector database companies. So I think there can also be power and, you know, kind of like the performance of building those specialized solutions. Maybe for some specific domains, for example, like recommendation, you really need very high efficiency vector search. But yeah, my general thought is that I think for either, whether it's vectors are a data type in your database or you have a full-fledged vector database, I think there needs to be a broader support for all the things that you would ideally want to do over your structured data. Transactions could be one of them. You know, analytics over your structured data.
Starting point is 00:50:50 Hybrid search over your unstructured unstructured data. I think that's really crucial. Yeah, yeah, I completely agree. It's one of those, that classic debate of it's a specialized system versus one size fits all system, right? And it was funny to sort of see that discussion at at vldb because obviously i worked for a graph database company and it kind of was subject to the same sort of debate right and when it when they were sort of emerging but anyway yes cool so let's get on to the to the last word then so what's the one thing you want the listener to get away from this podcast takeaway sorry from this podcast today one takeaway sorry, from this podcast today? One takeaway. Yeah.
Starting point is 00:51:31 So I think kind of like the lesson of, you know, Acorn in my mind was really this, you know, this key piece, which is that you can get this high efficiency performance without foregoing generality. And I think this is like a kind of powerful systems and, you know, powerful concept for optimization. You can actually achieve both. And so I think that's kind of a great goal to strive for and something that has also shaped how I approach research
Starting point is 00:51:52 and kind of set the goal or the target for my projects. Well, we'll leave things there. Thank you so much, Leanna, for coming on. It's been a lovely chat. I've really enjoyed myself and I'm sure that listener will have as well. Where can we find you on social media? want x blue sky twitter yeah well twitter is x so it's the same thing twice there yeah yeah yeah i'm on i'm on twitter linkedin yeah both of those cool we'll
Starting point is 00:52:17 drop links to all that in the show notes and all the relevant materials and everything we've mentioned today we'll put those in the in the show notes as well and yeah we'll we'll yeah and we'll we'll see you all next time for some more awesome computer science research

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.