Disseminate: The Computer Science Research Podcast - Liana Patel | ACORN: Performant and Predicate-Agnostic Hybrid Search | #60
Episode Date: November 11, 2024In this episode, we chat with with Liana Patel to discuss ACORN, a groundbreaking method for hybrid search in applications using mixed-modality data. As more systems require simultaneous access to emb...edded images, text, video, and structured data, traditional search methods struggle to maintain efficiency and flexibility. Liana explains how ACORN, leveraging Hierarchical Navigable Small Worlds (HNSW), enables efficient, predicate-agnostic searches by introducing innovative predicate subgraph traversal. This allows ACORN to outperform existing methods significantly, supporting complex query semantics and achieving 2–1,000 times higher throughput on diverse datasets. Tune in to learn more!Links:ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data [SIGMOD'24]Liana's LinkedInLiana's X Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host, Jack Wardby.
Today we are going to be going on a journey into the world of vector databases and taking us on that journey is going to be Liana Patel,
who is going to be talking about her recent work that she published at Sigma this year, ACORN, Performant and Predicate Agnostic Search over Vector Embeddings and
Structured Data. Now, if you don't know any of those things, me and listener, by the end
of the podcast, I'm sure you will do. So yeah, welcome to the show, Liana.
Thank you, Jack. It's great to be with you.
Cool. Before we do get into all the cool technical stuff, let's take a quick detour and find
about the person behind the paper, right? So yeah, tell us about your journey and how you became interested in databases
and machine learning research.
Yeah, so I'm a now fourth year PhD student at Stanford.
Before that, I was an undergrad at UPenn
where I did a lot of systems work
and I had a really good advisor there, Boon Thao Lu,
who kind of got me very interested in this area. And when I came to Stanford, I was kind of broadly interested in
ML systems. So I was kind of reading and exploring a little bit of both, but I didn't exactly know
what I wanted to work on. And my two really wonderful advisors, Matej Sahari and Carlos
Guestrin, kind of I broadly brainstormed with them really a number of different areas.
And kind of this was an exciting area, you know, vector databases at the time. This is now,
I guess, a few years ago before kind of the term vector databases was really mainstream.
So it was kind of a little bit lesser known, but seemed really interesting. And, you know,
I read some of these ANN indexing papers like H H&SW a couple of others and I thought the ideas are really cool and they seemed really powerful so I was kind of interested to explore
more there. Yeah it must have been sort of nice to be a kind of an early not an early adopter or
sort of pioneer in the sense of that kind of the space wasn't really that well defined and sort of
been able to help sort of define it must have been quite nice. It was really interesting yeah it was
an interesting time I remember kind of we had a chat with a couple of folks in industry and we
would use the term vector databases of course like there were companies in that area who had been
working on this but it was that we would kind of get like funny looks when we when we use that word
because it wasn't necessarily as popular as it is as it is now yeah and another thing as well
obviously with this being sort of fast-moving space,
and I guess the machine learning space
in general,
sort of really on the AI space
is so fast-moving.
When I was reading your paper,
I think it was at the start,
section two,
there was a background section
and the sheer volume
of references there,
I was like, wow.
I mean, I thought it was
sort of bad in general
and research is always
sort of like new materials
coming out,
but that's a lot of research
in a lot, I guess,
quite a short amount of time as well. yeah it's crazy because ann search is really well
studied so there's all these classical papers but then also you're right it's such a fast-moving
space so there's a lot of recent works too so there's a really massive amount of literature
to to look over so we try to be thorough and try to do our homework nice cool so yeah well we've
mentioned a few sort of key words there when we were just chatting but let's set some more context for the listener
and give some background to your work then so i guess maybe we can kind of start give us a kind
of a 101 on the vector database space tell us kind of what a vector embedding is what is
approximate and i'll use the acronym then you can explain it a and n a and n search and yeah
tell us all about it yeah great uh so a and n search, and yeah, vector search. Tell us all about it. Yeah, great. So A and N
search is approximate nearest neighbor search. And the basic paradigm here is that we want to
represent everything as a vector in a metric space. And that will be very convenient for us to work
over because then we can essentially just compare two points, compute a distance between them,
and that will give us some notion of semantic similarity. And that's great when we want to work with unstructured data and do these types of
similarity search. So the user has a query. I want to find all the items from my database that are
most similar to it. All I need to do now is embed my query, convert it to a vector. And offline,
I've already converted everything in my database to a vector. So this problem essentially reduces
to just finding the nearest neighbors to my query. But okay, so actually finding the exact nearest
neighbors, we can't do that efficiently. So we'll actually find the approximate nearest neighbors.
Okay, so how does this relate to vector databases? And kind of, you know, there's a lot of work on
Ragnarok. Well, ANN search is really the core component of a wide range of systems for
search, for recommendation, for text retrieval, and most recently, a lot of work on retrieval
augmented generation, where the crux of doing your similarity search is going to be some type
of ANN index, some type of data structure that can efficiently perform your search,
and then give you back your results for that query. So kind of a really popular use
case has been RAG, where for an AI system, if you want to kind of ground your LM on the context or
the knowledge in your corpus that is most relevant to the question, you would first do this search.
And there's now a lot of offerings in both startups and also in an industry that serve
these types of queries with specialized indices or specialized data systems or just data types
in your your database and then you can perform that get back the most relevant data pass it to
your alum and get a grounded response and also ideally reduce the risk of hallucination.
Kind of given that then let's shift to the research focus of your paper now.
So in this thing called hybrid search,
because whilst ANN is good,
it's not great for every use case, right?
And there's maybe some certain types of queries
and things that we may want to do that that doesn't cover.
So yeah, tell us about hybrid search
and the elevator pitch, I guess, for ACON is where I'm going.
Yeah, so ANN search is really great when you have fully unstructured data. So the queries that I
want to serve here are similarity search. So given my query, find me the most similar documents.
But, you know, this very realistic setting that we have in a lot of use cases is where I have
both unstructured and structured data. So let's give you an example of this. Like, let's say I want to find articles
that are relevant to, let's say, the presidential elections. That's going to be my query. But I want
to actually find papers that are from the last day, let's say, or articles that are from the last day.
So date in my database or my knowledge corpus would be a structured field. And so this actually
represents a hybrid query where what I'm trying to do, I'm trying to do a similarity search on the text part of my query,
you know, election, or maybe something about presidential election. And I also want to serve
a predicate here over my data, which is filtering on date. So that's the hybrid query. It's
essentially combining both search over unstructured data and structured data.
Nice. And I'm guessing the existing solutions before Acorn, they weren't great at these,
right? So give us a rundown of the few years, there's pre and post filtering, and then some
specialized data structures, but they, they aren't really solved the problem. So yeah,
tell us about those.
Yeah, yeah, I think this is a really interesting problem, right? Because it feels like the
semantics of my query have barely changed, but actually the problem changes a lot.
So kind of like the most naive way you can do this is by taking your standard ANN index.
And so your ANN index is going to give you efficient search for the similarity piece over your unstructured data.
And you can essentially try to do your filtering either before or after doing that search. So that will give you pre-filtering, which is one method. So there, first you filter your data on your predicate, and then you do this brute force type of search
over your data doing distance computations to perform that ANN piece. That's going to have a
lot of scaling issues. So anytime you have a predicate that is maybe a medium to high selectivity,
essentially a lot of things are going to pass your filter and it's going to become
very inefficient to actually do that brute force distance computation piece.
On the other hand, you can also do post-filtering. So there what you would do is you search your ANN
index and essentially you're going to expand the scope of your search so that afterwards you'll
perform your filtering and you hope to retrieve enough
things from your search that pass the filter. But they're actually doing this type of search
scope expansion becomes very expensive where essentially you're paying distance computations
to do that type of expansion. And again, this is not going to give you very good accuracy
guarantees. It'll kind of have all different types of failure cases. And also it's expensive again.
So in both cases, and we benchmarked this, performance can degrade significantly,
much worse than what you would expect typically from your ANN search.
So kind of researchers were realizing this, and we kind of know now that pre- and post-filtering,
you know, they have these very severe performance limitations, even though they're kind of like
widely used in practice. And so the prior works that we had been looking at, they were all these types of
specialized indices. And there's a number of those works, things like filter disk ANN and HQ,
HKN, QDRAN also offers one version of this. And essentially what they do is they will put some
type of restriction on the predicate set or your query workload. And they will then use
that to try to build this type of specialized index that can serve the workload well. And so
in particular, like some of the assumptions that they'll make are things like, you know,
your predicate set has to have low cardinality. So for instance, you have to have less than
maybe 10,000 distinct predicates that you can serve. Or maybe the types of predicate operators
that you want to serve are also limited. So maybe they can only be equality operators. So
think about, I can no longer do date ranges. I have to do my date equals X or my date equals Y.
So again, these types of things become very restrictive. And for the general case,
like the workloads that we're seeing, you generally don't want to make these assumptions.
Ideally, you have great performance and you can do that for any query workload that you're given or any predicate
essentially yeah nice yeah it's i was kind of when i was reading so i kind of wrote down like
i don't know if you have any sort of uh what's the word gauge on how far this sort of very limited
set of predicates and can actually get you in practice like is it the case of yeah you can
maybe just do like it's good enough for 90 of the cases or is it like this is just like get catching 10% of what people want to do with this
sort of thing? That's a good question so there are use cases where you it can be okay to make
some of these assumptions it really depends like in general what kind of our philosophy of ACORN
is that you should be able to serve all career workloads like very robustly and not make any
assumptions if you can
make some of these restrictive assumptions you can actually do something that's kind of like this
ideal or theoretical ideal which we propose in the paper which we call this oracle partition index
and so you actually have like a decent way to handle this in like a more brute force way which
will be okay yeah so i guess it really depends on the data, which I don't have kind of a brain type view on it.
While you were talking there, it's interesting sort of,
if you kind of have this nice general sort of way of doing things,
but then you can almost, I'm probably solutionizing here
and taking this off course, so I apologize.
But sort of being able to like kind of declare like hints
or things like that, you can incorporate that sort of information
at sort of runtime into your query
and then either decide to go the one way or the other way,
sort of be able to factor in and make a smart decision.
I'm thinking more,
you're almost like choosing a different type of index
based on what your information's kind of coming in,
but maybe I'm getting off course.
So I think you're making an implicit assumption here,
which is that if you can make these restrictive assumptions,
you can actually get better performance. And maybe I'm jumping ahead a little bit, which is that if you can make these restrictive assumptions you can actually
get better performance and maybe i'm jumping ahead a little bit which maybe i shouldn't do
but maybe a small spoiler is like you can actually get state-of-the-art performance without making
those types of assumptions which is really cool okay nice yeah that is nice cool so let's talk
about aircon some more in some more detail and let's get into it. So yeah, high level, what is the sales pitch for
Acorn and what's the magic idea behind it? Yeah. So what we essentially want to do,
kind of what I was describing is we want to provide both a very performant index for hybrid
search and also it should ideally be predicate agnostic. And what I mean by that is we're not
making any assumptions about the query workload, the types of predicates you can serve, the cardinality of predicates, things like that.
So we're going to build this type of general index, and then we'll serve search for whatever query we get at query time.
And the intuition here is that there is this kind of theoretical ideal, which is not actually tractable to build in all all cases or in like kind of like the practical
case but it'll give you give us like kind of a nice way of thinking about how we want to do our
search and how to build the index and so this theoretical ideal essentially you know we can
imagine if we have some predicate set where maybe I have you know x number of distinct predicates
what I would do is I could essentially do this very simple type of partitioning approach where I'll just take all my data set records or all my items that pass
each one of my given predicates and I'll build an ANN index or, you know, the one we use in the
paper is HNSW, which stands for Hierarchical Navigable Small Worlds. Cool name, by the way.
It's a cool name for something like that. Yeah, I love the HNSW paper.
It's really great. Really awesome name as well. But yeah, HNSW is the one we use in the paper.
But the idea is that essentially, if you have this type of system where you can assume that
you have some low cardinality predicates set or something that you can enumerate over,
essentially, you would just build an ANN index for every one of your known predicates. And then what you've essentially done is you've reduced
this hybrid search problem to an ANN search problem. And then all the nice kind of performance
and methods that we can use in ANN search, we can apply here. And so HNSLV is the one that we use
in paper. But again, this is the general idea. Okay, so you can actually do that because in general,
like I said, we don't want to make any type of assumption on the carnality of the predicate set.
Also, sometimes you just don't know your predicates during construction. So you will
typically build your index offline, and you just want to build an index that will be able to serve
whatever you have at query time. So we can't do that. But what we'll do instead is Acorn will build this kind of general index that
looks kind of like HNSLView, but not quite. And the main idea is that we're going to essentially
emulate search over the oracle partition index without actually ever building that. And so the
way we'll do this is essentially during search, we'll traverse this Acorn index and specifically
we'll traverse what we call the predicate subgraph.
And so this predicate subgraph is going to be the subgraph that's induced by the nodes that pass my predicate, my search predicate of my query.
So that's the thing that I want to search over.
It's the subgraph within my acorn index.
So the search actually is not going to be too complicated. But what will be difficult is we actually need to construct this index such that any arbitrary predicate subgraph
is going to look something like an HNSW graph.
And so basically that subgraph needs to have all these types of navigability
and connectivity properties just like HNSW.
And if we can do that, essentially what we're doing is essentially,
you know, emulating search over that Oracle partition.
Nice, cool. So it might be useful now to sort of maybe go into how we actually go about constructing the HNSW and actually searching that, because I guess that's sort of the baseline that
we're kind of going to build ACON on top of or augment things slightly. So yeah, I guess,
how do we construct and how do you search? great yeah so so hnsw is this really awesome
um graph-based index for ann search um so there's a various i mean there's a there's a number of
ann indices um hnsw is a graph-based one um there's other types of indices too like tree-based
ones um hashing like lsh-based ones cluster-based ones etc but graph-based ones are particularly
nice because they give us very good performance um and hnsw is one of these very widely used like LSH-based ones, cluster-based ones, etc. But graph-based ones are particularly nice
because they give us very good performance.
And HNSW is one of these very widely used indices.
So we focus on this one.
So there's a couple of components.
So, you know, hierarchical navigable small worlds,
that's what HNSW stands for.
And kind of the name is very descriptive.
So I can maybe briefly talk about some of the properties
that we have without
going into too much depth, but I really like the paper. So I kind of encourage anyone who's
interested to check out the original paper by Uri Malkov. So first of all, it's a hierarchical
graph structure in the sense that HNSLB will have multiple layers. And in each of those layers,
we're going to have a subset of nodes in the data set represented.
So at the bottom level, every node will be, every item in your data set is going to be
a node in that level.
And then as you go up your graph, those layers are going to become sparser.
And so essentially, you're going to sample some subset of your nodes at every layer above.
So to get to the i plus 1th level, I'll take some subset of my nodes at every layer above. So to get to the i plus one level, I'll
take some subset of my of my nodes from the i-th level. So that's the hierarchical piece. And
there's kind of some nice properties that arise from that related to efficiency of traversing
that graph. I won't go into too much detail there. But another key property is that you also have a bounded degree of every node in your graph. And so again, this is a graph index where every vector, every item
in your data set is a node in your graph, and then you're going to have edges connecting those nodes.
And the edges connecting the nodes, so you're going to have edges between things that are
close together, or things that are semantically similar, things that are points that are close together or things that are semantically similar, things that are points that are close together. And every node will have at most M neighbors. And so that gives you a
bounded degree, which again will give you some nice properties of your search. Okay, so essentially
you have this type of graph index. And if it's not completely crystal clear, that's okay. But let's
just kind of black box it as you have a graph structure where essentially all your nodes represent your
items and you have some edges between them. Okay, so then your search traversal, the way we'll do
this to actually find our approximate nearest neighbors, we're going to start from some
predefined entry point. And then we're essentially just going to do a greedy search. And the way this
looks on HMSL view is kind of like a slight variation of greedy search where what I'll
actually do is I'll do a greedy search layer by layer. So I'll start from my top layer. I have
an entry point into my top level and then I'll do a greedy search just on that level. And once my
greedy search converges, I'll use that point to drop down into my next level. And remember I said
that every level is essentially every level. So if I have level i, level i plus one is going to
be a subset. And so if I find something on level i plus one, it will also be in level i. And so I
can just essentially drop down to that point and then repeat my procedure. I do again a greedy
search, find my convergence point, drop down, and I continue until I get to my bottom level,
which then has all my nodes in my data set and will give me kind of this
finest granularity search. So the idea is kind of, you can think of it as you're starting from
these, you're starting from this kind of like zoomed out search and you're gradually refining
your search as you go down because you have more and more nodes as you drop down every level.
I mean, this thing is a lot easier to describe with images right this is where the limitations
of the of the audio only format is like we need a whiteboard now and it would be a lot it'd be a
lot easier but yeah so you kind of yeah i'm just doing a lot here yeah we're waving a lot this
now we're doing all that yeah so just kind of run through that again real quick so we have
n number of levels and then on
the base level there that's where all the nodes are and the nodes is the node represents somewhere
a vector right that's one of the vector embeddings i guess and that they're kind of linked up to some
degree m based on that's configurable and that's kind of how many um edges we'll have going out of
and in of any one of these nodes and then as we kind of go up we have a we have a smaller subset
of them all and when we want to search we start at the top we do a greedy search which means i guess we just kind of go to
the next one based off where we are at the moment so it's kind of like using local only information
explore the top tier or the tier you're on then drop down because obviously you're in the next
level because it's a subset and you keep doing that till you get to the very bottom at which
point you're left with k number of or N number of whatever letter you want to use,
nearest amount of vectors that are similar to you
and jobs are good in.
Yes, exactly.
Yes, that's a great high level sketch.
Perfect, I had a good teacher.
Cool.
Amazing stuff, yeah.
On that note, we've got our H and SW.
How do we modify this?
How do we get to ACOM?
Okay, so there's really two pieces here,
right? So it's going to be the search algorithm and the construction algorithm that I'll talk
about. So I'll start with the search because it's kind of a smaller modification, very intuitive.
Okay, so the goal here that I mentioned earlier was we want to traverse to and navigate over this
predicate subgraph. Okay, so our search is going to essentially look very similar to HNSW
search. We're going to do this greedy search that starts at a sumfix entry point at the top level of
our graph. And the main difference is just going to be how we do neighbor lookups at every node
that we traverse. So typically, when we do neighbor lookup, we would look at all the neighbors and we
pick the one that gets us closest to the point that we're looking
for. So essentially that has the minimum distance. But what we'll instead do here is we'll first just
filter out the nodes that don't pass the predicate. So now what I'm essentially doing is I'm going to
be searching over exclusively these nodes that pass the predicate. And so this is
helping me navigate over that predicate subgraph
that I was mentioning. So kind of simple idea there, we're just performing a filtering step
during our greedy search. Okay, so there's going to be kind of a small addition that we add.
And I'll describe that a little bit in more detail. But I'll talk about the construction
algorithm first. So how do we actually construct an index that gets us close to something where
we can
emulate the oracle partition with any type of predicate subgraph that we have? So the main idea
is that we're going to create this denser HNSW graph. When we create that denser graph, that's
actually going to create a little bit of an issue because we're going to blow up our memory overhead
or the space footprint of the index. And this is really important because
actually graphs like HNSW, they're memory resident. And so you actually care a lot about
kind of keeping a modest memory overhead. And so to address that, we'll then perform this type of
compression. And that compression is going to be predicate agnostic. So again, it needs to not use
any information about possible predicates I might see during query time.
Because again, this is just offline.
This is the construction process.
Okay, so the first piece creating the denser graph is going to just expand each one of
my neighbor lists by this factor gamma.
So typically, I told you my nodes have a bounded degree of m.
But what we'll say to you is we'll take m times gamma neighbors per node.
So there's a small question you might be asking yourself. How do I set this gamma? What is this gamma? But what we'll set do is we'll take m times gamma neighbors per node.
So there's a small question you might be asking yourself.
How do I set this gamma?
What is this gamma?
That gamma is an expansion factor.
And maybe I won't go into all the details here, but essentially we would set this to be inversely proportional to the minimum selectivity predicate that we would want to handle with
our index.
And there are ways to set this.
I won't go into too
much detail, but you can set that either with or without knowledge of your predicate set.
And essentially, there is a point where pre-filtering would just become more efficient
and give you a very high accuracy solution. And so you can essentially find that point
and choose your gamma that way. Okay, so so far, we've just expanded our graph. We got our denser
HNSW graph. Now we've blown up our memory. we have a very high space footprint. So what do we do, we have to do some type of pruning procedure. And so we, we basically kind of analyzed this in the paper, HNSW actually does have a pr what we want is to we're going to remove any nodes
that are already neighbors of uh neighbors of nodes that are previously stored in our neighbor
list so that's the key idea of the pruning and what we'll actually do during search is a slight
modification where we'll actually expand to neighbors of neighbors when we look over our neighbor list
okay cool so if we've we've got this denser graph and then we have this so when does the
pruning step happen that is that is happening at the construction time right when i'm deciding
okay so when i'm when i'm searching i kind of i have my kind of hot friends of friends of friends
two two two hops nodes right i know i say friends of friends because i'm always thinking about my
go-to example of a graphic it's a friend of friends social network yeah my friends of friends
i've got my got my two hops and that allows me to to to kind of look for i always i always
okay so now i'm getting the pruning is discarding when actually creating it i only include people
in there if they're not already with if they're already a friend of somebody else so fusing
the friends example again i don't have them in my list okay so it kind of gives us a nicer
representation like a more richer representation cramming more information in basically which
allows us to search search sort of more efficiently right and get closer to this this idealistic goal
okay cool so with that you mentioned a lot there about being memory resident
and being conscious of space there are two variations of acorn right there are acorn gamma
and acorn one so um i said acorn then as in the singer acorn um yes so yeah tell us about these
two differences between between the acorn variants i guess yes yeah so that's a great point so we can do something maybe even stronger than this compression that acorn
gamma does and what we can actually do is just completely move this type of search expansion
or just keep completely keep like the search expansion during search and not actually do the
expansion the neighborless expansion during construction.
And so what that gives us is this ACORN1 algorithm, which is search heavy and construction light,
meaning that we're doing less work during construction and we're doing more work,
slightly more work during search. And so essentially our construction algorithm now,
it will just give us m neighbors per node with no pruning and then during search what
we'll do is we'll expand our neighbor list to also look at neighbors of neighbors and then once we do
that we'll apply our similar our predicate filter to that neighbor list and get back the subset that
will help us traverse our predicate subgraph nice so of, I guess we're trading off here search time,
I guess for construction time on memory space, right? As well as space in memory. So I guess
that is a nice segue into how Acorn performs. Tell me how you went about evaluating it and
what was the data sets you used? Who did you compare it against? How bad does post filtering
suck? Tell me everything. Yeah. Yeah. So we looked at a number of data sets
and we really, there is two types. So first I talked a lot about these specialized indices
and prior works that were good for these types of constrained query workloads, specifically
workloads with low cardinality predicate sets. And so we tested on those data sets because we
wanted to be able to compare to all of those prior works and those baselines. So that was the first class. And then we also tested on these kind of more complex data
sets, which have, you know, these arbitrary predicates, various types of predicate operators,
like even regex expressions, for example, and also high cardinality predicate sets.
So you can think about, you know, like many more than thousands or
tens of thousands of predicates in your predicate set. So these are the two types of data sets we
looked at. They were these simpler ones, which prior works had looked at, and also these more
complex ones. And then we also did a number of ablations on various properties of the query
workload. So things like the selectivity of the predicates that we're looking at, we actually define this notion of correlation. So there are cases where your query
vector might be very close to the cluster of points that pass your predicate. That represents
this positive query correlation setting, but the reverse might also be true. So your query vector
might also be far away from the cluster of points that pass your predicate.
And so ideally, a hybrid index needs to serve both of these.
And so correlation was a really important property to also analyze.
So we looked at that.
And then last...
Sorry to interject real quick, Liana.
Do you have a kind of example of the query correlation?
Because that's a really interesting thing.
Do you have like an intuitive thing of like,
kind of like what that might...
An example of that, basically.
Yeah, yeah. Yeah, that's a great a great question yeah so i really liked looking into this because we actually looked at it on this image data set and there
were kind of a lot of fun examples that we we showed in the paper so again slightly limited
because i'm going to try to just describe them but i'll give you one example so So, you know, we have this image of this small child that's in this meadow of like
green grass. And so the query image is just that. And my predicate is going to be some type of
keyword. So, you know, one keyword I might have is green, like show me images that look like that
child in the meadow. And I also want a keyword match on the word green.
So intuitively, because the child is in this meadow of green grass,
green, it kind of corresponds with that image, right?
And so what I'll get back are these images that look a lot like my query vector.
All the things that are most similar to my query image, in fact,
actually kind of match this keyword green.
They're also small children
in these grassy fields or meadows. But my filter could be something else. So let's say I have this
image of this cute, adorable child in this grassy field, but my keyword filter is going to be scary,
let's say. Scary, it's not really capturing my image. It's kind of like the opposite
intuitively of what my image is showing. So what I'll actually. It's kind of like the opposite intuitively of what my image is
showing. So what I'll actually get back are kind of like these images of, you know, like children
during Halloween or like something like a sad child, kind of like a child making a scary face.
So that's an example where I don't actually expect the things that match my keyword, my predicate,
to be very semantically similar to my image which is kind of this cute child
not scary at all in this field of grass that's a really cool example yeah but no yeah i found it
fascinating when i was reading through i was like oh yeah because this and that and i was like wow
okay yeah i'd not how did you encounter that and i'm probably going off again going off topic but
how did you kind of first come across this idea of query correlation? Yeah, that's a great question.
So I'm trying to think back how we first started thinking about it.
So like a reason why you should care a lot about query correlation is because one of
the baselines that are pretty widely used in a lot of systems post-filtering, it will
fail significantly when you vary your query correlation.
So it was very relevant for
studying that method. And then we actually showed this in the paper, like if you have
these negative correlation queries, post filtering will give you very bad accuracy.
And the reason is that the implicit assumption that you're making when you post filter
is that the spot in your, or like kind of like the place that your vector search drops you in within your larger vector space,
it's going to be something that is close to the set of things that pass your predicate.
So in other words, the post-filtering assumption is that your hybrid search targets are close to
your ANN search targets. That's the assumption you're making. But that's not true. And that
doesn't have to be true in all cases. And there's many cases like LRE detection, you know, those are real use cases
where that might not be true. And so yeah, it becomes very important to be able to actually
serve that. Yeah, that's really cool. Anyway, so let's get back to the evaluation. Sorry. Yeah,
we've been through the data sets, I think. And did we talk about who we compared against?
Okay, yes, great. So there were a number of baselines here.
And so we looked at essentially benchmarking all of these methods.
So that includes methods like FilterDiskANN, which has kind of these two variants.
Milvus also has this hybrid search.
And we looked at various indices.
And with their hybrid search, there was HNSW pre-filtering and post-filtering.
There's also this method NHQ, which we looked at, which is a specialized index.
So essentially the baselines were these sets of specialized indices, as well as pre- and
post-filtering, which we looked at.
And then one additional thing that was also kind of interesting to just kind of give us
some intuition on these low cardinality predicate sets, or these data sets with low cardinality
predicate sets, these data sets with low cardinality predicate sets,
which are relatively simpler, it was also possible to simulate this oracle partition index,
which reposes this theoretical ideal. And so kind of the goal there was to check,
are we actually getting performance close with Acorn to that oracle partition? And so these are
all the methods that kind of we were comparing against. Cool. Now it's numbers time. So yeah, how did it do? Yeah. Okay. So I'll give you kind of like the high level first. So across all these
data sets, ACORN Gamma and ACORN1 achieve two to a thousand X higher QPS at fixed recall. So really
significant performance gains, which is really cool. And this is across all these data sets.
If I'm to break down the numbers for you. So we'll first start on these low cardinality predicate sets, which is where
the specialized indices are really designed for. Okay, so we basically expect those specialized
indices methods like filter to scan n and hq, etc, to basically perform very well on those.
In these settings, Acorn Gamma outperformed those indices by 2 to 3
x higher QPS at a fixed recall. So that was 0.9 recall that we looked at. And we also very closely
emulate the Oracle partition method. So that's pretty cool because we're essentially getting
very close to the theoretical ideal there without any such assumptions. Okay, so that's the first
setting. The second setting
were these kind of more complex data sets. And so we looked at things like keyword search,
regex matching, some of these others. And so there, Acorn Gamma outperformed the baselines,
which were pre- and post-filtering by 20 to 1000x. So really significant performance gains,
very much highlighting the limitations of pre- and post-filtering, where post-filtering
will typically not give you high quality results, and pre-filtering just won't scale very well.
So those are kind of the main benchmarks. And then we also did a number of ablations looking
at variations in predicate cell activity, scaling of the data set, and then also this correlation
setting. And Acron Gamma and Acron One they they actually perform very robustly across all three
dimensions and create workloads and so that was very cool to see nice that's it hybrid search
predicate predicate anostic hybrid search done solve problem now that's it yeah mic draw yes so
i always have to ask about limitations of work even though we've obviously we've just concluded
it to solve problem but yeah are there any scenarios in which the performance of Acorn is sort of suboptimal still?
Yeah, so I would say that the main downside or kind of like the main overhead of these methods,
it's the time to index. And also depending on which method you're using, it can also be space
footprint. So as I mentioned, Acorn Gamma, you're essentially doing this neighborless expansion during your construction. And so that will give you a higher time to index.
It essentially increases your TTI, the time to index by this gamma log gamma factor. And so we
did some complexity analysis looking at this. And that's one thing you should definitely consider
if you want to use this method for hybrid search so acorn one actually kind of addresses that limitation so you're no longer doing this gamma
log gamma you're just going to potentially increase your space footprint and very marginally
compared to hnsw so that's one thing to consider is kind of what is the overhead that you're willing
to pay um for this type of generality in your hybrid search method.
So yeah, kind of that's a big thing.
I would say like one general thing is that, you know, HNSW does not give you guarantees on performance.
And so again, with Acorn, we don't do that either.
But what you do need to do is just benchmark
your performance on the given data set you have
and ideally a query workload that you have
yeah fine kind of what works for your workload uh on the time to index thing is this is this sort of
the the current the pattern here is i create my index and then how long does that index then get
used for because obviously data is changing right so like i'm thinking here like how about how about
we like modify our index in place rather than having to build a new one from scratch every time?
As this sort of an updatable version of ACON,
something that you would maybe consider,
maybe that's not the way people do things.
Yeah, this is a great thought
because it's actually great future work.
And so in general, incremental updates
to graph-based indices like HNSW is kind of non-trivial.
And there has been work studying this
where as you keep updating your index,
you will get this small or can be large, sometimes small degradation in performance.
So essentially, as you keep updating your performance, your index is not doing as great.
So this is kind of a research problem that has gotten some work in this area.
And I think there's more to do, especially with Acorn in particular. So graph-based indices, standard graph-based indices have some methods that
can be applied. Essentially, you look at two hop neighbors of the node that you're updating. So
this is particularly an issue if you're doing a delete. So you'll essentially look at this
neighborhood of that item that you're trying to sleep and you'll rearrange your your edges so that
kind of you can account for that so we haven't yet applied that to acorn so this is kind of a
great direction for future work it'd be really interesting to look at that um and you're exactly
right that would kind of help you to amortize this this kind of like fixed cost you pay to do your
indexing and essentially you can keep your index longer if you were able to solve for that so yeah
that's a great thought yeah kind of on future on future work, then what else have you got
on there in the pipeline for aircon? What's where else do we go with it? Yeah, I think there's,
you know, so much to do in this area, really excited about kind of all the excitement and
activity that's been in industry and both and also in research, some things that could be really
cool. Obviously, you know, one addressing kind of of some of the TTI and the overhead limitations, that could be very powerful. There's also a lot
of interest, I think, in kind of like serverless architectures for vector databases or for serving
these types of ANN queries. So looking at what that might look like for a graph-based index like
HNSW or ACORN could be very powerful. So in particular, if you have this
distributed setup and you want to maybe be able to route your query across different nodes,
you know, how do you actually build that efficiently? That could be really interesting.
I think there's kind of this interesting work to do on the indexing side, something that I've also
been really interested about and kind of has been some of the recent work that I've been looking at.
I think there is also so much to do and kind of thinking through,
you know, what other queries do users want to ask over their data?
And kind of how do we serve those queries or those questions
with these efficient systems that are general purpose?
So, you know, RAG is kind of something that has been pretty well studied.
Hybrid search is kind of this interesting variant where it's not quite RAG,
but I think there's actually so much more to do
just in this general space.
RAG is really just one point
in this larger space of queries that we need to ask
or need to serve over the data.
Nice. Is that more a case of defining the top level,
the clarity of where you ask these queries
or more in a sense of,
I'm asking these types of queries and these are the indexes i need to use for these queries this subset of queries or is
it something i missed the point is it something else yeah maybe i guess i i kind of i was very
vague about what i meant by this i think the kind of broader theme like when you think about you
know where like hybrid search or where ann search is kind search is mostly used, it's probably in this LM paradigm,
where you do retrieval, then you pass it to your LM.
And that's good.
Essentially, that represents a point lookup on your data,
where you have some answer.
Basically, you have a question,
and that answer can be localized to a little bit of data in your database.
But I think there's this whole space of what we would call
bulk semantic processing,
where you might want to
run your lm over like a lot of data and serve these queries or questions over your data where
the the pattern or the query pattern that you need it looks very different than what rag what rag
gives you very nice in talking about sort of real world impact i guess kind of going going
building off it slightly there what we're just talking about but is what impact do you think acorn can kind of have in or is already
having it in sort of production systems is it can it be easily incorporated into a existing vector
database and it's out there in the wild already yeah what's the story with it there is it had a
much impact yeah so i'm really excited about to see kind of the impact of acorn and you know some
of the things we thought through as we were designing the method,
we wanted it to be very easy to implement
in existing HNSW libraries
or existing ANN indexing libraries.
And we've actually seen some adoptions.
So there's this really cool release,
for example, from the Weevia team,
and I'm not being paid to say this,
but it was a really cool release
where they actually incorporated
this Acorn-based hybrid search.
And we see these very substantial performance improvements. So that's really cool release where they actually incorporated this Acorn-based hybrid search. And we see these very substantial performance improvements. So that's really cool. I think
hybrid search in general can be very powerful for letting users leverage both structured and
unstructured data. And so I'm very excited to see more systems adopt methods like this
so that you can really do that very effectively you need to send them the
invoice liana for that the sponsorship the paid promotion cool yeah yeah so how long have you
been working on the acorn project for from start to finish how long how long has the journey been
with it yeah i guess it was kind of about a year or so so this is one of the the
first major projects of my PhD had a lot of fun with it and then we we presented it at SIGMOD
which you saw so yeah it was about a year yeah yeah the question I'm leading on to is sort of
across that duration what was the thing that you when you were studying it that kind of jumped out
the most like oh that's surprising and did not expect that like what caught you off guard yeah i think there were really two things which maybe i
somewhat hinted at earlier but i think the first thing is when i was first looking at this space
it was totally not clear at all you know why is this problem even difficult to solve because all
we're really doing is changing the semantics of the query like a typical retrieval query very
slightly like where you're adding this predicate.
But it feels like it should be, you know, pre or post filtering should work quite well.
But when we actually benchmarked those methods, what we saw is that just changing that query very slightly completely changes the performance.
And what you typically expect out of an ANN index index you don't get from your hybrid search query with
the baselines that people were using performance was very severely degraded um so that was like
just surprising itself kind of just understanding the problem right and like the nuances here
as to why hybrid search is so difficult I think the the second thing that was kind of cool
and that I don't know I think was kind of like a very fun outcome of this project was that
you actually don't need to forego generality to get, or yeah, you essentially don't need to forego generality for like very good performance on this problem.
And so specifically when we looked at these specialized indices, you know, the question was like, can you really get something that does quite well without making those assumptions about your query workload?
And actually, yeah,
you can, you can do that. And essentially the way we formulated in Acorn was to, you know,
think about this Oracle partition index or that theoretical ideal, and then try to emulate that.
So that was also a really cool outcome. Yeah. I mean, you kind of do have that
implicit assumptions. Oh, and you generalize something you maybe don't, it ends up not being
as performant, like a specialization normally is kind of you target in one specific use case therefore you're probably going to go
faster or whatever yeah that is a really surprising outcome right that the generality making it pretty
good and i was like actually it's faster than everything else like which is yeah it's a nice
outcome how did this idea come out in the first place and is it sort of something that yeah who
proposed this sort of space initially who proposed the the space? Yeah, that's a good question. So I think kind of broadly,
I was kind of interested in this area and I was talking with my advisors about factor databases
and, you know, what could be like kind of interesting problems within this area. And so
we had kind of like read some of the existing works and I felt like hybrid search could be one
kind of one interesting problem where, you know, there is these real query workloads where
people really want to ask questions over both unstructured and structured data.
And so that was initially kind of compelling. And I don't think it was until we really ran
those benchmarks that I was kind of convinced like this is a good problem to focus on.
So yeah, it was kind of, you know, very incremental and very iterative going from kind of convinced like this is a good problem to focus on so yeah it was kind of you know very incremental and very iterative going from kind of this could be interesting to okay I want to work
on this yeah it's funny I think there's never often there's not like just a light bulb moment
was okay I'm gonna do this this is the thing I want to do for the next 12 months right it's always
sort of incremental and I kind of that actually is a very nice segue into my next question
which is about identity idea generation and selecting things to work on so yeah how do you approach that specific problem
is it incremental yeah how do you go about it yeah this is a really hard question to answer
because also my favorite question long time listeners will know I love this question this
is my favorite one yeah yeah it's a great question I'd actually be curious to hear your thoughts as well, since you also completed a PhD. But yeah, I find that it is somewhat ad hoc. I think one of the very nice things about academia is that you have an incredible amount of freedom. And so generally, what kind of will steer, you know, the broad direction is just my interest. And I kind of like, what do I want I want to learn about or like what would be nice to spend the next year learning about so I think that's
kind of first because you know just kind of like doing all the work like there's a lot of you know
experiments that need to be done like we're done like a lot of papers to read so kind of just like
what feels interesting that's kind of the first thing and then I also think that there's a lot
of thought that goes into impact and this this is something that, you know, I really appreciate my advisors because I think they
have great taste and they've like helped me really kind of refine my taste as well in
terms of like what could be an impactful project.
And so, yeah, I think kind of what ends up happening is I'll kind of modulate between
these things, like kind of what is interesting, what could be impactful.
I do like try to find some sweet spot between those two. And I would say that, again, it's like a very iterative process where, you know, I'll kind of
like brainstorm broadly, like these could be directions that we could go in. Sometimes I'll
be a little blocked. Sometimes I'll have these late nights of like thinking about problems and
like how to formulate them. And eventually kind of things will come together where at a certain
point, like we've actually formalized, this is the problem definition. And like, kind of things will come together where at a certain point like we've actually formalized this is the problem definition and like kind of this is what
we think we could achieve ideally yeah yeah that's amazing i mean having i mean i imagine
matthew knows a lot about impact right i mean in terms of what works you want to yeah like quite
successful on that front so yeah he's got good guidance there so yeah but no do you know i think
you hit on a few key points that I completely
agree with I mean you need to surround yourself with people that you can kind of sound ideas off
and it's kind of you can go to say what do you think what this and they can use on they can rely
on their experience to inform whether you should drop this or or pursue that any further but I kind
of found that my approach to it changed kind of across my PhD as I kind of got more and more
familiar with an area so I think what I lacked at the start what we need at the start is you need
a certain amount of breadth basically so you kind of know what's out there and you know what you
like right so you need that for one and then once you've sort of narrowed down it's kind of
kind of reading like you've got to read a lot right you've got to kind of know your area because
once you've only once you've got to that point can I then? You've got to kind of know your area because once you've, I have to want to only,
only once you've got to that point,
can I then ask,
I think, oh,
maybe tweak it this way or maybe tweak it this way.
But I always like to,
when I'm reading things,
go through and highlight the assumptions,
right?
And sort of what,
and then,
then challenge those assumptions
because that's where you can create
something new a lot of the time.
Oh, what happens if we relax this?
But you've got to start from somewhere,
right?
Yeah.
I think getting a lot of breadth in there,
having, surrounding yourself with really smart people always helps um and then just when when things when you kind of get something that really gets you kind of
you go in basically you're really sort of interested in just kind of be relentless with
it and pursuing it and keep iterating and then if you keep iterating eventually something good's
going to fall out of it i yeah hope so anyway but yeah hopefully
cool so we're getting our way towards the last word now before we do that i wanted to get your
take on the vector database space and there was it only because there was this keynote or a panel
at vldb this year where they were talking about sort of do we need standalone vector databases
versus should we just should that be consumed by SQL and the relational world?
And I kind of want to get your opinion on it
and whether you think, what are your take on that,
and whether it sort of should be its own independent thing
and it needs to be, or it should be,
we can kind of put these indexes and things
in an existing relational database and job done.
And then you obviously have all the
benefits of like transactions and all the other nice things that we've got for decades of of work
on these relational systems yeah so what's your take on that question yeah this is a really good
question so i think i i have some slight biases but you know kind of like a large premise of this
work was that i think there's so much power in unifying both structured and unstructured data and working over both of those.
So I do think that systems that will be very successful can kind of bring both of those worlds and let users kind of seamlessly operate between both structured and unstructured data.
And kind of Acorn is one example of something that moves us in that direction. Kind of, as you noted at the beginning, I did also spend an internship
at one of these specialized vector database companies. So I think there can also be power
and, you know, kind of like the performance of building those specialized solutions.
Maybe for some specific domains, for example, like recommendation, you really need very high efficiency vector search.
But yeah, my general thought is that I think for either, whether it's vectors are a data type in your database or you have a full-fledged vector database,
I think there needs to be a broader support for all the things that you would ideally want to do over your structured data.
Transactions could be one of them.
You know, analytics over your structured data.
Hybrid search over your unstructured unstructured data.
I think that's really crucial.
Yeah, yeah, I completely agree.
It's one of those, that classic debate of it's a specialized system versus one size fits all system, right?
And it was funny to sort of see that discussion at at vldb because obviously i worked for a graph database company and it kind of was subject to the
same sort of debate right and when it when they were sort of emerging but anyway yes cool so let's
get on to the to the last word then so what's the one thing you want the listener to get away from
this podcast takeaway sorry from this podcast today one takeaway sorry, from this podcast today? One takeaway. Yeah.
So I think kind of like the lesson of, you know, Acorn in my mind was really this, you know, this key piece, which is that you can get this high efficiency performance without
foregoing generality.
And I think this is like a kind of powerful systems and, you know, powerful concept for
optimization.
You can actually achieve both.
And so I think that's kind of a great goal to strive for
and something that has also shaped
how I approach research
and kind of set the goal or the target for my projects.
Well, we'll leave things there.
Thank you so much, Leanna, for coming on.
It's been a lovely chat.
I've really enjoyed myself
and I'm sure that listener will have as well.
Where can we find you on social media? want x blue sky twitter yeah well twitter is x so it's the same
thing twice there yeah yeah yeah i'm on i'm on twitter linkedin yeah both of those cool we'll
drop links to all that in the show notes and all the relevant materials and everything we've
mentioned today we'll put those in the in the show notes as well and yeah we'll we'll yeah and we'll we'll see you all next time for some more awesome computer
science research