Disseminate: The Computer Science Research Podcast - Navid Eslami | Diva: Dynamic Range Filter for Var-Length Keys and Queries | #67
Episode Date: November 13, 2025In this episode of Disseminate: The Computer Science Research Podcast, Jack sits down with Navid Eslami, PhD researcher at the University of Toronto, to discuss his award-winning paper “DIVA: Dynami...c Range Filter for Variable Length Keys and Queries”, which earned Best Research Paper at VLDB.Navid breaks down how range filters extend the power of traditional filters for modern databases and storage systems, enabling faster queries, better scalability, and theoretical guarantees. We dive into:How DIVA overcomes the limitations of existing range filtersWhat makes it the “holy grail” of filtering for dynamic dataReal-world integration in WiredTiger (the MongoDB storage engine)Future challenges in data distribution smoothing and hybrid filteringWhether you're a database engineer, systems researcher, or student exploring data structures, this episode reveals how cutting-edge research can transform how we query, filter, and scale modern data systems.Links:Diva: Dynamic Range Filter for Var-Length Keys and Queries [VLDB'25]Diva on GitHubNavid's LinkedIn Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Disseminate the computer science research podcast.
Hello and welcome to disseminate the computer science research podcast.
I'm your host, Jack Wardby.
This is the podcast where we interview computer science researchers about their latest work,
highlighting the problems they've tackled, solutions they've developed,
and how their findings can be applied in practice.
It's our mission here to help further narrow the gap between research and practice
and make computer science research more accessible.
So if you are an industry practitioner,
researcher or a student, then this is definitely a podcast for you. You can listen
along on Apple, Spotify, all major podcast platforms, and you can now also watch along on
YouTube. So if you do enjoy the show, please like, follow, subscribe, do all those things,
and yeah, tell a friend to help disseminate disseminate. In today's show, I was really happy
with that joke when I wrote it down. So, yeah. That's great.
Anyway, cool. So in today's show, we're going to be talking with Navid Aslamy about his
cutting-edge research on filters, specifically range filters. Navid is doing his PhD in the
Systems and Network Group, the CISNET group at the University of Toronto, and he recently
published a paper on the topic of filters at the very large databases conference. That's the
VLDB conference. And it was titled Diva, Dynamic Range Filter for Variable Length Keys and Queries,
which won the best research paper award. So welcome to the show, Navid, and congratulations on the
best paper award. Thank you so much.
It's great to be here.
Cool.
Well, let's get stuck straight in then.
So let's set some background for the chat.
And for those who are new to the area, can you tell us, first of all, what a filter is?
Okay, sure.
So if that's okay, I'm going to start with like the sort of killer applications of filters.
Go for it.
Yeah.
Yeah, yeah, that's great.
Okay.
So nowadays, a lot of data systems, particularly database systems,
in just large amounts of data at high rates, right?
So this kind of necessitates the databases to support a high right throughput.
And the way they do this traditionally, or I guess since in the past 20 years,
it's been through this data structure called an LSM tree.
So at a high level, an LSM tree basically just buffers your insurions into your tables.
And once it gets to a specific size, that's hefty enough.
it's going to write a two-disc
using very
using very few IOS.
But the way this works is
whenever it writes to disk, it's going to create a
new file that's kind of disparate
from everything else that it's written until
now. This is great for rights
because you don't care where you're
writing and you're just buffering everything.
But it's not so great if you
want to do queries, right? Because then you
have to look at every single one of the files
and say, okay,
does the thing that I'm looking for exist in this file or not?
And then search in that.
So to get around this issue,
people have started using filters over these LSM tree data structures.
So what this means is, okay, so what a filter is,
it's basically, for each one of these files,
the filter is just going to kind of approximate that file in very low memory.
So it's a compact, probabilistic data structure in a sense,
that you can give to it a key
and ask, does this key exist in that file or not?
The filter is going to tell you
either it's for sure not going to exist in the file
or it's going to tell you, okay,
it exists, but with high probability,
it might make a mistake and tell you that it exists
even though it doesn't.
The more remember you give the filter,
the lower this probability of a mistake is going to go.
So that probability is also known as the false
positive rate in a sense. So why is this thing nice for something like an LSM tree? It's because if you want to do something like a point query, you want to find like one specific record with a given ID, well, chances are almost all of the files that you're storing on disk aren't going to have anything relevant to that ID, right? So there's only one of them that's going to contain the record that you're looking for. So before accessing each file, if you just ask the filter,
hey, does the record with this ID exists in my file or not,
then you can avoid accessing most of those files.
This is going to speed up your system significantly.
And even beyond just database systems,
filters have been used in similar contexts for like network application.
You can imagine instead of accessing things on disk,
you have some storage medium over the network
and you don't want to do this round trip every single time.
Nice.
so yeah so it's uh i always get my false positives and my false negatives around as well so this
is going to be fun for me today because i can't always get them the wrong way around
so yeah so i guess in the example you're in there with the database i've got um some
some database i've got a list of all the best sort of all the podcasts in the world now i want to go
and check out he's disseminated there right it's kind of i'm going to ask that query and it's
going to if it definitely is there it's going to say so i'll have a yeah it's there
or I'll get a maybe it's there
go and check sort of thing
is that how it's going to work
but it's never going to tell me that
it's not there when it is there right
we get no false no false
this is the question
positive negative negative
there we go yeah yeah okay fantastic
cool yeah exactly
brilliant because false negatives would result in data loss
if you do that
we don't want that
yeah exactly cool so that's a filter
but your work focuses on
an extension of these I guess and that range filters
So, yeah, how do we go from filters to range filters?
What's the difference there?
Yeah, sure.
So it's, so range for the filters are in a sense a really natural extension of filters.
Because filters traditionally only support queries with a single key, right?
So the question is, okay, I have my database and, well, a lot of my queries are actually ranges.
I want to scan all the pieces of data that are within a given range.
I can't really use a standard filter to do this
because it doesn't give me the relevant API.
So the question is, can I have a range,
can I have a filter or a range filter that I can give it a range
and ask, is there anything included within this range or not in my file?
So then if you just have this data structure,
now suddenly you can do range queries over your LSM as well
as the point queries you were doing before.
Nice, cool.
Could you maybe give an example of that?
I don't know of what a fun example might be of such an application that would do such range queries.
No, sure.
So I guess one standard example might be, let's say if you have a database of patients,
and you want to just find the patients that have an age between, let's say, 50 and 60.
So your range would just be 50 to 60.
And you're looking for, your keys are the integers between these two numbers.
So you just want to filter those out.
Cool.
So that sets us up nicely then for the sales pitch for Diva, I guess, then.
So, yeah, what's the, I guess, the TLDR on it, the overall elevator pitch for it?
How, I guess, and how would you probably explain it to your grandma, I guess, as well,
sort of if you sort of summarizing up the whole piece of research in one go?
Sure.
So at a high level, what Diva does is basically it lets you do range filtering with theoretical guarantees.
so it's going to give you a provable false positive rate
that also works on string keys
so not necessarily only integers
and it doesn't impose any restrictions
on the kinds of quarries you can feed to it
so the query can be extremely big the range
or it can be really small
that's the high level pitch
nice that's cool so yeah
let's dig into then so how did you
arrive at this this research topic
guess, and kind of what was the specific gap or question you were saying out to answer?
Okay.
So we were basically looking at the existing range filters, and we saw that, okay, almost all
of them, well, almost none of them give you a nice false positive rate guarantee.
So they use some amount of memory, but it's the filtering that they give you doesn't necessarily
correlate with the amount of memory you give to the filter.
So it might not give you the best possible false positive rate, given that memory budget you give it.
So the second issue is, the filters that do actually give you a false positive rate guarantee,
they don't really support range queries that are too big.
Because if you issue a range query that's really big to those types of filters,
what's going to happen is their false positive rate is going to go all the way up to one,
which means that any query you give it, it's all the way.
always going to return, yes, it exists, which is going to make it practically useless.
In addition to that, sometimes the performance of these filters also gets really, really bad
when you give them a long-range worry, and so much so that it's like on the same level as doing
an I.O., which is not great. So that's the second problem. The third problem is that
almost none of these filters support variable-length keys,
meaning that they only operate over integers, I'd say.
The only one that does support variable-length keys
does not give you any sort of guarantee
on the false positive rate that gives you.
And the fourth issue, we saw, which was that, okay,
all of these filters are kind of like static structures.
You can't really do insertions and deletion.
to them. So except for one, which was actually our own work prior to Diva. That was like the first
range filter that supported insertion's deletions and it could accommodate workloads of growing
size so it could expand itself. But yeah, none of the other filters achieve this. And you can see that
I have these four criteria that I want to achieve. And I want to achieve them with high performance.
in some sense I have five goals
and what's interesting
was that okay of these five goals
two of them are kind of non-existent
among the current filters
and for sure none of the current filters
give you these five at the same time
so we thought okay
let's try and do this and
give the people kind of like the holy grail
of filters
nice yeah that's what I was thinking about
you get all these five things that's all the five special
ingredients to make some yeah really good sauce
that and so yeah just to recap really
quick so there's there's nothing out there that sort of gave any sort of guarantees of the
false positive rate and the ranges were uh often too big or if they got too big basically
kind of got redundant because it went to once basically there's no that could kind of handle
handle variable size of large ranges interesting keys only which again feels an incredibly
restrictive API right like kind of I imagine a lot of use cases or a lot of data kind of have
strings as keys so yeah that that feels okay we'd be nice to have that and lastly um static
structures, right? So, yeah, data is dynamic, right? So people want to insert and delete things.
I mean, that's, it's often, I guess, it's a assumption in a lot of research works.
Oh, yeah, just assume things are static because you simplify the game a little bit, right?
But yeah, in reality, things are dynamic, right? So yeah, and we all want to do that as fast
as possible, right? And achieve all of these goals, well, have been really high performance.
So yeah, so that's, that was the goal then. And so I guess maybe now should we talk a little
bit, we kind of covered off there, kind of what the existing solutions are like and how
they perform and where they fall short. So maybe we don't need to go into the
in too much depth then.
So let's switch on to Diva then
and talk about how you went about achieving
the Holy Grill.
Right.
Okay.
So this whole story with Diba
started from this theorem
from the theory community back in 2014.
So it's basically
the theorem is quantifying you
how, like if you want to have a range filter
that guarantees a given false positive rate,
how much memory you need to use.
And if you look at it, you'll see that, okay, if I want to support string keys,
the theorem is basically telling me I have to store the actual keys.
There's nothing interesting I can do.
So in some sense, when you hear this statement, you feel like, okay,
the problem is kind of intractable.
I can't really do much here.
Well, then we looked at it and we saw that, okay,
this theorem is actually only talking about the worst case scenario,
which doesn't seem to be that.
common in reality.
So we thought to ourselves, okay,
what can we say about the
common case that we do see in reality and in practice?
Maybe we can do better for those.
And that brought us to this assumption of,
well, most of the time, our data follows some nice distribution.
So maybe if we assume that our data comes from this nice, smooth distribution,
then we could do filtering better.
So that's kind of like the starting point.
We wanted to bypass this theorem or this lower bound to achieve space efficiency while...
Just going to ignore the theorem committee.
By the way, the theorem committee sound really kind of a cool sort of, I don't know, quite a scary group of people.
The theorem committee, I don't know, yeah.
Yeah, the people and people in theory of computer science.
Cool. Sorry, I continued. I interrupted.
No, that's okay.
But yeah, we had, I guess we had this observation and we thought, okay, let's suppose our data has a distribution.
And kind of the first step we took after making this assumption was, okay, so I want to leverage this distribution somehow.
So what I do is I take some samples from this distribution to kind of get an idea what the distribution looks like and learn it in a sense.
And that's basically just, this corresponds to saying that, okay, if I have a data set with some number of keys,
I'm just going to take like every thousandth key, that's going to be my sample set.
This is going to give me a really good idea of what the distribution looks like.
And based on that, I can do some more interesting things.
In particular, if your distribution is smooth, smooth enough, let's say, in between each two,
samples you pick, they're going to have an almost uniform distribution. If you look
up like the distribution function, in between each two samples, there's going to be like an
almost straight line connecting them. That corresponds to an almost uniform distribution. So this
means that the keys that you that you kind of omitted between these samples were almost
uniformly distributed. So we're going to leverage this property to give you a really space
efficient encoding of the of the keys in your set. So,
So the way we would do this is, well, we're first going to store the samples that we picked out in some data structure.
Let's say it's just a binary search tree.
In our work, we use something known as a Y fast fry.
This is like a cute data structure from the 80s.
But that's besides the point.
It just gives you like a slightly faster lookup time than just the binary search tree.
Well, let's say we have a binary search tree over our samples.
Now, this gives me access to that nice approximation of the distribution.
Then what I do for the keys in between any two consecutive samples,
because I know they're uniform, I can lossily compress them
and store them in these data structures we like to call infix stores.
So the compression, the way it goes is any two consecutive samples I take,
well, these two samples have the longest common prefix, right?
any key that I take in between these two samples in lexicographic order is also going to have the same longest common prefix.
So because I'm already storing that longest common prefix in my binary search free,
I don't have to store it in these infix stores too.
So I can just remove that longest common prefix from what's left of each key,
then I can take like a small patch of bits at the very, at the middle, that kind of,
differentiates between the keys that are in between the two samples.
And because I had this uniformity property, that small patch of bits, which we call it, which we call an infix because it's in the middle of the key, that small patch is going to be almost uniformly distributed.
It's going to inherit the uniformity of the key itself.
So I have kind of like a small summary of the key, which, well, it's part of the key itself, so it's order preserving.
it's almost uniform.
So in some sense, the uniformity kind of makes it look like a hash in the key.
Because that's what a hash is in reality.
It's like a small summary of the key that looks really random.
In our case, it also looks random, but the additional benefit is that it's actually order preserving.
So we summarize these keys in between the samples as these infixers.
And we're going to leverage this hash-like property of the infixes to store them
in something that looks like a hash table.
So the way this is going to go is each infix,
we're going to split it into an address and a remainder is what you like to call it.
This procedure is known as quotient thing in the sense.
The address is telling you where in the hash table you want to store the infix,
and in that position, you're going to store the remainder.
So what you get in the end is a order-preserving hash table
that
it's an order-preserving hash table
that gives you
constant time access
to the infixes
in between two samples
nice cool
so once we've got this ditches
can you walk us through
kind of how we go the other way then
so I think it'd be quite nice
to go through
how a range query works
on top of this then
sure of course
so a range query
is actually really simple
to handle
So you had this binary search free, right?
And then in between each two consecutive samples,
I'm assuming that the samples are ordered lexographically, right?
In between each two consecutive samples, I have a small infix store.
So the way a range query would go is I have my range.
I'm first going to look at this binary search free that I have.
I'm going to check if there's any key that is included within my range.
Because my binary search free is storing the action,
samples themselves without any loss.
I can do this check exactly.
And if I find something that's in the range,
well, the answer to my query is a true positive.
So I just returned that and I'm happy.
So in some sense, the interesting case is when I don't have any samples within my range.
In that case, well, if you think about it, you'll see that, okay,
this means that the end points of the range query are both in between two consecutive samples,
because otherwise they would have contained some sample.
So this means that, in other words,
this means that the endpoints of the range query
are fully included within a particular infix store.
So then what you do is, you take the endpoints of the query,
you kind of do the same compression algorithm
that we did for the keys in between the two samples
to convert these endpoints into endpoint infixes.
And then look at the order-preserving hash table to see if there's anything in between these two endpoint infixes.
If we find something, the answer is a, well, the core is a positive.
And if we don't, the query is a negative.
We get false positives because this compression scheme of removing like the longest one prefix and just removing some other suffix to get the end fix, this is lossy.
So there is some chance that you might accidentally match something that wasn't exactly the key.
you were hoping for.
Nice, cool. So this, as you've been talking through this example, then, I'm thinking, okay,
I can see how this is all constructed and works nicely for static data. How do we then go and make
this dynamic? Like, how do we then handle inserts, and deletes? And yeah, expansions as well.
And kind of what are expansions? I want to ask as well, I saw that mentioned in the paper.
Expansions? What do they mean? Sure, sure. So, yeah, let me start with how,
with how you can do inserts and delete. Then I'll get there.
So inserts and deletes are quite straightforward, well, they're not straightforward.
They're not straightforward for you now, I'd say, yeah.
Right.
So yeah, inserts and deletes, the way they work is, let's say you get an insert, right?
What Divo does is when it sees the new key, it's going to flip a coin with some probability that's going to take that new key to
to be a sample.
So it's going to insert it into the binary search
free of samples that we had.
If it does this, then this means that it kind of
land into an infix store.
It has to split the infix store into two halves.
So now you're kind of like, this is sort of similar
to how a bee tree works if you're familiar with it.
So you insert a new thing and it splits like the infix store
into two halves, which is just like a smaller subset of the problem.
So that's one scenario if my coin tells me to insert it as a sample.
The other scenario is my coin tells me to not answer it as a sample.
So this means that I just go to the relevant infix store for that key, and I just insert
it as an infix.
Really simple.
And deletions are kind of the same.
You follow the same procedure.
Well, you don't have the coin flip anymore.
But if you find a sample within your tree.
in your tree that matches your deletion,
then you just remove that and merge
the two infix stores.
And if you don't find the sample,
then you just go to the relevant infix store
and find an infix that
matches your key
to the best
amount in some sense.
And you remove that and fix
that matches it the most.
Nice, cool. Yeah. And expansions.
Right. So expansions.
So an expansion is basically
just a fancy word.
for saying that I want my data structure to resize itself as I insert more stuff.
So you can imagine in the insertion procedure that I mentioned,
because I'm doing this coin flip procedure,
there is a chance that I, for a long time, I don't insert any samples.
So I'm going to have an infix store that's kind of big.
It's going to get full.
So I have to reallocate it in some sense and give it more memory
so that it can store the extra data.
That reallocation process is known as expansion.
And the cute thing that we do in Devo,
that's sort of different from how other filters do expansion,
is, well, other filters do this, I guess,
classic procedure of expanding by doubling the size of the array
and then inserting your new keys.
We do it by expanding it in fractionally in some sense.
We increase it by a little bit so that you don't have this wasted space
by doubling.
Because if you double, then there is a chance that half of the space that you're using for
your filter is going to be empty, which is not great for something that's supposed to be
really space efficient.
Yeah.
Cool.
I guess now it might be quite a nice time to then talk about evaluation and results and how
well Diva performs.
So I guess the first thing would be to ask, how did you go about evaluating Diva?
What was the setup looking like?
Who are you comparing it against?
And yeah, so let's take that question first.
Then we can give us some click beta results that we can.
We can chop up and then make some shots out of.
Right.
Awesome.
Okay.
So the overall setup is we do both micro benchmarks and a integration benchmark.
For the integration, we're integrating the range filters that are integrable in some sense into wired tiger,
which is sort of like the back end of MongoDB.
So let me start with the micro benchmarks.
For these benchmarks, we are comparing against pretty much every single range filter that's out there.
The ones that are important to note, I think, are Surf.
Surf is like the first range filter that was designed.
This is the only range filter that supports string keys, variable-length keys.
And Memento, which is our previous work.
This is the only range filter that supports dynamic updates.
Okay.
So when compared to the other range filters, in general, Liva gives you almost the best false positive rate.
Like it's better than, actually, it gives you the best false positive rate that's stable with the range quarry size.
So no matter what range query size you feed it, it's always going to give you a really stable and constant false positive rate that's better than anything else.
There are like one or two filters that are slightly better.
then surf than Diva in terms of the false positive rate.
Well, those guys are actually much, much slower in terms of query time, and they're not
updatable.
So you can't do any inserts and believe still.
So that's like the setup against ranges of different size.
If you care about string keys, well, then the only range photo that's relevant to surf,
compared to surf, or Diva gives you like one to two orders of magnitude, better follow.
positive rate at the same memory budget.
Yeah.
So that's like the benefit you would get.
And it's pretty much,
and it's really much faster than surf.
And surf is like, yeah, anyhow, it is much faster.
How much faster is it then?
Give us some numbers.
So it kind of depends on the type of data
that you're working with.
But in theory, what can happen is
surf, morally speaking,
surf is like a try.
so it has to go down this multi-level try
and the number of levels that are in the try
is basically the length of the strings in your workload.
So that means for each level of the try,
you're going to incur a single cache mist,
which is going to scale pretty poorly in the length of the key.
Viva in comparison, because it's using the Y-fast try thingy,
it's going to take actually logarithmic time
in terms of the length of the key.
So it's exponentially faster in some sense
and answer.
Nice, cool.
Yeah.
Do you want to talk about?
Should we talk about the white?
Let me cut.
I'll let you continue.
I'll be sorry.
I interrupted.
Yeah.
There is one more thing that I want to.
Keep going.
Yeah.
Yeah.
So yes.
So against Memento filter as well,
which is the only other dynamic one.
Well, now both Diva and Memento can expand.
So when compared to each other, you'll see that, okay.
Performance-wise, Memento filter is a little bit faster.
Well, yeah, it's like faster by 1.5x, I think, if I recall correctly.
And the reason for that is, well, because it doesn't have this extra binary search tree
that it has to go through to get to where it's supposed to go.
It's just using hashing.
But even though it's slightly faster, it's much, much worse in terms of false positive rate
when you do long range more use, even as you're expanding the dataset.
Before we talk about the wired target implementation, I'm very intrigued about that.
How does the performance, how is the performance affected by the kind of proportion of insets and deletes going on as well?
Like how does the, what's the relationship between that, between the, I guess, the rights and the, the rights and the reads, I guess.
Right.
So they don't really impact each other that much.
Because the scaling, yeah.
If you were using a normal binary search free in Diva, then they would impact each other because then with more inserts.
you're going to get, the binary search tree is going to get bigger
and you're going to be slower in your search.
Yeah.
But the Y fast try actually gives you a access time
that's independent of the number of keys.
Okay.
Yeah.
In that sense, they're both awesome.
Yeah, very nice property.
Cool.
So let's talk Wyatt Tagger then.
So first of all, tell us a little bit about the implementation,
if and how easy it was to implement it in Wyatt Tiger.
And then, yeah, tell us how much, like how,
how it improves what was already existing in wiretiger as well
yeah sure so
the implementation itself was actually very easy
because we give you like a really nice header only
implementation of deval you can just plug and play
and we just had to place it on top of the
interface for accessing like the beege tree and wired tiger
so before accessing so again
wire tiger is like a key value store that's formed as a beach free
Just before accessing the B-tree to get your design record,
you just have to ask Diva first to see if it exists.
If it exists, it tells you go access to the B-tree,
and if it doesn't, it tells you don't bother.
So it was really easy to integrate it.
We just put Diva as like an upper layer over Wired Tiger.
And compared to the things that existed beforehand,
well, Memento Filter was really the only other.
range filter that was integrable with White Tiger?
Because now, you have this dynamic database,
which you're doing inserts and deletes to.
So the filter that you're using must also be dynamic.
You must also support inserts and deletes.
So really the only other filter that Fitbil was fermento.
Compared to that, well, if the workload was similar,
Viva actually achieved the same, yeah,
If you were working over integer keys and you didn't have long range quarries, let's say.
You've achieved the same positive rate as memento, and it gave you pretty much the same performance.
Because when you're using a database system, the bottom neck really comes from, well, most of the time,
it comes from the disk accesses you're doing, not the filter time itself.
But I guess the core benefit is that now you can also do string keys,
which Melenthalter filter doesn't even work with.
Or if you have long-range queries, which Melentifilter can potentially be extended to support,
Memento is going to get a really bad false positive rate,
which is going to translate into a very slow system,
which is just going to do redundant accuracy to disk,
whereas Diva is going to let you filter everything out irrespective of the range size.
nice cool so for for all our listeners who have always wanted to do um to arrange queries over variable
length keys but haven't been able to and too now can't count is diva actually available um to
users of mongo db can they actually use diva um or is it so is it publicly available i guess is what
i'm asking yeah of course diva is open source so you can just go to my github page and
pick it up uh it is i tried to make it as user friendly as possible it's just a
header only file you include it plug and play you just do your workloads and you're happy uh it is
also thread safe actually that's a new edition so if you have a multi-threaded system you should be
happy with that as well fantastic we'll put a link to that in the in the notes so the interest list you can
go and have a play uh yeah so let's change gears a little bit now and talk about surprises so whilst
you're working on on the projects and with range filters on diva was there any sort of thing that
sprung up any sort of counterintuitive findings or results that really sort of challenged your
initial assumptions about range filters? Yeah, yeah, for further, were. So the biggest one was
when we started this project, we had this conception that basically every single workload with
variable length keys had a smooth distribution, in particular, like natural language, we assumed
had a smooth distribution. But when we got into evaluating it, actually,
We saw that, okay, there are some specific natural language data sets that don't seem to be as nice in terms of the distribution.
They're not as smooth as you would expect.
For those, it actually turns out that both Diva and Surf, if you just use them as is on these datasets, you're not going to get a great false positive rate.
But we also observe that, okay, the reason why we are getting a bad false positive rate on these datasets is actually because there's a lot of redundancy in the English language in the sense that there are specific combinations of characters and words that are really common.
And if you think about it, okay, because I'm removing, the way I'm deriving the infix is I'm just removing the longest hopper.
and I'm taking some bits,
that redundancy is going to translate into the infix that I take at the very end.
So what that's going to do is it's going to make the infixes less suitable for differentiating the keys in some sense.
And that's going to mess with the false positive ring.
So where am I going with this?
I'm saying that I have this redundancy.
That's causing the issues.
So why not do some compression algorithm to remove the,
redundancy. And it turns out we did a order-preserving compression of the keys.
And it turns out when you do that, the distribution of the keys and your workload are going
to get much, is going to get much smoother. And you can use data over that newly modified
data set. Nice. That's a good segue into limitations because I was going to kind of ask you
about this assumption of having a smooth date distribution there and kind of saying, well, how
likely is that in reality how often do we see that but it's you can actually often um with like the
technique you just mentioned there with compression actually takes when that isn't smooth and kind of
make it smooth right um but yeah i guess the question that still stands is then are there any
limitations any scenarios in which i shouldn't be using diva or diva still can be improved maybe
in certain places yeah okay so um yeah the core limitation that is as you mentioned if you have a
non-smooth distribution, you have to do some pre-processing in some sense to make it smooth.
That, I think, is like an open research question in some sense.
We have to study it a bit more to see what type of pre-processing we have to apply.
That best smooths out the distribution.
There is also another limitation in the sense that, okay, you might have, your range queries
might be constructed in such a way that you care about some range that's really close by to a given key.
So you can imagine you have a key that's, let's say, 1,000, and your range is from 1,001 to 1002.
It's like really close to the key, but it doesn't include it.
This type of workload, we call it the correlated workload and the literature.
For this one, because of the way it only takes the infix and it throws away the lower order bits of the keys,
is not really able to help you in filtering these types of queries out.
But the thing is, this is kind of a philosophical point.
If the ranges that you care about are really similar to the keys,
that in some sense kind of implies that you know a lot about your dataset.
And if you know a lot about your data set, then it's kind of questionable that you want to issue range queries that are kind of empty.
And you know what the data set is, so why not just iterate over the stuff that you know exists?
So in some sense, this workload that I mentioned, this correlated workload may or may not be really something that you see in practice.
To me, it seems like the most reasonable scenario where you would use a range filter is when you don't know a lot about the dataset.
So the range queries you issue are likely to be empty.
In that scenario, divo is perfectly fine.
You could use it.
But if for some reason you happen to know a lot about your dataset and you want to issue this range query that's really close by to a key, it's not going to help you there.
You have to use something similar to Memento to filter those out.
And by the way, this is like the worst-case scenario that the lower bound that I mentioned,
the theorem that I mentioned is kind of exploiting.
So in some sense, you can't really do better than the memory usage of Memento to handle these.
Do you feel like if you can have both like Memento and Diva in the same system,
you um like would you have it i don't know how you would basically have i'm so i think can both coexist
in the same system and depending on your query you can use one or the other right and have maybe
have that and i don't know whether the selection of that would need to be sort of human driven or
you can have some sort of i don't know optimizer or something i can say okay you should probably
use this one or use this one essentially um anything like that way it can coexist i think so yeah
i think like this well the simplest solution might be to just put a threshold on the size of the range
If it's smaller than something, use memento.
If it's bigger than something, use diva.
Sure.
Some simple heuristic like that or that would work.
Yeah.
Yeah.
I think, yeah, there's, and I don't actually, I actually think that I thought about this problem
for a while.
And I actually don't think you can do better memory-wise than just having a separate memento
instance on a Diva, and a separate Diva instance and use them together.
That seems to be like the optimal you can achieve.
Nice.
Cool.
I guess then, where do you go next
with Diva? Where's the future?
What's the future hold for it?
Yeah, that's a good question.
I think the main thing that we should
research in the future is how you can smooth out
non-smooth distributions.
That's like the biggest thing you can,
in my opinion, you can work on.
I am actually working on
some techniques to address that.
So hopefully we'll get that
at some point.
So just on like you said the very fertile area for research is
finding how we can take data sets and make them smooth.
Do you have a sort of a handle on out in the wild how unsmooth data actually is?
How is it just so application sensitive or context dependent way?
It's kind of really hard to say.
Well, most people that I've talked to that are actually interested in using range filters,
seem to have data that is quite smooth.
I'm not entirely sure about, like, the average workload across all industries, of course.
But the people that I've talked to seem to have smooth distributions mostly.
And it seems to me, like, when I think about this problem, it seems to me, okay, usually your keys are like maybe names, natural language, but perhaps, or, like, email addresses.
Those seem to me to be non-smooth, but they're very structured, and the structure is very well understood.
So we should be able to do some nice compression to make them work as well.
I don't think it's that bad, but of course you can have like a weird composite key that's not smooth.
So you have that's completely domain specific.
You have to do like a specific compression for that.
Some adversarial case.
Yeah, they always exist, right?
Cool.
So let's talk about impact then and practical relevance.
And so my first question is, I guess,
how much impacts have you seen Diva have so far since it was published?
Right.
So it is a little bit early, I think, for us to see some impact.
But I have seen a lot of interest.
And we are actually in talks with some companies about integrating Diva into their systems.
I'm hoping to see Diva show up in a lot of systems,
mainly because this LSM-free structure that I mentioned
is like the killer application for a range filter.
And it's extremely widely used across the industry.
One really famous example is like in RocksDB.
You can just plug Diva into RocksDB and you get all of the benefits out of the box.
Yeah.
And we do actually have an internal project in the University of Toronto that does this.
but it's not published yet so it's like internal knowledge for you yeah yeah cool and sweet so
i guess that that brings us nicely on to to the last word then i've been and that's there's two
things here the first of all is i kind of what should practitioners take away from from this
work and secondly what should researchers take away from from this work and and your pod on the podcast
Right. So for practitioners, I think the nice message is that you can actually have data structures and I guess techniques that are grounded in theory that give you really nice guarantees that are also useful in practice.
I think that instead of, well, I mean, going to heuristics to implement a, you know, whatever data structure you want for your system is a fine choice.
And it's often, oftentimes the correct choice given the time budget you have.
But if you really want something that's never going to break under any circumstances, it is possible to, I believe that it's possible to design structures that are, that give you strong guarantees like that.
Diva is one example of that.
I think I would like to encourage the practitioners to think about this a bit more, maybe,
like consider having structures that gives them strong guarantees
so that they don't have to deal with the headaches of things breaking later on.
Nice.
That's for the practitioners.
For the researchers, I think the nice message I have to give is, again,
tied back to the lower bound that I mentioned,
at the beginning. The statement of the lower bound kind of seemed pessimistic, right? It seemed
like you couldn't do anything good in the space and everything was done, so why even bother?
But oftentimes, maybe not oftentimes, but a lot of the time, when you look at these lower bounds,
you see that they're actually leveraging properties of the problem that are really hard,
and they only appear in the worst case. In reality, a lot of people don't deal with those worst cases.
So I would actually encourage people to think about what properties we have of the data we care about in practice that can help us in designing these types of structures.
Like it could be an algorithm, it could be a data structure, and designing any one of these, I think it's nice to find those specific properties and kind of tie the algorithm to that property to get a better performance out of the end.
entire thing. So this gives you like a broader spectrum of problems you kind of work on if you
go beyond this worst case and some sense. Nice. Yeah. So it's a great, great message to end on there.
One last question actually. I said that was the last word, but one last question, Diva. You need to tell us
about the name because naming's important. Why is it called Diva? Yeah, that's like a, it's like a,
it's like a weird acronym I came up with on the spot. So Diva, the D and the I,
come from dynamic if you look at the title of the paper the d is from dinac and the i is from
the end of the work and va is just for very willing so he's like contaminated bees and god god
diva oh yeah he's like it's a bit of a diva as well it's pretty cool right so yeah um they
got that go for it as well great stuff well we'll end things there then and that's great
we'll put links to everything you mentioned in the show notes as well so the list can go and connect
with you are you on social media anywhere where's best to find you yeah that's a good
I'm not actually on social medians.
Maybe like the homepage or something, we can link maybe.
So yeah, we can drop that in there.
Yeah, I do have a GitHub page.
So if you're interested in contacting me, please check that out.
I will answer emails.
And yeah, I'd be happy to chat.
Great stuff, yeah.
And go play with Diva.
That's the last thing we want to say.
Yes, cool.
Please do.
Great stuff.
Thank you very much.
