Disseminate: The Computer Science Research Podcast - Navid Eslami | Diva: Dynamic Range Filter for Var-Length Keys and Queries | #67

Starting point is 00:00:00 Disseminate the computer science research podcast. Hello and welcome to disseminate the computer science research podcast. I'm your host, Jack Wardby. This is the podcast where we interview computer science researchers about their latest work, highlighting the problems they've tackled, solutions they've developed, and how their findings can be applied in practice. It's our mission here to help further narrow the gap between research and practice and make computer science research more accessible.

Starting point is 00:00:27 So if you are an industry practitioner, researcher or a student, then this is definitely a podcast for you. You can listen along on Apple, Spotify, all major podcast platforms, and you can now also watch along on YouTube. So if you do enjoy the show, please like, follow, subscribe, do all those things, and yeah, tell a friend to help disseminate disseminate. In today's show, I was really happy with that joke when I wrote it down. So, yeah. That's great. Anyway, cool. So in today's show, we're going to be talking with Navid Aslamy about his cutting-edge research on filters, specifically range filters. Navid is doing his PhD in the

Starting point is 00:01:04 Systems and Network Group, the CISNET group at the University of Toronto, and he recently published a paper on the topic of filters at the very large databases conference. That's the VLDB conference. And it was titled Diva, Dynamic Range Filter for Variable Length Keys and Queries, which won the best research paper award. So welcome to the show, Navid, and congratulations on the best paper award. Thank you so much. It's great to be here. Cool. Well, let's get stuck straight in then.

Starting point is 00:01:31 So let's set some background for the chat. And for those who are new to the area, can you tell us, first of all, what a filter is? Okay, sure. So if that's okay, I'm going to start with like the sort of killer applications of filters. Go for it. Yeah. Yeah, yeah, that's great. Okay.

Starting point is 00:01:49 So nowadays, a lot of data systems, particularly database systems, in just large amounts of data at high rates, right? So this kind of necessitates the databases to support a high right throughput. And the way they do this traditionally, or I guess since in the past 20 years, it's been through this data structure called an LSM tree. So at a high level, an LSM tree basically just buffers your insurions into your tables. And once it gets to a specific size, that's hefty enough. it's going to write a two-disc

Starting point is 00:02:29 using very using very few IOS. But the way this works is whenever it writes to disk, it's going to create a new file that's kind of disparate from everything else that it's written until now. This is great for rights because you don't care where you're

Starting point is 00:02:47 writing and you're just buffering everything. But it's not so great if you want to do queries, right? Because then you have to look at every single one of the files and say, okay, does the thing that I'm looking for exist in this file or not? And then search in that. So to get around this issue,

Starting point is 00:03:05 people have started using filters over these LSM tree data structures. So what this means is, okay, so what a filter is, it's basically, for each one of these files, the filter is just going to kind of approximate that file in very low memory. So it's a compact, probabilistic data structure in a sense, that you can give to it a key and ask, does this key exist in that file or not? The filter is going to tell you

Starting point is 00:03:35 either it's for sure not going to exist in the file or it's going to tell you, okay, it exists, but with high probability, it might make a mistake and tell you that it exists even though it doesn't. The more remember you give the filter, the lower this probability of a mistake is going to go. So that probability is also known as the false

Starting point is 00:03:55 positive rate in a sense. So why is this thing nice for something like an LSM tree? It's because if you want to do something like a point query, you want to find like one specific record with a given ID, well, chances are almost all of the files that you're storing on disk aren't going to have anything relevant to that ID, right? So there's only one of them that's going to contain the record that you're looking for. So before accessing each file, if you just ask the filter, hey, does the record with this ID exists in my file or not, then you can avoid accessing most of those files. This is going to speed up your system significantly. And even beyond just database systems, filters have been used in similar contexts for like network application. You can imagine instead of accessing things on disk, you have some storage medium over the network

Starting point is 00:04:49 and you don't want to do this round trip every single time. Nice. so yeah so it's uh i always get my false positives and my false negatives around as well so this is going to be fun for me today because i can't always get them the wrong way around so yeah so i guess in the example you're in there with the database i've got um some some database i've got a list of all the best sort of all the podcasts in the world now i want to go and check out he's disseminated there right it's kind of i'm going to ask that query and it's going to if it definitely is there it's going to say so i'll have a yeah it's there

Starting point is 00:05:20 or I'll get a maybe it's there go and check sort of thing is that how it's going to work but it's never going to tell me that it's not there when it is there right we get no false no false this is the question positive negative negative

Starting point is 00:05:34 there we go yeah yeah okay fantastic cool yeah exactly brilliant because false negatives would result in data loss if you do that we don't want that yeah exactly cool so that's a filter but your work focuses on an extension of these I guess and that range filters

Starting point is 00:05:50 So, yeah, how do we go from filters to range filters? What's the difference there? Yeah, sure. So it's, so range for the filters are in a sense a really natural extension of filters. Because filters traditionally only support queries with a single key, right? So the question is, okay, I have my database and, well, a lot of my queries are actually ranges. I want to scan all the pieces of data that are within a given range. I can't really use a standard filter to do this

Starting point is 00:06:21 because it doesn't give me the relevant API. So the question is, can I have a range, can I have a filter or a range filter that I can give it a range and ask, is there anything included within this range or not in my file? So then if you just have this data structure, now suddenly you can do range queries over your LSM as well as the point queries you were doing before. Nice, cool.

Starting point is 00:06:45 Could you maybe give an example of that? I don't know of what a fun example might be of such an application that would do such range queries. No, sure. So I guess one standard example might be, let's say if you have a database of patients, and you want to just find the patients that have an age between, let's say, 50 and 60. So your range would just be 50 to 60. And you're looking for, your keys are the integers between these two numbers. So you just want to filter those out.

Starting point is 00:07:19 Cool. So that sets us up nicely then for the sales pitch for Diva, I guess, then. So, yeah, what's the, I guess, the TLDR on it, the overall elevator pitch for it? How, I guess, and how would you probably explain it to your grandma, I guess, as well, sort of if you sort of summarizing up the whole piece of research in one go? Sure. So at a high level, what Diva does is basically it lets you do range filtering with theoretical guarantees. so it's going to give you a provable false positive rate

Starting point is 00:07:49 that also works on string keys so not necessarily only integers and it doesn't impose any restrictions on the kinds of quarries you can feed to it so the query can be extremely big the range or it can be really small that's the high level pitch nice that's cool so yeah

Starting point is 00:08:09 let's dig into then so how did you arrive at this this research topic guess, and kind of what was the specific gap or question you were saying out to answer? Okay. So we were basically looking at the existing range filters, and we saw that, okay, almost all of them, well, almost none of them give you a nice false positive rate guarantee. So they use some amount of memory, but it's the filtering that they give you doesn't necessarily correlate with the amount of memory you give to the filter.

Starting point is 00:08:45 So it might not give you the best possible false positive rate, given that memory budget you give it. So the second issue is, the filters that do actually give you a false positive rate guarantee, they don't really support range queries that are too big. Because if you issue a range query that's really big to those types of filters, what's going to happen is their false positive rate is going to go all the way up to one, which means that any query you give it, it's all the way. always going to return, yes, it exists, which is going to make it practically useless. In addition to that, sometimes the performance of these filters also gets really, really bad

Starting point is 00:09:28 when you give them a long-range worry, and so much so that it's like on the same level as doing an I.O., which is not great. So that's the second problem. The third problem is that almost none of these filters support variable-length keys, meaning that they only operate over integers, I'd say. The only one that does support variable-length keys does not give you any sort of guarantee on the false positive rate that gives you. And the fourth issue, we saw, which was that, okay,

Starting point is 00:10:05 all of these filters are kind of like static structures. You can't really do insertions and deletion. to them. So except for one, which was actually our own work prior to Diva. That was like the first range filter that supported insertion's deletions and it could accommodate workloads of growing size so it could expand itself. But yeah, none of the other filters achieve this. And you can see that I have these four criteria that I want to achieve. And I want to achieve them with high performance. in some sense I have five goals and what's interesting

Starting point is 00:10:42 was that okay of these five goals two of them are kind of non-existent among the current filters and for sure none of the current filters give you these five at the same time so we thought okay let's try and do this and give the people kind of like the holy grail

Starting point is 00:10:58 of filters nice yeah that's what I was thinking about you get all these five things that's all the five special ingredients to make some yeah really good sauce that and so yeah just to recap really quick so there's there's nothing out there that sort of gave any sort of guarantees of the false positive rate and the ranges were uh often too big or if they got too big basically kind of got redundant because it went to once basically there's no that could kind of handle

Starting point is 00:11:23 handle variable size of large ranges interesting keys only which again feels an incredibly restrictive API right like kind of I imagine a lot of use cases or a lot of data kind of have strings as keys so yeah that that feels okay we'd be nice to have that and lastly um static structures, right? So, yeah, data is dynamic, right? So people want to insert and delete things. I mean, that's, it's often, I guess, it's a assumption in a lot of research works. Oh, yeah, just assume things are static because you simplify the game a little bit, right? But yeah, in reality, things are dynamic, right? So yeah, and we all want to do that as fast as possible, right? And achieve all of these goals, well, have been really high performance.

Starting point is 00:11:56 So yeah, so that's, that was the goal then. And so I guess maybe now should we talk a little bit, we kind of covered off there, kind of what the existing solutions are like and how they perform and where they fall short. So maybe we don't need to go into the in too much depth then. So let's switch on to Diva then and talk about how you went about achieving the Holy Grill. Right.

Starting point is 00:12:17 Okay. So this whole story with Diba started from this theorem from the theory community back in 2014. So it's basically the theorem is quantifying you how, like if you want to have a range filter that guarantees a given false positive rate,

Starting point is 00:12:36 how much memory you need to use. And if you look at it, you'll see that, okay, if I want to support string keys, the theorem is basically telling me I have to store the actual keys. There's nothing interesting I can do. So in some sense, when you hear this statement, you feel like, okay, the problem is kind of intractable. I can't really do much here. Well, then we looked at it and we saw that, okay,

Starting point is 00:12:59 this theorem is actually only talking about the worst case scenario, which doesn't seem to be that. common in reality. So we thought to ourselves, okay, what can we say about the common case that we do see in reality and in practice? Maybe we can do better for those. And that brought us to this assumption of,

Starting point is 00:13:24 well, most of the time, our data follows some nice distribution. So maybe if we assume that our data comes from this nice, smooth distribution, then we could do filtering better. So that's kind of like the starting point. We wanted to bypass this theorem or this lower bound to achieve space efficiency while... Just going to ignore the theorem committee. By the way, the theorem committee sound really kind of a cool sort of, I don't know, quite a scary group of people. The theorem committee, I don't know, yeah.

Starting point is 00:13:55 Yeah, the people and people in theory of computer science. Cool. Sorry, I continued. I interrupted. No, that's okay. But yeah, we had, I guess we had this observation and we thought, okay, let's suppose our data has a distribution. And kind of the first step we took after making this assumption was, okay, so I want to leverage this distribution somehow. So what I do is I take some samples from this distribution to kind of get an idea what the distribution looks like and learn it in a sense. And that's basically just, this corresponds to saying that, okay, if I have a data set with some number of keys, I'm just going to take like every thousandth key, that's going to be my sample set.

Starting point is 00:14:45 This is going to give me a really good idea of what the distribution looks like. And based on that, I can do some more interesting things. In particular, if your distribution is smooth, smooth enough, let's say, in between each two, samples you pick, they're going to have an almost uniform distribution. If you look up like the distribution function, in between each two samples, there's going to be like an almost straight line connecting them. That corresponds to an almost uniform distribution. So this means that the keys that you that you kind of omitted between these samples were almost uniformly distributed. So we're going to leverage this property to give you a really space

Starting point is 00:15:25 efficient encoding of the of the keys in your set. So, So the way we would do this is, well, we're first going to store the samples that we picked out in some data structure. Let's say it's just a binary search tree. In our work, we use something known as a Y fast fry. This is like a cute data structure from the 80s. But that's besides the point. It just gives you like a slightly faster lookup time than just the binary search tree. Well, let's say we have a binary search tree over our samples.

Starting point is 00:15:57 Now, this gives me access to that nice approximation of the distribution. Then what I do for the keys in between any two consecutive samples, because I know they're uniform, I can lossily compress them and store them in these data structures we like to call infix stores. So the compression, the way it goes is any two consecutive samples I take, well, these two samples have the longest common prefix, right? any key that I take in between these two samples in lexicographic order is also going to have the same longest common prefix. So because I'm already storing that longest common prefix in my binary search free,

Starting point is 00:16:41 I don't have to store it in these infix stores too. So I can just remove that longest common prefix from what's left of each key, then I can take like a small patch of bits at the very, at the middle, that kind of, differentiates between the keys that are in between the two samples. And because I had this uniformity property, that small patch of bits, which we call it, which we call an infix because it's in the middle of the key, that small patch is going to be almost uniformly distributed. It's going to inherit the uniformity of the key itself. So I have kind of like a small summary of the key, which, well, it's part of the key itself, so it's order preserving. it's almost uniform.

Starting point is 00:17:27 So in some sense, the uniformity kind of makes it look like a hash in the key. Because that's what a hash is in reality. It's like a small summary of the key that looks really random. In our case, it also looks random, but the additional benefit is that it's actually order preserving. So we summarize these keys in between the samples as these infixers. And we're going to leverage this hash-like property of the infixes to store them in something that looks like a hash table. So the way this is going to go is each infix,

Starting point is 00:18:02 we're going to split it into an address and a remainder is what you like to call it. This procedure is known as quotient thing in the sense. The address is telling you where in the hash table you want to store the infix, and in that position, you're going to store the remainder. So what you get in the end is a order-preserving hash table that it's an order-preserving hash table that gives you

Starting point is 00:18:29 constant time access to the infixes in between two samples nice cool so once we've got this ditches can you walk us through kind of how we go the other way then so I think it'd be quite nice

Starting point is 00:18:42 to go through how a range query works on top of this then sure of course so a range query is actually really simple to handle So you had this binary search free, right?

Starting point is 00:18:56 And then in between each two consecutive samples, I'm assuming that the samples are ordered lexographically, right? In between each two consecutive samples, I have a small infix store. So the way a range query would go is I have my range. I'm first going to look at this binary search free that I have. I'm going to check if there's any key that is included within my range. Because my binary search free is storing the action, samples themselves without any loss.

Starting point is 00:19:27 I can do this check exactly. And if I find something that's in the range, well, the answer to my query is a true positive. So I just returned that and I'm happy. So in some sense, the interesting case is when I don't have any samples within my range. In that case, well, if you think about it, you'll see that, okay, this means that the end points of the range query are both in between two consecutive samples, because otherwise they would have contained some sample.

Starting point is 00:19:57 So this means that, in other words, this means that the endpoints of the range query are fully included within a particular infix store. So then what you do is, you take the endpoints of the query, you kind of do the same compression algorithm that we did for the keys in between the two samples to convert these endpoints into endpoint infixes. And then look at the order-preserving hash table to see if there's anything in between these two endpoint infixes.

Starting point is 00:20:30 If we find something, the answer is a, well, the core is a positive. And if we don't, the query is a negative. We get false positives because this compression scheme of removing like the longest one prefix and just removing some other suffix to get the end fix, this is lossy. So there is some chance that you might accidentally match something that wasn't exactly the key. you were hoping for. Nice, cool. So this, as you've been talking through this example, then, I'm thinking, okay, I can see how this is all constructed and works nicely for static data. How do we then go and make this dynamic? Like, how do we then handle inserts, and deletes? And yeah, expansions as well.

Starting point is 00:21:10 And kind of what are expansions? I want to ask as well, I saw that mentioned in the paper. Expansions? What do they mean? Sure, sure. So, yeah, let me start with how, with how you can do inserts and delete. Then I'll get there. So inserts and deletes are quite straightforward, well, they're not straightforward. They're not straightforward for you now, I'd say, yeah. Right. So yeah, inserts and deletes, the way they work is, let's say you get an insert, right? What Divo does is when it sees the new key, it's going to flip a coin with some probability that's going to take that new key to

Starting point is 00:21:52 to be a sample. So it's going to insert it into the binary search free of samples that we had. If it does this, then this means that it kind of land into an infix store. It has to split the infix store into two halves. So now you're kind of like, this is sort of similar to how a bee tree works if you're familiar with it.

Starting point is 00:22:13 So you insert a new thing and it splits like the infix store into two halves, which is just like a smaller subset of the problem. So that's one scenario if my coin tells me to insert it as a sample. The other scenario is my coin tells me to not answer it as a sample. So this means that I just go to the relevant infix store for that key, and I just insert it as an infix. Really simple. And deletions are kind of the same.

Starting point is 00:22:43 You follow the same procedure. Well, you don't have the coin flip anymore. But if you find a sample within your tree. in your tree that matches your deletion, then you just remove that and merge the two infix stores. And if you don't find the sample, then you just go to the relevant infix store

Starting point is 00:23:01 and find an infix that matches your key to the best amount in some sense. And you remove that and fix that matches it the most. Nice, cool. Yeah. And expansions. Right. So expansions.

Starting point is 00:23:18 So an expansion is basically just a fancy word. for saying that I want my data structure to resize itself as I insert more stuff. So you can imagine in the insertion procedure that I mentioned, because I'm doing this coin flip procedure, there is a chance that I, for a long time, I don't insert any samples. So I'm going to have an infix store that's kind of big. It's going to get full.

Starting point is 00:23:45 So I have to reallocate it in some sense and give it more memory so that it can store the extra data. That reallocation process is known as expansion. And the cute thing that we do in Devo, that's sort of different from how other filters do expansion, is, well, other filters do this, I guess, classic procedure of expanding by doubling the size of the array and then inserting your new keys.

Starting point is 00:24:12 We do it by expanding it in fractionally in some sense. We increase it by a little bit so that you don't have this wasted space by doubling. Because if you double, then there is a chance that half of the space that you're using for your filter is going to be empty, which is not great for something that's supposed to be really space efficient. Yeah. Cool.

Starting point is 00:24:33 I guess now it might be quite a nice time to then talk about evaluation and results and how well Diva performs. So I guess the first thing would be to ask, how did you go about evaluating Diva? What was the setup looking like? Who are you comparing it against? And yeah, so let's take that question first. Then we can give us some click beta results that we can. We can chop up and then make some shots out of.

Starting point is 00:24:55 Right. Awesome. Okay. So the overall setup is we do both micro benchmarks and a integration benchmark. For the integration, we're integrating the range filters that are integrable in some sense into wired tiger, which is sort of like the back end of MongoDB. So let me start with the micro benchmarks. For these benchmarks, we are comparing against pretty much every single range filter that's out there.

Starting point is 00:25:27 The ones that are important to note, I think, are Surf. Surf is like the first range filter that was designed. This is the only range filter that supports string keys, variable-length keys. And Memento, which is our previous work. This is the only range filter that supports dynamic updates. Okay. So when compared to the other range filters, in general, Liva gives you almost the best false positive rate. Like it's better than, actually, it gives you the best false positive rate that's stable with the range quarry size.

Starting point is 00:26:04 So no matter what range query size you feed it, it's always going to give you a really stable and constant false positive rate that's better than anything else. There are like one or two filters that are slightly better. then surf than Diva in terms of the false positive rate. Well, those guys are actually much, much slower in terms of query time, and they're not updatable. So you can't do any inserts and believe still. So that's like the setup against ranges of different size. If you care about string keys, well, then the only range photo that's relevant to surf,

Starting point is 00:26:42 compared to surf, or Diva gives you like one to two orders of magnitude, better follow. positive rate at the same memory budget. Yeah. So that's like the benefit you would get. And it's pretty much, and it's really much faster than surf. And surf is like, yeah, anyhow, it is much faster. How much faster is it then?

Starting point is 00:27:03 Give us some numbers. So it kind of depends on the type of data that you're working with. But in theory, what can happen is surf, morally speaking, surf is like a try. so it has to go down this multi-level try and the number of levels that are in the try

Starting point is 00:27:23 is basically the length of the strings in your workload. So that means for each level of the try, you're going to incur a single cache mist, which is going to scale pretty poorly in the length of the key. Viva in comparison, because it's using the Y-fast try thingy, it's going to take actually logarithmic time in terms of the length of the key. So it's exponentially faster in some sense

Starting point is 00:27:46 and answer. Nice, cool. Yeah. Do you want to talk about? Should we talk about the white? Let me cut. I'll let you continue. I'll be sorry.

Starting point is 00:27:53 I interrupted. Yeah. There is one more thing that I want to. Keep going. Yeah. Yeah. So yes. So against Memento filter as well,

Starting point is 00:28:02 which is the only other dynamic one. Well, now both Diva and Memento can expand. So when compared to each other, you'll see that, okay. Performance-wise, Memento filter is a little bit faster. Well, yeah, it's like faster by 1.5x, I think, if I recall correctly. And the reason for that is, well, because it doesn't have this extra binary search tree that it has to go through to get to where it's supposed to go. It's just using hashing.

Starting point is 00:28:32 But even though it's slightly faster, it's much, much worse in terms of false positive rate when you do long range more use, even as you're expanding the dataset. Before we talk about the wired target implementation, I'm very intrigued about that. How does the performance, how is the performance affected by the kind of proportion of insets and deletes going on as well? Like how does the, what's the relationship between that, between the, I guess, the rights and the, the rights and the reads, I guess. Right. So they don't really impact each other that much. Because the scaling, yeah.

Starting point is 00:29:08 If you were using a normal binary search free in Diva, then they would impact each other because then with more inserts. you're going to get, the binary search tree is going to get bigger and you're going to be slower in your search. Yeah. But the Y fast try actually gives you a access time that's independent of the number of keys. Okay. Yeah.

Starting point is 00:29:29 In that sense, they're both awesome. Yeah, very nice property. Cool. So let's talk Wyatt Tagger then. So first of all, tell us a little bit about the implementation, if and how easy it was to implement it in Wyatt Tiger. And then, yeah, tell us how much, like how, how it improves what was already existing in wiretiger as well

Starting point is 00:29:47 yeah sure so the implementation itself was actually very easy because we give you like a really nice header only implementation of deval you can just plug and play and we just had to place it on top of the interface for accessing like the beege tree and wired tiger so before accessing so again wire tiger is like a key value store that's formed as a beach free

Starting point is 00:30:12 Just before accessing the B-tree to get your design record, you just have to ask Diva first to see if it exists. If it exists, it tells you go access to the B-tree, and if it doesn't, it tells you don't bother. So it was really easy to integrate it. We just put Diva as like an upper layer over Wired Tiger. And compared to the things that existed beforehand, well, Memento Filter was really the only other.

Starting point is 00:30:42 range filter that was integrable with White Tiger? Because now, you have this dynamic database, which you're doing inserts and deletes to. So the filter that you're using must also be dynamic. You must also support inserts and deletes. So really the only other filter that Fitbil was fermento. Compared to that, well, if the workload was similar, Viva actually achieved the same, yeah,

Starting point is 00:31:11 If you were working over integer keys and you didn't have long range quarries, let's say. You've achieved the same positive rate as memento, and it gave you pretty much the same performance. Because when you're using a database system, the bottom neck really comes from, well, most of the time, it comes from the disk accesses you're doing, not the filter time itself. But I guess the core benefit is that now you can also do string keys, which Melenthalter filter doesn't even work with. Or if you have long-range queries, which Melentifilter can potentially be extended to support, Memento is going to get a really bad false positive rate,

Starting point is 00:31:54 which is going to translate into a very slow system, which is just going to do redundant accuracy to disk, whereas Diva is going to let you filter everything out irrespective of the range size. nice cool so for for all our listeners who have always wanted to do um to arrange queries over variable length keys but haven't been able to and too now can't count is diva actually available um to users of mongo db can they actually use diva um or is it so is it publicly available i guess is what i'm asking yeah of course diva is open source so you can just go to my github page and pick it up uh it is i tried to make it as user friendly as possible it's just a

Starting point is 00:32:36 header only file you include it plug and play you just do your workloads and you're happy uh it is also thread safe actually that's a new edition so if you have a multi-threaded system you should be happy with that as well fantastic we'll put a link to that in the in the notes so the interest list you can go and have a play uh yeah so let's change gears a little bit now and talk about surprises so whilst you're working on on the projects and with range filters on diva was there any sort of thing that sprung up any sort of counterintuitive findings or results that really sort of challenged your initial assumptions about range filters? Yeah, yeah, for further, were. So the biggest one was when we started this project, we had this conception that basically every single workload with

Starting point is 00:33:23 variable length keys had a smooth distribution, in particular, like natural language, we assumed had a smooth distribution. But when we got into evaluating it, actually, We saw that, okay, there are some specific natural language data sets that don't seem to be as nice in terms of the distribution. They're not as smooth as you would expect. For those, it actually turns out that both Diva and Surf, if you just use them as is on these datasets, you're not going to get a great false positive rate. But we also observe that, okay, the reason why we are getting a bad false positive rate on these datasets is actually because there's a lot of redundancy in the English language in the sense that there are specific combinations of characters and words that are really common. And if you think about it, okay, because I'm removing, the way I'm deriving the infix is I'm just removing the longest hopper. and I'm taking some bits,

Starting point is 00:34:32 that redundancy is going to translate into the infix that I take at the very end. So what that's going to do is it's going to make the infixes less suitable for differentiating the keys in some sense. And that's going to mess with the false positive ring. So where am I going with this? I'm saying that I have this redundancy. That's causing the issues. So why not do some compression algorithm to remove the, redundancy. And it turns out we did a order-preserving compression of the keys.

Starting point is 00:35:07 And it turns out when you do that, the distribution of the keys and your workload are going to get much, is going to get much smoother. And you can use data over that newly modified data set. Nice. That's a good segue into limitations because I was going to kind of ask you about this assumption of having a smooth date distribution there and kind of saying, well, how likely is that in reality how often do we see that but it's you can actually often um with like the technique you just mentioned there with compression actually takes when that isn't smooth and kind of make it smooth right um but yeah i guess the question that still stands is then are there any limitations any scenarios in which i shouldn't be using diva or diva still can be improved maybe

Starting point is 00:35:47 in certain places yeah okay so um yeah the core limitation that is as you mentioned if you have a non-smooth distribution, you have to do some pre-processing in some sense to make it smooth. That, I think, is like an open research question in some sense. We have to study it a bit more to see what type of pre-processing we have to apply. That best smooths out the distribution. There is also another limitation in the sense that, okay, you might have, your range queries might be constructed in such a way that you care about some range that's really close by to a given key. So you can imagine you have a key that's, let's say, 1,000, and your range is from 1,001 to 1002.

Starting point is 00:36:41 It's like really close to the key, but it doesn't include it. This type of workload, we call it the correlated workload and the literature. For this one, because of the way it only takes the infix and it throws away the lower order bits of the keys, is not really able to help you in filtering these types of queries out. But the thing is, this is kind of a philosophical point. If the ranges that you care about are really similar to the keys, that in some sense kind of implies that you know a lot about your dataset. And if you know a lot about your data set, then it's kind of questionable that you want to issue range queries that are kind of empty.

Starting point is 00:37:30 And you know what the data set is, so why not just iterate over the stuff that you know exists? So in some sense, this workload that I mentioned, this correlated workload may or may not be really something that you see in practice. To me, it seems like the most reasonable scenario where you would use a range filter is when you don't know a lot about the dataset. So the range queries you issue are likely to be empty. In that scenario, divo is perfectly fine. You could use it. But if for some reason you happen to know a lot about your dataset and you want to issue this range query that's really close by to a key, it's not going to help you there. You have to use something similar to Memento to filter those out.

Starting point is 00:38:21 And by the way, this is like the worst-case scenario that the lower bound that I mentioned, the theorem that I mentioned is kind of exploiting. So in some sense, you can't really do better than the memory usage of Memento to handle these. Do you feel like if you can have both like Memento and Diva in the same system, you um like would you have it i don't know how you would basically have i'm so i think can both coexist in the same system and depending on your query you can use one or the other right and have maybe have that and i don't know whether the selection of that would need to be sort of human driven or you can have some sort of i don't know optimizer or something i can say okay you should probably

Starting point is 00:38:58 use this one or use this one essentially um anything like that way it can coexist i think so yeah i think like this well the simplest solution might be to just put a threshold on the size of the range If it's smaller than something, use memento. If it's bigger than something, use diva. Sure. Some simple heuristic like that or that would work. Yeah. Yeah.

Starting point is 00:39:17 I think, yeah, there's, and I don't actually, I actually think that I thought about this problem for a while. And I actually don't think you can do better memory-wise than just having a separate memento instance on a Diva, and a separate Diva instance and use them together. That seems to be like the optimal you can achieve. Nice. Cool. I guess then, where do you go next

Starting point is 00:39:42 with Diva? Where's the future? What's the future hold for it? Yeah, that's a good question. I think the main thing that we should research in the future is how you can smooth out non-smooth distributions. That's like the biggest thing you can, in my opinion, you can work on.

Starting point is 00:40:03 I am actually working on some techniques to address that. So hopefully we'll get that at some point. So just on like you said the very fertile area for research is finding how we can take data sets and make them smooth. Do you have a sort of a handle on out in the wild how unsmooth data actually is? How is it just so application sensitive or context dependent way?

Starting point is 00:40:32 It's kind of really hard to say. Well, most people that I've talked to that are actually interested in using range filters, seem to have data that is quite smooth. I'm not entirely sure about, like, the average workload across all industries, of course. But the people that I've talked to seem to have smooth distributions mostly. And it seems to me, like, when I think about this problem, it seems to me, okay, usually your keys are like maybe names, natural language, but perhaps, or, like, email addresses. Those seem to me to be non-smooth, but they're very structured, and the structure is very well understood. So we should be able to do some nice compression to make them work as well.

Starting point is 00:41:23 I don't think it's that bad, but of course you can have like a weird composite key that's not smooth. So you have that's completely domain specific. You have to do like a specific compression for that. Some adversarial case. Yeah, they always exist, right? Cool. So let's talk about impact then and practical relevance. And so my first question is, I guess,

Starting point is 00:41:47 how much impacts have you seen Diva have so far since it was published? Right. So it is a little bit early, I think, for us to see some impact. But I have seen a lot of interest. And we are actually in talks with some companies about integrating Diva into their systems. I'm hoping to see Diva show up in a lot of systems, mainly because this LSM-free structure that I mentioned is like the killer application for a range filter.

Starting point is 00:42:21 And it's extremely widely used across the industry. One really famous example is like in RocksDB. You can just plug Diva into RocksDB and you get all of the benefits out of the box. Yeah. And we do actually have an internal project in the University of Toronto that does this. but it's not published yet so it's like internal knowledge for you yeah yeah cool and sweet so i guess that that brings us nicely on to to the last word then i've been and that's there's two things here the first of all is i kind of what should practitioners take away from from this

Starting point is 00:43:00 work and secondly what should researchers take away from from this work and and your pod on the podcast Right. So for practitioners, I think the nice message is that you can actually have data structures and I guess techniques that are grounded in theory that give you really nice guarantees that are also useful in practice. I think that instead of, well, I mean, going to heuristics to implement a, you know, whatever data structure you want for your system is a fine choice. And it's often, oftentimes the correct choice given the time budget you have. But if you really want something that's never going to break under any circumstances, it is possible to, I believe that it's possible to design structures that are, that give you strong guarantees like that. Diva is one example of that. I think I would like to encourage the practitioners to think about this a bit more, maybe, like consider having structures that gives them strong guarantees

Starting point is 00:44:10 so that they don't have to deal with the headaches of things breaking later on. Nice. That's for the practitioners. For the researchers, I think the nice message I have to give is, again, tied back to the lower bound that I mentioned, at the beginning. The statement of the lower bound kind of seemed pessimistic, right? It seemed like you couldn't do anything good in the space and everything was done, so why even bother? But oftentimes, maybe not oftentimes, but a lot of the time, when you look at these lower bounds,

Starting point is 00:44:43 you see that they're actually leveraging properties of the problem that are really hard, and they only appear in the worst case. In reality, a lot of people don't deal with those worst cases. So I would actually encourage people to think about what properties we have of the data we care about in practice that can help us in designing these types of structures. Like it could be an algorithm, it could be a data structure, and designing any one of these, I think it's nice to find those specific properties and kind of tie the algorithm to that property to get a better performance out of the end. entire thing. So this gives you like a broader spectrum of problems you kind of work on if you go beyond this worst case and some sense. Nice. Yeah. So it's a great, great message to end on there. One last question actually. I said that was the last word, but one last question, Diva. You need to tell us about the name because naming's important. Why is it called Diva? Yeah, that's like a, it's like a,

Starting point is 00:45:47 it's like a weird acronym I came up with on the spot. So Diva, the D and the I, come from dynamic if you look at the title of the paper the d is from dinac and the i is from the end of the work and va is just for very willing so he's like contaminated bees and god god diva oh yeah he's like it's a bit of a diva as well it's pretty cool right so yeah um they got that go for it as well great stuff well we'll end things there then and that's great we'll put links to everything you mentioned in the show notes as well so the list can go and connect with you are you on social media anywhere where's best to find you yeah that's a good I'm not actually on social medians.

Starting point is 00:46:28 Maybe like the homepage or something, we can link maybe. So yeah, we can drop that in there. Yeah, I do have a GitHub page. So if you're interested in contacting me, please check that out. I will answer emails. And yeah, I'd be happy to chat. Great stuff, yeah. And go play with Diva.

Starting point is 00:46:45 That's the last thing we want to say. Yes, cool. Please do. Great stuff. Thank you very much.

Disseminate: The Computer Science Research Podcast - Navid Eslami | Diva: Dynamic Range Filter for Var-Length Keys and Queries | #67

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.