Disseminate: The Computer Science Research Podcast - Marco Costa | Taming Adversarial Queries with Optimal Range Filters | #58

Episode Date: October 14, 2024

In this episode, we sit down with Marco Costa to discuss the fascinating world of range filters, focusing on how they help optimize queries in databases by determining whether a range intersects with ...a given set of keys. Marco explains how traditional range filters, like Bloom filters, often result in high false positives and slow query times, especially when dealing with adversarial inputs where queries are correlated with the keys. He walks us through the limitations of existing heuristic-based solutions and the common challenges they face in maintaining accuracy and speed under such conditions.The highlight of our conversation is Grafite, a novel range filter introduced by Marco and his team. Unlike previous approaches, Grafite comes with clear theoretical guarantees and offers robust performance across various datasets, query sizes, and workloads. Marco dives into the technicalities, explaining how Grafite delivers faster query times and maintains predictable false positive rates, making it the most reliable range filter in scenarios where queries are correlated with keys. Additionally, he introduces a simple heuristic filter that excels in uncorrelated queries, pushing the boundaries of current solutions in the field.SIGMOD' 24 Paper - Grafite: Taming Adversarial Queries with Optimal Range Filters Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate, the computer science research podcast. Today is another episode of our Cutting Edge series and I'm going to be joined by Marco Costa, who will be talking about his recent paper from SIGMOD, Graffiti, Taming Adversarial Queries with Optimal Range Filters. Marco is a research engineer at Huawei. Welcome to the show, Marco. Hey, thank you, Jack. Thank you for inviting me. Yeah, I'm really looking forward to this chat today. So we always start off on the podcast by getting you to tell a listener more about your story and how you ended up being interested in database research. So yeah, tell us more about
Starting point is 00:00:59 yourself, Marco. So why database research? Our research about Grafite, it's mainly a matter of algorithm engineering, right? But since this is a field that goes
Starting point is 00:01:19 its main purpose about range filter, about what Grafite is, is mainly because of database purpose. It's mainly used in database. And so when I started this work as my master thesis back in 2022, I had no idea about database research. I was just a master's student. And then this work drove me straight into it. It started like a pure algorithm engineering work, an algorithm engineering thesis. But since this is a field that it's really used mainly into databases,
Starting point is 00:02:09 then that's what I had to start studying about it, trying to implement Grafita in modern databases. And that's basically how that started. Nice, yeah. So you kind of found a back doorway into databases there, but now you've got a book, right? That's it. Databases for life now.
Starting point is 00:02:31 Cool. So we're going to be talking about graffiti today. So let's set the listener up with some context for this chat then because the title has some cool words in there like adversarial queries and optimal range filters. So let's kick things off, Ben. Tell us what are range queries and yeah, what the heck is an adversarial query?
Starting point is 00:02:51 Basically what range queries are is simple. You have a set of keys and you want to check if a set intersects with those keys. So you have two endpoints, A and B, and you want to see if there is any key of your set that intersects on it. So what approximate range queries are is a bit different, is because this is mainly, this can be seen as a extension of what Bloom filter do from point filter to range from point queries to range queries what Bloom filter to do is to give you an approximate answer where you don't have false
Starting point is 00:03:34 negatives but you can have false positives about the answer of the of the of the of the filter if the you can have a false positive so it can tell you with a certain precision, with a certain grade of error, if a key belongs to the set or not. But it cannot give you a false negative answer. So if the filter tells you that the key is definitely not in the set, so it's definitely not into the set. You cannot have any false negative probability. And approximate range queries are the same, basically just on ranges rather than on simple points. What are adversarial queries? Well, we saw that when we started evaluating Grafitta and the other state-of-the-art range filters,
Starting point is 00:04:33 what we saw is that previously of Grafitta, what we had were mainly heuristic range filters. So what is a heuristic range filter? It's a range filter that answers approximate range emptiness queries, but cannot give you a bound on the error rate of the false probability, on the false positive rate probability.
Starting point is 00:05:00 And since these range filters cannot provide you any guarantee, then you can build an adversarial workload that basically makes them useless because it draws the false positive rate straight up to one. So the false positive rate is straight up to one. So basically you have no filtering and you can imagine that what we use this data structure are mainly to since they are very they use very low memory they they can be kept into main memory and you want them and you want to use them to guard the access to the disk, to avoid the useless IOs on disk. So what we do is that when we have a query on our system, we first query the range filter tells us that yes the the range or the the key is probably it's probably there then you go straight to the disk and you and you perform the actual query
Starting point is 00:06:15 and so if you have and if you can build adversarial workload on top of range filters, you can build certain kinds of queries that drives the false proceed rate straight to one and you're basically providing no filtering. You are allowing your adversarial or maybe not even adversarial to... you're giving them free IOs on your system. And we also saw that these adversarial workloads, what we have for adversarial workloads on range queries are workloads where the range is very close to the input keys, which is pretty normal, no?
Starting point is 00:07:06 Because when you want to do a range query, you basically want to, ideally, you are going to query a range that is close to your input set. It's not completely on the other side. So the more that this range goes close to the input keys, the more this heuristic range filters, they just provide no filtering at all.
Starting point is 00:07:35 So that's why we wanted to provide a solution that we call it robust. And so what Grafite can do is, what Grafite does is that you provide basically a false positive rate, you provide your input set, and that's it. You have a measurable way and you have a certain way, a measurable way and you have a certain way, a measurable way
Starting point is 00:08:05 to know what's going to be your false positive probability when you perform a query. Nice, cool. So I'm just going to run through that and try to recap it and grasp it for you, Marco. So going back,
Starting point is 00:08:18 way back to the start of your answer there, we were talking when we were actually introducing what range queries are. To give, I guess, a concrete example, you can tell me this, I'm thinking this, obviously, my go-to example was footballers so i've got a list of every premier league player in every premier league player and their age right and that might be that's my kind of a set of values and then i guess my my range query be
Starting point is 00:08:37 go and get me every player that is between the ages of 21 and 23 right that is basically it kind of way that one of those queries would you would potentially go to that set of 21 and 23, right? That is basically it, kind of where one of those queries would, you would potentially go to that set of inputs and run that query over it, right? No. Would that be a good example? We run the range queries on keys, straight to the keys. So imagine you have
Starting point is 00:08:57 the input, your input set is the date of birth of football players. So you want to know what what football players so you want to know if a football player is born between 1997 and 1998 so that's basically a range query cool gotcha cool. And then the problem here is that to kind of go for what the false positive is, it's always going to tell you when something is
Starting point is 00:09:30 definitely not in that range, but sometimes it's going to say, hey, this is in there when it's potentially not in there. That's what we mean by the false negatives, or the false positives. I always get my false positives and false negatives the wrong way around, Marco. See it in this way.
Starting point is 00:09:46 When you perform a query, it can tell you it may be in the set, it may be true, or it can tell you it's definitely not there. Gotcha. Cool. And then we've got this bunch of adversarial queries which can kind of really kind of... Because these exist in the state of the art suffuses heuristics, right? And they have basically no bound on the error. queries which can kind of really kind of because these exist in the state of the arts of use these heuristics right and they have basically no bound on the error so if you kind of get your queries set up in a certain way that this could basically drive that whole thing to
Starting point is 00:10:15 want the whole yes and it's useless right because saying everybody you have to go and check everybody again essentially because it's it's saying everyone's possibly true right possibly in the in the in the range in the set sorry cool and that's bad because we use this thing for performance and then we're not getting any gain from it so yeah that's where graffiti comes in so let's go through the design of graffiti then and tell us how it actually managed to achieve this this goal of not having the the issues that the current state-of-the-art approaches have? Well, our approach basically starts from one of the first papers, the first theoretical paper in the field.
Starting point is 00:11:00 It's a paper published by Goswami and it gives us, this paper proves the theoretical lower bound of the problem. So it proves you that you cannot achieve range filtering on approximate range filtering on a certain number of keys for a certain length of the query without using at least that number of bits. So it's basically a lower bound of the problem.
Starting point is 00:11:38 In that paper, they also propose a solution for the problem that uses optimal space and optimal time. And in that solution, they use carefully crafted hash function to reduce the universe size from the original universe where your input set belongs to a reduced universe. And then they store that reduced universe into a very complicated kind of data structure. They do all the theorems, the lemma, the proof, and whatever. And they put that this data structure is optimal for the problem but it's very complicated and there is no
Starting point is 00:12:29 I think there is no implementation for it anyway because it's mainly theoretical and there is no way for implementing it in an efficient way so what we did is to take that approach, but changing the underlining data structure, using a different type of data structure that is really fast, it's really performant, and we have implementation for it. But basically using the same ideas offered using the universe sites
Starting point is 00:13:05 through hashing store the input the hashed set of keys and then performing range queries on top of that and what we proved is that using this structure the famous Elias-Fano structure
Starting point is 00:13:20 we achieve same optimal space and optimal time. Actually, we even have a better space if we go and check for the lower order terms of the formula. But yeah, we managed to implement it and this is where this approach comes from. Nice. Cool. Before we go on talking about how actually how it performed
Starting point is 00:13:49 and it actually is and discussing the results of your evaluation of what not have benefited, there was another contribution in your work that wasn't the only thing. There's a thing called bucketing. So it might be now a quite good time to tell us about bucketing as well and how that fits into this landscape. Bucketing is another contribution. What we said with Giorgio and Paolo
Starting point is 00:14:07 is that, okay, we have a heuristic range filter, we have a robust range filter as well, so we have graffiti. But since the design of the heuristic range filter that we have right now, it's very complicated.
Starting point is 00:14:23 They have a lot of tricks, a lot of parameters, a lot of stuff that you have to be careful about. But maybe someone, some system engineer, they still want to use a heuristic range filter on their system. Because I think of a lot of reasons. Can we design a heuristic range filter that is way, way simpler than the state of the art that we have now? And then we came with the bucketing, which is basically simple. What we do is simply to map the input set into buckets. So we divide the universe into buckets of fixed length. And we basically store one if at least one key falls in the bucket or zero otherwise. Then we have a bit vector
Starting point is 00:15:25 and we can compress that bit vector in an efficient way. And that's what bucketing is. It's a heuristic range filter which suffers on adversarial queries, correlation and everything we talked before, but it's way, way simpler of what the other state of the-the-art range filters do. Cool.
Starting point is 00:15:46 So let's talk about how bucketing and graffiti obviously compare to the state-of-the-art then. So the first thing probably would be good to tell us about how you actually went about evaluating these two algorithms. What was the actual setup of the experiments you ran? Okay. First of all, of course, the experiments, of course, are important,
Starting point is 00:16:09 but there is a careful theoretical evaluation of them. So, first, we did the theoretical evaluation, and then for the experiments, we used classical datasets that are used in this kind of data-of-the-art range filters.
Starting point is 00:16:32 So correlated and uncorrelated data sets. For uncorrelated, we mean that there is no correlation between the ranges and the keys, and we correlated, of course, the opposite. And of course, it's important to say that we didn't compare heuristic range filter with robust range filter, because it's basically like comparing apples and oranges. They do different purposes and it's not fair to compare them.
Starting point is 00:17:15 So we divided the comparison between heuristic and robust. We compared bucketing, of course, with heurISD sticker and Grafitta with the other Romus French filters. We used synthetic datasets, as I told you, with uncorrelated, correlated ranges, and also real-world datasets like the Facebook IDs or OpenStreetMap positions and this kind of standard datasets. And then we tested the main things such as query time, construction time, and false positive rate. And also the false positive rate as the correlation increased. And as we thought, you can see the actual difference between heuristic and others and robust range filters. that can in some sense deal with correlated workloads,
Starting point is 00:18:25 they increase exponentially their query time. So this makes them, I don't know, I would say a bit useless in those kind of workloads. Yeah, that's it. Cool. So yeah, let's put some numbers on these experiments that matter. So give us the rundown. So how much better are graphite and bucketing compared to the other state-of-the-art approaches for heuristic and robust range filters?
Starting point is 00:18:54 What's the magnitude of the performance gain? Well, if we speak only for graphite and we take just the false positive rate we have a comparison of almost two orders of magnitude better for grafite we compare to to the other state-of-the-art for what concerns the false positive rate and almost up to 90 times faster than the other state-of-the-art range filters. For what concerns bucketing, what we saw is that it basically matches
Starting point is 00:19:35 what the other heuristic range filters do in terms of phosphorescent rate, but it does that with up to two or three magnitude of time, in better time. So, yes. So up to two or two, up to two to three orders of magnitude less in both the query time and construction time.
Starting point is 00:20:04 Nice. So yeah, clear win on all fronts then. So that kind of leads into my next question then of, are there any situations still now where either bucketing or graffiti would still not perform particularly well? Like what are the current limitations still of this approach, of these approaches?
Starting point is 00:20:24 So this kind of data structure, they perform really well in LSM trees environments. So in LSM trees database, that's why they're taut in data structures. So you exploit the fact that when you merge different runs in ls and trees you have to rebuild you have to rebuild everything so you rebuild also the doc struct it's not very suitable in dynamic
Starting point is 00:20:56 environments because of this reason but yeah it will be definitely a nice future work to try to design and implement a dynamic version of it. Yeah, that's always one of my questions as well, Marco. Like, where do we go next with this line of research? And obviously that is when you're making these data structures dynamic, so they can work well in a dynamic environment.
Starting point is 00:21:21 But are there any other particular interesting areas of research you think that could follow from this work? This kind of data structure has been also used in research, just for research purposes in other fields like query autocompletion and so on and so forth. But I don't think they are completely suitable for this purpose. For the same reason, because they are static, the structure. So when you change the set,
Starting point is 00:21:52 you have to rebuild everything. And this is a cost. So if you can pay that cost, perfect. You cannot achieve anything better because you are doing... With Grafitta, you're basically lower bound. But maybe you want to, with the current state of computer architectures and storage, you are able to pay something in terms of, it's okay for you to pay something in terms of page space to store.
Starting point is 00:22:30 It's okay for you to pay something more for about the space, but have something more dynamic that you don't have to rebuild every time that your set changes. So that's why they're mainly exploited in LSM3's kind of databases. Yeah, cool. Another sort of question that's kind of been bouncing around my brain whilst we've been talking is about the kind of adversarial queries in the wild, I guess is probably the way to phrase it. And how common would you see these sorts of query workloads
Starting point is 00:23:02 appearing in production systems and sort of exploiting the the bad path in a way right like is it something that we see frequently well imagine that you have a i don't know a database that stores the the identification the ids of your registered user in your system. So then they basically, they pick a point in the universe space. So what you do, what you probably a user wants to do is to, or it's normal behavior of the system to query something that it's close
Starting point is 00:23:43 to those kind of ideas something that it's not completely if you have your ideas between uh one thousand and five thousand you know don't want to go thirty thousand right so the more the closer you go to those input keys, the higher is the correlation. The higher is the correlation, the higher is the false positive rate that increases when range filters cannot deal with those kinds of workloads. So yes, basically it's what is supposed to be
Starting point is 00:24:21 a normal behavior of a system. Going the closer you can go to the input set. Cool, yeah. I guess sort of going off to my next question then would be, obviously we always like to talk about impact, and impact is sort of one of the key things that we like to talk about on the podcast. Has graffiti had any sort of impact in the real world in the short term?
Starting point is 00:24:48 Are you aware of any systems, maybe LSM-based systems, that are thinking of adopting this? Yeah, and kind of longer term, just two questions for the price of one here, what sort of impact do you think Graffiti can have and bucketing in the future, long term? Well, for sure they are used into modern databases, into modern storage databases.
Starting point is 00:25:13 And you also have to consider that you can use range filter to answer also point queries. So you basically don't have to add in your system, but if you have a bloom filter and a range filter, both in your system, because you can basically use a range filter to answer a range query where the size of your range is one. So it's a point query.
Starting point is 00:25:44 I don't know to what extent this kind of filters are preferred over I'm speaking about Uric range filter preferred over classical solutions that can be bloom filter or
Starting point is 00:25:59 prefix bloom filters for range queries which do not provide any for what concerns prefix bloom filters for range queries, which do not provide any, for what concerns prefix bloom filter, they still have the same problem about adversarial workloads and so on. solution that can provide robustness in this kind of topic. Maybe it can be something that it's going to be implemented in future systems, or at least a good benchmark will tell us it's worth or not to replace maybe more trivial solution. On the implementation effort, so of your person, how hard was it to implement this?
Starting point is 00:26:50 Like what was the sort of, how long was the engineering effort, I guess, in like kind of going from, okay, idea to actually implementing the thing, getting it working correctly and being able to evaluate it. Was it quite a long period or easy or? No, it wasn't a long period. It's a simple implementation it's an uh
Starting point is 00:27:08 there is a github repository you you can check it it's uh it's it's really simple because and i think this is its trend because keeping it simple sometimes we want to to over complicate things that they don't need to be complicated so sometimes we just stop and maybe reflect a little bit more we can find easier solutions for tasks i'm not saying that we don't need complex solutions for a lot of topics, but sometimes it's just it may be an overkill for some
Starting point is 00:27:54 sort of problem. Yeah, I completely agree with that message there, Marco. There is sometimes sort of that inherent desire to make more complex means better in a way sometimes yeah which i think it's kind of in some level maybe like human nature i think we're all guilty of it but having a clean solution that achieve the same goal that simple is like yeah obviously that should
Starting point is 00:28:15 be better right but sometimes we always reach for the more complex thing right we think we should be doing but um yeah it's interesting that cool yeah so i guess while working on this on this project was there anything that sort of jumped out as a surprise or was quite kind of like huh But yeah, it's interesting that. Cool, yeah. So I guess while working on this project, was there anything that sort of jumped out as a surprise or was quite kind of like, huh, that's really interesting while you're working on graffiti? Well, it gave me the chance to actually, maybe I was a fresh grad from my master. So it gave me the chance actually to explore a lot explore a lot of
Starting point is 00:28:47 research to see how things are done see how things are done in this kind of environment so how research is actually done and i will want to what we need to do to do things properly. So, yeah, that has been a bit of a shake for me because I was a master's student and now I try to write a paper that is going to be published in one of the best conferences in the world. So yeah, it was the joy of your first paper. the the world between when you go out from the university when you actually start to
Starting point is 00:29:47 take up the art works is different yeah i mean it's a very similar experience of when you first exposed to the sort of level of rigor and that sort of scientific process to go through to be able to get something that is good enough level to like you say be published at one of the leading conferences in this research area is it's an experience right it's it's challenging but i think rewarding in the long term but i mean you started off great right it was on one paper one sigmoid paper it's a pretty strong record marco so yeah maybe it's kind of well you know like when the athletes they just retire as a as a winner will just retire one paper sigmund completely jumped on yeah i've got my gold base yeah been there done that got the t-shirt yeah cool so going over then so like talking about i think maybe we mentioned
Starting point is 00:30:41 this a little bit earlier on i mean you kind kind of was telling us about how you ended up being interested in database research. But this origin story of this paper, so obviously there was an existing sort of paper that was sort of a bit of the motivation behind this and you kind of brought this data structure along and sort of made it real.
Starting point is 00:31:00 How did that actually come about? Tell us a little bit more about the paper's backstory in the paper, I guess. Well, I was looking for a master thesis, and that's the beginning of the story. So Giorgio, one of my co-authors, so Giorgio Vinciguerda, he read the Ghost Huami paper that we were discussing before,
Starting point is 00:31:29 and he saw that maybe the existing solutions, they weren't exploiting at all that kind of result. So, and he told me, maybe we can do something about it. We can try to read carefully that paper, that Goswami paper, and see if we can do something that takes into consideration what they did. Because they propose a solution,
Starting point is 00:31:56 they are not able to implement it, and the robust range filter solution we had at the time that were implemented were inefficient in terms of theoretical theoretical were inefficient in terms of theoretical bounds so that's how it started then it continued with my with my other advisor, Paolo Ferragina, we discussed it was possible to transform that master thesis into a paper. PISA, but I was more interested in trying to explore the industry field rather than continuing on research. So I did a six-month research fellowship contract. Yeah, that's basically what I worked on.
Starting point is 00:33:04 Cool. I guess sort of as we were talking about you going through the scientific process and sort of getting familiar with how to sort of produce a research paper of top quality this is obviously part of that is generating ideas and then deciding which ones are actually worth spending time working on which ones need to go straight in the bin. So I don't know, Marco, if you have a personal approach that you've developed over this experience of generating ideas
Starting point is 00:33:32 and then choosing ones which you think have actually got some value. Well, this is not an easy question. So I think that you can have all the thinking process to say to yourself, okay, I will just stop here on this desk and I will just try to think and think and think until something comes into my mind and I come up with a great idea. But in my experience, when you have the moment when you have an idea, it's a very short time frame. So it's something that just comes into your mind.
Starting point is 00:34:20 You have that moment when you actually stop there and you reflect and say, okay, yes, maybe this can work. Maybe this can be a good approach. And maybe it comes in the most desperate time of your life, under your shower, under your, I don't't know when you go out for a walk but yeah most of the time it didn't in my experience i wasn't at the desk when i had the best ideas my best creative process but yeah yeah no that's true it comes to you when you least expect right and i
Starting point is 00:35:03 think that sometimes is if you try and force it and sit there like you're saying lock yourself in a room it's often almost antagonistic to generating ideas like it has the opposite effect your brain just kind of goes i'm gonna do anything but yeah no that's really cool so yeah marco we're at the uh the time for the last word now for the podcast so what's the one takeaway you want the listener to get from this chat today? My key takeaway that in my experience, when we have simple problems, where we will, my key takeaway,
Starting point is 00:35:41 I would say that where we have, when we face simple problems, we don't have to overkill the solution. Usually the best solution, in our case the optimal solution, is simple. So usually simple problems have simple solutions. Let's try not to overkill solutions just for the sake of it. Yeah, I think that's a great message to end on there, Manfred. Thank you so much for coming on the podcast. Thank you.
Starting point is 00:36:18 Thank you for having me here. More than welcome. It's been a pleasure. Wayne Broxwin, can we find you on social media? Are you active on any of those platforms that people can engage with you? You can find me on LinkedIn,
Starting point is 00:36:29 I think. On LinkedIn and my GitHub, I don't use much other kind of social medias. Cool.
Starting point is 00:36:37 That's the healthiest thing to do, man, honestly. Cool. Fantastic. And yeah,
Starting point is 00:36:42 we'll see you all next time, listeners, for some more awesome computer science research thank you

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.