Disseminate: The Computer Science Research Podcast - Marco Costa | Taming Adversarial Queries with Optimal Range Filters | #58
Episode Date: October 14, 2024In this episode, we sit down with Marco Costa to discuss the fascinating world of range filters, focusing on how they help optimize queries in databases by determining whether a range intersects with ...a given set of keys. Marco explains how traditional range filters, like Bloom filters, often result in high false positives and slow query times, especially when dealing with adversarial inputs where queries are correlated with the keys. He walks us through the limitations of existing heuristic-based solutions and the common challenges they face in maintaining accuracy and speed under such conditions.The highlight of our conversation is Grafite, a novel range filter introduced by Marco and his team. Unlike previous approaches, Grafite comes with clear theoretical guarantees and offers robust performance across various datasets, query sizes, and workloads. Marco dives into the technicalities, explaining how Grafite delivers faster query times and maintains predictable false positive rates, making it the most reliable range filter in scenarios where queries are correlated with keys. Additionally, he introduces a simple heuristic filter that excels in uncorrelated queries, pushing the boundaries of current solutions in the field.SIGMOD' 24 Paper - Grafite: Taming Adversarial Queries with Optimal Range Filters Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the computer science research podcast.
Today is another episode of our Cutting Edge series and I'm going to be joined by Marco Costa,
who will be talking about his recent paper from SIGMOD, Graffiti, Taming Adversarial Queries with Optimal Range Filters.
Marco is a research engineer at Huawei. Welcome to the show, Marco.
Hey, thank you, Jack. Thank you for inviting me.
Yeah, I'm really looking forward to this chat today. So we always start off on the podcast
by getting you to tell a listener more about your story and how you
ended up being interested in database research. So yeah, tell us more about
yourself, Marco. So why database
research?
Our research about Grafite,
it's mainly a matter
of algorithm engineering,
right? But since
this is a field that
goes
its main purpose
about range filter, about
what Grafite is, is mainly because of database purpose.
It's mainly used in database.
And so when I started this work as my master thesis back in 2022, I had no idea about database research.
I was just a master's student.
And then this work drove me straight into it. It started like a
pure algorithm engineering work, an algorithm engineering thesis. But since this is a field that it's really used mainly into databases,
then that's what I had to start studying about it,
trying to implement Grafita in modern databases.
And that's basically how that started.
Nice, yeah.
So you kind of found a back doorway into databases there,
but now you've got a book, right?
That's it.
Databases for life now.
Cool.
So we're going to be talking about graffiti today.
So let's set the listener up with some context for this chat then
because the title has some cool words in there like adversarial queries
and optimal range filters.
So let's kick things off, Ben.
Tell us what are range queries
and yeah, what the heck is an adversarial query?
Basically what range queries are is simple.
You have a set of keys
and you want to check if a set intersects with those keys.
So you have two endpoints, A and B,
and you want to see if there is any
key of your set that intersects on it. So what approximate range queries are is a bit different,
is because this is mainly, this can be seen as a extension of what Bloom filter do from point filter to range from point queries to range
queries what Bloom filter to do is to give you an approximate answer where you don't have false
negatives but you can have false positives about the answer of the of the of the of the filter if
the you can have a false positive so it can tell you with a certain precision,
with a certain grade of error, if a key belongs to the set or not. But it cannot give you a false
negative answer. So if the filter tells you that the key is definitely not in the set, so it's definitely not into the set.
You cannot have any false negative probability.
And approximate range queries are the same, basically just on ranges rather than on simple points.
What are adversarial queries? Well, we saw that when we started evaluating Grafitta
and the other state-of-the-art range filters,
what we saw is that previously of Grafitta,
what we had were mainly heuristic range filters.
So what is a heuristic range filter?
It's a range filter that answers
approximate range emptiness queries,
but cannot give you a bound
on the error rate of the false probability,
on the false positive rate probability.
And since these range filters
cannot provide you any guarantee, then you can build an adversarial workload that basically makes them useless because it draws the false positive rate straight up to one.
So the false positive rate is straight up to one.
So basically you have no filtering and you can
imagine that what we use this data structure are mainly to since they are very they use very low
memory they they can be kept into main memory and you want them and you want to use them to guard the access to the disk, to avoid the useless IOs on disk.
So what we do is that when we have a query on our system, we first query the range filter tells us that yes the the range or the the key is probably
it's probably there then you go straight to the disk and you and you perform the actual query
and so if you have and if you can build adversarial workload on top of range filters, you can build certain kinds of queries that drives
the false proceed rate straight to one and you're basically providing no filtering.
You are allowing your adversarial or maybe not even adversarial to... you're giving them
free IOs on your system. And we also saw that these adversarial workloads,
what we have for adversarial workloads
on range queries are workloads
where the range is very close to the input keys,
which is pretty normal, no?
Because when you want to do a range query,
you basically want to,
ideally, you are going to query a range
that is close to your input set.
It's not completely on the other side.
So the more that this range goes close to the input keys,
the more this heuristic range filters,
they just provide no filtering at all.
So that's why we wanted to provide a solution
that we call it robust.
And so what Grafite can do is,
what Grafite does is that
you provide basically a false positive rate,
you provide your input set, and that's it.
You have a measurable way
and you have a certain way, a measurable way and you have a certain way, a measurable way
to know what's going to be
your false positive probability
when you perform a query.
Nice, cool.
So I'm just going to run through that
and try to recap it
and grasp it for you, Marco.
So going back,
way back to the start of your answer there,
we were talking when we were actually
introducing what range queries are.
To give, I guess, a concrete example,
you can tell me this,
I'm thinking this, obviously, my go-to example was footballers
so i've got a list of every premier league player in every premier league player and their age right
and that might be that's my kind of a set of values and then i guess my my range query be
go and get me every player that is between the ages of 21 and 23 right that is basically it
kind of way that one of those queries would you would potentially go to that set of 21 and 23, right? That is basically it, kind of where one of those queries would, you would potentially go to
that set of inputs and run that query
over it, right? No. Would that be a good
example? We run
the range queries on
keys, straight to the keys.
So imagine you have
the input,
your input set is the
date of birth of
football players. So you want to know what what football
players so you want to know if a football player is born between 1997 and 1998 so that's basically
a range query cool gotcha cool. And then the problem here is that
to kind of go for what the false positive is,
it's always going to tell you when something is
definitely not in that range,
but sometimes it's going to say,
hey, this is in there when it's potentially not in there.
That's what we mean by the false negatives,
or the false positives.
I always get my false positives and false negatives
the wrong way around, Marco.
See it in this way.
When you perform a query, it can tell you
it may be in the set, it may be true,
or it can tell you it's definitely not there.
Gotcha. Cool.
And then we've got this bunch of adversarial queries
which can kind of really kind of...
Because these exist in the state of the art suffuses heuristics, right? And they have basically no bound on the error. queries which can kind of really kind of because these exist in the state of the arts of use these heuristics right and they have basically no bound on the error so if you kind
of get your queries set up in a certain way that this could basically drive that whole thing to
want the whole yes and it's useless right because saying everybody you have to go and check everybody
again essentially because it's it's saying everyone's possibly true right possibly in the in
the in the range in the set sorry cool and that's bad because we use this thing for performance and then we're
not getting any gain from it so yeah that's where graffiti comes in so let's go through the design
of graffiti then and tell us how it actually managed to achieve this this goal of not having
the the issues that the current state-of-the-art approaches have?
Well, our approach basically starts from one of the first papers,
the first theoretical paper in the field.
It's a paper published by Goswami and it gives us, this paper proves
the theoretical lower bound of the problem. So it proves you that
you cannot achieve range filtering
on approximate range filtering on
a certain number of keys for
a certain length of the query
without using at least that number of bits.
So it's basically a lower bound of the problem.
In that paper, they also propose a solution
for the problem that uses optimal space and optimal time.
And in that solution, they use carefully crafted hash function
to reduce the universe size from the original universe
where your input set belongs to a reduced universe. And then they store that
reduced universe into a very complicated kind of data structure. They do all the
theorems, the lemma, the proof, and whatever. And they put that this data structure is optimal for the problem
but it's very complicated and there is no
I think there is no
implementation for it anyway
because it's mainly theoretical and there is no way
for implementing it in an efficient way
so what we did is to
take that approach, but changing the underlining data structure,
using a different type of data structure that is really fast, it's really performant,
and we have implementation for it. But basically using the same ideas offered using the universe sites
through hashing
store the input
the hashed set of keys
and then performing range queries
on top of that
and what we proved is that
using this structure
the famous Elias-Fano structure
we achieve same optimal space and optimal time. Actually,
we even have a better space if we
go and check for the lower order terms of the formula.
But yeah, we managed to implement it and
this is where this approach
comes from.
Nice. Cool.
Before we go on talking about how actually how it performed
and it actually is and discussing the results of your evaluation
of what not have benefited,
there was another contribution in your work that wasn't the only thing.
There's a thing called bucketing.
So it might be now a quite good time to tell us about bucketing as well
and how that fits into this landscape.
Bucketing is another contribution.
What we said with Giorgio and Paolo
is that, okay,
we have a heuristic range filter,
we have a robust range filter as well,
so we have graffiti.
But since the design
of the heuristic range filter
that we have right now,
it's very complicated.
They have a lot of tricks, a lot of parameters, a lot of stuff that you have to be careful about.
But maybe someone, some system engineer, they still want to use a heuristic range filter on their system.
Because I think of a lot of reasons. Can we design a heuristic range filter that is way, way simpler than the
state of the art that we have now? And then we came with the bucketing, which is basically simple.
What we do is simply to map the input set into buckets.
So we divide the universe into buckets of fixed length.
And we basically store one if at least one key falls in the bucket or zero otherwise.
Then we have a bit vector
and we can compress that bit vector in an efficient way.
And that's what bucketing is.
It's a heuristic range filter
which suffers on adversarial queries,
correlation and everything we talked before,
but it's way, way simpler
of what the other state of the-the-art range filters do.
Cool.
So let's talk about how bucketing and graffiti obviously compare to the state-of-the-art
then.
So the first thing probably would be good to tell us about how you actually went about
evaluating these two algorithms.
What was the actual setup of the experiments you ran?
Okay.
First of all, of course,
the experiments, of course, are important,
but there is a careful theoretical evaluation of them.
So,
first, we
did the
theoretical evaluation, and then
for the experiments,
we
used classical datasets that are used in this kind of data-of-the-art range filters.
So correlated and uncorrelated data sets.
For uncorrelated, we mean that there is no correlation between the ranges and the keys,
and we correlated, of course, the opposite.
And of course, it's important to say
that we didn't compare heuristic range filter
with robust range filter,
because it's basically like comparing apples and oranges.
They do different purposes and it's not fair to compare them.
So we divided the comparison between heuristic and robust.
We compared bucketing, of course, with heurISD sticker and Grafitta with the other Romus
French filters.
We used synthetic datasets, as I told you, with uncorrelated, correlated ranges, and
also real-world datasets like the Facebook IDs or OpenStreetMap positions and this kind of standard datasets.
And then we tested the main things such as query time, construction time, and false positive
rate.
And also the false positive rate as the correlation increased. And as we thought, you can see the actual difference between heuristic and others and robust range filters. that can in some sense deal with correlated workloads,
they increase exponentially their query time.
So this makes them, I don't know,
I would say a bit useless in those kind of workloads.
Yeah, that's it.
Cool. So yeah, let's put some numbers on these experiments that matter.
So give us the rundown. So how much better are graphite and bucketing
compared to the other state-of-the-art approaches
for heuristic and robust range filters?
What's the magnitude of the performance gain?
Well, if we speak only for graphite
and we take just the false positive rate we have a
comparison of almost two orders of magnitude better for grafite we compare to to the other
state-of-the-art for what concerns the false positive rate and almost up to 90 times faster
than the other state-of-the-art range filters.
For what concerns bucketing,
what we saw is that it basically matches
what the other heuristic range filters do
in terms of phosphorescent rate,
but it does that with up to two or three magnitude of time,
in better time.
So, yes.
So up to two or two,
up to two to three orders of magnitude less
in both the query time and construction time.
Nice.
So yeah, clear win on all fronts then.
So that kind of leads into my next question then of,
are there any situations still now
where either bucketing or graffiti
would still not perform particularly well?
Like what are the current limitations still of this approach,
of these approaches?
So this kind of data structure, they perform really well in LSM trees environments.
So in LSM trees database, that's why they're taut in data structures.
So you exploit the fact that when you merge different runs in ls and trees
you have to rebuild
you have to rebuild everything
so you rebuild also
the doc struct
it's not very suitable in dynamic
environments because
of this reason
but yeah it will be
definitely
a nice future work to try to design and implement a dynamic version of it.
Yeah, that's always one of my questions as well, Marco.
Like, where do we go next with this line of research?
And obviously that is when you're making these data structures dynamic, so they can work well in a dynamic environment.
But are there any other particular interesting areas of research you think that could follow from this work?
This kind of data structure has been also used in research,
just for research purposes in other fields like
query autocompletion and so on and so forth.
But I don't think they are completely suitable for this purpose.
For the same reason,
because they are static, the structure.
So when you change the set,
you have to rebuild everything.
And this is a cost.
So if you can pay that cost,
perfect.
You cannot achieve anything better
because you are doing...
With Grafitta, you're basically lower bound.
But maybe you want to, with the current state of computer architectures and storage, you are able to pay something in terms of, it's okay for you to pay something in terms of page space to store.
It's okay for you to pay something more for about the space,
but have something more dynamic that you don't have to rebuild every time that your set changes.
So that's why they're mainly exploited in LSM3's kind of databases.
Yeah, cool.
Another sort of question that's kind of been bouncing around my brain
whilst we've been talking is about the kind of adversarial queries in the wild,
I guess is probably the way to phrase it.
And how common would you see these sorts of query workloads
appearing in production systems and sort of exploiting the
the bad path in a way right like is it something that we see frequently well imagine that you have
a i don't know a database that stores the the identification the ids of your registered user in your system. So then they basically,
they pick a point in the universe space.
So what you do,
what you probably a user wants to do
is to, or it's normal behavior of the system
to query something that it's close
to those kind of ideas
something that it's not completely if you have your ideas between uh one thousand and five thousand
you know don't want to go thirty thousand right so the more the closer you go to those input keys,
the higher is the correlation.
The higher is the correlation,
the higher is the false positive rate that increases when range filters
cannot deal with those kinds of workloads.
So yes, basically it's what is supposed to be
a normal behavior of a system.
Going the closer you can go to the input set.
Cool, yeah.
I guess sort of going off to my next question then would be,
obviously we always like to talk about impact,
and impact is sort of one of the key things
that we like to talk about on the podcast.
Has graffiti had any sort of impact in the real world in the short term?
Are you aware of any systems, maybe LSM-based systems,
that are thinking of adopting this?
Yeah, and kind of longer term,
just two questions for the price of one here,
what sort of impact do you think Graffiti can have
and bucketing in the future, long term?
Well, for sure they are used into modern databases,
into modern storage databases.
And you also have to consider
that you can use range filter to answer also point queries.
So you basically don't have to add in your system,
but if you have a bloom filter and a range filter,
both in your system,
because you can basically use a range filter
to answer a range query where the size of your range is one.
So it's a point query.
I don't know to what
extent this kind of
filters are preferred
over I'm speaking
about Uric range filter
preferred over classical
solutions that
can be bloom filter or
prefix bloom filters for
range queries which
do not provide any for what concerns prefix bloom filters for range queries, which do not provide any, for what concerns prefix bloom filter, they still have the same problem about adversarial workloads and so on. solution that can provide robustness in this kind of topic. Maybe it can be something that it's
going to be implemented in future systems, or at least a good benchmark will tell us
it's worth or not to replace maybe more trivial solution.
On the implementation effort,
so of your person,
how hard was it to implement this?
Like what was the sort of,
how long was the engineering effort, I guess,
in like kind of going from,
okay, idea to actually implementing the thing,
getting it working correctly and being able to evaluate it.
Was it quite a long period or easy or?
No, it wasn't a long period.
It's a simple implementation it's an uh
there is a github repository you you can check it it's uh it's it's really simple because
and i think this is its trend because keeping it simple sometimes we want to to over complicate things that they don't need to be complicated
so sometimes we just stop and maybe reflect a little bit more we can find easier solutions
for tasks i'm not saying that we don't need complex solutions for
a lot of topics,
but sometimes
it's just
it may be an overkill for some
sort of problem.
Yeah, I completely agree
with that message there, Marco. There is sometimes
sort of that inherent desire to make
more complex means better
in a way sometimes yeah
which i think it's kind of in some level maybe like human nature i think we're all guilty of it
but having a clean solution that achieve the same goal that simple is like yeah obviously that should
be better right but sometimes we always reach for the more complex thing right we think we should
be doing but um yeah it's interesting that cool yeah so i guess while working on this on this
project was there anything that sort of jumped out as a surprise or was quite kind of like huh But yeah, it's interesting that. Cool, yeah. So I guess while working on this project,
was there anything that sort of jumped out as a surprise or was quite kind of like, huh, that's really interesting
while you're working on graffiti?
Well, it gave me the chance to actually,
maybe I was a fresh grad from my master.
So it gave me the chance actually to explore a lot explore a lot of
research to see how things are done see how things are done in this kind of environment
so how research is actually done and i will want to what we need to do to do things properly.
So, yeah, that has been a bit of a shake for me
because I was a master's student
and now I try to write a paper
that is going to be published
in one of the best conferences in the world.
So yeah, it was the joy of your first paper. the the world between when you go out from the university when you actually start to
take up the art works is different yeah i mean it's a very similar experience of when you first
exposed to the sort of level of rigor and that sort of scientific process to go through to be
able to get something that is good enough level to like you say be published at one of the leading conferences in this research area is it's an experience right
it's it's challenging but i think rewarding in the long term but i mean you started off great
right it was on one paper one sigmoid paper it's a pretty strong record marco so yeah maybe
it's kind of well you know like when the athletes they just retire as a as a winner will just retire one
paper sigmund completely jumped on yeah i've got my gold base yeah been there done that
got the t-shirt yeah cool so going over then so like talking about i think maybe we mentioned
this a little bit earlier on i mean you kind kind of was telling us about how you ended up being interested in
database research.
But this origin story
of this paper, so obviously there was an existing
sort of paper that was sort of
a bit of the motivation behind this and you kind of
brought this data structure along and sort of
made it real.
How did that actually come about?
Tell us a little bit more about the paper's
backstory in the paper, I guess.
Well, I was looking for a master thesis,
and that's the beginning of the story.
So Giorgio, one of my co-authors,
so Giorgio Vinciguerda,
he read the Ghost Huami paper that we were discussing before,
and he saw that maybe the existing solutions,
they weren't exploiting at all that kind of result.
So, and he told me, maybe we can do something about it.
We can try to read carefully that paper,
that Goswami paper,
and see if we can do something
that takes into consideration what they did.
Because they propose a solution,
they are not able to implement it,
and the robust range filter solution
we had at the time that were implemented
were inefficient in terms of theoretical theoretical were inefficient
in terms of theoretical bounds so that's how it started then it continued with my with my other advisor, Paolo Ferragina, we discussed it was possible to transform that master thesis into a paper. PISA, but I was more interested in trying to explore the industry field rather than
continuing on research.
So I did a six-month research fellowship contract.
Yeah, that's basically what I worked on.
Cool.
I guess sort of as we were talking about
you going through the scientific process and sort of getting familiar with how to sort of produce a
research paper of top quality this is obviously part of that is generating ideas and then deciding
which ones are actually worth spending time working on which ones need to go straight in
the bin.
So I don't know, Marco, if you have a personal approach
that you've developed over this experience of generating ideas
and then choosing ones which you think have actually got some value.
Well, this is not an easy question.
So I think that you can have all the thinking process to say to yourself, okay, I will just stop here on this desk and I will just try to think and think and think until something comes into my mind and I come up with a great idea.
But in my experience,
when you have the moment
when you have an idea,
it's a very short time frame.
So it's something that just comes into your mind.
You have that moment
when you actually stop there and you reflect and say,
okay, yes, maybe this can work.
Maybe this can be a good approach.
And maybe it comes in the most desperate time of your life,
under your shower, under your, I don't't know when you go out for a walk but yeah most of
the time it didn't in my experience i wasn't at the desk when i had the best ideas my best
creative process but yeah yeah no that's true it comes to you when you least expect right and i
think that sometimes is if you try and force it and sit there like you're saying lock yourself in a room it's often almost
antagonistic to generating ideas like it has the opposite effect your brain just kind of goes
i'm gonna do anything but yeah no that's really cool so yeah marco we're at the uh the time for
the last word now for the podcast so what's the one takeaway you want the listener
to get from this chat today?
My key takeaway that in my experience,
when we have simple problems,
where we will, my key takeaway,
I would say that where we have,
when we face simple problems, we don't have to overkill the solution.
Usually the best solution, in our case the optimal solution, is simple.
So usually simple problems have simple solutions.
Let's try not to overkill solutions just for the sake of it.
Yeah, I think that's a great message to end on there, Manfred.
Thank you so much for coming on the podcast.
Thank you.
Thank you for having me here.
More than welcome.
It's been a pleasure.
Wayne Broxwin, can we find you on social media?
Are you active on any of those platforms that people can engage
with you?
You can find me
on LinkedIn,
I think.
On LinkedIn
and my GitHub,
I don't use
much other
kind of social
medias.
Cool.
That's the
healthiest thing
to do,
man,
honestly.
Cool.
Fantastic.
And yeah,
we'll see you all
next time,
listeners, for some more awesome
computer science research thank you