Disseminate: The Computer Science Research Podcast - Felix S Campbell | Efficient Answering of Historical What-if Queries | #2
Episode Date: July 1, 2022Summary:In this interview Felix discusses "historical what-if queries", a novel type of what-if analysis that determines the effect of a hypothetical change to the transactional history of a database.... For example, “how would revenue be affected if we would have charged an additional $6 for shipping?” In his research Felix has developed efficient techniques for answering these historical what-if queries, i.e., determining how a modified history affects the current database state. During the show, Felix talks about reenactment, a replay technique for transactional histories, and how he and his co-authors optimize this process using program and data slicing techniques to determine which updates and what data can be excluded from reenactment without affecting the result. Questions:0:42: Can you start off by explaining what are historical what-if queries?1:56: What is the naive approach to answering these types of questions?2:47: What are the problems with this naive approach and why is your solution better?3:45: Tell us about reenactment, how does that work?4:48: In your paper you mention two additional techniques, data slicing and program slicing, can you tell us more about these? 6:44: How does reenactment, data slicing and program slicing, compare to other techniques in the literature? Where does it improve on the pitfalls of those? 8:00: Are there any commercial DBMSs that provide similar functionality out of the box?8:57: How did you go about evaluation your solution? 10:40: What are the parameters you varied in your evaluation? 14:11: Where do you see this research being most useful? Who can use this?15:17: Are the code/toolkit publicly available? 16:15: What is the most interesting aspect of working on what-if queries and more generally in the area of data provenance? 17:36: What do you have planned for future research? Links:Felix's homepageIllinois Institute of Technology (IIT) Database Group's homepage Efficient Answering of Historical What-if Queries SIGMOD paperSIGMOD presentationContact Info:Email: fcampbell@hawk.iit.eduLinkedIn Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the podcast bringing you overviews of the latest computer science research.
I'm your host, Jack Wardby. We're recording today from ACM Sigmund Pods in Philadelphia,
and I'm delighted to say I'm joined by Felix Campbell, who will be talking about his paper,
Efficient Answering of Historical What-If Questions.
Felix is a student at the Illinois Institute of Technology,
and in the fall he will be joining Ben Gurion
University to start his PhD. He is interested in data provenance and query optimization
methods. Felix, thank you for joining the show. Yeah, of course. Thanks for having me.
Can you start off by explaining to our listeners what are historical what-if questions?
Sure. So a historical what-if query is we want to
allow users to sort of be able to leverage the history of their database typically there's
actually a lot of interesting information that we can actually sort of exploit to sort of reason
about our data a little bit better so namely historical, historic web queries allow users to sort of make some
sort of hypothetical scenarios that they might want to see how it would have played out with
the rest of their database. So the canonical example that we use is in like an e-commerce
scenario, you might have a shipping policy that was in the database that reduced the shipping fee to zero for
orders of greater than $50.
And so if you as a data scientist or analyst decided, you know, we want to see what would
have happened if we had set this threshold to $60 instead, we now provide a framework
to not only answer this query but also efficiently to more effectively allow
data scientists to sort of test these interventions.
Fantastic.
So what's the naive approach to answering these types of questions?
Sure.
So naively, right, the structure of the query is such that we want to know the difference
between the original database and this hypothetical database.
So naively, in order to do
that, we have our original database. It already exists. It's current. It's the truth, ground
truth, so to speak. And so we need to create a copy of the database prior to the transactional
history. So we use a time travel feature that's supported by many database management systems in
order to achieve that. So after copying, then we apply the set of modifications to the history,
which we then evaluate over the database.
And then we simply just take the symmetric difference, right?
So that's anything that's exclusively in the original ground truth database
or anything that's exclusively in the hypothetical database.
And it's annotated to let you know which is coming from where.
Okay, fantastic.
So what are the problems with this naive approach,
and why is your solution better?
Sure, of course. So, right, the key part here is there's a lot of overhead in the naive approach to the problem, right?
Namely, copying the entire database can be a little bit impractical with larger relations,
especially if this is something that you want to evaluate a few times, right?
So if you change the threshold to $60 from our earlier example,
maybe you want to try $65 or $70 or $45.
But in this case, you'd have to be copying your large relation
for each indication historically.
So on top of that, actually evaluating these transactional histories,
again, over the database can be problematic
as you have the overhead that would occur from actually executing these queries with side effects, right?
So you're actually writing to disk, you're incurring the transaction processing overhead.
So we would like to avoid this, ideally, if possible.
Cool. So with that in mind, can we go into a little bit more depth on your solution and talk about reenactment?
How does that work?
Sure. So as I just mentioned, this big issue with copying, right? With that in mind, can we go into a little bit more depth on your solution and talk about reenactment? How does that work?
Sure.
So, as I just mentioned, right, this big issue with copying, right?
And the issue is that, right, we can't reuse the copy after we've executed the history, right?
Because now we've made any side effects, right?
So you run the query, the reenactment query, right?
And you get back a relation that is going to be the same thing
as if you had executed the actual updates, inserts, and deletes as they were open.
So in your paper as well, you mentioned two additional techniques
that can help in this sort of thing.
So these are data slicing and programming slicing.
Can you go into a little bit more depth about these as well and describe how they're useful?
Sure, of course.
So data slicing and program slicing are kind of some techniques that we borrowed,
at least particularly program slicing for the programming language community.
But namely, what we seek to do in both is to kind of reduce the size of the computation.
So now that we're doing reenactment, we're able to kind of piece apart those computations and reduce the amount of computation necessary.
So in data slicing, intuitively, because of the structure of the problem, we concern ourselves with only tuples that might appear in the symmetric difference. And so knowing this, we kind of investigate this a little bit further to see which tuples
will necessarily have to be in the answer of the historical word of grid, which is the
difference.
So we investigated that, and we've come up with how to filter out the tuples prior
to the reenactment to get rid of all of that extra data that we would have to process otherwise.
And in the case of program slicing, it's kind of a bit of an analog to data slicing. In this case,
we have a transactional history and a lot of these transactions are actually irrelevant
to computing the answer to the historical difficulty, right?
If we are concerning ourselves with modifying the shipping fee based on price,
we don't have to care about changing a customer's country.
That's a little bit not so precise of a statement, right?
Because changing their country might affect a later shipping fee change.
But in general, right, just to demonstrate,
there's probably a decent amount of updates that
you don't need to run again in order to answer the same question. How does this solution then
compare to other techniques in the literature? And what are the benefits of this compared to
the other ones? Where does it improve on the pitfalls of those solutions? Sure. So there's
a few existing what-if analysis systems. It's not a
completely new idea, this what-if analysis hypothetical reasoning. I would say in particular
our approach tries to leverage the transactional history of the database. This presents a lot of
information. There's a great talk today about a paper that is also a framework for hypothetical reasoning, but they don't really
use the historical, you know, nature that we are able to actually leverage to sort of,
you know, provide some predictions or some hypothetical answers. So in general, I would
say the key to our approach, at least the problem, is a little bit novel, right?
Because we're looking at modifying this history and seeing how it would have been different.
And then the technique is novel in that, you know, there's been data slicing and program slicing done before in literature,
even with respect to databases.
But some of the more technical, like key technical aspects are, I would say, pretty novel.
Following on from that, are there any commercial database systems at the moment that provide anything like this in their current offerings?
So I think there's a number of commercial database systems that definitely provide the groundwork.
I think the key things that we require for this is audit logging, right?
So we actually need to have a transactional history before we do anything, so audit logging is key.
But additionally, time travel, right? We need to know
what was the state of the database prior
to the history that we're trying to investigate
hypothetically. But in terms
of out-of-the-box functionality,
it doesn't really quite exist.
Originally,
this paper kind of referred to our
system as a middleware for answering
sort of relative queries, and this kind of nomenclature as a middleware for answering a sort of a what-if queries.
And this kind of nomenclature pops up.
It's a story history.
But yeah, so to our knowledge,
it's definitely not an out-of-the-box.
We actually have a slight extension to SQL to facilitate the what-if queries.
How did you go about evaluating your solution?
What sort of benchmarks did you use?
What were the results?
Give us some numbers and say how good this is.
So we were going to evaluate this.
So we used a couple of data sets,
some standardized benchmarks,
and some sort of real-world data.
The challenge with this is, obviously,
it's hard to measure some sort of ground truth, right?
This is a hypothetical reason.
We're doing A-B testing without the B, right?
There's not really...
It can be hard to...
And that's definitely a challenge that...
I don't know how much I should reference this other talk, but...
Go for it.
Of course, right?
That's sort of an issue they had too, right?
They need to input this causal graph because they don't know.
That's a hard problem.
It's something for someone to sort of analyze
just with domain expertise.
And so, but yeah, so we used some couple standard benchmarks.
TPC-C specifically is one of the such benchmark
that's used for pretty standard query answering.
Also the Yahoo Cloud Services benchmark.
But also, we used a TaxiTrips database from the city of Chicago.
They have an open data portal.
And so the workloads that we used for at least the standardized benchmarks,
they specify some workloads.
So they specify some updates, some inserts, some deletes.
And so this is, we took these workloads,
we made some hypothetical modifications ourselves
and tried to understand how, you know,
different parameters affect the runtime.
What sort of parameters are we talking here?
What are the knobs we can turn here?
Yeah, exactly.
So there's quite a few knobs that we can turn.
There's a lot of, you can really dial it in.
But in particular, right, the key thing that we've identified is,
I mean, it's, you know, on its face, it's pretty obvious.
But, right, the two key things I would say are,
in terms of the speedup you'll get,
is the percentage of dependent updates. key things I would say are in terms of the speed up you'll get is
the percentage of dependent updates. So we term
a dependent update as an update that we can't eliminate
from the reenactment.
And so
if everything is dependent on everything
you're not going to be able to benefit.
And the same thing with the
data. The data is I think as well.
So if every dependent update is
modifying all the data, you're not going to be able
to get rid of anything from your processing.
So those are the two things. So
typically in a real-world scenario, right,
you usually have updates that
make certain partitions or cuts.
So if you're only modifying,
for example, customers in the UK
or customers in the US,
you're able to sort of isolate that
as compared to modifying every single customer.
The main difference in
running time as well is obviously we scale
in the size of the transactional history.
So the data size
doesn't matter with regards
to program slicing, which is nice.
Some other previous approaches
that don't
do it exactly for the same problem, but do kind of don't do it exactly the same prop for the same problem but do kind of a
similar program slicing are linear in the size of the data so it's not good right you have more
data it's going to take longer it's something we would like to avoid but in general the results are
you know against the naive method to our method we're usually in order of magnitude better. Typically, our reenactment approach with all of our optimizations enabled
is usually to win out over reenactment alone or either optimization alone.
And there are some cases where this isn't necessarily true.
Yeah, what are those cases?
Sure. So as I mentioned, we have this case where we have basically the extremes.
So if we have an extremely where we have basically the extremes.
So if we have an extremely high amount of tuples being modified or extremely high amount of dependent updates, you're not going to be able to benefit.
But there's an extreme in the data side where if you have very few tuples that are actually being or right rows being modified by the updates that are dependent, it's going to be the case that it's actually,
the program slicing is actually a pretty heavy algorithm.
It uses linear programming,
and just to kind of like high level,
it's definitely a hard problem.
So there's some cases where,
because the amount of data that we're actually reenacting on
is so small, it could be amount of data that we're actually reenacting on is so small, right?
It could be in the cases that we were testing, we were testing maybe only, you know, a couple to tens of tuples out of millions.
And in those cases, it's a lot simpler to just do those 10, 20 to 100 so tuples through the whole history rather than to try to do the program slicing routine, right?
And try to cut away some of routine, right, and try to cut
away some of that, some of those updates. Sometimes it's just faster to just run it.
It's definitely cases where, in particular with low selectivity, where, you know, the program
slicing might not benefit. But we definitely see this. I mean, you don't even have to have that
high of a percentage, you know, like even 10%, 25%. We start seeing some pretty significant
improvement. Fantastic. Where do you see this research being the most useful, and who can use this?
So in general, we kind of propose this as being just one tool in a hypothetical reasoning toolbox.
We don't really, we would like to be able to help users in the future come up with a hypothetical reasoning,
but the question is, right, who wants this?
And that's a good question.
And really, I think it's applicable to a lot of business scenarios, right, who wants this? And that's a good question. And really, I think it's applicable
to a lot of business scenarios, right,
where we might have some, you know, transactions,
you know, business transactions,
or we might have some, you know,
really anything where you might say, like,
oh, we have a lot of, you you know updates and inserts and deletes that
kind of encode sort of what our business does in the database then it's like oh we can we can
totally use that to leverage that to help you make data-driven decisions um based on sort of
how your how your database behaved in the past if that makes sense anyone Anyone who wants to, you know, make better decisions, hopefully.
So I believe the code and the toolkit is publicly available.
Where can people go and find that?
Yes.
So it's on the database group at IIT's website, cs.iit.edu slash tilde dbgroup.
We'll link the relevant places in the show notes.
But it's implemented on top of a system that we have developed,
particularly by the head of the group, Boris Golovkin.
But it's a system called GPROM.
So that's like a middleware for a lot of different confidence-aware tasks.
So it's an extension on top of that.
But in general, the algorithm can be found in the paper and can be certainly implemented elsewhere.
It's not the easiest to deal with, but it's certainly tractable.
So what's the most interesting lesson that you've learned while working on what-if queries,
and more generally in the area of data
provenance sure um it can be a little uh it can be difficult to reason about there's sometimes some
some unexpected surprises um you know you'll you'll think that you have like a condition that's
that's good enough right and then you might find a countererexample. And so, obviously, everything's an iterative approach.
You come up with an idea, you test it, you look around.
But that's definitely, there's a lot of nuance to this.
For example, if we have some modified update
that occurs later in the history,
we have to consider that updates before it
are producing input to that as well.
So we need to check the dependency
between those later modified updates and updates that come prior. This isn need to check the dependency between those later modified updates
and updates that come prior.
This isn't the case when there's only one modified update.
Then we can just check them only going forward.
We don't need to go backwards.
So I think definitely there's a lot to reason about.
We're pretty certain now based on
the proof I say pretty certain but we're certain you know let me be a little bit
more confident when we say we're certain that we definitely kind of ironed out
the semantics that all this has great and so what do you have planned for the
future sure so in general like I said this is just one tool in like a
hypothetical reasoning toolbox.
There's a lot of good work being done in this space right now,
be it from years ago introducing WIDF analysis,
as I said, like Caravan,
but also in the talk that we had just seen today, HypeR.
Yeah, so there's a lot of interesting work to continue to grow this toolbox.
Obviously, it's a hard problem to answer.
But hopefully, either kind of directly continuing this line of work, perhaps by making the implementation
more industry-ready, a little bit more robust, a little bit more performant, or also increase
the utility, right? So maybe, as we mentioned, maybe sort of learning some relationships between the updates and the history.
For example, if we delete a customer, right, we're going to need to delete their orders too, right?
But it's currently possible that, right, it's that the framework is not super, it's not smart, right?
If you tell it to do it, it'll do it.
But you might, you know, get results that aren't helpful.
So we really would like to maybe in the future assist users in this type of little reason.
And we can also apply this technique, the program slicing,
the sort of underlying technical approach can be appropriated for some other interesting,
sometimes more theoretical tasks.
Well, that's fantastic.
I think we'll leave it there.
Thank you so much, Felix.
Awesome.
Thank you so much for having me.
Yeah, if you're interested in knowing more about Felix's work,
all the relevant links will be put into the show notes.
Do you have any social media accounts you'd like to shout out?
Just my website.
Okay, well, if you want to learn more, go look at Felix's website.
Brilliant, and thank you for listening, and we will see you next time.
Awesome. you next time.