Disseminate: The Computer Science Research Podcast - Felix S Campbell | Efficient Answering of Historical What-if Queries | #2

Starting point is 00:00:00 Hello and welcome to Disseminate, the podcast bringing you overviews of the latest computer science research. I'm your host, Jack Wardby. We're recording today from ACM Sigmund Pods in Philadelphia, and I'm delighted to say I'm joined by Felix Campbell, who will be talking about his paper, Efficient Answering of Historical What-If Questions. Felix is a student at the Illinois Institute of Technology, and in the fall he will be joining Ben Gurion University to start his PhD. He is interested in data provenance and query optimization methods. Felix, thank you for joining the show. Yeah, of course. Thanks for having me.

Starting point is 00:00:57 Can you start off by explaining to our listeners what are historical what-if questions? Sure. So a historical what-if query is we want to allow users to sort of be able to leverage the history of their database typically there's actually a lot of interesting information that we can actually sort of exploit to sort of reason about our data a little bit better so namely historical, historic web queries allow users to sort of make some sort of hypothetical scenarios that they might want to see how it would have played out with the rest of their database. So the canonical example that we use is in like an e-commerce scenario, you might have a shipping policy that was in the database that reduced the shipping fee to zero for

Starting point is 00:01:46 orders of greater than $50. And so if you as a data scientist or analyst decided, you know, we want to see what would have happened if we had set this threshold to $60 instead, we now provide a framework to not only answer this query but also efficiently to more effectively allow data scientists to sort of test these interventions. Fantastic. So what's the naive approach to answering these types of questions? Sure.

Starting point is 00:02:15 So naively, right, the structure of the query is such that we want to know the difference between the original database and this hypothetical database. So naively, in order to do that, we have our original database. It already exists. It's current. It's the truth, ground truth, so to speak. And so we need to create a copy of the database prior to the transactional history. So we use a time travel feature that's supported by many database management systems in order to achieve that. So after copying, then we apply the set of modifications to the history, which we then evaluate over the database.

Starting point is 00:02:49 And then we simply just take the symmetric difference, right? So that's anything that's exclusively in the original ground truth database or anything that's exclusively in the hypothetical database. And it's annotated to let you know which is coming from where. Okay, fantastic. So what are the problems with this naive approach, and why is your solution better? Sure, of course. So, right, the key part here is there's a lot of overhead in the naive approach to the problem, right?

Starting point is 00:03:15 Namely, copying the entire database can be a little bit impractical with larger relations, especially if this is something that you want to evaluate a few times, right? So if you change the threshold to $60 from our earlier example, maybe you want to try $65 or $70 or $45. But in this case, you'd have to be copying your large relation for each indication historically. So on top of that, actually evaluating these transactional histories, again, over the database can be problematic

Starting point is 00:03:44 as you have the overhead that would occur from actually executing these queries with side effects, right? So you're actually writing to disk, you're incurring the transaction processing overhead. So we would like to avoid this, ideally, if possible. Cool. So with that in mind, can we go into a little bit more depth on your solution and talk about reenactment? How does that work? Sure. So as I just mentioned, this big issue with copying, right? With that in mind, can we go into a little bit more depth on your solution and talk about reenactment? How does that work? Sure. So, as I just mentioned, right, this big issue with copying, right?

Starting point is 00:04:15 And the issue is that, right, we can't reuse the copy after we've executed the history, right? Because now we've made any side effects, right? So you run the query, the reenactment query, right? And you get back a relation that is going to be the same thing as if you had executed the actual updates, inserts, and deletes as they were open. So in your paper as well, you mentioned two additional techniques that can help in this sort of thing. So these are data slicing and programming slicing.

Starting point is 00:05:07 Can you go into a little bit more depth about these as well and describe how they're useful? Sure, of course. So data slicing and program slicing are kind of some techniques that we borrowed, at least particularly program slicing for the programming language community. But namely, what we seek to do in both is to kind of reduce the size of the computation. So now that we're doing reenactment, we're able to kind of piece apart those computations and reduce the amount of computation necessary. So in data slicing, intuitively, because of the structure of the problem, we concern ourselves with only tuples that might appear in the symmetric difference. And so knowing this, we kind of investigate this a little bit further to see which tuples will necessarily have to be in the answer of the historical word of grid, which is the

Starting point is 00:05:59 difference. So we investigated that, and we've come up with how to filter out the tuples prior to the reenactment to get rid of all of that extra data that we would have to process otherwise. And in the case of program slicing, it's kind of a bit of an analog to data slicing. In this case, we have a transactional history and a lot of these transactions are actually irrelevant to computing the answer to the historical difficulty, right? If we are concerning ourselves with modifying the shipping fee based on price, we don't have to care about changing a customer's country.

Starting point is 00:06:33 That's a little bit not so precise of a statement, right? Because changing their country might affect a later shipping fee change. But in general, right, just to demonstrate, there's probably a decent amount of updates that you don't need to run again in order to answer the same question. How does this solution then compare to other techniques in the literature? And what are the benefits of this compared to the other ones? Where does it improve on the pitfalls of those solutions? Sure. So there's a few existing what-if analysis systems. It's not a

Starting point is 00:07:05 completely new idea, this what-if analysis hypothetical reasoning. I would say in particular our approach tries to leverage the transactional history of the database. This presents a lot of information. There's a great talk today about a paper that is also a framework for hypothetical reasoning, but they don't really use the historical, you know, nature that we are able to actually leverage to sort of, you know, provide some predictions or some hypothetical answers. So in general, I would say the key to our approach, at least the problem, is a little bit novel, right? Because we're looking at modifying this history and seeing how it would have been different. And then the technique is novel in that, you know, there's been data slicing and program slicing done before in literature,

Starting point is 00:07:59 even with respect to databases. But some of the more technical, like key technical aspects are, I would say, pretty novel. Following on from that, are there any commercial database systems at the moment that provide anything like this in their current offerings? So I think there's a number of commercial database systems that definitely provide the groundwork. I think the key things that we require for this is audit logging, right? So we actually need to have a transactional history before we do anything, so audit logging is key. But additionally, time travel, right? We need to know what was the state of the database prior

Starting point is 00:08:29 to the history that we're trying to investigate hypothetically. But in terms of out-of-the-box functionality, it doesn't really quite exist. Originally, this paper kind of referred to our system as a middleware for answering sort of relative queries, and this kind of nomenclature as a middleware for answering a sort of a what-if queries.

Starting point is 00:08:47 And this kind of nomenclature pops up. It's a story history. But yeah, so to our knowledge, it's definitely not an out-of-the-box. We actually have a slight extension to SQL to facilitate the what-if queries. How did you go about evaluating your solution? What sort of benchmarks did you use? What were the results?

Starting point is 00:09:04 Give us some numbers and say how good this is. So we were going to evaluate this. So we used a couple of data sets, some standardized benchmarks, and some sort of real-world data. The challenge with this is, obviously, it's hard to measure some sort of ground truth, right? This is a hypothetical reason.

Starting point is 00:09:24 We're doing A-B testing without the B, right? There's not really... It can be hard to... And that's definitely a challenge that... I don't know how much I should reference this other talk, but... Go for it. Of course, right? That's sort of an issue they had too, right?

Starting point is 00:09:39 They need to input this causal graph because they don't know. That's a hard problem. It's something for someone to sort of analyze just with domain expertise. And so, but yeah, so we used some couple standard benchmarks. TPC-C specifically is one of the such benchmark that's used for pretty standard query answering. Also the Yahoo Cloud Services benchmark.

Starting point is 00:10:06 But also, we used a TaxiTrips database from the city of Chicago. They have an open data portal. And so the workloads that we used for at least the standardized benchmarks, they specify some workloads. So they specify some updates, some inserts, some deletes. And so this is, we took these workloads, we made some hypothetical modifications ourselves and tried to understand how, you know,

Starting point is 00:10:34 different parameters affect the runtime. What sort of parameters are we talking here? What are the knobs we can turn here? Yeah, exactly. So there's quite a few knobs that we can turn. There's a lot of, you can really dial it in. But in particular, right, the key thing that we've identified is, I mean, it's, you know, on its face, it's pretty obvious.

Starting point is 00:10:57 But, right, the two key things I would say are, in terms of the speedup you'll get, is the percentage of dependent updates. key things I would say are in terms of the speed up you'll get is the percentage of dependent updates. So we term a dependent update as an update that we can't eliminate from the reenactment. And so if everything is dependent on everything

Starting point is 00:11:16 you're not going to be able to benefit. And the same thing with the data. The data is I think as well. So if every dependent update is modifying all the data, you're not going to be able to get rid of anything from your processing. So those are the two things. So typically in a real-world scenario, right,

Starting point is 00:11:31 you usually have updates that make certain partitions or cuts. So if you're only modifying, for example, customers in the UK or customers in the US, you're able to sort of isolate that as compared to modifying every single customer. The main difference in

Starting point is 00:11:48 running time as well is obviously we scale in the size of the transactional history. So the data size doesn't matter with regards to program slicing, which is nice. Some other previous approaches that don't do it exactly for the same problem, but do kind of don't do it exactly the same prop for the same problem but do kind of a

Starting point is 00:12:06 similar program slicing are linear in the size of the data so it's not good right you have more data it's going to take longer it's something we would like to avoid but in general the results are you know against the naive method to our method we're usually in order of magnitude better. Typically, our reenactment approach with all of our optimizations enabled is usually to win out over reenactment alone or either optimization alone. And there are some cases where this isn't necessarily true. Yeah, what are those cases? Sure. So as I mentioned, we have this case where we have basically the extremes. So if we have an extremely where we have basically the extremes.

Starting point is 00:12:51 So if we have an extremely high amount of tuples being modified or extremely high amount of dependent updates, you're not going to be able to benefit. But there's an extreme in the data side where if you have very few tuples that are actually being or right rows being modified by the updates that are dependent, it's going to be the case that it's actually, the program slicing is actually a pretty heavy algorithm. It uses linear programming, and just to kind of like high level, it's definitely a hard problem. So there's some cases where, because the amount of data that we're actually reenacting on

Starting point is 00:13:24 is so small, it could be amount of data that we're actually reenacting on is so small, right? It could be in the cases that we were testing, we were testing maybe only, you know, a couple to tens of tuples out of millions. And in those cases, it's a lot simpler to just do those 10, 20 to 100 so tuples through the whole history rather than to try to do the program slicing routine, right? And try to cut away some of routine, right, and try to cut away some of that, some of those updates. Sometimes it's just faster to just run it. It's definitely cases where, in particular with low selectivity, where, you know, the program slicing might not benefit. But we definitely see this. I mean, you don't even have to have that high of a percentage, you know, like even 10%, 25%. We start seeing some pretty significant

Starting point is 00:14:02 improvement. Fantastic. Where do you see this research being the most useful, and who can use this? So in general, we kind of propose this as being just one tool in a hypothetical reasoning toolbox. We don't really, we would like to be able to help users in the future come up with a hypothetical reasoning, but the question is, right, who wants this? And that's a good question. And really, I think it's applicable to a lot of business scenarios, right, who wants this? And that's a good question. And really, I think it's applicable to a lot of business scenarios, right, where we might have some, you know, transactions,

Starting point is 00:14:33 you know, business transactions, or we might have some, you know, really anything where you might say, like, oh, we have a lot of, you you know updates and inserts and deletes that kind of encode sort of what our business does in the database then it's like oh we can we can totally use that to leverage that to help you make data-driven decisions um based on sort of how your how your database behaved in the past if that makes sense anyone Anyone who wants to, you know, make better decisions, hopefully. So I believe the code and the toolkit is publicly available.

Starting point is 00:15:10 Where can people go and find that? Yes. So it's on the database group at IIT's website, cs.iit.edu slash tilde dbgroup. We'll link the relevant places in the show notes. But it's implemented on top of a system that we have developed, particularly by the head of the group, Boris Golovkin. But it's a system called GPROM. So that's like a middleware for a lot of different confidence-aware tasks.

Starting point is 00:15:38 So it's an extension on top of that. But in general, the algorithm can be found in the paper and can be certainly implemented elsewhere. It's not the easiest to deal with, but it's certainly tractable. So what's the most interesting lesson that you've learned while working on what-if queries, and more generally in the area of data provenance sure um it can be a little uh it can be difficult to reason about there's sometimes some some unexpected surprises um you know you'll you'll think that you have like a condition that's that's good enough right and then you might find a countererexample. And so, obviously, everything's an iterative approach.

Starting point is 00:16:27 You come up with an idea, you test it, you look around. But that's definitely, there's a lot of nuance to this. For example, if we have some modified update that occurs later in the history, we have to consider that updates before it are producing input to that as well. So we need to check the dependency between those later modified updates and updates that come prior. This isn need to check the dependency between those later modified updates

Starting point is 00:16:45 and updates that come prior. This isn't the case when there's only one modified update. Then we can just check them only going forward. We don't need to go backwards. So I think definitely there's a lot to reason about. We're pretty certain now based on the proof I say pretty certain but we're certain you know let me be a little bit more confident when we say we're certain that we definitely kind of ironed out

Starting point is 00:17:13 the semantics that all this has great and so what do you have planned for the future sure so in general like I said this is just one tool in like a hypothetical reasoning toolbox. There's a lot of good work being done in this space right now, be it from years ago introducing WIDF analysis, as I said, like Caravan, but also in the talk that we had just seen today, HypeR. Yeah, so there's a lot of interesting work to continue to grow this toolbox.

Starting point is 00:17:43 Obviously, it's a hard problem to answer. But hopefully, either kind of directly continuing this line of work, perhaps by making the implementation more industry-ready, a little bit more robust, a little bit more performant, or also increase the utility, right? So maybe, as we mentioned, maybe sort of learning some relationships between the updates and the history. For example, if we delete a customer, right, we're going to need to delete their orders too, right? But it's currently possible that, right, it's that the framework is not super, it's not smart, right? If you tell it to do it, it'll do it. But you might, you know, get results that aren't helpful.

Starting point is 00:18:25 So we really would like to maybe in the future assist users in this type of little reason. And we can also apply this technique, the program slicing, the sort of underlying technical approach can be appropriated for some other interesting, sometimes more theoretical tasks. Well, that's fantastic. I think we'll leave it there. Thank you so much, Felix. Awesome.

Starting point is 00:18:44 Thank you so much for having me. Yeah, if you're interested in knowing more about Felix's work, all the relevant links will be put into the show notes. Do you have any social media accounts you'd like to shout out? Just my website. Okay, well, if you want to learn more, go look at Felix's website. Brilliant, and thank you for listening, and we will see you next time. Awesome. you next time.

CODACE Plant Stand

Disseminate: The Computer Science Research Podcast - Felix S Campbell | Efficient Answering of Historical What-if Queries | #2

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Disseminate: The Computer Science Research Podcast - Felix S Campbell | Efficient Answering of Historical What-if Queries | #2

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.