Disseminate: The Computer Science Research Podcast - Sainyam Galhotra | Causal Feature Selection for Algorithmic Fairness | #5

Episode Date: July 8, 2022

Summary:The use of machine learning (ML) in high-stakes societal decisions has encouraged the consideration of fairness throughout the ML lifecycle. Although data integration is one of the primary st...eps to generate high-quality training data, most of the fairness literature ignores this stage. In this interview Sainyam discusses why he focuses on fairness in the integration component of data management, aiming to identify features that improve prediction without adding any bias to the dataset. Sainyam works under the causal fairness paradigm and without requiring the underlying structural causal model a priori, we has developed an approach to identify a sub-collection of features that ensure fairness of the dataset by performing conditional independence tests between different subsets of features. Questions:0:35: Can you introduce your work and describe the problem you're aiming to solve?2:39: Can you elaborate on what fairness mean?3:51: Lets dig into your solution, how does the causal approach work?4:41: How does your approach compare to other approach into your evaluations?6:17: How can data scientists apply your findings to the real world?7:54: What was the most unexpected challenge you faced while working on algorithmic fairness?8:29: What is next for your research? 9:17: Tell us about your other publications at SIGMOD? 10:57: How can the research get involved in algorithmic fairness? Links:SIGMOD PresentationPaperHompageTwitterLinkedIn Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate, the podcast bringing you the latest computer science research. I'm your host, Jack Wardby. We're recording today from ACM SIGMOD pods in sunny Philadelphia. I'm delighted to say I'm joined here today by Sayyam Galhotra from the University of Chicago, who is going to be talking about his paper, Causal Feature Selection for Algorithmic Fairness. Thank you for joining us on the show. Thanks, Zach, for having me. I'm excited to be here. Brilliant. Let's dive straight in.
Starting point is 00:00:50 So, first of all, can you set the scene and introduce your work and describe the problem you're aiming to solve and the motivation for doing so? Sure. So, nowadays, machine learning algorithms are being widely used in decision-making systems, which impact us on a day-to-day basis. When you go to a bank for getting loans, the bank uses a credit scoring system to figure out what are the chances that you will be able to repay, and those systems are used to decide whether you should get a loan or not. You apply to a company for jobs, they screen your resumes using an automated system.
Starting point is 00:01:24 And in the past past we have identified that there are many such systems which are biased against sensitive attributes like race gender and age and these are some of the critical issues that the software face and the goal is we want to make sure that the software are fairer towards these sensitive attributes and that is the main focus of this work and to go into more detail about the work like prior work has looked at attacked this problem by changing the training data for the machine learning algorithms so machine learning used training data which has correlation between sensitive attribute and the target because of historical biases in society or the data the way data was collected and many other reasons.
Starting point is 00:02:06 Because of these unwanted correlations, people used to modify the data distributions so that any classifier they train, any analysis they do on the data set is going to be fair, which is great, but it used to lead to loss in accuracy of the final classifier. The analysis may be incorrect because they had to add some noise to the data to make sure it's becoming fairer. So rather what we took, what approach in this paper was slightly different where we said that let's not modify the data values.
Starting point is 00:02:39 Let's just pick a subset of the data, a subset of the features, which when we use for machine learning analysis or any other analysis, we'll make sure that the outcome is fair. And fair is, again, with respect to the sensitive attributes. That's the high-level motivation. Excellent.
Starting point is 00:02:55 Can you elaborate on what fairness actually means in this sense? Because from my understanding, there's quite several different definitions of fairness. Which is the one that you kind of used in in your work here so the intuitive definition is the sensitive attribute should not be used by any analysis algorithm that is being trained on the classification it's easier said than done because there is no loan prediction software that uses the gender information to make the prediction what it uses is different attributes like zip code, income, education, and so on. And many of these attributes are often correlated with the sensitive attributes.
Starting point is 00:03:32 And this is the correlation that causes a problem. For example, Amazon's same-day delivery was found to be racially biasing when it was launched a couple of years ago. And the reason was it used zip code to figure out where to deliver, and the zip codes were highly correlated with the races. Okay, right. So the problem is much more complex because many attributes are correlated,
Starting point is 00:03:55 causally dependent on each other, and we need to figure out what is the right set of variables which are not going to actually cause unfairness. Fantastic. So let's dig into the solution a little bit more. So there's two types of ways of approaching this problem. There's the causal approach and then the associative approach. So you opted for the causal one.
Starting point is 00:04:14 So how does that actually work in practice and what did your solution actually, how does it work? We looked at the causal dependencies between the different attributes that the loan protection software would use. For example, age, income, education, race, gender, and so on. We looked at this causal dependency between these attributes to figure out what are the attributes that are actually fine to be used. If an attribute like zip code is causally dependent on age or sex, then it should not be used. We inferred which of these attributes should be used for the training or the analysis.
Starting point is 00:04:47 And the machine, like the algorithm we used to identify these variables was analyzing the data set we had there. Excellent. Rather than assuming that you have the graph kind of up front and someone's telling you this information, which obviously is not often true in reality. So compared to the other approaches, such as the associative approach, how does your approach actually fare when you evaluate, and how did you evaluate your approach?
Starting point is 00:05:15 So compared to other approaches, feature selection-based techniques are generally more robust to real-world issues like data distribution shifts. Because in the real world, so our algorithm is great theoretically. But in the real world, those assumptions may not hold in practice a lot. Sure. And when those are violated, feature selection based techniques are much more robust as compared to other prior literature techniques which modify the data.
Starting point is 00:05:40 How you need to modify the data depends on its distribution. And if the distribution is going to change in real time, the algorithm would not work well. That was one of the main reasons. Another thing is we don't want to modify the data because it can lead to erroneous analysis that tomorrow you do some other analysis on the data, we will have to modify it again and so on. So that was the main motivation of not modifying the data. In terms of the evaluation, we just looked at the prediction outcome of not modifying the data in terms of the evaluation we just looked at the prediction outcome of the classifier we looked at the correlation between the prediction outcome and the sensitive attribute for those individuals and whether it was high or low if it's high it's
Starting point is 00:06:16 unfair if it's low it's fine awesome so how can how can these findings be leveraged by people data scientists in the real world? How can they take your results and apply it to their practice? Have we kind of now got the final word on what's the best thing to do to achieve fairness? Right. So our approach is working well under a certain assumption. What we say is that we can identify the set of features that ensure fairness. But that is a binary question. In reality, you may say that I'm okay to have of features that ensure fairness but that is a binary question in reality you may say that i'm okay to have less than five percent unfairness or ten percent
Starting point is 00:06:51 unfairness or unfairness caused by certain specific attributes so those are the real world extensions where this approach should be used like this approach needs a bit of an extension to use there so i think this approach is very useful in ranking all the attributes of the individuals based on their unfairness okay and the domain scientist can look at those attributes and pick like the top k based on whatever they want like whatever the threshold is for example in this case our algorithm would say that zip code is probably highly unfair but education is is fine, so we pick education. So, after all, fairness is a problem for which the decision has to be taken by the decision maker.
Starting point is 00:07:34 Sure. The algorithm cannot solve the problem end-to-end. Yeah. Algorithm is just going to act as a guidance for the decision maker. So the next question I have is, what's the most interesting or unexpected or challenging lesson that you learned while working on this paper and more generally in the area of algorithmic fairness? So I think algorithmic fairness is a very difficult topic. But it's a very practical topic because it impacts us on a daily basis. But not being a very practical social researcher, I wanted to figure out what is the core technical issue that I can solve and how it impacts the real world. So finding connection is the toughest job in this case.
Starting point is 00:08:18 How do you define the problem and finding the real value in that problem? So what's next? Where do you take this work further? What's the next step to improve on this? So we are generalizing the notion of fairness. Like I mentioned, right now we treat it as binary. Either it's fair or unfair. There's nothing in between. Now we are trying to build a continuous spectrum where we say that it's epsilon unfair.
Starting point is 00:08:40 And you can vary the value of epsilon based on your requirements and figure out what is the classification of what you want want to train so that is just one direction of work there's another line of work where we are looking into other issues in the data like suppose data has sampling bias you did not have a sufficient data to do training then how can you do make algorithms fairer so these are a couple of directions that we are looking at several interesting promising directions there. That's great. So whilst we're here, you may as well tell us briefly about your other publications at SigMod. What other topics have you been working on and would the listener maybe be interested in looking at?
Starting point is 00:09:15 Sure. So there's another paper which uses causal inference and similar notions to understand hypothetical queries. For example, you are an individual. Now you ask a question, what would be my salary if I go for masters or if I go for a doctorate degree? So this is a hypothetical scenario where you are projecting the future. So asking such questions
Starting point is 00:09:36 on a database is one work which we propose where we answer what if and how to queries. What if is like, what if I do this, what will happen? How to is like, how can I reach there? And we capture the causal dependencies between attributes to give you a good answer for that.
Starting point is 00:09:52 There's another line of work where we are using causal inference kind of techniques to debug data and systems and the mismatch between them. Generally, you implement a system, assuming the data input data is in a specific format. But what if it's not in that format? How do you figure
Starting point is 00:10:07 out what should you do? How should you change the data? Or how should you change the code to make sure the input data is in the correct format for the system? That's the third one. The fourth one is more on data integration side of things, where you get data from different sources. How do you organize it well?
Starting point is 00:10:24 Can you figure out what are the different types of records you have? How can you arrange them in the form of a hierarchy, like the Amazon type hierarchy, where the products are organized based on the different product types? Fascinating stuff. So there's plenty there for the listener to get stuck into. My last question would be a bit of, I guess, advice from you to say to the listener, how would they get more involved in your research area? Where can they look? Where can they find more resources to kind of maybe themselves find a career researching in algorithmic fairness?
Starting point is 00:10:59 I think algorithmic fairness is a very new topic, even though there's a lot of work recently. It's a very new topic where there are millions of challenges which need to be solved. You just need to dive deep in, read the most recent papers, try to understand what is it that they're trying to do, what is it that they're not trying to do, what you can improve on them. And slowly and slowly, you'll be able to figure out what should be done and how to improve those further. That's fantastic.
Starting point is 00:11:24 So I presume the source code for all of your projects are available publicly online, so we'll link those in the show notes and obviously to all of your papers as well. But that's everything for today. We'll end it there. Thank you. Thank you very much, Sayyam.
Starting point is 00:11:37 And like I said before, if we're interested in learning any more, I'll put all the relevant links into the show notes. So thank you for listening and see you next time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.