Disseminate: The Computer Science Research Podcast - Sainyam Galhotra | Causal Feature Selection for Algorithmic Fairness | #5
Episode Date: July 8, 2022Summary:The use of machine learning (ML) in high-stakes societal decisions has encouraged the consideration of fairness throughout the ML lifecycle. Although data integration is one of the primary st...eps to generate high-quality training data, most of the fairness literature ignores this stage. In this interview Sainyam discusses why he focuses on fairness in the integration component of data management, aiming to identify features that improve prediction without adding any bias to the dataset. Sainyam works under the causal fairness paradigm and without requiring the underlying structural causal model a priori, we has developed an approach to identify a sub-collection of features that ensure fairness of the dataset by performing conditional independence tests between different subsets of features. Questions:0:35: Can you introduce your work and describe the problem you're aiming to solve?2:39: Can you elaborate on what fairness mean?3:51: Lets dig into your solution, how does the causal approach work?4:41: How does your approach compare to other approach into your evaluations?6:17: How can data scientists apply your findings to the real world?7:54: What was the most unexpected challenge you faced while working on algorithmic fairness?8:29: What is next for your research? 9:17: Tell us about your other publications at SIGMOD? 10:57: How can the research get involved in algorithmic fairness? Links:SIGMOD PresentationPaperHompageTwitterLinkedIn Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the podcast bringing you the latest computer science research.
I'm your host, Jack Wardby.
We're recording today from ACM SIGMOD pods in sunny Philadelphia.
I'm delighted to say I'm joined here today by Sayyam Galhotra from the University of Chicago,
who is going to be talking about his paper, Causal Feature Selection for Algorithmic Fairness.
Thank you for joining us on the show.
Thanks, Zach, for having me. I'm excited to be here.
Brilliant. Let's dive straight in.
So, first of all, can you set the scene and introduce your work
and describe the problem you're aiming to solve and the motivation for doing so?
Sure. So, nowadays, machine learning algorithms are being widely used
in decision-making systems, which impact us on a day-to-day basis.
When you go to a bank for getting loans,
the bank uses a credit scoring system to figure out what are the chances that you will be able to repay,
and those systems are used to decide whether you should get a loan or not.
You apply to a company for jobs, they screen your resumes using an automated system.
And in the past past we have identified
that there are many such systems which are biased against sensitive attributes like race gender and
age and these are some of the critical issues that the software face and the goal is we want to make
sure that the software are fairer towards these sensitive attributes and that is the main focus of this work and to go into more detail
about the work like prior work has looked at attacked this problem by changing the training
data for the machine learning algorithms so machine learning used training data which has
correlation between sensitive attribute and the target because of historical biases in society
or the data the way data was collected and many other reasons.
Because of these unwanted correlations, people used to modify the data distributions
so that any classifier they train, any analysis they do on the data set is going to be fair,
which is great, but it used to lead to loss in accuracy of the final classifier.
The analysis may be incorrect because they had to add some noise to the data
to make sure it's becoming fairer.
So rather what we took,
what approach in this paper was slightly different
where we said that let's not modify the data values.
Let's just pick a subset of the data,
a subset of the features,
which when we use for machine learning analysis
or any other analysis,
we'll make sure that the outcome is fair.
And fair is, again, with respect to the sensitive attributes.
That's the high-level motivation.
Excellent.
Can you elaborate on what fairness actually means in this sense?
Because from my understanding,
there's quite several different definitions of fairness.
Which is the one that you kind of used in in your work here so the intuitive definition is the sensitive attribute should not
be used by any analysis algorithm that is being trained on the classification it's easier said
than done because there is no loan prediction software that uses the gender information to
make the prediction what it uses is different attributes like zip code, income, education, and so on.
And many of these attributes are often correlated with the sensitive attributes.
And this is the correlation that causes a problem.
For example, Amazon's same-day delivery was found to be racially biasing
when it was launched a couple of years ago.
And the reason was it used zip code to figure out where to deliver,
and the zip codes were highly correlated with the races.
Okay, right.
So the problem is much more complex
because many attributes are correlated,
causally dependent on each other,
and we need to figure out what is the right set of variables
which are not going to actually cause unfairness.
Fantastic.
So let's dig into the solution a little bit more.
So there's two types of ways of approaching this problem.
There's the causal approach and then the associative approach.
So you opted for the causal one.
So how does that actually work in practice and what did your solution actually, how does it work?
We looked at the causal dependencies between the different attributes that the loan protection software would use.
For example, age, income, education, race, gender, and so on.
We looked at this causal dependency between these attributes
to figure out what are the attributes that are actually fine to be used.
If an attribute like zip code is causally dependent on age or sex,
then it should not be used.
We inferred which of these attributes should be used for the training or the analysis.
And the machine, like the algorithm we used to identify these variables was analyzing
the data set we had there.
Excellent.
Rather than assuming that you have the graph kind of up front and someone's telling you
this information, which obviously is not often true in reality.
So compared to the other approaches, such as the associative approach,
how does your approach actually fare when you evaluate,
and how did you evaluate your approach?
So compared to other approaches, feature selection-based techniques
are generally more robust to real-world issues like data distribution shifts.
Because in the real world, so our algorithm is great theoretically.
But in the real world, those assumptions may not hold in practice a lot.
Sure.
And when those are violated,
feature selection based techniques are much more robust
as compared to other prior literature techniques which modify the data.
How you need to modify the data depends on its distribution.
And if the distribution is going to change in real time, the algorithm would not work well. That was one of the main
reasons. Another thing is we don't want to modify the data because it can lead to erroneous
analysis that tomorrow you do some other analysis on the data, we will have to modify it again
and so on. So that was the main motivation of not modifying the data.
In terms of the evaluation, we just looked at the prediction outcome of not modifying the data in terms of the evaluation we just looked at the
prediction outcome of the classifier we looked at the correlation between the prediction outcome
and the sensitive attribute for those individuals and whether it was high or low if it's high it's
unfair if it's low it's fine awesome so how can how can these findings be leveraged by people
data scientists in the real world?
How can they take your results and apply it to their practice?
Have we kind of now got the final word on what's the best thing to do to achieve fairness?
Right. So our approach is working well under a certain assumption.
What we say is that we can identify the set of features that ensure fairness.
But that is a binary question. In reality, you may say that I'm okay to have of features that ensure fairness but that is a binary question
in reality you may say that i'm okay to have less than five percent unfairness or ten percent
unfairness or unfairness caused by certain specific attributes so those are the real world
extensions where this approach should be used like this approach needs a bit of an extension to use
there so i think this approach is very
useful in ranking all the attributes of the individuals based on their unfairness okay and
the domain scientist can look at those attributes and pick like the top k based on whatever they
want like whatever the threshold is for example in this case our algorithm would say that zip code
is probably highly unfair but education is is fine, so we pick education.
So, after all, fairness is a problem for which the decision has to be taken by the decision maker.
Sure.
The algorithm cannot solve the problem end-to-end.
Yeah.
Algorithm is just going to act as a guidance for the decision maker. So the next question I have is, what's the most interesting or unexpected or challenging lesson that you learned while working on this paper and more generally in the area of algorithmic fairness?
So I think algorithmic fairness is a very difficult topic. But it's a very practical topic because it impacts us on a daily basis. But not being a very practical social researcher,
I wanted to figure out what is the core technical issue that I can solve
and how it impacts the real world.
So finding connection is the toughest job in this case.
How do you define the problem and finding the real value in that problem?
So what's next? Where do you take this work further?
What's the next step to improve on this?
So we are generalizing the notion of fairness.
Like I mentioned, right now we treat it as binary.
Either it's fair or unfair. There's nothing in between.
Now we are trying to build a continuous spectrum
where we say that it's epsilon unfair.
And you can vary the value of epsilon based on your requirements
and figure out what is the classification of what you want want to train so that is just one direction of work
there's another line of work where we are looking into other issues in the data
like suppose data has sampling bias you did not have a sufficient data to do training then how
can you do make algorithms fairer so these are a couple of directions that we are looking at
several interesting promising directions there. That's great.
So whilst we're here, you may as well tell us briefly about your other publications at SigMod.
What other topics have you been working on and would the listener maybe be interested in looking at?
Sure. So there's another paper which uses causal inference and similar notions to understand hypothetical queries.
For example, you are an individual.
Now you ask a question, what
would be my salary if I
go for masters or if I go for a doctorate
degree? So this is a hypothetical
scenario where you are projecting the future.
So asking such questions
on a database is one
work which we propose where we
answer what if and how to queries.
What if is like, what if I do this, what
will happen? How to is like,
how can I reach there? And we capture the
causal dependencies between attributes to
give you a good answer for that.
There's another line of work where we are using
causal inference
kind of techniques to debug
data and systems and the mismatch
between them. Generally, you
implement a system, assuming the data input
data is in a specific format. But what if
it's not in that format? How do you figure
out what should you do?
How should you change the data? Or how should you change
the code to make sure the input data
is in the correct format for the system?
That's the third one. The fourth
one is more on data integration side of
things, where you get data from different
sources. How do you organize it well?
Can you figure out what are the different types of records you have?
How can you arrange them in the form of a hierarchy,
like the Amazon type hierarchy,
where the products are organized based on the different product types?
Fascinating stuff.
So there's plenty there for the listener to get stuck into.
My last question would be a bit of, I guess, advice from you to say to the listener, how would they get more involved in your research area?
Where can they look? Where can they find more resources to kind of maybe themselves find a career researching in algorithmic fairness?
I think algorithmic fairness is a very new topic, even though there's a lot of work recently.
It's a very new topic where there are millions of challenges which need to be solved.
You just need to dive deep in, read the most recent papers,
try to understand what is it that they're trying to do,
what is it that they're not trying to do, what you can improve on them.
And slowly and slowly, you'll be able to figure out what should be done
and how to improve those further.
That's fantastic.
So I presume the source code for all of your projects
are available publicly online,
so we'll link those in the show notes
and obviously to all of your papers as well.
But that's everything for today.
We'll end it there.
Thank you.
Thank you very much, Sayyam.
And like I said before, if we're interested in learning any more,
I'll put all the relevant links into the show notes.
So thank you for listening and see you next time.