Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample selection bias and up/down-sampling #540

Open
rth opened this issue Feb 5, 2019 · 5 comments
Open

Sample selection bias and up/down-sampling #540

rth opened this issue Feb 5, 2019 · 5 comments
Milestone

Comments

@rth
Copy link

rth commented Feb 5, 2019

It's a bit of an open-ended question. In my understanding up/down-sampling the input data depending on the target class is equivalent to having a dataset with sample selection bias. The possible impact of the latter on ML models is discussed e.g. by Zadrozny 2004.

In the use case of imbalanced-learn I gather that is not an issue because the sample selection only happens depending on the target variable y, not any of the features in X? (which corresponds to case 2 on page 2 of the above-linked paper).

An orthogonal question: assuming we do have some dataset with sample selection bias based on some feature in X (case 3, page 2 of the same paper). In other words, the distribution of one of the column of X does not match the real world distribution and we would like to compensate for it. Could one of the approaches in imbalanced-learn be used (or adapted) for it? Would something like this be in the scope of this project?

@glemaitre
Copy link
Member

I would say yes. Then, we would need to think about the right module to do that.

In other words, the distribution of one of the columns of X does not match the real world distribution and we would like to compensate for it.

I did not look at the paper yet but is it related to importance sampling in which you would like to sample the X column such that it follows a given "real-world" distribution.

In the case of over-sampling, we could think about something similar in which you could estimate distribution (or parameters such as covariances) from other datasets and use this in the rebalancing procedure. It would be a kind of data augmentation using knowledge from data instead of randomly generation.

I would be really interested to implement such stuff or helping for it.

@glemaitre
Copy link
Member

We should include some of these in 1.X

@glemaitre glemaitre added this to the 1.0 milestone Nov 17, 2019
@glemaitre
Copy link
Member

@rth Did you see some of the methods in the literature. Probably we should look at the fairness papers.

@rth
Copy link
Author

rth commented Nov 17, 2019

I have not really looked into this question since opening this issue in February..

@chkoar
Copy link
Member

chkoar commented Nov 19, 2020

Well,

Probably we should look at the fairness papers.

Yes. There is a body of research regarding this subject. I think that even this problem is imbalanced. So, we can tackle this inside imbalanced-learn. APIwise probably we may need some changes. I leave here one (of the many) relevant paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants