Sample selection bias and up/down-sampling #540

rth · 2019-02-05T22:48:45Z

It's a bit of an open-ended question. In my understanding up/down-sampling the input data depending on the target class is equivalent to having a dataset with sample selection bias. The possible impact of the latter on ML models is discussed e.g. by Zadrozny 2004.

In the use case of imbalanced-learn I gather that is not an issue because the sample selection only happens depending on the target variable y, not any of the features in X? (which corresponds to case 2 on page 2 of the above-linked paper).

An orthogonal question: assuming we do have some dataset with sample selection bias based on some feature in X (case 3, page 2 of the same paper). In other words, the distribution of one of the column of X does not match the real world distribution and we would like to compensate for it. Could one of the approaches in imbalanced-learn be used (or adapted) for it? Would something like this be in the scope of this project?

The text was updated successfully, but these errors were encountered:

glemaitre · 2019-02-06T18:06:32Z

I would say yes. Then, we would need to think about the right module to do that.

In other words, the distribution of one of the columns of X does not match the real world distribution and we would like to compensate for it.

I did not look at the paper yet but is it related to importance sampling in which you would like to sample the X column such that it follows a given "real-world" distribution.

In the case of over-sampling, we could think about something similar in which you could estimate distribution (or parameters such as covariances) from other datasets and use this in the rebalancing procedure. It would be a kind of data augmentation using knowledge from data instead of randomly generation.

I would be really interested to implement such stuff or helping for it.

glemaitre · 2019-11-17T11:43:56Z

We should include some of these in 1.X

glemaitre · 2019-11-17T11:44:37Z

@rth Did you see some of the methods in the literature. Probably we should look at the fairness papers.

rth · 2019-11-17T11:49:42Z

I have not really looked into this question since opening this issue in February..

chkoar · 2020-11-19T13:32:03Z

Well,

Probably we should look at the fairness papers.

Yes. There is a body of research regarding this subject. I think that even this problem is imbalanced. So, we can tackle this inside imbalanced-learn. APIwise probably we may need some changes. I leave here one (of the many) relevant paper.

glemaitre added this to the 1.0 milestone Nov 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample selection bias and up/down-sampling #540

Sample selection bias and up/down-sampling #540

rth commented Feb 5, 2019

glemaitre commented Feb 6, 2019

glemaitre commented Nov 17, 2019

glemaitre commented Nov 17, 2019

rth commented Nov 17, 2019 •

edited

chkoar commented Nov 19, 2020

Sample selection bias and up/down-sampling #540

Sample selection bias and up/down-sampling #540

Comments

rth commented Feb 5, 2019

glemaitre commented Feb 6, 2019

glemaitre commented Nov 17, 2019

glemaitre commented Nov 17, 2019

rth commented Nov 17, 2019 • edited

chkoar commented Nov 19, 2020

rth commented Nov 17, 2019 •

edited