New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sample selection bias and up/down-sampling #540
Comments
I would say yes. Then, we would need to think about the right module to do that.
I did not look at the paper yet but is it related to importance sampling in which you would like to sample the X column such that it follows a given "real-world" distribution. In the case of over-sampling, we could think about something similar in which you could estimate distribution (or parameters such as covariances) from other datasets and use this in the rebalancing procedure. It would be a kind of data augmentation using knowledge from data instead of randomly generation. I would be really interested to implement such stuff or helping for it. |
We should include some of these in 1.X |
@rth Did you see some of the methods in the literature. Probably we should look at the fairness papers. |
I have not really looked into this question since opening this issue in February.. |
Well,
Yes. There is a body of research regarding this subject. I think that even this problem is imbalanced. So, we can tackle this inside |
It's a bit of an open-ended question. In my understanding up/down-sampling the input data depending on the target class is equivalent to having a dataset with sample selection bias. The possible impact of the latter on ML models is discussed e.g. by Zadrozny 2004.
In the use case of imbalanced-learn I gather that is not an issue because the sample selection only happens depending on the target variable
y
, not any of the features inX
? (which corresponds to case 2 on page 2 of the above-linked paper).An orthogonal question: assuming we do have some dataset with sample selection bias based on some feature in
X
(case 3, page 2 of the same paper). In other words, the distribution of one of the column of X does not match the real world distribution and we would like to compensate for it. Could one of the approaches in imbalanced-learn be used (or adapted) for it? Would something like this be in the scope of this project?The text was updated successfully, but these errors were encountered: