Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paper: The Effect of Class Distribution on Classifier Learning: An Empirical Study #730

Open
Sandy4321 opened this issue Jun 21, 2020 · 19 comments

Comments

@Sandy4321
Copy link

Friends
there is interesting discussion what can be done better
catboost/catboost#392 (comment)
@ShaharKatz
and reference to this paper
The Effect of Class Distribution on Classifier Learning:
An Empirical Study
https://pdfs.semanticscholar.org/8939/585e7d464703fe0ec8ca9fc6acc3528ce601.pdf

@glemaitre
Copy link
Member

Could you elaborate on what the method is doing?

@ShaharKatz
Copy link

Sure. Through empirical research it has been shown that class imbalance does not necessarily mean worse performance. The proposed method consists of grid search-ing on several (and reverse) re-sampling to produce several classifiers and pick the best one. A bias correction is made (easily in the case of trees) in the form of a higher or lower threshold per leaf based on the original-to-resample relation rate (e.g. if you sampled a class twice as frequent the threshold for classification to that class should be twice as high per leaf). The annotated version shows what i believe to be the essence.

@Sandy4321
Copy link
Author

@ShaharKatz
do you know some python code to illustrate this ?
for any classier ?

@ShaharKatz
Copy link

As far as I read, the article itself doesn't come with the source. The intuition and example case is pretty straightforward - take an imbalanced binary label dataset that is perfectly linear separable. SVM shouldn't have a problem with this dataset and by downsampling the majority you can loose the fine-tuning of the edge. I think first i'll show that this actually works for a dataset with reproducible code and than if the results are good, incorporate it.

@glemaitre
Copy link
Member

It looks to me as it would be easy to do with normal scikit-learn/imbalanced-learn components in 3 lines of code (pipeline + grid-search).

from sklearn.datasets import load_iris
from imblearn.datasets import make_imbalance
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

data = load_iris()
X, y = data.data, data.target

X, y = make_imbalance(
    X, y, sampling_strategy={0: 10, 1: 20, 2: 50}, random_state=42
)

model = Pipeline(
    [("sampler", SMOTE()),
     ("scaler", StandardScaler()),
     ("clf", LogisticRegression())]
)

param_grid = {
    "sampler__sampling_strategy": [
        {0: 20, 1: 30}, {0: 30, 1: 30}, {0: 30, 1: 20},
    ]
}
grid = GridSearchCV(model, param_grid=param_grid).fit(X, y)
print(grid.best_params_)

@glemaitre
Copy link
Member

So if I don't miss anything, IMO, it would not be worth creating an estimator that would wrap all possible parameters of these models while it seems pretty easy to create a pipeline in this case.

WDYT?

@ShaharKatz
Copy link

Maybe the current implementation already addresses this, but the over/under sampling is the first part, the second part (that might still needs implementation) is the de-bias-ing of the estimator's results. The article shows it for trees (which is very intuitive), for logistic regression we need a different correction though..

@Sandy4321
Copy link
Author

Yes
the de-bias-ing of the estimator's results
Wild be great to add this

@Sandy4321
Copy link
Author

What is problem
ShaharKatz
What's to do
Let him do it
And after you can test how good is it?

@Sandy4321
Copy link
Author

If ShaharKatz wants to it
Why to stop him?

@glemaitre
Copy link
Member

@Sandy4321 We have to be careful when adding a new algorithm in the source code. Basically code comes with the responsibility to maintain it. So we need to weigh the benefit and limitation of the current solution and decide if this is worth or not to add it.

This said I did not look at the paper yet so I cannot say if this is worth or not. When speaking about debiasing, I would think that this should be linked to the scoring used during the fit of the GridSearchCV and might be implemented using make_scorer from scikit-learn.

@glemaitre
Copy link
Member

OK, so I see that the debiasing is actually a ratio at the leaf level in the tree.
So this should be added in the tree code base from scikit-learn directly.
I am wondering if it could always be applied even when not resampling the dataset?

@glemaitre
Copy link
Member

One thing that I am not sure is how well this method works with deep trees where you will have very few samples in the leaf.

@ShaharKatz
Copy link

Regarding the implementation @glemaitre suggested - this isn't really a simple scorer since it must know the resampling technique used in the pre-processing stage. On the other hand, this isn't really a preprocessing step since it obviously takes action during the inference.

On one hand this shouldn't be model specific since most models don't do the resampling inside (which is why this repo comes in handy) but on the other hand the model implementation is relevant since it goes down to the leaf level (in trees).

It's:

  1. An adjacent model to the actual model being trained.
  2. it is fitted using the resampling technique(s) used.
  3. it is model-specific (doesn't work just on the proba).

This is the reason i think this repo is the best place for it - cause it deals specifically with imbalanced learning and it can take this "hybrid" which doesn't necessarily plays nice with the existing interfaces.

@Sandy4321
Copy link
Author

ShaharKatz

@Sandy4321
Copy link
Author

@ShaharKatz
If it is so complicated to incorporate your great suggestion to this package
Would you like to create stand alone Python package to benefit all of us
After you can add your code to this package, when package owner will test your code ...
Please do not give up, we need your code...

@chkoar
Copy link
Member

chkoar commented Jul 24, 2020

The article shows it for trees (which is very intuitive), for logistic regression we need a different correction though..

Could this generalized across predictors (1)?

On the other hand, this isn't really a preprocessing step since it obviously takes action during the inference.

If the answer to (1) is yes then I suppose that the above sentence indicates that the solutions it could be a meta-estimator, no?

@Sandy4321
Copy link
Author

hi
why against this definite improvement?
@ShaharKatz
may you do it as separate package?
we need your code

@ShaharKatz
Copy link

regarding your question - @chkoar - this is model specific, we have a solution for trees and i'm currently looking at a solution for logistic regression. Don't have a general framework yet.
@Sandy4321 - I want to see that it provides value and can be generalised. If the generalisation allows for this to be a meta-estimator than there's no problem committing the code here, if not than yes, this would require a different project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants