Approaches for binary classification when you have hugely more data in one class #19218

lesshaste · 2021-01-20T14:10:38Z

lesshaste
Jan 20, 2021

I have a situation which I feel must be quite common. I am doing binary classification and have about 10,000 examples from the positive class. I can make an essentially unlimited number of examples from the negative class however. What is the best approach and is there an elegant scikit learn solution?

One simple idea would be the following. Take all the examples from the positive class and 10,000 examples from the negative chosen at random. Build a classifier (call it classifier 1) and store it. Now repeat as many times as you like storing classifiers 1,2,3...

When you want to perform prediction you take the median of all the predicted probabilities of the classifiers you have stored.

This is just something I made up and I can't believe it isn't a studied problem. What would an expert do and does scikit learn support it?

jnothman · 2021-01-20T21:24:35Z

jnothman
Jan 20, 2021
Maintainer

Learning with imbalance is a well established area of research. Have you taken a look at imbalanced-learn? https://imbalanced-learn.org/

0 replies

glemaitre · 2021-01-21T08:51:50Z

glemaitre
Jan 21, 2021
Maintainer

To add up on @jnothman answer, I would probably favour the following solution if I was to use something from imbalanced-learn: create an ensemble of strong learners but that have been trained on different negative samples. BalancedBaggingClassifier would do this job.

Another thing to try would be to look at a novelty detection algorithm and train the model solely with the samples of the positive class.

2 replies

ogrisel Jan 21, 2021
Maintainer

Also note: if you want to use a rebalancing classifier such as BalancedRandomForestClassifier as suggested above, make sure that your minority class has at least a few hundreds of samples. You should not expect a miracle if your classifier has fewer than a hundred examples for the minority class.

Also make sure to evaluate your model with metrics (precision, recall, Matthews' correlation coefficient, balanced accuracy) computed with the natural class balancing of the data, not a rebalanced version. Otherwise your performance metric would not reflect the behavior of the model in the natural setting.

If you have fewer than a few hundreds examples labeled with the minority class, I would focus on treating the problem as a novelty / anomaly detection problem rather than a supervised classification problems. Keep the few examples from the minority class you have as a validation test to select the "best" novelty detection strategy for your particular problem (for instance using a ranking metric).

jnothman Jan 21, 2021
Maintainer

A thought: When you say "look at novelty detection" I read "use a more biased model or feature representation". Supervised learning might work just fine, but don't expect it to find a signal in a mixed bag of features.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Approaches for binary classification when you have hugely more data in one class #19218

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Approaches for binary classification when you have hugely more data in one class #19218

lesshaste Jan 20, 2021

Replies: 2 comments · 2 replies

jnothman Jan 20, 2021 Maintainer

glemaitre Jan 21, 2021 Maintainer

ogrisel Jan 21, 2021 Maintainer

jnothman Jan 21, 2021 Maintainer

lesshaste
Jan 20, 2021

Replies: 2 comments 2 replies

jnothman
Jan 20, 2021
Maintainer

glemaitre
Jan 21, 2021
Maintainer

ogrisel Jan 21, 2021
Maintainer

jnothman Jan 21, 2021
Maintainer