Imbalanced Data

Introduction

This repository is an auxiliary to my medium blog post on handling imbalanced datasets.
The data used for this repository is sourced with gratitude from Daniel Perico's Kaggle entry earthquakes. The key idea behind this collection is to provide an even playing field to compare a variety of methods to address imabalance - feel free to plug in your own dataset and have a play with the options of the algorithms! It should be fairly easy to change what you need or copy and paste to have more comparable elements.

If you think I've missed an important algorithm or method, please let me know.

Results

This table shows the results of the examples that were run in the notebooks. The best in each category are highlighted in dark grey, the worst in light grey. It's important to note that the different algorithms produce very different results, especially the neural networks compared to the regular Random Forest. I have not tested for model size, this could play a role in deciding on a model candidate since the Random Forest's size will depend on its hyperparameters and the dataset, while the neural network will remain fixed.
I have included the F1-score as well as the ROC score to show the difference between those with respect to precision and recall as well. Examples like NearMiss Version 1 undersampling or the Balanced Ensemble Algorithms show how difficult it can be to apply a metric focussed on only the true and false positive rates - seemingly perfect for a classification problem!

Here's a set of other algorithms than Random Forest to give alternative baselines for further experimentation:

Installation and requirements

The csv file containing the unmodified data is assumed to reside in the /data folder.

The project requirements can be found in the requirements.txt and installed with

pip install -r requirements.txt

Additional ML libraries

The non-sklearn classifiers can be found in

NB:

To run the example of the GAN as a sample generator for oversampling, you have to get the script from the DIAL team on github.
To run the example using the Hellinger distance criterion in Random Forest, you need to set up the criterion as described in Evgeni Dubov's github repository.
(You might also have to copy the resulting files into your notebook directory to import the module)
To be compatible with all libraries and algorithms, I'd recommend creating an environment based on Python 3.6 which is the lowest common denominator for compatibility among the libraries.

Jupyter Notebooks

All notebooks are standalone examples and can be executed individually. To save resources, variables within the notebooks are reused and redefined for each example shown, hence examples can be executed individually or sequentially as a whole.
Each model evaluation calculates and prints the precision, recall, ROC, F1 and accuracy scores for the model and calculates a confusion matrix, should it be desired. Unless specified, all notebooks scale the data and one-hot encode the categoricals. This is recommended practice for many of the resampling algorithms and further ensures that the method is comparable and compatible with any other algorithms a user might like to try, such as the neural networks in notebooks 9 and 10.

(NB: as the appendices show, there is a small, reproducible difference in model performance based on the amount and type of preprocessing)

01 - EDA

This notebook contains a short exploration of the dataset. Its main focus is to check the dataset for an imbalanced class and to examine the missing values and correlations in the dataset.
For a detailed explanation of the data types in the columns go to the US Geological Survey description page.
The exploration is rudimentary since there is no feature engineering besides the creation of a binary class.

02 - Basic, Basic_weights, Basic_Hellinger

These notebooks have the same structure, they are all using the dataset straight up without any missing values. The basic notebook contains a default Random Forest and, for good measure, a simple default model guessing every event is an earthquake.
The basic_weights notebook is the same, but the Random Forest uses balanced class weights. Similarly, the basic_Hellinger notebook uses the Random Forest, but changes the decision criterion to a custom model.

03 - Imputation

This notebook uses the sklearn iterative imputer to 'salvage' more data from the dataset by filling in missing data values. This notebook does not one-hot encode the data due to resource constraints. The imputation is limited to rows that contain fewer than 5 missing values.

04 - Knnimputation

This notebook is similar to notebook 03, except it uses the sklearn knnimputer. For the sake of time, only rows with fewer than 3 missing values were imputed in this notebook.

05 - Undersampling

This notebook uses a standard Random Forest to model a variety of controlled undersampling and cleaning undersampling methods.

06 - Oversampling

This notebook uses a standard Random Forest to model random undersampling as well as a variety of synthetic oversamplers.

07 - Combined sampling

This notebook uses a standard Random Forest to model with two hybrid over- and undersampling methods based on SMOTE and the EditedNearestNeighbours and TomekLinks algorithms.

08 - Balanced ensembles

This notebook uses a variety of modelling algorithms with integrated resamplers instead of the standard Random Forest while keeping the preprocessing similar to the basic model approach.

09 - ANNs

This notebook uses a simple neural network as a point of comparison to the Random Forest model.
To make the modelling comparable, the training contains a cross-validation step and collects the results to calculate the average performance of the model.

09a - ANNs resampled

For further comparison, this notebook employs the same neural network as notebook 09 and trains the model on data preprocessed with SMOTE and data preprocessed with random undersampling.

10 - Alternative methods

This notebook applies a different oversampler, a GAN, to generate synthetic data. It's a very interesting approach that, to my surprise, was the only oversampling/synthetic method that managed to increase the precision of the default model. This could be an interesting approach for situations where precision is the focus, but other tools that successfully increased the precision, like the Hellinger distance criterion, are out of reach due to platform constraints or security concerns.
It took a bit of hyperparameter tuning to get to the result, so a grid search based on the desired metric might be the way to go if the compute is available.
The most important aspect to note is that the loss functions are related - the generator's loss is dependent on the discriminator's loss. That makes the loss in itself more unreliable as an indicator of performance, hence training and evaluation are required to grade the effectiveness of the generator, slowing down the development process. As an example, the generator could show a lower loss after 65 epochs of training that after 165 epochs with other parameter settings, but the data the former generates performs much worse in training the model than the data from the latter.
As for which hyperparameters to tune, while the standard deviation of the initialisation had the greatest immediate effect, the batch size and learning rates appear to be the most significant factors overall. Especially the ratio of learning rates between the discriminator and generator is important here, since the generator depends on the discriminator to improve. While the default values have a higher learning rate for the discriminator than the generator, I am not convinced that that is a general rule to be followed. Depending on the data and the architecture of discriminator and generator, it might even be useful to slow down the discriminator with respect to the generator.

11 - Alternative Baselines

I took all the classifiers I could find from libraries/packages that appear still active (which is why I neglected pybrain, Veles and NervanaNeon) and ran them in more or less their default settings on the scaled, one-hot encoded data.
Not all classifiers that are included in this notebook appear in the table (such as the Shogun-ML classifiers) because for one reason or another I could not obtain results from them.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data		data
.gitignore		.gitignore
01_earthquake_eda.ipynb		01_earthquake_eda.ipynb
02_earthquake_basic.ipynb		02_earthquake_basic.ipynb
02a_earthquake_basic_weights.ipynb		02a_earthquake_basic_weights.ipynb
02b_earthquake_basic_Hellinger.ipynb		02b_earthquake_basic_Hellinger.ipynb
03_earthquake_imputation.ipynb		03_earthquake_imputation.ipynb
04_earthquake_knnimputation.ipynb		04_earthquake_knnimputation.ipynb
05_earthquake_undersampling.ipynb		05_earthquake_undersampling.ipynb
06_earthquake_oversampling.ipynb		06_earthquake_oversampling.ipynb
07_earthquake_combined_sampling.ipynb		07_earthquake_combined_sampling.ipynb
08_earthquake_balanced_ensembles.ipynb		08_earthquake_balanced_ensembles.ipynb
09_earthquake_ANNs.ipynb		09_earthquake_ANNs.ipynb
09a_earthquake_ANNs_resampled.ipynb		09a_earthquake_ANNs_resampled.ipynb
10_earthquake_alternative_methods-Hellinger.ipynb		10_earthquake_alternative_methods-Hellinger.ipynb
10_earthquake_alternative_methods.ipynb		10_earthquake_alternative_methods.ipynb
10a_earthquake_alternative_methods-balanced.ipynb		10a_earthquake_alternative_methods-balanced.ipynb
11_earthquake_alternative_baselines.ipynb		11_earthquake_alternative_baselines.ipynb
Interventions.png		Interventions.png
README.md		README.md
__init__.py		__init__.py
alternative_algos_results.png		alternative_algos_results.png
appendix_basic_num_enc_no_scaling.ipynb		appendix_basic_num_enc_no_scaling.ipynb
appendix_basic_num_enc_plus_scaling.ipynb		appendix_basic_num_enc_plus_scaling.ipynb
appendix_basic_one_hot_no_scaling.ipynb		appendix_basic_one_hot_no_scaling.ipynb
requirements.txt		requirements.txt

eliiza/imbalanced-data

Folders and files

Latest commit

History

Repository files navigation