Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper way of fitting classifiers before creating an heterogeneous pool #264

Open
francescopisu opened this issue Dec 30, 2021 · 4 comments

Comments

@francescopisu
Copy link

francescopisu commented Dec 30, 2021

Hey, I'm working on a research paper focused on building a binary classification model in the biomedical domain.The dataset comprises approximately 800 data points. Let's say I want to feed an heterogeneous pool of classifiers to the dynamic selection methods. By following the instructions on the examples, I've found two different ways of splitting the dataset and fitting the base classifiers of the pool.

  1. Split in train/test (e.g., 75/25) and then split the training in train/dsel (e.g., 50/50).
    In this random forest example, the RF is fitted on the 75% training portion and the DS methods on the 50% DSEL portion.
  2. In all the other examples, the 50% training portion is used to fit the classifier and the 50% DSEL portion is used to fit DS methods.

Furthermore, I wanted to point out this tip taken from the tutorial :

An important point here is that in case of small datasets or when the base classifier models in the pool are weak estimators such as Decision Stumps or Perceptrons, an overlap between the training data and DSEL may be beneficial for achieving better performance.

That seems my case, as my dataset is rather small compared to most datasets in the ML domain. Hence, I was thinking of fitting my base classifiers on the 75% part and then leveraging some overlap to get better performance (and this is really the case! In fact, overlapping leads to a median auc of 0.76 whereas non-overlapping gives 0.71).

What would be the best way of dealing with the problem ?

@francescopisu
Copy link
Author

Any update on this ?

@Menelau
Copy link
Collaborator

Menelau commented Jan 4, 2022

Hello,

In you case (small dataset with about 800 samples), I do believe that using the whole 75% for training the base models and DSEL I'm pretty sure the best way of handling it is having some overlap between the datasets.

Does your dataset also suffers from class imbalance? If yes, this is another indication that you should consider an overlapping approach as the minority class may be quite under represented if we divide the dataset too much. If have done that in quite a few papers such as the FIRE-DES++ one:

FIRE-DES++: Enhanced online pruning of base classifiers for dynamic ensemble selection: https://www.etsmtl.ca/Unites-de-recherche/LIVIA/Recherche-et-innovation/Publications/Publications-2019/Cruz_PR_2019.pdf

Also, if your dataset suffer from class imbalance, maybe you could try applying data augmentation to increase the DSEL size later as we did in a previous publication:

A study on combining dynamic selection and data preprocessing for imbalance learning: https://en.etsmtl.ca/Unites-de-recherche/LIVIA/Recherche-et-innovation/Publications/Publications-2018/Roy_Neurocomputing_2018_InPress.pdf

@francescopisu
Copy link
Author

Hello Menelau, thank you very much for your time and valuable inputs. My dataset suffers from mild imbalance, but this time is the negative class (let's say absence of condition) to be underrepresented, which is the class I'm less interested in. I'm going to read these two articles and try to apply what you suggested.

@francescopisu
Copy link
Author

francescopisu commented Feb 22, 2022

-- EDIT: In my previous answer I was reporting wrong results due to some implementation errors on my part. I removed that part as I am now getting more realistic results.

I have some feedback regarding the overlap. If I understood correctly, Xtrain and Xdsel should be taken such that set.intersection(Xnew_train, Xdsel) is not empty set.

Let Xtrain, Xtest = train_test_split(features, target, test_size=0.2)
be the train and test set after the first split (test set is 20% of the original dataset).

Then, I'm sampling 80% of entries from Xtrain to get the new training set Xnew_train. The remaining 20% is the Xdsel set. As of now there's no overlap.

To introduce overlap, I'm randomly sampling 10% of the entries in Xnew_train and adding them to Xdsel to get Xnew_dsel.

Example with easy numbers:
dataset = (1000, n_features)
Xtrain, Xtest = (800, n_features), (200, n_features)
Xnew_train = 80% of 800 -> (640, n_features)
Xdsel = 20% of 800 -> (160, n_features)
Xnew_dsel = take 10% from Xnew_train: 64 -> (64+160=224, n_features)

I'm also doing it in the cross-validation procedure: I split the training portion into new_train and dsel (80%, 20%), sample 10% of rows from new_train and add them to dsel to get new_dsel.

Am I implementing the overlapping correctly ?

Thank you for your time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants