Base classifiers: clone should deep copy the pool of classifiers #89

luizgh · 2018-09-14T20:42:28Z

check_estimator is failing for instances, since during clone the pool of classifiers are not being properly copied.
Need to investigate how this problem is addressed in sklearn (e.g. on sklearn.ensembles)

luizgh · 2018-09-19T14:44:14Z

I did some investigation today, and I could not find a solution for the problem without changing the scikit-learn code. Here is the sequence of steps that cause the problem:

GridSearchCV calls "clone" on the DES estimator
The clone function calls the "get_params" function of the DES estimator. At this point, the pool of classifiers is still "fitted"
The clone function then clones each parameter with safe=False. When cloning the pool of classifiers, the result is a pool that is not "fitted" anymore.

We don't have any control over these steps, so I am not sure there is anything we can do. Other ensemble methods in scikit-learn always fit the base classifiers as part of the "fit" procedure (e.g. see the voting classifier: http://scikit-learn.org/stable/modules/ensemble.html#voting-classifier). I sent an email on the sklearn mailing list to see if anyone can help.

luizgh · 2018-09-19T17:00:01Z

Note: there is an issue in scikit learn that proposes a "Freeze" mechanism for estimators, that could address this issue: scikit-learn/scikit-learn#8370

Menelau · 2018-09-20T20:04:15Z

@luizgh Yes, looks like freezing mechanism could address this issue (and can also be beneficial to other ensemble/meta-estimators methods). For what I check there were other ensemble libraries with the same problem.

Let's check them if that would be the best solution to our problem, if yes we will need to wait for the integration of the freezing mechanism on sklearn in order to address this issue.

luizgh · 2018-09-21T12:32:00Z

Agreed. For now I will just update the documentation to add a "known issue" that these estimators do not work with GridSearch / CV methods. I will also remove this github issue from the 0.3 milestone.

Do you have a suggestion on where to add this "Known issues" section? I was thinking that the best is to create a new section on the User guide (https://deslib.readthedocs.io/en/latest/user_guide.html). An alternative is to add it directly on the first page, but I don't think it is necessary (at least this problem will raise an Exception if people try to use a GridSearch, so the problem is not hidden from the user).

luizgh · 2018-09-21T19:15:13Z

Waiting on the resolution of scikit-learn/scikit-learn#8370

…all DES/DCS/static classifiers. * - Moving code to validate the parameters from __init__ to the fit method (sklearn style) * Refactoring DCS classes: Changing class attributes names to the sklearn style. Attributes estimated from the data now have an underscore after its name. * Changes to make the class compatible with the sklearn standards: - Moving code to validate the estimator parameters from the __init__ to the fit method; - Refactoring: Changing class attributes names to the sklearn style. Attributes estimated from the data now have an underscore after its name; - Addition BaseEstimator to the inherited classes for the get_params and set_params methods. * Updating the test routines to the according to the new changes in attribute names and parameter validation scheme * PEP8 formatting * Refactoring according to sklearn guidelines: Changing names of class attributes that are estimated based on the data (on the fit method) * Updating test routines according to the attributes name change * Refactoring according to sklearn guidelines: - Moving code to validate parameters from __init__ to fit - Change in attribute names (using an underscore after the name of attributes estimated from the data) * Updating test routines according to the refactoring on attribute name change and the new method for validating the estimator parameters * Fixing problem with identation * Refactoring: Moving code that validate parameters to the fit method; change ins the attribute names (sklearn standard) and accepting a clustering method as input parameter. * Updated test routines for the DESClustering class according to the new guidelines. * Adding code to verify whether the object passed as the clustering method is part of the sklearn clustering classes. * Updating the test routines that check if the base classifier implements the predict_proba function (Now the check happens inside the fit method) * Moving the _check_predict_proba function to the fit method. * Refactoring: remove old DFP masks * Refactoring * - Changing default value of pool_classifiers to None - Modifying name of random state attribute from rng to random_state * updating the n_classifies_ attribute in the test routines * Changing the name of the attribute rng tp random_state in the integration tests. * Fixing error in the docstring (return value of the method) * Changing check for proba after the fit method; refactoring attribute names according to sklearn guidelines * Adding random_state parameter * Adding random_state parameter * Adding the DFP and IH and random_state hyper-parameters to DESMI class. * changing random_state default value * Adding DESMI to the list of DES techniques * Adding DES Logarithmic * Making DS clustering compatible with sklearn estimators guidelines. * Making DESKNN compatible with sklearn estimators guidelines. * Making KNOP compatible with sklearn estimators guidelines. * Making META-DES compatible with sklearn estimators guidelines. * Making Probabilistic techniques compatible with sklearn estimators guidelines. * Updating test routines according to the new changes in variable names; Removing not used test cases * Updating test routines according to the new variable names; Removing obsolete test functions * Updating name of variables estimated from the data according to the sklearn guidelines * Adding random_state to the clustering definition * Making DESMI class an sklearn estimator * Merging with master branch * Adding sklearn's "check_estimator" tests (#84) * Adding sklearn's "check_estimator" for probabilistic DS methods * Adding test to show #89 is indeed a problem * Adding warning on base class (k bigger than DSEL) #93 * Adding known issue with GridSearch #89 * Fixes #91 * Marking the grid search test to skip (#89) * Adding tests for python 3.7 (#98) * Workaround for travis 3.7 support (#98) * Fix #92 * adding pytest_cache to the list of ignored folders * removing .idea from project * Fixing problem with rng in DCS classes when using "random" or "diff" as selection method (rng during predict/predict_proba). Fixes #88 * Base class for static ensembles * Making SingleBest an sklearn estimator * Making StaticSelection a sklearn estimator * Removing unused imports * Making Oracle class compatible with sklearn * Using sklearn check_array to assert a given array is 2d * Removing commented code lines * Fixing docstring on static ensemble classes; Solving a bug with label encoder for the single best class * Adding license information * automatically convert array to 2d * Updating tests with Oracle technique (using fit to setup label encoder) * updating oracle tests (setup label encoder in the fit method) * updating test; removing check estimator from Oracle since it is not a real classifier * Adding check array to predict. * Enforcing the predictions of the base classifiers are integers. * Fixing random state bug * removing commented lines of code * adding kdn score method * PEP8 formatting; Cleaning commented code. * Adding license information * Adding license information; moving kdn_score function to utils.instance_hardness.py; Adding Label encoder; Refactoring variable names according to sklearn standards * removing unused code * Adding check_estimator test for OLP method * Solving problem with label encoder when no base classifier predicts the correct label * Test routines for the SGH class * Adding predict proba; Checking if the method was fitted before calling predict and predict_proba. * Adding checks to raise an error in regression problems * skipping test while the batch processing version is not implemented * Adding parameter to indicate percentage of data used for DSEL in the training-DSEL split * Updating variable names. * Updating requirements version (sklearn 0.19) due to estimators check * Updating requirements version (sklearn 0.19) due to estimators check * Updating requirements; travis * Print values of N_ and J_ on error * Fixed checks for pct_accuracy * Fix test name * Fixing test

Menelau · 2019-04-24T00:53:50Z

@luizgh Do you know if there is an update on this issue?

luizgh · 2019-04-24T03:34:51Z

@Menelau it is on the roadmap, but from the discussions in scikit-learn/scikit-learn#8370 it seems that it may take a lot of time. They are actually proposing a workaround for pre-trained models (see scikit-learn/scikit-learn#13288), but I am not sure if that would address our problem

luizgh added this to the v0.3 - integration to sklearn contrib milestone Sep 14, 2018

luizgh self-assigned this Sep 14, 2018

luizgh added a commit that referenced this issue Sep 18, 2018

Adding test to show #89 is indeed a problem

44c9bec

luizgh added a commit that referenced this issue Sep 21, 2018

Adding known issue with GridSearch #89

b009e1d

luizgh removed this from the v0.3 - integration to sklearn contrib milestone Sep 21, 2018

luizgh added a commit that referenced this issue Sep 21, 2018

Marking the grid search test to skip (#89)

81b3aa7

luizgh added the waiting third party Waiting for changes out of our control (e.g. in another library) label Sep 21, 2018

Menelau mentioned this issue Sep 18, 2020

Kfold - TimeSeriesSplit? #222

Closed

Menelau mentioned this issue Jun 29, 2022

Multi-label classification #267

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Base classifiers: clone should deep copy the pool of classifiers #89

Base classifiers: clone should deep copy the pool of classifiers #89

luizgh commented Sep 14, 2018

luizgh commented Sep 19, 2018

luizgh commented Sep 19, 2018

Menelau commented Sep 20, 2018

luizgh commented Sep 21, 2018

luizgh commented Sep 21, 2018

Menelau commented Apr 24, 2019

luizgh commented Apr 24, 2019

Base classifiers: clone should deep copy the pool of classifiers #89

Base classifiers: clone should deep copy the pool of classifiers #89

Comments

luizgh commented Sep 14, 2018

luizgh commented Sep 19, 2018

luizgh commented Sep 19, 2018

Menelau commented Sep 20, 2018

luizgh commented Sep 21, 2018

luizgh commented Sep 21, 2018

Menelau commented Apr 24, 2019

luizgh commented Apr 24, 2019