Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base classifiers: clone should deep copy the pool of classifiers #89

Open
luizgh opened this issue Sep 14, 2018 · 7 comments
Open

Base classifiers: clone should deep copy the pool of classifiers #89

luizgh opened this issue Sep 14, 2018 · 7 comments
Assignees
Labels
waiting third party Waiting for changes out of our control (e.g. in another library)

Comments

@luizgh
Copy link
Collaborator

luizgh commented Sep 14, 2018

check_estimator is failing for instances, since during clone the pool of classifiers are not being properly copied.
Need to investigate how this problem is addressed in sklearn (e.g. on sklearn.ensembles)

@luizgh luizgh self-assigned this Sep 14, 2018
luizgh added a commit that referenced this issue Sep 18, 2018
@luizgh
Copy link
Collaborator Author

luizgh commented Sep 19, 2018

I did some investigation today, and I could not find a solution for the problem without changing the scikit-learn code. Here is the sequence of steps that cause the problem:

  1. GridSearchCV calls "clone" on the DES estimator
  2. The clone function calls the "get_params" function of the DES estimator. At this point, the pool of classifiers is still "fitted"
  3. The clone function then clones each parameter with safe=False. When cloning the pool of classifiers, the result is a pool that is not "fitted" anymore.

We don't have any control over these steps, so I am not sure there is anything we can do. Other ensemble methods in scikit-learn always fit the base classifiers as part of the "fit" procedure (e.g. see the voting classifier: http://scikit-learn.org/stable/modules/ensemble.html#voting-classifier). I sent an email on the sklearn mailing list to see if anyone can help.

@luizgh
Copy link
Collaborator Author

luizgh commented Sep 19, 2018

Note: there is an issue in scikit learn that proposes a "Freeze" mechanism for estimators, that could address this issue: scikit-learn/scikit-learn#8370

@Menelau
Copy link
Collaborator

Menelau commented Sep 20, 2018

@luizgh Yes, looks like freezing mechanism could address this issue (and can also be beneficial to other ensemble/meta-estimators methods). For what I check there were other ensemble libraries with the same problem.

Let's check them if that would be the best solution to our problem, if yes we will need to wait for the integration of the freezing mechanism on sklearn in order to address this issue.

@luizgh
Copy link
Collaborator Author

luizgh commented Sep 21, 2018

Agreed. For now I will just update the documentation to add a "known issue" that these estimators do not work with GridSearch / CV methods. I will also remove this github issue from the 0.3 milestone.

Do you have a suggestion on where to add this "Known issues" section? I was thinking that the best is to create a new section on the User guide (https://deslib.readthedocs.io/en/latest/user_guide.html). An alternative is to add it directly on the first page, but I don't think it is necessary (at least this problem will raise an Exception if people try to use a GridSearch, so the problem is not hidden from the user).

luizgh added a commit that referenced this issue Sep 21, 2018
@luizgh luizgh added the waiting third party Waiting for changes out of our control (e.g. in another library) label Sep 21, 2018
@luizgh
Copy link
Collaborator Author

luizgh commented Sep 21, 2018

Waiting on the resolution of scikit-learn/scikit-learn#8370

Menelau pushed a commit that referenced this issue Sep 28, 2018
…all DES/DCS/static classifiers.

* - Moving code to validate the parameters from __init__ to the fit method (sklearn style)

* Refactoring DCS classes: Changing class attributes names to the sklearn style. Attributes estimated from the data now have an underscore after its name.

* Changes to make the class compatible with the sklearn standards:
- Moving code to validate the estimator parameters from the __init__ to the fit method;
- Refactoring: Changing class attributes names to the sklearn style. Attributes estimated from the data now have an underscore after its name;
- Addition BaseEstimator to the inherited classes for the get_params and set_params methods.

* Updating the test routines to the according to the new changes in attribute names and parameter validation scheme

* PEP8 formatting

* Refactoring according to sklearn guidelines: Changing names of class attributes that are estimated based on the data (on the fit method)

* Updating test routines according to the attributes name change

* Refactoring according to sklearn guidelines:

- Moving code to validate parameters from __init__ to fit
- Change in attribute names (using an underscore after the name of attributes estimated from the data)

* Updating test routines according to the refactoring on attribute name change and the new method for validating the estimator parameters

* Fixing problem with identation

* Refactoring: Moving code that validate parameters to the fit method; change ins the attribute names (sklearn standard) and accepting a clustering method as input parameter.

* Updated test routines for the DESClustering class according to the new guidelines.

* Adding code to verify whether the object passed as the clustering method is part of the sklearn clustering classes.

* Updating the test routines that check if the base classifier implements the predict_proba function (Now the check happens inside the fit method)

* Moving the _check_predict_proba function to the fit method.

* Refactoring: remove old DFP masks

* Refactoring

* - Changing default value of pool_classifiers to None
- Modifying name of random state attribute from rng to random_state

* updating the n_classifies_ attribute in the test routines

* Changing the name of the attribute rng tp random_state in the integration tests.

* Fixing error in the docstring (return value of the method)

* Changing check for proba after the fit method; refactoring attribute names according to sklearn guidelines

* Adding random_state parameter

* Adding random_state parameter

* Adding the DFP and IH and random_state hyper-parameters to DESMI class.

* changing random_state default value

* Adding DESMI to the list of DES techniques

* Adding DES Logarithmic

* Making DS clustering compatible with sklearn estimators guidelines.

* Making DESKNN compatible with sklearn estimators guidelines.

* Making KNOP compatible with sklearn estimators guidelines.

* Making META-DES compatible with sklearn estimators guidelines.

* Making Probabilistic techniques compatible with sklearn estimators guidelines.

* Updating test routines according to the new changes in variable names; Removing not used test cases

* Updating test routines according to the new variable names; Removing obsolete test functions

* Updating name of variables estimated from the data according to the sklearn guidelines

* Adding random_state to the clustering definition

* Making DESMI class an sklearn estimator

* Merging with master branch

* Adding sklearn's "check_estimator"  tests (#84)

* Adding sklearn's "check_estimator" for probabilistic DS methods

* Adding test to show #89 is indeed a problem

* Adding warning on base class (k bigger than DSEL) #93

* Adding known issue with GridSearch #89

* Fixes #91

* Marking the grid search test to skip (#89)

* Adding tests for python 3.7 (#98)

* Workaround for travis 3.7 support (#98)

* Fix #92

* adding pytest_cache to the list of ignored folders

* removing .idea from project

* Fixing problem with rng in DCS classes when using "random" or "diff" as selection method (rng during predict/predict_proba). Fixes #88

* Base class for static ensembles

* Making SingleBest an sklearn estimator

* Making StaticSelection a sklearn estimator

* Removing unused imports

* Making Oracle class compatible with sklearn

* Using sklearn check_array to assert a given array is 2d

* Removing commented code lines

* Fixing docstring on static ensemble classes; Solving a bug with label encoder for the single best class

* Adding license information

* automatically convert array to 2d

* Updating tests with Oracle technique (using fit to setup label encoder)

* updating oracle tests (setup label encoder in the fit method)

* updating test; removing check estimator from Oracle since it is not a real classifier

* Adding check array to predict.

* Enforcing the predictions of the base classifiers are integers.

* Fixing random state bug

* removing commented lines of code

* adding kdn score method

* PEP8 formatting; Cleaning commented code.

* Adding license information

* Adding license information; moving kdn_score function to utils.instance_hardness.py; Adding Label encoder; Refactoring variable names according to sklearn standards

* removing unused code

* Adding check_estimator test for OLP method

* Solving problem with label encoder when no base classifier predicts the correct label

* Test routines for the SGH class

* Adding predict proba; Checking if the method was fitted before calling predict and predict_proba.

* Adding checks to raise an error in regression problems

* skipping test while the batch processing version is not implemented

* Adding parameter to indicate percentage of data used for DSEL in the training-DSEL split

* Updating variable names.

* Updating requirements version (sklearn 0.19) due to estimators check

* Updating requirements version (sklearn 0.19) due to estimators check

* Updating requirements; travis

* Print values of N_ and J_ on error

* Fixed checks for pct_accuracy

* Fix test name

* Fixing test
@Menelau
Copy link
Collaborator

Menelau commented Apr 24, 2019

@luizgh Do you know if there is an update on this issue?

@luizgh
Copy link
Collaborator Author

luizgh commented Apr 24, 2019

@Menelau it is on the roadmap, but from the discussions in scikit-learn/scikit-learn#8370 it seems that it may take a lot of time. They are actually proposing a workaround for pre-trained models (see scikit-learn/scikit-learn#13288), but I am not sure if that would address our problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting third party Waiting for changes out of our control (e.g. in another library)
Projects
None yet
Development

No branches or pull requests

2 participants