Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds fit_params support for stacking classifiers #255

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

jrbourbeau
Copy link
Contributor

@jrbourbeau jrbourbeau commented Sep 22, 2017

Description

This PR aims to add fit parameter support for StackingClassifier, StackingCVClassifier, and EnsembleVoteClassifier.

Related issues or pull requests

Fixes #177, fixes #178, fixes #179

Pull Request requirements

  • Added appropriate unit test functions in the ./mlxtend/*/tests directories
  • Ran nosetests ./mlxtend -sv and make sure that all unit tests pass
  • Checked the test coverage by running nosetests ./mlxtend --with-coverage
  • Checked for style issues by running flake8 ./mlxtend
  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file
  • Modify documentation in the appropriate location under mlxtend/docs/sources/ (@rasbt will take care of that)
  • Checked that the Travis-CI build passed at https://travis-ci.org/rasbt/mlxtend

@pep8speaks
Copy link

pep8speaks commented Sep 22, 2017

Hello @jrbourbeau! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on October 20, 2017 at 03:42 Hours UTC

@jrbourbeau
Copy link
Contributor Author

jrbourbeau commented Sep 22, 2017

@rasbt so far I've only added fit_param support for StackingClassifier. Any comments on the implementation? The following code snippet with cross_val_score should work now

import numpy as np
from mlxtend.classifier import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_blobs
from sklearn.model_selection import cross_val_score

# Generate some data
X, y = make_blobs(random_state=2)

# Build a StackingClassifier
classifiers=[RandomForestClassifier(random_state=2),
             SGDClassifier(random_state=2)]
meta_classifier = RandomForestClassifier(random_state=2)
sclf = StackingClassifier(classifiers=classifiers, meta_classifier=meta_classifier)

# Define some fit_params
fit_params = {'randomforestclassifier__sample_weight': np.arange(X.shape[0]),
              'sgdclassifier__intercept_init': np.unique(y),
              'meta-randomforestclassifier__sample_weight': np.full(X.shape[0], 7)}

# Pass fit params to StackingClassifier and cross_val_score
sclf.fit(X, y, **fit_params)
print('predictions = {}'.format(sclf.predict(X)))
print('scores = {}'.format(cross_val_score(sclf, X, y, fit_params=fit_params)))

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.06%) to 89.597% when pulling 6fd998d on jrbourbeau:add_fit_params_for_cross_val_score into 3424df6 on rasbt:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) to 89.705% when pulling 27206b3 on jrbourbeau:add_fit_params_for_cross_val_score into 3424df6 on rasbt:master.

@rasbt
Copy link
Owner

rasbt commented Sep 23, 2017

Wow, this is awesome -- it even includes the support for both the level-2 classifiers as well as the meta-classifier. Again, thanks so much for the PR. I am happy to help with extending this to the other ensemble/stacking classifiers (and regressors) :).

@jrbourbeau
Copy link
Contributor Author

jrbourbeau commented Sep 25, 2017

Great! I can work on adding fit parameters to other estimators. Here are the ones I had in mind:

  • StackingClassifier
  • StackingCVClassifier
  • EnsembleVoteClassifier
  • StackingRegressor
  • StackingCVRegressor

Any others you can think of?

@rasbt
Copy link
Owner

rasbt commented Sep 26, 2017

I think that includes all of the ensemble methods I could currently think of as well :). I was/am a bit busy due to paper deadline end of the week, but I am happy to take care of a few of them as well so that you don't have to work on it all alone -- and of course, there's really no hurry :)

@rasbt
Copy link
Owner

rasbt commented Sep 27, 2017

Just a very minor suggestion, could you change the docstring for the fit_params in fit() to the following:

    fit_params : dict of string -> object, optional
        Parameters to pass to the fit methods of the `classifiers` and
        `meta_classifier`.

(The string -> object is how scikit-learn lists it; it's maybe good to use the same convention for consistency)

@jrbourbeau
Copy link
Contributor Author

For sure, I definitely think the fit_params docstring here should match the corresponding scikit-learn docstring. Where did you find the fit_params : dict of string -> object, optional docstring in scikit-learn? I was using fit_params : dict, optional from the cross_val_score documentation http://scikit-learn.org/dev/modules/generated/sklearn.model_selection.cross_val_score.html

@rasbt
Copy link
Owner

rasbt commented Sep 27, 2017

Yeah, I think that also "fit_params : dict of string -> object, optional" would only be minimally more helpful compared to "dict" (found it in the GridSearchCV docs; http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

@jrbourbeau
Copy link
Contributor Author

jrbourbeau commented Sep 27, 2017

Do you think adding something like Example 3 - Stacked Regression with sample_weight to the StackingRegressor user guide page would be more useful for users? (similar examples could be added for stacked classifiers as well). Note sure if that's too specific an example or not.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.09%) to 89.75% when pulling 05942e9 on jrbourbeau:add_fit_params_for_cross_val_score into 3424df6 on rasbt:master.

@rasbt
Copy link
Owner

rasbt commented Sep 27, 2017

Example 3 - Stacked Regression with sample_weight

Yeah, I think this would be very useful! Maybe, "Example 3 - Stacked Regression with sample_weight using fit_params"? The reason is that it may be a bit more clear how to refer to the meta vs first-level classifiers regarding the string names.

@jrbourbeau
Copy link
Contributor Author

@rasbt sorry for the delay, I've been out of town at a conference.

I ran into an issue when trying to add fit_param support for StackingCVClassifier. It issue arises when each CV fold is being trained on.

for num, (train_index, test_index) in enumerate(skf):
if self.verbose > 0:
print("Training and fitting fold %d of %d..." %
((num + 1), final_cv.get_n_splits()))
try:
model.fit(X[train_index], y[train_index])

Some classifier fit parameters are arrays that have a value for each sample being trained on (e.g. the sample_weight parameter for RandomForestClassifier). This case is pretty straight-forward because the sample_weight array can be indexed using the same train_index array as the training data.

However, it's less clear how to deal with fit parameters that aren't of shape (n_samples,). For example, SGDClassifier has coef_init and intercept_init fit parameters that are of shape (n_classes, n_features) and (n_classes,), respectively. I guess one could check if a fit pararmeter is an array of shape (n_samples,), and if so index it with the train_index array. But this seems a little hack-ish to me 😕 I can already think of a couple of edge cases that would lead to problems.

Any suggestions on how to get around this issue?

@rasbt
Copy link
Owner

rasbt commented Oct 10, 2017

No need to apologize, hope you enjoyed the conference and the trip!

how to deal with fit parameters

Hm, that's a good question ...

  • n_classes: I guess we don't have to worry about n_classes as that shouldn't change.
  • n_features: We don't have to worry about that in the 1st level classifiers as the k-folds should have the same feature dimension as the training set. However, it's a bit tricky regarding the 2nd-level (meta) classifier. Maybe we should add a check in the fit method for that, for example,
if not self.use_features_in_secondary:
    n_features_in_2nd = len(self.classifiers)

else:
    n_features_in_2nd += X.shape[1]

if 'coef_init' in meta_fit_params and\
        meta_fit_params['coef_init'].shape[1] != n_features_in_2nd:
    raise AttributeError('Number of features in the `fit_params`'s `coef_init` array'
                         ' of the meta-classifier must be equal to'
                         ' the number of features this classifier expects based on the'
                         ' number of first-level classifiers and wether `use_features_in_secondary`
                         ' is set to `True` or `False`.'
                         ' Expected: %d'
                         ' Got: %d' % (n_features_in_2nd, meta_fit_params['coef_init'].shape[1]))

On the other hand, there are probably multiple parameters that use shapen_features somehow. And my guess is it would be quite tricky to maintain this.

  • n_samples: this is probably most tricky since we are modifying the training set in a sense, and like you said, we kind of need a way to determine when to use the subindices to pass the correct weights.

I agree with you that an automated checking based on the shape is quite unstable and might break in certain edge cases. Hm, I currently don't have a good idea for how to handle that.

For now, maybe we should just handle sample_weight explicitly via passing the respective the training set indices and be very clear about that in the fit docstring that and mention that any other parameter based on sample_weights shapes (and n_features for the meta-classifier) might result in unexpected behavior?

@rasbt
Copy link
Owner

rasbt commented Oct 12, 2017

Btw, I just saw that the VotingClassifier in scikit-learn (which we ported from mlxtend, aka the EnsembleVoteClassifier some time ago) currently also doesn't support sample weights:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/voting_classifier.py#L176

(Edit: Please ignore the passage above, I misread the code)

So, I was thinking that before we come up with some hacky work-arounds for those parameters that rely on n_samples or n_features, we could maybe take a similar approach and just allow support for fit params that are more general, e.g,. the ones mentioned in

#177

    'xgbclassifier__eval_metric': 'mlogloss',
    'xgbclassifier__eval_set': [(X_test, y_test)],
    'xgbclassifier__early_stopping_rounds': 100,
    'xgbclassifier__verbose': False}

So, I think the best way would be to add similar exceptions for

sample_weight, coef, coef_init

to get at least some working versions that generally support fit_params without some unexpected behavior in edge cases :).

@jrbourbeau
Copy link
Contributor Author

That sounds like a good plan to me! So just to be clear, right now we'll only support fit_params that are the same for each training fold (e.g. like the params in #177)? So something like

for num, (train_index, test_index) in enumerate(skf): 
  
     if self.verbose > 0: 
         print("Training and fitting fold %d of %d..." % 
               ((num + 1), final_cv.get_n_splits())) 
  
     try: 
         model.fit(X[train_index], y[train_index], *clf_fit_params) 

will work because we won't have to worry about having different slices of fit_params for different CV folds.

And obviously this will need to be clarified in the documentation 😃

@rasbt
Copy link
Owner

rasbt commented Oct 13, 2017

Oh yeah, that's probably the best way to handle this for now :)

@coveralls
Copy link

Coverage Status

Coverage increased (+1.3%) to 90.918% when pulling 162c159 on jrbourbeau:add_fit_params_for_cross_val_score into 3424df6 on rasbt:master.

@rasbt
Copy link
Owner

rasbt commented Oct 14, 2017

Thanks for the PR! Hm, it's weird that the unit test on Windows for Py 2.7 fail. My first thought was that it could be due to banker's rounding in Python 3. I.e.,

Python 2.7
>>> round(2.5)
3.0

Python 3:
>>> round(2.5)
2

But then, it wouldn't explain why the Py27 pass in Travis. Any idea about what might be going on?

@jrbourbeau
Copy link
Contributor Author

Yeah, that is really weird that the tests fails for 2.7 on AppVeyor, but not on Travis — not sure what's going on there. I have seen kind of random test failures happen before on some CI services. I'll try adding a small commit to see if re-running the tests fails again.

@coveralls
Copy link

Coverage Status

Coverage increased (+1.3%) to 90.918% when pulling 7ac7887 on jrbourbeau:add_fit_params_for_cross_val_score into 3424df6 on rasbt:master.

@rasbt
Copy link
Owner

rasbt commented Oct 15, 2017

Sorry about the trouble with AppVeyor! I see sth like

 assert scores_mean == 0.94, scores_mean
AssertionError: 0.95

I am not sure why it suddenly occurs, it seems like some rounding error somewhere. I can't see anything in your new code additions (like integer division) that may cause this as those are "old" unit tests that fail. This is really weird!

@rasbt
Copy link
Owner

rasbt commented Oct 18, 2017

Arg, I just found an issue with Travis CI. I.e., the unit tests were not properly executed in python 2.7. It's fixed in the master branch now.

Maybe try to sync your master branch of your fork (e.g., like described here: http://rasbt.github.io/mlxtend/contributing/#syncing-an-existing-fork)

and then you could rebase your fork on top of the master branch. I think the following should do it:

git checkout your_branch
git rebase upstream/master

Alternatively, instead of rebasing, you could execute the following four commands that should also fix the problem:

git checkout origin/master ci/.travis_install.sh
git checkout origin/master ci/.travis_test.sh
git checkout origin/master .travis.yml
git checkout origin/master mlxtend/plotting/test_ecdf.py

(the last one was due to a bug in Python 2.7 as the travis py27 tests didn't run recently, and Appveyor does not run plotting functions).

This will probably not solve the unit test issue, but at least we would know if there's something odd about Python 2.7 on Windows or in general (once more, Python 2.7 turns out to be an annoyance, time to replace it completely by Python 3.6 :))

@jrbourbeau
Copy link
Contributor Author

Awesome, good to know! I'll update and see if we at least get consistent test failures.

once more, Python 2.7 turns out to be an annoyance, time to replace it completely by Python 3.6 :)

👍

@jrbourbeau jrbourbeau force-pushed the add_fit_params_for_cross_val_score branch from 7ac7887 to 2372300 Compare October 19, 2017 17:10
@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 90.918% when pulling 2372300 on jrbourbeau:add_fit_params_for_cross_val_score into 922f44f on rasbt:master.

@rasbt
Copy link
Owner

rasbt commented Oct 19, 2017

Awesome, good to know! I'll update and see if we at least get consistent test failures.

Thanks! I am curious to see how to see how that will turn out

@jrbourbeau
Copy link
Contributor Author

I believe the rebase worked fine, my commits for this PR are now on top of your commits related to Python 2.7 on Travis. But it looks like the tests are still passing on Travis, but not on AppVeyor 😕

@rasbt
Copy link
Owner

rasbt commented Oct 19, 2017

Oh hm, that's weird. It seems like the files are still the old ones and Travis is still not running those Py27 tests. Sorry about the inconvenience, but maybe this could be resolved via the following (after you synced your master branch with the upstream one)

git checkout origin/master ci/.travis_install.sh
git checkout origin/master ci/.travis_test.sh
git checkout origin/master .travis.yml
git checkout origin/master mlxtend/plotting/test_ecdf.py

@jrbourbeau
Copy link
Contributor Author

Thanks for helping out with this!

Just to be clear, I've done

# Switch to master branch from feature branch
$ git checkout master
# Update my local master branch to be up-to-date with upstream
$ git fetch upstream
$ git merge upstream/master
# Update origin to also be synced up with upstream
$ git push origin master 

which has synced up my master branch with upstream (your mlxtend repo). A quick inspection of git log on my local copy master branch matches the upstream log (https://github.com/rasbt/mlxtend/commits/master) and my origin log (https://github.com/jrbourbeau/mlxtend/commits/master) — so I think everything should be up-to-date.

I then switched to my feature branch and did the checkouts you've provided

# Switch back to feature branch
$ git checkout add_fit_params_for_cross_val_score
# Checkout files from my updated origin/master
$ git checkout origin/master ci/.travis_install.sh
$ git checkout origin/master ci/.travis_test.sh
$ git checkout origin/master .travis.yml
$ git checkout origin/master mlxtend/plotting/tests/test_ecdf.py

This didn't seem to update any of my local files (at least git status didn't seem to indicate any changes have been made). So, if I'm not mistaken, then everything (origin, my local master, upstream, and my local feature branch) should all be up to date with one another.

@jrbourbeau
Copy link
Contributor Author

jrbourbeau commented Oct 20, 2017

Actually, one thing I did notice related to the Travis tests (and my previous comment) is that .travis.yml

mlxtend/.travis.yml

Lines 11 to 15 in 922f44f

install:
- sudo apt-get update
- source ci/.travis_install.sh
script:
- bash ci/.travis_test.sh

is pointing to scripts in the ci directory. But those scripts, mlxtend/ci/.travis_install.sh and mlxtend/ci/.travis_test.sh, haven't been updated in four months. I think they should be pointing to mlxtend/.travis_install.sh and mlxtend/.travis_test.sh, which were updated a couple of days ago, instead.

@rasbt
Copy link
Owner

rasbt commented Oct 20, 2017

Oh, there's definitely something funny going on. I think I screwed up some rebase after adjusting those travis files, which reverted them back to the 4 month old version. Let me address this ... thanks for mentioning it!

@rasbt
Copy link
Owner

rasbt commented Oct 20, 2017

Okay, like you said, the issue was that I left the wrong link after playing around with travis a few days ago. I moved them into the right place now and based on the logs, it should be all fine now :)

@jrbourbeau jrbourbeau force-pushed the add_fit_params_for_cross_val_score branch from 2372300 to ab389b1 Compare October 20, 2017 03:42
@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 90.918% when pulling ab389b1 on jrbourbeau:add_fit_params_for_cross_val_score into 4388714 on rasbt:master.

@kingychiu
Copy link

Hi @rasbt, how is the progress of this enhancement? As a workaround, I implemented a class inheriting the StackingClassifier, but I'd like to see the official support for the fitting parameters :). Anything left I can help here?

@jrbourbeau
Copy link
Contributor Author

Hi @kingychiu, thanks for commenting! I totally lost track of this PR.

I'm busy for the remainder of this week and into next week, but I'd like to finish up my work on this PR later next week.


Returns
-------
self : object

"""
self.clfs_ = [clone(clf) for clf in self.classifiers]
self.named_clfs_ = {key: value for key, value in
_name_estimators(self.clfs_)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict(_name_estimators(self.clfs_)) might work, but no guarantee.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm yeah, I think it should be equivalent indeed

clf_fit_params = {}
for key, value in six.iteritems(fit_params):
if name in key and 'meta-' not in key:
clf_fit_params[key.replace(name+'__', '')] = value
Copy link
Contributor

@qiagu qiagu Oct 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for key in list(six.iterkeys(fit_params)):
    if name in key and 'meta-' not in key:
        clf_fit_params[key[len(name) + 2: ]] = fit_params.pop(key)

might be more efficient since the fit_params will be iterated again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose key.startswith(name) is True. Is that the case?

clf_fit_params = {}
for key, value in six.iteritems(fit_params):
if name in key and 'meta-' not in key:
clf_fit_params[key.replace(name+'__', '')] = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pop can help here as well.

@qiagu
Copy link
Contributor

qiagu commented Oct 2, 2019

Solid enhancement. I look forward to this PR being merged!
After issues being solved, of course.

@rasbt
Copy link
Owner

rasbt commented Oct 3, 2019

Thanks for the code review, @qiagu !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants