Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does vecstack.StackingTransformer differ from sklearn.ensemble.StackingClassifier? #37

Closed
zachmayer opened this issue Feb 11, 2020 · 4 comments

Comments

@zachmayer
Copy link

This might be useful to add to the readme

@vecxoz
Copy link
Owner

vecxoz commented Feb 14, 2020

Hi,
Thanks for a good question!

Both classes are based on the same algorithm described in the paper Stacked Generalization by David H. Wolpert. But each of them has very different conceptual implementation and application. The most important difference is transformer vs. predictor architecture.

vecstack.StackingTransformer is a dedicated transformer which integrates classification and regression tasks in a single class. It outputs out-of-fold predictions and does not have predict method. Direct access to OOF is extremely important in stacking because it gives the ability to perform full cycle of analytics and optimization: compute correlations, optimize weights for average, etc. When everything is ready we can easily combine arbitrary number of stacking levels and final estimator using sklearn.pipeline.Pipeline.

sklearn.ensemble.StackingClassifier was designed as a predictor which outputs predictions from the final estimator. It does not give access to out-of-fold predictions for train set even though it has transform method. StackingClassifier.transform method can correctly transform only X_test because estimators used in this method are trained on full X_train while stacking requires cross-validation procedure to transform X_train.

Below I put together self-contained example which depicts the common ground between two implementations (where results are exactly the same). You can easily iterate over it to compare other different aspects which are important for your use cases:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.pipeline import Pipeline
from vecstack import StackingTransformer

X, y = make_classification(n_samples=500, n_features=5,
                           n_informative=3, n_redundant=1,
                           n_classes=3, flip_y=0, random_state=0)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=0)
    
estimators = [    
    ('et', ExtraTreesClassifier(n_estimators=100, random_state=0)),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=0))]

final_estimator = LogisticRegression(random_state=0)

#-------------------------------------------------------------------------------
# vecstack.StackingTransformer
#-------------------------------------------------------------------------------

stack = StackingTransformer(estimators=estimators, 
                            regression=False, 
                            variant='B',
                            n_folds=5,
                            shuffle=False, 
                            stratified=True,
                            needs_proba=True)

steps = [('stack', stack),
         ('final_estimator', final_estimator)]

pipe = Pipeline(steps)

y_pred_vecstack = pipe.fit(X_train, y_train).predict_proba(X_test)

#-------------------------------------------------------------------------------
# sklearn.ensemble.StackingClassifier
#-------------------------------------------------------------------------------

clf = StackingClassifier(estimators=estimators,
                         final_estimator=final_estimator,
                         stack_method='predict_proba')

y_pred_sklearn = clf.fit(X_train, y_train).predict_proba(X_test)

print((y_pred_vecstack == y_pred_sklearn).all()) # True

#-------------------------------------------------------------------------------
# Compare transformation
#-------------------------------------------------------------------------------

S_test_vecstack = stack.transform(X_test)
S_test_sklearn = clf.transform(X_test)
print((S_test_vecstack == S_test_sklearn).all()) # True

S_train_vecstack = stack.transform(X_train)
S_train_sklearn = clf.transform(X_train)
print((S_train_vecstack == S_train_sklearn).all()) # False

et = ExtraTreesClassifier(random_state=0, n_estimators=100)
rf = RandomForestClassifier(random_state=0, n_estimators=100)
y_pred_et = et.fit(X_train, y_train).predict_proba(X_train)
y_pred_rf = rf.fit(X_train, y_train).predict_proba(X_train)
print((S_train_sklearn == np.hstack([y_pred_et, y_pred_rf])).all()) # True

@zachmayer
Copy link
Author

This is really informative! Thank you for the writeup, and thank you for the great package! (I think your comment would be a great addition to the readme btw).

Have you considered adding you package to sklearn-contrib?

@zachmayer
Copy link
Author

Direct access to OOF is extremely important in stacking <- this seems like the key point. I totally agree!

@vecxoz
Copy link
Owner

vecxoz commented Feb 15, 2020

Thanks for your kind words!
Yes, OOF is all we need.
Actually I did not consider sklearn-contrib. I just want to be sklearn-compatible :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants