Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stacked ensembler: use same CV data splitter as the rest of automl #1592

dsherry opened this issue Dec 22, 2020 · 4 comments

Stacked ensembler: use same CV data splitter as the rest of automl #1592

dsherry opened this issue Dec 22, 2020 · 4 comments
bug Issues tracking problems with existing features.


Copy link

dsherry commented Dec 22, 2020

Currently, the stacked ensembler has its own setup for CV. IterativeAlgorithm calls the _make_stacked_ensembler util, but doesn't currently thread through the data splitter from automl search.

The data splitter set up by default in the stacked ensembler doesn't set shuffle=True, which could lead to poor performance if the input dataset has an ordering. It also won't have the same settings for other parameters like n_folds, which isn't ideal.

Also, this difference is preventing us from supporting sklearn 0.24.0. Fixing this issue should allow us to support that version.

Let's have automl pass its data splitter through IterativeAlgorithm down into the stacked ensembler.

@dsherry dsherry added the bug Issues tracking problems with existing features. label Dec 22, 2020
@dsherry dsherry added this to the December 2020 milestone Dec 22, 2020
dsherry added a commit that referenced this issue Dec 22, 2020
Copy link
Contributor Author

dsherry commented Dec 22, 2020

@angela97lin does my explanation make sense / was there a reason we chose not to do this when we were setting up stacking? :)

Copy link

@dsherry I think your explanation makes sense! IIRC when we were setting up stacking and were trying to make it more performant / make stacking run faster, we wanted to default to something that didn't have too many folds--hence the self._default_cv(n_splits=3, random_state=random_state) line where we take the default specified by scikit-learn, and hardcoded n_splits to 3.

Copy link

Dug into this a bit more, and tried to weave the data split method used by AutoML to the stacked ensemble component. However, I ran into this issue (after addressing the API updates necessary to get our TrainingValidationSplit class to work):

estimator = WrappedSKClassifier(pipeline=LogisticRegressionBinaryPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'm...Logistic Regression Classifier':{'penalty': 'l2', 'C': 1.0, 'n_jobs': -1, 'multi_class': 'auto', 'solver': 'lbfgs'},}))
X =            0         1         2         3         4
0   0.965469  0.041236  0.028701  0.659165  0.213375
1   0.043831...978  0.079577
48  0.376344  0.920154  0.314640  0.180086  0.197598
49  0.682661  0.046529  0.400513  0.412513  0.751464
y = array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 0])

    def cross_val_predict(estimator, X, y=None, *, groups=None, cv=None,
                          n_jobs=None, verbose=0, fit_params=None,
                          pre_dispatch='2*n_jobs', method='predict'):
        """Generate cross-validated estimates for each input data point

        The data is split according to the cv parameter. Each sample belongs
        to exactly one test set, and its prediction is computed with an
        estimator fitted on the corresponding training set.

        Passing these predictions into an evaluation metric may not be a valid
        way to measure generalization performance. Results can differ from
        :func:`cross_validate` and :func:`cross_val_score` unless all tests sets
        have equal size and the metric decomposes over samples.

        Read more in the :ref:`User Guide <cross_validation>`.

        estimator : estimator object implementing 'fit' and 'predict'
            The object to use to fit the data.

        X : array-like of shape (n_samples, n_features)
            The data to fit. Can be, for example a list, or an array at least 2d.

        y : array-like of shape (n_samples,) or (n_samples, n_outputs), \
            The target variable to try to predict in the case of
            supervised learning.

        groups : array-like of shape (n_samples,), default=None
            Group labels for the samples used while splitting the dataset into
            train/test set. Only used in conjunction with a "Group" :term:`cv`
            instance (e.g., :class:`GroupKFold`).

        cv : int, cross-validation generator or an iterable, default=None
            Determines the cross-validation splitting strategy.
            Possible inputs for cv are:

            - None, to use the default 5-fold cross validation,
            - int, to specify the number of folds in a `(Stratified)KFold`,
            - :term:`CV splitter`,
            - An iterable yielding (train, test) splits as arrays of indices.

            For int/None inputs, if the estimator is a classifier and ``y`` is
            either binary or multiclass, :class:`StratifiedKFold` is used. In all
            other cases, :class:`KFold` is used.

            Refer :ref:`User Guide <cross_validation>` for the various
            cross-validation strategies that can be used here.

            .. versionchanged:: 0.22
                ``cv`` default value if None changed from 3-fold to 5-fold.

        n_jobs : int, default=None
            Number of jobs to run in parallel. Training the estimator and
            predicting are parallelized over the cross-validation splits.
            ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
            ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
            for more details.

        verbose : int, default=0
            The verbosity level.

        fit_params : dict, defualt=None
            Parameters to pass to the fit method of the estimator.

        pre_dispatch : int or str, default='2*n_jobs'
            Controls the number of jobs that get dispatched during parallel
            execution. Reducing this number can be useful to avoid an
            explosion of memory consumption when more jobs get dispatched
            than CPUs can process. This parameter can be:

                - None, in which case all the jobs are immediately
                  created and spawned. Use this for lightweight and
                  fast-running jobs, to avoid delays due to on-demand
                  spawning of the jobs

                - An int, giving the exact number of total jobs that are

                - A str, giving an expression as a function of n_jobs,
                  as in '2*n_jobs'

        method : {'predict', 'predict_proba', 'predict_log_proba', \
                  'decision_function'}, default='predict'
            The method to be invoked by `estimator`.

        predictions : ndarray
            This is the result of calling `method`. Shape:

                - When `method` is 'predict' and in special case where `method` is
                  'decision_function' and the target is binary: (n_samples,)
                - When `method` is one of {'predict_proba', 'predict_log_proba',
                  'decision_function'} (unless special case above):
                  (n_samples, n_classes)
                - If `estimator` is :term:`multioutput`, an extra dimension
                  'n_outputs' is added to the end of each shape above.

        See Also
        cross_val_score : Calculate score for each CV split.
        cross_validate : Calculate one or more scores and timings for each CV

        In the case that one or more classes are absent in a training portion, a
        default score needs to be assigned to all instances for that class if
        ``method`` produces columns per class, as in {'decision_function',
        'predict_proba', 'predict_log_proba'}.  For ``predict_proba`` this value is
        0.  In order to ensure finite output, we approximate negative infinity by
        the minimum finite float value for the dtype in other cases.

        >>> from sklearn import datasets, linear_model
        >>> from sklearn.model_selection import cross_val_predict
        >>> diabetes = datasets.load_diabetes()
        >>> X =[:150]
        >>> y =[:150]
        >>> lasso = linear_model.Lasso()
        >>> y_pred = cross_val_predict(lasso, X, y, cv=3)
        X, y, groups = indexable(X, y, groups)

        cv = check_cv(cv, y, classifier=is_classifier(estimator))
        splits = list(cv.split(X, y, groups))

        test_indices = np.concatenate([test for _, test in splits])
        if not _check_is_permutation(test_indices, _num_samples(X)):
>           raise ValueError('cross_val_predict only works for partitions')
E           ValueError: cross_val_predict only works for partitions

../venv/lib/python3.7/site-packages/sklearn/model_selection/ ValueError

This is an error thrown when we try to call:

clf = StackedEnsembleClassifier(input_pipelines=[logistic_regression_binary_pipeline_class(parameters={})], cv=TrainingValidationSplit()), y)

The reason for this is because scikit-learn validates that the cv passed is indeed a cross-validation method; it isn't happy with single splits such as TrainingValidationSplit where some of the data will never be in the test data (since there is only one split).

As such, I think the best plan for now is to do the easy thing to support scikit-learn 0.24 and set the default cv's shuffle=True. We can revisit this if we think it's a useful thing to do. Thoughts, @dsherry?


Copy link

#1593 should no longer be blocked by this issue, since what was necessary for 0.24.0 should have been resolved in #1613.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
bug Issues tracking problems with existing features.
None yet

Successfully merging a pull request may close this issue.

3 participants