Update components and pipelines to return Woodwork data structures #1668

angela97lin · 2021-01-08T02:09:44Z

Not too much implementation wise but notes that may be useful:

Sometimes it doesn't always work to just call .to_series()/.to_dataframe() on a Woodwork data structure, because the types inferred might not be what we want. Example is: Woodwork could convert to a series w/ Int64. Converting that to numpy would create an array with an object type, which is usually not what we want. 😬
Partial dependence method update context: previously, we could pass our pipelines directly to scikit-learn's partial dependence method after a bit of finagling (adding attributes expected). This is no longer true because we're returning Woodwork data structures from predict and scikit-learn doesn't know how to handle that. Thus, I needed to wrap our pipeline in scikit_learn_wrapped_estimator (implemented for ensemble), updating the fields now on this wrapped object. IMO kinda cleaner than previous, where we had to delete these ugly attributes we added just for scikit-learn to operate since we don't care for the wrapped obj afterwards :d
Imputer handling indices and restoring to original was moved to SimpleImputer. This way, we support the same behavior for SimpleImputer and don't need to maintain it in two separate places since the Imputer is just a combination of SimpleImputers.
I noticed there was a lot of duplicate code in fit_features and compute_final_component_features so I refactored using a new helper function, _fit_transform_features_helper
Currently, the ComponentGraph object is responsible for keeping track of original logical types passed by a user and transforming the transformed data to the original types as necessary. We will have to update this in Custom woodwork types aren't preserved in components #1662
Unrelated, but cleaned up docstrings for random state for consistency as I was updating each component.
A lot of duplicate code between fit/transform and fit_transform for FeatureSelector. Removed fit_transform.
Updated a lot of tests for consistency. Mock fit should still return self. Mock transforms still need to return WW data structures.
Combine _compute_predictions for classification and time series classification pipelines.
Cleaned up a lot of duplicate impl for components fit/transform/fit_transform and removed if unnecessary.
Removed random_state from OutliersDataCheck. No functional change, but since we're currently not using IsolationForest anymore, this isn't necessary.
Updated user_guide/components notebook: removed duplicate sections of LinearRegressor code and DropNull code before code generation.

Perf tests graphs (I want to double check on this, but having some issues running the tests):
Time automl takes increased by a bit. In most cases, this was less than a second increase. In the most extreme case, it was a 12 second increase. I think this is expected and not too large.

Performance is... odder. I don't know how much of this is due to randomness, and would like to try again but here are the results thus far:

More perf test results for the interested:

… into 1406_components_return_ww

Makefile

chukarsten

Angela - wow. I can't believe you went through all that. That was a lot of changes and great attention to detail. My eyes are bleeding a little bit. Hopefully, next time, we can keep it to a little less of a heroic PR? I don't see anything in here that will require a re-review though - just some doc strings to maybe touch up on. Again, great work.

evalml/automl/utils.py

chukarsten · 2021-01-22T15:47:18Z

evalml/model_understanding/graphs.py

@@ -461,19 +461,20 @@ def partial_dependence(pipeline, X, feature, grid_resolution=100):
        raise ValueError("Pipeline to calculate partial dependence for must be fitted")
    if pipeline.model_family == ModelFamily.BASELINE:
        raise ValueError("Partial dependence plots are not supported for Baseline pipelines")
-    if isinstance(pipeline, evalml.pipelines.ClassificationPipeline):


Funny that you should make this change. I was just looking at this section of the code and questioning it. This certainly explains why it's being done, but it still doesn't feel like the right place to do it. Was there an argument against adding this _estimator_type attribute to the pipeline class itself? I don't think it's in the scope of this PR to do so, but it would be good if we came out of this with an issue determining how to ultimately remove this chunk from partial dependence.

evalml/pipelines/classification_pipeline.py

evalml/pipelines/components/ensemble/stacked_ensemble_classifier.py

evalml/pipelines/components/estimators/classifiers/lightgbm_classifier.py

evalml/tests/automl_tests/test_automl_search_classification.py

chukarsten · 2021-01-23T19:29:25Z

evalml/tests/component_tests/test_baseline_classifier.py

-    np.testing.assert_allclose(clf.predict(X), get_random_state(0).choice(np.unique(y), len(X)))
+
+    predictions = clf.predict(X)
+    assert_series_equal(pd.Series(get_random_state(0).choice(np.unique(y), len(X)), dtype="Int64"), predictions.to_series())


It's a little unclear to me what we're trying to test here.

I think we're just trying to test that the baseline classifier's predict method returns as we'd expect, here for the case where the strategy is "random" (just randomly select from target to use as prediction). I'll break this down into two separate lines, so hopefully it'll be a bit more clear :)

evalml/tests/component_tests/test_components.py

evalml/tests/component_tests/test_stacked_ensemble_classifier.py

evalml/tests/component_tests/test_stacked_ensemble_regressor.py

bchen1116

Giant PR! Wow this was a lot of edits.

There's some issues popping up in AutoMLSearch in the docs, here as well. I think this was caused through these changes somewhere, since the errors seem to stem from type errors thrown by ww. Would be useful to fix before merging this PR!

Also, do we want to change the tutorials in the docs to use and return woodwork instead of pandas?

Other than that, left a few comments/questions, but looks good! Hopefully I didn't miss any major errors. This PR was a doozy

evalml/automl/automl_algorithm/automl_algorithm.py

evalml/tests/automl_tests/test_automl.py

evalml/tests/component_tests/test_components.py

evalml/tests/component_tests/test_datetime_featurizer.py

evalml/tests/component_tests/test_delayed_features_transformer.py

evalml/tests/component_tests/test_estimators.py

evalml/tests/component_tests/test_lgbm_classifier.py

angela97lin · 2021-01-25T20:59:23Z

@bchen1116 Thank you for reviewing and pointing out the docs issues!! I'll dig into them and see what's up 😁

… into 1406_components_return_ww

angela97lin · 2021-01-26T21:28:52Z

docs/source/user_guide/components.ipynb

@@ -402,45 +416,6 @@
    "AutoML will perform a search over the allowed ranges for each parameter to select models which produce optimal performance within those ranges. AutoML gets the allowed ranges for each component from the component's `hyperparameter_ranges` class attribute. Any component parameter you add an entry for in `hyperparameter_ranges` will be included in the AutoML search. If parameters are omitted, AutoML will use the default value in all pipelines. "
   ]
  },
-  {
-   "cell_type": "code",


I think this is accidental duplicate code, deleting 😱

angela97lin · 2021-01-26T21:29:49Z

docs/source/user_guide/components.ipynb

    "from evalml.pipelines.components.utils import generate_component_code\n",
    "\n",
-    "class MyDropNullColumns(Transformer):\n",


No need to repeat this, I believe the only difference is the name and there's nothing special about this necessary for code gen so deleting!

… into 1406_components_return_ww

angela97lin added 3 commits December 27, 2020 22:48

init

393052b

updated imputer init, starting to update tests...

2746e74

Merge branch '1406_components_return_ww' of github.com:alteryx/evalml…

7260503

… into 1406_components_return_ww

angela97lin self-assigned this Jan 8, 2021

angela97lin added 26 commits January 8, 2021 13:41

Merge branch 'main' into 1406_components_return_ww

f1d74c2

what a mess! messing around with simpleimputer logic and type inference

1aee7f4

Merge branch '1406_components_return_ww' of github.com:alteryx/evalml…

9abf008

… into 1406_components_return_ww

clean up imputer tests

d311015

update datetime featurizer

3c27d68

update per_column_imputer

482d2f1

fix per col imputer tests

2ad0526

fix drop null cols tests

001dc59

fix ohe tests

a24468d

fix pca

f4eb8c0

fix lda

49c8400

fix lsa and text featurizer

ab8789f

update featuretools

0865d32

update col selector transformer

c99808b

update baseline tests

704c1c4

update baseline regressor

c348d79

update target encoder

231523b

update delated feature transformer

a93c8b3

Merge branch 'main' into 1406_components_return_ww

4b519f5

fix estimator tests

094698d

Merge branch '1406_components_return_ww' of github.com:alteryx/evalml…

559fa05

… into 1406_components_return_ww

fix some component tests, more to go

72a0b9d

continue fixing tests, more to go

86f4dbc

fix one more test

350aa69

fix component tests

fc45967

fix more pipeline tests

d619154

angela97lin commented Jan 22, 2021

View reviewed changes

Makefile Outdated Show resolved Hide resolved

Merge branch 'main' into 1406_components_return_ww

ed9cbd6

chukarsten approved these changes Jan 24, 2021

View reviewed changes

bchen1116 reviewed Jan 25, 2021

View reviewed changes

angela97lin added 14 commits January 25, 2021 18:08

merging

bb1337d

fix docs

6c1ef89

fi

94ad165

lint and document

e144a0a

fix some tests

6e41ee7

merging and cleanup

0d8d540

fix docstr

ade7fb5

update tests

c084c4d

test docstr update

a8149f7

more cleanup, update partial dep impl

bd182e1

some more cleanup of feature selector and baseline tests

11dd1af

Merge branch 'main' into 1406_components_return_ww

ecd99b1

clean up components notebook

8692830

Merge branch '1406_components_return_ww' of github.com:alteryx/evalml…

253ced5

… into 1406_components_return_ww

angela97lin commented Jan 26, 2021

View reviewed changes

angela97lin added 4 commits January 26, 2021 17:20

Merge branch 'main' into 1406_components_return_ww

f8eb2b1

the tinest of docstring caps cleanup

0babaca

Merge branch '1406_components_return_ww' of github.com:alteryx/evalml…

72389b7

… into 1406_components_return_ww

Merge branch 'main' into 1406_components_return_ww

f6d343f

angela97lin merged commit 01bf21f into main Jan 27, 2021

angela97lin deleted the 1406_components_return_ww branch January 27, 2021 06:24

dsherry mentioned this pull request Jan 27, 2021

Unit tests in build_conda_pkg job timing out after woodwork component PR merge #1749

Closed

angela97lin mentioned this pull request Jan 27, 2021

Custom woodwork types aren't preserved in components #1662

Closed

chukarsten mentioned this pull request Feb 1, 2021

Release v0.18.1 #1774

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update components and pipelines to return Woodwork data structures #1668

Update components and pipelines to return Woodwork data structures #1668

angela97lin commented Jan 8, 2021 •

edited

chukarsten left a comment

chukarsten Jan 22, 2021

chukarsten Jan 23, 2021

angela97lin Jan 26, 2021

bchen1116 left a comment

angela97lin commented Jan 25, 2021

angela97lin Jan 26, 2021

angela97lin Jan 26, 2021

Update components and pipelines to return Woodwork data structures #1668

Update components and pipelines to return Woodwork data structures #1668

Conversation

angela97lin commented Jan 8, 2021 • edited

chukarsten left a comment

Choose a reason for hiding this comment

chukarsten Jan 22, 2021

Choose a reason for hiding this comment

chukarsten Jan 23, 2021

Choose a reason for hiding this comment

angela97lin Jan 26, 2021

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

angela97lin commented Jan 25, 2021

angela97lin Jan 26, 2021

Choose a reason for hiding this comment

angela97lin Jan 26, 2021

Choose a reason for hiding this comment

angela97lin commented Jan 8, 2021 •

edited