New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update components and pipelines to return Woodwork data structures #1668
Conversation
… into 1406_components_return_ww
… into 1406_components_return_ww
… into 1406_components_return_ww
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Angela - wow. I can't believe you went through all that. That was a lot of changes and great attention to detail. My eyes are bleeding a little bit. Hopefully, next time, we can keep it to a little less of a heroic PR? I don't see anything in here that will require a re-review though - just some doc strings to maybe touch up on. Again, great work.
@@ -461,19 +461,20 @@ def partial_dependence(pipeline, X, feature, grid_resolution=100): | |||
raise ValueError("Pipeline to calculate partial dependence for must be fitted") | |||
if pipeline.model_family == ModelFamily.BASELINE: | |||
raise ValueError("Partial dependence plots are not supported for Baseline pipelines") | |||
if isinstance(pipeline, evalml.pipelines.ClassificationPipeline): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny that you should make this change. I was just looking at this section of the code and questioning it. This certainly explains why it's being done, but it still doesn't feel like the right place to do it. Was there an argument against adding this _estimator_type attribute to the pipeline class itself? I don't think it's in the scope of this PR to do so, but it would be good if we came out of this with an issue determining how to ultimately remove this chunk from partial dependence.
evalml/pipelines/components/ensemble/stacked_ensemble_classifier.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/estimators/classifiers/lightgbm_classifier.py
Show resolved
Hide resolved
np.testing.assert_allclose(clf.predict(X), get_random_state(0).choice(np.unique(y), len(X))) | ||
|
||
predictions = clf.predict(X) | ||
assert_series_equal(pd.Series(get_random_state(0).choice(np.unique(y), len(X)), dtype="Int64"), predictions.to_series()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little unclear to me what we're trying to test here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're just trying to test that the baseline classifier's predict
method returns as we'd expect, here for the case where the strategy is "random" (just randomly select from target to use as prediction). I'll break this down into two separate lines, so hopefully it'll be a bit more clear :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Giant PR! Wow this was a lot of edits.
There's some issues popping up in AutoMLSearch in the docs, here as well. I think this was caused through these changes somewhere, since the errors seem to stem from type errors thrown by ww. Would be useful to fix before merging this PR!
Also, do we want to change the tutorials in the docs to use and return woodwork instead of pandas?
Other than that, left a few comments/questions, but looks good! Hopefully I didn't miss any major errors. This PR was a doozy
@bchen1116 Thank you for reviewing and pointing out the docs issues!! I'll dig into them and see what's up 😁 |
… into 1406_components_return_ww
@@ -402,45 +416,6 @@ | |||
"AutoML will perform a search over the allowed ranges for each parameter to select models which produce optimal performance within those ranges. AutoML gets the allowed ranges for each component from the component's `hyperparameter_ranges` class attribute. Any component parameter you add an entry for in `hyperparameter_ranges` will be included in the AutoML search. If parameters are omitted, AutoML will use the default value in all pipelines. " | |||
] | |||
}, | |||
{ | |||
"cell_type": "code", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is accidental duplicate code, deleting 😱
"from evalml.pipelines.components.utils import generate_component_code\n", | ||
"\n", | ||
"class MyDropNullColumns(Transformer):\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to repeat this, I believe the only difference is the name and there's nothing special about this necessary for code gen so deleting!
Closes #1406
Not too much implementation wise but notes that may be useful:
.to_series()
/.to_dataframe()
on a Woodwork data structure, because the types inferred might not be what we want. Example is: Woodwork could convert to a series w/ Int64. Converting that to numpy would create an array with an object type, which is usually not what we want. 😬scikit_learn_wrapped_estimator
(implemented for ensemble), updating the fields now on this wrapped object. IMO kinda cleaner than previous, where we had to delete these ugly attributes we added just for scikit-learn to operate since we don't care for the wrapped obj afterwards :dfit_features
andcompute_final_component_features
so I refactored using a new helper function,_fit_transform_features_helper
_compute_predictions
for classification and time series classification pipelines.random_state
from OutliersDataCheck. No functional change, but since we're currently not using IsolationForest anymore, this isn't necessary.Perf tests graphs (I want to double check on this, but having some issues running the tests):
Time automl takes increased by a bit. In most cases, this was less than a second increase. In the most extreme case, it was a 12 second increase. I think this is expected and not too large.
Performance is... odder. I don't know how much of this is due to randomness, and would like to try again but here are the results thus far:
More perf test results for the interested: