Adds recommended actions for InvalidTargetDataCheck and update _make_component_list_from_actions to address this action #1989

angela97lin · 2021-03-16T20:56:32Z

Closes #1881

This suddenly became a much bigger PR, so happy to split this up or explain more if it's too confusing 😅

Add actions to InvalidTargetDataCheck to impute target with missing values.
Add TargetImputer component that can impute target with missing values
Update _make_component_list_from_actions to handle new code, IMPUTE_COL
Update _retain_custom_types_and_initalize_woodwork to handle DataColumns
Update InvalidTargetDataCheck to separate out when target is fully null vs has nulls with two different DataCheckMessageCodes (TARGET_HAS_NULL vs TARGET_IS_EMPTY_OR_FULLY_NULL). Only add an action when TARGET_HAS_NULL
- For the fully null or empty case, we return, rather than letting the other checks run. I think this makes sense than having other warnings (ex: not having two unique values) also be returned, as it is what the immediate issue is.
Cleanup: updated InvalidTargetDataCheck to return TARGET_BINARY_NOT_TWO_UNIQUE_VALUES` for time series binary problems as well
Cleanup: updated InvalidTargetDataCheck to return TARGET_BINARY_NOT_TWO_EXAMPLES_PER_CLASS for time series multiclass as well

ANGE TODO / to check:

TargetImputer in pipelines, then fit and score. Make sure no errors.

codecov · 2021-03-18T15:08:30Z

Codecov Report

Merging #1989 (f0cdcbd) into main (c335c4e) will decrease coverage by 0.1%.
The diff coverage is 100.0%.

@@            Coverage Diff            @@
##             main    #1989     +/-   ##
=========================================
- Coverage   100.0%   100.0%   -0.0%     
=========================================
  Files         282      284      +2     
  Lines       23004    23271    +267     
=========================================
+ Hits        22995    23261    +266     
- Misses          9       10      +1

Impacted Files	Coverage Δ
evalml/pipelines/components/__init__.py	`100.0% <ø> (ø)`
evalml/data_checks/data_check_action_code.py	`100.0% <100.0%> (ø)`
evalml/data_checks/data_check_message_code.py	`100.0% <100.0%> (ø)`
evalml/data_checks/invalid_targets_data_check.py	`100.0% <100.0%> (ø)`
...alml/pipelines/components/transformers/__init__.py	`100.0% <100.0%> (ø)`
...lines/components/transformers/imputers/__init__.py	`100.0% <100.0%> (ø)`
...components/transformers/imputers/target_imputer.py	`100.0% <100.0%> (ø)`
evalml/pipelines/utils.py	`100.0% <100.0%> (ø)`
evalml/tests/component_tests/test_components.py	`100.0% <100.0%> (ø)`
...valml/tests/component_tests/test_simple_imputer.py	`100.0% <100.0%> (ø)`
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c335c4e...f0cdcbd. Read the comment docs.

… into 1881_fill_in_actions_cont

angela97lin · 2021-03-21T05:11:33Z

evalml/tests/data_checks_tests/test_invalid_targets_data_check.py

-    messages = invalid_targets_check.validate(X, y)
-    assert messages == {
+
+    expected = {


Just cleaning up duplicate expected values :)

angela97lin · 2021-03-21T05:53:12Z

evalml/data_checks/invalid_targets_data_check.py

@@ -57,7 +59,7 @@ def validate(self, X, y):
                                                                   "code": "TARGET_HAS_NULL",\
                                                                   "details": {"num_null_rows": 2, "pct_null_rows": 50}}],\
                                                       "warnings": [],\
-                                                       "actions": []}
+                                                       "actions": [{'code': 'IMPUTE_COL', 'details': {'column': None, 'impute_strategy': 'most_frequent', 'is_target': True}}]}


Wanted a way to specify that we want to impute the target without relying on the name of the column

Makes sense!

angela97lin · 2021-03-21T20:41:00Z

evalml/utils/woodwork_utils.py

@@ -69,37 +69,45 @@ def _convert_woodwork_types_wrapper(pd_data):
    return pd_data


-def _retain_custom_types_and_initalize_woodwork(old_datatable, new_dataframe, ltypes_to_ignore=None):
+def _retain_custom_types_and_initalize_woodwork(old_woodwork_data, new_pandas_data, ltypes_to_ignore=None):


Updated to handle DataColumns/Series

evalml/data_checks/invalid_targets_data_check.py

bchen1116

Nice! These tests are super thorough, and big fan of the cleanup! I left a few nitpicks and documentation fix comments, but nothing blocking!

evalml/pipelines/components/transformers/imputers/target_imputer.py

evalml/tests/component_tests/test_target_imputer.py

Addressed all changes :)

CLAassistant · 2021-03-26T14:35:19Z

All committers have signed the CLA.

chukarsten

Indeed, this was a doozy. I think the only thing that I'd be a little iffy on is what we do when the user specifies a constant imputation of one type but the column is full of data of the other type. I don't really see anything blocking, but would like to address that! Great job!

chukarsten · 2021-03-26T20:11:54Z

evalml/data_checks/invalid_targets_data_check.py

@@ -82,18 +90,27 @@ def validate(self, X, y):
                                                    details={"unsupported_type": y.logical_type.type_string}).to_dict())
        y_df = _convert_woodwork_types_wrapper(y.to_series())
        null_rows = y_df.isnull()
-        if null_rows.any():
+        if null_rows.all():
+            results["errors"].append(DataCheckError(message="Target values are either empty or fully null.",


Nit: are "empty" and "fully null" different? If they're not I'd just go with "Target values are fully null."

Yeah, they're different in that empty refers to len(y) == 0, and fully null is len(y) != 0 but all nan values 😢

evalml/data_checks/invalid_targets_data_check.py

evalml/pipelines/components/transformers/imputers/target_imputer.py

evalml/tests/component_tests/test_components.py

chukarsten · 2021-03-29T15:09:28Z

evalml/tests/component_tests/test_target_imputer.py

+@pytest.mark.parametrize("fill_value, y, y_expected", [(None, pd.Series([np.nan, 0, 5]), pd.Series([0, 0, 5])),
+                                                       (None, pd.Series([np.nan, "a", "b"]), pd.Series(["missing_value", "a", "b"]).astype("category")),
+                                                       (3, pd.Series([np.nan, 0, 5]), pd.Series([3, 0, 5])),
+                                                       (3, pd.Series([np.nan, "a", "b"]), pd.Series([3, "a", "b"]).astype("category"))])


This last parametrized test case is a very interesting one. Do we want to match types? Like if the integer 3 is put in, do we want it filling with the integer 3? Or the string 3? Do we want to allow cross-type imputation? Or perhaps raise a value error.

Yeah, that's a good question! This follows the behavior in SimpleImputer / Imputer right now, but I think it's okay because the type of the series is category, which allows for mixed-type categories:

Happy to file an issue if you think this is worth a greater discussion though!

… into 1881_fill_in_actions_cont

init

c9e2e66

angela97lin self-assigned this Mar 16, 2021

fix tests

3d76716

angela97lin changed the title ~~Adds recommended actions for InvalidTargetDataCheck and update _make_component_from_actions to address this action~~ Adds recommended actions for InvalidTargetDataCheck and update _make_component_list_from_actions to address this action Mar 17, 2021

angela97lin added 7 commits March 17, 2021 14:05

release notes

70299b5

add init code for target imputer

8ac18d3

Merge branch 'main' into 1881_fill_in_actions_cont

a035a41

welp

1327230

hmm testing

b53407d

Merge branch 'main' into 1881_fill_in_actions_cont

542ec07

fix some tests

8c08bc8

angela97lin added 13 commits March 18, 2021 13:11

test renaming

97e2f48

Merge branch 'main' into 1881_fill_in_actions_cont

18497fb

some updates, more tests to go

ac28999

Merge branch '1881_fill_in_actions_cont' of github.com:alteryx/evalml…

f9c04e8

… into 1881_fill_in_actions_cont

fix tests, add impute strategy

d0ed8ee

lint mclint

15ec313

Merge branch 'main' into 1881_fill_in_actions_cont

364fd95

fix tests

fbf7ead

codecov testing

f681b28

linting

5f8f2b1

clean up and fix tests

e68927f

remove unreachable

0944f08

Merge branch 'main' into 1881_fill_in_actions_cont

31145e5

angela97lin marked this pull request as ready for review March 21, 2021 04:00

angela97lin commented Mar 21, 2021

View reviewed changes

cleanup docstrings

d34e0c9

angela97lin commented Mar 21, 2021

View reviewed changes

angela97lin added 3 commits March 23, 2021 18:15

fix up tests

43a442f

merge

9890722

oops

3b9f13c

angela97lin commented Mar 24, 2021

View reviewed changes

evalml/data_checks/invalid_targets_data_check.py Show resolved Hide resolved

angela97lin requested review from bchen1116 and freddyaboulton March 24, 2021 15:12

angela97lin added 3 commits March 25, 2021 00:51

Merge branch 'main' into 1881_fill_in_actions_cont

cdc320c

move release notes

da982d9

Merge branch 'main' into 1881_fill_in_actions_cont

ff4b679

bchen1116 approved these changes Mar 25, 2021

View reviewed changes

chukarsten approved these changes Mar 29, 2021

View reviewed changes

angela97lin added 9 commits March 29, 2021 15:23

merge

c1919e3

Merge branch 'main' into 1881_fill_in_actions_cont

cccb717

fixing from comments and rename from details to metadata

6068a8c

Merge branch '1881_fill_in_actions_cont' of github.com:alteryx/evalml…

1402507

… into 1881_fill_in_actions_cont

fix test and add one for X not None

65a301a

fix tests with indices

9c2ff4e

Merge branch 'main' into 1881_fill_in_actions_cont

aabdbd2

cleanup

fb82cee

codecov

4f093c9

angela97lin mentioned this pull request Mar 31, 2021

Data check actions: support TargetImputer in pipelines #2059

Closed

angela97lin added 4 commits March 31, 2021 02:31

remove from component graph and cleanup

56388f8

add another test

2d72bc6

clean up merge:

5cdb59b

Merge branch 'main' into 1881_fill_in_actions_cont

f0cdcbd

angela97lin merged commit 2f46b6a into main Mar 31, 2021

angela97lin deleted the 1881_fill_in_actions_cont branch March 31, 2021 16:30

chukarsten mentioned this pull request Apr 6, 2021

Release v0.22.0 #2109

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds recommended actions for InvalidTargetDataCheck and update _make_component_list_from_actions to address this action #1989

Adds recommended actions for InvalidTargetDataCheck and update _make_component_list_from_actions to address this action #1989

angela97lin commented Mar 16, 2021 •

edited

codecov bot commented Mar 18, 2021 •

edited

angela97lin Mar 21, 2021

angela97lin Mar 21, 2021

freddyaboulton Mar 22, 2021

angela97lin Mar 21, 2021

bchen1116 left a comment

CLAassistant commented Mar 26, 2021 •

edited

chukarsten left a comment

chukarsten Mar 26, 2021

angela97lin Mar 29, 2021

chukarsten Mar 29, 2021

angela97lin Mar 29, 2021

Adds recommended actions for InvalidTargetDataCheck and update _make_component_list_from_actions to address this action #1989

Adds recommended actions for InvalidTargetDataCheck and update _make_component_list_from_actions to address this action #1989

Conversation

angela97lin commented Mar 16, 2021 • edited

codecov bot commented Mar 18, 2021 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

CLAassistant commented Mar 26, 2021 • edited

chukarsten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angela97lin commented Mar 16, 2021 •

edited

codecov bot commented Mar 18, 2021 •

edited

CLAassistant commented Mar 26, 2021 •

edited