New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds recommended actions for InvalidTargetDataCheck and update _make_component_list_from_actions to address this action #1989
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1989 +/- ##
=========================================
- Coverage 100.0% 100.0% -0.0%
=========================================
Files 282 284 +2
Lines 23004 23271 +267
=========================================
+ Hits 22995 23261 +266
- Misses 9 10 +1
Continue to review full report at Codecov.
|
… into 1881_fill_in_actions_cont
messages = invalid_targets_check.validate(X, y) | ||
assert messages == { | ||
|
||
expected = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just cleaning up duplicate expected values :)
@@ -57,7 +59,7 @@ def validate(self, X, y): | |||
"code": "TARGET_HAS_NULL",\ | |||
"details": {"num_null_rows": 2, "pct_null_rows": 50}}],\ | |||
"warnings": [],\ | |||
"actions": []} | |||
"actions": [{'code': 'IMPUTE_COL', 'details': {'column': None, 'impute_strategy': 'most_frequent', 'is_target': True}}]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted a way to specify that we want to impute the target without relying on the name of the column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
@@ -69,37 +69,45 @@ def _convert_woodwork_types_wrapper(pd_data): | |||
return pd_data | |||
|
|||
|
|||
def _retain_custom_types_and_initalize_woodwork(old_datatable, new_dataframe, ltypes_to_ignore=None): | |||
def _retain_custom_types_and_initalize_woodwork(old_woodwork_data, new_pandas_data, ltypes_to_ignore=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to handle DataColumns/Series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! These tests are super thorough, and big fan of the cleanup! I left a few nitpicks and documentation fix comments, but nothing blocking!
evalml/pipelines/components/transformers/imputers/target_imputer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/imputers/target_imputer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/imputers/target_imputer.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this was a doozy. I think the only thing that I'd be a little iffy on is what we do when the user specifies a constant imputation of one type but the column is full of data of the other type. I don't really see anything blocking, but would like to address that! Great job!
@@ -82,18 +90,27 @@ def validate(self, X, y): | |||
details={"unsupported_type": y.logical_type.type_string}).to_dict()) | |||
y_df = _convert_woodwork_types_wrapper(y.to_series()) | |||
null_rows = y_df.isnull() | |||
if null_rows.any(): | |||
if null_rows.all(): | |||
results["errors"].append(DataCheckError(message="Target values are either empty or fully null.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: are "empty" and "fully null" different? If they're not I'd just go with "Target values are fully null."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, they're different in that empty refers to len(y) == 0, and fully null is len(y) != 0 but all nan values 😢
@pytest.mark.parametrize("fill_value, y, y_expected", [(None, pd.Series([np.nan, 0, 5]), pd.Series([0, 0, 5])), | ||
(None, pd.Series([np.nan, "a", "b"]), pd.Series(["missing_value", "a", "b"]).astype("category")), | ||
(3, pd.Series([np.nan, 0, 5]), pd.Series([3, 0, 5])), | ||
(3, pd.Series([np.nan, "a", "b"]), pd.Series([3, "a", "b"]).astype("category"))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This last parametrized test case is a very interesting one. Do we want to match types? Like if the integer 3 is put in, do we want it filling with the integer 3? Or the string 3? Do we want to allow cross-type imputation? Or perhaps raise a value error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… into 1881_fill_in_actions_cont
Closes #1881
This suddenly became a much bigger PR, so happy to split this up or explain more if it's too confusing 😅
InvalidTargetDataCheck
to impute target with missing values.TargetImputer
component that can impute target with missing values_make_component_list_from_actions
to handle new code,IMPUTE_COL
_retain_custom_types_and_initalize_woodwork
to handle DataColumnsInvalidTargetDataCheck
to separate out when target is fully null vs has nulls with two different DataCheckMessageCodes (TARGET_HAS_NULL vs TARGET_IS_EMPTY_OR_FULLY_NULL). Only add an action when TARGET_HAS_NULLInvalidTargetDataCheck to return
TARGET_BINARY_NOT_TWO_UNIQUE_VALUES` for time series binary problems as wellInvalidTargetDataCheck
to returnTARGET_BINARY_NOT_TWO_EXAMPLES_PER_CLASS
for time series multiclass as wellANGE TODO / to check: