Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data check message to include a "code" and "details" fields #1451

Merged
merged 24 commits into from Nov 23, 2020

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Nov 20, 2020

Closes #1430

Also addresses #1422 by converting the check for binary classification targets != {0, 1} to return a warning instead of an error.

@angela97lin angela97lin added this to the November 2020 milestone Nov 20, 2020
@angela97lin angela97lin self-assigned this Nov 20, 2020
@codecov
Copy link

codecov bot commented Nov 22, 2020

Codecov Report

Merging #1451 (c197f3c) into main (2ddca58) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1451     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         222      223      +1     
  Lines       14891    14930     +39     
=========================================
+ Hits        14884    14923     +39     
  Misses          7        7             
Impacted Files Coverage Δ
evalml/data_checks/__init__.py 100.0% <100.0%> (ø)
evalml/data_checks/class_imbalance_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_message.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
evalml/data_checks/high_variance_cv_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/highly_null_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/id_columns_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/invalid_targets_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/no_variance_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/outliers_data_check.py 100.0% <100.0%> (ø)
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ddca58...c197f3c. Read the comment docs.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 ! 😁

No blocking comments

evalml/data_checks/data_check_message_code.py Show resolved Hide resolved
evalml/tests/data_checks_tests/test_data_check_message.py Outdated Show resolved Hide resolved
evalml/tests/data_checks_tests/test_data_check_message.py Outdated Show resolved Hide resolved
evalml/data_checks/outliers_data_check.py Outdated Show resolved Hide resolved
if set(unique_values) != set([0, 1]):
messages["errors"].append(DataCheckError("Numerical binary classification target classes must be [0, 1], got [{}] instead".format(", ".join([str(val) for val in unique_values])), self.name).to_dict())
messages["warnings"].append(DataCheckWarning(message="Numerical binary classification target classes must be [0, 1], got [{}] instead".format(", ".join([str(val) for val in unique_values])),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think about it until now but this check could potentially blow up if a large regression dataset is used. The message would get super long because we include all unique values. Would be a good thing to file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry Right, but that's why we have problem_type passed in for the initialization of InvalidTargetDataCheck, right? Or are you concerned if the user accidentally passes in the wrong problem type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's what I was thinking, if someone accidentally ran binary classification with a regression target. One quick way around this would be to show the number of uniques and the counts of the 100 most frequent uniques instead of the counts of all the uniques!

No need to add this to this PR heh. But this would be great to file as a performance enhancement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it! Filed #1460 :)

evalml/data_checks/data_check_message.py Show resolved Hide resolved
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @angela97lin ! I think this is great. My only comment is that enums are not actually json serializable so we should convert to it to string in validate!

@dsherry
Copy link
Contributor

dsherry commented Nov 23, 2020

Oh shoot, that's a super great point @freddyaboulton , that python enums aren't JSON serializable. Yes, let's include the stringified version of the enum. (@angela97lin )

@angela97lin
Copy link
Contributor Author

@freddyaboulton @dsherry Ah, really good point--Will update this to include the stringified versions instead. Thanks for catching this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data checks: JSON-friendly message fmt, include a type enum and affected column names
3 participants