Update data check message to include a "code" and "details" fields #1451

angela97lin · 2020-11-20T22:12:10Z

Also addresses #1422 by converting the check for binary classification targets != {0, 1} to return a warning instead of an error.

…o 1430_data_check_jsons

codecov · 2020-11-22T06:03:57Z

Codecov Report

Merging #1451 (c197f3c) into main (2ddca58) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@            Coverage Diff            @@
##             main    #1451     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         222      223      +1     
  Lines       14891    14930     +39     
=========================================
+ Hits        14884    14923     +39     
  Misses          7        7

Impacted Files	Coverage Δ
evalml/data_checks/__init__.py	`100.0% <100.0%> (ø)`
evalml/data_checks/class_imbalance_data_check.py	`100.0% <100.0%> (ø)`
evalml/data_checks/data_check_message.py	`100.0% <100.0%> (ø)`
evalml/data_checks/data_check_message_code.py	`100.0% <100.0%> (ø)`
evalml/data_checks/high_variance_cv_data_check.py	`100.0% <100.0%> (ø)`
evalml/data_checks/highly_null_data_check.py	`100.0% <100.0%> (ø)`
evalml/data_checks/id_columns_data_check.py	`100.0% <100.0%> (ø)`
evalml/data_checks/invalid_targets_data_check.py	`100.0% <100.0%> (ø)`
evalml/data_checks/no_variance_data_check.py	`100.0% <100.0%> (ø)`
evalml/data_checks/outliers_data_check.py	`100.0% <100.0%> (ø)`
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ddca58...c197f3c. Read the comment docs.

evalml/data_checks/high_variance_cv_data_check.py

dsherry

🚢 ! 😁

No blocking comments

evalml/data_checks/data_check_message_code.py

evalml/tests/data_checks_tests/test_data_check_message.py

evalml/tests/data_checks_tests/test_data_checks.py

evalml/data_checks/outliers_data_check.py

dsherry · 2020-11-23T16:30:47Z

evalml/data_checks/invalid_targets_data_check.py

            if set(unique_values) != set([0, 1]):
-                messages["errors"].append(DataCheckError("Numerical binary classification target classes must be [0, 1], got [{}] instead".format(", ".join([str(val) for val in unique_values])), self.name).to_dict())
+                messages["warnings"].append(DataCheckWarning(message="Numerical binary classification target classes must be [0, 1], got [{}] instead".format(", ".join([str(val) for val in unique_values])),


I didn't think about it until now but this check could potentially blow up if a large regression dataset is used. The message would get super long because we include all unique values. Would be a good thing to file.

@dsherry Right, but that's why we have problem_type passed in for the initialization of InvalidTargetDataCheck, right? Or are you concerned if the user accidentally passes in the wrong problem type?

Yep, that's what I was thinking, if someone accidentally ran binary classification with a regression target. One quick way around this would be to show the number of uniques and the counts of the 100 most frequent uniques instead of the counts of all the uniques!

No need to add this to this PR heh. But this would be great to file as a performance enhancement.

Ah, got it! Filed #1460 :)

evalml/data_checks/high_variance_cv_data_check.py

evalml/data_checks/data_check_message.py

evalml/data_checks/class_imbalance_data_check.py

freddyaboulton

Thanks @angela97lin ! I think this is great. My only comment is that enums are not actually json serializable so we should convert to it to string in validate!

evalml/data_checks/class_imbalance_data_check.py

dsherry · 2020-11-23T18:30:12Z

Oh shoot, that's a super great point @freddyaboulton , that python enums aren't JSON serializable. Yes, let's include the stringified version of the enum. (@angela97lin )

angela97lin · 2020-11-23T18:31:13Z

@freddyaboulton @dsherry Ah, really good point--Will update this to include the stringified versions instead. Thanks for catching this!

angela97lin added 8 commits November 20, 2020 13:58

init

8d81468

fix id col tests

26f867e

update highly null cols

fe360ac

add to invalid targets data check:

c6f1dd2

update target leakage and high var

d79e15a

fix outliers data check

debf07f

fix class imbalance and no var data checks

5eb09c0

fix data checks

1a762d4

angela97lin added this to the November 2020 milestone Nov 20, 2020

angela97lin self-assigned this Nov 20, 2020

angela97lin added 12 commits November 20, 2020 17:12

Merge branch 'main' into 1430_data_check_jsons

86a123c

make code optional

1af2be4

Merge branch '1430_data_check_jsons' of github.com:alteryx/evalml int…

22ad33d

…o 1430_data_check_jsons

add details

f4a0aa0

add detail to target leakage

d1036db

update other data checks

a30a213

test doctest

98aeeac

doctests

b4e8f19

fix doctests

768b320

fix doctest

cc13c99

fix tests

af337da

linting

e2239b8

angela97lin commented Nov 23, 2020

View reviewed changes

evalml/data_checks/high_variance_cv_data_check.py Show resolved Hide resolved

angela97lin marked this pull request as ready for review November 23, 2020 15:59

angela97lin requested review from freddyaboulton, dsherry, bchen1116, christopherbunn and eccabay November 23, 2020 15:59

angela97lin requested review from jeremyliweishih and ParthivNaresh November 23, 2020 15:59

dsherry approved these changes Nov 23, 2020

View reviewed changes

freddyaboulton approved these changes Nov 23, 2020

View reviewed changes

evalml/data_checks/class_imbalance_data_check.py Show resolved Hide resolved

angela97lin added 4 commits November 23, 2020 13:46

minor cleanup, use name instead

e47c3cb

clean up doctest

ee4f6fc

some more cleanup

6a7d3dd

merge and update outliers

c197f3c

angela97lin mentioned this pull request Nov 23, 2020

Show only top unique values rather than all unique values for InvalidTargetDataCheck #1460

Closed

angela97lin merged commit e2b1030 into main Nov 23, 2020

angela97lin deleted the 1430_data_check_jsons branch November 23, 2020 22:23

dsherry mentioned this pull request Nov 24, 2020

Release v0.16.0 #1468

Merged

angela97lin mentioned this pull request Nov 30, 2020

Investigate if binary classification problems still require numerical {0, 1} targets #1422

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update data check message to include a "code" and "details" fields #1451

Update data check message to include a "code" and "details" fields #1451

angela97lin commented Nov 20, 2020 •

edited

codecov bot commented Nov 22, 2020 •

edited

dsherry left a comment

dsherry Nov 23, 2020

angela97lin Nov 23, 2020

dsherry Nov 23, 2020

angela97lin Nov 23, 2020

freddyaboulton left a comment

dsherry commented Nov 23, 2020

angela97lin commented Nov 23, 2020

Update data check message to include a "code" and "details" fields #1451

Update data check message to include a "code" and "details" fields #1451

Conversation

angela97lin commented Nov 20, 2020 • edited

codecov bot commented Nov 22, 2020 • edited

Codecov Report

dsherry left a comment

Choose a reason for hiding this comment

dsherry Nov 23, 2020

Choose a reason for hiding this comment

angela97lin Nov 23, 2020

Choose a reason for hiding this comment

dsherry Nov 23, 2020

Choose a reason for hiding this comment

angela97lin Nov 23, 2020

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

dsherry commented Nov 23, 2020

angela97lin commented Nov 23, 2020

angela97lin commented Nov 20, 2020 •

edited

codecov bot commented Nov 22, 2020 •

edited