Update data checks to return DataCheckResults object #1444

angela97lin · 2020-11-17T02:22:09Z

validate always returns a DataCheckResults object now; no messages means returning {DataCheckMessageType.WARNING: [], DataCheckMessageType.ERROR: []}
to use DataCheckMessageType as key?

…alml into 1325_data_checks_returns_dict

codecov · 2020-11-17T17:45:26Z

Codecov Report

Merging #1444 (b08447c) into main (f54abd3) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@            Coverage Diff            @@
##             main    #1444     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         220      220             
  Lines       14672    14687     +15     
=========================================
+ Hits        14665    14680     +15     
  Misses          7        7

Impacted Files	Coverage Δ
evalml/data_checks/data_check.py	`100.0% <ø> (ø)`
evalml/data_checks/default_data_checks.py	`100.0% <ø> (ø)`
evalml/automl/automl_search.py	`99.7% <100.0%> (-<0.1%)`	⬇️
evalml/data_checks/class_imbalance_data_check.py	`100.0% <100.0%> (ø)`
evalml/data_checks/data_checks.py	`100.0% <100.0%> (ø)`
evalml/data_checks/high_variance_cv_data_check.py	`100.0% <100.0%> (ø)`
evalml/data_checks/highly_null_data_check.py	`100.0% <100.0%> (ø)`
evalml/data_checks/id_columns_data_check.py	`100.0% <100.0%> (ø)`
evalml/data_checks/invalid_targets_data_check.py	`100.0% <100.0%> (ø)`
evalml/data_checks/no_variance_data_check.py	`100.0% <100.0%> (ø)`
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f54abd3...76aaaff. Read the comment docs.

jeremyliweishih

LGTM - very nice

freddyaboulton

@angela97lin Looks good! I left a comment about how defining a class could make things easier/cleaner in the long run.

freddyaboulton · 2020-11-17T22:00:43Z

evalml/data_checks/class_imbalance_data_check.py

        """
+        messages = {


I think the downside of using DataCheckMessageType as the keys is that the user needs to import the enum in order to look at the warnings and results. I think that can be annoying, especially if the user is running automl and trying to access data_checks_results:

Creating a DataCheckResults class.

Using string values like "warning" and "error" as keys in the dict you have here.

The benefits of 1 is thatL

We can expose the warnings and errors as instance properties and the user doesn't have to import another class.

We can also have an api for checking if there are warnings or errors, or it's empty as opposed to if self._data_check_results[DataCheckMessageType.ERROR]. The nice thing about this is that if we ever add data to the dict, the is_empty wouldn't break users code downstream but the dict would because they'd be checking if results == {DataCheckMessageType.WARNING: [], DataCheckMessageType.ERROR: []}
The downside is that it's another class but our docs already show how to get data from the data checks results.

The benefit of 2 is that it'd be a small change but would definitely lead to typos lol.

My vote is for 1 I think it'd be fine to keep it as-is for now too. I saw that you left to use DataCheckMessageType as key? in the PR description so wanted to offer my thoughts lol.

Ooo I like this suggestion a lot. I agree, the reason why I left the comment about using DataCheckMessageType as key was because as I was updating the code, I too felt the inconvenience / frustration of having to import the enum everywhere, but as @dsherry had mentioned, since we already have the enums in place we might as well use them.

That being said, I like the idea of creating a separate DataCheckResults class, and having warnings and errors as attributes 🤔 That way, the user doesn't need to directly type in the keys as strings, and any typos will result in an AttributeError instead.

@freddyaboulton @angela97lin sure I'm on board with having a DataCheckResults class! This would make it easy to access the errors and warnings. We could also define a to_json method which returns native python types instead of DataCheckError/DataCheckWarning instances.

Is your proposal that we do that instead of merging this PR?

UPDATE: we discussed this, further changes tracked by #1430.

bchen1116

LGTM!

dsherry

@angela97lin LGTM, 🚢 ! I suggest you merge this and then start #1430 because #1430 builds on this work.

docs/source/release_notes.rst

docs/source/user_guide/data_checks.ipynb

evalml/automl/automl_search.py

dsherry · 2020-11-18T19:54:56Z

evalml/data_checks/class_imbalance_data_check.py

        """
+        messages = {


@freddyaboulton @angela97lin sure I'm on board with having a DataCheckResults class! This would make it easy to access the errors and warnings. We could also define a to_json method which returns native python types instead of DataCheckError/DataCheckWarning instances.

Is your proposal that we do that instead of merging this PR?

UPDATE: we discussed this, further changes tracked by #1430.

evalml/data_checks/data_checks.py

angela97lin · 2020-11-19T18:59:45Z

Closing in favor of #1448

angela97lin added 9 commits November 13, 2020 12:54

init

f560617

docstrs

b6a66ac

update class imbalance

29284c1

more conversion

f50c267

docstr

86b92ad

more updates

a3e4b8f

Merge branch 'main' into 1325_data_checks_returns_dict

8031cc2

cleanup

ad8a15e

Merge branch '1325_data_checks_returns_dict' of github.com:alteryx/ev…

455dfdc

…alml into 1325_data_checks_returns_dict

angela97lin self-assigned this Nov 17, 2020

angela97lin added 7 commits November 16, 2020 22:59

fix some tests for data checks

cbd9f42

fix more tests

e4e4b78

fix more tests

aa101da

test doctest

61489de

fix doctest

82b3c3a

fix no variance data check

ac7181f

fix data checks tests

86273fe

update notebook

2b8e910

angela97lin marked this pull request as ready for review November 17, 2020 19:17

angela97lin requested review from freddyaboulton, bchen1116, dsherry, christopherbunn, eccabay and jeremyliweishih November 17, 2020 19:17

angela97lin added this to the November 2020 milestone Nov 17, 2020

jeremyliweishih approved these changes Nov 17, 2020

View reviewed changes

freddyaboulton approved these changes Nov 17, 2020

View reviewed changes

bchen1116 approved these changes Nov 18, 2020

View reviewed changes

angela97lin mentioned this pull request Nov 18, 2020

Data checks: JSON-friendly message fmt, include a type enum and affected column names #1430

Closed

Merge branch 'main' into 1325_data_checks_returns_dict

c4f12ab

dsherry approved these changes Nov 18, 2020

View reviewed changes

angela97lin added 9 commits November 18, 2020 20:25

Merge branch 'main' into 1325_data_checks_returns_dict

b08447c

update to use data check results class

677edf1

fix doctests

2a801f0

clear notebook outputs

1038edd

fix notebooks

96ceaef

fix doctests

bc49716

add equality test

a30bd71

fix docstr

aaf52ea

add to_json

76aaaff

angela97lin requested review from dsherry and freddyaboulton November 19, 2020 06:19

angela97lin changed the title ~~Update data checks to return dictionary of warnings and errors instead of list~~ Update data checks to return DataCheckResults object Nov 19, 2020

angela97lin mentioned this pull request Nov 19, 2020

Update data checks to return a dictionary #1448

Merged

angela97lin closed this Nov 19, 2020

angela97lin deleted the 1325_data_checks_returns_dict branch January 13, 2021 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update data checks to return DataCheckResults object #1444

Update data checks to return DataCheckResults object #1444

angela97lin commented Nov 17, 2020 •

edited

codecov bot commented Nov 17, 2020 •

edited

jeremyliweishih left a comment

freddyaboulton left a comment

freddyaboulton Nov 17, 2020 •

edited

angela97lin Nov 18, 2020

dsherry Nov 18, 2020

bchen1116 left a comment

dsherry left a comment

dsherry Nov 18, 2020

angela97lin commented Nov 19, 2020

Update data checks to return DataCheckResults object #1444

Update data checks to return DataCheckResults object #1444

Conversation

angela97lin commented Nov 17, 2020 • edited

codecov bot commented Nov 17, 2020 • edited

Codecov Report

jeremyliweishih left a comment

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

freddyaboulton Nov 17, 2020 • edited

Choose a reason for hiding this comment

angela97lin Nov 18, 2020

Choose a reason for hiding this comment

dsherry Nov 18, 2020

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

dsherry Nov 18, 2020

Choose a reason for hiding this comment

angela97lin commented Nov 19, 2020

angela97lin commented Nov 17, 2020 •

edited

codecov bot commented Nov 17, 2020 •

edited

freddyaboulton Nov 17, 2020 •

edited