Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data checks to return DataCheckResults object #1444

Closed
wants to merge 27 commits into from

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Nov 17, 2020

Closes #1325

  • validate always returns a DataCheckResults object now; no messages means returning {DataCheckMessageType.WARNING: [], DataCheckMessageType.ERROR: []}
  • to use DataCheckMessageType as key?

@angela97lin angela97lin self-assigned this Nov 17, 2020
@codecov
Copy link

codecov bot commented Nov 17, 2020

Codecov Report

Merging #1444 (b08447c) into main (f54abd3) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1444     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         220      220             
  Lines       14672    14687     +15     
=========================================
+ Hits        14665    14680     +15     
  Misses          7        7             
Impacted Files Coverage Δ
evalml/data_checks/data_check.py 100.0% <ø> (ø)
evalml/data_checks/default_data_checks.py 100.0% <ø> (ø)
evalml/automl/automl_search.py 99.7% <100.0%> (-<0.1%) ⬇️
evalml/data_checks/class_imbalance_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/data_checks.py 100.0% <100.0%> (ø)
evalml/data_checks/high_variance_cv_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/highly_null_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/id_columns_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/invalid_targets_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/no_variance_data_check.py 100.0% <100.0%> (ø)
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f54abd3...76aaaff. Read the comment docs.

Copy link
Contributor

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - very nice

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin Looks good! I left a comment about how defining a class could make things easier/cleaner in the long run.

"""
messages = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the downside of using DataCheckMessageType as the keys is that the user needs to import the enum in order to look at the warnings and results. I think that can be annoying, especially if the user is running automl and trying to access data_checks_results:

  1. Creating a DataCheckResults class.
  2. Using string values like "warning" and "error" as keys in the dict you have here.

The benefits of 1 is thatL

  1. We can expose the warnings and errors as instance properties and the user doesn't have to import another class.
  2. We can also have an api for checking if there are warnings or errors, or it's empty as opposed to if self._data_check_results[DataCheckMessageType.ERROR]. The nice thing about this is that if we ever add data to the dict, the is_empty wouldn't break users code downstream but the dict would because they'd be checking if results == {DataCheckMessageType.WARNING: [], DataCheckMessageType.ERROR: []}
    The downside is that it's another class but our docs already show how to get data from the data checks results.

The benefit of 2 is that it'd be a small change but would definitely lead to typos lol.

My vote is for 1 I think it'd be fine to keep it as-is for now too. I saw that you left to use DataCheckMessageType as key? in the PR description so wanted to offer my thoughts lol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo I like this suggestion a lot. I agree, the reason why I left the comment about using DataCheckMessageType as key was because as I was updating the code, I too felt the inconvenience / frustration of having to import the enum everywhere, but as @dsherry had mentioned, since we already have the enums in place we might as well use them.

That being said, I like the idea of creating a separate DataCheckResults class, and having warnings and errors as attributes 🤔 That way, the user doesn't need to directly type in the keys as strings, and any typos will result in an AttributeError instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton @angela97lin sure I'm on board with having a DataCheckResults class! This would make it easy to access the errors and warnings. We could also define a to_json method which returns native python types instead of DataCheckError/DataCheckWarning instances.

Is your proposal that we do that instead of merging this PR?

UPDATE: we discussed this, further changes tracked by #1430.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin LGTM, 🚢 ! I suggest you merge this and then start #1430 because #1430 builds on this work.

docs/source/release_notes.rst Outdated Show resolved Hide resolved
docs/source/user_guide/data_checks.ipynb Show resolved Hide resolved
docs/source/user_guide/data_checks.ipynb Show resolved Hide resolved
evalml/automl/automl_search.py Outdated Show resolved Hide resolved
"""
messages = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton @angela97lin sure I'm on board with having a DataCheckResults class! This would make it easy to access the errors and warnings. We could also define a to_json method which returns native python types instead of DataCheckError/DataCheckWarning instances.

Is your proposal that we do that instead of merging this PR?

UPDATE: we discussed this, further changes tracked by #1430.

evalml/data_checks/data_checks.py Outdated Show resolved Hide resolved
@angela97lin angela97lin changed the title Update data checks to return dictionary of warnings and errors instead of list Update data checks to return DataCheckResults object Nov 19, 2020
@angela97lin
Copy link
Contributor Author

Closing in favor of #1448

@angela97lin angela97lin deleted the 1325_data_checks_returns_dict branch January 13, 2021 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data checks: make it easier to process results
5 participants