Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always shuffle data in default automl data split strategies #1265

Merged
merged 3 commits into from Oct 6, 2020

Conversation

dsherry
Copy link
Contributor

@dsherry dsherry commented Oct 6, 2020

Fix #1259

@codecov
Copy link

codecov bot commented Oct 6, 2020

Codecov Report

Merging #1265 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1265   +/-   ##
=======================================
  Coverage   99.93%   99.93%           
=======================================
  Files         207      208    +1     
  Lines       13157    13211   +54     
=======================================
+ Hits        13149    13203   +54     
  Misses          8        8           
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.59% <100.00%> (ø)
...automl/data_splitters/training_validation_split.py 100.00% <100.00%> (ø)
evalml/tests/automl_tests/test_automl.py 100.00% <100.00%> (ø)
...sts/automl_tests/test_training_validation_split.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95ec2db...935bb1c. Read the comment docs.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry This looks good to me!

# if shuffle is disabled, the mean value learned on each CV fold's training data will be incredible inaccurate,
# thus yielding an R^2 well below 0.

n = 100000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this test doesn't take too long it might be worth running it with n=1000 so that the KFold CV is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah shoot you're right. I'll track this down. I upped the number because I wasn't reproing it as reliably at lower values.

@@ -645,6 +645,32 @@ def generate_fake_dataset(rows):
assert automl.data_split.test_size == (automl._LARGE_DATA_PERCENT_VALIDATION)


def test_data_split_shuffle():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great test!

Copy link
Contributor

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -5,7 +5,7 @@
class TrainingValidationSplit(BaseCrossValidator):
"""Split the training data into training and validation sets"""

def __init__(self, test_size=None, train_size=None, shuffle=True, stratify=None, random_state=0):
def __init__(self, test_size=None, train_size=None, shuffle=False, stratify=None, random_state=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be shuffle=True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gsheni I updated this to be the same as the other sklearn-defined CV methods, which have shuffle=False by default. Then, in the automl code which calls this, I set shuffle=True. I just did it to be consistent with sklearn.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, LGTM!

@dsherry dsherry merged commit 5ef841a into main Oct 6, 2020
@dsherry dsherry deleted the ds_1259_shuffle_data branch October 6, 2020 17:42
@dsherry dsherry mentioned this pull request Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Poor performance on diamond dataset
5 participants