Always shuffle data in default automl data split strategies #1265

dsherry · 2020-10-06T06:30:00Z

codecov · 2020-10-06T13:22:13Z

Codecov Report

Merging #1265 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1265   +/-   ##
=======================================
  Coverage   99.93%   99.93%           
=======================================
  Files         207      208    +1     
  Lines       13157    13211   +54     
=======================================
+ Hits        13149    13203   +54     
  Misses          8        8

Impacted Files	Coverage Δ
evalml/automl/automl_search.py	`99.59% <100.00%> (ø)`
...automl/data_splitters/training_validation_split.py	`100.00% <100.00%> (ø)`
evalml/tests/automl_tests/test_automl.py	`100.00% <100.00%> (ø)`
...sts/automl_tests/test_training_validation_split.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95ec2db...935bb1c. Read the comment docs.

freddyaboulton

@dsherry This looks good to me!

freddyaboulton · 2020-10-06T13:40:27Z

evalml/tests/automl_tests/test_automl.py

+    # if shuffle is disabled, the mean value learned on each CV fold's training data will be incredible inaccurate,
+    # thus yielding an R^2 well below 0.
+
+    n = 100000


If this test doesn't take too long it might be worth running it with n=1000 so that the KFold CV is used?

Ah shoot you're right. I'll track this down. I upped the number because I wasn't reproing it as reliably at lower values.

jeremyliweishih · 2020-10-06T13:41:44Z

evalml/tests/automl_tests/test_automl.py

@@ -645,6 +645,32 @@ def generate_fake_dataset(rows):
    assert automl.data_split.test_size == (automl._LARGE_DATA_PERCENT_VALIDATION)


+def test_data_split_shuffle():


great test!

jeremyliweishih

LGTM

gsheni · 2020-10-06T14:46:14Z

evalml/automl/data_splitters/training_validation_split.py

@@ -5,7 +5,7 @@
 class TrainingValidationSplit(BaseCrossValidator):
    """Split the training data into training and validation sets"""

-    def __init__(self, test_size=None, train_size=None, shuffle=True, stratify=None, random_state=0):
+    def __init__(self, test_size=None, train_size=None, shuffle=False, stratify=None, random_state=0):


Should this be shuffle=True?

@gsheni I updated this to be the same as the other sklearn-defined CV methods, which have shuffle=False by default. Then, in the automl code which calls this, I set shuffle=True. I just did it to be consistent with sklearn.

angela97lin

Great, LGTM!

dsherry marked this pull request as ready for review October 6, 2020 13:30

dsherry requested review from angela97lin, freddyaboulton, bchen1116, gsheni, jeremyliweishih, christopherbunn and eccabay October 6, 2020 13:30

freddyaboulton approved these changes Oct 6, 2020

View reviewed changes

jeremyliweishih reviewed Oct 6, 2020

View reviewed changes

jeremyliweishih approved these changes Oct 6, 2020

View reviewed changes

gsheni reviewed Oct 6, 2020

View reviewed changes

angela97lin approved these changes Oct 6, 2020

View reviewed changes

dsherry added 3 commits October 6, 2020 12:04

In automl, have default data split strategies always shuffle the data.

4ba14fc

Changelog

7b36921

Loosen decimal comparison

935bb1c

dsherry force-pushed the ds_1259_shuffle_data branch from e3887a4 to 935bb1c Compare October 6, 2020 16:05

dsherry merged commit 5ef841a into main Oct 6, 2020

dsherry deleted the ds_1259_shuffle_data branch October 6, 2020 17:42

dsherry mentioned this pull request Oct 29, 2020

Release v0.15.0 #1370

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always shuffle data in default automl data split strategies #1265

Always shuffle data in default automl data split strategies #1265

dsherry commented Oct 6, 2020

codecov bot commented Oct 6, 2020 •

edited

freddyaboulton left a comment

freddyaboulton Oct 6, 2020

dsherry Oct 7, 2020

jeremyliweishih Oct 6, 2020

jeremyliweishih left a comment

gsheni Oct 6, 2020

dsherry Oct 6, 2020

angela97lin left a comment

		@@ -645,6 +645,32 @@ def generate_fake_dataset(rows):
		assert automl.data_split.test_size == (automl._LARGE_DATA_PERCENT_VALIDATION)


		def test_data_split_shuffle():

Always shuffle data in default automl data split strategies #1265

Always shuffle data in default automl data split strategies #1265

Conversation

dsherry commented Oct 6, 2020

codecov bot commented Oct 6, 2020 • edited

Codecov Report

freddyaboulton left a comment

Choose a reason for hiding this comment

freddyaboulton Oct 6, 2020

Choose a reason for hiding this comment

dsherry Oct 7, 2020

Choose a reason for hiding this comment

jeremyliweishih Oct 6, 2020

Choose a reason for hiding this comment

jeremyliweishih left a comment

Choose a reason for hiding this comment

gsheni Oct 6, 2020

Choose a reason for hiding this comment

dsherry Oct 6, 2020

Choose a reason for hiding this comment

angela97lin left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 6, 2020 •

edited