Poor performance on diamond dataset #1259

dsherry · 2020-10-05T16:58:30Z

Problem
Automl yields models with negative R2.

Repro
Dataset here.

import evalml
import pandas as pd
import numpy as np
df = pd.read_csv('stones_encoded_small.csv')
y = df.pop('total_sales_price')
automl = evalml.automl.AutoMLSearch(problem_type='regression')
automl.search(df, y)

Data checks will fail due to highly null / single-value columns. You can disable them with data_checks='disabled'. Or, to address them and continue:

cols_to_drop = ['culet_condition', 'fancy_color_dominant_color', 'fancy_color_intensity', 'fancy_color_overtone', 'fancy_color_secondary_color', 'fluor_color', 'image_file_url', 'diamond_id', 'currency_code', 'currency_symbol', 'fancy_color_dominant_color', 'fancy_color_intensity', 'fancy_color_overtone', 'fancy_color_secondary_color', 'has_sarineloupe']
df.drop(columns=cols_to_drop, inplace=True)
automl = evalml.automl.AutoMLSearch(problem_type='regression')
automl.search(df, y)

The results are highly similar either way: negative R2 values for all models, i.e. the models can't produce meaningful results.

Switching the metric to MSE and MAE yields similarly poor models.

Discussion
My first suspicion is that the features aren't getting the right type. When I look at the dtypes inferred by pandas, I see many are set as float64 but only have a few unique values, i.e. they should be set as categorical. I gave that a shot but it didn't seem to change the model results, so there's more to the story.

The text was updated successfully, but these errors were encountered:

SydneyAyx · 2020-10-05T20:10:03Z

Hi Team,

I believe this is related to having an input data set sorted by the target variable, and the sampling method used for the 3-fold cross validation. This data set is sorted by price from lowest to highest. I suspect that the cross validation is splitting the records in order, so the splits are tied to the target variable - meaning that the R2 values are really low because they are being tested against target variable values that were not included in the training data. This behavior is resolved by doing a shuffle on the full data set prior to feeding it in to the search.

gsheni · 2020-10-05T20:17:37Z

As @SydneyAyx mentioned, you get a better R2 scores once you shuffle the dataset.

import evalml
import pandas as pd
import numpy as np
from evalml.data_checks import EmptyDataChecks

df = pd.read_csv('stones_encoded_small.csv')

# shuffles data
df = df.sample(frac=1)

y = df.pop('total_sales_price')
automl = evalml.automl.AutoMLSearch(problem_type='regression')
automl.search(df, y, data_checks=EmptyDataChecks()))

dsherry · 2020-10-06T05:21:26Z

Thank you @SydneyAyx @gsheni ! Great detective work there, genius :)

Yes, confirmed. It appears our default data splitters in automl don't currently set shuffle=True.

@SydneyAyx @gsheni one workaround is to shuffle before running automl as @gsheni showed above. Another workaround is to set your own data splitter, like so:

import evalml
import pandas as pd
import numpy as np
import sklearn.model_selection
df = pd.read_csv('stones_encoded_small.csv')
y = df.pop('total_sales_price')

data_splitter = sklearn.model_selection.KFold(n_splits=3, random_state=0, shuffle=True)
automl = evalml.automl.AutoMLSearch(problem_type='regression', data_split=data_splitter)
automl.search(df, y, data_checks='disabled')

I'll get a PR up with the evalml fix.

dsherry added the bug Issues tracking problems with existing features. label Oct 5, 2020

dsherry assigned freddyaboulton Oct 5, 2020

freddyaboulton assigned dsherry and unassigned freddyaboulton Oct 5, 2020

dsherry mentioned this issue Oct 6, 2020

Always shuffle data in default automl data split strategies #1265

Merged

dsherry closed this as completed in #1265 Oct 6, 2020

dsherry mentioned this issue Dec 22, 2020

Stacked ensembler: use same CV data splitter as the rest of automl #1592

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance on diamond dataset #1259

Poor performance on diamond dataset #1259

dsherry commented Oct 5, 2020

SydneyAyx commented Oct 5, 2020

gsheni commented Oct 5, 2020 •

edited

dsherry commented Oct 6, 2020 •

edited

Poor performance on diamond dataset #1259

Poor performance on diamond dataset #1259

Comments

dsherry commented Oct 5, 2020

SydneyAyx commented Oct 5, 2020

gsheni commented Oct 5, 2020 • edited

dsherry commented Oct 6, 2020 • edited

gsheni commented Oct 5, 2020 •

edited

dsherry commented Oct 6, 2020 •

edited