Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance on diamond dataset #1259

Closed
dsherry opened this issue Oct 5, 2020 · 3 comments · Fixed by #1265
Closed

Poor performance on diamond dataset #1259

dsherry opened this issue Oct 5, 2020 · 3 comments · Fixed by #1265
Assignees
Labels
bug Issues tracking problems with existing features.

Comments

@dsherry
Copy link
Contributor

dsherry commented Oct 5, 2020

Problem
Automl yields models with negative R2.

Repro
Dataset here.

import evalml
import pandas as pd
import numpy as np
df = pd.read_csv('stones_encoded_small.csv')
y = df.pop('total_sales_price')
automl = evalml.automl.AutoMLSearch(problem_type='regression')
automl.search(df, y)

Data checks will fail due to highly null / single-value columns. You can disable them with data_checks='disabled'. Or, to address them and continue:

cols_to_drop = ['culet_condition', 'fancy_color_dominant_color', 'fancy_color_intensity', 'fancy_color_overtone', 'fancy_color_secondary_color', 'fluor_color', 'image_file_url', 'diamond_id', 'currency_code', 'currency_symbol', 'fancy_color_dominant_color', 'fancy_color_intensity', 'fancy_color_overtone', 'fancy_color_secondary_color', 'has_sarineloupe']
df.drop(columns=cols_to_drop, inplace=True)
automl = evalml.automl.AutoMLSearch(problem_type='regression')
automl.search(df, y)

The results are highly similar either way: negative R2 values for all models, i.e. the models can't produce meaningful results.

Switching the metric to MSE and MAE yields similarly poor models.

Discussion
My first suspicion is that the features aren't getting the right type. When I look at the dtypes inferred by pandas, I see many are set as float64 but only have a few unique values, i.e. they should be set as categorical. I gave that a shot but it didn't seem to change the model results, so there's more to the story.

@dsherry dsherry added the bug Issues tracking problems with existing features. label Oct 5, 2020
@SydneyAyx
Copy link

Hi Team,

I believe this is related to having an input data set sorted by the target variable, and the sampling method used for the 3-fold cross validation. This data set is sorted by price from lowest to highest. I suspect that the cross validation is splitting the records in order, so the splits are tied to the target variable - meaning that the R2 values are really low because they are being tested against target variable values that were not included in the training data. This behavior is resolved by doing a shuffle on the full data set prior to feeding it in to the search.

@gsheni
Copy link
Contributor

gsheni commented Oct 5, 2020

  • As @SydneyAyx mentioned, you get a better R2 scores once you shuffle the dataset.
import evalml
import pandas as pd
import numpy as np
from evalml.data_checks import EmptyDataChecks

df = pd.read_csv('stones_encoded_small.csv')

# shuffles data
df = df.sample(frac=1)

y = df.pop('total_sales_price')
automl = evalml.automl.AutoMLSearch(problem_type='regression')
automl.search(df, y, data_checks=EmptyDataChecks()))

@dsherry
Copy link
Contributor Author

dsherry commented Oct 6, 2020

Thank you @SydneyAyx @gsheni ! Great detective work there, genius :)

Yes, confirmed. It appears our default data splitters in automl don't currently set shuffle=True.

@SydneyAyx @gsheni one workaround is to shuffle before running automl as @gsheni showed above. Another workaround is to set your own data splitter, like so:

import evalml
import pandas as pd
import numpy as np
import sklearn.model_selection
df = pd.read_csv('stones_encoded_small.csv')
y = df.pop('total_sales_price')

data_splitter = sklearn.model_selection.KFold(n_splits=3, random_state=0, shuffle=True)
automl = evalml.automl.AutoMLSearch(problem_type='regression', data_split=data_splitter)
automl.search(df, y, data_checks='disabled')

I'll get a PR up with the evalml fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issues tracking problems with existing features.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants