New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance on diamond dataset #1259
Comments
Hi Team, I believe this is related to having an input data set sorted by the target variable, and the sampling method used for the 3-fold cross validation. This data set is sorted by price from lowest to highest. I suspect that the cross validation is splitting the records in order, so the splits are tied to the target variable - meaning that the R2 values are really low because they are being tested against target variable values that were not included in the training data. This behavior is resolved by doing a shuffle on the full data set prior to feeding it in to the search. |
import evalml
import pandas as pd
import numpy as np
from evalml.data_checks import EmptyDataChecks
df = pd.read_csv('stones_encoded_small.csv')
# shuffles data
df = df.sample(frac=1)
y = df.pop('total_sales_price')
automl = evalml.automl.AutoMLSearch(problem_type='regression')
automl.search(df, y, data_checks=EmptyDataChecks())) |
Thank you @SydneyAyx @gsheni ! Great detective work there, genius :) Yes, confirmed. It appears our default data splitters in automl don't currently set @SydneyAyx @gsheni one workaround is to shuffle before running automl as @gsheni showed above. Another workaround is to set your own data splitter, like so: import evalml
import pandas as pd
import numpy as np
import sklearn.model_selection
df = pd.read_csv('stones_encoded_small.csv')
y = df.pop('total_sales_price')
data_splitter = sklearn.model_selection.KFold(n_splits=3, random_state=0, shuffle=True)
automl = evalml.automl.AutoMLSearch(problem_type='regression', data_split=data_splitter)
automl.search(df, y, data_checks='disabled') I'll get a PR up with the evalml fix. |
Problem
Automl yields models with negative R2.
Repro
Dataset here.
Data checks will fail due to highly null / single-value columns. You can disable them with
data_checks='disabled'
. Or, to address them and continue:The results are highly similar either way: negative R2 values for all models, i.e. the models can't produce meaningful results.
Switching the metric to MSE and MAE yields similarly poor models.
Discussion
My first suspicion is that the features aren't getting the right type. When I look at the dtypes inferred by pandas, I see many are set as
float64
but only have a few unique values, i.e. they should be set as categorical. I gave that a shot but it didn't seem to change the model results, so there's more to the story.The text was updated successfully, but these errors were encountered: