Running AutoML on Iris Dataset Fails #966

SydneyAyx · 2020-07-23T15:33:06Z

Running evalml 0.11.2. It looks like the option to set data checks to False has been removed from AutoMLSearch, which was a work-around for this issue previously.

TypeError Traceback (most recent call last)
in
1 automl = AutoMLSearch(objective="log_loss_multi", max_pipelines=5, problem_type="multiclass")
2
----> 3 automl.search(X, y)

~.conda\envs\evalml_test_1.0\lib\site-packages\evalml\automl\automl_search.py in search(self, X, y, data_checks, feature_types, raise_errors, show_iteration_plot)
316
317 data_checks = self._validate_data_checks(data_checks)
--> 318 data_check_results = data_checks.validate(X, y)
319
320 if len(data_check_results) > 0:

~.conda\envs\evalml_test_1.0\lib\site-packages\evalml\data_checks\data_checks.py in validate(self, X, y)
33 messages = []
34 for data_check in self.data_checks:
---> 35 messages_new = data_check.validate(X, y)
36 messages.extend(messages_new)
37 return messages

~.conda\envs\evalml_test_1.0\lib\site-packages\evalml\data_checks\label_leakage_data_check.py in validate(self, X, y)
53 if len(X.columns) == 0:
54 return []
---> 55 corrs = {label: abs(y.corr(col)) for label, col in X.iteritems() if abs(y.corr(col)) >= self.pct_corr_threshold}
56
57 highly_corr_cols = {key: value for key, value in corrs.items() if value >= self.pct_corr_threshold}

~.conda\envs\evalml_test_1.0\lib\site-packages\evalml\data_checks\label_leakage_data_check.py in (.0)
53 if len(X.columns) == 0:
54 return []
---> 55 corrs = {label: abs(y.corr(col)) for label, col in X.iteritems() if abs(y.corr(col)) >= self.pct_corr_threshold}
56
57 highly_corr_cols = {key: value for key, value in corrs.items() if value >= self.pct_corr_threshold}

~.conda\envs\evalml_test_1.0\lib\site-packages\pandas\core\series.py in corr(self, other, method, min_periods)
2252 if method in ["pearson", "spearman", "kendall"] or callable(method):
2253 return nanops.nancorr(
-> 2254 this.values, other.values, method=method, min_periods=min_periods
2255 )
2256

~.conda\envs\evalml_test_1.0\lib\site-packages\pandas\core\nanops.py in _f(*args, **kwargs)
67 try:
68 with np.errstate(invalid="ignore"):
---> 69 return f(*args, **kwargs)
70 except ValueError as e:
71 # we want to transform an object array

~.conda\envs\evalml_test_1.0\lib\site-packages\pandas\core\nanops.py in nancorr(a, b, method, min_periods)
1238
1239 f = get_corr_func(method)
-> 1240 return f(a, b)
1241
1242

~.conda\envs\evalml_test_1.0\lib\site-packages\pandas\core\nanops.py in _pearson(a, b)
1254
1255 def _pearson(a, b):
-> 1256 return np.corrcoef(a, b)[0, 1]
1257
1258 def _kendall(a, b):

<array_function internals> in corrcoef(*args, **kwargs)

~.conda\envs\evalml_test_1.0\lib\site-packages\numpy\lib\function_base.py in corrcoef(x, y, rowvar, bias, ddof)
2524 warnings.warn('bias and ddof have no effect and are deprecated',
2525 DeprecationWarning, stacklevel=3)
-> 2526 c = cov(x, y, rowvar)
2527 try:
2528 d = diag(c)

<array_function internals> in cov(*args, **kwargs)

~.conda\envs\evalml_test_1.0\lib\site-packages\numpy\lib\function_base.py in cov(m, y, rowvar, bias, ddof, fweights, aweights)
2429 w *= aweights
2430
-> 2431 avg, w_sum = average(X, axis=1, weights=w, returned=True)
2432 w_sum = w_sum[0]
2433

<array_function internals> in average(*args, **kwargs)

~.conda\envs\evalml_test_1.0\lib\site-packages\numpy\lib\function_base.py in average(a, axis, weights, returned)
391
392 if weights is None:
--> 393 avg = a.mean(axis)
394 scl = avg.dtype.type(a.size/avg.size)
395 else:

~.conda\envs\evalml_test_1.0\lib\site-packages\numpy\core_methods.py in _mean(a, axis, dtype, out, keepdims)
152 if isinstance(ret, mu.ndarray):
153 ret = um.true_divide(
--> 154 ret, rcount, out=ret, casting='unsafe', subok=False)
155 if is_float16_result and out is None:
156 ret = arr.dtype.type(ret)

TypeError: unsupported operand type(s) for /: 'str' and 'int'

It does something slightly different when run - the search executes instead of failing with a stack trace, but all scores for all pipelines are nan.

Optimizing for Log Loss Multiclass.
Lower score is better.

Searching up to 4 pipelines.
Allowed model families: random_forest, xgboost, linear_model, catboost

(1/4) Mode Baseline Multiclass Classificati... Elapsed:00:00
Starting cross validation
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Finished cross validation - mean Log Loss Multiclass: nan
(2/4) CatBoost Classifier w/ Simple Imputer Elapsed:00:00
Starting cross validation
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Finished cross validation - mean Log Loss Multiclass: nan
(3/4) XGBoost Classifier w/ Simple Imputer Elapsed:00:02
Starting cross validation
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Finished cross validation - mean Log Loss Multiclass: nan
(4/4) Random Forest Classifier w/ Simple Im... Elapsed:00:02
Starting cross validation
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Finished cross validation - mean Log Loss Multiclass: nan

Search finished after 00:02
Best pipeline: Mode Baseline Multiclass Classification Pipeline
Best pipeline Log Loss Multiclass: nan
ToolId 3: AutoML tool done
Finished in 14.397 seconds

The pandas data types are the same in both environments.

sepal.length float64
sepal.width float64
petal.length float64
petal.width float64
class object
dtype: object

The Jupyter notebook is using Python 3.7.3 and the tool is 3.6.8.

dsherry · 2020-07-23T16:11:10Z

@SydneyAyx : yep, we changed the mechanism for disabling data checks in 0.11.2:

automl.search(..., data_checks=None, ...)

Making a note that we should add that to the user guide section.

Please give that a shot and if that still doesn't fix your issue let's talk again.

If that does fix the issue, I remember #828 was previously filed to track this. And we closed that in favor of #645 , which is currently in progress. However, I'm not sure #645 will actually fix the underlying problem. So let's keep it open.

dsherry · 2020-07-23T16:16:17Z

Ah, I got confused about the timeline: #932 was merged last week and fixes this issue! I just ran the reproducer I wrote in #828 to confirm this. The next release (0.12.0, next Tues) will include the fix.

I'll keep this open and close it when we put that release out.

dsherry · 2020-08-03T21:27:20Z

Fixed in v0.12.0 which just went out!

SydneyAyx added the bug Issues tracking problems with existing features. label Jul 23, 2020

dsherry added this to the July 2020 milestone Jul 23, 2020

dsherry assigned dsherry and unassigned dsherry Jul 23, 2020

dsherry closed this as completed Aug 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running AutoML on Iris Dataset Fails #966

Running AutoML on Iris Dataset Fails #966

SydneyAyx commented Jul 23, 2020 •

edited by dsherry

dsherry commented Jul 23, 2020

dsherry commented Jul 23, 2020

dsherry commented Aug 3, 2020

Running AutoML on Iris Dataset Fails #966

Running AutoML on Iris Dataset Fails #966

Comments

SydneyAyx commented Jul 23, 2020 • edited by dsherry

dsherry commented Jul 23, 2020

dsherry commented Jul 23, 2020

dsherry commented Aug 3, 2020

SydneyAyx commented Jul 23, 2020 •

edited by dsherry