Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running AutoML on Iris Dataset Fails #966

Closed
SydneyAyx opened this issue Jul 23, 2020 · 3 comments
Closed

Running AutoML on Iris Dataset Fails #966

SydneyAyx opened this issue Jul 23, 2020 · 3 comments
Labels
bug Issues tracking problems with existing features.
Milestone

Comments

@SydneyAyx
Copy link

SydneyAyx commented Jul 23, 2020

Running evalml 0.11.2. It looks like the option to set data checks to False has been removed from AutoMLSearch, which was a work-around for this issue previously.


TypeError Traceback (most recent call last)
in
1 automl = AutoMLSearch(objective="log_loss_multi", max_pipelines=5, problem_type="multiclass")
2
----> 3 automl.search(X, y)

~.conda\envs\evalml_test_1.0\lib\site-packages\evalml\automl\automl_search.py in search(self, X, y, data_checks, feature_types, raise_errors, show_iteration_plot)
316
317 data_checks = self._validate_data_checks(data_checks)
--> 318 data_check_results = data_checks.validate(X, y)
319
320 if len(data_check_results) > 0:

~.conda\envs\evalml_test_1.0\lib\site-packages\evalml\data_checks\data_checks.py in validate(self, X, y)
33 messages = []
34 for data_check in self.data_checks:
---> 35 messages_new = data_check.validate(X, y)
36 messages.extend(messages_new)
37 return messages

~.conda\envs\evalml_test_1.0\lib\site-packages\evalml\data_checks\label_leakage_data_check.py in validate(self, X, y)
53 if len(X.columns) == 0:
54 return []
---> 55 corrs = {label: abs(y.corr(col)) for label, col in X.iteritems() if abs(y.corr(col)) >= self.pct_corr_threshold}
56
57 highly_corr_cols = {key: value for key, value in corrs.items() if value >= self.pct_corr_threshold}

~.conda\envs\evalml_test_1.0\lib\site-packages\evalml\data_checks\label_leakage_data_check.py in (.0)
53 if len(X.columns) == 0:
54 return []
---> 55 corrs = {label: abs(y.corr(col)) for label, col in X.iteritems() if abs(y.corr(col)) >= self.pct_corr_threshold}
56
57 highly_corr_cols = {key: value for key, value in corrs.items() if value >= self.pct_corr_threshold}

~.conda\envs\evalml_test_1.0\lib\site-packages\pandas\core\series.py in corr(self, other, method, min_periods)
2252 if method in ["pearson", "spearman", "kendall"] or callable(method):
2253 return nanops.nancorr(
-> 2254 this.values, other.values, method=method, min_periods=min_periods
2255 )
2256

~.conda\envs\evalml_test_1.0\lib\site-packages\pandas\core\nanops.py in _f(*args, **kwargs)
67 try:
68 with np.errstate(invalid="ignore"):
---> 69 return f(*args, **kwargs)
70 except ValueError as e:
71 # we want to transform an object array

~.conda\envs\evalml_test_1.0\lib\site-packages\pandas\core\nanops.py in nancorr(a, b, method, min_periods)
1238
1239 f = get_corr_func(method)
-> 1240 return f(a, b)
1241
1242

~.conda\envs\evalml_test_1.0\lib\site-packages\pandas\core\nanops.py in _pearson(a, b)
1254
1255 def _pearson(a, b):
-> 1256 return np.corrcoef(a, b)[0, 1]
1257
1258 def _kendall(a, b):

<array_function internals> in corrcoef(*args, **kwargs)

~.conda\envs\evalml_test_1.0\lib\site-packages\numpy\lib\function_base.py in corrcoef(x, y, rowvar, bias, ddof)
2524 warnings.warn('bias and ddof have no effect and are deprecated',
2525 DeprecationWarning, stacklevel=3)
-> 2526 c = cov(x, y, rowvar)
2527 try:
2528 d = diag(c)

<array_function internals> in cov(*args, **kwargs)

~.conda\envs\evalml_test_1.0\lib\site-packages\numpy\lib\function_base.py in cov(m, y, rowvar, bias, ddof, fweights, aweights)
2429 w *= aweights
2430
-> 2431 avg, w_sum = average(X, axis=1, weights=w, returned=True)
2432 w_sum = w_sum[0]
2433

<array_function internals> in average(*args, **kwargs)

~.conda\envs\evalml_test_1.0\lib\site-packages\numpy\lib\function_base.py in average(a, axis, weights, returned)
391
392 if weights is None:
--> 393 avg = a.mean(axis)
394 scl = avg.dtype.type(a.size/avg.size)
395 else:

~.conda\envs\evalml_test_1.0\lib\site-packages\numpy\core_methods.py in _mean(a, axis, dtype, out, keepdims)
152 if isinstance(ret, mu.ndarray):
153 ret = um.true_divide(
--> 154 ret, rcount, out=ret, casting='unsafe', subok=False)
155 if is_float16_result and out is None:
156 ret = arr.dtype.type(ret)

TypeError: unsupported operand type(s) for /: 'str' and 'int'

It does something slightly different when run - the search executes instead of failing with a stack trace, but all scores for all pipelines are nan.

Optimizing for Log Loss Multiclass.
Lower score is better.

Searching up to 4 pipelines.
Allowed model families: random_forest, xgboost, linear_model, catboost

(1/4) Mode Baseline Multiclass Classificati... Elapsed:00:00
Starting cross validation
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Finished cross validation - mean Log Loss Multiclass: nan
(2/4) CatBoost Classifier w/ Simple Imputer Elapsed:00:00
Starting cross validation
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Finished cross validation - mean Log Loss Multiclass: nan
(3/4) XGBoost Classifier w/ Simple Imputer Elapsed:00:02
Starting cross validation
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Finished cross validation - mean Log Loss Multiclass: nan
(4/4) Random Forest Classifier w/ Simple Im... Elapsed:00:02
Starting cross validation
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Finished cross validation - mean Log Loss Multiclass: nan

Search finished after 00:02
Best pipeline: Mode Baseline Multiclass Classification Pipeline
Best pipeline Log Loss Multiclass: nan
ToolId 3: AutoML tool done
Finished in 14.397 seconds

The pandas data types are the same in both environments.

sepal.length float64
sepal.width float64
petal.length float64
petal.width float64
class object
dtype: object

The Jupyter notebook is using Python 3.7.3 and the tool is 3.6.8.

@SydneyAyx SydneyAyx added the bug Issues tracking problems with existing features. label Jul 23, 2020
@dsherry
Copy link
Contributor

dsherry commented Jul 23, 2020

@SydneyAyx : yep, we changed the mechanism for disabling data checks in 0.11.2:

automl.search(..., data_checks=None, ...)

Making a note that we should add that to the user guide section.

Please give that a shot and if that still doesn't fix your issue let's talk again.

If that does fix the issue, I remember #828 was previously filed to track this. And we closed that in favor of #645 , which is currently in progress. However, I'm not sure #645 will actually fix the underlying problem. So let's keep it open.

@dsherry
Copy link
Contributor

dsherry commented Jul 23, 2020

Ah, I got confused about the timeline: #932 was merged last week and fixes this issue! I just ran the reproducer I wrote in #828 to confirm this. The next release (0.12.0, next Tues) will include the fix.

I'll keep this open and close it when we put that release out.

@dsherry dsherry added this to the July 2020 milestone Jul 23, 2020
@dsherry dsherry assigned dsherry and unassigned dsherry Jul 23, 2020
@dsherry
Copy link
Contributor

dsherry commented Aug 3, 2020

Fixed in v0.12.0 which just went out!

@dsherry dsherry closed this as completed Aug 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issues tracking problems with existing features.
Projects
None yet
Development

No branches or pull requests

2 participants