Imputer cannot fit when there is None in a categorical or boolean column #1075

freddyaboulton · 2020-08-19T16:46:35Z

Reproducer

from evalml.pipelines.components import Imputer
df = pd.DataFrame({"a": [1, 2, 3], "b": ["1", "2", None]})
imputer = Imputer()
imputer.fit(df)

from evalml.pipelines.components import Imputer
df_with_bool = pd.DataFrame({"a": [1, 2, 3], "b": [True, False, None]})
imputer = Imputer()
imputer.fit(df_with_bool)

Both have the same stacktrace:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-69-9af4cfc17aec> in <module>
      1 df_with_bool = pd.DataFrame({"a": [1, 2, 3], "b": [True, False, None]})
      2 imputer = Imputer()
----> 3 imputer.fit(df_with_bool)

~/sources/evalml/evalml/utils/base_meta.py in _set_fit(self, X, y)
     12         @wraps(method)
     13         def _set_fit(self, X, y=None):
---> 14             return_value = method(self, X, y)
     15             self._is_fitted = True
     16             return return_value

~/sources/evalml/evalml/pipelines/components/transformers/imputers/imputer.py in fit(self, X, y)
     76         X_categorical = X_null_dropped.select_dtypes(include=categorical_dtypes + boolean)
     77         if len(X_categorical.columns) > 0:
---> 78             self._categorical_imputer.fit(X_categorical, y)
     79             self._categorical_cols = X_categorical.columns
     80         return self

~/sources/evalml/evalml/utils/base_meta.py in _set_fit(self, X, y)
     12         @wraps(method)
     13         def _set_fit(self, X, y=None):
---> 14             return_value = method(self, X, y)
     15             self._is_fitted = True
     16             return return_value

~/sources/evalml/evalml/pipelines/components/transformers/imputers/simple_imputer.py in fit(self, X, y)
     42         if not isinstance(X, pd.DataFrame):
     43             X = pd.DataFrame(X)
---> 44         self._component_obj.fit(X, y)
     45         self._all_null_cols = set(X.columns) - set(X.dropna(axis=1, how='all').columns)
     46         return self

~/miniconda3/envs/evalml/lib/python3.8/site-packages/sklearn/impute/_base.py in fit(self, X, y)
    300                                                     fill_value)
    301         else:
--> 302             self.statistics_ = self._dense_fit(X,
    303                                                self.strategy,
    304                                                self.missing_values,

~/miniconda3/envs/evalml/lib/python3.8/site-packages/sklearn/impute/_base.py in _dense_fit(self, X, strategy, missing_values, fill_value)
    384                 row_mask = np.logical_not(row_mask).astype(np.bool)
    385                 row = row[row_mask]
--> 386                 most_frequent[i] = _most_frequent(row, np.nan, 0)
    387 
    388             return most_frequent

~/miniconda3/envs/evalml/lib/python3.8/site-packages/sklearn/impute/_base.py in _most_frequent(array, extra_value, n_repeat)
     40             # has already been NaN-masked.
     41             warnings.simplefilter("ignore", RuntimeWarning)
---> 42             mode = stats.mode(array)
     43 
     44         most_frequent_value = mode[0][0]

~/miniconda3/envs/evalml/lib/python3.8/site-packages/scipy/stats/stats.py in mode(a, axis, nan_policy)
    498     counts = np.zeros(a_view.shape[:-1], dtype=np.int)
    499     for ind in inds:
--> 500         modes[ind], counts[ind] = _mode1D(a_view[ind])
    501     newshape = list(a.shape)
    502     newshape[axis] = 1

~/miniconda3/envs/evalml/lib/python3.8/site-packages/scipy/stats/stats.py in _mode1D(a)
    485 
    486     def _mode1D(a):
--> 487         vals, cnts = np.unique(a, return_counts=True)
    488         return vals[cnts.argmax()], cnts.max()
    489 

<__array_function__ internals> in unique(*args, **kwargs)

~/miniconda3/envs/evalml/lib/python3.8/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
    259     ar = np.asanyarray(ar)
    260     if axis is None:
--> 261         ret = _unique1d(ar, return_index, return_inverse, return_counts)
    262         return _unpack_tuple(ret)
    263 

~/miniconda3/envs/evalml/lib/python3.8/site-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
    320         aux = ar[perm]
    321     else:
--> 322         ar.sort()
    323         aux = ar
    324     mask = np.empty(aux.shape, dtype=np.bool_)

TypeError: '<' not supported between instances of 'NoneType' and 'bool'

This works when it is np.nan instead of None

The text was updated successfully, but these errors were encountered:

dsherry · 2020-08-27T03:57:59Z

@freddyaboulton thanks for the clear reproducer! It appears this explains another bug #1092 as well.

Problem
If any feature in the pandas dataframe has object type and contains a None value, our Imputer fails.

X = pd.DataFrame({'feature1': [False, True, None, np.nan]}) creates a feature with object type. Imputer.fit fails.
X = pd.DataFrame({'feature1': [False, True, np.nan]}) creates a feature with object type. Imputer.fit works.
X = pd.DataFrame({'feature1': [False, True]}) creates a feature with bool type. Imputer.fit works.

The same is true for category type. A similar situation happens for string types, although the last case doesn't apply.

Notes
The confusing thing here is that None can mean different things. It could be the same as nan, or it could be intended as its own category.

I think its fine to treat it as nan as long as we document and explain that convention.

Workaround
Clean None out of bool/category/string features: df = df.fillna(value=np.nan)

Fix
Short-term:

Update Imputer to replace None with np.nan
Update Imputer API doc and automl user guide to mention this.
Add test coverage of Imputer with the inclusion of None in the data, for all intended datatypes.

We could instead add a DataCheck which errors if there are Nones in the data. But this feels unnecessary since Nones can be easily converted.

Long-term:
Once we update evalml to use the new DataTable datastructure, users will be able to configure the types of each feature ahead of time. I hope this means standardization will make these sorts of errors irrelevant.

angela97lin · 2020-08-27T15:55:01Z

Is this related to #540?

dsherry · 2020-08-27T17:57:08Z

@angela97lin 🤦 100% related... in fact its a dup. Haha. We even decided there to have the imputer convert Nones to np.nans.

Closing #540 in favor of this because the writeups here are more up-to-date.

Thank you!

freddyaboulton added the bug Issues tracking problems with existing features. label Aug 19, 2020

dsherry added this to the September 2020 milestone Aug 27, 2020

dsherry mentioned this issue Aug 27, 2020

Training Pipeline Fails when there are Nulls in the Training Data #1092

Closed

dsherry mentioned this issue Aug 27, 2020

SimpleImputer fails if data contains None instead of np.nan #540

Closed

angela97lin self-assigned this Sep 3, 2020

angela97lin mentioned this issue Sep 4, 2020

Fixes issue where Imputer cannot fit when there is None in a categorical or boolean column #1144

Merged

angela97lin closed this as completed in #1144 Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imputer cannot fit when there is None in a categorical or boolean column #1075

Imputer cannot fit when there is None in a categorical or boolean column #1075

freddyaboulton commented Aug 19, 2020

dsherry commented Aug 27, 2020

angela97lin commented Aug 27, 2020

dsherry commented Aug 27, 2020

Imputer cannot fit when there is None in a categorical or boolean column #1075

Imputer cannot fit when there is None in a categorical or boolean column #1075

Comments

freddyaboulton commented Aug 19, 2020

Reproducer

dsherry commented Aug 27, 2020

angela97lin commented Aug 27, 2020

dsherry commented Aug 27, 2020