Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imputer cannot fit when there is None in a categorical or boolean column #1075

Closed
freddyaboulton opened this issue Aug 19, 2020 · 3 comments · Fixed by #1144
Closed

Imputer cannot fit when there is None in a categorical or boolean column #1075

freddyaboulton opened this issue Aug 19, 2020 · 3 comments · Fixed by #1144
Assignees
Labels
bug Issues tracking problems with existing features.

Comments

@freddyaboulton
Copy link
Contributor

Reproducer

from evalml.pipelines.components import Imputer
df = pd.DataFrame({"a": [1, 2, 3], "b": ["1", "2", None]})
imputer = Imputer()
imputer.fit(df)
from evalml.pipelines.components import Imputer
df_with_bool = pd.DataFrame({"a": [1, 2, 3], "b": [True, False, None]})
imputer = Imputer()
imputer.fit(df_with_bool)

Both have the same stacktrace:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-69-9af4cfc17aec> in <module>
      1 df_with_bool = pd.DataFrame({"a": [1, 2, 3], "b": [True, False, None]})
      2 imputer = Imputer()
----> 3 imputer.fit(df_with_bool)

~/sources/evalml/evalml/utils/base_meta.py in _set_fit(self, X, y)
     12         @wraps(method)
     13         def _set_fit(self, X, y=None):
---> 14             return_value = method(self, X, y)
     15             self._is_fitted = True
     16             return return_value

~/sources/evalml/evalml/pipelines/components/transformers/imputers/imputer.py in fit(self, X, y)
     76         X_categorical = X_null_dropped.select_dtypes(include=categorical_dtypes + boolean)
     77         if len(X_categorical.columns) > 0:
---> 78             self._categorical_imputer.fit(X_categorical, y)
     79             self._categorical_cols = X_categorical.columns
     80         return self

~/sources/evalml/evalml/utils/base_meta.py in _set_fit(self, X, y)
     12         @wraps(method)
     13         def _set_fit(self, X, y=None):
---> 14             return_value = method(self, X, y)
     15             self._is_fitted = True
     16             return return_value

~/sources/evalml/evalml/pipelines/components/transformers/imputers/simple_imputer.py in fit(self, X, y)
     42         if not isinstance(X, pd.DataFrame):
     43             X = pd.DataFrame(X)
---> 44         self._component_obj.fit(X, y)
     45         self._all_null_cols = set(X.columns) - set(X.dropna(axis=1, how='all').columns)
     46         return self

~/miniconda3/envs/evalml/lib/python3.8/site-packages/sklearn/impute/_base.py in fit(self, X, y)
    300                                                     fill_value)
    301         else:
--> 302             self.statistics_ = self._dense_fit(X,
    303                                                self.strategy,
    304                                                self.missing_values,

~/miniconda3/envs/evalml/lib/python3.8/site-packages/sklearn/impute/_base.py in _dense_fit(self, X, strategy, missing_values, fill_value)
    384                 row_mask = np.logical_not(row_mask).astype(np.bool)
    385                 row = row[row_mask]
--> 386                 most_frequent[i] = _most_frequent(row, np.nan, 0)
    387 
    388             return most_frequent

~/miniconda3/envs/evalml/lib/python3.8/site-packages/sklearn/impute/_base.py in _most_frequent(array, extra_value, n_repeat)
     40             # has already been NaN-masked.
     41             warnings.simplefilter("ignore", RuntimeWarning)
---> 42             mode = stats.mode(array)
     43 
     44         most_frequent_value = mode[0][0]

~/miniconda3/envs/evalml/lib/python3.8/site-packages/scipy/stats/stats.py in mode(a, axis, nan_policy)
    498     counts = np.zeros(a_view.shape[:-1], dtype=np.int)
    499     for ind in inds:
--> 500         modes[ind], counts[ind] = _mode1D(a_view[ind])
    501     newshape = list(a.shape)
    502     newshape[axis] = 1

~/miniconda3/envs/evalml/lib/python3.8/site-packages/scipy/stats/stats.py in _mode1D(a)
    485 
    486     def _mode1D(a):
--> 487         vals, cnts = np.unique(a, return_counts=True)
    488         return vals[cnts.argmax()], cnts.max()
    489 

<__array_function__ internals> in unique(*args, **kwargs)

~/miniconda3/envs/evalml/lib/python3.8/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
    259     ar = np.asanyarray(ar)
    260     if axis is None:
--> 261         ret = _unique1d(ar, return_index, return_inverse, return_counts)
    262         return _unpack_tuple(ret)
    263 

~/miniconda3/envs/evalml/lib/python3.8/site-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
    320         aux = ar[perm]
    321     else:
--> 322         ar.sort()
    323         aux = ar
    324     mask = np.empty(aux.shape, dtype=np.bool_)

TypeError: '<' not supported between instances of 'NoneType' and 'bool'

This works when it is np.nan instead of None

@freddyaboulton freddyaboulton added the bug Issues tracking problems with existing features. label Aug 19, 2020
@dsherry dsherry added this to the September 2020 milestone Aug 27, 2020
@dsherry
Copy link
Contributor

dsherry commented Aug 27, 2020

@freddyaboulton thanks for the clear reproducer! It appears this explains another bug #1092 as well.

Problem
If any feature in the pandas dataframe has object type and contains a None value, our Imputer fails.

  1. X = pd.DataFrame({'feature1': [False, True, None, np.nan]}) creates a feature with object type. Imputer.fit fails.
  2. X = pd.DataFrame({'feature1': [False, True, np.nan]}) creates a feature with object type. Imputer.fit works.
  3. X = pd.DataFrame({'feature1': [False, True]}) creates a feature with bool type. Imputer.fit works.

The same is true for category type. A similar situation happens for string types, although the last case doesn't apply.

Notes
The confusing thing here is that None can mean different things. It could be the same as nan, or it could be intended as its own category.

I think its fine to treat it as nan as long as we document and explain that convention.

Workaround
Clean None out of bool/category/string features: df = df.fillna(value=np.nan)

Fix
Short-term:

  • Update Imputer to replace None with np.nan
  • Update Imputer API doc and automl user guide to mention this.
  • Add test coverage of Imputer with the inclusion of None in the data, for all intended datatypes.

We could instead add a DataCheck which errors if there are Nones in the data. But this feels unnecessary since Nones can be easily converted.

Long-term:
Once we update evalml to use the new DataTable datastructure, users will be able to configure the types of each feature ahead of time. I hope this means standardization will make these sorts of errors irrelevant.

@angela97lin
Copy link
Contributor

Is this related to #540?

@dsherry
Copy link
Contributor

dsherry commented Aug 27, 2020

@angela97lin 🤦 100% related... in fact its a dup. Haha. We even decided there to have the imputer convert Nones to np.nans.

Closing #540 in favor of this because the writeups here are more up-to-date.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issues tracking problems with existing features.
Projects
None yet
3 participants