You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@freddyaboulton thanks for the clear reproducer! It appears this explains another bug #1092 as well.
Problem
If any feature in the pandas dataframe has object type and contains a None value, our Imputer fails.
X = pd.DataFrame({'feature1': [False, True, None, np.nan]}) creates a feature with object type. Imputer.fit fails.
X = pd.DataFrame({'feature1': [False, True, np.nan]}) creates a feature with object type. Imputer.fit works.
X = pd.DataFrame({'feature1': [False, True]}) creates a feature with bool type. Imputer.fit works.
The same is true for category type. A similar situation happens for string types, although the last case doesn't apply.
Notes
The confusing thing here is that None can mean different things. It could be the same as nan, or it could be intended as its own category.
I think its fine to treat it as nan as long as we document and explain that convention.
Workaround
Clean None out of bool/category/string features: df = df.fillna(value=np.nan)
Fix Short-term:
Update Imputer to replace None with np.nan
Update Imputer API doc and automl user guide to mention this.
Add test coverage of Imputer with the inclusion of None in the data, for all intended datatypes.
We could instead add a DataCheck which errors if there are Nones in the data. But this feels unnecessary since Nones can be easily converted.
Long-term:
Once we update evalml to use the new DataTable datastructure, users will be able to configure the types of each feature ahead of time. I hope this means standardization will make these sorts of errors irrelevant.
Reproducer
Both have the same stacktrace:
This works when it is
np.nan
instead ofNone
The text was updated successfully, but these errors were encountered: