Understanding behavior of Simple Imputer with categorical values #19445

theonlypoi · 2021-02-12T08:39:18Z

theonlypoi
Feb 12, 2021

Trying to understand the behaviour of an already fitted simple imputer on a dataframe that have only missing values.

Adding a sample code for better understanding.

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

df_1 = pd.DataFrame({"col_1": ["A", "A", "B", np.nan, "A"]}, dtype="category")
df_2 = pd.DataFrame({"col_1": [np.nan, np.nan, np.nan]}, dtype="category")

imputer = SimpleImputer(strategy="constant", fill_value="missing")
imputer.fit(df_1)

Now, if i do imputer.transform(df_1), then it works correctly and imputes the missing value in df_1.
However, if i try to do imputer.transform(df_2), it generates ValueError: could not convert string to float: 'missing'.
If I modify df_2 dataframe to include one non-missing value, then it works fine.

Shouldn't it impute the missing values to missing in df_2 as imputer is already fitted ?
What will be better approach to handle such cases ?

Answered by alfaro96

Feb 12, 2021

Trying to understand the behaviour of an already fitted simple imputer on a dataframe that have only missing values.

Adding a sample code for better understanding.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

df_1 = pd.DataFrame({"col_1": ["A", "A", "B", np.nan, "A"]}, dtype="category")
df_2 = pd.DataFrame({"col_1": [np.nan, np.nan, np.nan]}, dtype="category")

imputer = SimpleImputer(strategy="constant", fill_value="missing")
imputer.fit(df_1)
Now, if i do imputer.transform(df_1), then it works correctly and imputes the missing value in df_1.
However, if i try to do imputer.transform(df_2), it generates ValueError: could not convert string to float: '…

View full answer

NicolasHug · 2021-02-12T09:01:12Z

NicolasHug
Feb 12, 2021
Maintainer

Looks like an unsupported edge case but I agree it can be surprising. It will work as expected if you specify dtype=object for df_2.

Unless you really need to, try to use numpy arrays instead of pandas dataframes.

@thomasjpfan may have thoughts on how to best handle this

3 replies

theonlypoi Feb 12, 2021
Author

Thank you for your quick response.

Unless you really need to, try to use numpy arrays instead of pandas dataframes.

I will keep this is mind.

theonlypoi Feb 12, 2021
Author

Are there any specific reasons to prefer numpy arrays over pandas dataframes, as pandas is also built on top of numpy ?
Is it related to faster computation ?

NicolasHug Feb 12, 2021
Maintainer

We officially support numpy arrays, not pandas dataframe. For now, it's impossible to get a df as output of our transformers for example, so you might prefer having homogeneous objects right from the start.

Most of the time, passing a df as input goes smoothly, as we still try to be as compatible as possible. But in some edge-cases like this one (categorical dtype is specific to pandas, not to numpy, and we don't account for it everywhere), it can lead to surprising results.

Disclaimer: our partial pandas support is something that has always bugged me for different reasons so I tend to advise against it. Other devs will have different opinions ;) But in any case, using numpy arrays in scikit-leran is always safer / less bug-prone.

alfaro96 · 2021-02-12T09:01:49Z

alfaro96
Feb 12, 2021
Collaborator

Trying to understand the behaviour of an already fitted simple imputer on a dataframe that have only missing values.

Adding a sample code for better understanding.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

df_1 = pd.DataFrame({"col_1": ["A", "A", "B", np.nan, "A"]}, dtype="category")
df_2 = pd.DataFrame({"col_1": [np.nan, np.nan, np.nan]}, dtype="category")

imputer = SimpleImputer(strategy="constant", fill_value="missing")
imputer.fit(df_1)
Now, if i do imputer.transform(df_1), then it works correctly and imputes the missing value in df_1.
However, if i try to do imputer.transform(df_2), it generates ValueError: could not convert string to float: 'missing'.
If I modify df_2 dataframe to include one non-missing value, then it works fine.

Shouldn't it impute the missing values to missing in df_2 as imputer is already fitted ?

What will be better approach to handle such cases ?

Thank you @theonlypoi for reaching out the scikit-learn discussion!

We use the check_array function to convert the DataFrame to a numpy array. Since df_2 are all missing values (np.nan), the data type of the numpy array will be floating and not categorical (as desired).

To solve the issue, you can change:

df_2 = pd.DataFrame({"col_1": [np.nan, np.nan, np.nan]}, dtype="category")

by:

df_2 = pd.DataFrame({"col_1": [np.nan, np.nan, np.nan]}, dtype=object)

That will solve the issue 😉.

1 reply

thomasjpfan Feb 12, 2021
Maintainer

When the values of a pandas categorical dtype can be all expressed as a float, it will be a float array when used with np.asarray:

import pandas as pd
import numpy as np

s = pd.Series([1, 2, np.nan], dtype="category")
np.asarray(s)
# array([ 1.,  2., nan])
# this is a float ndarray

So for the original post with all nans, the feature will be converted to all floats, which triggers the error. Even in the above case with [1, 2, np.nan], would also trigger an error:

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

df_2 = pd.DataFrame({"col_1": [1, 2, np.nan]}, dtype="category")
imputer = SimpleImputer(strategy="constant", fill_value="missing")

# This will error
imputer.fit(df_2)

I am undecided on if this should be handled by sklearn, which would involve specific logic for categorical dtypes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding behavior of Simple Imputer with categorical values #19445

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Understanding behavior of Simple Imputer with categorical values #19445

theonlypoi Feb 12, 2021

Replies: 2 comments · 4 replies

NicolasHug Feb 12, 2021 Maintainer

theonlypoi Feb 12, 2021 Author

theonlypoi Feb 12, 2021 Author

NicolasHug Feb 12, 2021 Maintainer

alfaro96 Feb 12, 2021 Collaborator

thomasjpfan Feb 12, 2021 Maintainer

theonlypoi
Feb 12, 2021

Replies: 2 comments 4 replies

NicolasHug
Feb 12, 2021
Maintainer

theonlypoi Feb 12, 2021
Author

theonlypoi Feb 12, 2021
Author

NicolasHug Feb 12, 2021
Maintainer

alfaro96
Feb 12, 2021
Collaborator

thomasjpfan Feb 12, 2021
Maintainer