Error handling when assigning OneHotEncoder default output to pandas DataFrame #26883

StefanieSenger · 2023-07-23T12:42:10Z

StefanieSenger
Jul 23, 2023

For OneHotEncoder a default param is sparse_output=True . After transform() we get a sparse matrix of type <class 'scipy.sparse._csr.csr_matrix'>.

Example:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({"Color":["red", "blue", "blue", "green", "yellow", "red", "blue", "red", "yellow", "red"],
                   "Target": [0, 1, 1, 0, 0, 0, 0, 1, 0, 0]})

ohe = OneHotEncoder() #default: sparse_output=True

ohe.fit(df[["Color"]])
color_encoded = ohe.transform(df[["Color"]])
print(type(color_encoded))

Output:
<class 'scipy.sparse._csr.csr_matrix'>

When we want to assign the transformed column into a pandas DataFrame, we need to set sparse_output=False , because otherwise we get an error coming from pandas, as shown here:

df[ohe.get_feature_names_out()] = color_encoded

Output:

Traceback (most recent call last):
  File "/home/stefanie/Python/issue_sklearn_ohe/new.py", line 12, in <module>
    df[ohe.get_feature_names_out()] = color_encoded
  File "/home/stefanie/.pyenv/versions/3.10.6/envs/sklearn_ohe_issue/lib/python3.10/site-packages/pandas/core/frame.py", line 3938, in __setitem__
    self._setitem_array(key, value)
  File "/home/stefanie/.pyenv/versions/3.10.6/envs/sklearn_ohe_issue/lib/python3.10/site-packages/pandas/core/frame.py", line 3994, in _setitem_array
    return self._setitem_array(key, value)
  File "/home/stefanie/.pyenv/versions/3.10.6/envs/sklearn_ohe_issue/lib/python3.10/site-packages/pandas/core/frame.py", line 3989, in _setitem_array
    self._iset_not_inplace(key, value)
  File "/home/stefanie/.pyenv/versions/3.10.6/envs/sklearn_ohe_issue/lib/python3.10/site-packages/pandas/core/frame.py", line 4016, in _iset_not_inplace
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key

But if we'd inspect the shape of the output matrix it seems alright:

print("len(df['Color'].value_counts()): ", len(df["Color"].value_counts()))
print("color_encoded.shape: ", color_encoded.shape)

Output:

len(df['Color'].value_counts()):  4           # there are 4 different values to encode
color_encoded.shape:  (10, 4)                 # our output seems to have the shape of  (10,4)

Infact, color_encoded.shape=color_encoded.toarray().shape, because the reporting shape of the sparse matrix is as if it was a dense matrix.

And:
print(len(ohe.get_feature_names_out()) == color_encoded.shape[1])

Output:
True

I find this confusing and wonder if there is anything we can do on scikit-learn to make the handling of OneHotEncoder more intuitive in this regard.

I know it's in scipy and pandas, but can we somehow catch this in scikit-learn and show a warning if sparse_output=True and the user tries it assign it to a pandas DataFrame? Or could we change the representation of the sparse matrix' representing shape?

If nothing else is possible, we could add to the docstring that sparse_output=False should be used, when the output is intended for a pandas DataFrame.

There are many related issues and PRs that discuss handing down the sparse output matrix to ColumnTransformer, but I haven't understood, why sparse_output=True is the default?

adrinjalali · 2023-07-24T16:05:42Z

adrinjalali
Jul 24, 2023
Maintainer

I guess a few things are relevant here: the way pandas handles sparse, is very different from scipy.sparse, and scipy sparse matrices are going through a lot of change.

A sparse matrix can't be converted to a pandas dataframe, and it probably shouldn't since it might explode the memory.

The default of OHE is sparse since it's always returning a sparse matrix (and if dense, a matrix with mostly zeros). So that default is sensible.

But I think the way we have the docs, causes a bit of confusion here for users. What we do, is that we convert to pandas if the output is not sparse, and otherwise we don't touch it. And the docs don't really say that:

That definitely needs to be updated, and I think it'd make sense to raise a warning maybe if the output is sparse by the user says they want pandas?

cc @thomasjpfan

2 replies

adrinjalali Jul 27, 2023
Maintainer

Ok, so I was confused here with the output set on OHE as sparse_output and the set_output (it's confusing to have both though).

The documentation of the sparse_output needs to be improved to mention we return a csr matrix, and also the documentation of transform needs to be explicit about that.

And if the user sets output to pandas, we actually fail, which is good, but the error message is a bit confusing I think, since it's not clear to the user how they can fix it.

StefanieSenger Jul 27, 2023
Author

I will make a PR to improve this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error handling when assigning OneHotEncoder default output to pandas DataFrame #26883

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Error handling when assigning OneHotEncoder default output to pandas DataFrame #26883

StefanieSenger Jul 23, 2023

Replies: 1 comment · 2 replies

adrinjalali Jul 24, 2023 Maintainer

adrinjalali Jul 27, 2023 Maintainer

StefanieSenger Jul 27, 2023 Author

StefanieSenger
Jul 23, 2023

Replies: 1 comment 2 replies

adrinjalali
Jul 24, 2023
Maintainer

adrinjalali Jul 27, 2023
Maintainer

StefanieSenger Jul 27, 2023
Author