Error handling when assigning OneHotEncoder default output to pandas DataFrame #26883
Replies: 1 comment 2 replies
-
I guess a few things are relevant here: the way pandas handles sparse, is very different from scipy.sparse, and scipy sparse matrices are going through a lot of change. A sparse matrix can't be converted to a pandas dataframe, and it probably shouldn't since it might explode the memory. The default of OHE is sparse since it's always returning a sparse matrix (and if dense, a matrix with mostly zeros). So that default is sensible. But I think the way we have the docs, causes a bit of confusion here for users. What we do, is that we convert to pandas if the output is not sparse, and otherwise we don't touch it. And the docs don't really say that: That definitely needs to be updated, and I think it'd make sense to raise a warning maybe if the output is sparse by the user says they want pandas? cc @thomasjpfan |
Beta Was this translation helpful? Give feedback.
-
For OneHotEncoder a default param is
sparse_output=True
. Aftertransform()
we get a sparse matrix of type<class 'scipy.sparse._csr.csr_matrix'>
.Example:
Output:
<class 'scipy.sparse._csr.csr_matrix'>
When we want to assign the transformed column into a pandas DataFrame, we need to set
sparse_output=False
, because otherwise we get an error coming from pandas, as shown here:df[ohe.get_feature_names_out()] = color_encoded
Output:
But if we'd inspect the shape of the output matrix it seems alright:
Output:
Infact,
color_encoded.shape=color_encoded.toarray().shape
, because the reporting shape of the sparse matrix is as if it was a dense matrix.And:
print(len(ohe.get_feature_names_out()) == color_encoded.shape[1])
Output:
True
I find this confusing and wonder if there is anything we can do on scikit-learn to make the handling of OneHotEncoder more intuitive in this regard.
I know it's in scipy and pandas, but can we somehow catch this in scikit-learn and show a warning if
sparse_output=True
and the user tries it assign it to a pandas DataFrame? Or could we change the representation of the sparse matrix' representing shape?If nothing else is possible, we could add to the docstring that
sparse_output=False
should be used, when the output is intended for a pandas DataFrame.There are many related issues and PRs that discuss handing down the sparse output matrix to ColumnTransformer, but I haven't understood, why
sparse_output=True
is the default?Beta Was this translation helpful? Give feedback.
All reactions