PyError with OneHotEncoder (Julia 0.6.0 on Windows10) #32

ValdarT · 2017-07-06T13:38:20Z

I'm getting a PyError with this code.

using DataFrames
using ScikitLearn
@sk_import preprocessing: OneHotEncoder

df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])

mapper = DataFrameMapper([([:B], OneHotEncoder())]);

fit_transform!(mapper, df)

ERROR: PyError (ccall(@pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, arg, C_NULL)) <type 'exceptions.ValueError'>
ValueError('could not convert string to float: M',)
  File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\preprocessing\data.py", line 1844, in fit
    self.fit_transform(X)
  File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\preprocessing\data.py", line 1902, in fit_transform
    self.categorical_features, copy=True)
  File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\preprocessing\data.py", line 1697, in _transform_selected
    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
  File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\utils\validation.py", line 382, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)

It seems specific to OneHotEncoder. For example, LabelBinarizer works fine like this:

mapper = DataFrameMapper([(:B, LabelBinarizer())]);

I'm on Windows 10 using Julia 0.6.0.
Package versions:

- Conda                         0.5.3
- DataArrays                    0.5.3
- DataFrames                    0.10.0
- PyCall                        1.14.0
- ScikitLearn                   0.3.0
- ScikitLearnBase               0.3.0

I let ScikitLearn.jl automatically handle the installation of Python dependencies. The installed versions are:

python                    2.7.13
numpy                     1.13.0
scikit-learn              0.18.2

The text was updated successfully, but these errors were encountered:

cstjean · 2017-07-06T13:47:46Z

It's probably a bug, but have you checked if the equivalent code works in Python?

You can use ScikitLearn.Preprocessing.DictEncoder() until this gets fixed. The semantics are a bit different, but for single-column input matrices it should be the same:

DictEncoder()

For every unique row in the input matrix, associate a 0/1 binary column in the output matrix. This is similar to OneHotEncoder, but considers the entire row as a single value for one-hot-encoding. It works with any hashable datatype.

It is common to use it inside a DataFrameMapper, with a particular subset of the columns.

cstjean · 2017-07-06T13:48:28Z

Thank you for the detailed bug report!

ValdarT · 2017-07-07T12:06:22Z

Sorry, my mistake. Turns out OneHotEncoder only accepts integer values. Rather unexpected and weird in my opinion but clearly stated in the docs. At least I'm not the only one: scikit-learn-contrib/sklearn-pandas#63. : )

However, I still get an 'invalid Array dimensions' error with this code

using DataFrames
using ScikitLearn
@sk_import preprocessing: OneHotEncoder

df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])

mapper = DataFrameMapper([([:A], OneHotEncoder())]);

fit_transform!(mapper, df)

invalid Array dimensions

Stacktrace:
 [1] Array{Float64,N} where N(::Tuple{Int64}) at .\boot.jl:317
 [2] py2array(::Type{T} where T, ::PyCall.PyObject) at C:\Users\...\.julia\v0.6\PyCall\src\conversions.jl:381
 [3] convert(::Type{Array{Float64,2}}, ::PyCall.PyObject) at C:\Users\...\.julia\v0.6\PyCall\src\numpy.jl:480
 [4] transform(::ScikitLearn.DataFrameMapper, ::DataFrames.DataFrame) at C:\Users\...\.julia\v0.6\ScikitLearn\src\dataframes.jl:150
 [5] #fit_transform!#16(::Array{Any,1}, ::Function, ::ScikitLearn.DataFrameMapper, ::DataFrames.DataFrame, ::Void) at C:\Users\...\.julia\v0.6\ScikitLearnBase\src\ScikitLearnBase.jl:152
 [6] fit_transform!(::ScikitLearn.DataFrameMapper, ::DataFrames.DataFrame) at C:\Users\...\.julia\v0.6\ScikitLearnBase\src\ScikitLearnBase.jl:152

although this code in Python works fine

import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({'A': [1,2,3,4], 'B': ["M", "F", "F", "M"]})
mapper = DataFrameMapper([(['A'], OneHotEncoder())])

mapper.fit_transform(df)

ValdarT · 2017-07-07T12:15:28Z

Fortunately the change to OneHotEncoder for accepting strings is in the works: scikit-learn/scikit-learn#4920

cstjean · 2017-07-07T12:42:38Z

Figured it out; OneHotEncoder returns a sparse matrix by default, which PyCall doesn't know how to convert (JuliaPy/PyCall.jl#204). It would have to be fixed there. Or at the very least, there should be a clearer error message on that end.

Fortunately, you can solve the problem with OneHotEncoder(sparse=false).

Turns out OneHotEncoder only accepts integer values

Use DictEncoder! It's pure Julia, so it'll be way faster than OneHotEncoder, and it will work with any hashable type (almost anything).

ValdarT · 2017-07-07T13:09:12Z

Thank you!

Use DictEncoder!

Will do.

Hopefully we can soon replace all the preprocessing steps with pure Julia implementations. The work at JuliaML seems to get there step-by-step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyError with OneHotEncoder (Julia 0.6.0 on Windows10) #32

PyError with OneHotEncoder (Julia 0.6.0 on Windows10) #32

ValdarT commented Jul 6, 2017

cstjean commented Jul 6, 2017

cstjean commented Jul 6, 2017

ValdarT commented Jul 7, 2017

ValdarT commented Jul 7, 2017

cstjean commented Jul 7, 2017

ValdarT commented Jul 7, 2017

PyError with OneHotEncoder (Julia 0.6.0 on Windows10) #32

PyError with OneHotEncoder (Julia 0.6.0 on Windows10) #32

Comments

ValdarT commented Jul 6, 2017

cstjean commented Jul 6, 2017

cstjean commented Jul 6, 2017

ValdarT commented Jul 7, 2017

ValdarT commented Jul 7, 2017

cstjean commented Jul 7, 2017

ValdarT commented Jul 7, 2017