Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyError with OneHotEncoder (Julia 0.6.0 on Windows10) #32

Open
ValdarT opened this issue Jul 6, 2017 · 6 comments
Open

PyError with OneHotEncoder (Julia 0.6.0 on Windows10) #32

ValdarT opened this issue Jul 6, 2017 · 6 comments

Comments

@ValdarT
Copy link
Contributor

ValdarT commented Jul 6, 2017

I'm getting a PyError with this code.

using DataFrames
using ScikitLearn
@sk_import preprocessing: OneHotEncoder

df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])

mapper = DataFrameMapper([([:B], OneHotEncoder())]);

fit_transform!(mapper, df)
ERROR: PyError (ccall(@pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, arg, C_NULL)) <type 'exceptions.ValueError'>
ValueError('could not convert string to float: M',)
  File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\preprocessing\data.py", line 1844, in fit
    self.fit_transform(X)
  File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\preprocessing\data.py", line 1902, in fit_transform
    self.categorical_features, copy=True)
  File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\preprocessing\data.py", line 1697, in _transform_selected
    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
  File "C:\Users\...\.julia\v0.6\Conda\deps\usr\lib\site-packages\sklearn\utils\validation.py", line 382, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)

It seems specific to OneHotEncoder. For example, LabelBinarizer works fine like this:

mapper = DataFrameMapper([(:B, LabelBinarizer())]);

I'm on Windows 10 using Julia 0.6.0.
Package versions:

- Conda                         0.5.3
- DataArrays                    0.5.3
- DataFrames                    0.10.0
- PyCall                        1.14.0
- ScikitLearn                   0.3.0
- ScikitLearnBase               0.3.0

I let ScikitLearn.jl automatically handle the installation of Python dependencies. The installed versions are:

python                    2.7.13
numpy                     1.13.0
scikit-learn              0.18.2
@cstjean
Copy link
Owner

cstjean commented Jul 6, 2017

It's probably a bug, but have you checked if the equivalent code works in Python?

You can use ScikitLearn.Preprocessing.DictEncoder() until this gets fixed. The semantics are a bit different, but for single-column input matrices it should be the same:

DictEncoder()

For every unique row in the input matrix, associate a 0/1 binary column in the output matrix. This is similar to OneHotEncoder, but considers the entire row as a single value for one-hot-encoding. It works with any hashable datatype.

It is common to use it inside a DataFrameMapper, with a particular subset of the columns.

@cstjean
Copy link
Owner

cstjean commented Jul 6, 2017

Thank you for the detailed bug report!

@ValdarT
Copy link
Contributor Author

ValdarT commented Jul 7, 2017

Sorry, my mistake. Turns out OneHotEncoder only accepts integer values. Rather unexpected and weird in my opinion but clearly stated in the docs. At least I'm not the only one: scikit-learn-contrib/sklearn-pandas#63. : )

However, I still get an 'invalid Array dimensions' error with this code

using DataFrames
using ScikitLearn
@sk_import preprocessing: OneHotEncoder

df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])

mapper = DataFrameMapper([([:A], OneHotEncoder())]);

fit_transform!(mapper, df)
invalid Array dimensions

Stacktrace:
 [1] Array{Float64,N} where N(::Tuple{Int64}) at .\boot.jl:317
 [2] py2array(::Type{T} where T, ::PyCall.PyObject) at C:\Users\...\.julia\v0.6\PyCall\src\conversions.jl:381
 [3] convert(::Type{Array{Float64,2}}, ::PyCall.PyObject) at C:\Users\...\.julia\v0.6\PyCall\src\numpy.jl:480
 [4] transform(::ScikitLearn.DataFrameMapper, ::DataFrames.DataFrame) at C:\Users\...\.julia\v0.6\ScikitLearn\src\dataframes.jl:150
 [5] #fit_transform!#16(::Array{Any,1}, ::Function, ::ScikitLearn.DataFrameMapper, ::DataFrames.DataFrame, ::Void) at C:\Users\...\.julia\v0.6\ScikitLearnBase\src\ScikitLearnBase.jl:152
 [6] fit_transform!(::ScikitLearn.DataFrameMapper, ::DataFrames.DataFrame) at C:\Users\...\.julia\v0.6\ScikitLearnBase\src\ScikitLearnBase.jl:152

although this code in Python works fine

import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({'A': [1,2,3,4], 'B': ["M", "F", "F", "M"]})
mapper = DataFrameMapper([(['A'], OneHotEncoder())])

mapper.fit_transform(df)

@ValdarT
Copy link
Contributor Author

ValdarT commented Jul 7, 2017

Fortunately the change to OneHotEncoder for accepting strings is in the works: scikit-learn/scikit-learn#4920

@cstjean
Copy link
Owner

cstjean commented Jul 7, 2017

Figured it out; OneHotEncoder returns a sparse matrix by default, which PyCall doesn't know how to convert (JuliaPy/PyCall.jl#204). It would have to be fixed there. Or at the very least, there should be a clearer error message on that end.

Fortunately, you can solve the problem with OneHotEncoder(sparse=false).

Turns out OneHotEncoder only accepts integer values

Use DictEncoder! It's pure Julia, so it'll be way faster than OneHotEncoder, and it will work with any hashable type (almost anything).

@ValdarT
Copy link
Contributor Author

ValdarT commented Jul 7, 2017

Thank you!

Use DictEncoder!

Will do.

Hopefully we can soon replace all the preprocessing steps with pure Julia implementations. The work at JuliaML seems to get there step-by-step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants