Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LabelBinarizer and LabelEncoder fit and transform signatures not compatible with Pipeline #3112

Closed
hxu opened this issue Apr 26, 2014 · 6 comments
Labels

Comments

@hxu
Copy link

hxu commented Apr 26, 2014

I get this error when I try to use LabelBinarizer and LabelEncoder in a Pipeline:

sklearn/pipeline.pyc in fit_transform(self, X, y, **fit_params)
    141         Xt, fit_params = self._pre_transform(X, y, **fit_params)
    142         if hasattr(self.steps[-1][-1], 'fit_transform'):
--> 143             return self.steps[-1][-1].fit_transform(Xt, y, **fit_params)
    144         else:
    145             return self.steps[-1][-1].fit(Xt, y, **fit_params).transform(Xt)

TypeError: fit_transform() takes exactly 2 arguments (3 given)

It seems like this is because the classes' fit and transform signatures are different from most other estimators and only accept a single argument.

I think this is a pretty easy fix (just change the signature to def(self, X, y=None)) that I'd be happy to send a pull request for, but I wanted to check if there were any other reasons that the signatures are the way they are that I didn't think of.

@jnothman
Copy link
Member

I think you're right to fix that.

On 26 April 2014 19:37, hxu notifications@github.com wrote:

I get this error when I try to use LabelBinarizer and LabelEncoder in a
Pipeline:

sklearn/pipeline.pyc in fit_transform(self, X, y, *_fit_params)
141 Xt, fit_params = self._pre_transform(X, y, *_fit_params)
142 if hasattr(self.steps[-1][-1], 'fit_transform'):--> 143 return self.steps[-1][-1].fit_transform(Xt, y, *_fit_params)
144 else:
145 return self.steps[-1][-1].fit(Xt, y, *_fit_params).transform(Xt)
TypeError: fit_transform() takes exactly 2 arguments (3 given)

It seems like this is because the classes' fit and transform signatureshttps://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/label.py#L85are different from most other estimators and only accept a single argument.

I think this is a pretty easy fix (just change the signature to def(self,
X, y=None)) that I'd be happy to send a pull request for, but I wanted to
check if there were any other reasons that the signatures are the way they
are that I didn't think of.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3112
.

@jnothman
Copy link
Member

In #3113 we have decided this is not to be fixed because label encoding doesn't really belong in a Pipeline.

@tutuca
Copy link

tutuca commented Jun 15, 2016

@jnothman, just to know: what should I be doing instead if I happen to need to vectorize a categorical feature in a pipeline?

@jnothman
Copy link
Member

You might be best off writing your own Pipeline-like code (perhaps inheriting from the existing) to handle your specific case.

@Kallin
Copy link

Kallin commented Jul 10, 2017

Instead of using LabelBinarizer in a pipeline I just implemented my own transformer:

class CustomBinarizer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None,**fit_params):
        return self
    def transform(self, X):
        return LabelBinarizer().fit(X).transform(X)

Seems to do the trick!

edit:

this is a better solution:
https://github.com/scikit-learn/scikit-learn/pull/7375/files#diff-1e175ddb0d84aad0a578d34553f6f9c6

@jnothman
Copy link
Member

jnothman commented Jan 29, 2018

I see that there have been a lot of negative reactions on this page. I think there has been a long misunderstanding of the purpose of LabelBinarizer and LabelEncoder. These are for targets, not features. Although admittedly they were designed (and poorly named) before my time.

Although I think users could have been using CountVectorizer (or DictVectorizer with dataframe.to_dict(orient='records') if you're coming from a dataframe) for this purpose for a long time, we have recently merged a CategoricalEncoder (#9151) into master, although this may be rolled into OneHotEncoer, and a new OrdinalEncoder before release (#10521).

I hope this satisfies the needs of a clearly disgruntled populace.

I must say that as someone who has been volunteering enormous quantities of free time for the development of this project for nearly five years now (and recently has been employed to work on it too), seeing the magnitude of negative reactions, rather than constructive contributions to the library is quite saddening. Although admittedly my response above that you should write a new Pipeline-like thing, rather than a new transformer for categorical inputs was a misunderstanding on my part (and should/could have been corrected by others), which I hope is understandable while working through the enormous workload that is maintaining this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants