One Hot Encoder: Drop one redundant feature by default for features with two categories #1936

freddyaboulton · 2021-03-05T20:40:27Z

Our one hot encoder creates a feature for every level of the original categorical feature:

from evalml.pipelines import OneHotEncoder
import pandas as pd
df = pd.DataFrame({"category": ["a", "b"], "number": [4,5 ]})
OneHotEncoder().fit_transform(df).to_dataframe()

The category_a and category_b columns are completely collinear, which makes one redundant. This could have adverse effects on estimator fitting. I think we should drop one by default.

FYI @rpeck

The text was updated successfully, but these errors were encountered:

rpeck · 2021-03-05T21:03:24Z

💯 % we should drop the negative-case column.

If we do the OHE ourselves first, then sklearn hopefully won't expand them. As Freddy said, you can think of this as generating two columns that have perfect collinearity.

There are two problems I see with expanding a binary into two columns rather than one:

Like other forms of feature collinearity it messes up lots of things in interpretability, because the effect of the one original source column is divided between the two OHE columns. Freddy's new SHAP rollups address this, obviously, but things like Feature Importance and Partial Dependence plots will still have the issue.
Tree models like Random Forest and GBM randomly sample their input features. The source column in this case will be randomly sampled twice as often as it really should be, so it can have an outsized impact on the model.

rpeck · 2021-03-05T22:04:29Z

@freddyaboulton Q: The dataframe above for the OHE columns shows them as floats. Is this really true?

freddyaboulton · 2021-03-05T22:06:23Z

@rpeck Yes!

rpeck · 2021-03-05T22:07:57Z

@freddyaboulton Whatthe? That's weird. I've never seen anything but true booleans, or 0/1 integers. I wonder how the tree models actually handle this. It has a bad smell to me.

rpeck · 2021-03-05T22:08:35Z

Third Law of Code: Thou Shalt Not Make == Comparisons With Floats

rpeck · 2021-03-05T22:09:13Z

(ok, unless it's with Math.NaN)

dsherry · 2021-03-09T16:09:09Z

Hmm, I thought we were doing this!

I agree we should. I thought it was just a flag we had to set in the underlying impl.

angela97lin · 2021-03-11T07:25:13Z

@dsherry @freddyaboulton Looks like we have support for it via our drop parameter but only takes into consideration user input and isn't used by our impl so this issue just tracks setting the default for drop to something other than None?

evalml/evalml/pipelines/components/transformers/encoders/onehot_encoder.py

Line 28 in 91775ff

drop=None,

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

We could either go with first or if_binary, not sure what the right call is.

freddyaboulton · 2021-03-11T15:11:36Z

@angela97lin You're right that changing the default value would suffice! I think first is the way to go since we should avoid perfectly collinear features even when the number of categories > 2. What do you think @rpeck ?

angela97lin · 2021-03-15T19:18:26Z

Was reading into this a bit and found this link: https://inmachineswetrust.com/posts/drop-first-columns/

Key takeaways:

Dropping columns is only required when creating an OLS model without regularization (I believe linear regressor falls into this category: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html, https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)
Dropping one-hot encoded columns changes a linear regression model's parameters and predictions, affecting the model returned. Yet, it's difficult for me to determine whether this for the better or not.

RE @rpeck's first comment: "Like other forms of feature collinearity it messes up lots of things in interpretability, because the effect of the one original source column is divided between the two OHE columns. Freddy's new SHAP rollups address this, obviously, but things like Feature Importance and Partial Dependence plots will still have the issue."

This makes sense for binary cases, but in the case where we have multiple categories, dropping one col will still have this issue.

Perhaps we shouldn't do this by default, but should update make_pipeline to create a OHE with first as the parameter if the estimator is a linear regressor?

Alas, I don't have a strong grasp at the underlying math to make the judgement so I'd love to hear your thoughts, @freddyaboulton @rpeck @dsherry

angela97lin · 2021-03-16T16:39:06Z

Post-discussion with @freddyaboulton @rpeck @dsherry @chukarsten @jeremyliweishih

We will only do this for binary cases.
A "nice-to-have" is to use, in the binary case, is the minority class, but otherwise just choosing one of the two categories should suffice.

dsherry · 2021-03-16T16:56:46Z

@angela97lin sounds good RE default behavior. Another nice to have: ability to override that default behavior via the component params

angela97lin · 2021-03-16T17:00:57Z

@dsherry If I'm understanding correctly, since we're updating the default value of drop (a parameter), users will have the ability to override this by setting the component parameter manually?

angela97lin · 2021-03-17T21:18:22Z

Dug around to see what was necessary to implement this. In particular, I was curious how difficult it would be to always remove the minority class in the binary case.

The result of that digging is:

With scikit-learn, it is quite difficult to select which category to remove. From the documentation, this seems feasible via the array option for the drop parameter (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). However, after trying it out, it requires that an index value is specified for every column. Hence, the following, which is trying to remove the category specified at index 0 for column 0 and no other values for columns 1 and 2 errors:

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

X = pd.DataFrame({'col_1': ["a", "b", "b", "a", "b"],
                      'col_2': ["a", "b", "a", "c", "b"],
                      'col_3': ["a", "a", "a", "a", "a"]})

indices_to_drop = np.array([0, None, None])

ohe = OneHotEncoder(drop=indices_to_drop)
ohe.fit(X)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-a099fa2fc4a7> in <module>
----> 1 ohe.fit(X)

~/Desktop/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in fit(self, X, y)
    417         self._fit(X, handle_unknown=self.handle_unknown,
    418                   force_all_finite='allow-nan')
--> 419         self.drop_idx_ = self._compute_drop_idx()
    420         return self
    421 

~/Desktop/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in _compute_drop_idx(self)
    394                                 ["Category: {}, Feature: {}".format(c, v)
    395                                     for c, v in missing_drops])))
--> 396                 raise ValueError(msg)
    397             return np.array(drop_indices, dtype=object)
    398 

ValueError: The following categories were supposed to be dropped, but were not found in the training data.
Category: 0, Feature: 0
Category: 1, Feature: None
Category: 2, Feature: None

I believe this is also half of what this issue points out: scikit-learn/scikit-learn#16511

An alternative we can do to support this is to manually keep track of which columns and what values we want to drop during fitting. Pass the data to scikit-learn. Then, prune out the columns that we stored and specified we wanted to drop. However, this requires some logic handling to determine the original (feature, value) from the transformed column name. (We have this logic in get_feature_names but that helps us connect the column names assuming nothing should be dropped...)

All of this is to say, maybe just using the default scikit-learn if_binary will suffice for now, and we can file a separate issue to always use the minority class. Honestly also in favor of moving away from scikit-learn's OHE implementation given how much we've had to work around it.

Useful resources:
OHE doc: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
Code in scikit-learn causing inflexibility: https://github.com/scikit-learn/scikit-learn/blob/95119c13af77c76e150b753485c662b7c52a41a2/sklearn/preprocessing/_encoders.py#L338
Related issue: scikit-learn/scikit-learn#16511

For using if_binary: scikit-learn requires that handle_unknown is error. This doesn't play well with our top_n parameters, which drops everything but the top N categories because the data to transform will not know what to do with the new categories. As Becca noted in #830, we would have to set top_n to None for these parameters to work.

With this in mind, maybe it is best to just roll our own impl 🤔

freddyaboulton added the enhancement An improvement to an existing feature. label Mar 5, 2021

dsherry modified the milestone: Sprint 2021 Mar A Mar 9, 2021

angela97lin self-assigned this Mar 15, 2021

angela97lin changed the title ~~One Hot Encoder: Drop one redundant feature by default~~ One Hot Encoder: Drop one redundant feature by default for binary cases Mar 17, 2021

angela97lin changed the title ~~One Hot Encoder: Drop one redundant feature by default for binary cases~~ One Hot Encoder: Drop one redundant feature by default for features with two categories Mar 17, 2021

This was referenced Mar 17, 2021

One Hot Encoder: Drop one redundant feature by default for features with two categories #1993

Closed

Use our own implementation for one-hot encoding #1995

Open

One Hot Encoder: Drop one redundant feature by default for features with two categories #1997

Merged

angela97lin closed this as completed in #1997 Mar 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One Hot Encoder: Drop one redundant feature by default for features with two categories #1936

One Hot Encoder: Drop one redundant feature by default for features with two categories #1936

freddyaboulton commented Mar 5, 2021 •

edited

rpeck commented Mar 5, 2021

rpeck commented Mar 5, 2021

freddyaboulton commented Mar 5, 2021

rpeck commented Mar 5, 2021

rpeck commented Mar 5, 2021

rpeck commented Mar 5, 2021

dsherry commented Mar 9, 2021

angela97lin commented Mar 11, 2021

freddyaboulton commented Mar 11, 2021 •

edited

angela97lin commented Mar 15, 2021

angela97lin commented Mar 16, 2021

dsherry commented Mar 16, 2021

angela97lin commented Mar 16, 2021

angela97lin commented Mar 17, 2021 •

edited

One Hot Encoder: Drop one redundant feature by default for features with two categories #1936

One Hot Encoder: Drop one redundant feature by default for features with two categories #1936

Comments

freddyaboulton commented Mar 5, 2021 • edited

rpeck commented Mar 5, 2021

rpeck commented Mar 5, 2021

freddyaboulton commented Mar 5, 2021

rpeck commented Mar 5, 2021

rpeck commented Mar 5, 2021

rpeck commented Mar 5, 2021

dsherry commented Mar 9, 2021

angela97lin commented Mar 11, 2021

freddyaboulton commented Mar 11, 2021 • edited

angela97lin commented Mar 15, 2021

angela97lin commented Mar 16, 2021

dsherry commented Mar 16, 2021

angela97lin commented Mar 16, 2021

angela97lin commented Mar 17, 2021 • edited

freddyaboulton commented Mar 5, 2021 •

edited

freddyaboulton commented Mar 11, 2021 •

edited

angela97lin commented Mar 17, 2021 •

edited