Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One Hot Encoder: Drop one redundant feature by default for features with two categories #1936

Closed
freddyaboulton opened this issue Mar 5, 2021 · 14 comments · Fixed by #1997
Closed
Assignees
Labels
enhancement An improvement to an existing feature.

Comments

@freddyaboulton
Copy link
Contributor

freddyaboulton commented Mar 5, 2021

Our one hot encoder creates a feature for every level of the original categorical feature:

from evalml.pipelines import OneHotEncoder
import pandas as pd
df = pd.DataFrame({"category": ["a", "b"], "number": [4,5 ]})
OneHotEncoder().fit_transform(df).to_dataframe()

image

The category_a and category_b columns are completely collinear, which makes one redundant. This could have adverse effects on estimator fitting. I think we should drop one by default.

FYI @rpeck

@freddyaboulton freddyaboulton added the enhancement An improvement to an existing feature. label Mar 5, 2021
@rpeck
Copy link

rpeck commented Mar 5, 2021

💯 % we should drop the negative-case column.

If we do the OHE ourselves first, then sklearn hopefully won't expand them. As Freddy said, you can think of this as generating two columns that have perfect collinearity.

There are two problems I see with expanding a binary into two columns rather than one:

  1. Like other forms of feature collinearity it messes up lots of things in interpretability, because the effect of the one original source column is divided between the two OHE columns. Freddy's new SHAP rollups address this, obviously, but things like Feature Importance and Partial Dependence plots will still have the issue.
  2. Tree models like Random Forest and GBM randomly sample their input features. The source column in this case will be randomly sampled twice as often as it really should be, so it can have an outsized impact on the model.

@rpeck
Copy link

rpeck commented Mar 5, 2021

@freddyaboulton Q: The dataframe above for the OHE columns shows them as floats. Is this really true?

@freddyaboulton
Copy link
Contributor Author

@rpeck Yes!

@rpeck
Copy link

rpeck commented Mar 5, 2021

@freddyaboulton Whatthe? That's weird. I've never seen anything but true booleans, or 0/1 integers. I wonder how the tree models actually handle this. It has a bad smell to me.

@rpeck
Copy link

rpeck commented Mar 5, 2021

Third Law of Code: Thou Shalt Not Make == Comparisons With Floats

@rpeck
Copy link

rpeck commented Mar 5, 2021

(ok, unless it's with Math.NaN)

@dsherry
Copy link
Contributor

dsherry commented Mar 9, 2021

Hmm, I thought we were doing this!

I agree we should. I thought it was just a flag we had to set in the underlying impl.

@dsherry dsherry modified the milestone: Sprint 2021 Mar A Mar 9, 2021
@angela97lin
Copy link
Contributor

@dsherry @freddyaboulton Looks like we have support for it via our drop parameter but only takes into consideration user input and isn't used by our impl so this issue just tracks setting the default for drop to something other than None?

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

We could either go with first or if_binary, not sure what the right call is.

@freddyaboulton
Copy link
Contributor Author

freddyaboulton commented Mar 11, 2021

@angela97lin You're right that changing the default value would suffice! I think first is the way to go since we should avoid perfectly collinear features even when the number of categories > 2. What do you think @rpeck ?

@angela97lin angela97lin self-assigned this Mar 15, 2021
@angela97lin
Copy link
Contributor

Was reading into this a bit and found this link: https://inmachineswetrust.com/posts/drop-first-columns/

Key takeaways:

RE @rpeck's first comment: "Like other forms of feature collinearity it messes up lots of things in interpretability, because the effect of the one original source column is divided between the two OHE columns. Freddy's new SHAP rollups address this, obviously, but things like Feature Importance and Partial Dependence plots will still have the issue."

This makes sense for binary cases, but in the case where we have multiple categories, dropping one col will still have this issue.

Perhaps we shouldn't do this by default, but should update make_pipeline to create a OHE with first as the parameter if the estimator is a linear regressor?

Alas, I don't have a strong grasp at the underlying math to make the judgement so I'd love to hear your thoughts, @freddyaboulton @rpeck @dsherry

@angela97lin
Copy link
Contributor

Post-discussion with @freddyaboulton @rpeck @dsherry @chukarsten @jeremyliweishih

  • We will only do this for binary cases.
  • A "nice-to-have" is to use, in the binary case, is the minority class, but otherwise just choosing one of the two categories should suffice.

@dsherry
Copy link
Contributor

dsherry commented Mar 16, 2021

@angela97lin sounds good RE default behavior. Another nice to have: ability to override that default behavior via the component params

@angela97lin
Copy link
Contributor

@dsherry If I'm understanding correctly, since we're updating the default value of drop (a parameter), users will have the ability to override this by setting the component parameter manually?

@angela97lin angela97lin changed the title One Hot Encoder: Drop one redundant feature by default One Hot Encoder: Drop one redundant feature by default for binary cases Mar 17, 2021
@angela97lin angela97lin changed the title One Hot Encoder: Drop one redundant feature by default for binary cases One Hot Encoder: Drop one redundant feature by default for features with two categories Mar 17, 2021
@angela97lin
Copy link
Contributor

angela97lin commented Mar 17, 2021

Dug around to see what was necessary to implement this. In particular, I was curious how difficult it would be to always remove the minority class in the binary case.

The result of that digging is:

  • With scikit-learn, it is quite difficult to select which category to remove. From the documentation, this seems feasible via the array option for the drop parameter (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). However, after trying it out, it requires that an index value is specified for every column. Hence, the following, which is trying to remove the category specified at index 0 for column 0 and no other values for columns 1 and 2 errors:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

X = pd.DataFrame({'col_1': ["a", "b", "b", "a", "b"],
                      'col_2': ["a", "b", "a", "c", "b"],
                      'col_3': ["a", "a", "a", "a", "a"]})

indices_to_drop = np.array([0, None, None])

ohe = OneHotEncoder(drop=indices_to_drop)
ohe.fit(X)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-a099fa2fc4a7> in <module>
----> 1 ohe.fit(X)

~/Desktop/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in fit(self, X, y)
    417         self._fit(X, handle_unknown=self.handle_unknown,
    418                   force_all_finite='allow-nan')
--> 419         self.drop_idx_ = self._compute_drop_idx()
    420         return self
    421 

~/Desktop/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in _compute_drop_idx(self)
    394                                 ["Category: {}, Feature: {}".format(c, v)
    395                                     for c, v in missing_drops])))
--> 396                 raise ValueError(msg)
    397             return np.array(drop_indices, dtype=object)
    398 

ValueError: The following categories were supposed to be dropped, but were not found in the training data.
Category: 0, Feature: 0
Category: 1, Feature: None
Category: 2, Feature: None

I believe this is also half of what this issue points out: scikit-learn/scikit-learn#16511

An alternative we can do to support this is to manually keep track of which columns and what values we want to drop during fitting. Pass the data to scikit-learn. Then, prune out the columns that we stored and specified we wanted to drop. However, this requires some logic handling to determine the original (feature, value) from the transformed column name. (We have this logic in get_feature_names but that helps us connect the column names assuming nothing should be dropped...)

All of this is to say, maybe just using the default scikit-learn if_binary will suffice for now, and we can file a separate issue to always use the minority class. Honestly also in favor of moving away from scikit-learn's OHE implementation given how much we've had to work around it.

Useful resources:
OHE doc: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
Code in scikit-learn causing inflexibility: https://github.com/scikit-learn/scikit-learn/blob/95119c13af77c76e150b753485c662b7c52a41a2/sklearn/preprocessing/_encoders.py#L338
Related issue: scikit-learn/scikit-learn#16511


For using if_binary: scikit-learn requires that handle_unknown is error. This doesn't play well with our top_n parameters, which drops everything but the top N categories because the data to transform will not know what to do with the new categories. As Becca noted in #830, we would have to set top_n to None for these parameters to work.

With this in mind, maybe it is best to just roll our own impl 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement to an existing feature.
Projects
None yet
4 participants