New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One Hot Encoder: Drop one redundant feature by default for features with two categories #1936
Comments
💯 % we should drop the negative-case column. If we do the OHE ourselves first, then sklearn hopefully won't expand them. As Freddy said, you can think of this as generating two columns that have perfect collinearity. There are two problems I see with expanding a binary into two columns rather than one:
|
@freddyaboulton Q: The dataframe above for the OHE columns shows them as floats. Is this really true? |
@rpeck Yes! |
@freddyaboulton Whatthe? That's weird. I've never seen anything but true booleans, or 0/1 integers. I wonder how the tree models actually handle this. It has a bad smell to me. |
Third Law of Code: Thou Shalt Not Make == Comparisons With Floats |
(ok, unless it's with |
Hmm, I thought we were doing this! I agree we should. I thought it was just a flag we had to set in the underlying impl. |
@dsherry @freddyaboulton Looks like we have support for it via our
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html We could either go with |
@angela97lin You're right that changing the default value would suffice! I think |
Was reading into this a bit and found this link: https://inmachineswetrust.com/posts/drop-first-columns/ Key takeaways:
RE @rpeck's first comment: "Like other forms of feature collinearity it messes up lots of things in interpretability, because the effect of the one original source column is divided between the two OHE columns. Freddy's new SHAP rollups address this, obviously, but things like Feature Importance and Partial Dependence plots will still have the issue." This makes sense for binary cases, but in the case where we have multiple categories, dropping one col will still have this issue. Perhaps we shouldn't do this by default, but should update Alas, I don't have a strong grasp at the underlying math to make the judgement so I'd love to hear your thoughts, @freddyaboulton @rpeck @dsherry |
Post-discussion with @freddyaboulton @rpeck @dsherry @chukarsten @jeremyliweishih
|
@angela97lin sounds good RE default behavior. Another nice to have: ability to override that default behavior via the component params |
@dsherry If I'm understanding correctly, since we're updating the default value of |
Dug around to see what was necessary to implement this. In particular, I was curious how difficult it would be to always remove the minority class in the binary case. The result of that digging is:
I believe this is also half of what this issue points out: scikit-learn/scikit-learn#16511 An alternative we can do to support this is to manually keep track of which columns and what values we want to drop during fitting. Pass the data to scikit-learn. Then, prune out the columns that we stored and specified we wanted to drop. However, this requires some logic handling to determine the original (feature, value) from the transformed column name. (We have this logic in All of this is to say, maybe just using the default scikit-learn Useful resources: For using With this in mind, maybe it is best to just roll our own impl 🤔 |
Our one hot encoder creates a feature for every level of the original categorical feature:
The
category_a
andcategory_b
columns are completely collinear, which makes one redundant. This could have adverse effects on estimator fitting. I think we should drop one by default.FYI @rpeck
The text was updated successfully, but these errors were encountered: