-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One Hot Encoder: Drop one redundant feature by default for features with two categories #1997
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1997 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 274 274
Lines 22325 22360 +35
=========================================
+ Hits 22319 22354 +35
Misses 6 6
Continue to review full report at Codecov.
|
evalml/pipelines/components/transformers/encoders/onehot_encoder.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin Nice! This was a surprisingly tricky one. I think this looks good. I have some minor non-blocking comments and a question on test_ohe_column_names_unique
evalml/pipelines/components/transformers/encoders/onehot_encoder.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! LGTM
Closes #1936
Logic for our own custom dropping for binary features (features with only two categories):
is_binary
, determine which features have two categories and which is the majority feature which should be dropped. Store this information. PassNone
to scikit-learn OHE so that scikit-learn does not drop.get_feature_names
, find the name of the column that we should drop. Ex: col "original" with majority category "a" might have a transformed name of "original_a"; store "original_a" so we know to drop this column. Before returning, drop all columns that have been specified.Impl notes:
_get_feature_names
fromget_feature_names
, where_get_feature_names
is pre-dropping binary features and more for our private implementation, whileget_feature_names
is user-facing and has the expected cols (without the binary feature that is dropped)