Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One Hot Encoder: Drop one redundant feature by default for features with two categories #1997

Merged
merged 16 commits into from Mar 23, 2021

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Mar 19, 2021

Closes #1936

Logic for our own custom dropping for binary features (features with only two categories):

  1. During fit, if is_binary, determine which features have two categories and which is the majority feature which should be dropped. Store this information. Pass None to scikit-learn OHE so that scikit-learn does not drop.
  2. Transform. Get output array from scikit-learn. During get_feature_names, find the name of the column that we should drop. Ex: col "original" with majority category "a" might have a transformed name of "original_a"; store "original_a" so we know to drop this column. Before returning, drop all columns that have been specified.

Impl notes:

  • separates out _get_feature_names from get_feature_names, where _get_feature_names is pre-dropping binary features and more for our private implementation, while get_feature_names is user-facing and has the expected cols (without the binary feature that is dropped)

@angela97lin angela97lin self-assigned this Mar 19, 2021
@angela97lin angela97lin changed the title One Hot Encoder: Drop one redundant feature by default for features with two categories #1993 One Hot Encoder: Drop one redundant feature by default for features with two categories Mar 19, 2021
@codecov
Copy link

codecov bot commented Mar 21, 2021

Codecov Report

Merging #1997 (b199d28) into main (2cbfa34) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1997     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         274      274             
  Lines       22325    22360     +35     
=========================================
+ Hits        22319    22354     +35     
  Misses          6        6             
Impacted Files Coverage Δ
evalml/tests/component_tests/test_components.py 100.0% <ø> (ø)
...components/transformers/encoders/onehot_encoder.py 100.0% <100.0%> (ø)
...alml/tests/component_tests/test_one_hot_encoder.py 100.0% <100.0%> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2cbfa34...b199d28. Read the comment docs.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin Nice! This was a surprisingly tricky one. I think this looks good. I have some minor non-blocking comments and a question on test_ohe_column_names_unique

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM

@angela97lin angela97lin merged commit 9b1ffde into main Mar 23, 2021
@angela97lin angela97lin deleted the 1936_ohe branch March 23, 2021 03:31
@dsherry dsherry mentioned this pull request Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

One Hot Encoder: Drop one redundant feature by default for features with two categories
3 participants