One Hot Encoder: Drop one redundant feature by default for features with two categories #1997

angela97lin · 2021-03-19T17:28:31Z

Closes #1936

Logic for our own custom dropping for binary features (features with only two categories):

During fit, if is_binary, determine which features have two categories and which is the majority feature which should be dropped. Store this information. Pass None to scikit-learn OHE so that scikit-learn does not drop.
Transform. Get output array from scikit-learn. During get_feature_names, find the name of the column that we should drop. Ex: col "original" with majority category "a" might have a transformed name of "original_a"; store "original_a" so we know to drop this column. Before returning, drop all columns that have been specified.

Impl notes:

separates out _get_feature_names from get_feature_names, where _get_feature_names is pre-dropping binary features and more for our private implementation, while get_feature_names is user-facing and has the expected cols (without the binary feature that is dropped)

codecov · 2021-03-21T08:17:32Z

Codecov Report

Merging #1997 (b199d28) into main (2cbfa34) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@            Coverage Diff            @@
##             main    #1997     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         274      274             
  Lines       22325    22360     +35     
=========================================
+ Hits        22319    22354     +35     
  Misses          6        6

Impacted Files	Coverage Δ
evalml/tests/component_tests/test_components.py	`100.0% <ø> (ø)`
...components/transformers/encoders/onehot_encoder.py	`100.0% <100.0%> (ø)`
...alml/tests/component_tests/test_one_hot_encoder.py	`100.0% <100.0%> (ø)`
evalml/tests/pipeline_tests/test_pipelines.py	`100.0% <100.0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2cbfa34...b199d28. Read the comment docs.

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

freddyaboulton

@angela97lin Nice! This was a surprisingly tricky one. I think this looks good. I have some minor non-blocking comments and a question on test_ohe_column_names_unique

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

evalml/tests/component_tests/test_one_hot_encoder.py

bchen1116

Nice! LGTM

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

angela97lin added 3 commits March 17, 2021 15:53

init

bcaf44b

hmmm doesn't quite work

8e55096

Merge branch 'main' into 1936_ohe

6c7632e

angela97lin self-assigned this Mar 19, 2021

angela97lin changed the title ~~One Hot Encoder: Drop one redundant feature by default for features with two categories #1993~~ One Hot Encoder: Drop one redundant feature by default for features with two categories Mar 19, 2021

angela97lin added 6 commits March 19, 2021 17:18

Merge branch 'main' into 1936_ohe

0e5737f

WIP

768c1f2

Merge branch '1936_ohe' of github.com:alteryx/evalml into 1936_ohe

b107388

clean up impl and tests

55243d7

release notes

81c65e6

fix tests

7e6e656

angela97lin commented Mar 21, 2021

View reviewed changes

evalml/pipelines/components/transformers/encoders/onehot_encoder.py Outdated Show resolved Hide resolved

angela97lin added 6 commits March 21, 2021 16:58

update attributes to private

02833de

linting

8abdb9f

split out helper

0951611

fix tests with provenance

597ae41

add test for get_feature_names

dd63d78

add test for top_n and if_binary

fdf579d

angela97lin marked this pull request as ready for review March 22, 2021 17:21

angela97lin requested review from dsherry, freddyaboulton, bchen1116, chukarsten, jeremyliweishih and ParthivNaresh March 22, 2021 17:21

freddyaboulton approved these changes Mar 22, 2021

View reviewed changes

bchen1116 approved these changes Mar 22, 2021

View reviewed changes

evalml/pipelines/components/transformers/encoders/onehot_encoder.py Show resolved Hide resolved

clean up

b199d28

angela97lin merged commit 9b1ffde into main Mar 23, 2021

angela97lin deleted the 1936_ohe branch March 23, 2021 03:31

dsherry mentioned this pull request Mar 24, 2021

Release v0.21.0 #2029

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One Hot Encoder: Drop one redundant feature by default for features with two categories #1997

One Hot Encoder: Drop one redundant feature by default for features with two categories #1997

angela97lin commented Mar 19, 2021 •

edited

codecov bot commented Mar 21, 2021 •

edited

freddyaboulton left a comment

bchen1116 left a comment

One Hot Encoder: Drop one redundant feature by default for features with two categories #1997

One Hot Encoder: Drop one redundant feature by default for features with two categories #1997

Conversation

angela97lin commented Mar 19, 2021 • edited

codecov bot commented Mar 21, 2021 • edited

Codecov Report

freddyaboulton left a comment

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

angela97lin commented Mar 19, 2021 •

edited

codecov bot commented Mar 21, 2021 •

edited