Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: having exhaustive feature selector take a custom iterator as input #834

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

jonathan-taylor
Copy link

Code of Conduct

A working candidate for #833

Description

Related issues or pull requests

Pull Request Checklist

  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
  • Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
  • Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
  • Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
  • Checked for style issues by running flake8 ./mlxtend

@pep8speaks
Copy link

pep8speaks commented Jun 22, 2021

Hello @jonathan-taylor! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 289:80: E501 line too long (97 > 79 characters)

Line 119:1: W293 blank line contains whitespace
Line 132:5: E303 too many blank lines (2)

Line 29:1: E302 expected 2 blank lines, found 1
Line 38:37: W291 trailing whitespace
Line 39:46: W291 trailing whitespace
Line 44:70: W291 trailing whitespace
Line 216:1: W293 blank line contains whitespace
Line 260:80: E501 line too long (80 > 79 characters)
Line 261:80: E501 line too long (82 > 79 characters)
Line 263:80: E501 line too long (85 > 79 characters)
Line 264:80: E501 line too long (86 > 79 characters)
Line 414:80: E501 line too long (86 > 79 characters)
Line 424:43: W291 trailing whitespace
Line 446:44: W291 trailing whitespace
Line 501:1: W293 blank line contains whitespace
Line 522:1: W391 blank line at end of file

Line 6:2: W291 trailing whitespace
Line 16:22: E128 continuation line under-indented for visual indent
Line 17:22: E128 continuation line under-indented for visual indent
Line 18:22: E128 continuation line under-indented for visual indent
Line 20:1: E302 expected 2 blank lines, found 1
Line 32:40: W291 trailing whitespace
Line 45:1: W293 blank line contains whitespace
Line 68:80: E501 line too long (80 > 79 characters)
Line 69:80: E501 line too long (84 > 79 characters)
Line 75:78: W291 trailing whitespace
Line 84:80: E501 line too long (81 > 79 characters)
Line 131:13: E741 ambiguous variable name 'l'
Line 151:80: E501 line too long (98 > 79 characters)
Line 155:80: E501 line too long (89 > 79 characters)
Line 158:1: W293 blank line contains whitespace
Line 172:57: W291 trailing whitespace
Line 192:1: W293 blank line contains whitespace
Line 199:44: W291 trailing whitespace
Line 205:1: W293 blank line contains whitespace
Line 240:80: E501 line too long (80 > 79 characters)
Line 241:80: E501 line too long (84 > 79 characters)
Line 247:78: W291 trailing whitespace
Line 256:80: E501 line too long (81 > 79 characters)
Line 270:1: W293 blank line contains whitespace
Line 277:71: W291 trailing whitespace
Line 281:72: W291 trailing whitespace
Line 285:1: W293 blank line contains whitespace
Line 295:57: W291 trailing whitespace
Line 308:43: E261 at least two spaces before inline comment
Line 309:80: E501 line too long (99 > 79 characters)
Line 310:65: E128 continuation line under-indented for visual indent
Line 310:80: E501 line too long (112 > 79 characters)
Line 314:43: E261 at least two spaces before inline comment
Line 315:80: E501 line too long (96 > 79 characters)
Line 316:68: E128 continuation line under-indented for visual indent
Line 316:80: E501 line too long (115 > 79 characters)
Line 326:1: W293 blank line contains whitespace
Line 355:80: E501 line too long (80 > 79 characters)
Line 356:80: E501 line too long (84 > 79 characters)
Line 364:78: W291 trailing whitespace
Line 373:80: E501 line too long (81 > 79 characters)
Line 379:54: W291 trailing whitespace
Line 407:80: E501 line too long (93 > 79 characters)
Line 449:80: E501 line too long (80 > 79 characters)
Line 450:80: E501 line too long (84 > 79 characters)
Line 458:78: W291 trailing whitespace
Line 467:80: E501 line too long (81 > 79 characters)
Line 473:54: W291 trailing whitespace
Line 501:80: E501 line too long (93 > 79 characters)
Line 513:1: W293 blank line contains whitespace
Line 516:1: E303 too many blank lines (3)
Line 537:80: E501 line too long (80 > 79 characters)
Line 543:74: W291 trailing whitespace
Line 558:50: W291 trailing whitespace
Line 568:47: W291 trailing whitespace
Line 588:1: W293 blank line contains whitespace
Line 612:1: E302 expected 2 blank lines, found 1
Line 617:40: W291 trailing whitespace
Line 635:1: E302 expected 2 blank lines, found 1
Line 641:40: W291 trailing whitespace
Line 665:80: E501 line too long (94 > 79 characters)
Line 668:1: W293 blank line contains whitespace
Line 669:1: E302 expected 2 blank lines, found 1
Line 682:33: E261 at least two spaces before inline comment
Line 689:1: W293 blank line contains whitespace
Line 690:1: E302 expected 2 blank lines, found 1
Line 703:33: E261 at least two spaces before inline comment
Line 740:39: W291 trailing whitespace
Line 750:1: W293 blank line contains whitespace
Line 761:1: W293 blank line contains whitespace
Line 764:33: E261 at least two spaces before inline comment
Line 764:34: E262 inline comment should start with '# '
Line 770:39: E261 at least two spaces before inline comment

Line 16:1: W293 blank line contains whitespace
Line 17:1: E302 expected 2 blank lines, found 1
Line 25:44: E231 missing whitespace after ','
Line 56:34: E231 missing whitespace after ','
Line 56:36: E231 missing whitespace after ','
Line 66:1: E302 expected 2 blank lines, found 1
Line 70:8: E231 missing whitespace after ','
Line 76:44: E231 missing whitespace after ','
Line 89:8: E231 missing whitespace after ','
Line 95:53: E231 missing whitespace after ','
Line 96:55: E231 missing whitespace after ','
Line 103:1: E302 expected 2 blank lines, found 1
Line 110:1: W293 blank line contains whitespace
Line 112:8: E231 missing whitespace after ','
Line 118:53: E231 missing whitespace after ','
Line 119:55: E231 missing whitespace after ','
Line 138:1: W293 blank line contains whitespace
Line 139:1: E302 expected 2 blank lines, found 1
Line 143:8: E231 missing whitespace after ','
Line 152:57: E231 missing whitespace after ','
Line 153:59: E231 missing whitespace after ','
Line 154:80: E501 line too long (81 > 79 characters)
Line 161:1: E302 expected 2 blank lines, found 1
Line 175:59: E231 missing whitespace after ','
Line 176:61: E231 missing whitespace after ','
Line 177:1: W293 blank line contains whitespace
Line 185:1: E302 expected 2 blank lines, found 1
Line 193:69: W291 trailing whitespace
Line 194:1: W293 blank line contains whitespace
Line 195:1: E302 expected 2 blank lines, found 1
Line 200:80: E501 line too long (112 > 79 characters)
Line 208:76: W291 trailing whitespace
Line 209:44: W291 trailing whitespace
Line 217:1: W293 blank line contains whitespace
Line 241:15: E231 missing whitespace after ','
Line 243:1: W293 blank line contains whitespace
Line 247:1: E302 expected 2 blank lines, found 1
Line 253:80: E501 line too long (109 > 79 characters)
Line 261:75: W291 trailing whitespace
Line 262:47: W291 trailing whitespace
Line 271:1: W293 blank line contains whitespace
Line 294:15: E231 missing whitespace after ','
Line 296:1: W293 blank line contains whitespace
Line 300:1: E302 expected 2 blank lines, found 1
Line 306:80: E501 line too long (109 > 79 characters)
Line 314:75: W291 trailing whitespace
Line 315:47: W291 trailing whitespace
Line 324:1: W293 blank line contains whitespace
Line 347:15: E231 missing whitespace after ','
Line 349:1: W293 blank line contains whitespace
Line 353:1: E302 expected 2 blank lines, found 1
Line 358:80: E501 line too long (121 > 79 characters)
Line 366:46: W291 trailing whitespace
Line 375:1: W293 blank line contains whitespace
Line 411:15: E231 missing whitespace after ','
Line 413:1: W293 blank line contains whitespace
Line 417:1: W293 blank line contains whitespace
Line 417:1: W391 blank line at end of file

Comment last updated at 2021-11-29 18:26:37 UTC

@jonathan-taylor
Copy link
Author

Obviously some linting needed, but have generalized the sequential / exhaustive feature selector to allow for categorical and custom search strategies. The previous searches default to special cases now. Post-processing of search results still need implementing.

from sklearn.exceptions import NotFittedError


class Column(NamedTuple):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently no type checking implemented in mlxtend, but maybe one day, so why not being proactive (or in other words, thanks for adhering to potential future best practices :)).


strategy = min_max(X,
max_features=4,
fixed_features=[2,3])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some pep8 issues, but we can fix them later, no worries.

@jonathan-taylor
Copy link
Author

jonathan-taylor commented Oct 6, 2021 via email

@rasbt
Copy link
Owner

rasbt commented Oct 6, 2021

Thanks a lot for the PR, this is very exciting!

Big picture-wise, there are a few thoughts.

  1. What do we do with the existing ExhaustiveFeatureSelector and SequentialFeatureSelector? We could deprecate them, that is, remove them from the documentation but leave them in the code for a few versions / years.

  2. If we do deprecate the existing SFS, two missing features would be floating-forward and floating-backward. I think right now, via

    for direction in ['forward', 'backward', 'both']:
         strategy = step(X,
                         direction=direction,
                         max_features=p,
                         fixed_features=[2,3],
                         categorical_features=categorical_features)

it only supports the standard forward and backward. I assume that 'both' means it runs forward first and finds the best set. Then it runs backward (independently) to find the best set. Then, the best set is determined by comparing the results from forward and backward? This is actually a neat addition.

  1. If we deprecate and add the floating variants, I think the only thing that we need to ensure is that it still remains compatible with scikit-learn pipelines and maybe GridSearchCV.

Amazing work, though. What you put together here is really exciting and impressive!

@rasbt
Copy link
Owner

rasbt commented Oct 6, 2021

I think EFS is straightforward, but SFS I have to understand the "floating" logic a little better -- I gather this is just "search both forward and backward" but somehow in self.subsets_​ we only keep a given model of a fixed size -- is this meant to be the best model of a fixed size? Or maybe I still have the logic wrong.

Oh, I should have described this a bit better in the docs. In the floating forward variant, it allows you to delete a previously selected feature if it improves performance. In the floating backward variant, it allows you to add back a previously deleted feature.

Maybe it is easier to explain with a toy example.

Say we have a feature set [0, 1, 3, 4, 5] and we do sequential "floating" forward selection.

Round 1

Starting with the empty set [], we check all features and find that feature 1 is best:

  • Select feature 1 to be added to [ ]

  • Now we have the optional floating step where we can optionally delete a previously selected feature. Here, this does not apply because we only have the currently selected feature.

  • Return feature set [1]

Round 2

Starting with the set [1], we check all features and find that feature 4 is best:

  • Select feature 4 to be added to [1]

  • Optional floating step. Here, we can try removing 1 but this can be skipped, because it would be the same as selecting the 4 in the first place in the previous round. We can try removing it for simplicity though.

  • Return feature set [1, 4]

Round 3

  • Select feature 2 to be added to [1, 4].

  • Optional floating step. Here, we can try removing 1 or 4. Removing 4 can technically be skipped, because it would be the same as selecting the 2 in the first place in the previous round. We can try removing it for simplicity though. I.e., we try [1, 2], [2, 4].

  • return [1, 2, 4] (assuming it is better than [1, 2], [2, 4])

Round 4

  • Select feature 3 to be added to [1, 2, 4].

  • Optional floating step. Here, we can try removing 1, 2, or 4. We try [1, 2, 3], [1, 3, 4], [2, 3, 4].

  • return [1, 2, 3] (assuming it is better than [1, 3, 4], [2, 3, 4], [1, 2, 3, 4])

Note that we would not have selected [1, 2, 3] with regular forward selection. With regular forward selection, we would have had

[ ] -> [1] -> [1, 4] -> [1, 2, 4]

Or in other words, with regular forward selection, any feature set of length 3 would have included the value 4. However, in combination with other features, 4 may not be useful.

So, in practice, the floating forward variant allows us to test a few more combinations than the regular forward variant, without being as expensive as the exhaustive feature selection of course.

The floating backward variant works similar to the floating forward variant, except we are allowed to add a previously removed feature back.

@jonathan-taylor
Copy link
Author

jonathan-taylor commented Oct 6, 2021 via email

@rasbt
Copy link
Owner

rasbt commented Oct 6, 2021

Yeah, I think with the new version it will be way easier to track.

Regarding "both" there is a difference between floating-forward and floating-backward though. They can both give you different end results. I think "both" is currently doing floating-forward?

@rasbt
Copy link
Owner

rasbt commented Nov 2, 2021

Thanks for the updates. Actually, some of the failing tests are related to sklearn 1.0. Due to bug fixes or changes, their have been very slight changes in some of the accuracy values, which will cause values when comparing the results on 3 positions after the decimal point. I am currently looking into this and will update the tests

@jonathan-taylor
Copy link
Author

jonathan-taylor commented Nov 2, 2021 via email

@rasbt
Copy link
Owner

rasbt commented Nov 2, 2021

I am not sure how many people use the floating versions. One option could be to deprecate/remove them. We could maybe add a module mlxtend.legacy for that where we can keep the old sequential feature selection code for a while. However, I think think floating variants can be quite powerful. Just looking at the paper again (attached), here is a comparison of regular forward and backward selection compared to an exhaustive ("optimal") method:
Screen Shot 2021-11-02 at 1 52 21 PM
1-s2.0-0167865594901279-main.pdf

Here it is compared to the floating version:

![Screen Shot 2021-11-02 at 1 52 06 PM](https://user-
Screen Shot 2021-11-02 at 1 54 56 PM
images.githubusercontent.com/5618407/139927438-77579a31-6f2a-48ff-8820-8deecd7eb76e.png)

So I would say having SFFS (floating forward) and SFBS (floating backward) is useful. But I can understand if it is too much work in terms of rewriting all the code so having only one of the two is probably sufficient (given that performance is very close anyway)

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@rasbt
Copy link
Owner

rasbt commented Nov 29, 2021

Finally got around fixing the CI. Then, I was trying to rebase this PR but since it is not in a feature branch but in the master branch itself, I couldn't figure out how. I manually applied all the changes to this PR, and it appears I created quite a mess

@rasbt
Copy link
Owner

rasbt commented Nov 29, 2021

Actually not sure why the workflow CI doesn't run. It worked fine in another PR that I just closed a few minutes ago. Maybe it would be easiest to just close this PR, and make a new PR with the 4 files you had initially submitted, i.e,

  • columns.py
  • generic_selector.py
  • strategy.py
  • test_generic_selector.py

if you could do this in a separate feature branch instead of master, I think this would make things easier. Let me know if I can help with anything

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants