[WIP] ENH: SPIDER Sampling Algorithm #603

MattEding · 2019-09-19T00:52:53Z

As per requested by the pinned New Methods, I have implemented the Selective Pre-processing of Imbalanced Data (SPIDER) sampling algorithm.

I have developed unit tests based on drawing out a sample dataset and working out by hand what the expected results would be as it is deterministic. See the following notebook for diagrams here.

Currently, the implementation for dense and sparse return the same data points but in different orders (np.lexsort will show they are indeed the same). Consequently this fails the existing unit test that compares dense vs sparse outputs. I would rather not have to require sorting results to ensure that test passes due to the overhead for sorting large datasets. Maybe I can just have a parameter that defaults sort=True to give the option to bypass this issue.

The only other unit test I saw fail with PyTest is a with raises(MESSAGE), but I will fix that later since I am not sure if other developers will want to keep my new sample_type = 'preprocess-strategy'. I had chosen do this because SPIDER does both cleaning and oversampling and I did not feel that inheriting from either was appropriate--oversamplers allow sampling_strategy as a float which does not make sense for this algorithm, and cleaners only really allow for under-sampling sampling strategies which is also inappropriate.

Benchmark results using cross-validation with the mean of 5 folds comparing None, NCR, SMOTE, and the 3 SPIDER variants are here in the PKL, CSV, and PNG folders. I used the Zenodo datasets (excluding set 26 due to it being a large dataset for my local computer to work with in a time effective manner). I set all n_jobs=-1 on my 4 core macbook if you want to infer the time values. Additionally when a scorer was undefined, I left the value as 0 rather than changing it to NaN.

TODO:

~~Resolve the two failing tests I addressed above~~
~~Online reference explaining the algorithm~~
Possibly re-categorize algorithm
Maybe change self._{xyz} to be variables to be explicitly passed into methods to avoid possible multiprocessing mutation complications

…s intended behind scenes; benchmark code

…lanced-learn into spider

pep8speaks · 2019-09-19T00:52:57Z

Hello @MattEding! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-22 00:59:05 UTC

chkoar · 2019-09-19T04:19:38Z

Thanks Matt! That would be a nice addition!

MattEding · 2019-09-20T22:07:05Z

Rough draft for online documentation illustrations of how the SPIDER algorithm works for arbitrary X and y values. The generated plots work as planned; just need to add written text discussing how to interpret graphs. Also needs to be moved from .ipynb to .py that can be understood by sphinx.
https://github.com/MattEding/SPIDER-Algorithm/blob/master/SPIDER-Formulation.ipynb

…ion illustration to docs

codecov · 2019-09-30T21:51:19Z

Codecov Report

Merging #603 into master will decrease coverage by 0.02%.
The diff coverage is 97.92%.

@@            Coverage Diff             @@
##           master     #603      +/-   ##
==========================================
- Coverage   98.46%   98.43%   -0.03%     
==========================================
  Files          82       86       +4     
  Lines        4886     5062     +176     
==========================================
+ Hits         4811     4983     +172     
- Misses         75       79       +4

Impacted Files	Coverage Δ
imblearn/utils/_validation.py	`100% <100%> (ø)`	⬆️
imblearn/combine/_preprocess/base.py	`100% <100%> (ø)`
imblearn/combine/_preprocess/__init__.py	`100% <100%> (ø)`
imblearn/utils/tests/test_validation.py	`100% <100%> (ø)`	⬆️
imblearn/combine/tests/test_spider.py	`100% <100%> (ø)`
imblearn/combine/__init__.py	`100% <100%> (ø)`	⬆️
imblearn/combine/_preprocess/_spider.py	`96.39% <96.39%> (ø)`
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3839df1...53dfc4e. Read the comment docs.

MattEding · 2019-09-30T22:04:38Z

With Travis CI DISTRIB="ubuntu", it appears that this build is using Tensorflow 2.0 as the unit tests are failing with AttributeError: module 'tensorflow' has no attribute 'placeholder'.

glemaitre · 2019-11-17T11:28:09Z

I would suggest moving this implementation into smote_variants. The idea behind this move is to benchmark the smote variants on a common benchmark on a large number of datasets and include in imbalanced-learn only the versions that show an advantage. You can see the discussion and contribute to it: https://github.com/gykovacs/smote_variants/issues/14

@MattEding would this strategy would be fine with you?

chkoar · 2019-11-17T13:35:05Z

I believe that SPIDER would be great addition to imblearn package. I need to find some time to thoroughly review it. @MattEding thanks for the patience.

MattEding · 2019-11-17T18:08:14Z

@glemaitre I understand the motivation for not wanting to include it until it has a more proven track record. My only reservation about trying to include it in smote_variants is that SPIDER is unrelated to SMOTE. One of its motivations was to avoid creating synthetic samples:

"Random introduction of synthetic examples by SMOTE may be questionable or difficult to justify in some domains, where it is important to preserve a link between original data and a constructed classifier (e.g., to justify suggested decisions)."
Stefanowski and Wilk, 2008. Selective Pre-processing of Imbalanced Data for Improving Classification Performance.

As a side note, you may want to revisit New Methods inviting people to contribute new sampling techniques by specifying that contributions may need to go through other repositories first.

glemaitre · 2019-11-17T18:54:37Z

My only reservation about trying to include it in smote_variants is that SPIDER is unrelated to SMOTE.

This is a really could point, I misread when reviewing the different PR :).

My proposal was a bit too quick. @chkoar has some concerns also regarding my proposal as well. We probably need to open an issue to have a public discussion about it.

So as a summary, let's go-ahead for the inclusion of this method. I should have a bit of time to review it in the coming days. @MattEding could you solve the conflict?

lgtm-com · 2019-11-17T19:54:27Z

This pull request introduces 4 alerts when merging 504d601 into 9b31677 - view on LGTM.com

new alerts:

2 for Unused local variable
1 for Unused import
1 for Syntax error

…s docstring substitution; formatted code syntax to be more congruent with rest of codebase

lgtm-com · 2019-11-19T05:56:17Z

This pull request introduces 3 alerts when merging 2517bd9 into f356284 - view on LGTM.com

new alerts:

2 for Unused local variable
1 for Unused import

lgtm-com · 2019-11-19T15:28:33Z

This pull request fixes 3 alerts when merging a9cd91c into f356284 - view on LGTM.com

fixed alerts:

2 for Mismatch in multiple assignment
1 for Wrong name for an argument in a call

…e "do not"; add cleaning test for minority error

…lanced-learn into SPIDER

lgtm-com · 2019-11-20T17:19:28Z

This pull request introduces 1 alert when merging a8d21e0 into 9a1191e - view on LGTM.com

new alerts:

1 for Wrong name for an argument in a class instantiation

…lanced-learn into spider

…isfy codecov

MattEding · 2019-11-22T07:22:13Z

I am unfamiliar with how codecov determines bad lines of code coverage. With try-except blocks being normal control flow in Python--in fact it's considered good practice I am curious as to why these are being flagged as poor coverage. For example, the catching of an UnboundLocalError is a lot cleaner than say locals().get('discard_indices', []). Also I am presuming that the red flags in the helper functions are due to multiple return statements. Is this the case?

nschlaffke · 2020-06-04T19:12:27Z

How do the things look like? It would be nice to have this algorithm

MattEding · 2020-06-06T02:44:38Z

@nschlaffke
It's been quite some time since I've worked on SPIDER, but I believe that I have left it as a fully working algorithm--it passes the (very small) toy test data set that I worked out manually. See examples/combine/plot_illustration_spider.py and the unit tests for details.

The things that needed to be fleshed out would be how it integrates with the rest of imblearn, and whether to lower the Code Coverage threshold rejecting try-except clauses or rewrite in different style. For example, I had created a new base class for this since it didn't feel like it would fit the role of over/under sampling. This may not be what the maintainers would desire or may want it to be refactored.

glemaitre · 2020-06-09T09:06:15Z

I will try to have a look soon. I had to make some underground work to ensure the compatibility with scikit-learn 0.23 and I should be done with it.

glemaitre · 2021-02-18T14:26:52Z

I will check when I get time. We can put a milestone for 0.9.

MattEding added 6 commits September 15, 2019 22:28

SPIDER class runs, need to develop tests to make sure it is working a…

54e1377

…s intended behind scenes; benchmark code

unit tests for weak, resample, strong pass; need to fix sparse tests

df48951

pep8; unit testing spider

774be31

fix sparse spider test, remove jupyter notebooks

3f21d37

Merge branch 'master' of https://github.com/scikit-learn-contrib/imba…

58f4d78

…lanced-learn into spider

docstring SPIDER; spider sample strategy list

dbb0564

MattEding changed the title ~~[MRG] ENH: SPIDER Sampling Algorithm~~ [WIP] ENH: SPIDER Sampling Algorithm Sep 19, 2019

pep8

e82fe2c

MattEding added 2 commits September 30, 2019 14:23

spider dense & sparse now resample in same order; add spider formulat…

c0fe84e

…ion illustration to docs

pep8; remove old commented-out code in spider amplify

4ec89cf

glemaitre force-pushed the master branch from 65132db to 68123d0 Compare November 8, 2019 22:54

Merge branch 'master' into spider

504d601

fixed syntax error; removed unused & variables; _safe_indexing; n_job…

2517bd9

…s docstring substitution; formatted code syntax to be more congruent with rest of codebase

remove unused import & variables

a9cd91c

MattEding added 4 commits November 19, 2019 08:44

fixed relative import; __all__ quote cosmetics

b8c8a8e

update unit tests for preproces-sampling; fix cleaning err msg to hav…

5888ff7

…e "do not"; add cleaning test for minority error

Merge branch 'master' of https://github.com/scikit-learn-contrib/imba…

54fce0d

…lanced-learn into SPIDER

fix test_docstring

f354ef0

fix docstring format errors; add unit tests

a8d21e0

MattEding added 7 commits November 20, 2019 09:42

fix renamed param in example plotting

149ea38

Merge branch 'master' of https://github.com/scikit-learn-contrib/imba…

ba8347a

…lanced-learn into spider

add additional_neighbors validation tests; refactor try-except to sat…

1155fa3

…isfy codecov

fix 80 char line

3dd9ae9

fix visual indent; remove unused import

610f3c6

import from _testing to remove future warning

583af30

fix assertion typo

53dfc4e

chkoar force-pushed the master branch from 4a201cd to 0eb9033 Compare June 20, 2020 02:58

glemaitre added this to the 0.9 milestone Feb 18, 2021

glemaitre force-pushed the master branch from f8347ad to 56eefdf Compare September 29, 2021 16:10

glemaitre force-pushed the master branch from 3228f8a to 7e94390 Compare October 21, 2021 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] ENH: SPIDER Sampling Algorithm #603

[WIP] ENH: SPIDER Sampling Algorithm #603

MattEding commented Sep 19, 2019 •

edited

pep8speaks commented Sep 19, 2019 •

edited

chkoar commented Sep 19, 2019

MattEding commented Sep 20, 2019

codecov bot commented Sep 30, 2019 •

edited

MattEding commented Sep 30, 2019

glemaitre commented Nov 17, 2019

chkoar commented Nov 17, 2019

MattEding commented Nov 17, 2019

glemaitre commented Nov 17, 2019

lgtm-com bot commented Nov 17, 2019

lgtm-com bot commented Nov 19, 2019

lgtm-com bot commented Nov 19, 2019

lgtm-com bot commented Nov 20, 2019

MattEding commented Nov 22, 2019

nschlaffke commented Jun 4, 2020

MattEding commented Jun 6, 2020

glemaitre commented Jun 9, 2020

glemaitre commented Feb 18, 2021

[WIP] ENH: SPIDER Sampling Algorithm #603

Are you sure you want to change the base?

[WIP] ENH: SPIDER Sampling Algorithm #603

Conversation

MattEding commented Sep 19, 2019 • edited

pep8speaks commented Sep 19, 2019 • edited

Comment last updated at 2019-11-22 00:59:05 UTC

chkoar commented Sep 19, 2019

MattEding commented Sep 20, 2019

codecov bot commented Sep 30, 2019 • edited

Codecov Report

MattEding commented Sep 30, 2019

glemaitre commented Nov 17, 2019

chkoar commented Nov 17, 2019

MattEding commented Nov 17, 2019

glemaitre commented Nov 17, 2019

lgtm-com bot commented Nov 17, 2019

lgtm-com bot commented Nov 19, 2019

lgtm-com bot commented Nov 19, 2019

lgtm-com bot commented Nov 20, 2019

MattEding commented Nov 22, 2019

nschlaffke commented Jun 4, 2020

MattEding commented Jun 6, 2020

glemaitre commented Jun 9, 2020

glemaitre commented Feb 18, 2021

MattEding commented Sep 19, 2019 •

edited

pep8speaks commented Sep 19, 2019 •

edited

codecov bot commented Sep 30, 2019 •

edited