n_neighbors inconsistency #601

MattEding · 2019-09-11T19:45:18Z

Description

All the following classes use n_neighbors:

ADASYN
OneSidedSelection
NeighbourhoodCleaningRule
NearMiss
AllKNN
RepeatedEditedNearestNeighbours
EditedNearestNeighbours
CondensedNearestNeighbour

Whereas k_neighbors is used with SMOTE and all its variants.

This poses a problem with duck-typing and pipelines.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

from imblearn.pipeline import Pipeline
from imblearn.over_sampling import ADASYN
from imblearn.over_sampling import SMOTE

X, y = ...

smote = SMOTE()
adasyn = ADASYN()
logreg = LogisticRegression()

smote_pipe = Pipeline([('sampler', smote), ('classifier', logreg)])
adasyn_pipe = Pipeline([('sampler', adasyn), ('classifier', logreg)])

params = dict(sampler__n_neighbors=range(3, 6))
smote_grid = GridSearchCV(smote_pipe, params)
adasyn_grid = GridSearchCV(adasyn_pipe, params)

# fails due to k_neighbors instead of n_neighbors
# I am forced to make a new params dict
smote_grid.fit(X, y)

# succeeds
adasyn_grid.fit(X, y)

Expected Results

SMOTE would benefit using n_neighbors to have consistent API.

Versions

Darwin-18.7.0-x86_64-i386-64bit
Python 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 14:38:56)
[Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.17.1
SciPy 1.3.1
Scikit-Learn 0.21.3
Imbalanced-Learn 0.5.0

The text was updated successfully, but these errors were encountered:

glemaitre · 2019-09-18T17:27:41Z

I see. Could make sense. It would take 2 versions for the deprecation. However, you still have some other neighbors params in the smote variants as well. It could also be an issue.

You could always create you grid on the fly:

for pipe in [smote_pipe, adasyn_pipe]:
    neighbors_params_name = [p for p in pipeline.get_params().keys() if 'neighbors' in p]
    params = {p: range(3, 6) for p in neighbors_params_name}
    gs_pipe = GridSearchCV(pipe, params)
    gs_pipe.fit(X, y)

MattEding · 2019-09-19T16:02:35Z

I would argue that the extra m_neighbors parameters in SVMSMOTE and BorderlineSMOTE have different meaning than the n/k_neighbors found in other algorithms (and themselves). The n/k_neighbors are used only for finding neighbors, whereas m_neighbors looks to me that its usage is for flagging samples as 'danger' or 'noise'.

I know this is a minor issue that has simple workarounds, but I felt that it was worth marking as an issue nonetheless.

glemaitre · 2019-11-17T11:43:33Z

We could think about modifying this in 1.X since that we will have more freedom to break the API

MattEding · 2019-11-19T02:15:03Z

Additionally, I recently noticed the inconsistency also occurs with self.nn_ vs self.nn_k_ for non-SMOTE and SMOTE repsectively.

rola93 · 2020-02-03T13:48:21Z

hey! come here from #680

Thanks for your answer.

I know it's more or less complex and need some time for this cycle (waiting for two releases) but, is it going to start?

Thanks

glemaitre added the Type: Enhancement Indicates new feature requests label Nov 17, 2019

glemaitre added this to the 1.0 milestone Nov 17, 2019

chkoar mentioned this issue Jan 31, 2020

normalize parameter names #680

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

n_neighbors inconsistency #601

n_neighbors inconsistency #601

MattEding commented Sep 11, 2019

glemaitre commented Sep 18, 2019

MattEding commented Sep 19, 2019

glemaitre commented Nov 17, 2019

MattEding commented Nov 19, 2019

rola93 commented Feb 3, 2020

n_neighbors inconsistency #601

n_neighbors inconsistency #601

Comments

MattEding commented Sep 11, 2019

Description

Expected Results

Versions

glemaitre commented Sep 18, 2019

MattEding commented Sep 19, 2019

glemaitre commented Nov 17, 2019

MattEding commented Nov 19, 2019

rola93 commented Feb 3, 2020