Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add sample_indices_ for SMOTE/ADASYN classes #772

Open
glemaitre opened this issue Nov 2, 2020 · 5 comments
Open

[ENH] Add sample_indices_ for SMOTE/ADASYN classes #772

glemaitre opened this issue Nov 2, 2020 · 5 comments
Labels
Type: Enhancement Indicates new feature requests

Comments

@glemaitre
Copy link
Member

SMOTE/ADASYN classes currently do not provide a sample_indices_ attribute since they are generating samples that do not belong to the original dataset.

However, we could create a new semantic for these samplers that generate data. sample_indices_ could expose a tuple of the sample used to generate the new point. For the samples that are not generated, it will only be a single integer.

This would implement a feature requested in issues and gitter.

@glemaitre glemaitre added the Type: Enhancement Indicates new feature requests label Nov 2, 2020
@glemaitre
Copy link
Member Author

Thinking a bit more about it and after reading about #724, I think that we should avoid reusing sample_indices_ that would have another semantic. However, we could provide a new attribute that would have a proper semantic for the SMOTE-like sampler.

@tianlinhe
Copy link

I was thinking on the same issue because I need the sample indices for GroupKFold CV after oversampling using SMOTE. So I downloaded the repo and made some small local changes to imblearn/over_sampling/_smote/base.py/. The codes to oversample are the same:

import numpy as np
from imblearn.over_sampling import SMOTE as smo
X=np.random.random((8,3))
y=np.array([0,0,2,0,2,2,2,2])
oversample=smo(k_neighbors=2)
X_,y_=oversample.fit_resample(X,y)

By calling oversample.sample_indices(), it returns:

array([0, 1, 2, 3, 4, 5, 6, 7, 1, 3])

where the indice of the synthetic sample is the same as its "mother" real sample.

One can also call oversample.sample_indices(get_which_neighbors=True), which returns a list of tuples indicating which neighbor the synthetic sample was generated from:

[(0, 0),
 (1, 0),
 (2, 0),
 (3, 0),
 (4, 0),
 (5, 0),
 (6, 0),
 (7, 0),
 (1, 1),
 (3, 1)]

For real sample, its neighbor is 0 (itself).
Please let me know if this is also what you have
base.txt

in mind! If you think it is implementable I can open a new branch.

@nhm-7
Copy link

nhm-7 commented May 17, 2021

Hi! Thanks for creating this issue. I think this feature can be useful to understand datasets we are working with.

Thinking a bit more about it and after reading about #724, I think that we should avoid reusing sample_indices_ that would have another semantic. However, we could provide a new attribute that would have a proper semantic for the SMOTE-like sampler.

@glemaitre, IMO, semantic should be given by owners of datasets. If we use the example of #724, oversample the data and suppose we use sample_indices_ as a tuple of the sample used to generate the new point, we will expect people generating new points (i.e., new people).

WDYT?

@JurajSlivka
Copy link

Hi,
Is this issue still open?
I see there was an PR but it seems outdated.

@JurajSlivka
Copy link

Hi, Is this issue still open? I see there was an PR but it seems outdated.

So as it seems that no one is currently working on it, I will do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement Indicates new feature requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants