Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generating samples when training set is sparse #582

Open
laurallu opened this issue Jul 1, 2019 · 1 comment
Open

generating samples when training set is sparse #582

laurallu opened this issue Jul 1, 2019 · 1 comment

Comments

@laurallu
Copy link

laurallu commented Jul 1, 2019

Description

In case of a sparse training set X, the algorithm uses scipy.sparse to make the computation more efficient. A synthetic sample will only be generated when X[row].nnz is not 0. That means, when the current sample has 0 for all its features, no synthetic sample will be generated. See lines 115-123 in smote.py. (The only exception is that we store samples with all 0 features as elements in the sparse matrix.)

The new samples should be generated using the rule: $s_{new} = s_i - \epsilon \times (s_i-s_{nn})$, where $s_{new}$ is the new synthetic sample, $s_i$ is the current sample, and $s_{nn}$ is a nearest neighbor of $s_i$. When the $s_i$ is all zeros, $s_new = \epsilon \times s_{nn}$. Thus I think a sample should still be generated using its nearest neighbors and they should be treated the same way as the other samples.

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

@glemaitre
Copy link
Member

I am not sure this is an issue because there is only a single sample with all zeros. I would not be surprised that such a sample will make the algorithm crash but I would need to run the test.

PR is welcomed if you can show that it works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants