generating samples when training set is sparse #582

laurallu · 2019-07-01T21:54:05Z

Description

In case of a sparse training set X, the algorithm uses scipy.sparse to make the computation more efficient. A synthetic sample will only be generated when X[row].nnz is not 0. That means, when the current sample has 0 for all its features, no synthetic sample will be generated. See lines 115-123 in smote.py. (The only exception is that we store samples with all 0 features as elements in the sparse matrix.)

The new samples should be generated using the rule: $s_{new} = s_i - \epsilon \times (s_i-s_{nn})$, where $s_{new}$ is the new synthetic sample, $s_i$ is the current sample, and $s_{nn}$ is a nearest neighbor of $s_i$. When the $s_i$ is all zeros, $s_new = \epsilon \times s_{nn}$. Thus I think a sample should still be generated using its nearest neighbors and they should be treated the same way as the other samples.

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

glemaitre · 2019-07-09T19:35:04Z

I am not sure this is an issue because there is only a single sample with all zeros. I would not be surprised that such a sample will make the algorithm crash but I would need to run the test.

PR is welcomed if you can show that it works

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generating samples when training set is sparse #582

generating samples when training set is sparse #582

laurallu commented Jul 1, 2019

glemaitre commented Jul 9, 2019

generating samples when training set is sparse #582

generating samples when training set is sparse #582

Comments

laurallu commented Jul 1, 2019

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

glemaitre commented Jul 9, 2019