You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In case of a sparse training set X, the algorithm uses scipy.sparse to make the computation more efficient. A synthetic sample will only be generated when X[row].nnz is not 0. That means, when the current sample has 0 for all its features, no synthetic sample will be generated. See lines 115-123 in smote.py. (The only exception is that we store samples with all 0 features as elements in the sparse matrix.)
The new samples should be generated using the rule: $s_{new} = s_i - \epsilon \times (s_i-s_{nn})$, where $s_{new}$ is the new synthetic sample, $s_i$ is the current sample, and $s_{nn}$ is a nearest neighbor of $s_i$. When the $s_i$ is all zeros, $s_new = \epsilon \times s_{nn}$. Thus I think a sample should still be generated using its nearest neighbors and they should be treated the same way as the other samples.
Steps/Code to Reproduce
Expected Results
Actual Results
Versions
The text was updated successfully, but these errors were encountered:
I am not sure this is an issue because there is only a single sample with all zeros. I would not be surprised that such a sample will make the algorithm crash but I would need to run the test.
Description
In case of a sparse training set X, the algorithm uses scipy.sparse to make the computation more efficient. A synthetic sample will only be generated when X[row].nnz is not 0. That means, when the current sample has 0 for all its features, no synthetic sample will be generated. See lines 115-123 in smote.py. (The only exception is that we store samples with all 0 features as elements in the sparse matrix.)
The new samples should be generated using the rule:$s_{new} = s_i - \epsilon \times (s_i-s_{nn})$ , where $s_{new}$ is the new synthetic sample, $s_i$ is the current sample, and $s_{nn}$ is a nearest neighbor of $s_i$ . When the $s_i$ is all zeros, $s_new = \epsilon \times s_{nn}$ . Thus I think a sample should still be generated using its nearest neighbors and they should be treated the same way as the other samples.
Steps/Code to Reproduce
Expected Results
Actual Results
Versions
The text was updated successfully, but these errors were encountered: