[neighbors] sparse matrix to ind/dist matrix pair: same as PyNNDescent? #80

flying-sheep · 2023-10-12T09:19:29Z

We’ve been thinking about checking if Scanpy’s neighbors backends can benefit from the shortcut, but it seems like PyNNDescent behaves differently ((n_neighbors - 1) * n_obs instead of n_neighbors * n_obs)

Shouldn’t the two match up in behavior? Is this a cupy bug? I’m unsure where to harmonize this.

rapids_singlecell/src/rapids_singlecell/preprocessing/_neighbors.py

Lines 131 to 132 in b038bcc

    
           knn_dist = distances.data.reshape(n_obs, n_neighbors) 
        
           knn_indices = distances.indices.reshape(n_obs, n_neighbors)

https://github.com/scverse/scanpy/blob/616d580384c1daa8115a270220aafd8c09f9f0f1/scanpy/neighbors/_common.py#L64-L81

The text was updated successfully, but these errors were encountered:

Intron7 · 2023-10-12T09:31:50Z

This is not a bug but intended. cuML behaves like scikit-learn. When I started working on neighbors I also went for (n_neighbors - 1) * n_obs. But for rsc n_neighbors * n_obs is important for the umap shortcut. I can think about adding in the 0 zeros in UMAP and mimic PyNNDescent.

flying-sheep · 2023-10-12T11:05:39Z

No need, I just wanted to start the discussion about what is “correct”.

Seems like this isn’t so easy to answer, see relevant note from that linked document:

Note

When a specific number of neighbors is queried (using KNeighborsTransformer), the definition of n_neighbors is ambiguous since it can either include each training point as its own neighbor, or exclude them. Neither choice is perfect, since including them leads to a different number of non-self neighbors during training and testing, while excluding them leads to a difference between fit(X).transform(X) and fit_transform(X), which is against scikit-learn API. In KNeighborsTransformer we use the definition which includes each training point as its own neighbor in the count of n_neighbors. However, for compatibility reasons with other estimators which use the other definition, one extra neighbor will be computed when mode == 'distance'. To maximise compatibility with all estimators, a safe choice is to always include one extra neighbor in a custom nearest neighbors estimator, since unnecessary neighbors will be filtered by following estimators.

I assume “one extra neighbor” means “per cell, calculate n_neighbors neighbors that aren’t the same cell”, resulting in a sparse matrix with n_neighbors non-zero entries per row. Our code then creates knn_* matrices with n_neighbors columns that include that “additional” neighbor, but not the implicit, zero-distance, “self” neighbor.

So if PyNNDescent has n_neighbors - 1 entries per row in the sparse matrix, that means it doesn’t follow the sklearn advice and interprets n_neighbors as “number of neighbors including the same cell”. So using your code on PyNNDescent’s output returns two matrices with n_neighbors - 1 columns each.

cc @eroell @ivirshup

Intron7 added the enhancement New feature or request label Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[neighbors] sparse matrix to ind/dist matrix pair: same as PyNNDescent? #80

[neighbors] sparse matrix to ind/dist matrix pair: same as PyNNDescent? #80

flying-sheep commented Oct 12, 2023

Intron7 commented Oct 12, 2023 •

edited

flying-sheep commented Oct 12, 2023

[neighbors] sparse matrix to ind/dist matrix pair: same as PyNNDescent? #80

[neighbors] sparse matrix to ind/dist matrix pair: same as PyNNDescent? #80

Comments

flying-sheep commented Oct 12, 2023

Intron7 commented Oct 12, 2023 • edited

flying-sheep commented Oct 12, 2023

Intron7 commented Oct 12, 2023 •

edited