Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[neighbors] sparse matrix to ind/dist matrix pair: same as PyNNDescent? #80

Open
flying-sheep opened this issue Oct 12, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@flying-sheep
Copy link
Member

We’ve been thinking about checking if Scanpy’s neighbors backends can benefit from the shortcut, but it seems like PyNNDescent behaves differently ((n_neighbors - 1) * n_obs instead of n_neighbors * n_obs)

Shouldn’t the two match up in behavior? Is this a cupy bug? I’m unsure where to harmonize this.

knn_dist = distances.data.reshape(n_obs, n_neighbors)
knn_indices = distances.indices.reshape(n_obs, n_neighbors)

https://github.com/scverse/scanpy/blob/616d580384c1daa8115a270220aafd8c09f9f0f1/scanpy/neighbors/_common.py#L64-L81

@Intron7
Copy link
Member

Intron7 commented Oct 12, 2023

This is not a bug but intended. cuML behaves like scikit-learn. When I started working on neighbors I also went for (n_neighbors - 1) * n_obs. But for rsc n_neighbors * n_obs is important for the umap shortcut. I can think about adding in the 0 zeros in UMAP and mimic PyNNDescent.

@Intron7 Intron7 added the enhancement New feature or request label Oct 12, 2023
@flying-sheep
Copy link
Member Author

No need, I just wanted to start the discussion about what is “correct”.

Seems like this isn’t so easy to answer, see relevant note from that linked document:

Note

When a specific number of neighbors is queried (using KNeighborsTransformer), the definition of n_neighbors is ambiguous since it can either include each training point as its own neighbor, or exclude them. Neither choice is perfect, since including them leads to a different number of non-self neighbors during training and testing, while excluding them leads to a difference between fit(X).transform(X) and fit_transform(X), which is against scikit-learn API. In KNeighborsTransformer we use the definition which includes each training point as its own neighbor in the count of n_neighbors. However, for compatibility reasons with other estimators which use the other definition, one extra neighbor will be computed when mode == 'distance'. To maximise compatibility with all estimators, a safe choice is to always include one extra neighbor in a custom nearest neighbors estimator, since unnecessary neighbors will be filtered by following estimators.

I assume “one extra neighbor” means “per cell, calculate n_neighbors neighbors that aren’t the same cell”, resulting in a sparse matrix with n_neighbors non-zero entries per row. Our code then creates knn_* matrices with n_neighbors columns that include that “additional” neighbor, but not the implicit, zero-distance, “self” neighbor.

So if PyNNDescent has n_neighbors - 1 entries per row in the sparse matrix, that means it doesn’t follow the sklearn advice and interprets n_neighbors as “number of neighbors including the same cell”. So using your code on PyNNDescent’s output returns two matrices with n_neighbors - 1 columns each.

cc @eroell @ivirshup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants