Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Avoid memoryviews' slicing for KMeans Cython implementations #24565

Closed
adam2392 opened this issue Oct 3, 2022 · 4 comments
Closed

ENH Avoid memoryviews' slicing for KMeans Cython implementations #24565

adam2392 opened this issue Oct 3, 2022 · 4 comments

Comments

@adam2392
Copy link
Contributor

adam2392 commented Oct 3, 2022

Summary

Addresses issues raised in #17299

The proposal is to modify the LOC here: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/cluster/_k_means_common.pyx#L155-L159. There are currently three places where the _euclidean_sparse_dense Cython function is used and can be optimized.

The issue with the current implementation is that centers is a 2D memview and thus passing in centers[j] creates a 1D memview. I think going from 1D memview to another 1D memview is okay(?) If not, then we need to also modify the other arguments.

Proposal

Change the signature of the Cython function _euclidean_sparse_dense to this:

cdef floating _euclidean_sparse_dense(
        floating[::1] a_data,  # IN
        int[::1] a_indices,    # IN
        floating[::1] b,       # IN
        floating b_squared_norms,
        int b_index,
        bint squared) nogil:
    ...

and adjust the unit tests and corresponding Cython code. I will put up a draft PR to demonstrate what is needed there.

Misc.

cc: @jjerphan who brought this up as a possible Cython improvement for me to help out with.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Oct 3, 2022
@adam2392
Copy link
Contributor Author

adam2392 commented Oct 3, 2022

Label: Cython, Clustering

@jjerphan
Copy link
Member

jjerphan commented Oct 3, 2022

Thank you for opening this issue and proposing a PR, @adam2392.

The issue with the current implementation is that centers is a 2D memview and thus passing in centers[j] creates a 1D memview.

Yes — cross-referencing #17299.

I think going from 1D memview to another 1D memview is okay(?) If not, then we need to also modify the other arguments.

I think 1-D slicing comes with some overhead. This is present in k-means' internals here for instance:

sq_dist = _euclidean_sparse_dense(
X_data[X_indptr[i]: X_indptr[i + 1]],
X_indices[X_indptr[i]: X_indptr[i + 1]],
centers[j], centers_squared_norms[j], True)
inertia += sq_dist * sample_weight[i]

but it might not be as costly.

I think we can proceed with separate benchmarks for both removing 2D- and 1D-slicing and see if those removals are useful.

@jjerphan jjerphan changed the title [ENH] Improve KMeans Cython code ENH Avoid memoryviews' slicing for KMeans Cython implementations Oct 6, 2022
@adam2392
Copy link
Contributor Author

adam2392 commented Mar 2, 2023

Closing this as there is no performance gain.

@adam2392 adam2392 closed this as completed Mar 2, 2023
@jeremiedbb
Copy link
Member

as seen in #24566 experiments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants