ENH Avoid memoryviews' slicing for `KMeans` Cython implementations #24565

adam2392 · 2022-10-03T17:34:14Z

Summary

Addresses issues raised in #17299

The proposal is to modify the LOC here: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/cluster/_k_means_common.pyx#L155-L159. There are currently three places where the _euclidean_sparse_dense Cython function is used and can be optimized.

The issue with the current implementation is that centers is a 2D memview and thus passing in centers[j] creates a 1D memview. I think going from 1D memview to another 1D memview is okay(?) If not, then we need to also modify the other arguments.

Proposal

Change the signature of the Cython function _euclidean_sparse_dense to this:

cdef floating _euclidean_sparse_dense(
        floating[::1] a_data,  # IN
        int[::1] a_indices,    # IN
        floating[::1] b,       # IN
        floating b_squared_norms,
        int b_index,
        bint squared) nogil:
    ...

and adjust the unit tests and corresponding Cython code. I will put up a draft PR to demonstrate what is needed there.

Misc.

cc: @jjerphan who brought this up as a possible Cython improvement for me to help out with.

The text was updated successfully, but these errors were encountered:

adam2392 · 2022-10-03T17:34:46Z

Label: Cython, Clustering

jjerphan · 2022-10-03T17:53:55Z

Thank you for opening this issue and proposing a PR, @adam2392.

The issue with the current implementation is that centers is a 2D memview and thus passing in centers[j] creates a 1D memview.

Yes — cross-referencing #17299.

I think going from 1D memview to another 1D memview is okay(?) If not, then we need to also modify the other arguments.

I think 1-D slicing comes with some overhead. This is present in k-means' internals here for instance:

scikit-learn/sklearn/cluster/_k_means_common.pyx

Lines 155 to 159 in 798aeeb

    
           sq_dist = _euclidean_sparse_dense( 
        
               X_data[X_indptr[i]: X_indptr[i + 1]], 
        
               X_indices[X_indptr[i]: X_indptr[i + 1]], 
        
               centers[j], centers_squared_norms[j], True) 
        
           inertia += sq_dist * sample_weight[i]

but it might not be as costly.

I think we can proceed with separate benchmarks for both removing 2D- and 1D-slicing and see if those removals are useful.

adam2392 · 2023-03-02T22:23:00Z

Closing this as there is no performance gain.

jeremiedbb · 2023-03-02T22:25:19Z

as seen in #24566 experiments.

github-actions bot added the Needs Triage Issue requires triage label Oct 3, 2022

adam2392 mentioned this issue Oct 3, 2022

ENH Improve Cython code for KMeans to not create additional 1D memviews in for-loop #24566

Closed

2 tasks

jjerphan added module:cluster cython Performance and removed Needs Triage Issue requires triage labels Oct 3, 2022

jjerphan changed the title ~~[ENH] Improve KMeans Cython code~~ ENH Avoid memoryviews' slicing for KMeans Cython implementations Oct 6, 2022

adam2392 closed this as completed Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Avoid memoryviews' slicing for `KMeans` Cython implementations #24565

ENH Avoid memoryviews' slicing for `KMeans` Cython implementations #24565

adam2392 commented Oct 3, 2022 •

edited

adam2392 commented Oct 3, 2022

jjerphan commented Oct 3, 2022 •

edited

adam2392 commented Mar 2, 2023

jeremiedbb commented Mar 2, 2023

ENH Avoid memoryviews' slicing for KMeans Cython implementations #24565

ENH Avoid memoryviews' slicing for KMeans Cython implementations #24565

Comments

adam2392 commented Oct 3, 2022 • edited

Summary

Proposal

Misc.

adam2392 commented Oct 3, 2022

jjerphan commented Oct 3, 2022 • edited

adam2392 commented Mar 2, 2023

jeremiedbb commented Mar 2, 2023

ENH Avoid memoryviews' slicing for `KMeans` Cython implementations #24565

ENH Avoid memoryviews' slicing for `KMeans` Cython implementations #24565

adam2392 commented Oct 3, 2022 •

edited

jjerphan commented Oct 3, 2022 •

edited