EHN Optimized CSR-CSR support for `Euclidean` specializations of `PairwiseDistancesReductions` #24556

Vincent-Maladiere · 2022-09-30T20:29:00Z

Reference Issues/PRs

Relates #23585 @jjerphan

What does this implement/fix? Explain your changes.

Create the base class MiddleTermComputer generalizingGEMMTermComputer to handle the computation of $-2 X . Y^\top$. MiddleTermComputer is extended by:
- DenseDenseMiddleTermComputer when both X and Y are dense (whose implementation originates from GEMMTermComputer)
- SparseSparseMiddleTermComputer when both X and Y are CSR. This components relies on a Cython routine.
Change EuclideanArgKmin and EuclideanRadiusNeighbors to only have them adhere to MiddleTermComputer
Adapt is_usable_for in BaseDistanceReductionDispatcher to add the CSR-CSR case.
Change the logic of compute in ArgKmin and RadiusNeighbors to select the Euclidean class for the CSR-CSR case
Implement sparse versions of _sqeuclidean_row_norms

Any other comments?

For benchmark results, see the following comments:

jjerphan

Thanks you for working on this, @Vincent-Maladiere!

Here are a first few comments to guide you through extending this class hierarchy.

sklearn/metrics/_pairwise_distances_reduction/_gemm_term_computer.pyx.tp

sklearn/metrics/_pairwise_distances_reduction/_radius_neighborhood.pxd.tp

sklearn/metrics/_pairwise_distances_reduction/_radius_neighborhood.pyx.tp

sklearn/metrics/_pairwise_distances_reduction/_gemm_term_computer.pyx.tp

sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py

sklearn/metrics/_pairwise_distances_reduction/_gemm_term_computer.pxd.tp

sklearn/metrics/_pairwise_distances_reduction/_gemm_term_computer.pyx.tp

Vincent-Maladiere · 2022-10-16T14:35:30Z

Some observations

The dense and sparse computers both yield the same result when max(X.shape[0], Y.shape[0]) < chunk_size, otherwise the sparse computation is still incorrect
(unrelated but could be substantial) It seems that we don't handle the case when parameter > X.shape[0], in that scenario even the values of the dense computer are wrong.

Testing script:

This is working ✅

import numpy as np
from numpy.testing import assert_array_equal
from scipy.sparse import csr_matrix
from sklearn.metrics._pairwise_distances_reduction import ArgKmin

X = np.array([[0, 1], [0, 0], [2, 3]], dtype=np.float64)
Y = np.array([[1, 1], [0, 0], [0, 1]], dtype=np.float64)

X_csr = csr_matrix(X)
Y_csr = csr_matrix(Y)

parameter = 3
dist, indices = ArgKmin.compute(
    X,
    Y,
    parameter,
    chunk_size=100,
    return_distance=True,
)
dist_csr, indices_csr = ArgKmin.compute(
    X_csr,
    Y_csr,
    parameter,
    chunk_size=100,
    return_distance=True,
)
assert_array_equal(dist, dist_csr)

This fails ❌ (notice that the only difference is the length of X and Y)

import numpy as np
from numpy.testing import assert_array_equal
from scipy.sparse import csr_matrix
from sklearn.metrics._pairwise_distances_reduction import ArgKmin

X = np.array([[0, 1], [0, 0], [2, 3]] * 40, dtype=np.float64)
Y = np.array([[1, 1], [0, 0], [0, 1]] * 40, dtype=np.float64)

X_csr = csr_matrix(X)
Y_csr = csr_matrix(Y)

parameter = 3
dist, indices = ArgKmin.compute(
    X,
    Y,
    parameter,
    chunk_size=100,
    return_distance=True,
)
dist_csr, indices_csr = ArgKmin.compute(
    X_csr,
    Y_csr,
    parameter,
    chunk_size=100,
    return_distance=True,
)
assert_array_equal(dist, dist_csr)

sklearn/metrics/_pairwise_distances_reduction/_base.pxd.tp

jjerphan · 2022-10-17T10:08:24Z

(unrelated but could be substantial) It seems that we don't handle the case when parameter > X.shape[0], in that scenario even the values of the dense computer are wrong.

Yes, that's right (I supposed you meant "when parameter > Y.shape[0]"?): it's not validated in ArgKmin . Most validations are done closer to user-facing interfaces, here in kneighbors:

scikit-learn/sklearn/neighbors/_base.py

Lines 786 to 790 in 60cc5b5

    
           if n_neighbors > n_samples_fit: 
        
               raise ValueError( 
        
                   "Expected n_neighbors <= n_samples, " 
        
                   " but n_samples = %d, n_neighbors = %d" % (n_samples_fit, n_neighbors) 
        
               )

As those interface are directly used by users, we do not revalidate all the parameters but this might change in the future.

jjerphan

Nice job making this works, @Vincent-Maladiere!

Here are a few comments to finish this first part.

The next steps involves (in this orders):

rename sklearn/metrics/_pairwise_distances_reduction/_gemm_term_computer.{pxd,pyx}.tp (mind the lines mentioning those files as well, IRRC in at least .gitignore and setup.cfg)
extending some tests to make sure that the Sparse-Sparse support for Euclidean specialisations is correct
performing some benchmarks on user-facing API, benchmarking kneighbors should suffice (this gist might be adapted)
updating and adapting the documentation/comments of those implementations with respect to those changes

To clarify, the following points better be done as part of subsequent PRs (preferably in this order):

refactoring for merging of DatasetsPairs and MiddleTermComputers and encapsulating squared norms computations where appropriate
the support of the CSR-dense and dense-CSR for the Euclidean specialisation

sklearn/metrics/_pairwise_distances_reduction/_gemm_term_computer.pyx.tp

sklearn/metrics/_pairwise_distances_reduction/_base.pyx.tp

sklearn/metrics/_pairwise_distances_reduction/_base.pxd.tp

sklearn/metrics/_pairwise_distances_reduction/_argkmin.pyx.tp

sklearn/metrics/_pairwise_distances_reduction/_gemm_term_computer.pyx.tp

sklearn/metrics/_pairwise_distances_reduction/_argkmin.pyx.tp

Vincent-Maladiere · 2022-10-19T10:07:19Z

As most of this work (except squeuclidean_row_norms) is not user-facing, should we append it to the changelog?

jjerphan

As most of this work (except squeuclidean_row_norms) is not user-facing, should we append it to the changelog?

For now, to have the CI green I've labelled this PR with "No Changelog needed" because it is still experimental. Also I converted to to draft because this is still WIP ([WIP] was used before the existence of the draft mode but is not much of an use now). When we have accessed performance changes, we can add an entry to the change log and remove this label.

Also, it looks like a few Cython sources have been added lately.

They should be removable with:

rm sklearn/metrics/_pairwise_distances_reduction/{_gemm_term_computer,_radius_neighborhood}.{pxd,pyx}
git rm --cached -f sklearn/metrics/_pairwise_distances_reduction/{_gemm_term_computer,_radius_neighborhood}.{pxd,pyx}
git add sklearn/metrics/_pairwise_distances_reduction/{_gemm_term_computer,_radius_neighborhood}.{pxd,pyx}
git commit -m "Remove previous Cython templates and sources"

Here are a few comments.

sklearn/metrics/_pairwise_distances_reduction/_base.pyx.tp

sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pxd.tp

sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pyx.tp

sklearn/metrics/tests/test_pairwise_distances_reduction.py

Vincent-Maladiere · 2022-10-26T08:28:52Z

Hardware Scalability on 76574eb (using this script)

The current implementation of Sparse Sparse Euclidean NearestNeighbors does not benefit from PairwiseDistanceReduction introduced in #22134.

Following the design guidelines previously established, this new implementation:

Computes the sqeuclidean_row_norm from a CSR format without the GIL
Computes the middle term -2 * X @ Y.T in the idea introduced here. SparseSparseMiddleTermComputer realises a matmul operation in the blas gemm fashion without the GIL, between two CSR matrices.

This new design scales easily up to 8 cores on my laptop, unlike the current implementation that fails to do so.

Below are the results obtained from running the script linked in the title, with the following parameters:

size of X and Y (100,000 x n_features)
dtype: np.float64
density: 0.05 (5%)

This branch

Main

However, I also observe that the absolute amount of time taken by this implementation greatly depends on the number of features: in high dimension (> 500), this branch currently performs worse than main, so I bit more investigation is required here.

n_features = 50 (main is blue)

n_features = 100 (main is blue)

n_features = 500 (main is blue)

Raw data

this branch

	n_threads	n_train	n_test	n_features	mean_runtime	commit
0	1	100000	100000	50	140.372	`76574eb`
1	2	100000	100000	50	69.0472	`76574eb`
2	4	100000	100000	50	37.6792	`76574eb`
3	8	100000	100000	50	33.1491	`76574eb`
4	1	100000	100000	100	374.193	`76574eb`
5	2	100000	100000	100	191.189	`76574eb`
6	4	100000	100000	100	100.264	`76574eb`
7	8	100000	100000	100	83.0642	`76574eb`
8	1	100000	100000	500	6489.29	`76574eb`
9	2	100000	100000	500	3357.13	`76574eb`
10	4	100000	100000	500	1744.47	`76574eb`
11	8	100000	100000	500	1274.04	`76574eb`

main

	n_threads	n_train	n_test	n_features	mean_runtime	commit
0	1	100000	100000	50	240.924	`a71c535`
1	2	100000	100000	50	323.733	`a71c535`
2	4	100000	100000	50	273.084	`a71c535`
3	8	100000	100000	50	270.112	`a71c535`
4	1	100000	100000	100	245.132	`a71c535`
5	2	100000	100000	100	316.328	`a71c535`
6	4	100000	100000	100	364.912	`a71c535`
7	8	100000	100000	100	284.274	`a71c535`
8	1	100000	100000	500	454.946	`a71c535`
9	2	100000	100000	500	416.54	`a71c535`
10	4	100000	100000	500	349.073	`a71c535`
11	8	100000	100000	500	306.178	`a71c535`

@Micky774 could you give me your thoughts about this PR? :)

jjerphan

Here is what should be my ultimate review for this PR.

Here are some last comments.

Note that #24542 and #24715 must be merged before this PR.

sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pyx.tp

jjerphan · 2022-11-04T13:13:49Z

sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors.pyx.tp

@@ -84,8 +84,7 @@ cdef class RadiusNeighbors{{name_suffix}}(BaseDistancesReduction{{name_suffix}})
        """
        if (
            metric in ("euclidean", "sqeuclidean")
-            and not issparse(X)
-            and not issparse(Y)
+            and not (issparse(X) ^ issparse(Y))


Side-note: I think it would be nice to turn in another PR is_valid_sparse_matrix, i.e.:

scikit-learn/sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py

Lines 101 to 111 in 40a6afb

def is_valid_sparse_matrix(X):

return (

isspmatrix_csr(X)

and

# TODO: support CSR matrices without non-zeros elements

X.nnz > 0

and

# TODO: support CSR matrices with int64 indices and indptr

# See: https://github.com/scikit-learn/scikit-learn/issues/23653

X.indices.dtype == X.indptr.dtype == np.int32

)

into is_supported_sparse_matrix in the global scope of base.pyx and propagated in place of scipy.sparse.{issparse,isspmatrix_csr} in sklearn.metrics._pairwise_distances_reduction.

sklearn/metrics/_pairwise_distances_reduction/setup.py

sklearn/metrics/_pairwise_distances_reduction/_base.pyx.tp

sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pyx.tp

sklearn/metrics/tests/test_pairwise_distances_reduction.py

sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pyx.tp

sklearn/metrics/_pairwise_distances_reduction/_base.pyx.tp

sklearn/metrics/_pairwise_distances_reduction/_argkmin.pyx.tp

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

sklearn/metrics/tests/test_pairwise_distances_reduction.py

jjerphan

LGTM with pending acceptance for merge due to a duplicated allocation and cast as explained in the comment bellow.

sklearn/metrics/_pairwise_distances_reduction/_base.pyx.tp

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

jjerphan

LGTM. Let's solve the TODOs in another PR.

sklearn/metrics/_pairwise_distances_reduction/_base.pyx.tp

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

ogrisel

LGTM. Just a few nitpick to make the not-XOR based conditions easier to grasp as I think it's quite rare to see such constructs in our code.

sklearn/metrics/_pairwise_distances_reduction/_argkmin.pyx.tp

sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py

sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pyx.tp

sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors.pyx.tp

sklearn/metrics/tests/test_pairwise_distances_reduction.py

jjerphan

We are almost there!

sklearn/metrics/tests/test_pairwise_distances_reduction.py

jjerphan · 2022-11-18T18:32:11Z

Congrats @Vincent-Maladiere! 🎵 🎉 👏

That is significant performance improvements for many estimators! 🎉
I hope you have learnt a bunch of elements of scikit-learn and of Cython development. 😄

Vincent-Maladiere added 2 commits September 30, 2022 22:10

wip

24fe8a6

remove potential bug

0132e56

github-actions bot added cython module:metrics labels Sep 30, 2022

jjerphan reviewed Oct 3, 2022

View reviewed changes

Vincent-Maladiere added 2 commits October 11, 2022 18:36

much fix such wow

c0ac7db

compiling !

27b9427

jjerphan reviewed Oct 13, 2022

View reviewed changes

jjerphan reviewed Oct 14, 2022

View reviewed changes

sklearn/metrics/_pairwise_distances_reduction/_gemm_term_computer.pyx.tp Outdated Show resolved Hide resolved

Vincent-Maladiere added 2 commits October 16, 2022 16:18

add sparse method for sq_euclidean_norm

2a14154

remove print

4060271

jjerphan reviewed Oct 17, 2022

View reviewed changes

sklearn/metrics/_pairwise_distances_reduction/_base.pxd.tp Outdated Show resolved Hide resolved

Vincent-Maladiere added 2 commits October 18, 2022 10:45

underkill the overkill

c8e35fd

baptism by fire

c9aed9d

jjerphan reviewed Oct 18, 2022

View reviewed changes

Vincent-Maladiere added 5 commits October 19, 2022 10:15

Merge branch 'main' into euclidean_argkmin_sparse_sparse

12acf71

clean it up, yo

c5bd316

Merge branch 'main' into euclidean_argkmin_sparse_sparse

4697f4e

worldwide renaming

50bbf47

add sparse test for test_sqeuclidean_row_norms

e9cc5ad

jjerphan added the No Changelog Needed label Oct 20, 2022

jjerphan reviewed Oct 20, 2022

View reviewed changes

jjerphan marked this pull request as draft October 20, 2022 09:38

Vincent-Maladiere added 3 commits October 23, 2022 19:21

Remove previous Cython templates and sources

2ee517d

Apply suggestions

9df551d

remove np.asarray from test

76574eb

Merge branch 'main' into euclidean_argkmin_sparse_sparse

40a6afb

jjerphan reviewed Nov 4, 2022

View reviewed changes

sklearn/metrics/_pairwise_distances_reduction/_middle_term_computer.pyx.tp Show resolved Hide resolved

some suggestions from Maurice

366548e

jjerphan reviewed Nov 7, 2022

View reviewed changes

sklearn/metrics/_pairwise_distances_reduction/_base.pyx.tp Outdated Show resolved Hide resolved

jjerphan reviewed Nov 7, 2022

View reviewed changes

sklearn/metrics/_pairwise_distances_reduction/_argkmin.pyx.tp Outdated Show resolved Hide resolved

Apply suggestions from code review

8ee5a29

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

jjerphan reviewed Nov 8, 2022

View reviewed changes

sklearn/metrics/tests/test_pairwise_distances_reduction.py Outdated Show resolved Hide resolved

jjerphan reviewed Nov 8, 2022

View reviewed changes

sklearn/metrics/_pairwise_distances_reduction/_base.pyx.tp Show resolved Hide resolved

Vincent-Maladiere and others added 3 commits November 8, 2022 21:16

Update sklearn/metrics/tests/test_pairwise_distances_reduction.py

8d814fa

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Merge branch 'main' into euclidean_argkmin_sparse_sparse

47c6127

Merge branch 'main' into euclidean_argkmin_sparse_sparse

4fbb024

jjerphan approved these changes Nov 16, 2022

View reviewed changes

sklearn/metrics/_pairwise_distances_reduction/_base.pyx.tp Show resolved Hide resolved

Vincent-Maladiere and others added 3 commits November 16, 2022 15:53

Update sklearn/metrics/_pairwise_distances_reduction/_base.pyx.tp

6fa803f

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Merge branch 'main' into euclidean_argkmin_sparse_sparse

e20ed2c

Merge branch 'main' into euclidean_argkmin_sparse_sparse

115813f

ogrisel approved these changes Nov 17, 2022

View reviewed changes

jjerphan removed Waiting for Second Reviewer First reviewer is done, need a second one! No Changelog Needed labels Nov 18, 2022

Vincent-Maladiere added 2 commits November 18, 2022 15:14

apply suggestions :)

c08a515

Merge branch 'main' into euclidean_argkmin_sparse_sparse

e94d3dd

jjerphan reviewed Nov 18, 2022

View reviewed changes

sklearn/metrics/tests/test_pairwise_distances_reduction.py Outdated Show resolved Hide resolved

Vincent-Maladiere added 2 commits November 18, 2022 18:36

Fu... sion!

001fa5b

Merge branch 'main' into euclidean_argkmin_sparse_sparse

61b462e

jjerphan merged commit 9c9c858 into scikit-learn:main Nov 18, 2022

This was referenced Nov 18, 2022

PERF PairwiseDistancesReductions initial work #22587

Closed

ENH Add the fused CSR dense case for Euclidean Specializations #25044

Merged

Vincent-Maladiere mentioned this pull request Jan 21, 2023

MAINT Improve the _middle_term_sparse_sparse_{32, 64} routines #25449

Merged

glemaitre mentioned this pull request Jan 19, 2024

ENH: use the sparse-sparse backend for computing pairwise distance #28191

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EHN Optimized CSR-CSR support for `Euclidean` specializations of `PairwiseDistancesReductions` #24556

EHN Optimized CSR-CSR support for `Euclidean` specializations of `PairwiseDistancesReductions` #24556

Vincent-Maladiere commented Sep 30, 2022 •

edited by jjerphan

jjerphan left a comment

Vincent-Maladiere commented Oct 16, 2022

jjerphan commented Oct 17, 2022

jjerphan left a comment •

edited

Vincent-Maladiere commented Oct 19, 2022

jjerphan left a comment •

edited

Vincent-Maladiere commented Oct 26, 2022 •

edited

jjerphan left a comment

jjerphan Nov 4, 2022

jjerphan left a comment

jjerphan left a comment •

edited

ogrisel left a comment

jjerphan left a comment

jjerphan commented Nov 18, 2022 •

edited

	def is_valid_sparse_matrix(X):
	return (
	isspmatrix_csr(X)
	and
	# TODO: support CSR matrices without non-zeros elements
	X.nnz > 0
	and
	# TODO: support CSR matrices with int64 indices and indptr
	# See: https://github.com/scikit-learn/scikit-learn/issues/23653
	X.indices.dtype == X.indptr.dtype == np.int32
	)

EHN Optimized CSR-CSR support for Euclidean specializations of PairwiseDistancesReductions #24556

EHN Optimized CSR-CSR support for Euclidean specializations of PairwiseDistancesReductions #24556

Conversation

Vincent-Maladiere commented Sep 30, 2022 • edited by jjerphan

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

jjerphan left a comment

Choose a reason for hiding this comment

Vincent-Maladiere commented Oct 16, 2022

Some observations

Testing script:

jjerphan commented Oct 17, 2022

jjerphan left a comment • edited

Choose a reason for hiding this comment

Vincent-Maladiere commented Oct 19, 2022

jjerphan left a comment • edited

Choose a reason for hiding this comment

Vincent-Maladiere commented Oct 26, 2022 • edited

Hardware Scalability on 76574eb (using this script)

jjerphan left a comment

Choose a reason for hiding this comment

jjerphan Nov 4, 2022

Choose a reason for hiding this comment

jjerphan left a comment

Choose a reason for hiding this comment

jjerphan left a comment • edited

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

jjerphan left a comment

Choose a reason for hiding this comment

jjerphan commented Nov 18, 2022 • edited

EHN Optimized CSR-CSR support for `Euclidean` specializations of `PairwiseDistancesReductions` #24556

EHN Optimized CSR-CSR support for `Euclidean` specializations of `PairwiseDistancesReductions` #24556

Vincent-Maladiere commented Sep 30, 2022 •

edited by jjerphan

jjerphan left a comment •

edited

jjerphan left a comment •

edited

Vincent-Maladiere commented Oct 26, 2022 •

edited

jjerphan left a comment •

edited

jjerphan commented Nov 18, 2022 •

edited