Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EHN Optimized CSR-CSR support for Euclidean specializations of PairwiseDistancesReductions #24556

Merged

Conversation

Vincent-Maladiere
Copy link
Contributor

@Vincent-Maladiere Vincent-Maladiere commented Sep 30, 2022

Reference Issues/PRs

Relates #23585 @jjerphan

What does this implement/fix? Explain your changes.

  • Create the base class MiddleTermComputer generalizingGEMMTermComputer to handle the computation of $-2 X . Y^\top$. MiddleTermComputer is extended by:
    • DenseDenseMiddleTermComputer when both X and Y are dense (whose implementation originates from GEMMTermComputer)
    • SparseSparseMiddleTermComputer when both X and Y are CSR. This components relies on a Cython routine.
  • Change EuclideanArgKmin and EuclideanRadiusNeighbors to only have them adhere to MiddleTermComputer
  • Adapt is_usable_for in BaseDistanceReductionDispatcher to add the CSR-CSR case.
  • Change the logic of compute in ArgKmin and RadiusNeighbors to select the Euclidean class for the CSR-CSR case
  • Implement sparse versions of _sqeuclidean_row_norms

Any other comments?

For benchmark results, see the following comments:

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you for working on this, @Vincent-Maladiere!

Here are a first few comments to guide you through extending this class hierarchy.

@Vincent-Maladiere
Copy link
Contributor Author

Some observations

  1. The dense and sparse computers both yield the same result when max(X.shape[0], Y.shape[0]) < chunk_size, otherwise the sparse computation is still incorrect
  2. (unrelated but could be substantial) It seems that we don't handle the case when parameter > X.shape[0], in that scenario even the values of the dense computer are wrong.

Testing script:

This is working ✅

import numpy as np
from numpy.testing import assert_array_equal
from scipy.sparse import csr_matrix
from sklearn.metrics._pairwise_distances_reduction import ArgKmin

X = np.array([[0, 1], [0, 0], [2, 3]], dtype=np.float64)
Y = np.array([[1, 1], [0, 0], [0, 1]], dtype=np.float64)

X_csr = csr_matrix(X)
Y_csr = csr_matrix(Y)

parameter = 3
dist, indices = ArgKmin.compute(
    X,
    Y,
    parameter,
    chunk_size=100,
    return_distance=True,
)
dist_csr, indices_csr = ArgKmin.compute(
    X_csr,
    Y_csr,
    parameter,
    chunk_size=100,
    return_distance=True,
)
assert_array_equal(dist, dist_csr)

This fails ❌ (notice that the only difference is the length of X and Y)

import numpy as np
from numpy.testing import assert_array_equal
from scipy.sparse import csr_matrix
from sklearn.metrics._pairwise_distances_reduction import ArgKmin

X = np.array([[0, 1], [0, 0], [2, 3]] * 40, dtype=np.float64)
Y = np.array([[1, 1], [0, 0], [0, 1]] * 40, dtype=np.float64)

X_csr = csr_matrix(X)
Y_csr = csr_matrix(Y)

parameter = 3
dist, indices = ArgKmin.compute(
    X,
    Y,
    parameter,
    chunk_size=100,
    return_distance=True,
)
dist_csr, indices_csr = ArgKmin.compute(
    X_csr,
    Y_csr,
    parameter,
    chunk_size=100,
    return_distance=True,
)
assert_array_equal(dist, dist_csr)

@jjerphan
Copy link
Member

(unrelated but could be substantial) It seems that we don't handle the case when parameter > X.shape[0], in that scenario even the values of the dense computer are wrong.

Yes, that's right (I supposed you meant "when parameter > Y.shape[0]"?): it's not validated in ArgKmin . Most validations are done closer to user-facing interfaces, here in kneighbors:

if n_neighbors > n_samples_fit:
raise ValueError(
"Expected n_neighbors <= n_samples, "
" but n_samples = %d, n_neighbors = %d" % (n_samples_fit, n_neighbors)
)

As those interface are directly used by users, we do not revalidate all the parameters but this might change in the future.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job making this works, @Vincent-Maladiere!

Here are a few comments to finish this first part.

The next steps involves (in this orders):

  • rename sklearn/metrics/_pairwise_distances_reduction/_gemm_term_computer.{pxd,pyx}.tp (mind the lines mentioning those files as well, IRRC in at least .gitignore and setup.cfg)
  • extending some tests to make sure that the Sparse-Sparse support for Euclidean specialisations is correct
  • performing some benchmarks on user-facing API, benchmarking kneighbors should suffice (this gist might be adapted)
  • updating and adapting the documentation/comments of those implementations with respect to those changes

To clarify, the following points better be done as part of subsequent PRs (preferably in this order):

  • refactoring for merging of DatasetsPairs and MiddleTermComputers and encapsulating squared norms computations where appropriate
  • the support of the CSR-dense and dense-CSR for the Euclidean specialisation

@Vincent-Maladiere
Copy link
Contributor Author

As most of this work (except squeuclidean_row_norms) is not user-facing, should we append it to the changelog?

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As most of this work (except squeuclidean_row_norms) is not user-facing, should we append it to the changelog?

For now, to have the CI green I've labelled this PR with "No Changelog needed" because it is still experimental. Also I converted to to draft because this is still WIP ([WIP] was used before the existence of the draft mode but is not much of an use now). When we have accessed performance changes, we can add an entry to the change log and remove this label.

Also, it looks like a few Cython sources have been added lately.

They should be removable with:

rm sklearn/metrics/_pairwise_distances_reduction/{_gemm_term_computer,_radius_neighborhood}.{pxd,pyx}
git rm --cached -f sklearn/metrics/_pairwise_distances_reduction/{_gemm_term_computer,_radius_neighborhood}.{pxd,pyx}
git add sklearn/metrics/_pairwise_distances_reduction/{_gemm_term_computer,_radius_neighborhood}.{pxd,pyx}
git commit -m "Remove previous Cython templates and sources"

Here are a few comments.

@jjerphan jjerphan marked this pull request as draft October 20, 2022 09:38
@Vincent-Maladiere
Copy link
Contributor Author

Vincent-Maladiere commented Oct 26, 2022

Hardware Scalability on 76574eb (using this script)

The current implementation of Sparse Sparse Euclidean NearestNeighbors does not benefit from PairwiseDistanceReduction introduced in #22134.

Following the design guidelines previously established, this new implementation:

  • Computes the sqeuclidean_row_norm from a CSR format without the GIL
  • Computes the middle term -2 * X @ Y.T in the idea introduced here. SparseSparseMiddleTermComputer realises a matmul operation in the blas gemm fashion without the GIL, between two CSR matrices.

This new design scales easily up to 8 cores on my laptop, unlike the current implementation that fails to do so.

Below are the results obtained from running the script linked in the title, with the following parameters:

  • size of X and Y (100,000 x n_features)
  • dtype: np.float64
  • density: 0.05 (5%)

This branch
speed_up_100000_100000_log_my_branch

Main
speed_up_100000_100000_log

However, I also observe that the absolute amount of time taken by this implementation greatly depends on the number of features: in high dimension (> 500), this branch currently performs worse than main, so I bit more investigation is required here.

n_features = 50 (main is blue)
n_features_50

n_features = 100 (main is blue)
n_features_100

n_features = 500 (main is blue)
n_features_500

Raw data

this branch
n_threads n_train n_test n_features mean_runtime stderr_runtime commit
0 1 100000 100000 50 140.372 0 76574eb
1 2 100000 100000 50 69.0472 0 76574eb
2 4 100000 100000 50 37.6792 0 76574eb
3 8 100000 100000 50 33.1491 0 76574eb
4 1 100000 100000 100 374.193 0 76574eb
5 2 100000 100000 100 191.189 0 76574eb
6 4 100000 100000 100 100.264 0 76574eb
7 8 100000 100000 100 83.0642 0 76574eb
8 1 100000 100000 500 6489.29 0 76574eb
9 2 100000 100000 500 3357.13 0 76574eb
10 4 100000 100000 500 1744.47 0 76574eb
11 8 100000 100000 500 1274.04 0 76574eb
main
n_threads n_train n_test n_features mean_runtime stderr_runtime commit
0 1 100000 100000 50 240.924 0 a71c535
1 2 100000 100000 50 323.733 0 a71c535
2 4 100000 100000 50 273.084 0 a71c535
3 8 100000 100000 50 270.112 0 a71c535
4 1 100000 100000 100 245.132 0 a71c535
5 2 100000 100000 100 316.328 0 a71c535
6 4 100000 100000 100 364.912 0 a71c535
7 8 100000 100000 100 284.274 0 a71c535
8 1 100000 100000 500 454.946 0 a71c535
9 2 100000 100000 500 416.54 0 a71c535
10 4 100000 100000 500 349.073 0 a71c535
11 8 100000 100000 500 306.178 0 a71c535

@Micky774 could you give me your thoughts about this PR? :)

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is what should be my ultimate review for this PR.

Here are some last comments.

Note that #24542 and #24715 must be merged before this PR.

@@ -84,8 +84,7 @@ cdef class RadiusNeighbors{{name_suffix}}(BaseDistancesReduction{{name_suffix}})
"""
if (
metric in ("euclidean", "sqeuclidean")
and not issparse(X)
and not issparse(Y)
and not (issparse(X) ^ issparse(Y))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side-note: I think it would be nice to turn in another PR is_valid_sparse_matrix, i.e.:

def is_valid_sparse_matrix(X):
return (
isspmatrix_csr(X)
and
# TODO: support CSR matrices without non-zeros elements
X.nnz > 0
and
# TODO: support CSR matrices with int64 indices and indptr
# See: https://github.com/scikit-learn/scikit-learn/issues/23653
X.indices.dtype == X.indptr.dtype == np.int32
)

into is_supported_sparse_matrix in the global scope of base.pyx and propagated in place of scipy.sparse.{issparse,isspmatrix_csr} in sklearn.metrics._pairwise_distances_reduction.

sklearn/metrics/_pairwise_distances_reduction/setup.py Outdated Show resolved Hide resolved
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with pending acceptance for merge due to a duplicated allocation and cast as explained in the comment bellow.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's solve the TODOs in another PR.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a few nitpick to make the not-XOR based conditions easier to grasp as I think it's quite rare to see such constructs in our code.

@jjerphan jjerphan removed Waiting for Second Reviewer First reviewer is done, need a second one! No Changelog Needed labels Nov 18, 2022
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are almost there!

sklearn/metrics/tests/test_pairwise_distances_reduction.py Outdated Show resolved Hide resolved
@jjerphan jjerphan merged commit 9c9c858 into scikit-learn:main Nov 18, 2022
@jjerphan
Copy link
Member

jjerphan commented Nov 18, 2022

Congrats @Vincent-Maladiere! 🎵 🎉 👏

That is significant performance improvements for many estimators! 🎉
I hope you have learnt a bunch of elements of scikit-learn and of Cython development. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants