Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Private Cython Submodule for Reduction over Pairwise Distances #20254

Closed
wants to merge 261 commits into from

Conversation

jjerphan
Copy link
Member

@jjerphan jjerphan commented Jun 11, 2021

⚠ This PR has been superseded by #21462

What does this implement/fix? Explain your changes.

Aggregations over pairwise distances (argmin, argkmin, threshold filtering, cumulative sum, etc.) are one of the foundations used by many algorithms within scikit-learn.

Various strategies exist to compute them different, including some using chunks (see sklearn/metrics/pairwise.py). However, those are still mainly implemented at the level of Python leaving some rooms for manœuvre, especially for potential optimizations which can be made at a lower level.

This PR explores new implementations for such aggregations using Cython to have a finer control over computations (similarly to what has been explored and implemented for K-means).

Importantly, the core of the computation is chunked to make sure that the pairwise distance chunk matrices stay in CPU cache before applying the final reduction step. This can lead to several x speed-ups alone. Furthermore, the chunking strategy is also used to leverage OpenMP-based paralellism (using Cython prange loop) which gives another multiplicative speed-up in favorable cases on many-core machines.

Interfaces benefiting from this PR

Core interfaces:

  • sklearn.neighbors.KNeighborsMixin.kneighbors
  • sklearn.neighbors.RadiusNeighborsMixin.radius_neighbors
  • sklearn.metrics.pairwise_distances_argmin
  • sklearn.metrics.pairwise_distances_argmin_min

User facing interfaces (might not be exhaustive):

  • sklearn.cluster.Birch
  • sklearn.cluster.DBSCAN
  • sklearn.cluster.MeanShift
  • sklearn.cluster.OPTICS
  • sklearn.cluster.SpectralClustering
  • sklearn.feature_selection.mutual_info_regression
  • sklearn.neighbors.AffinityPropagation
  • sklearn.neighbors.KNeighborsClassifier
  • sklearn.neighbors.KNeighborsRegressor
  • sklearn.neighbors.LocalOutlierFactor
  • sklearn.neighbors.NearestNeighbors
  • sklearn.manifold.Isomap
  • sklearn.manifold.LocallyLinearEmbedding
  • sklearn.manifold.TSNE
  • sklearn.manifold.trustworthiness
  • sklearn.semi_supervised.LabelPropagation
  • sklearn.semi_supervised.LabelSpreading

Last Benchmarks

Benchmarks are made via https://github.com/jjerphan/scikit-learn/tree/benchmarks/pairwise_aggregation_cython.

For kneighbors(1.2× to 20× speed-up): #20254 (comment)

fast_sqeuclidean metric

Ran with:

asv continuous -b BruteForceNearestNeighborsBenchmark -e -q  main benchmarks/pairwise_aggregation_cython
       before           after         ratio                                                                                                                                                        
     [df20e815]       [3adec08f]                                                                                                                                                                   
     <main>           <benchmarks/pairwise_aggregation_cython>                                          n_train, n_test, n_features, (k, radius)
-         3.51±0s          2.99±0s     0.85  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 500, (1000, 1000))                                                       
-         1.73±0s          1.43±0s     0.83  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 500, (100, 100))                                                   
-         1.72±0s          1.39±0s     0.81  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 500, (1000, 1000))                                                 
-         1.77±0s          1.41±0s     0.80  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 500, (10, 10))                                                     
-         1.12±0s          865±0ms     0.77  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 500, (1, 1))                                                       
-         1.08±0s          760±0ms     0.70  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 50, (100, 100))                                                    
-         1.13±0s          746±0ms     0.66  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 100, (1, 1))                                                       
-         1.18±0s          776±0ms     0.66  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 100, (10, 10))                                                     
-         1.05±0s          679±0ms     0.64  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 50, (1000, 1000))                                                  
-         2.55±0s          1.14±0s     0.45  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 500, (100, 100))                                                         
-         2.89±0s          1.22±0s     0.42  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 100, (1000, 1000))                                                       
-         2.32±0s          792±0ms     0.34  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 100, (1000, 1000))                                                 
-         2.64±0s          881±0ms     0.33  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 500, (1, 1))                                                             
-         2.09±0s          666±0ms     0.32  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 50, (10, 10))                                                      
-         2.30±0s          720±0ms     0.31  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 50, (1, 1))                                                        
-         2.51±0s          763±0ms     0.30  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 100, (100, 100))                                                   
-         4.37±0s          959±0ms     0.22  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 500, (10, 10))                                                           
-         1.42±0s          226±0ms     0.16  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 100, (1, 1))                                                             
-         6.31±0s          974±0ms     0.15  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 50, (1000, 1000))                                                        
-         1.93±0s          226±0ms     0.12  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 100, (10, 10))                                                           
-         2.23±0s          225±0ms     0.10  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 50, (100, 100))                                                          
-         3.98±0s          327±0ms     0.08  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 100, (100, 100))                                                         
-         2.93±0s          125±0ms     0.04  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 50, (10, 10))                                                            
-         3.41±0s          141±0ms     0.04  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 50, (1, 1))                                                              
                                                                                                                                                                                                   
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.  
PERFORMANCE INCREASED.
Standard metric (here manhattan )

BruteForceNearestNeighborsBenchmark as been modified on top of in 3adec08 to use metric='manhattan' and ran with:

asv continuous -b BruteForceNearestNeighborsBenchmark -e -q  main benchmarks/pairwise_aggregation_cython
       before           after         ratio
     [df20e815]       [3adec08f]
     <main>           <benchmarks/pairwise_aggregation_cython>
-         2.19±0s          1.78±0s     0.81  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 50, (10, 10))
-         3.35±0s          2.37±0s     0.71  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 50, (100, 100))
-         2.45±0s          1.62±0s     0.66  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 50, (1, 1))
-         2.94±0s          1.88±0s     0.64  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 50, (1000, 1000))
-         6.97±0s          3.97±0s     0.57  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 100, (1, 1))
-         8.15±0s          3.85±0s     0.47  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 100, (10, 10))
-         5.92±0s          2.78±0s     0.47  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 100, (100, 100))
-         7.46±0s          3.39±0s     0.45  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 50, (1000, 1000))
-         2.62±0s          1.15±0s     0.44  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 50, (1, 1))
-         4.21±0s          1.80±0s     0.43  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 50, (100, 100))
-         6.16±0s          2.30±0s     0.37  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 50, (10, 10))
-         7.97±0s          2.84±0s     0.36  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 100, (1000, 1000))
-         45.1±0s          15.7±0s     0.35  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 500, (1000, 1000))
-         43.6±0s          14.9±0s     0.34  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 500, (100, 100))
-         41.4±0s          13.8±0s     0.33  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 500, (10, 10))
-         43.4±0s          13.7±0s     0.32  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 500, (1, 1))
-         8.19±0s          2.54±0s     0.31  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 100, (1, 1))
-         8.39±0s          2.55±0s     0.30  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 100, (100, 100))
-         44.4±0s          13.4±0s     0.30  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 500, (1, 1))
-         42.7±0s          12.5±0s     0.29  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 500, (10, 10))
-         45.5±0s          13.2±0s     0.29  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 500, (100, 100))
-         11.2±0s          3.21±0s     0.29  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 100, (1000, 1000))
-         44.3±0s          11.7±0s     0.26  bruteforce.BruteForceNearestNeighborsBenchmark.time_radius_neighbors(10000, 10000, 500, (1000, 1000))
-         8.88±0s          2.28±0s     0.26  bruteforce.BruteForceNearestNeighborsBenchmark.time_kneighbors(10000, 10000, 100, (10, 10))

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Any other comments?

The initial experiments have been in scikit-learn-inria-fondation/pdist_aggregation

Subsequent work include:

  • the adaptation for sparse datasets implementations by @mbatoul is still WIP in [WIP] ENH Implement sparse support jjerphan/scikit-learn#4
  • the adaptation using optimized interfaces of algorithms internals such as:
    • cluster._optics.compute_core_distances
    • setting return_distance=False where applicable on sklearn.neighbors.KNeighborsMixin.kneighbors and sklearn.neighbors.RadiusNeighborsMixin.radius_neighbors calls.
    • specialize ArgKmin(k=1) under ArgMin
    • define PairwiseDistancesReduction for KMeans and adapt its implementation accordingly
    • row norms of X_train once and attach it to KNeighborsMixins
  • compare OpenMP scheduling strategies (especially 'static' vs 'guided')

Main changes:

  • Group NeighborsHeap implementations in a private submodule and factor code from neighbors._binary_tree.NeighborsHeap as it is being reused here
  • Move DistanceMetrics from neighbors to metrics and unify types
  • Introduce a backward compatibility alias with a deprecation warning for sklearn.neighbors.DistanceMetrics because unfortunately it's part of the public API of scikit-learn.
  • Introduce the 'fast_sqeuclidean' strategy (the current one using the _gemm trick) and document it appropriately (tl;dr fast, but unstable when far from the origin)
  • Adapt pairwise_distances_argmin_min
  • Introduce general class hierarchy for reductions on pairwise distance
  • Implement a reduction for querying neighbors using a radius (similarly to RadiusNeighborsMixin._radius_neighbors_reduce_func)
  • Add tests for the private submodule
  • Add a guard on arm64 to use the latest version of openblas and use it in CI
  • Use metric="fast_sqeuclidean" in KNeighborsClassifier/Regressor by default to unlock maximum prediction performance while give the users full control to use a more numerically stable alternative otherwise.
  • Support sparse matrices (sparse-dense, dense-sparse, sparse-sparse) Design (see DatasetsPair)

Move neighbors.NeighborsHeap's code and  _typedefs under
sklearn.utils as cyclic imports are currently happening
between sklearn.neighbors and sklearn.metrics.

Also, using integral in some cases gave unexpected results.
Occurences were changed to use np.int_p, as exposed by
utils._typedefs.ITYPE_t (we don't need signed integer)
@jjerphan jjerphan force-pushed the pairwise_aggregation_cython branch from ce73f0c to cb85791 Compare June 23, 2021 16:16
@jjerphan
Copy link
Member Author

(Small note: force-pushing allow me here to have a somewhat clean history during adaptation prior to a first review).

@jjerphan jjerphan changed the title EHN Optimise aggregations over pairwise distances [WIP] ENH Optimise aggregations over pairwise distances Jun 24, 2021
This test segfaults:

test_neighbors.py::test_fast_sqeuclidean_correctness[1-10-5-1000]
The segfaults was due to reallocation
of on the same pointers, causing multiple
freeing on the same reference and memory leaks.

To resolve this, arrays of pointers for local
datastructures are allocated at the initialisation
of the interface so that they can be handled
separately in threads with proper allocation
and deallocation.

The memory management will be wrapped in
subsequent private template method for
each types of reduction and parallelisation
strategy. This is one of the next iteration.
Refactor ArgKmin._reduce_on_chunks to pave the
way to general interface for reductions.

Private datastructures will have to be accessed via
the implementation of this private method.
Introduce ParallelReduction as a abstract class,
and extend it using ArgKmin.

FastSquaredEuclideanArgKmin extends ArgKmin
for the "fast_sqeuclidean" strategy.
The associated _typedefs.pyx file has been
moved to utils to avoid circular dependencies
has it is being used in neighbors.
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rest of the comments. I really like test_pairwise_distances_reduction

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/_heap.pyx Outdated Show resolved Hide resolved
sklearn/utils/_testing.py Outdated Show resolved Hide resolved
sklearn/neighbors/tests/test_neighbors.py Outdated Show resolved Hide resolved
sklearn/neighbors/tests/test_neighbors.py Outdated Show resolved Hide resolved
sklearn/neighbors/tests/test_neighbors.py Outdated Show resolved Hide resolved
sklearn/neighbors/tests/test_neighbors.py Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_pairwise_distances_reduction.py Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_pairwise_distances_reduction.py Outdated Show resolved Hide resolved
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
@jjerphan
Copy link
Member Author

To re-answer @thomasjpfan's question (still present in a thread but hardly accessible because of new activity and impossibility to access it via a link):

Does self.distance_metric.rdist use a v-table look up? (I am curious. This may not be actionable)

Yes, it does. I do not think we can both have design flexibility without v-tables.


V-table implementation details
  • a struct is defined for the base class (here DistanceMetric):
struct __pyx_vtabstruct_7sklearn_7metrics_13_dist_metrics_DistanceMetric {
  __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*dist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_ITYPE_t);
  __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*rdist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_ITYPE_t);
  __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*csr_dist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __Pyx_memviewslice, __Pyx_memviewslice, __Pyx_memviewslice, __Pyx_memviewslice);
  __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*csr_rdist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __Pyx_memviewslice, __Pyx_memviewslice, __Pyx_memviewslice, __Pyx_memviewslice);
  int (*pdist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __Pyx_memviewslice, __Pyx_memviewslice);
  int (*cdist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __Pyx_memviewslice, __Pyx_memviewslice, __Pyx_memviewslice);
  __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*_rdist_to_dist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t);
  __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*_dist_to_rdist)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t);
};
  • a struct is created for each subclass (for instance here EuclideanDistance) and wraps the original struct as __pyx_base.
struct __pyx_vtabstruct_7sklearn_7metrics_13_dist_metrics_EuclideanDistance {
  struct __pyx_vtabstruct_7sklearn_7metrics_13_dist_metrics_DistanceMetric __pyx_base;
};
static struct __pyx_vtabstruct_7sklearn_7metrics_13_dist_metrics_EuclideanDistance *__pyx_vtabptr_7sklearn_7metrics_13_dist_metrics_EuclideanDistance;
  • subclasses' methods are converted to functions (not shown bellow), __pyx_base get set as the base and subclasses' methods get bounds to the table:
  __pyx_vtabptr_7sklearn_7metrics_13_dist_metrics_EuclideanDistance = &__pyx_vtable_7sklearn_7metrics_13_dist_metrics_EuclideanDistance;
  __pyx_vtable_7sklearn_7metrics_13_dist_metrics_EuclideanDistance.__pyx_base = *__pyx_vtabptr_7sklearn_7metrics_13_dist_metrics_DistanceMetric;
  __pyx_vtable_7sklearn_7metrics_13_dist_metrics_EuclideanDistance.__pyx_base.dist = (__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_ITYPE_t))__pyx_f_7sklearn_7metrics_13_dist_metrics_17EuclideanDistance_dist;
  __pyx_vtable_7sklearn_7metrics_13_dist_metrics_EuclideanDistance.__pyx_base.rdist = (__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const *, __pyx_t_7sklearn_5utils_9_typedefs_ITYPE_t))__pyx_f_7sklearn_7metrics_13_dist_metrics_17EuclideanDistance_rdist;
  __pyx_vtable_7sklearn_7metrics_13_dist_metrics_EuclideanDistance.__pyx_base._rdist_to_dist = (__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t))__pyx_f_7sklearn_7metrics_13_dist_metrics_17EuclideanDistance__rdist_to_dist;
  __pyx_vtable_7sklearn_7metrics_13_dist_metrics_EuclideanDistance.__pyx_base._dist_to_rdist = (__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t (*)(struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_DistanceMetric *, __pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t))__pyx_f_7sklearn_7metrics_13_dist_metrics_17EuclideanDistance__dist_to_rdist;
  • when a subclass object is created, its __pyx_base v-tab is set to the one of its class.
static PyObject *__pyx_tp_new_7sklearn_7metrics_13_dist_metrics_EuclideanDistance(PyTypeObject *t, PyObject *a, PyObject *k) {
  struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_EuclideanDistance *p;
  #if CYTHON_COMPILING_IN_LIMITED_API
  newfunc new_func = (newfunc)PyType_GetSlot(__pyx_ptype_7sklearn_7metrics_13_dist_metrics_DistanceMetric, Py_tp_new);
  PyObject *o = new_func(t, a, k);
  #else
  PyObject *o = __pyx_tp_new_7sklearn_7metrics_13_dist_metrics_DistanceMetric(t, a, k);
  #endif
  if (unlikely(!o)) return 0;
  p = ((struct __pyx_obj_7sklearn_7metrics_13_dist_metrics_EuclideanDistance *)o);
  p->__pyx_base.__pyx_vtab = (struct __pyx_vtabstruct_7sklearn_7metrics_13_dist_metrics_DistanceMetric*)__pyx_vtabptr_7sklearn_7metrics_13_dist_metrics_EuclideanDistance;
  return o;
}
  • on call sites, the dispatch is done via look-ups, e.g. on the code you commented:
  /* "sklearn/metrics/_dist_metrics.pyx":1334
 *     @final
 *     cdef DTYPE_t proxy_dist(self, ITYPE_t i, ITYPE_t j) nogil:
 *         return self.distance_metric.rdist(&self.X[i, 0],             # <<<<<<<<<<<<<<
 *                                           &self.Y[j, 0],
 *                                           self.d)
 */
  __pyx_t_5 = ((struct __pyx_vtabstruct_7sklearn_7metrics_13_dist_metrics_DistanceMetric *)__pyx_v_self->__pyx_base.distance_metric->__pyx_vtab)->rdist(__pyx_v_self->__pyx_base.distance_metric, (&(*((__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const  *) ( /* dim=1 */ ((char *) (((__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const  *) ( /* dim=0 */ (__pyx_v_self->X.data + __pyx_t_1 * __pyx_v_self->X.strides[0]) )) + __pyx_t_2)) )))), (&(*((__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const  *) ( /* dim=1 */ ((char *) (((__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t const  *) ( /* dim=0 */ (__pyx_v_self->Y.data + __pyx_t_3 * __pyx_v_self->Y.strides[0]) )) + __pyx_t_4)) )))), __pyx_v_self->d); if (unlikely(__pyx_t_5 == ((__pyx_t_7sklearn_5utils_9_typedefs_DTYPE_t)-1.0))) __PYX_ERR(1, 1334, __pyx_L1_error)
  __pyx_r = __pyx_t_5;
  goto __pyx_L0;

I do not know much about C compilers, but are static-qualified definitions helping with optimizations?

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not done yet, there are so many things to think about with this PR.

Do you think we can split the PairwiseDistancesRadiusNeighborhood out into a follow up PR? This way we can focus on PairwiseDistancesReduction and reduce the diff in this PR.

sklearn/cluster/_birch.py Outdated Show resolved Hide resolved
sklearn/cluster/_birch.py Outdated Show resolved Hide resolved

@final
cdef DTYPE_t proxy_dist(self, ITYPE_t i, ITYPE_t j) nogil:
return self.distance_metric.rdist(&self.X[i, 0],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does. My concern is that _compute_and_reduce_distances_on_chunks is going to call into .surrogate_dist in a nested loop where each

I do no not think the static definition helps to optimize this away, although I would need to look into it with https://compiler-explorer.com/.

Anyways, I do not think this is actionable for now. In a follow up PR, it would be interesting to investigate if hard coding a inline function and avoiding the vtable look changes our benchmarks.

sklearn/metrics/_pairwise_distances_reduction.pyx Outdated Show resolved Hide resolved
@jjerphan
Copy link
Member Author

jjerphan commented Oct 19, 2021

Answering @thomasjpfan's question:

Do you think we can split the PairwiseDistancesRadiusNeighborhood out into a follow up PR? This way we can focus on PairwiseDistancesReduction and reduce the diff in this PR.

Yes, I also would have preferred this PR to have a smaller granularity.

We can extract the patches to solely introduce PairwiseDistancesArgKmin and the other necessary abstractions (DatasetsPair). WDYT?

--

PS: @rth and I were/are in favor in splitting this PR is atomic ones, @ogrisel prefers to keep it as is.

@jjerphan
Copy link
Member Author

jjerphan commented Oct 20, 2021

A simple setup with perf(1) with:

  • X.shape=(100000, 100)
  • Y.shape=(10000, 100)

shows that most of cycles are spent in GEMM kernels for FastEuclideanPairwiseDistancesArgKmin.

When PRESCOTT kernels are used
-  100.00%        python
   -   87.78%        libopenblasp-r0-085ca80a.3.9.so
          84.30%        [.] dgemm_kernel_PRESCOTT
           2.49%        [.] dgemm_oncopy_PRESCOTT
           0.74%        [.] dgemm_beta_PRESCOTT
           0.18%        [.] blas_thread_server
           0.02%        [.] dgemm_tn
           0.01%        [.] ddot_k_PRESCOTT
           0.01%        [.] sched_yield@plt
           0.01%        [.] pthread_mutex_lock@plt
           0.01%        [.] blas_memory_free
           0.01%        [.] pthread_mutex_unlock@plt
           0.00%        [.] dgemm_
           0.00%        [.] blas_memory_alloc
           0.00%        [.] ddot_
   -    5.91%        _pairwise_distances_reduction.cpython-39-x86_64-linux-gnu.so
           5.91%        [.] __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_37FastEuclideanPairwiseDistancesArgKmin__compute_and_reduce_distances_on_chunks
           0.00%        [.] __pyx_memoryview_slice_memviewslice
           0.00%        [.] __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_24PairwiseDistancesArgKmin__parallel_on_X_prange_iter_finalize
           0.00%        [.] __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_26PairwiseDistancesReduction__parallel_on_X
   -    3.23%        _heap.cpython-39-x86_64-linux-gnu.so
           3.19%        [.] __pyx_fuse_1__pyx_f_7sklearn_5utils_5_heap_heap_push
           0.04%        [.] __pyx_fuse_1__pyx_f_7sklearn_5utils_5_heap_simultaneous_sort
   +    1.12%        python3.9
   -    0.94%        libgomp.so.1.0.0
           0.94%        [.] do_wait
           0.00%        [.] gomp_thread_start
           0.00%        [.] gomp_team_barrier_wait_end
   +    0.29%        [unknown]
   +    0.28%        libpthread-2.33.so
   +    0.23%        libc-2.33.so
   +    0.17%        libopenblasp-r0-5bebc122.3.13.dev.so
   +    0.04%        ld-2.33.so
   +    0.00%        cython_blas.cpython-39-x86_64-linux-gnu.so
   +    0.00%        _multiarray_umath.cpython-39-x86_64-linux-gnu.so
   +    0.00%        _cython_blas.cpython-39-x86_64-linux-gnu.so
   +    0.00%        _ctypes.cpython-39-x86_64-linux-gnu.so
   +    0.00%        _queue.cpython-39-x86_64-linux-gnu.so
When SKYLAKEX kernels are used
-  100.00%        python
   -   57.57%        libopenblasp-r0.3.17.so
          49.84%        [.] dgemm_kernel_SKYLAKEX
           4.44%        [.] dgemm_incopy_SKYLAKEX
           2.34%        [.] dgemm_oncopy_SKYLAKEX
           0.73%        [.] blas_thread_server
           0.12%        [.] dgemm_tn
           0.08%        [.] dot_compute
           0.01%        [.] dgemm_
           0.01%        [.] blas_memory_free
           0.01%        [.] blas_memory_alloc
           0.00%        [.] dgemm_beta_SKYLAKEX
   -   21.29%        _pairwise_distances_reduction.cpython-39-x86_64-linux-gnu.so
          21.28%        [.] __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_37FastEuclideanPairwiseDistancesArgKmin__compute_and_reduce_distances_on_chunks
           0.01%        [.] __pyx_memoryview_slice_memviewslice
           0.00%        [.] __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_24PairwiseDistancesArgKmin__parallel_on_X_prange_iter_finalize
   -   11.71%        _heap.cpython-39-x86_64-linux-gnu.so
          11.58%        [.] __pyx_fuse_1__pyx_f_7sklearn_5utils_5_heap_heap_push
           0.12%        [.] __pyx_fuse_1__pyx_f_7sklearn_5utils_5_heap_simultaneous_sort
   -    3.43%        libgomp.so.1.0.0
           3.43%        [.] do_wait
   +    2.94%        python3.9
   +    2.29%        libc-2.33.so
   +    0.62%        [unknown]
   +    0.09%        ld-2.33.so
   +    0.02%        _multiarray_umath.cpython-39-x86_64-linux-gnu.so
   +    0.02%        libpthread-2.33.so
   +    0.01%        _dist_metrics.cpython-39-x86_64-linux-gnu.so
   +    0.01%        _reordering.cpython-39-x86_64-linux-gnu.so
   +    0.00%        _cython_blas.cpython-39-x86_64-linux-gnu.so

jjerphan and others added 3 commits October 20, 2021 16:02
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
@ogrisel
Copy link
Member

ogrisel commented Oct 20, 2021

Interesting, on a machine with AVX512 and the proper version of OpenBLAS, the function __pyx_f_7sklearn_7metrics_29_pairwise_distances_reduction_37FastEuclideanPairwiseDistancesArgKmin__compute_and_reduce_distances_on_chunks starts to be significant (more than 20% of the time) which means that optimizing it (e.g. maybe loop unrolling, vectorization...) could start to make sense at some point. But not as part of this PR ;)

@ogrisel
Copy link
Member

ogrisel commented Oct 20, 2021

PS: @rth and I were/are in favor in splitting this PR is atomic ones, @ogrisel prefers to keep it as is.

I am fine with moving the radius neighbors code in a follow-up PR on top of this ArgKMin + infra PR.

Also reorder instructions to have X's before Y's.

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Oct 25, 2021
As to make scikit-learn#20254 smaller.
The removed hunks will be re-introduced in a subsequent PR.
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Oct 25, 2021
As to make scikit-learn#20254 smaller.
The removed hunks will be re-introduced in a subsequent PR.
jjerphan and others added 3 commits October 25, 2021 14:50
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
As to make scikit-learn#20254 smaller.
The removed hunks will be re-introduced in a subsequent PR.
As to make scikit-learn#20254 smaller.
The removed hunks will be re-introduced in a subsequent PR.
@jjerphan jjerphan changed the title ENH Private Cython submodule for aggregations over pairwise distances ENH Private Cython Submodule for Reduction over Pairwise Distances Oct 25, 2021
jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Oct 26, 2021
Taken and adapted from the description of scikit-learn#20254
written by Olivier.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@jjerphan jjerphan added Superseded PR has been replace by a newer PR and removed Waiting for Reviewer labels Oct 26, 2021
@jjerphan
Copy link
Member Author

This PR has been superseded by #21462.

@lorentzenchr
Copy link
Member

#22134 is merged.

@jjerphan jjerphan deleted the pairwise_aggregation_cython branch October 21, 2022 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cython High Priority High priority issues and pull requests module:metrics module:neighbors module:utils Performance Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants