FEA Add support for float32 on `PairwiseDistancesReduction` using Tempita #23865

jjerphan · 2022-07-08T13:54:24Z

Reference Issues/PRs

Follows-up #22134

What does this implement/fix? Explain your changes.

This ports PairwiseDistancesReduction and other implementations to 32bit using Tempita.

Benchmarks results

The hardware scalability plateaus at 64 threads because, asymptotically and using Adham's law, 2.5% of the code (which parts of it are due to interaction with CPython) is sequential.

Improved hardware scalability beyond that point, mean removing the last portions of sequential code accounting for the few points of percentage left.

Raw results

    n_threads  n_train  n_test  n_features  mean_runtime  stderr_runtime
0           1   100000  100000          50     54.270973               0
1           2   100000  100000          50     27.357690               0
2           4   100000  100000          50     13.772927               0
3           8   100000  100000          50      7.034176               0
4          16   100000  100000          50      3.851457               0
5          32   100000  100000          50      2.134666               0
6          64   100000  100000          50      1.482027               0
7         128   100000  100000          50      2.239688               0
8           1   100000  100000         100     77.925089               0
9           2   100000  100000         100     39.125349               0
10          4   100000  100000         100     19.810733               0
11          8   100000  100000         100     10.130284               0
12         16   100000  100000         100      5.506694               0
13         32   100000  100000         100      3.067685               0
14         64   100000  100000         100      2.061337               0
15        128   100000  100000         100      3.396916               0
16          1   100000  100000         500    274.099079               0
17          2   100000  100000         500    138.078319               0
18          4   100000  100000         500     70.136737               0
19          8   100000  100000         500     35.598209               0
20         16   100000  100000         500     19.321611               0
21         32   100000  100000         500     10.415704               0
22         64   100000  100000         500      7.194686               0
23        128   100000  100000         500     12.095341               0

Details

    n_threads  n_train  n_test  n_features  mean_runtime  stderr_runtime
0           1  1000000   10000          50     53.670283               0
1           2  1000000   10000          50     27.603052               0
2           4  1000000   10000          50     14.014151               0
3           8  1000000   10000          50      7.138670               0
4          16  1000000   10000          50      3.810226               0
5          32  1000000   10000          50      2.129321               0
6          64  1000000   10000          50      1.363076               0
7         128  1000000   10000          50      1.540974               0
8           1  1000000   10000         100     77.725753               0
9           2  1000000   10000         100     39.835435               0
10          4  1000000   10000         100     20.107033               0
11          8  1000000   10000         100     10.242633               0
12         16  1000000   10000         100      5.499012               0
13         32  1000000   10000         100      3.151450               0
14         64  1000000   10000         100      2.051802               0
15        128  1000000   10000         100      2.319589               0
16          1  1000000   10000         500    274.992947               0
17          2  1000000   10000         500    140.689740               0
18          4  1000000   10000         500     70.843511               0
19          8  1000000   10000         500     36.023845               0
20         16  1000000   10000         500     19.761463               0
21         32  1000000   10000         500     10.633548               0
22         64  1000000   10000         500      7.017808               0
23        128  1000000   10000         500      8.313477               0

Benchmarks results between `main` (`a5d50cf`) and this PR @ `31b8b28` (via `2c842bd`)

Between ×1.2 and, well, ×250+ speed-ups: it looks like it just scales linearly.
Regressions are due to using too many cores when the size of the problem (i.e. n_train and n_test) is small.

1 thread

· Discovering benchmarks
·· Uninstalling from conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
·· Installing 31b8b28b <feat/pdr-32bit> into conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For scikit-learn commit 31b8b28b <feat/pdr-32bit> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         2/9 failed
[ 50.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     11.3±0.2ms     105±2ms       1.06±0s    
                10000     89.6±0.9ms     878±4ms      8.76±0.01s  
               10000000    1.41±0m        failed        failed    
              ========== ============ ============= ==============

[ 50.00%] ···· For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

[ 50.00%] · For scikit-learn commit a5d50cf3 <main> (round 1/1):
[ 50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[100.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         3/9 failed
[100.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     20.1±0.1ms     193±1ms       2.02±0s    
                10000      203±1ms       2.12±0s       21.0±0s    
               10000000     failed        failed        failed    
              ========== ============ ============= ==============

[100.00%] ···· For parameters: 10000000, 1000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

       before           after         ratio
     [a5d50cf3]       [31b8b28b]
     <main>           <feat/pdr-32bit>
-      20.1±0.1ms       11.3±0.2ms     0.56  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-         193±1ms          105±2ms     0.54  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-         2.02±0s          1.06±0s     0.52  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-         203±1ms       89.6±0.9ms     0.44  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         21.0±0s       8.76±0.01s     0.42  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 100000, 100)
-         2.12±0s          878±4ms     0.41  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

2 threads

· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
·· Installing 31b8b28b <feat/pdr-32bit> into conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For scikit-learn commit 31b8b28b <feat/pdr-32bit> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         2/9 failed
[ 50.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     6.89±0.2ms    53.9±0.7ms     531±3ms    
                10000     46.7±0.2ms    443±0.5ms      4.33±0s    
               10000000   42.7±0.06s      failed        failed    
              ========== ============ ============= ==============

[ 50.00%] ···· For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

[ 50.00%] · For scikit-learn commit a5d50cf3 <main> (round 1/1):
[ 50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[100.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         3/9 failed
[100.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     18.3±0.2ms     171±1ms       1.78±0s    
                10000     179±0.9ms      1.86±0s       18.6±0s    
               10000000     failed        failed        failed    
              ========== ============ ============= ==============

[100.00%] ···· For parameters: 10000000, 1000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

       before           after         ratio
     [a5d50cf3]       [31b8b28b]
     <main>           <feat/pdr-32bit>
-      18.3±0.2ms       6.89±0.2ms     0.38  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-         171±1ms       53.9±0.7ms     0.32  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-         1.78±0s          531±3ms     0.30  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-       179±0.9ms       46.7±0.2ms     0.26  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.86±0s        443±0.5ms     0.24  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-         18.6±0s          4.33±0s     0.23  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

4 threads

· Discovering benchmarks
·· Uninstalling from conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
·· Installing 31b8b28b <feat/pdr-32bit> into conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For scikit-learn commit 31b8b28b <feat/pdr-32bit> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         2/9 failed
[ 50.00%] ··· ========== ============= ============= ==============
              --                    n_test / n_features            
              ---------- ------------------------------------------
               n_train     1000 / 100   10000 / 100   100000 / 100 
              ========== ============= ============= ==============
                 1000     5.60±0.07ms    29.6±0.4ms     276±2ms    
                10000      27.3±0.3ms     230±1ms      2.23±0.01s  
               10000000    21.7±0.01s      failed        failed    
              ========== ============= ============= ==============

[ 50.00%] ···· For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

[ 50.00%] · For scikit-learn commit a5d50cf3 <main> (round 1/1):
[ 50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[100.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         3/9 failed
[100.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     17.4±0.2ms    163±0.8ms      1.68±0s    
                10000      172±1ms       1.77±0s      17.5±0.02s  
               10000000     failed        failed        failed    
              ========== ============ ============= ==============

[100.00%] ···· For parameters: 10000000, 1000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

       before           after         ratio
     [a5d50cf3]       [31b8b28b]
     <main>           <feat/pdr-32bit>
-      17.4±0.2ms      5.60±0.07ms     0.32  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-       163±0.8ms       29.6±0.4ms     0.18  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-         1.68±0s          276±2ms     0.16  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-         172±1ms       27.3±0.3ms     0.16  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.77±0s          230±1ms     0.13  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

8 threads

· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
·· Installing 31b8b28b <feat/pdr-32bit> into conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For scikit-learn commit 31b8b28b <feat/pdr-32bit> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         1/9 failed
[ 50.00%] ··· ========== ============= ============= ==============
              --                    n_test / n_features            
              ---------- ------------------------------------------
               n_train     1000 / 100   10000 / 100   100000 / 100 
              ========== ============= ============= ==============
                 1000     6.83±0.03ms    17.1±0.3ms     150±2ms    
                10000      17.3±0.3ms     122±1ms      1.16±0.01s  
               10000000    11.5±0.01s     1.89±0m        failed    
              ========== ============= ============= ==============

[ 50.00%] ···· For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

[ 50.00%] · For scikit-learn commit a5d50cf3 <main> (round 1/1):
[ 50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[100.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         3/9 failed
[100.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     18.8±0.2ms    167±0.7ms      1.71±0s    
                10000     176±0.6ms      1.80±0s      17.7±0.01s  
               10000000     failed        failed        failed    
              ========== ============ ============= ==============

[100.00%] ···· For parameters: 10000000, 1000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

       before           after         ratio
     [a5d50cf3]       [31b8b28b]
     <main>           <feat/pdr-32bit>
-      18.8±0.2ms      6.83±0.03ms     0.36  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-       167±0.7ms       17.1±0.3ms     0.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-       176±0.6ms       17.3±0.3ms     0.10  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.71±0s          150±2ms     0.09  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)
-         1.80±0s          122±1ms     0.07  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

16 threads

· Discovering benchmarks
·· Uninstalling from conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
·· Installing 31b8b28b <feat/pdr-32bit> into conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For scikit-learn commit 31b8b28b <feat/pdr-32bit> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         1/9 failed
[ 50.00%] ··· ========== ============= ============= ==============
              --                    n_test / n_features            
              ---------- ------------------------------------------
               n_train     1000 / 100   10000 / 100   100000 / 100 
              ========== ============= ============= ==============
                 1000     9.97±0.08ms    60.3±0.3ms    87.2±0.6ms  
                10000      15.7±0.2ms    106±0.6ms      631±3ms    
               10000000    6.22±0.02s    59.4±0.2s       failed    
              ========== ============= ============= ==============

[ 50.00%] ···· For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

[ 50.00%] · For scikit-learn commit a5d50cf3 <main> (round 1/1):
[ 50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[100.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         3/9 failed
[100.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     20.5±0.3ms     168±1ms       1.67±0s    
                10000      175±1ms       1.82±0s      18.0±0.02s  
               10000000     failed        failed        failed    
              ========== ============ ============= ==============

[100.00%] ···· For parameters: 10000000, 1000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

       before           after         ratio
     [a5d50cf3]       [31b8b28b]
     <main>           <feat/pdr-32bit>
-      20.5±0.3ms      9.97±0.08ms     0.49  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-         168±1ms       60.3±0.3ms     0.36  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-         175±1ms       15.7±0.2ms     0.09  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.82±0s        106±0.6ms     0.06  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-         1.67±0s       87.2±0.6ms     0.05  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

32 threads

· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
·· Installing 31b8b28b <feat/pdr-32bit> into conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For scikit-learn commit 31b8b28b <feat/pdr-32bit> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         1/9 failed
[ 50.00%] ··· ========== ============= ============= ==============
              --                    n_test / n_features            
              ---------- ------------------------------------------
               n_train     1000 / 100   10000 / 100   100000 / 100 
              ========== ============= ============= ==============
                 1000     17.4±0.08ms    95.7±0.3ms    59.3±0.3ms  
                10000      21.0±0.3ms    92.8±0.3ms     366±20ms   
               10000000    3.49±0.02s    32.3±0.1s       failed    
              ========== ============= ============= ==============

[ 50.00%] ···· For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

[ 50.00%] · For scikit-learn commit a5d50cf3 <main> (round 1/1):
[ 50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[100.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         3/9 failed
[100.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     21.4±0.5ms     173±2ms       1.64±0s    
                10000      179±3ms       1.82±0s      18.1±0.01s  
               10000000     failed        failed        failed    
              ========== ============ ============= ==============

[100.00%] ···· For parameters: 10000000, 1000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

       before           after         ratio
     [a5d50cf3]       [31b8b28b]
     <main>           <feat/pdr-32bit>
-      21.4±0.5ms      17.4±0.08ms     0.81  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-         173±2ms       95.7±0.3ms     0.55  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
-         179±3ms       21.0±0.3ms     0.12  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-         1.82±0s       92.8±0.3ms     0.05  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-         1.64±0s       59.3±0.3ms     0.04  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

64 threads

· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
·· Installing 31b8b28b <feat/pdr-32bit> into conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For scikit-learn commit 31b8b28b <feat/pdr-32bit> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         1/9 failed
[ 50.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     33.2±10ms      185±20ms     49.0±0.2ms  
                10000     31.2±0.3ms     169±10ms      246±20ms   
               10000000   2.50±0.04s    20.0±0.02s      failed    
              ========== ============ ============= ==============

[ 50.00%] ···· For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

[ 50.00%] · For scikit-learn commit a5d50cf3 <main> (round 1/1):
[ 50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[100.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         3/9 failed
[100.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000      37.9±5ms      200±2ms       1.76±0s    
                10000     208±0.8ms     2.07±0.01s    20.1±0.04s  
               10000000     failed        failed        failed    
              ========== ============ ============= ==============

[100.00%] ···· For parameters: 10000000, 1000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

       before           after         ratio
     [a5d50cf3]       [31b8b28b]
     <main>           <feat/pdr-32bit>
-       208±0.8ms       31.2±0.3ms     0.15  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 1000, 100)
-      2.07±0.01s         169±10ms     0.08  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-         1.76±0s       49.0±0.2ms     0.03  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

128 threads

· Creating environments
· Discovering benchmarks
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For scikit-learn commit 31b8b28b <feat/pdr-32bit> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         1/9 failed
[ 50.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000      250±30ms     1.45±0.1s     13.9±0.1s   
                10000      239±10ms     1.42±0.05s    12.8±0.1s   
               10000000   1.66±0.02s    13.1±0.08s      failed    
              ========== ============ ============= ==============

[ 50.00%] ···· For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

[ 50.00%] · For scikit-learn commit a5d50cf3 <main> (round 1/1):
[ 50.00%] ·· Building for conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[ 50.00%] ·· Benchmarking conda-py3.9-cython-joblib-numpy-scipy-threadpoolctl
[100.00%] ··· ...ancesArgKminBenchmark.time_nearest_neighbors         3/9 failed
[100.00%] ··· ========== ============ ============= ==============
              --                    n_test / n_features           
              ---------- -----------------------------------------
               n_train    1000 / 100   10000 / 100   100000 / 100 
              ========== ============ ============= ==============
                 1000     51.2±10ms      225±2ms      1.85±0.04s  
                10000      226±2ms      2.22±0.03s    21.3±0.03s  
               10000000     failed        failed        failed    
              ========== ============ ============= ==============

[100.00%] ···· For parameters: 10000000, 1000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 10000, 100
               
               
               asv: benchmark timed out (timeout 500s)
               
               For parameters: 10000000, 100000, 100
               
               
               asv: benchmark timed out (timeout 500s)

       before           after         ratio
     [a5d50cf3]       [31b8b28b]
     <main>           <feat/pdr-32bit>
+         225±2ms        1.45±0.1s     6.45  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 10000, 100)
+       51.2±10ms         250±30ms     4.89  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(1000, 1000, 100)
-      2.22±0.03s       1.42±0.05s     0.64  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 10000, 100)
-      21.3±0.03s        12.8±0.1s     0.60  pairwise_argkmin_estimator.PairwiseDistancesArgKminBenchmark.time_nearest_neighbors(10000, 100000, 100)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Benchmarks information

Machine specification

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7742 64-Core Processor
Stepping:            0
CPU MHz:             3388.360
BogoMIPS:            4491.59
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-63,128-191
NUMA node1 CPU(s):   64-127,192-255
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

ogrisel · 2022-07-11T13:59:22Z

Thanks for the updated PR. I assume that merging with main is needed before starting to review this.

Could you run the benchmarks with more imbalanced train / tests, e.g. n_samples_train = int(1e7) and n_samples_test = 1000?

I wonder if the performance slowdown for a very large number of threads is caused by the fact that we have two few chunks to execute per-thread and using more imbalanced benchmark cases might validate (or invalidate) this hypothesis.

ogrisel · 2022-07-11T13:59:53Z

still, a 50x speed-up w.r.t. main is nice :)

jjerphan · 2022-07-11T20:43:27Z

Thanks for the updated PR. I assume that merging with main is needed before starting to review this.

You're welcome! I did not know if this PR was submitted correctly while travelling (🍀), another one is to come for a new pairwise_distances back-end.

still, a 50x speed-up w.r.t. main is nice :)

Yes, I am quite glad we can reach those performance. I don't think we need to adapt the chunk size for the float32 case because there's little additional memory due to data-structures (the extra datastructures memory-wise are just the original X_c and Y_c) but this can be tried in another PR. :)

Could you run the benchmarks with more imbalanced train / tests, e.g. n_samples_train = int(1e7) and n_samples_test = 1000?

Yes. Let's try that.

I wonder if the performance slowdown for a very large number of threads is caused by the fact that we have two few chunks to execute per-thread and using more imbalanced benchmark cases might validate (or invalidate) this hypothesis.

I share the same hypothesis. I think we can explore a strategies to have a minimal number of batch per thread in another PR (a task I have added in the TODO list in the description).

jjerphan · 2022-07-17T09:18:22Z

I've adapted the description with updated benchmarks script and results.

It looks like the implementations scales well on the (n_samples_train, n_samples_test) = (int(1e7), 1000) case. On main, the execution times out after 500s in all the cases, even when using 128 threads, but complete in less than 2s, reaching "×100+ speed-ups" in this case (see the raw logs in this PR description).

The drop is mainly present when using too much threads. I think the PairwiseDistancesArgKmin used under the hood scales even better and that it is the sequential part at the beginning of kneighbors which might be costly.

Probably we could have had advertised the practical support for really large datasets more in the changelog for 1.1? :)

adrinjalali · 2022-07-17T09:31:11Z

WOW

thomasjpfan

We are on the path to make everything use Tempita :)

sklearn/manifold/_t_sne.py

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

jjerphan · 2022-07-18T19:01:44Z

We are on the path to make everything use Tempita :)

It looks like yes. IMO, even if it's suboptimal, restrictive and hard to maintain on the long run, it's rather a pragmatic solution given where we are at today.

From IRL discussions this week 🍀, it looks like @adrinjalali is interested to experiment with alternatives, like Rust. I think it's worth exploring, but might add complexity especially on the build setup and on interfacing with other libraries like BLAS and colleagues.
Similar concerns also apply for C++.

Probably the work on #22438 might help?

sklearn/metrics/_dist_metrics.pxd.tp

doc/whats_new/v1.1.rst

doc/whats_new/v1.2.rst

sklearn/manifold/_t_sne.py

sklearn/metrics/_pairwise_distances_reduction/_base.pxd.tp

sklearn/metrics/_dist_metrics.pxd.tp

sklearn/metrics/_pairwise_distances_reduction/_base.pyx.tp

sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pxd.tp

sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pyx.tp

sklearn/manifold/tests/test_t_sne.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

sklearn/metrics/_pairwise_distances_reduction/_base.pyx.tp

sklearn/metrics/_pairwise_distances_reduction/_argkmin.pxd.tp

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

thomasjpfan

I left a minor nit: otherwise LGTM

sklearn/metrics/_pairwise_distances_reduction/_gemm_term_computer.pxd.tp

Done with: grep -rl need_upcast . | xargs sed -i's/need_upcast/upcast_to_float64/g' Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

sklearn/metrics/_pairwise_distances_reduction/_argkmin.pxd.tp

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel

I did another quick pass. LGTM. Let's merge and handle type renaming in dedicated PRs (e.g. #24153).

ogrisel · 2022-08-10T14:04:16Z

Thanks @jjerphan 🎉

jjerphan · 2022-08-10T15:36:33Z

Thanks @ogrisel and @thomasjpfan for the reviews!

This update the branch after the merge of scikit-learn#23865.

…pita (scikit-learn#23865) Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

FEA Port implementations to 32bit using Tempita

88a98a1

github-actions bot added cython module:metrics labels Jul 8, 2022

jjerphan mentioned this pull request Jul 8, 2022

POC 32bit datasets support for PairwiseDistancesReduction #22590

Closed

3 tasks

jjerphan added the No Changelog Needed label Jul 8, 2022

Validate Y_norm_squared

840b473

jjerphan mentioned this pull request Jul 11, 2022

PERF PairwiseDistancesReductions initial work #22587

Closed

Merge branch 'main' into feat/pdr-32bit

31b8b28

jjerphan added 2 commits July 17, 2022 12:45

MAINT Do not use PairwiseDistancesArgKmin for TSNE for now

2d44278

MAINT Adapt tests

991c21d

jjerphan marked this pull request as ready for review July 17, 2022 11:23

jjerphan added Performance float32 Issues related to support for 32bit data labels Jul 17, 2022

thomasjpfan reviewed Jul 17, 2022

View reviewed changes

sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved

sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved

sklearn/manifold/_t_sne.py Outdated Show resolved Hide resolved

jjerphan and others added 2 commits July 18, 2022 20:51

DOC Fix typo

350021e

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

DOC Fix typo

42b3225

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

jjerphan added 2 commits July 19, 2022 09:12

Retrigger CI

231a85f

DOC Add whats_new entry

3722882

jjerphan removed the No Changelog Needed label Jul 19, 2022

ogrisel reviewed Jul 20, 2022

View reviewed changes

jjerphan commented Jul 20, 2022

View reviewed changes

sklearn/metrics/_dist_metrics.pxd.tp Outdated Show resolved Hide resolved

jjerphan commented Jul 20, 2022

View reviewed changes

sklearn/metrics/_pairwise_distances_reduction/_base.pyx.tp Outdated Show resolved Hide resolved

jjerphan commented Jul 20, 2022

View reviewed changes

sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pxd.tp Outdated Show resolved Hide resolved

jjerphan commented Jul 20, 2022

View reviewed changes

sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pyx.tp Outdated Show resolved Hide resolved

ogrisel reviewed Jul 21, 2022

View reviewed changes

sklearn/manifold/tests/test_t_sne.py Outdated Show resolved Hide resolved

DOC Better motivate the xfail on t-SNE edge

e2cc073

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

thomasjpfan reviewed Jul 21, 2022

View reviewed changes

sklearn/metrics/_pairwise_distances_reduction/_base.pyx.tp Outdated Show resolved Hide resolved

sklearn/metrics/_pairwise_distances_reduction/_argkmin.pxd.tp Outdated Show resolved Hide resolved

Use stack-allocated vector[vector[T]] instead of thread-local buffers

d91b8ac

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

jjerphan mentioned this pull request Jul 25, 2022

FEA Fused sparse-dense support for PairwiseDistancesReduction #23585

Merged

4 tasks

thomasjpfan approved these changes Jul 25, 2022

View reviewed changes

sklearn/metrics/_pairwise_distances_reduction/_gemm_term_computer.pxd.tp Outdated Show resolved Hide resolved

MAINT Reword "need_upcastr" to "upcast_to_float64"

2a82699

Done with: grep -rl need_upcast . | xargs sed -i's/need_upcast/upcast_to_float64/g' Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

jjerphan changed the title ~~FEA Port PairwiseDistancesReduction to 32bit using Tempita~~ FEA Add support for float32 on PairwiseDistancesReduction using Tempita Jul 27, 2022

jjerphan added 4 commits July 28, 2022 16:34

MAINT Also ignore generated Cython source at the project level

d9a89d7

Merge branch 'main' into feat/pdr-32bit

2596394

Merge branch 'main' into feat/pdr-32bit

99ccc7c

MAINT Fix dtype_validity

0cc1367

Micky774 mentioned this pull request Aug 4, 2022

PERF Implement PairwiseDistancesReduction backend for KNeighbors.predict_proba #24076

Merged

4 tasks

Merge branch 'main' into feat/pdr-32bit

93ca845

ogrisel reviewed Aug 9, 2022

View reviewed changes

sklearn/metrics/_pairwise_distances_reduction/_argkmin.pxd.tp Outdated Show resolved Hide resolved

thomasjpfan mentioned this pull request Aug 9, 2022

MNT Use float64_t and intp_t directly in Cython for _pairwise_distances_reduction #24153

Closed

MAINT Use types and dtypes via NumPy directly

eac14db

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jjerphan mentioned this pull request Aug 10, 2022

MAINT Adapt PairwiseDistancesReduction heuristic for strategy="auto" #24043

Merged

ogrisel approved these changes Aug 10, 2022

View reviewed changes

ogrisel merged commit b7d0171 into scikit-learn:main Aug 10, 2022

jjerphan deleted the feat/pdr-32bit branch August 10, 2022 15:36

jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Aug 11, 2022

Merge branch 'main' into maint/pdr-sparse-support

c8bacc6

This update the branch after the merge of scikit-learn#23865.

lesteve mentioned this pull request Aug 11, 2022

⚠️ CI failed on Linux.ubuntu_atlas ⚠️ #24131

Closed

lesteve mentioned this pull request Oct 18, 2022

MNT Fix build when SKLEARN_OPENMP_PARALLELISM_ENABLED=False #24682

Merged

lorentzenchr mentioned this pull request Feb 9, 2023

RFC Guideline for usage of Cython types #25572

Closed

jjerphan mentioned this pull request Mar 29, 2023

Introduce SIMD intrinsics for _dist_metrics.pyx #26010

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEA Add support for float32 on `PairwiseDistancesReduction` using Tempita #23865

FEA Add support for float32 on `PairwiseDistancesReduction` using Tempita #23865

jjerphan commented Jul 8, 2022 •

edited

ogrisel commented Jul 11, 2022

ogrisel commented Jul 11, 2022

jjerphan commented Jul 11, 2022

jjerphan commented Jul 17, 2022 •

edited

adrinjalali commented Jul 17, 2022

thomasjpfan left a comment

jjerphan commented Jul 18, 2022

thomasjpfan left a comment •

edited

ogrisel left a comment •

edited

ogrisel commented Aug 10, 2022 •

edited

jjerphan commented Aug 10, 2022

FEA Add support for float32 on PairwiseDistancesReduction using Tempita #23865

FEA Add support for float32 on PairwiseDistancesReduction using Tempita #23865

Conversation

jjerphan commented Jul 8, 2022 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Benchmarks results

Benchmarks results between main (a5d50cf) and this PR @ 31b8b28 (via 2c842bd)

Benchmarks information

ogrisel commented Jul 11, 2022

ogrisel commented Jul 11, 2022

jjerphan commented Jul 11, 2022

jjerphan commented Jul 17, 2022 • edited

adrinjalali commented Jul 17, 2022

thomasjpfan left a comment

Choose a reason for hiding this comment

jjerphan commented Jul 18, 2022

thomasjpfan left a comment • edited

Choose a reason for hiding this comment

ogrisel left a comment • edited

Choose a reason for hiding this comment

ogrisel commented Aug 10, 2022 • edited

jjerphan commented Aug 10, 2022

FEA Add support for float32 on `PairwiseDistancesReduction` using Tempita #23865

FEA Add support for float32 on `PairwiseDistancesReduction` using Tempita #23865

jjerphan commented Jul 8, 2022 •

edited

Benchmarks results between `main` (`a5d50cf`) and this PR @ `31b8b28` (via `2c842bd`)

jjerphan commented Jul 17, 2022 •

edited

thomasjpfan left a comment •

edited

ogrisel left a comment •

edited

ogrisel commented Aug 10, 2022 •

edited