Speed with large arrays #7

grovduck · 2020-07-14T01:10:47Z

Thanks for this package, @mthh. This is more of a question than issue. I'm using your code in a raster GIS context and trying to get natural breaks for large rasters (> 1,000,000 cells). I've got equal interval and quantile classification modes built in for comparison. I'm curious if you've ever done speed comparisons on large datasets and if you'd advocate sampling to reduce classification times.

Here's a sample of what I'm seeing:

import numpy as np
import jenkspy

# Equal interval
def equal_intervals(arr, bin_count):
    return np.linspace(arr.min(), arr.max(), num=bin_count + 1)

# Quantile
def approx_quantiles(arr, bin_count):
    if arr.size <= bin_count:
        return np.sort(arr)
    q = np.linspace(0, 1, bin_count + 1)
    bins = np.quantile(arr, q)
    uniq, counts = np.unique(bins, return_counts=True)
    dup = uniq[counts > 1]
    if len(dup):
        new = arr[arr != dup[0]]
        return np.sort(
            np.hstack((dup[0], approx_quantiles(new, bin_count - 1)))
        )
    return bins

# Natural breaks
def jenks(arr, bin_count):
    return jenkspy.jenks_breaks(arr, bin_count)

# Sample array (n=10000)
arr = np.random.randint(0, 100, 10000)

When timing these, jenks is slower than I would expect, but I admit to not studying the algorithm carefully enough to know if this is expected.

%timeit equal_intervals(arr, 6)
61.5 µs ± 4.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit approx_quantiles(arr, 6)
431 µs ± 30.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit jenks(arr, 6)
514 ms ± 54.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Sampling the original array obviously helps speeds, but the bin boundaries aren't exact:

import math
arr = np.random.randint(0, 1000, 100000)
for i in range(10, 0, -1):
    size = math.floor(arr.size * i / 10)
    sample = np.random.choice(arr, size=size)
    %timeit jenks(sample, 6)
    print(f"{i*10}% sample", jenks(sample, 6))

49.6 s ± 864 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
100% sample [0.0, 167.0, 335.0, 503.0, 670.0, 835.0, 999.0]

39.6 s ± 661 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
90% sample [0.0, 165.0, 331.0, 497.0, 664.0, 831.0, 999.0]

30.8 s ± 1.03 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
80% sample [0.0, 167.0, 333.0, 500.0, 667.0, 834.0, 999.0]

23.8 s ± 686 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
70% sample [0.0, 170.0, 339.0, 507.0, 672.0, 834.0, 999.0]

17.7 s ± 708 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
60% sample [0.0, 168.0, 335.0, 500.0, 665.0, 831.0, 999.0]

12.1 s ± 392 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
50% sample [0.0, 169.0, 334.0, 500.0, 665.0, 831.0, 999.0]

7.28 s ± 158 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
40% sample [0.0, 168.0, 335.0, 502.0, 668.0, 833.0, 999.0]

4.31 s ± 221 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
30% sample [0.0, 168.0, 337.0, 503.0, 667.0, 832.0, 999.0]

2.05 s ± 48.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
20% sample [0.0, 170.0, 338.0, 507.0, 674.0, 837.0, 999.0]

458 ms ± 4.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10% sample [0.0, 167.0, 334.0, 500.0, 664.0, 830.0, 999.0]

Thanks for any pointers you may have on how to use jenkspy more efficiently.

The text was updated successfully, but these errors were encountered:

mthh · 2022-08-18T12:59:42Z

Thank you for your feedback and measurements and sorry for the late answer.

Indeed, by its nature, the Fisher Jenks classification algo is much more complex than quantile and equal interval classifications: it involves creating 2 large matrices of sizes (number of values * number of desired classes) as well as nested loops to compute the variance and determine the best class boundaries.

The implementation proposed by jenkspy is normally quite fast (it is implemented in C), but it doesn't use any parallel computing technique.
If you look at the implementation of the mapclassify package which proposes the same discretization method in Python (class FisherJenks), it offers much less satisfactory results, as the size of the input array grows, due to being implemented solely in Python.

As a comparaison, here are the timing (in seconds) that I am obtaining when comparing jenkspy to mapclassify when classifying a random array and asking for 5 classes :

Array size	Jenkspy	MapClassify.FisherJenks
100	0.00013094300084048882	0.018350736998399952
500	0.001357422999717528	0.4472179149997828
1000	0.004580991000693757	1.7872164090003935
5000	0.09825777399964863	45.221845292999205
10000	0.3929008960003557	180.37603843200122
50000	9.964591654999822	Not computed
100000	42.21563783500096	Not computed

Of course these timings have to be taken with a grain of salt (because they depend on my machine, on the number of requested classes, etc.) but they give you an idea (and they seem to be consistent with your timings).
And of course they don't serve to denigrate other implementations (jenkspy is specialized in the calculation of these classes, where mapclassify offers a much richer interface around its classifiers).

I am preparing a notebook with comparaison (performance and accuracy of the results) of various implementations, i will post the link here if you are interested.

The idea you had to sample the data to be discretized is probably the right one if you want to use the Fisher-Jenks algorithm. This is for example what is done with various software such as QGIS Desktop and classInt R library which are both using (unless otherwise specified) a sample of 3000 values.

With jenkspy, on not too old computers, you can easily use a 5000-values or a 10000-values sample while keeping execution time below the second.

grovduck · 2022-08-22T20:10:49Z

@mthh, thanks for the thorough answer. Yes, I'd be interested in seeing your comparison notebook once you've finished it. For my purposes, sampling the dataset gives me a good approximation of the classification I want.

kwtk86 · 2023-12-15T05:17:55Z

Thank you for your feedback and measurements and sorry for the late answer.

Indeed, by its nature, the Fisher Jenks classification algo is much more complex than quantile and equal interval classifications: it involves creating 2 large matrices of sizes (number of values * number of desired classes) as well as nested loops to compute the variance and determine the best class boundaries.

The implementation proposed by is normally quite fast (it is implemented in C), but it doesn't use any parallel computing technique. If you look at the implementation of the package which proposes the same discretization method in Python (class ), it offers much less satisfactory results, as the size of the input array grows, due to being implemented solely in Python.jenkspy``mapclassify``FisherJenks

As a comparaison, here are the timing (in seconds) that I am obtaining when comparing to when classifying a random array and asking for 5 classes :jenkspy``mapclassify

Array size Jenkspy MapClassify.FisherJenks
100 0.00013094300084048882 0.018350736998399952
500 0.001357422999717528 0.4472179149997828
1000 0.004580991000693757 1.7872164090003935
5000 0.09825777399964863 45.221845292999205
10000 0.3929008960003557 180.37603843200122
50000 9.964591654999822 Not computed
100000 42.21563783500096 Not computed

Of course these timings have to be taken with a grain of salt (because they depend on my machine, on the number of requested classes, etc.) but they give you an idea (and they seem to be consistent with your timings). And of course they don't serve to denigrate other implementations (jenkspy is specialized in the calculation of these classes, where mapclassify offers a much richer interface around its classifiers).

I am preparing a notebook with comparaison (performance and accuracy of the results) of various implementations, i will post the link here if you are interested.

The idea you had to sample the data to be discretized is probably the right one if you want to use the Fisher-Jenks algorithm. This is for example what is done with various software such as QGIS Desktop and classInt R library which are both using (unless otherwise specified) a sample of 3000 values.

With jenkspy, on not too old computers, you can easily use a 5000-values or a 10000-values sample while keeping execution time below the second.

Thanks for your answer above, I think it is plausible to sample data for faster implementation of Jenks algorithm.
However, I still wonder that why GIS softwares like ArcGIS are able to implement natural break classification much much much faster that Jenkspy, even if Jenkspy is developed on Cython (or C). Just because they use parallel computing techniques?
Thank you.

mthh · 2023-12-15T07:20:54Z

I couldn't say for sure about ArcGIS but, for example, I know that QGIS doesn't take the whole dataset but only a sample of the values when you ask it for a classification with Jenks (you can see it in the code here: https://github.com/qgis/QGIS/blob/master/src/core/classification/qgsclassificationjenks.cpp#L80C8-L80C8 - the variable mMaximumSize is set to 3000, so if there is more than 3000 values in the dataset, the code samples 3000 values or 10% of the dataset, whichever is larger), which is why it's so much faster.
I imagine ArcGIS might do something similar.

Regarding Jenkspy, I think it's up to the user to implement this logic upstream if they wish.

mthh · 2023-12-15T08:00:09Z

While searching on the Web I found a mention of the fact that ArcGIS also seems to take a sample (of 10000 values) for Natural breaks classification with Jenks: https://gis.stackexchange.com/a/84243.

kevinjwalters mentioned this issue May 13, 2021

How many datapoints can this work with? #15

Closed

mthh added the question label Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed with large arrays #7

Speed with large arrays #7

grovduck commented Jul 14, 2020

mthh commented Aug 18, 2022 •

edited

grovduck commented Aug 22, 2022

kwtk86 commented Dec 15, 2023

mthh commented Dec 15, 2023

mthh commented Dec 15, 2023 •

edited

Speed with large arrays #7

Speed with large arrays #7

Comments

grovduck commented Jul 14, 2020

mthh commented Aug 18, 2022 • edited

grovduck commented Aug 22, 2022

kwtk86 commented Dec 15, 2023

mthh commented Dec 15, 2023

mthh commented Dec 15, 2023 • edited

mthh commented Aug 18, 2022 •

edited

mthh commented Dec 15, 2023 •

edited