Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed with large arrays #7

Open
grovduck opened this issue Jul 14, 2020 · 5 comments
Open

Speed with large arrays #7

grovduck opened this issue Jul 14, 2020 · 5 comments
Labels

Comments

@grovduck
Copy link

Thanks for this package, @mthh. This is more of a question than issue. I'm using your code in a raster GIS context and trying to get natural breaks for large rasters (> 1,000,000 cells). I've got equal interval and quantile classification modes built in for comparison. I'm curious if you've ever done speed comparisons on large datasets and if you'd advocate sampling to reduce classification times.

Here's a sample of what I'm seeing:

import numpy as np
import jenkspy

# Equal interval
def equal_intervals(arr, bin_count):
    return np.linspace(arr.min(), arr.max(), num=bin_count + 1)

# Quantile
def approx_quantiles(arr, bin_count):
    if arr.size <= bin_count:
        return np.sort(arr)
    q = np.linspace(0, 1, bin_count + 1)
    bins = np.quantile(arr, q)
    uniq, counts = np.unique(bins, return_counts=True)
    dup = uniq[counts > 1]
    if len(dup):
        new = arr[arr != dup[0]]
        return np.sort(
            np.hstack((dup[0], approx_quantiles(new, bin_count - 1)))
        )
    return bins

# Natural breaks
def jenks(arr, bin_count):
    return jenkspy.jenks_breaks(arr, bin_count)

# Sample array (n=10000)
arr = np.random.randint(0, 100, 10000)

When timing these, jenks is slower than I would expect, but I admit to not studying the algorithm carefully enough to know if this is expected.

%timeit equal_intervals(arr, 6)
61.5 µs ± 4.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit approx_quantiles(arr, 6)
431 µs ± 30.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit jenks(arr, 6)
514 ms ± 54.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Sampling the original array obviously helps speeds, but the bin boundaries aren't exact:

import math
arr = np.random.randint(0, 1000, 100000)
for i in range(10, 0, -1):
    size = math.floor(arr.size * i / 10)
    sample = np.random.choice(arr, size=size)
    %timeit jenks(sample, 6)
    print(f"{i*10}% sample", jenks(sample, 6))

49.6 s ± 864 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
100% sample [0.0, 167.0, 335.0, 503.0, 670.0, 835.0, 999.0]

39.6 s ± 661 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
90% sample [0.0, 165.0, 331.0, 497.0, 664.0, 831.0, 999.0]

30.8 s ± 1.03 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
80% sample [0.0, 167.0, 333.0, 500.0, 667.0, 834.0, 999.0]

23.8 s ± 686 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
70% sample [0.0, 170.0, 339.0, 507.0, 672.0, 834.0, 999.0]

17.7 s ± 708 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
60% sample [0.0, 168.0, 335.0, 500.0, 665.0, 831.0, 999.0]

12.1 s ± 392 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
50% sample [0.0, 169.0, 334.0, 500.0, 665.0, 831.0, 999.0]

7.28 s ± 158 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
40% sample [0.0, 168.0, 335.0, 502.0, 668.0, 833.0, 999.0]

4.31 s ± 221 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
30% sample [0.0, 168.0, 337.0, 503.0, 667.0, 832.0, 999.0]

2.05 s ± 48.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
20% sample [0.0, 170.0, 338.0, 507.0, 674.0, 837.0, 999.0]

458 ms ± 4.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10% sample [0.0, 167.0, 334.0, 500.0, 664.0, 830.0, 999.0]

Thanks for any pointers you may have on how to use jenkspy more efficiently.

@mthh
Copy link
Owner

mthh commented Aug 18, 2022

Thank you for your feedback and measurements and sorry for the late answer.

Indeed, by its nature, the Fisher Jenks classification algo is much more complex than quantile and equal interval classifications: it involves creating 2 large matrices of sizes (number of values * number of desired classes) as well as nested loops to compute the variance and determine the best class boundaries.

The implementation proposed by jenkspy is normally quite fast (it is implemented in C), but it doesn't use any parallel computing technique.
If you look at the implementation of the mapclassify package which proposes the same discretization method in Python (class FisherJenks), it offers much less satisfactory results, as the size of the input array grows, due to being implemented solely in Python.

As a comparaison, here are the timing (in seconds) that I am obtaining when comparing jenkspy to mapclassify when classifying a random array and asking for 5 classes :

Array size Jenkspy MapClassify.FisherJenks
100 0.00013094300084048882 0.018350736998399952
500 0.001357422999717528 0.4472179149997828
1000 0.004580991000693757 1.7872164090003935
5000 0.09825777399964863 45.221845292999205
10000 0.3929008960003557 180.37603843200122
50000  9.964591654999822 Not computed
 100000  42.21563783500096 Not computed

comp

Of course these timings have to be taken with a grain of salt (because they depend on my machine, on the number of requested classes, etc.) but they give you an idea (and they seem to be consistent with your timings).
And of course they don't serve to denigrate other implementations (jenkspy is specialized in the calculation of these classes, where mapclassify offers a much richer interface around its classifiers).

I am preparing a notebook with comparaison (performance and accuracy of the results) of various implementations, i will post the link here if you are interested.

The idea you had to sample the data to be discretized is probably the right one if you want to use the Fisher-Jenks algorithm. This is for example what is done with various software such as QGIS Desktop and classInt R library which are both using (unless otherwise specified) a sample of 3000 values.

With jenkspy, on not too old computers, you can easily use a 5000-values or a 10000-values sample while keeping execution time below the second.

@grovduck
Copy link
Author

@mthh, thanks for the thorough answer. Yes, I'd be interested in seeing your comparison notebook once you've finished it. For my purposes, sampling the dataset gives me a good approximation of the classification I want.

@kwtk86
Copy link

kwtk86 commented Dec 15, 2023

Thank you for your feedback and measurements and sorry for the late answer.

Indeed, by its nature, the Fisher Jenks classification algo is much more complex than quantile and equal interval classifications: it involves creating 2 large matrices of sizes (number of values * number of desired classes) as well as nested loops to compute the variance and determine the best class boundaries.

The implementation proposed by is normally quite fast (it is implemented in C), but it doesn't use any parallel computing technique. If you look at the implementation of the package which proposes the same discretization method in Python (class ), it offers much less satisfactory results, as the size of the input array grows, due to being implemented solely in Python.jenkspy``mapclassify``FisherJenks

As a comparaison, here are the timing (in seconds) that I am obtaining when comparing to when classifying a random array and asking for 5 classes :jenkspy``mapclassify

Array size Jenkspy MapClassify.FisherJenks
100 0.00013094300084048882 0.018350736998399952
500 0.001357422999717528 0.4472179149997828
1000 0.004580991000693757 1.7872164090003935
5000 0.09825777399964863 45.221845292999205
10000 0.3929008960003557 180.37603843200122
50000  9.964591654999822 Not computed
 100000  42.21563783500096 Not computed
comp

Of course these timings have to be taken with a grain of salt (because they depend on my machine, on the number of requested classes, etc.) but they give you an idea (and they seem to be consistent with your timings). And of course they don't serve to denigrate other implementations (jenkspy is specialized in the calculation of these classes, where mapclassify offers a much richer interface around its classifiers).

I am preparing a notebook with comparaison (performance and accuracy of the results) of various implementations, i will post the link here if you are interested.

The idea you had to sample the data to be discretized is probably the right one if you want to use the Fisher-Jenks algorithm. This is for example what is done with various software such as QGIS Desktop and classInt R library which are both using (unless otherwise specified) a sample of 3000 values.

With jenkspy, on not too old computers, you can easily use a 5000-values or a 10000-values sample while keeping execution time below the second.

Thanks for your answer above, I think it is plausible to sample data for faster implementation of Jenks algorithm.
However, I still wonder that why GIS softwares like ArcGIS are able to implement natural break classification much much much faster that Jenkspy, even if Jenkspy is developed on Cython (or C). Just because they use parallel computing techniques?
Thank you.

@mthh
Copy link
Owner

mthh commented Dec 15, 2023

I couldn't say for sure about ArcGIS but, for example, I know that QGIS doesn't take the whole dataset but only a sample of the values when you ask it for a classification with Jenks (you can see it in the code here: https://github.com/qgis/QGIS/blob/master/src/core/classification/qgsclassificationjenks.cpp#L80C8-L80C8 - the variable mMaximumSize is set to 3000, so if there is more than 3000 values in the dataset, the code samples 3000 values or 10% of the dataset, whichever is larger), which is why it's so much faster.
I imagine ArcGIS might do something similar.

Regarding Jenkspy, I think it's up to the user to implement this logic upstream if they wish.

@mthh
Copy link
Owner

mthh commented Dec 15, 2023

While searching on the Web I found a mention of the fact that ArcGIS also seems to take a sample (of 10000 values) for Natural breaks classification with Jenks: https://gis.stackexchange.com/a/84243.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants