Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up the "slow", not compiled with numba.jit, _fisher_jenks_means by 4 to 5 times. Saves 3 minutes classifiying 8000 data points. #201

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

JamesParrott
Copy link

Hi everyone,

Thanks for a great project, and for all your work. I really like mapclassify. This PR is for information only, to highlight an interesting way to improve performance for users not using Numba. Using Numba should be preferred - the version with it is blazingly fast. Given so many other dependencies are needed, I don't know why it's optional, even if llvmlite is very big.
I would say Numba should be included by default.

Otherwise, if there's a willingness to relax and generalise the input types, and instance variables created by MapClassifier (to lists of numbers, instead of only numpy arrays), technically making a breaking change, then this PR is a good UX improvement for casual mapclassify users. I don't know if something else that takes even longer is on the critical path for them instead (or if typical mapclassify users who use the default mapclassify without Numba, have much more powerful machines than me), if they all classify much much larger data sets than I've tested, but I myself would always prefer a library that gets the answer 3 minutes faster.

From running mapclassify\tests\time_fisher_jenkss.py:

MapClassify_no_numpy_pure_functions_comparison

MapClassify_no_numpy_end_user_comparison

Background:
I've made a fork to target IronPython in Grasshopper, in which Numpy and numba are not directly available. This got me wondering how much performance its users would be missing out on. Compared to Numba users, they might have to wait 30 seconds more for 3000 data points. But compared to non-Numba users, the answer, to my surprise, is not a lot. And in fact, compared to your own users who haven't installed numba, my fork that doesn't use Numpy at all is significantly faster, then the slow version of _fisher_jenks_means without Numba (that someone saw fit to raise a warning about in FisherJenks).

Basically, creating and iterating over numpy arrays is not free - there's actually an overhead. That overhead is well worth it for large matrix multiplications, and other vectorisable calculations in scientific computing on modern hardware. But on my laptop at least, running Python 3.11, using an implementation of an algorithm like the original fisher jenks, that must iterate over every element in sequence anyway, Numpy arrays are slower than plain Python iteration over nested lists.

I don't know how Numpy implemented it in C exactly, but when I tried to implement __getitem__ on tuples in pure Python, there was a an even larger performance hit. So the only change I've made in my alternative function, is to switch [i, j] with the admittedly uglier [i][j], to generalise a numpy specific type reference, and to pass in a Class for numpy (np), that defines int and float etc.. Hopefully otherwise, the original code in _fisher_jenks_means is unaltered.

python time_fisher_jenkss.py
Run times of the proposed function vs the original (excluding MapClassifier overhead)

Time: 0.009 seconds, data points: 100  _fjm_without_numpy, number_tests=1
Time: 0.081 seconds, data points: 300  _fjm_without_numpy, number_tests=1
Time: 0.911 seconds, data points: 1000  _fjm_without_numpy, number_tests=1
Time: 7.298 seconds, data points: 2800  _fjm_without_numpy, number_tests=1
Time: 63.026 seconds, data points: 8000  _fjm_without_numpy, number_tests=1
Time: 0.044 seconds, data points: 100  _fisher_jenks_means, number_tests=1
Time: 0.381 seconds, data points: 300  _fisher_jenks_means, number_tests=1
Time: 4.209 seconds, data points: 1000  _fisher_jenks_means, number_tests=1
Time: 33.791 seconds, data points: 2800  _fisher_jenks_means, number_tests=1
Time: 315.014 seconds, data points: 8000  _fisher_jenks_means, number_tests=1
Run times for end user, of the proposed code vs the original (inc MapClassifier overhead)

\mapclassif-Iron\mapclassify\tests\time_fisher_jenkss.py:56: UserWarning: Numba not installed. Using a less slow, pure python version.
  mapclassify.classify(y = data, scheme = 'fisherjenks', k=k)
Time: 0.042 seconds, data points: 100 without Numpy, much less slow, pure python code, number_tests=1
Time: 0.154 seconds, data points: 300 without Numpy, much less slow, pure python code, number_tests=1
Time: 1.559 seconds, data points: 1000 without Numpy, much less slow, pure python code, number_tests=1
Time: 12.467 seconds, data points: 2800 without Numpy, much less slow, pure python code, number_tests=1
Time: 105.765 seconds, data points: 8000 without Numpy, much less slow, pure python code, number_tests=1
Time: 0.046 seconds, data points: 100 with Numpy, existing "slow pure python" code, number_tests=1
Time: 0.385 seconds, data points: 300 with Numpy, existing "slow pure python" code, number_tests=1
Time: 4.263 seconds, data points: 1000 with Numpy, existing "slow pure python" code, number_tests=1
Time: 34.077 seconds, data points: 2800 with Numpy, existing "slow pure python" code, number_tests=1
Time: 296.244 seconds, data points: 8000 with Numpy, existing "slow pure python" code, number_tests=1

Fixes #118 if Numba is unavailable

@knaaptime
Copy link
Member

thanks for this :) we'll take a look. (funny timing, actually, because i have some brand new materials that use jenks (for the first time in ages) albeit in an python 3.12 environment where numba isnt yet available and the slowdown is very noticeable on a medium-sized dataset)

@JamesParrott
Copy link
Author

thanks for this :) we'll take a look.

You're welcome. I didn't realise there were more tests than I spotted at first glance (I've delved into those and fixed them by the way...). Commit 0b48854 makes self.bins a numpy array again. So including this PR no longer change the return typedbreak anyone's code now.

@jGaboardi
Copy link
Member

@JamesParrott make sure the files you are pushing to this PR are formatted and linted properly. Currently, pre-commit is failing due to this.

@JamesParrott
Copy link
Author

JamesParrott commented Dec 14, 2023

@JamesParrott make sure the files you are pushing to this PR are formatted and linted properly. Currently, pre-commit is failing due to this.

As I said above, this PR is for info only. But sure, no problem. Will do once the currently running actions stop, the tests pass, and I have time.

Is that why the actions runners are taking so long to complete, or is that just Windows and MacOS ones?

@jGaboardi
Copy link
Member

just Windows and MacOS ones

exactly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improving FisherJenks
3 participants