Add move_quantile function #418

andrii-riazanov · 2022-09-23T08:35:01Z

This PR adds move_quantile to the list of supported move functions.

Why?

Quantiles (and moving quantiles) are often useful statistics to look at, and having a fast move version of quantile would be great.

How?

Moving/rolling quantile is implemented in almost exactly the same way as moving median: via two heaps (max-heap and min-heap). The only difference is in sizes of the heaps -- for move_median they should have the same size (modulo parity nuances), while for the move_quantile sizes of the heaps should be rebalanced differently.

The changes to transform move_median into move_quantile are very minor, and were implemented in the first commit 524afbf (36++, 13--). This commit fully implemented move_quantile with fixed q=0.25 out of move_median.

The initial approach was to substitute move_median with move_quantile completely. Then, on move_median call, just call move_quantile(q=0.5). This is implemented and tested in commits until de181da , where fully working move_quantile (and move_median via move_quantile) was implemented.

At this point, new move_median bench was compared to old move_median bench. It was observed that the new move_median became slower by 1-3%. Even though the changes were minor, apparently new arithmetic operations introduced were enough to cause this overhead. For a performance-oriented package with decrease in speed is not justifiable.
It was decided to implement move_quantile parallel to move_median. This causes a lot of code repetition, but this needed to be done to not sacrificy move_median performance (and also to avoid abusing macros) cd49b4f . A lot of the functions in move_median.c were almost duplicated, hence a large diff. At this commit, both move_quantile and move_median were fully implemented and almost fully tested.
When move_quantile is called with q=0., instead move_min is called, which is much faster. Similarly with q=1. and move_max, and with q=0.5 and move_median.
Only interpolation method "midpoint" was implemented for now.

Other changes

Function parse_args in move_template.c was heavily refactored for better clarity

Technicalities

np.nanquantile behaves weirdly when there are np.inf's in the data. See, for instance, BUG: np.percentile gives unreasonable results when array contains np.inf numpy/numpy#21932, BUG: inf in quantile has undefined behaviour (and possibly different for -inf vs +inf) numpy/numpy#21091 . In particular, np.nanquantile(q=0.5) doesn't give the same result as np.nanmedian on such data, because of how arithmetic operation work on np.infs. Our move_quantile behaves as expected and in agreement with move_median when q=0.5. To test properly (and have a numpy slow version of move_quantile), we notice that np.nanmedian behaviour can be achieved if one takes
(np.nanquantile(a, q=0.5, method="lower") + np.nanquantile(a, q=0.5, method="higher")) / 2. This is what we use for slow function if there are np.inf's in the data. The fact that this and np.nanmedian return the same is tested in move_test.py. This issue is also discussed in there in comments (which I used pretty liberally)
When there are no infs in a, the usual np.nanquantile is called in bn.slow.move_quantile, so benching is "fair", since we don't consider infinite values during benching.

Tests

A lot of extensive tests were added for move_quantile. With constant REPEAT_FULL_QUANTILE set to 1 in test_move_quantile_with_infs_and_nans, the test considers 200k instances, and takes ~7 mins to run. It was tested with more repetitions and larger range of parameter, the current values are set so that the Github Actions tests run reasonable time.

Benches

bn.move_quantile is significantly faster than bn.slow.move_quantile:

    Bottleneck 1.3.5.post0.dev24; Numpy 1.23.1
    Speed is NumPy time divided by Bottleneck time
    None of the array elements are NaN

   Speed  Call                          Array
   269.9  move_quantile(a, 1, q=0.25)   rand(1)
  2502.7  move_quantile(a, 2, q=0.25)   rand(10)
  6718.9  move_quantile(a, 20, q=0.25)   rand(100)
  5283.4  move_quantile(a, 200, q=0.25)   rand(1000)
  5747.2  move_quantile(a, 2, q=0.25)   rand(10, 10)
  3197.3  move_quantile(a, 20, q=0.25)   rand(100, 100)
  3051.9  move_quantile(a, 20, axis=0, q=0.25)   rand(100, 100, 100)
  3135.6  move_quantile(a, 20, axis=1, q=0.25)   rand(100, 100, 100)
  3232.8  move_quantile(a, 20, axis=2, q=0.25)   rand(100, 100, 100)

The increase in speed was tested and confirmed separately (outside of bn.bench) for sanity check. q = 0.25 is used for all benches with move_quantile.

A slight complication that arises is that these benches are very long to run now, because of how slow np.nanquantile is. bn.bench(functions=["move_quantile"]) runs for about 20 minutes:

Bottleneck performance benchmark
    Bottleneck 1.3.5.post0.dev24; Numpy 1.23.1
    Speed is NumPy time divided by Bottleneck time
    NaN means approx one-fifth NaNs; float64 used

              no NaN     no NaN      NaN       no NaN      NaN    
               (100,)  (1000,1000)(1000,1000)(1000,1000)(1000,1000)
               axis=0     axis=0     axis=0     axis=1     axis=1  
move_quantile 6276.9     1961.2     1781.8     2294.7     2255.1

Further changes

Several things that can be improved with move_quantile going further:

Implement more interpolation methods. Refactoring of parse_arg function made it much easier to pass additional arguments to functions in move. Changing behavior of mq_get_quantile should not be a problem as well
np.quantile supports a list (iterable) of quantiles to compute. Can also add it here, quite easy to do if implement it at the first step on python level.
I had an attempt of making the argument q a required argument for move_quantile (as it should be), but was met with some complications and left it as is. If will create a python wrapper to parse the iterable q input anyway, can add non-keyword q to that python layer.

Wrap-up

Thanks for considering, and sorry for a large diff. 50% of that is duplicating code in move_median.c, and another 20% is new tests. You can see in de181da how few changes were actually made for move_quantile to work, but this approach just unfortunately slowed down move_median by a bit.

Implementation of moving quantile instead of moving median, which is a partial case of quantile. For now using the fixed constant in #define

into quantile

Add quantile and has_quantile parameters to the template

Added all imports Fixed a bug when q=1. Call move_max for this case (on python layer) Added a lot of tests for move_median and move_quantile

Fix keyword argument "method"/"interpolation" for different numpy verisons (keyword was changed after 1.22.0) Copy over doc string for move_quantile from C layer to Python layer

This reverts commit c2a2ae3. move_min is significanlty faster than move_quantile with q = 0. So in case of q=0 apply move_min instead. Same for q=1 and move_max.

Instead of move_quantile substituting move_median completely, have both move_median and move_quantile implemented separately.

Remove the wrapper for move_quantile on python level which checked for q = 0 or 1. Now it's fully in C. Also check for q=0.5 as we checked it's 3-4% faster to call move_median

Add some more tests

Remove redundant import

Mostly get rid of macros

for versions comparison

This eliminate the need for macro in move_median.c mm_handle will just have an unused membet "quanitle" for the case of move_median.

move_median and move_quantile now have all the same functions except for the construction of mm/mq.

andrii-riazanov · 2022-09-29T06:30:54Z

Update 1

The implementation of move_median.c was refactored in 72677f8 to remove code repetition and usage of macro completely. Now move_median and move_quantile use the same functions for managing heaps, and only differ when they calculate the actual statistic. This makes the implementation of both mm and mq at the same time cleaner while keeping the performance of move_median unchanged. The diff in the source code is much smaller now.

andrii-riazanov · 2022-10-02T03:18:40Z

Update 2

In 2c892db added a very simple python layer for move_quantile to support iterable q argument. Also argument q was made a required (non-default) argument on python layer. Documenation (copied from move_template.c) updated correspondingly.

andrii-riazanov · 2023-04-11T01:27:07Z

Hi @rdbisme, I was wondering if someone could take a look or make a comment on this PR at their convenience. I know it's a large one, just want to understand what I could expect from this. Thanks :)

rdbisme · 2023-04-14T16:02:40Z

Ehi @andrii-riazanov, thanks for your contribution. I'm currently alone managing this package, mostly focusing on keep it easily available and installable on supported Python versions.

I hope someone else from the community can step in and help to review implementations and improvements of the actual business logic as for your PR.

Otherwise, I'll try to find a bit of free time to actually give it a look, but it might take time.

Anyway, if anyone is reading this, feel free to step in this discussion and provide feedback :)

RichieHakim · 2023-04-28T18:43:58Z

While I'm not able to help directly with the code, I'm very thankful and eager to try this out. Also:

Currently, the best moving quantiles are: pandas.DataFrame.rolling.quantile + multiprocessing, as well as rolling_quantiles (https://github.com/marmarelis/rolling-quantiles)
These are partially benchmarked here: (https://github.com/RichieHakim/rolling_percentile`)
This is a paper on how median calculation can be done faster than typical insertion sort methods (https://www.stat.cmu.edu/~ryantibs/papers/median.pdf)

andrii-riazanov and others added 29 commits September 12, 2022 21:24

Add moving quantile

524afbf

Implementation of moving quantile instead of moving median, which is a partial case of quantile. For now using the fixed constant in #define

initial changes from median to quantile

4f98aa4

initial changes from median to quantile

e470fa2

Merge branch 'quantile' of https://github.com/andrii-riazanov/bottleneck

1c97aac

into quantile

Change all move_median to move_quantile

10b8824

Add quantile and has_quantile parameters to the template

Add move_median as move_quantile without q argument at C level

c718a18

Fix bug with addressing quantile before assignment

009c835

Initial tests and some fixes

832035a

Added all imports Fixed a bug when q=1. Call move_max for this case (on python layer) Added a lot of tests for move_median and move_quantile

Finish extensive testing of move_quantile

7af7168

Fix keyword argument "method"/"interpolation" for different numpy verisons (keyword was changed after 1.22.0) Copy over doc string for move_quantile from C layer to Python layer

Ignore warnings from numpy about infs and NaNs

42eddad

move_quantile(q=0) vs move_min benching

c2a2ae3

Revert "move_quantile(q=0) vs move_min benching"

9f4c5dc

This reverts commit c2a2ae3. move_min is significanlty faster than move_quantile with q = 0. So in case of q=0 apply move_min instead. Same for q=1 and move_max.

Some changes (to ammend later)

9447697

Bench move_quantile(q=0.5) with slow.move_median

de181da

Bring old move_median, add move_quantile separately

02a0ce1

Instead of move_quantile substituting move_median completely, have both move_median and move_quantile implemented separately.

Finish bringing move_median back

cd49b4f

Move move_quantile to C level fully

fcaefde

Remove the wrapper for move_quantile on python level which checked for q = 0 or 1. Now it's fully in C. Also check for q=0.5 as we checked it's 3-4% faster to call move_median

Refactor parse_args function in move_template

1bddedd

Add some more tests

Add docs and comments

7a413cb

Update move_test.py

6851ed2

Remove redundant import

Actually add docs and comments

97ecd15

Add comments, modify tests, change back gitignore

a7d5c22

Refactor parse_args again to actually work

5a2bcac

Mostly get rid of macros

Dial tests back a little to run reasonable time

c390863

Modify benches, restore old files

6f1e5d4

Change packaging module to pkg_resources

7f1c3af

for versions comparison

Update move_quantile benches in asv with q=0.25

4dadfe4

Make mm_handle and mq_handle the same

9012e24

This eliminate the need for macro in move_median.c mm_handle will just have an unused membet "quanitle" for the case of move_median.

Median and quantile with function pointers

72677f8

move_median and move_quantile now have all the same functions except for the construction of mm/mq.

Support of itrable q argument for move_quantile

2c892db

andrii-riazanov and others added 4 commits October 2, 2022 03:37

Make tests work with posiitonal q in move_quantile

654ab14

Merge branch 'master' into quantile

04ce117

Merge branch 'master' into quantile

00cd119

Merge branch 'master' into quantile

4ec8945

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add move_quantile function #418

Add move_quantile function #418

andrii-riazanov commented Sep 23, 2022

andrii-riazanov commented Sep 29, 2022 •

edited

andrii-riazanov commented Oct 2, 2022

andrii-riazanov commented Apr 11, 2023

rdbisme commented Apr 14, 2023

RichieHakim commented Apr 28, 2023

Add move_quantile function #418

Are you sure you want to change the base?

Add move_quantile function #418

Conversation

andrii-riazanov commented Sep 23, 2022

Why?

How?

Other changes

Technicalities

Tests

Benches

Further changes

Wrap-up

andrii-riazanov commented Sep 29, 2022 • edited

Update 1

andrii-riazanov commented Oct 2, 2022

Update 2

andrii-riazanov commented Apr 11, 2023

rdbisme commented Apr 14, 2023

RichieHakim commented Apr 28, 2023

andrii-riazanov commented Sep 29, 2022 •

edited