BUG: numpy.percentile output is not sorted #14685

A4Vision · 2019-10-12T12:21:37Z

The output of numpy.percentile is not always sorted

Reproducing code example:

import numpy as np
q = np.arange(0, 1, 0.01) * 100
percentile = np.percentile(np.array([0, 1, 1, 2, 2, 3, 3 , 4, 5, 5, 1, 1, 9, 9 ,9, 8, 8, 7]) * 0.1, q)
equals_sorted = np.sort(percentile) == percentile
print(equals_sorted)
assert equals_sorted.all()

Error message:

[ True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True False False True True True True False
True True True False]
AssertionError Traceback (most recent call last)
in
1 q = np.percentile(np.array([0, 1, 1, 2, 2, 3, 3 , 4, 5, 5, 1, 1, 9, 9 ,9, 8, 8, 7]) * 0.1, np.arange(0, 1, 0.01) * 100)
2 equals_sorted = np.sort(q) == q
----> 3 assert equals_sorted.all()

AssertionError:

Numpy/Python version information:

1.17.2 3.6.8 (v3.6.8:3c6b436a57, Dec 24 2018, 02:04:31)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]

The text was updated successfully, but these errors were encountered:

eric-wieser · 2019-10-12T13:42:48Z

Why would you expect it to be sorted? Percentile is elementwise - the outputs are in the order of the inputs.

A4Vision · 2019-10-12T13:51:33Z

Hi !
Indeed, percentile is elmenet-wise - when considering q, which in our case is
np.arange(0, 1, 0.01) * 100.
I expect the output to be sorted because q is sorted.

seberg · 2019-10-12T15:12:19Z

There are some numerical errors within a single ULP, that differ for different inputs with the same output value. I doubt there is anything to be done about that.

eric-wieser · 2019-10-13T22:57:49Z

A slightly reduced failing case:

In [40]: np.percentile(np.array([0, 1, 1, 2, 2, 3, 3 , 4, 5, 5, 1, 1, 9, 9 ,9, 8, 8, 7]) * 0.1, [89, 90, 95, 96, 98, 99])
Out[40]: array([0.9, 0.9, 0.9, 0.9, 0.9, 0.9])

In [41]: np.diff(_)
Out[41]:
array([-1.11022302e-16,  2.22044605e-16, -1.11022302e-16,  1.11022302e-16,
       -1.11022302e-16])

here showing non-sorted-ness via the diff.

I think there probably is something we can do about this. I think this comes down to the stability of these lines, which perform a lerp operation (essentially add(v_below*weights_below, v_above*weights_above)):

numpy/numpy/lib/function_base.py

Lines 3907 to 3908 in b9fa88e

    
           weights_above = indices - indices_below 
        
           weights_below = 1 - weights_above

numpy/numpy/lib/function_base.py

Lines 3928 to 3929 in b9fa88e

    
           x1 = take(ap, indices_below, axis=axis) * weights_below 
        
           x2 = take(ap, indices_above, axis=axis) * weights_above

numpy/numpy/lib/function_base.py

Lines 3939 to 3942 in b9fa88e

    
           if out is not None: 
        
               r = add(x1, x2, out=out) 
        
           else: 
        
               r = add(x1, x2)

There are a bunch of tradeoffs to be made when linearly interpolating floating point values, but I suspect that there's a "correct" choice here, and we just haven't made it.

Some more background here: https://math.stackexchange.com/questions/907327/accurate-floating-point-linear-interpolation

seberg · 2019-10-14T01:06:56Z

Yeah, I agree, +1 on reorganizing the operations so that it is strictly monotonic (numerically). Would be good if it is also no worse, or at least almost identical precision wise. I am sure we really do not have to worry about a few extra operations/speed here.

EDIT: Marked as good first issue. This is only a good first issue if you are willing to dive into the intricacies of IEEE floating point numbers. But after that, this is probably a fairly straight forward reorganization within python code.

ngonzo95 · 2019-10-16T13:42:39Z

I would be interested in taking on this issue. I was looking at some of the failing cases and noticed that they all involved linearly interpolating between the same number. i.e. in Eric's example all of the percentiles he listed listed are located in between two 9s. Therefore I think the linear interpolation between them must be 9 exactly correct? fixing the problem of linearly interpolating between two number that are the same seems like it would deal with the issues presented in this bug and not cause a noticeable hit in performance. If however we want to ensure that the linear interpolation will be monotonic always, we can do that but It will require a piecewise function that I would think would decrease performance.

seberg · 2019-10-16T15:56:18Z

@ngonzo95 there should be a way to spell the arithmetic of the interpolation differently to achieve this, i.e. change/rearrange the formula that is used for the calculation (so that it is mathematically identical, but numerically guarantees monotonicity). No piecewise calculation should be necessary.

eric-wieser · 2019-10-16T16:01:29Z

No piecewise calculation should be necessary.

It depends what your requirements on lerp are. Some that we may or may not care about:

monotonic ((lerp(a, b, t1) - lerp(a, b, t0)) * (b - a) * (t1 - t0) >= 0)
bounded (a <= lerp(a, b, t) <= b)
symmetric (lerp(a, b, t) == lerp(b, a, 1-t))

(0 <= t <= 1)

seberg · 2019-10-16T16:04:51Z

Oh OK, I did not expect piecewise to be necessary, but do not know the intrinsicaties of this well enough I guess.

ngonzo95 · 2019-10-16T21:09:25Z

looking into it more I discovered that the function a + (b-a)*t has the property of being both monotonic (definition noted above) and consistent (lerp (a, a, t) = a). I believe this should be sufficient for the functions requirements. It seems one of the main draw backs of this function is that lerp(a, b, 1) !=b. However I think the way we are calculating weights ensures that 0<=t<1.

eric-wieser · 2019-10-17T06:39:18Z

It seems one of the main draw backs of this function is that lerp(a, b, 1) !=b. However I think the way we are calculating weights ensures that 0<=t<1.

Note that unfortunately lerp(a, b. 1-eps) > b) is possible with that formulation.

anshulshankar · 2019-11-12T18:59:07Z

New to the open source.
Wanted to solve this as my good first issue. How can i contribute? Are there any prerequisites?

glemaitre · 2019-12-03T22:32:17Z

I was looking at some of the failing cases and noticed that they all involved linearly interpolating between the same number

In scikit-learn, we recently stumbled in this issue: scikit-learn/scikit-learn#15733

Since we expect q to be strictly increasing, we can apply np.maximum.accumulate reorder the array. However, if we could solve the issue in NumPy directly, this would be great. Is there anywhere that we can dig in to have a good fix?

eric-wieser · 2019-12-03T23:30:36Z

@glemaitre: All of the relevant lines in numpy are linked in my comment above, #14685 (comment)

arthertz · 2019-12-09T07:04:07Z

Hey, there seems to have been an update to one of the stackexchange answers provided by @eric-wieser with a good alternative interpolation.
The thread includes a proof of monotonicity, and the proposed fix appears to address all of the issues mentioned.
If this would make sense for the issue, I would be willing to implement this as a first commit, or someone else could try it.

lumbric · 2019-12-30T12:18:44Z

Note that there is another issue with lerp in quantile(): inf values are not handled correctly, see #12282.

A4Vision mentioned this issue Oct 12, 2019

BUG in ValuesToBins - A4Vision/naive-decision-tree#4

Closed

eric-wieser added 00 - Bug component: numpy.lib labels Oct 13, 2019

seberg added the good first issue label Oct 14, 2019

glemaitre mentioned this issue Dec 3, 2019

MRG FIX: order of values of self.quantiles_ in QuantileTransformer scikit-learn/scikit-learn#15751

Merged

glemaitre mentioned this issue Dec 11, 2019

BUG ensure monotonic property of lerp in numpy.percentile #15098

Closed

CloseChoice mentioned this issue May 17, 2020

BUG: Order percentile monotonically #16273

Merged

seberg closed this as completed in #16273 Jun 27, 2020

wphicks mentioned this issue Feb 3, 2021

Improve floating point accuracy in percentile cupy/cupy#4617

Merged

eric-wieser mentioned this issue Oct 18, 2021

MAINT, ENH: Refactor percentile and quantile methods #19857

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: numpy.percentile output is not sorted #14685

BUG: numpy.percentile output is not sorted #14685

A4Vision commented Oct 12, 2019 •

edited

eric-wieser commented Oct 12, 2019

A4Vision commented Oct 12, 2019

seberg commented Oct 12, 2019

eric-wieser commented Oct 13, 2019 •

edited

seberg commented Oct 14, 2019 •

edited

ngonzo95 commented Oct 16, 2019

seberg commented Oct 16, 2019

eric-wieser commented Oct 16, 2019 •

edited

seberg commented Oct 16, 2019

ngonzo95 commented Oct 16, 2019

eric-wieser commented Oct 17, 2019

anshulshankar commented Nov 12, 2019

glemaitre commented Dec 3, 2019

eric-wieser commented Dec 3, 2019

arthertz commented Dec 9, 2019 •

edited

lumbric commented Dec 30, 2019

BUG: numpy.percentile output is not sorted #14685

BUG: numpy.percentile output is not sorted #14685

Comments

A4Vision commented Oct 12, 2019 • edited

Reproducing code example:

Error message:

Numpy/Python version information:

eric-wieser commented Oct 12, 2019

A4Vision commented Oct 12, 2019

seberg commented Oct 12, 2019

eric-wieser commented Oct 13, 2019 • edited

seberg commented Oct 14, 2019 • edited

ngonzo95 commented Oct 16, 2019

seberg commented Oct 16, 2019

eric-wieser commented Oct 16, 2019 • edited

seberg commented Oct 16, 2019

ngonzo95 commented Oct 16, 2019

eric-wieser commented Oct 17, 2019

anshulshankar commented Nov 12, 2019

glemaitre commented Dec 3, 2019

eric-wieser commented Dec 3, 2019

arthertz commented Dec 9, 2019 • edited

lumbric commented Dec 30, 2019

A4Vision commented Oct 12, 2019 •

edited

eric-wieser commented Oct 13, 2019 •

edited

seberg commented Oct 14, 2019 •

edited

eric-wieser commented Oct 16, 2019 •

edited

arthertz commented Dec 9, 2019 •

edited