Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: numpy.percentile output is not sorted #14685

Closed
A4Vision opened this issue Oct 12, 2019 · 16 comments · Fixed by #16273
Closed

BUG: numpy.percentile output is not sorted #14685

A4Vision opened this issue Oct 12, 2019 · 16 comments · Fixed by #16273

Comments

@A4Vision
Copy link

A4Vision commented Oct 12, 2019

The output of numpy.percentile is not always sorted

Reproducing code example:

import numpy as np
q = np.arange(0, 1, 0.01) * 100
percentile = np.percentile(np.array([0, 1, 1, 2, 2, 3, 3 , 4, 5, 5, 1, 1, 9, 9 ,9, 8, 8, 7]) * 0.1, q)
equals_sorted = np.sort(percentile) == percentile
print(equals_sorted)
assert equals_sorted.all()

Error message:

[ True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True False False True True True True False
True True True False]
AssertionError Traceback (most recent call last)
in
1 q = np.percentile(np.array([0, 1, 1, 2, 2, 3, 3 , 4, 5, 5, 1, 1, 9, 9 ,9, 8, 8, 7]) * 0.1, np.arange(0, 1, 0.01) * 100)
2 equals_sorted = np.sort(q) == q
----> 3 assert equals_sorted.all()

AssertionError:

Numpy/Python version information:

1.17.2 3.6.8 (v3.6.8:3c6b436a57, Dec 24 2018, 02:04:31)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]

@eric-wieser
Copy link
Member

Why would you expect it to be sorted? Percentile is elementwise - the outputs are in the order of the inputs.

@A4Vision
Copy link
Author

Hi !
Indeed, percentile is elmenet-wise - when considering q, which in our case is
np.arange(0, 1, 0.01) * 100.
I expect the output to be sorted because q is sorted.

@seberg
Copy link
Member

seberg commented Oct 12, 2019

There are some numerical errors within a single ULP, that differ for different inputs with the same output value. I doubt there is anything to be done about that.

@eric-wieser
Copy link
Member

eric-wieser commented Oct 13, 2019

A slightly reduced failing case:

In [40]: np.percentile(np.array([0, 1, 1, 2, 2, 3, 3 , 4, 5, 5, 1, 1, 9, 9 ,9, 8, 8, 7]) * 0.1, [89, 90, 95, 96, 98, 99])
Out[40]: array([0.9, 0.9, 0.9, 0.9, 0.9, 0.9])

In [41]: np.diff(_)
Out[41]:
array([-1.11022302e-16,  2.22044605e-16, -1.11022302e-16,  1.11022302e-16,
       -1.11022302e-16])

here showing non-sorted-ness via the diff.

I think there probably is something we can do about this. I think this comes down to the stability of these lines, which perform a lerp operation (essentially add(v_below*weights_below, v_above*weights_above)):

weights_above = indices - indices_below
weights_below = 1 - weights_above

x1 = take(ap, indices_below, axis=axis) * weights_below
x2 = take(ap, indices_above, axis=axis) * weights_above

if out is not None:
r = add(x1, x2, out=out)
else:
r = add(x1, x2)

There are a bunch of tradeoffs to be made when linearly interpolating floating point values, but I suspect that there's a "correct" choice here, and we just haven't made it.

Some more background here: https://math.stackexchange.com/questions/907327/accurate-floating-point-linear-interpolation

@seberg
Copy link
Member

seberg commented Oct 14, 2019

Yeah, I agree, +1 on reorganizing the operations so that it is strictly monotonic (numerically). Would be good if it is also no worse, or at least almost identical precision wise. I am sure we really do not have to worry about a few extra operations/speed here.

EDIT: Marked as good first issue. This is only a good first issue if you are willing to dive into the intricacies of IEEE floating point numbers. But after that, this is probably a fairly straight forward reorganization within python code.

@ngonzo95
Copy link

I would be interested in taking on this issue. I was looking at some of the failing cases and noticed that they all involved linearly interpolating between the same number. i.e. in Eric's example all of the percentiles he listed listed are located in between two 9s. Therefore I think the linear interpolation between them must be 9 exactly correct? fixing the problem of linearly interpolating between two number that are the same seems like it would deal with the issues presented in this bug and not cause a noticeable hit in performance. If however we want to ensure that the linear interpolation will be monotonic always, we can do that but It will require a piecewise function that I would think would decrease performance.

@seberg
Copy link
Member

seberg commented Oct 16, 2019

@ngonzo95 there should be a way to spell the arithmetic of the interpolation differently to achieve this, i.e. change/rearrange the formula that is used for the calculation (so that it is mathematically identical, but numerically guarantees monotonicity). No piecewise calculation should be necessary.

@eric-wieser
Copy link
Member

eric-wieser commented Oct 16, 2019

No piecewise calculation should be necessary.

It depends what your requirements on lerp are. Some that we may or may not care about:

  • monotonic ((lerp(a, b, t1) - lerp(a, b, t0)) * (b - a) * (t1 - t0) >= 0)
  • bounded (a <= lerp(a, b, t) <= b)
  • symmetric (lerp(a, b, t) == lerp(b, a, 1-t))

(0 <= t <= 1)

@seberg
Copy link
Member

seberg commented Oct 16, 2019

Oh OK, I did not expect piecewise to be necessary, but do not know the intrinsicaties of this well enough I guess.

@ngonzo95
Copy link

looking into it more I discovered that the function a + (b-a)*t has the property of being both monotonic (definition noted above) and consistent (lerp (a, a, t) = a). I believe this should be sufficient for the functions requirements. It seems one of the main draw backs of this function is that lerp(a, b, 1) !=b. However I think the way we are calculating weights ensures that 0<=t<1.

@eric-wieser
Copy link
Member

It seems one of the main draw backs of this function is that lerp(a, b, 1) !=b. However I think the way we are calculating weights ensures that 0<=t<1.

Note that unfortunately lerp(a, b. 1-eps) > b) is possible with that formulation.

@anshulshankar
Copy link

New to the open source.
Wanted to solve this as my good first issue. How can i contribute? Are there any prerequisites?

@glemaitre
Copy link

I was looking at some of the failing cases and noticed that they all involved linearly interpolating between the same number

In scikit-learn, we recently stumbled in this issue: scikit-learn/scikit-learn#15733

Since we expect q to be strictly increasing, we can apply np.maximum.accumulate reorder the array. However, if we could solve the issue in NumPy directly, this would be great. Is there anywhere that we can dig in to have a good fix?

@eric-wieser
Copy link
Member

@glemaitre: All of the relevant lines in numpy are linked in my comment above, #14685 (comment)

@arthertz
Copy link

arthertz commented Dec 9, 2019

Hey, there seems to have been an update to one of the stackexchange answers provided by @eric-wieser with a good alternative interpolation.
The thread includes a proof of monotonicity, and the proposed fix appears to address all of the issues mentioned.
If this would make sense for the issue, I would be willing to implement this as a first commit, or someone else could try it.
20191209_020250

@lumbric
Copy link
Contributor

lumbric commented Dec 30, 2019

Note that there is another issue with lerp in quantile(): inf values are not handled correctly, see #12282.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants