-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Percentile Linear returns incorrect value. #7875
Comments
How did you arrive at your expected values? I agree with the numpy values using the linear interpolation. For example, the 75th percentile, given there are 60 items in your list, should be the 44.25th element in the sorted list. Using the linear interpolation, this is |
What numpy version? |
I get the same values on 1.11.1 |
I am using numpy 1.11.1 The values I expected came firstly from SPSS which I also checked using excel's percentile function. Both of these returned the values. 75 = 14700792987.25 |
Should the formula not be (n+1)*0.75 giving 45.75 |
I don't think so, (n-1)*0.75 will give you the 75th percentile, starting from a 0 index. For example, in a 5 item list, the second to last item is the 75th percentile, and (5-1) * 0.75 = 3, which gives the index of the second to last item. Whereas (5+1)*0.75 would be 4.5. |
Hum - in MATLAB I get still different answers:
|
the point is that d[44.25] us 14700792987.25 when doing linear interpolation so numpy looks wrong to me or at least computes a weird result which does match the documentation (depending on how you interpret it ...) |
oh no its |
R
|
I get 14102378961.75 for a[44.25] (ie a[44] + 0.25 * (a[45] - a [44]), as expected. MATLAB will be different as they use a different percentile scheme, as per http://www.mathworks.com/help/stats/prctile.html. For example, in a 5 element list, MATLAB views the first element as the 10th percentile, whereas numpy sees it as the 0th percentile. There is no agreed system for percentiles, so neither is "correct". |
R has nine different percentile types, and I doubt it covers all of these combinations.... It would be nice to clean it up for good, but we really need someone who knows this very well.... |
useful article: Selected methods for percentile estimation and their use in popular software. Method A: p(n+1) Method B: 0.5+pn Method C: p(n-1)+1 Method D: p(n+1/3)+1/3 |
Hmm, in principle we could add more methods I guess. In practice, if our stuff currently is OK, maybe it would be good enough to document it and cite that or similar paper(s)/books. This would be good in any case,percentile estimation seems a non-trivial thing, and at least interested users should be able to get enough input to figure it out. |
@mikep2016 - thanks for tracking down that information. Would you consider adding that to the docstring - it would be very useful. |
I have added the alternative methods to the docstring and created a pull request. |
To contribute to this discussion... The blue line is the Method1 that is the oldest/simplest "standard" definition as the inverse of the cumulative distribution function. There is no equivalent of this currently implemented in numpy. The default method "Linear" is fine. It corresponds to method 7 in H&F. (note that all methods 4-9 are somehow linear and if it were for me I would have assigned Method4 to this name) The other four (nearest-lower-upper-midpoint) does not directly correspond to any of those.
Finally, IMO, by working on something similar to scipy/scipy#6801, scipy/scipy#6801 and scipy/scipy#6466, I feel that would be quite useful to have at least Method1 to have a "proper inverse" of the empirical cdf. What do you think? |
I'm currently studying statistics from McClave's book. There is an example for the following data: Out[70]: First column is the index, second column is the data. In the book we get the following answer: If you use numpy 1.20.3 to calculate let's say the 95th percentile the answer is: np.percentile(data,95) which is wrong. The correct calculation is the one presented in the book. I calculated the answer by hand. Step 1) First calculate the rank = 95/100 * (40 + 1) = 38.95 As you can see the result is the same as in the book. Also I calculated the correct answer for OP and he was right with his expected values. Most people didn't noticed that there were actually 52 items and not 60 on his array. In conclusion numpy's percentile calculation is doing something wrong or what fraction is using to compute the linear answer? |
There is nothing wrong with it, but there are different ways to calculate percentile, see gh-10736. In particular, like many many default values, NumPy's is a sample percentile and not a population estimtae (there are many population estimate choices!). Comments on that PR are very welcome. |
@jiglesiasc I am planning on merging the PR working on this today, after that fixups to the documentation (+would be very welcome). There will definitely be no change in default. It would be much too disruptive and we will follow R here. (Happy to be corrected about all of those things, I am not a statistician, but the default would be painful to change and would require some extremely thorough convincing.) Seems the original issue was also about the definition of the method being used and is covered by the other issue (and PR), so I will close this. Thanks all! |
The percentile function does not return the expected values. It seems to be getting the linear distance between two points the wrong way around and returns a value closer to the lower number.
For example:
expected values:
When reading the documentation the linear methodology seems correct but there possibly could be an issue with the fraction it is using??
The text was updated successfully, but these errors were encountered: