New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restructure percentile methods #10736
Comments
I think this already exists? Using the wikipedia example: >>> np.percentile(15, 20, 35, 40, 50], [5, 30, 40, 50, 100], interpolation='lower')
array([15, 20, 20, 35, 50]) |
It does not. Look at example 2 in the wikipedia page: >>> np.percentile([3, 6, 7, 8, 8, 10, 13, 15, 16, 20], [25,50,75,100], interpolation='lower')
array([ 7, 8, 13, 20]) When it should be It similarly fails in the third example |
Nearest sounds a lot like "nearest"? Though there is always another point about how exactly the boundaries work. |
don't want to read it, I think the difference might be the C parameter further down, so if someone who knows this wants to add this.... |
Frankly, I think adding the C parameter would likely really be good. But mostly better documentation would be nice, and someone who really knows this stuff is needed.... |
I don't know if this has anything to do with the C-parameter, although I agree that the option of choosing it could be desirable. I have found another thread that incidentally brought up this issue (Dec. 2016). It seems that the algorithm I am looking for (and which wikipedia calls nearest-rank) is mentioned in this commonly cited paper by Hyndman-Fan (H&F) as being the oldest and most studied definition of percentile (it was the one I learned in Stats course). It is a discontinuous function, so I think the parameter C does not apply here (I may be wrong). Here is how it would look like against the other options provided by numpy that intuitively seem to compute a similar thing (i.e., 'lower', 'nearest'): |
To me it looks exactly like the C parameter on first sight, the nearest curve is more stretched then the H&F curve, which is expected since numpy uses 1 and apparently H&F uses 0. |
If you want proof. Repeat the whole thing with the same values repeated 1000 times, my guess is they will converge. |
A graph like that would be a great addition to the percentile docs edit: preferably one showing the open/closedness of the discontinuities Note to readers: To keep this thread manageable, I've marked all discussions below about adding this graph to the docs as "resolved". The graph is now at the bottom of https://numpy.org/devdocs/reference/generated/numpy.percentile.html. |
This comment has been minimized.
This comment has been minimized.
@seberg I will be honest here, I don't know how the interpolation is being calculated based on the C-parameter. What makes me think that it is not related is that the C-parameter is only discussed in the linear interpolation section (Wikipedia), and both the Wikipedia and Hyndmand & Fan paper discuss the algorithm I requested in separate sections from the interpolation ones. I don't know if there is any interpolation parameters that always give the same results as the algorithm I am interested in. Even if there are, should this be the way used to get to it? Changing a 'strange' parameter to get the most common definition of percentile does not seem the best way to implement it imho. |
@ricardoV94, maybe, but you can't just change the defaults, no matter how bad they are. We could expose something like method="H&K" to override both parameters at once. The C parameters is where you define 0% and 100% to be with respect to the datapoints (on the data point or not, etc.). As parameter |
@seberg I am fine with method="H&K" or maybe method="classic". Interpolation="none" could also make sense. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I think you can change
And voila, with "nearest" you will get your "H&F" and with linear you will get the plot from Wikipedia. (pending that I got something wrong, but I am pretty sure I got it right). As I said, the difference is where you place the data points from 0-100 (evenly) with respect to the last point. For C=1 you put min(data) to 0th percentile, etc. I have no clue about "what makes more sense" it probably matters a bit of the general view. The name inclusive for 1 and exclusive for 0 makes a bit sense I guess (when you think about the total range of percentiles, since exclusive the possible range is outside the data range). C=1/2 is also exclusive in that sense though. I would be for adding the C parameter, but I would want someone to come up with a descriptive name if possible. I would also not mind something like a "method" or so to make the best defaults obvious (combination of interpolation+C). Or, you we basically decide that most combinations are never used and not useful, fine then.... In the end my problem is: I want a statistician to tell me which methods have consensus (R has some stuff, but the last time someone came around here it was just a copy past of R doc or similar without setting it into numpy context at all, needless to say, it was useless for a general audience, citing papers would have been more helpfull). |
I don't want to read that H&F paper (honestly it also does not look very slick to read), but I think you could look at it from a support point of view too. The numpy "nearest" (or any other) version does not have identical support (in the percentiles) for each data point, H&F has equal support for "nearest" and maybe for midpoint it would be C=1/2, not sure. EDIT: midpoint has equal support (for the area in between data points, not for the point itself) in numpy, so with "C=1" |
@seberg It does not seem to work with me. Can you post your code showing it working? |
Well, I got the sign wrong, in that code up there, so it was opposite (C=0 a no-op not C=1): def scale_percentiles(p, num, C=0):
"""
p : float
percentiles to be used (within 0 and 100 inclusive)
num : int
number of data points.
C : float
parameter C, should be 0, 0.5 or 1. Numpy uses 1, matlab 0.5, H&F is 0.
"""
p = np.asarray(p)
fact = (num+1.-2*C)/(num-1)
p *= fact
p -= 0.5 * (fact-1) * 100
p[p < 0] = 0
p[p > 100] = 100
return p
plt.figure()
plt.plot(np.percentile([0, 1, 2, 3], scale_percentiles(np.linspace(0, 100, 101), 5, C=0), interpolation='nearest'))
plt.plot(np.percentile([0, 1, 2, 3], scale_percentiles(np.linspace(0, 100, 101), 5, C=1), interpolation='nearest'))
plt.figure()
plt.plot(np.percentile([15, 20, 35, 40, 50], scale_percentiles(np.linspace(0, 100, 101), 5, C=1), interpolation='linear'))
plt.plot(np.percentile([15, 20, 35, 40, 50], scale_percentiles(np.linspace(0, 100, 101), 5, C=0.5), interpolation='linear'))
plt.plot(np.percentile([15, 20, 35, 40, 50], scale_percentiles(np.linspace(0, 100, 101), 5, C=0), interpolation='linear')) |
@seberg Close but not there yet. For I had to make the list percentiles The function for the classical method is simple: Despite that, the C argument seems to be working as expected, so it could be implemented if people want to use it for the interpolation. I still would like a method='classic' or interpolation='none' that would work as the wikipedia one. |
For debugging, this is my ugly non-numpy implementation of the classical method:
and a more numpythonic one:
|
Those differences sound like floating point/rounding issues (which you
seem aware of), and maybe my guess with C=0 was wrong and you want
C=0.5.
My point was to say where the difference comes from (The "C parameter"
IMO, though there are probably good reasons to dislike many
combinations). It was not to give you/implement a workaround.
As to the "classical" method, I frankly do not care much what classical
is supposed to be. For all I know classical just means "quite a few
people use it".
Solution wise, my first impression is that "classical" or whatever
name, just adds another confusing option with an unclear name. I hope
that this discussion could go in the direction of actually making all
good (common) options available to users in a clean and transparent
way. Best in a way that people actually might understand.
We can add one more method, but frankly I only half like it. When we
last added more methods (I don't remember what changed exactly) I
already delayed and hoped that someone would jump up and figure out
what we should have. Needless to say it never really happened. And now
I am trying to point to the differences and try to see how it might fit
with what we currently have.
So, my impression is (with possible problems with rounding and exact
percentile matches) we have (probably too) many "interpolation" options
and would require the "C parameter" or whatever you want to call it to
be able to do almost anything.
And I would be really happy if someone could tell me how all the
(common) "Methods" out there fall into those categories, it seems that
more then C=0,0.5,1 exist even, and maybe some even outside those
options....
Maybe I am going down the wrong lane, but adding "Method1" with an
unclear name that does not really tell anyone how it differs from the
other methods does not seem helpful to me (except for someone who
happens to already know the name "Method1" and are looking for it. And
please don't say that the "classic" is the one obvious one, there is
way too much variance in implementations out there.
Another way might be to deprecated "interpolation", but having a list
of methods is also much less nice then hinting "linear interpolation"
to say that it is not a step behaviour, etc.... And if we go that way,
I still want a reasonable overview.
You do not have to do it, but if we want to add a new method, we need a
way to add it that does not confuse everyone even more and is clear!
|
Let me summarize it then:
In sum: the current options of numpy.percentile seem both rather confusing and limited. The paper mentioned above offers a good overview of other useful methods. Together with the wikipedia page, they could work as a starting point for the design of a more exhaustive and useful set of options to numpy.percentile. Hopefully someone would like to work on this task. |
Hello, I read the thread #10736 and #7875 and also had a look at the numpy current implementation and the paper of hyndman&fan.
But, as far as I understand the numpy code, it does not look like method 7.
I'm not saying this is a bad implementation, it just doesn't look like the method 7 to me. However I'm not a mathematician or statistician so I'm probably not understanding what is going on. If anyone could explain I would gladly add some doc about this. Also, about the default implementation to choose, hyndman and fan stated that the method 8 should be preferred, you can read that on the conclusion of the paper (PDF) as well as on a blog post of the hyndman |
Since this has not moved for years in big parts due to lack of clear input from someone who can give some clear opinions, lets try pinging the right person ;). @robjhyndman sorry if you do not have time the time/interest for input, but maybe you can help us unstick the NumPy The problem NumPy's
My questions for you are:
I would like to make the choice as easy as possible for the user. Either by reducing the methods or by "hiding" some of the methods away a bit. But I do not have the confidence to step and make that choice, and it seems nobody who passed by here in a few years really had the confidence either. |
This is a little ironic because the 1996 paper I wrote with Fan was intended to describe all the methods in use by software at the time and to encourage everyone to standardize on one method (8). Instead, the effect of the paper has been for software to add all the methods they are missing -- and now NumPy has some ones that didn't exist back in 1996! I coauthored the While I lost the argument to standardize on method 8, I still think standardization is important. I think the best solution would be if NumPy replicated what the |
Thanks! Neither can we change the default (but it is 7, so that is not terrible). I think what makes me hesitant is mostly how to present the methods (and guide users). I am still not sure about it.
This opinion is probably enough for me to just accept if anyone does a PR to add Can you/we think of some names like:
Just reading the word "unbiased" would entice me to check the notes! We could still provide the types as integer but use the strings to nudge users towards "recommended" (i.e. named) options. Python uses That would give us the named types 6, 7, 8, and 9 with names The discontinuous options feel a bit more awkward to me. IIRC I think one of them is sometimes named |
Agreed that the names are better than numbers. Here is an extended list of suggestions.
|
Maybe exclusive/inclusive don't fit well enough anyway... As far as I understand, they refer to "including" the actual population range (i.e. population minimum/maximum).
|
Honestly I don't see the point of including anything other than the current default and a new |
I didn't realise Wolfram had given them names. Happy to use |
I'm working on an implementation which will handle "the 9" and will possibly improve performances as well. The performance boost seems especially true for nanpercentile. I hope to publish a PR here soon. |
- Added the missing interpolation methods as an enumeration - Reworked the whole process to compute nanquantile. This may significantly improve the performances - Updated unit tests accordingly
- Added the missing interpolation methods as an enumeration - Reworked the whole process to compute nanquantile. This may significantly improve the performances - Updated unit tests accordingly
- Added the missing interpolation methods as an enumeration. - Reworked the whole process to compute quantiles. - There is now a single function for nanquantiles and quantile. This can significantly improve the performances for nanquantile. - Updated the existing unit tests.
The implementation of "the 9" is complete, I was inspired by your paper @robjhyndman as well as your library @ricardoV94 among other sources. |
Not sure what you want my OK for. But happy to see this implemented in python. A reference to the 1996 paper is sufficient. |
No need to mention my library :) |
- Added the missing linear interpolation methods. - Updated the existing unit tests. - Added pytest.mark.xfail for boolean arrays See - numpy#19857 (comment) - numpy#19154
- Added the missing linear interpolation methods. - Updated the existing unit tests. - Added pytest.mark.xfail for boolean arrays See - numpy#19857 (comment) - numpy#19154
- Added the missing linear interpolation methods. - Updated the existing unit tests. - Added pytest.mark.xfail for boolean arrays See - numpy#19857 (comment) - numpy#19154
- Added the missing linear interpolation methods. - Updated the existing unit tests. - Added pytest.mark.xfail for boolean arrays See - numpy#19857 (comment) - numpy#19154
- Added the missing linear interpolation methods. - Updated the existing unit tests. - Added pytest.mark.xfail for boolean arrays See - numpy#19857 (comment) - numpy#19154
I'm 100% an user and not a coder so I just wanted to add my thoughts after reading this thread and the other one related:
|
Hello @jiglesiasc, the 9 methods have been merged on main branch with #19857. I'm not sure if the issue should stay open though. |
Yeah, lets close it. The main thing is that I want to quickly follow-up to rename I think the current docs are good, but a snippet of text to guide users towards (safe) choices may be nice. |
As exemplified in the Wikipedia page: https://en.wikipedia.org/wiki/Percentile#The_nearest-rank_method
The text was updated successfully, but these errors were encountered: