Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure percentile methods #10736

Closed
ricardoV94 opened this issue Mar 12, 2018 · 69 comments
Closed

Restructure percentile methods #10736

ricardoV94 opened this issue Mar 12, 2018 · 69 comments
Labels
00 - Bug 01 - Enhancement 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. Priority: high High priority, also add milestones for urgent issues triage review Issue/PR to be discussed at the next triage meeting

Comments

@ricardoV94
Copy link

As exemplified in the Wikipedia page: https://en.wikipedia.org/wiki/Percentile#The_nearest-rank_method

@eric-wieser
Copy link
Member

I think this already exists? Using the wikipedia example:

>>> np.percentile(15, 20, 35, 40, 50], [5, 30, 40, 50, 100], interpolation='lower')
array([15, 20, 20, 35, 50])

@eric-wieser eric-wieser added the 57 - Close? Issues which may be closable unless discussion continued label Mar 13, 2018
@ricardoV94
Copy link
Author

ricardoV94 commented Mar 13, 2018

It does not. Look at example 2 in the wikipedia page:

>>> np.percentile([3, 6, 7, 8, 8, 10, 13, 15, 16, 20], [25,50,75,100], interpolation='lower')
array([ 7,  8, 13, 20])

When it should be [7,8,15,20]

It similarly fails in the third example

@seberg
Copy link
Member

seberg commented Mar 13, 2018

Nearest sounds a lot like "nearest"? Though there is always another point about how exactly the boundaries work.
EDIT: That is, where exactly is 0 and 100 considered to be, at the datapoint or before the datapoint? (that is IIRC, anyway there are a lot of annoying complexities here)

@seberg
Copy link
Member

seberg commented Mar 13, 2018

don't want to read it, I think the difference might be the C parameter further down, so if someone who knows this wants to add this....

@seberg
Copy link
Member

seberg commented Mar 13, 2018

Frankly, I think adding the C parameter would likely really be good. But mostly better documentation would be nice, and someone who really knows this stuff is needed....

@ricardoV94
Copy link
Author

ricardoV94 commented Mar 13, 2018

I don't know if this has anything to do with the C-parameter, although I agree that the option of choosing it could be desirable.

I have found another thread that incidentally brought up this issue (Dec. 2016). It seems that the algorithm I am looking for (and which wikipedia calls nearest-rank) is mentioned in this commonly cited paper by Hyndman-Fan (H&F) as being the oldest and most studied definition of percentile (it was the one I learned in Stats course). It is a discontinuous function, so I think the parameter C does not apply here (I may be wrong).

Here is how it would look like against the other options provided by numpy that intuitively seem to compute a similar thing (i.e., 'lower', 'nearest'):

percentiles

@seberg
Copy link
Member

seberg commented Mar 13, 2018

To me it looks exactly like the C parameter on first sight, the nearest curve is more stretched then the H&F curve, which is expected since numpy uses 1 and apparently H&F uses 0.

@seberg
Copy link
Member

seberg commented Mar 13, 2018

If you want proof. Repeat the whole thing with the same values repeated 1000 times, my guess is they will converge.
EDIT: Or maybe not, don't have the patience or time to really think about it. But I still think it is the C parameter wikipedia mentions, so please proof me wrong :)

@eric-wieser
Copy link
Member

eric-wieser commented Mar 13, 2018

A graph like that would be a great addition to the percentile docs

edit: preferably one showing the open/closedness of the discontinuities

Note to readers: To keep this thread manageable, I've marked all discussions below about adding this graph to the docs as "resolved". The graph is now at the bottom of https://numpy.org/devdocs/reference/generated/numpy.percentile.html.

@ricardoV94

This comment has been minimized.

@ricardoV94
Copy link
Author

ricardoV94 commented Mar 13, 2018

@seberg I will be honest here, I don't know how the interpolation is being calculated based on the C-parameter. What makes me think that it is not related is that the C-parameter is only discussed in the linear interpolation section (Wikipedia), and both the Wikipedia and Hyndmand & Fan paper discuss the algorithm I requested in separate sections from the interpolation ones.

I don't know if there is any interpolation parameters that always give the same results as the algorithm I am interested in.

Even if there are, should this be the way used to get to it? Changing a 'strange' parameter to get the most common definition of percentile does not seem the best way to implement it imho.

@seberg
Copy link
Member

seberg commented Mar 13, 2018

@ricardoV94, maybe, but you can't just change the defaults, no matter how bad they are. We could expose something like method="H&K" to override both parameters at once.

The C parameters is where you define 0% and 100% to be with respect to the datapoints (on the data point or not, etc.). As parameter C on wikipedia, it may well be only for interpolation, but the same issue causes the difference here I am sure. C is dubious of course, a proper name might be something like range='min-max' or range='extrapolated' or probably something completely different. As I said, redo the plots with many many datapoints (possibly with tiny noise), and I think you will see them converge, since the range definition becomes less obvious.

@ricardoV94
Copy link
Author

ricardoV94 commented Mar 13, 2018

@seberg I am fine with method="H&K" or maybe method="classic". Interpolation="none" could also make sense.

@eric-wieser

This comment has been minimized.

@ricardoV94

This comment has been minimized.

@eric-wieser

This comment has been minimized.

@ricardoV94

This comment has been minimized.

@eric-wieser

This comment has been minimized.

@eric-wieser

This comment has been minimized.

@ricardoV94

This comment has been minimized.

@seberg
Copy link
Member

seberg commented Mar 13, 2018

I think you can change C with something like this (test it out on something). Call the function on your percentiles, then plug it into the numpy version (which uses C=1, which is a no-op except correcting out of bound percentiles right now):

def scale_percentiles(p, num, C=0):
     """
     p : float
          percentiles to be used (within 0 and 100 inclusive)
     num : int
         number of data points.
     C : float
         parameter C, should be 0, 0.5 or 1. Numpy uses 1, matlab 0.5, H&K is 0.
     """
     p = np.asarray(p)
     fact = (num-1.+2*C)/(num-1)
     p *= fact
     p -= 0.5 * (fact-1) * 100
     p[p < 0] = 0
     p[p > 100] = 100
     return p

And voila, with "nearest" you will get your "H&F" and with linear you will get the plot from Wikipedia. (pending that I got something wrong, but I am pretty sure I got it right).

As I said, the difference is where you place the data points from 0-100 (evenly) with respect to the last point. For C=1 you put min(data) to 0th percentile, etc. I have no clue about "what makes more sense" it probably matters a bit of the general view. The name inclusive for 1 and exclusive for 0 makes a bit sense I guess (when you think about the total range of percentiles, since exclusive the possible range is outside the data range). C=1/2 is also exclusive in that sense though.

I would be for adding the C parameter, but I would want someone to come up with a descriptive name if possible. I would also not mind something like a "method" or so to make the best defaults obvious (combination of interpolation+C). Or, you we basically decide that most combinations are never used and not useful, fine then....

In the end my problem is: I want a statistician to tell me which methods have consensus (R has some stuff, but the last time someone came around here it was just a copy past of R doc or similar without setting it into numpy context at all, needless to say, it was useless for a general audience, citing papers would have been more helpfull).

@seberg
Copy link
Member

seberg commented Mar 13, 2018

I don't want to read that H&F paper (honestly it also does not look very slick to read), but I think you could look at it from a support point of view too. The numpy "nearest" (or any other) version does not have identical support (in the percentiles) for each data point, H&F has equal support for "nearest" and maybe for midpoint it would be C=1/2, not sure.
I keep repeating myself, I don't know if such a support argument (against C=1 such as numpy uses it), is actually a real reason.

EDIT: midpoint has equal support (for the area in between data points, not for the point itself) in numpy, so with "C=1"

@ricardoV94
Copy link
Author

@seberg It does not seem to work with me. Can you post your code showing it working?

@seberg
Copy link
Member

seberg commented Mar 14, 2018

Well, I got the sign wrong, in that code up there, so it was opposite (C=0 a no-op not C=1):

def scale_percentiles(p, num, C=0):
     """
     p : float
          percentiles to be used (within 0 and 100 inclusive)
     num : int
         number of data points.
     C : float
         parameter C, should be 0, 0.5 or 1. Numpy uses 1, matlab 0.5, H&F is 0.
     """
     p = np.asarray(p)
     fact = (num+1.-2*C)/(num-1)
     p *= fact
     p -= 0.5 * (fact-1) * 100
     p[p < 0] = 0
     p[p > 100] = 100
     return p
plt.figure()
plt.plot(np.percentile([0, 1, 2, 3], scale_percentiles(np.linspace(0, 100, 101), 5, C=0), interpolation='nearest'))
plt.plot(np.percentile([0, 1, 2, 3], scale_percentiles(np.linspace(0, 100, 101), 5, C=1), interpolation='nearest'))
plt.figure()
plt.plot(np.percentile([15, 20, 35, 40, 50], scale_percentiles(np.linspace(0, 100, 101), 5, C=1), interpolation='linear'))
plt.plot(np.percentile([15, 20, 35, 40, 50], scale_percentiles(np.linspace(0, 100, 101), 5, C=0.5), interpolation='linear'))
plt.plot(np.percentile([15, 20, 35, 40, 50], scale_percentiles(np.linspace(0, 100, 101), 5, C=0), interpolation='linear'))

@ricardoV94
Copy link
Author

ricardoV94 commented Mar 14, 2018

@seberg Close but not there yet. For a = [0,1,2,3] and percentiles = [25, 50, 75, 100] , np.percentile (a, scale_percentiles(percentiles, len(a), C=0), interpolation='nearest) returns [0, 2, 3, 3], when it should return [0,1,2,3].

I had to make the list percentilesdtype=np.float or your function would give an error, but I don't think that is the issue.

The function for the classical method is simple:
Percentile / 100 * N --> If it's a whole number that is the index, if not, use the ceiling as the index.

Despite that, the C argument seems to be working as expected, so it could be implemented if people want to use it for the interpolation. I still would like a method='classic' or interpolation='none' that would work as the wikipedia one.

@ricardoV94
Copy link
Author

ricardoV94 commented Mar 14, 2018

For debugging, this is my ugly non-numpy implementation of the classical method:

def percentile (arr, p):
    arr = sorted(arr)
    
    index = p /100 * len(arr)
    
    # If index is a whole number, and larger than zero, subtract one unit (due to 0-based indexing)
    if index%1 < 0.0001 and index//1 > 0:
        index -= 1
        
    return arr[int(index)]

and a more numpythonic one:

def indexes_classic(percentiles, set_size):
    percentiles = np.asarray(percentiles)
    
    indexes = percentiles / 100* set_size
    indexes[np.isclose(indexes%1, 0)] -= 1
    indexes = np.asarray(indexes, dtype=np.int)
    indexes[indexes < 0] = 0
    indexes[indexes > 100] = 100
    
    return indexes

@seberg
Copy link
Member

seberg commented Mar 14, 2018 via email

@ricardoV94
Copy link
Author

ricardoV94 commented Mar 14, 2018

Let me summarize it then:

  1. Right now numpy offers only one useful method: interpolation ='linear', and the others are just small variations around it that don't seem to really be used by anyone. Other packages have many more relevant options.

  2. Adding the other values for C=0 or C=0.5, makes sense to me. All the interpolation methods can work in combination with them, although again they are probably never going to be used.

  3. If one of the combos between interpolation methods and C argument, manages to replicate the classical method (the reference and wikipedia and my personal experience agree that it is the most commonly taught method), then I am happy with it. It can be stated in the docs that such combo produces the classical non-interpolation method. I am not sure if it's just due to float precision issues, but I appreciate your effort to tackle it in a more integrated way!

  4. If none of the combos achieves the same result, then I think a different method would make sense. Possibly called interpolation='none' would be the less confusing.

In sum: the current options of numpy.percentile seem both rather confusing and limited. The paper mentioned above offers a good overview of other useful methods. Together with the wikipedia page, they could work as a starting point for the design of a more exhaustive and useful set of options to numpy.percentile. Hopefully someone would like to work on this task.

@bzah
Copy link
Contributor

bzah commented Aug 12, 2021

Hello, I read the thread #10736 and #7875 and also had a look at the numpy current implementation and the paper of hyndman&fan.
I don't really understand why it is said that numpy implement the method 7 or an equivalent of it.
The method 7 is described as follow:

Gumbel (1939) also considered the modal position pk= modeF(X(k)) = (k-1)/(n-1). One nice property is that the vertices of Q7(p) divide the range into n-1 intervals, and exactly 100p% of the intervals lie to the left of Q7(p) and 100(1-p)% of the intervals lie to the right of Q7(p)

But, as far as I understand the numpy code, it does not look like method 7.
Here is a simplified version of the methods involved in the linear regression (_quantile_ureduce_func and _lerp) as they are implemented now:
To ease understanding I suppose that arr is a 1 dimensional array already sorted.

quantiled_index = quantile_value * (len(arr) -1)
index_below = floor(quantiled_index)
index_above = index_below + 1
weights_above = quantiled_index - indices_below
x_above = arr[index_above]
x_below = arr[index_below]
if weights_above >= 0.5:
    result =  x_above - ((x_above - x_below) * (1 - weights_above))
else:
    result = x_above

I'm not saying this is a bad implementation, it just doesn't look like the method 7 to me. However I'm not a mathematician or statistician so I'm probably not understanding what is going on. If anyone could explain I would gladly add some doc about this.

Also, about the default implementation to choose, hyndman and fan stated that the method 8 should be preferred, you can read that on the conclusion of the paper (PDF) as well as on a blog post of the hyndman
And there is a relevant discussion as to why method 8 is not currently widely used on stackexchange.

@seberg
Copy link
Member

seberg commented Aug 12, 2021

Since this has not moved for years in big parts due to lack of clear input from someone who can give some clear opinions, lets try pinging the right person ;).

@robjhyndman sorry if you do not have time the time/interest for input, but maybe you can help us unstick the NumPy quantile API problem?

The problem NumPy's quantile implementation has, is that:

  • We always use a method that assumes the boundary points are the extreme points of the actual distribution (fixed C=1), IIRC. (That seems OK, its not "Method 8", but probably fine.)
  • We added an interpolation=("nearest"|"lower"|"higher") argument at some point, which was probably ill-thought out, since none of the "9 Methods" that R has combines a non-interpolated version with C=1. So we have (non-default) options that we are not sure anyone should actually use!

My questions for you are:

  • Should we remove the current "interpolation" methods? Because nobody should use one that is not even included in the 9 that R has...
  • Could we either provide some guidlines or even "preferred" set of options from "the 9"? Maybe even reduce the 9 to just a select 3 with clear differences? (Every "method" we can drop with very few affected users will make the live of most users easier!)

I would like to make the choice as easy as possible for the user. Either by reducing the methods or by "hiding" some of the methods away a bit. But I do not have the confidence to step and make that choice, and it seems nobody who passed by here in a few years really had the confidence either.

@robjhyndman
Copy link

This is a little ironic because the 1996 paper I wrote with Fan was intended to describe all the methods in use by software at the time and to encourage everyone to standardize on one method (8). Instead, the effect of the paper has been for software to add all the methods they are missing -- and now NumPy has some ones that didn't exist back in 1996!

I coauthored the quantile() function in R to compute the 9 methods described in our paper. We originally argued for method 8, but I couldn't convince the R Core team to use that as a default, so the quantile() function in R uses type = 7 by default (for backwards compatibility) and almost no-one changes that default when using the function.

While I lost the argument to standardize on method 8, I still think standardization is important. I think the best solution would be if NumPy replicated what the quantile() function in R does, with the same 9 methods (and no more), and the same default method 7.

@seberg
Copy link
Member

seberg commented Aug 13, 2021

We originally argued for method 8, but I couldn't convince the R Core team to use that as a default

Thanks! Neither can we change the default (but it is 7, so that is not terrible). I think what makes me hesitant is mostly how to present the methods (and guide users). I am still not sure about it.

While I lost the argument to standardize on method 8, I still think standardization is important. I think the best solution would be if NumPy replicated what the quantile() function in R does, with the same 9 methods (and no more), and the same default method 7.

This opinion is probably enough for me to just accept if anyone does a PR to add method=... (or type=... with some good argument). However, I personally would much prefer a string based API vs. a numbering based approach. Of course we should refer to the R-style numbering, though!

Can you/we think of some names like:

  • "median-unbiased" (type 8)
  • "normal-unbiased" (type 9)

Just reading the word "unbiased" would entice me to check the notes! We could still provide the types as integer but use the strings to nudge users towards "recommended" (i.e. named) options.

Python uses method="exclusive" (type 6) and method="inclusive" (type 7), where exclusive is the default (uhoh)! It may make sense to "follow" python for those two names.
(I do not understand why python effectively recommends type 6 over type 8/9 by making it the default, but I have not attempted to find out.)

That would give us the named types 6, 7, 8, and 9 with names inclusive, median-unbiased, normal-unbiased, and exclusive (omitting the continuous 4 and 5). I suppose we need 4 and 5? But would we be fine with just numbering them, and waiting for someone to give them a better name if they want?

The discontinuous options feel a bit more awkward to me. IIRC I think one of them is sometimes named "H&K" and "inverted-cdf" may be a name for type 1 (if that is not the same anyway)?
The "type 2" might be slightly trickier with its mid-point logic, but by now it seems easier to implement it then to decide on omitting it.

@robjhyndman
Copy link

Agreed that the names are better than numbers. Here is an extended list of suggestions.

  1. "inverted-cdf"
  2. "averaged-inverted-cdf"
  3. "closest-observation"
  4. "interpolated-inverted-cdf"
  5. "midway-interpolated-inverted-cdf" (too long?)
  6. "exclusive" (no idea what that means here)
  7. "inclusive" (ditto)
  8. "median-unbiased"
  9. "normal-unbiased"

@seberg
Copy link
Member

seberg commented Aug 13, 2021

Maybe exclusive/inclusive don't fit well enough anyway... As far as I understand, they refer to "including" the actual population range (i.e. population minimum/maximum).
Maybe "sample" could make sense for method 7? I am not sure about 6, because I don't see that it is in any way an obvious choice for "not sample", which the exclusive/inclusive nomenclature suggests IMO.

"midway-interpolated-inverted-cdf" seems a bit long indeed (I don't particularly hate it if it is niche enough, though). Alternatively, we could use the names wolfram refers to "hazen" and "weibull" instead if they don't have simple statistical properties. (Or go to the weird(?) option of using numbers only for niche definitions).

@ricardoV94
Copy link
Author

ricardoV94 commented Aug 13, 2021

Honestly I don't see the point of including anything other than the current default and a new inverted-cdf (and removing the weird current alternatives). Everything else is probably better fit in scipy.stats as it is rather specific.

@robjhyndman
Copy link

I didn't realise Wolfram had given them names. Happy to use "hazen" and "weibull" for 5 and 6. Following that theme, we could use "gumbel" for 7 as it was proposed by Gumbel in 1939. Makes more sense than "inclusive".

@bzah
Copy link
Contributor

bzah commented Sep 1, 2021

I'm working on an implementation which will handle "the 9" and will possibly improve performances as well. The performance boost seems especially true for nanpercentile. I hope to publish a PR here soon.

bzah added a commit to bzah/numpy that referenced this issue Sep 8, 2021
- Added the missing interpolation methods
as an enumeration
- Reworked the whole process to compute
nanquantile.
This may significantly improve the performances
- Updated unit tests accordingly
bzah added a commit to bzah/numpy that referenced this issue Sep 9, 2021
- Added the missing interpolation methods
as an enumeration
- Reworked the whole process to compute
nanquantile.
This may significantly improve the performances
- Updated unit tests accordingly
bzah added a commit to bzah/numpy that referenced this issue Sep 9, 2021
- Added the missing interpolation methods
as an enumeration.
- Reworked the whole process to compute
quantiles.
- There is now a single function for nanquantiles and quantile.
This can significantly improve the performances for nanquantile.
- Updated the existing unit tests.
@bzah
Copy link
Contributor

bzah commented Sep 16, 2021

The implementation of "the 9" is complete, I was inspired by your paper @robjhyndman as well as your library @ricardoV94 among other sources.
Would you give your OK for this feature ?
I already added a reference to the paper in the documentation, what else can I do to give the due credit ?

@robjhyndman
Copy link

Not sure what you want my OK for. But happy to see this implemented in python. A reference to the 1996 paper is sufficient.

@ricardoV94
Copy link
Author

The implementation of "the 9" is complete, I was inspired by your paper @robjhyndman as well as your library @ricardoV94 among other sources.
Would you give your OK for this feature ?
I already added a reference to the paper in the documentation, what else can I do to give the due credit ?

No need to mention my library :)

@seberg seberg added the 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. label Sep 21, 2021
bzah added a commit to bzah/numpy that referenced this issue Oct 6, 2021
- Added the missing linear interpolation methods.
- Updated the existing unit tests.

- Added pytest.mark.xfail for boolean arrays
See
- numpy#19857 (comment)
- numpy#19154
bzah added a commit to bzah/numpy that referenced this issue Oct 6, 2021
- Added the missing linear interpolation methods.
- Updated the existing unit tests.

- Added pytest.mark.xfail for boolean arrays
See
- numpy#19857 (comment)
- numpy#19154
bzah added a commit to bzah/numpy that referenced this issue Oct 21, 2021
- Added the missing linear interpolation methods.
- Updated the existing unit tests.

- Added pytest.mark.xfail for boolean arrays
See
- numpy#19857 (comment)
- numpy#19154
seberg pushed a commit to bzah/numpy that referenced this issue Nov 1, 2021
- Added the missing linear interpolation methods.
- Updated the existing unit tests.

- Added pytest.mark.xfail for boolean arrays
See
- numpy#19857 (comment)
- numpy#19154
seberg pushed a commit to bzah/numpy that referenced this issue Nov 4, 2021
- Added the missing linear interpolation methods.
- Updated the existing unit tests.

- Added pytest.mark.xfail for boolean arrays
See
- numpy#19857 (comment)
- numpy#19154
@jiglesiasc
Copy link

I'm 100% an user and not a coder so I just wanted to add my thoughts after reading this thread and the other one related:

  • It's not really that important if you stick to H&F paper or Wikipedia both accomplish more or less the same thing, as long as you clearly state in the documentation: "this percentile method is based on XXXXX". The current documentation for numpy.percentile the definition of the linear interpolation is quite vague, this is why I'm here.
  • It is fundamental that numpy.percentile include more methods for linear interpolation, specially definition 6 from H&F which is widely used on a lot of software packages. This is necessary because the users we have to compare a lot our calculations with the calculation from a particular study. As for today numpy.percentile works only as a tool for own study but not to compare with others, which makes it less useful.
  • The default method is not really that relevant IMO. All of the method tend to get closer when N is large. As seen in H&F some methods have more desired properties than others but all the linear methods accomplish 5 out of 6 properties except definition 5 which accomplish all 6 properties. But I get what you said that you wanted to follow R. But why follow R and not Matlab? or not Excel? any particular reason just out of curiosity?

@bzah
Copy link
Contributor

bzah commented Nov 8, 2021

Hello @jiglesiasc, the 9 methods have been merged on main branch with #19857.
If everything goes right this feature should be available with numpy's next release I think.

I'm not sure if the issue should stay open though.

@seberg
Copy link
Member

seberg commented Nov 8, 2021

Yeah, lets close it. The main thing is that I want to quickly follow-up to rename interpolation to method. If anyone here has further hints/suggestions, please also check the new docs: https://numpy.org/devdocs/reference/generated/numpy.quantile.html changing names, etc. should happen within the next few weeks (or we miss the next release).

I think the current docs are good, but a snippet of text to guide users towards (safe) choices may be nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug 01 - Enhancement 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. Priority: high High priority, also add milestones for urgent issues triage review Issue/PR to be discussed at the next triage meeting
Projects
None yet
Development

No branches or pull requests