Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Percentile Linear returns incorrect value. #7875

Closed
mikep2016 opened this issue Jul 27, 2016 · 21 comments
Closed

Percentile Linear returns incorrect value. #7875

mikep2016 opened this issue Jul 27, 2016 · 21 comments

Comments

@mikep2016
Copy link

mikep2016 commented Jul 27, 2016

The percentile function does not return the expected values. It seems to be getting the linear distance between two points the wrong way around and returns a value closer to the lower number.

For example:

array([   110531353,    167471747,    167471747,    183000406,
          200000000,    759174457,    921094606,    931142911,
         1300000000,   1341797102,   1380317195,   1380317195,
         1500000000,   1500000000,   1500000000,   1830004057,
         1932444073,   2000000000,   2000000000, 2345976525, 2500000000,
       2745006085, 2847019692, 3000000000, 3000000000, 3000000000,
       3000000000, 3312761268, 3500000000, 3588824707, 4000000000,
       4140951585, 5000000000, 6600000000, 7100000000, 7717299940,
       8445515490, 8972061767, 9662220364, 11000000000, 11042537559,
       12422854754, 13000000000, 13000000000, 13803171949, 15000000000,
       17000000000, 20000000000, 22085075118, 24845709508, 32025070994,
       34000000000, 35000000000, 36000000000, 36000000000, 39453076027,
       40000000000, 46930784627, 82819031694, 110425375592], dtype=int64)

> > > np.percentile(a,75)
> > > 14102378961.75
> > > np.percentile(a,50)
> > > 3794412353.5
> > > np.percentile(a,25)
> > > 1747503042.75

expected values:

75 = 14700792987.25
50 = 3794412353.5
25 = 1582501014.25

When reading the documentation the linear methodology seems correct but there possibly could be an issue with the fraction it is using??

@tom-bird
Copy link
Contributor

How did you arrive at your expected values?

I agree with the numpy values using the linear interpolation.

For example, the 75th percentile, given there are 60 items in your list, should be the 44.25th element in the sorted list. Using the linear interpolation, this is i + (j - i) * 0.25 as per the docs, where i is a[44] and j is a[45].

@charris
Copy link
Member

charris commented Jul 28, 2016

What numpy version?

@tom-bird
Copy link
Contributor

I get the same values on 1.11.1

@mikep2016
Copy link
Author

I am using numpy 1.11.1

The values I expected came firstly from SPSS which I also checked using excel's percentile function. Both of these returned the values.

75 = 14700792987.25
50 = 3794412353.5
25 = 1582501014.25

@mikep2016
Copy link
Author

Should the formula not be (n+1)*0.75 giving 45.75

@tom-bird
Copy link
Contributor

tom-bird commented Jul 29, 2016

I don't think so, (n-1)*0.75 will give you the 75th percentile, starting from a 0 index.

For example, in a 5 item list, the second to last item is the 75th percentile, and (5-1) * 0.75 = 3, which gives the index of the second to last item. Whereas (5+1)*0.75 would be 4.5.

@matthew-brett
Copy link
Contributor

Hum - in MATLAB I get still different answers:

a = [ 110531353, 167471747, 167471747, 183000406, 200000000, 759174457, 921094606, 931142911, 1300000000, 1341797102, 1380317195, 1380317195, 1500000000, 1500000000, 1500000000, 1830004057, 1932444073, 2000000000, 2000000000, 2345976525, 2500000000, 2745006085, 2847019692, 3000000000, 3000000000, 3000000000, 3000000000, 3312761268, 3500000000, 3588824707, 4000000000, 4140951585, 5000000000, 6600000000, 7100000000, 7717299940, 8445515490, 8972061767, 9662220364, 11000000000, 11042537559, 12422854754, 13000000000, 13000000000, 13803171949, 15000000000, 17000000000, 20000000000, 22085075118, 24845709508, 32025070994, 34000000000, 35000000000, 36000000000, 36000000000, 39453076027, 40000000000, 46930784627, 82819031694, 110425375592];
>> prctile(a, [25, 50, 75])

ans =

   1.0e+10 *

    0.1665    0.3794    1.4402

@juliantaylor
Copy link
Contributor

the point is that d[44.25] us 14700792987.25 when doing linear interpolation so numpy looks wrong to me or at least computes a weird result which does match the documentation (depending on how you interpret it ...)
what does R compute?

@juliantaylor
Copy link
Contributor

oh no its 14102378961.75 ...
one should really use more readable numbers for this
weird why does spss compute something else

@matthew-brett
Copy link
Contributor

R

> quantile(a, c(.25, .50, .75));
        25%         50%         75% 
 1747503043  3794412354 14102378962 

@tom-bird
Copy link
Contributor

I get 14102378961.75 for a[44.25] (ie a[44] + 0.25 * (a[45] - a [44]), as expected.

MATLAB will be different as they use a different percentile scheme, as per http://www.mathworks.com/help/stats/prctile.html. For example, in a 5 element list, MATLAB views the first element as the 10th percentile, whereas numpy sees it as the 0th percentile.

There is no agreed system for percentiles, so neither is "correct".

@seberg
Copy link
Member

seberg commented Jul 29, 2016

R has nine different percentile types, and I doubt it covers all of these combinations.... It would be nice to clean it up for good, but we really need someone who knows this very well....

@mikep2016
Copy link
Author

mikep2016 commented Aug 1, 2016

useful article:

http://download.lww.com/wolterskluwer_vitalstream_com/PermaLink/EDE/A/EDE_2011_05_24_SCHOONJANS_201050_SDC1.pdf

Selected methods for percentile estimation and their use in popular software.

Method A: p(n+1)
Method A, cited for example by Altman1
and CLSI2, calculates an index p(n+1). This method is attributed to Gumbel and Weibull and is widely available in software packages such as JMP (SAS Institute Inc, Cary, NC) (only method), NCSS (NCSS, Kaysville, UT) (recommended method), Minitab (Minitab Inc., State CollegePA), SAS (SAS Institute Inc, Cary, NC) Statistica (StatSoft, Inc., Tulsa, OK) and SPSS (IBM Corp., Somers, NY).

Method B: 0.5+pn
Method B, cited for example by Armitage et al. and Lentner calculates an index 0.5+pn, and is attributed to Hazen Method B is used in the software programs Mathematica (Wolfram Research Inc., Champaign, IL) and MedCalc (MedCalc Software bvba, Mariakerke, Belgium) (only method, except for
reference interval estimation)

Method C: p(n-1)+1
Method C uses p(n-1)+1 and is attributed to Gumbel.4
This method is implemented in both Microsoft Excel 200318 and 2007 (Microsoft Corp., Redmond, WA)19 as well as in the Calc program, part of the OpenOffice suite by Sun Microsystems (Oracle Corp., Redwood Shores, CA), although this does not seem to be documented. Method C is the default value in R20 and is also among the methods available in Statistica(StatSoft, Inc., Tulsa, OK)

Method D: p(n+1/3)+1/3
Method D uses p(n+1/3)+1/3 and is proposed by Hyndman and Fan as the best choice among many othermethods including the above. We are not aware of any available software implementations of this method.

@seberg
Copy link
Member

seberg commented Aug 1, 2016

Hmm, in principle we could add more methods I guess. In practice, if our stuff currently is OK, maybe it would be good enough to document it and cite that or similar paper(s)/books. This would be good in any case,percentile estimation seems a non-trivial thing, and at least interested users should be able to get enough input to figure it out.

@matthew-brett
Copy link
Contributor

@mikep2016 - thanks for tracking down that information. Would you consider adding that to the docstring - it would be very useful.

@mikep2016
Copy link
Author

I have added the alternative methods to the docstring and created a pull request.

@apbard
Copy link
Contributor

apbard commented Dec 17, 2016

To contribute to this discussion...
On the one hand, as @theultimatecrouton has mentioned, there is no an unique and unanimously accepted definition of percentiles. There are however some methods that have found larger use in applications and statistical software. An highly cited survey on this matter is Hyndman-Fan (H&F). And the 4 methods cited in the article proposed by @mikep2016 are actually taken from the 9 analysed there. On the other hand, I have been quite surprised by the methods implemented in numpy for a series of reason that I'll try to explain in the following.
Let's consider for instance the simple case where sample data is D = [0, 1, 2, 3]
In the following picture you can see the plot of the different methods (percentiles on X, values on Y):
ppf_test1

The blue line is the Method1 that is the oldest/simplest "standard" definition as the inverse of the cumulative distribution function. There is no equivalent of this currently implemented in numpy.
method1

The default method "Linear" is fine. It corresponds to method 7 in H&F. (note that all methods 4-9 are somehow linear and if it were for me I would have assigned Method4 to this name)
method4

The other four (nearest-lower-upper-midpoint) does not directly correspond to any of those.
I have searched in literature and I haven't found any reference to those definitions. Plus, I think (but it can be just me) that the names are quite misleading and it seems to me that they are somehow "misaligned".

  • Upper: changes points are at [~33.3, 50, ~66.6].
  • Lower: changes points are at [~33.3, 50, ~66.6].
  • Midpoint: changes points are at [~16.5, 50, ~83.5].
    but for all of them I would have expected [25, 50,75] with therefore "Lower" equal to Method1 (blue line).
  • For Nearest I personally would have expected changing points at [12.5, 37.5, 62.5] like in Method3 but I agree that nearest is likewise a bit ambiguous.
    method3

Finally, IMO, by working on something similar to scipy/scipy#6801, scipy/scipy#6801 and scipy/scipy#6466, I feel that would be quite useful to have at least Method1 to have a "proper inverse" of the empirical cdf.
At the same time I think it maybe could not such a bad idea to align with other packages: R and Octave implements all the 9 methods proposed in H&F.
Finally, for the reasons I mentioned before, I am not so sure whether the not-"linear" methods (or at least their names) are so useful/intuitive.

What do you think?
If you believe that could be of interest I am willing to contribute to the implementation of some of such methods.

@ricardoV94
Copy link

@apbard I completely agree with you that the Method1 should be available, and that the names of the other interpolation methods are confusing. I posted this issue separately in here before finding your message in this thread.

@jiglesiasc
Copy link

jiglesiasc commented Nov 3, 2021

I'm currently studying statistics from McClave's book. There is an example for the following data:

Out[70]:
0 -9
1 -5
2 -3
3 -1
4 -1
5 -1
6 0
7 0
8 0
9 2
10 2
11 2
12 3
13 4
14 6
15 6
16 6
17 7
18 7
19 7
20 7
21 8
22 8
23 8
24 8
25 9
26 10
27 10
28 10
29 10
30 10
31 10
32 11
33 12
34 12
35 14
36 16
37 21
38 25
39 35
dtype: int64

First column is the index, second column is the data. In the book we get the following answer:

imagen

If you use numpy 1.20.3 to calculate let's say the 95th percentile the answer is:

np.percentile(data,95)
Out[71]: 21.19999999999999

which is wrong.

The correct calculation is the one presented in the book. I calculated the answer by hand.

Step 1) First calculate the rank = 95/100 * (40 + 1) = 38.95
Step 2) Look for the 38th element (index number 37) which is 21, and look for the 39th element (index number 38) which is 25
Step 3) Use the linear formula: 95th percentile = 21 + (25 - 21) * 0.95 = 24.8.

As you can see the result is the same as in the book. Also I calculated the correct answer for OP and he was right with his expected values. Most people didn't noticed that there were actually 52 items and not 60 on his array.

In conclusion numpy's percentile calculation is doing something wrong or what fraction is using to compute the linear answer?

@seberg
Copy link
Member

seberg commented Nov 3, 2021

There is nothing wrong with it, but there are different ways to calculate percentile, see gh-10736. In particular, like many many default values, NumPy's is a sample percentile and not a population estimtae (there are many population estimate choices!). Comments on that PR are very welcome.

@seberg
Copy link
Member

seberg commented Nov 4, 2021

@jiglesiasc I am planning on merging the PR working on this today, after that fixups to the documentation (+would be very welcome). There will definitely be no change in default. It would be much too disruptive and we will follow R here.
I also am not convinced it is good anyway personally. My interpretation is that the current version calculates a sample quantile (rather than population estimate), and this is actually similar to our std defaulting to ddof=0, and may even be more expected e.g. for plotting boxplots in matplotlib.

(Happy to be corrected about all of those things, I am not a statistician, but the default would be painful to change and would require some extremely thorough convincing.)

Seems the original issue was also about the definition of the method being used and is covered by the other issue (and PR), so I will close this. Thanks all!

@seberg seberg closed this as completed Nov 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants