Percentile Linear returns incorrect value. #7875

mikep2016 · 2016-07-27T09:25:28Z

The percentile function does not return the expected values. It seems to be getting the linear distance between two points the wrong way around and returns a value closer to the lower number.

For example:

array([   110531353,    167471747,    167471747,    183000406,
          200000000,    759174457,    921094606,    931142911,
         1300000000,   1341797102,   1380317195,   1380317195,
         1500000000,   1500000000,   1500000000,   1830004057,
         1932444073,   2000000000,   2000000000, 2345976525, 2500000000,
       2745006085, 2847019692, 3000000000, 3000000000, 3000000000,
       3000000000, 3312761268, 3500000000, 3588824707, 4000000000,
       4140951585, 5000000000, 6600000000, 7100000000, 7717299940,
       8445515490, 8972061767, 9662220364, 11000000000, 11042537559,
       12422854754, 13000000000, 13000000000, 13803171949, 15000000000,
       17000000000, 20000000000, 22085075118, 24845709508, 32025070994,
       34000000000, 35000000000, 36000000000, 36000000000, 39453076027,
       40000000000, 46930784627, 82819031694, 110425375592], dtype=int64)

> > > np.percentile(a,75)
> > > 14102378961.75
> > > np.percentile(a,50)
> > > 3794412353.5
> > > np.percentile(a,25)
> > > 1747503042.75

expected values:

75 = 14700792987.25
50 = 3794412353.5
25 = 1582501014.25

When reading the documentation the linear methodology seems correct but there possibly could be an issue with the fraction it is using??

The text was updated successfully, but these errors were encountered:

tom-bird · 2016-07-28T18:26:24Z

How did you arrive at your expected values?

I agree with the numpy values using the linear interpolation.

For example, the 75th percentile, given there are 60 items in your list, should be the 44.25th element in the sorted list. Using the linear interpolation, this is i + (j - i) * 0.25 as per the docs, where i is a[44] and j is a[45].

charris · 2016-07-28T18:29:22Z

What numpy version?

tom-bird · 2016-07-28T18:33:37Z

I get the same values on 1.11.1

mikep2016 · 2016-07-29T08:06:35Z

I am using numpy 1.11.1

The values I expected came firstly from SPSS which I also checked using excel's percentile function. Both of these returned the values.

75 = 14700792987.25
50 = 3794412353.5
25 = 1582501014.25

mikep2016 · 2016-07-29T12:00:40Z

Should the formula not be (n+1)*0.75 giving 45.75

tom-bird · 2016-07-29T16:31:57Z

I don't think so, (n-1)*0.75 will give you the 75th percentile, starting from a 0 index.

For example, in a 5 item list, the second to last item is the 75th percentile, and (5-1) * 0.75 = 3, which gives the index of the second to last item. Whereas (5+1)*0.75 would be 4.5.

matthew-brett · 2016-07-29T16:54:22Z

Hum - in MATLAB I get still different answers:

a = [ 110531353, 167471747, 167471747, 183000406, 200000000, 759174457, 921094606, 931142911, 1300000000, 1341797102, 1380317195, 1380317195, 1500000000, 1500000000, 1500000000, 1830004057, 1932444073, 2000000000, 2000000000, 2345976525, 2500000000, 2745006085, 2847019692, 3000000000, 3000000000, 3000000000, 3000000000, 3312761268, 3500000000, 3588824707, 4000000000, 4140951585, 5000000000, 6600000000, 7100000000, 7717299940, 8445515490, 8972061767, 9662220364, 11000000000, 11042537559, 12422854754, 13000000000, 13000000000, 13803171949, 15000000000, 17000000000, 20000000000, 22085075118, 24845709508, 32025070994, 34000000000, 35000000000, 36000000000, 36000000000, 39453076027, 40000000000, 46930784627, 82819031694, 110425375592];
>> prctile(a, [25, 50, 75])

ans =

   1.0e+10 *

    0.1665    0.3794    1.4402

juliantaylor · 2016-07-29T16:59:46Z

the point is that d[44.25] us 14700792987.25 when doing linear interpolation so numpy looks wrong to me or at least computes a weird result which does match the documentation (depending on how you interpret it ...)
what does R compute?

juliantaylor · 2016-07-29T17:03:08Z

oh no its 14102378961.75 ...
one should really use more readable numbers for this
weird why does spss compute something else

matthew-brett · 2016-07-29T17:03:09Z

R

> quantile(a, c(.25, .50, .75));
        25%         50%         75% 
 1747503043  3794412354 14102378962

tom-bird · 2016-07-29T17:05:51Z

I get 14102378961.75 for a[44.25] (ie a[44] + 0.25 * (a[45] - a [44]), as expected.

MATLAB will be different as they use a different percentile scheme, as per http://www.mathworks.com/help/stats/prctile.html. For example, in a 5 element list, MATLAB views the first element as the 10th percentile, whereas numpy sees it as the 0th percentile.

There is no agreed system for percentiles, so neither is "correct".

seberg · 2016-07-29T18:27:45Z

R has nine different percentile types, and I doubt it covers all of these combinations.... It would be nice to clean it up for good, but we really need someone who knows this very well....

mikep2016 · 2016-08-01T08:36:53Z

useful article:

http://download.lww.com/wolterskluwer_vitalstream_com/PermaLink/EDE/A/EDE_2011_05_24_SCHOONJANS_201050_SDC1.pdf

Selected methods for percentile estimation and their use in popular software.

Method A: p(n+1)
Method A, cited for example by Altman1
and CLSI2, calculates an index p(n+1). This method is attributed to Gumbel and Weibull and is widely available in software packages such as JMP (SAS Institute Inc, Cary, NC) (only method), NCSS (NCSS, Kaysville, UT) (recommended method), Minitab (Minitab Inc., State CollegePA), SAS (SAS Institute Inc, Cary, NC) Statistica (StatSoft, Inc., Tulsa, OK) and SPSS (IBM Corp., Somers, NY).

Method B: 0.5+pn
Method B, cited for example by Armitage et al. and Lentner calculates an index 0.5+pn, and is attributed to Hazen Method B is used in the software programs Mathematica (Wolfram Research Inc., Champaign, IL) and MedCalc (MedCalc Software bvba, Mariakerke, Belgium) (only method, except for
reference interval estimation)

Method C: p(n-1)+1
Method C uses p(n-1)+1 and is attributed to Gumbel.4
This method is implemented in both Microsoft Excel 200318 and 2007 (Microsoft Corp., Redmond, WA)19 as well as in the Calc program, part of the OpenOffice suite by Sun Microsystems (Oracle Corp., Redwood Shores, CA), although this does not seem to be documented. Method C is the default value in R20 and is also among the methods available in Statistica(StatSoft, Inc., Tulsa, OK)

Method D: p(n+1/3)+1/3
Method D uses p(n+1/3)+1/3 and is proposed by Hyndman and Fan as the best choice among many othermethods including the above. We are not aware of any available software implementations of this method.

seberg · 2016-08-01T09:08:08Z

Hmm, in principle we could add more methods I guess. In practice, if our stuff currently is OK, maybe it would be good enough to document it and cite that or similar paper(s)/books. This would be good in any case,percentile estimation seems a non-trivial thing, and at least interested users should be able to get enough input to figure it out.

matthew-brett · 2016-08-01T18:25:34Z

@mikep2016 - thanks for tracking down that information. Would you consider adding that to the docstring - it would be very useful.

mikep2016 · 2016-08-02T09:03:36Z

I have added the alternative methods to the docstring and created a pull request.

apbard · 2016-12-17T23:24:20Z

To contribute to this discussion...
On the one hand, as @theultimatecrouton has mentioned, there is no an unique and unanimously accepted definition of percentiles. There are however some methods that have found larger use in applications and statistical software. An highly cited survey on this matter is Hyndman-Fan (H&F). And the 4 methods cited in the article proposed by @mikep2016 are actually taken from the 9 analysed there. On the other hand, I have been quite surprised by the methods implemented in numpy for a series of reason that I'll try to explain in the following.
Let's consider for instance the simple case where sample data is D = [0, 1, 2, 3]
In the following picture you can see the plot of the different methods (percentiles on X, values on Y):

The blue line is the Method1 that is the oldest/simplest "standard" definition as the inverse of the cumulative distribution function. There is no equivalent of this currently implemented in numpy.

The default method "Linear" is fine. It corresponds to method 7 in H&F. (note that all methods 4-9 are somehow linear and if it were for me I would have assigned Method4 to this name)

The other four (nearest-lower-upper-midpoint) does not directly correspond to any of those.
I have searched in literature and I haven't found any reference to those definitions. Plus, I think (but it can be just me) that the names are quite misleading and it seems to me that they are somehow "misaligned".

Upper: changes points are at [~33.3, 50, ~66.6].
Lower: changes points are at [~33.3, 50, ~66.6].
Midpoint: changes points are at [~16.5, 50, ~83.5].
but for all of them I would have expected [25, 50,75] with therefore "Lower" equal to Method1 (blue line).
For Nearest I personally would have expected changing points at [12.5, 37.5, 62.5] like in Method3 but I agree that nearest is likewise a bit ambiguous.

Finally, IMO, by working on something similar to scipy/scipy#6801, scipy/scipy#6801 and scipy/scipy#6466, I feel that would be quite useful to have at least Method1 to have a "proper inverse" of the empirical cdf.
At the same time I think it maybe could not such a bad idea to align with other packages: R and Octave implements all the 9 methods proposed in H&F.
Finally, for the reasons I mentioned before, I am not so sure whether the not-"linear" methods (or at least their names) are so useful/intuitive.

What do you think?
If you believe that could be of interest I am willing to contribute to the implementation of some of such methods.

ricardoV94 · 2018-03-13T12:41:34Z

@apbard I completely agree with you that the Method1 should be available, and that the names of the other interpolation methods are confusing. I posted this issue separately in here before finding your message in this thread.

jiglesiasc · 2021-11-03T04:30:13Z

I'm currently studying statistics from McClave's book. There is an example for the following data:

Out[70]:
0 -9
1 -5
2 -3
3 -1
4 -1
5 -1
6 0
7 0
8 0
9 2
10 2
11 2
12 3
13 4
14 6
15 6
16 6
17 7
18 7
19 7
20 7
21 8
22 8
23 8
24 8
25 9
26 10
27 10
28 10
29 10
30 10
31 10
32 11
33 12
34 12
35 14
36 16
37 21
38 25
39 35
dtype: int64

First column is the index, second column is the data. In the book we get the following answer:

If you use numpy 1.20.3 to calculate let's say the 95th percentile the answer is:

np.percentile(data,95)
Out[71]: 21.19999999999999

which is wrong.

The correct calculation is the one presented in the book. I calculated the answer by hand.

Step 1) First calculate the rank = 95/100 * (40 + 1) = 38.95
Step 2) Look for the 38th element (index number 37) which is 21, and look for the 39th element (index number 38) which is 25
Step 3) Use the linear formula: 95th percentile = 21 + (25 - 21) * 0.95 = 24.8.

As you can see the result is the same as in the book. Also I calculated the correct answer for OP and he was right with his expected values. Most people didn't noticed that there were actually 52 items and not 60 on his array.

In conclusion numpy's percentile calculation is doing something wrong or what fraction is using to compute the linear answer?

seberg · 2021-11-03T13:16:29Z

There is nothing wrong with it, but there are different ways to calculate percentile, see gh-10736. In particular, like many many default values, NumPy's is a sample percentile and not a population estimtae (there are many population estimate choices!). Comments on that PR are very welcome.

seberg · 2021-11-04T13:24:14Z

@jiglesiasc I am planning on merging the PR working on this today, after that fixups to the documentation (+would be very welcome). There will definitely be no change in default. It would be much too disruptive and we will follow R here.
I also am not convinced it is good anyway personally. My interpretation is that the current version calculates a sample quantile (rather than population estimate), and this is actually similar to our std defaulting to ddof=0, and may even be more expected e.g. for plotting boxplots in matplotlib.

(Happy to be corrected about all of those things, I am not a statistician, but the default would be painful to change and would require some extremely thorough convincing.)

Seems the original issue was also about the definition of the method being used and is covered by the other issue (and PR), so I will close this. Thanks all!

charris mentioned this issue Aug 7, 2016

DOC: Edited percentile docstring #7887

Closed

charris added the 04 - Documentation label Aug 7, 2016

ricardoV94 mentioned this issue Mar 13, 2018

Restructure percentile methods #10736

Closed

seberg closed this as completed Nov 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Percentile Linear returns incorrect value. #7875

Percentile Linear returns incorrect value. #7875

mikep2016 commented Jul 27, 2016 •

edited by eric-wieser

tom-bird commented Jul 28, 2016

charris commented Jul 28, 2016

tom-bird commented Jul 28, 2016

mikep2016 commented Jul 29, 2016

mikep2016 commented Jul 29, 2016

tom-bird commented Jul 29, 2016 •

edited

matthew-brett commented Jul 29, 2016

juliantaylor commented Jul 29, 2016

juliantaylor commented Jul 29, 2016

matthew-brett commented Jul 29, 2016

tom-bird commented Jul 29, 2016

seberg commented Jul 29, 2016

mikep2016 commented Aug 1, 2016 •

edited

seberg commented Aug 1, 2016

matthew-brett commented Aug 1, 2016

mikep2016 commented Aug 2, 2016

apbard commented Dec 17, 2016 •

edited

ricardoV94 commented Mar 13, 2018

jiglesiasc commented Nov 3, 2021 •

edited

seberg commented Nov 3, 2021

seberg commented Nov 4, 2021 •

edited

Percentile Linear returns incorrect value. #7875

Percentile Linear returns incorrect value. #7875

Comments

mikep2016 commented Jul 27, 2016 • edited by eric-wieser

tom-bird commented Jul 28, 2016

charris commented Jul 28, 2016

tom-bird commented Jul 28, 2016

mikep2016 commented Jul 29, 2016

mikep2016 commented Jul 29, 2016

tom-bird commented Jul 29, 2016 • edited

matthew-brett commented Jul 29, 2016

juliantaylor commented Jul 29, 2016

juliantaylor commented Jul 29, 2016

matthew-brett commented Jul 29, 2016

tom-bird commented Jul 29, 2016

seberg commented Jul 29, 2016

mikep2016 commented Aug 1, 2016 • edited

seberg commented Aug 1, 2016

matthew-brett commented Aug 1, 2016

mikep2016 commented Aug 2, 2016

apbard commented Dec 17, 2016 • edited

ricardoV94 commented Mar 13, 2018

jiglesiasc commented Nov 3, 2021 • edited

seberg commented Nov 3, 2021

seberg commented Nov 4, 2021 • edited

mikep2016 commented Jul 27, 2016 •

edited by eric-wieser

tom-bird commented Jul 29, 2016 •

edited

mikep2016 commented Aug 1, 2016 •

edited

apbard commented Dec 17, 2016 •

edited

jiglesiasc commented Nov 3, 2021 •

edited

seberg commented Nov 4, 2021 •

edited