Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect quantile computation #353

Open
jverre-drivalytix opened this issue Jan 4, 2019 · 7 comments
Open

Incorrect quantile computation #353

jverre-drivalytix opened this issue Jan 4, 2019 · 7 comments
Labels

Comments

@jverre-drivalytix
Copy link

jverre-drivalytix commented Jan 4, 2019

While looking at the quantile computation, I have noticed that the quantile calculation does not match up with the calculations returned by numpy in python (can be considered as the reference).

There are 4 different ways to interpolate quantiles when one quantile does not land on an exact value: linear, lower, higher, midpoint, nearest. However it seems like neither of these matches with the way simple-statistics computes percentiles.

Using data = [0, 0, 0.3, 1.2, 1.23, 3.5, 10, 12, 23.3, 32.1] and computing the 25th, median and 75th percentile we get:

  • simple-statistics: [ 0.3, 2.365, 12.0 ]
  • simple-statistics python implementation: [0.3, 1.23, 12]
  • numpy / linear: [ 0.525, 2.365, 11.5 ]
  • numpy / lower: [ 0.3, 1.23, 10.0 ]
  • numpy / higher: [ 1.2, 3.5, 12.0 ]
  • numpy / midpoint: [ 0.75, 2.365, 11.0 ]
  • numpy / nearest: [ 0.3, 1.23, 12.0 ]

For reproducibility, this is the code I used:

const ss = require('simple-statistics')

var data = [0, 0, 0.3, 1.2, 1.23,  3.5, 10, 12, 23.3, 32.1]
console.log(ss.quantile(data, [0.25, 0.5, 0.75]))
import numpy as np
import simplestatistics as ss

data = [0, 0, 0.3, 1.2, 1.23,  3.5, 10, 12, 23.3, 32.1]
for i in ['linear', 'lower', 'higher', 'midpoint', 'nearest']:
    print(i, np.percentile(data, [25, 50, 75], interpolation=i))

print(ss.quantile(data, [0.25, 0.50, 0.75]))
@tmcw
Copy link
Member

tmcw commented Jan 7, 2019

We're currently using quickselect for quantiles, so unfortunately it's a little complicated. I'll take a look at the implementation, see if there's a fix, and potentially replace it with a simpler (albeit less performant) implementation.

@mourner mourner added the bug label Jan 8, 2019
@mourner
Copy link
Contributor

mourner commented Jan 8, 2019

@tmcw I think quickselect doesn't matter here. What defines the behavior is this logic for sorted quantile. We seem to use some kind of a mix between Python's equivalent nearest and midpoint depending on whether the index landed on an integer value.

@Yomguithereal
Copy link
Member

I agree that we should stick to numpy's behavior but just know that there still are controversies about their choices which are currently being discuted by Python's core concerning the addition of the quantile methods to the py3 statistics module right now.

@mhkeller
Copy link

Any update on this?

@tmcw
Copy link
Member

tmcw commented May 15, 2020

I think we should switch from the blend of nearest and midpoint to just one or the other. Reading through the python documentation for quantiles it seems like they shipped a version with just nearest-like behavior. Would welcome any other ecosystem examples for what we should do here.

It's tempting to support a linear/nearest/etc option so that this change doesn't have to be a major version bump, but if the current behavior isn't what anyone would want as a default, the change to fix this issue will be a major version bump.

@yiyange
Copy link

yiyange commented Feb 26, 2021

Somewhat related, this line of comment is a bit misleading (i was mislead by this thus made some incorrect comment)

// If p is not integer, return the next element in array

p is definitely not an integer by this point. but idx can be integer or a float

@rbox-risk
Copy link

@tmcw From the long and well established R stats ecosystem, the quantile function runs 9 different algorithms, of which type=7 is the default. This default agrees with the numpy / linear output with these data.

rdata <- c(0, 0, 0.3, 1.2, 1.23, 3.5, 10, 12, 23.3, 32.1)
quantile(rdata, c(0.25,0.5,0.75))

gives

   25%    50%    75% 
 0.525  2.365 11.500 

Running all types

t(sapply(1:9, function(i) quantile(rdata, c(0.25,0.5,0.75), type=i)))

gives output (with annotations):

          25%   50%      75%
 [1,] 0.30000 1.230 12.00000 (ss python)
 [2,] 0.30000 2.365 12.00000
 [3,] 0.00000 1.230 12.00000
 [4,] 0.15000 1.230 11.00000
 [5,] 0.30000 2.365 12.00000 (ss)
 [6,] 0.22500 2.365 14.82500
 [7,] 0.52500 2.365 11.50000 (R default, numpy default/linear)
 [8,] 0.27500 2.365 12.94167
 [9,] 0.28125 2.365 12.70625

recovering the ss python implementation as type=1 and the ss js implementation as type=5. The documentation describes each algo, and provides some useful references

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants