Incorrect quantile computation #353

jverre-drivalytix · 2019-01-04T12:34:47Z

While looking at the quantile computation, I have noticed that the quantile calculation does not match up with the calculations returned by numpy in python (can be considered as the reference).

There are 4 different ways to interpolate quantiles when one quantile does not land on an exact value: linear, lower, higher, midpoint, nearest. However it seems like neither of these matches with the way simple-statistics computes percentiles.

Using data = [0, 0, 0.3, 1.2, 1.23, 3.5, 10, 12, 23.3, 32.1] and computing the 25th, median and 75th percentile we get:

simple-statistics: [ 0.3, 2.365, 12.0 ]
simple-statistics python implementation: [0.3, 1.23, 12]
numpy / linear: [ 0.525, 2.365, 11.5 ]
numpy / lower: [ 0.3, 1.23, 10.0 ]
numpy / higher: [ 1.2, 3.5, 12.0 ]
numpy / midpoint: [ 0.75, 2.365, 11.0 ]
numpy / nearest: [ 0.3, 1.23, 12.0 ]

For reproducibility, this is the code I used:

const ss = require('simple-statistics')

var data = [0, 0, 0.3, 1.2, 1.23,  3.5, 10, 12, 23.3, 32.1]
console.log(ss.quantile(data, [0.25, 0.5, 0.75]))

import numpy as np
import simplestatistics as ss

data = [0, 0, 0.3, 1.2, 1.23,  3.5, 10, 12, 23.3, 32.1]
for i in ['linear', 'lower', 'higher', 'midpoint', 'nearest']:
    print(i, np.percentile(data, [25, 50, 75], interpolation=i))

print(ss.quantile(data, [0.25, 0.50, 0.75]))

The text was updated successfully, but these errors were encountered:

tmcw · 2019-01-07T17:53:49Z

We're currently using quickselect for quantiles, so unfortunately it's a little complicated. I'll take a look at the implementation, see if there's a fix, and potentially replace it with a simpler (albeit less performant) implementation.

mourner · 2019-01-08T10:14:40Z

@tmcw I think quickselect doesn't matter here. What defines the behavior is this logic for sorted quantile. We seem to use some kind of a mix between Python's equivalent nearest and midpoint depending on whether the index landed on an integer value.

Yomguithereal · 2019-01-08T10:37:12Z

I agree that we should stick to numpy's behavior but just know that there still are controversies about their choices which are currently being discuted by Python's core concerning the addition of the quantile methods to the py3 statistics module right now.

mhkeller · 2020-05-15T18:18:12Z

Any update on this?

tmcw · 2020-05-15T18:51:51Z

I think we should switch from the blend of nearest and midpoint to just one or the other. Reading through the python documentation for quantiles it seems like they shipped a version with just nearest-like behavior. Would welcome any other ecosystem examples for what we should do here.

It's tempting to support a linear/nearest/etc option so that this change doesn't have to be a major version bump, but if the current behavior isn't what anyone would want as a default, the change to fix this issue will be a major version bump.

yiyange · 2021-02-26T23:45:41Z

Somewhat related, this line of comment is a bit misleading (i was mislead by this thus made some incorrect comment)

simple-statistics/src/quantile_sorted.js

Line 27 in 1db09fc

// If p is not integer, return the next element in array

p is definitely not an integer by this point. but idx can be integer or a float

rbox-risk · 2021-09-03T12:51:19Z

@tmcw From the long and well established R stats ecosystem, the quantile function runs 9 different algorithms, of which type=7 is the default. This default agrees with the numpy / linear output with these data.

rdata <- c(0, 0, 0.3, 1.2, 1.23, 3.5, 10, 12, 23.3, 32.1)
quantile(rdata, c(0.25,0.5,0.75))

gives

   25%    50%    75% 
 0.525  2.365 11.500

Running all types

t(sapply(1:9, function(i) quantile(rdata, c(0.25,0.5,0.75), type=i)))

gives output (with annotations):

          25%   50%      75%
 [1,] 0.30000 1.230 12.00000 (ss python)
 [2,] 0.30000 2.365 12.00000
 [3,] 0.00000 1.230 12.00000
 [4,] 0.15000 1.230 11.00000
 [5,] 0.30000 2.365 12.00000 (ss)
 [6,] 0.22500 2.365 14.82500
 [7,] 0.52500 2.365 11.50000 (R default, numpy default/linear)
 [8,] 0.27500 2.365 12.94167
 [9,] 0.28125 2.365 12.70625

recovering the ss python implementation as type=1 and the ss js implementation as type=5. The documentation describes each algo, and provides some useful references

mourner added the bug label Jan 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect quantile computation #353

Incorrect quantile computation #353

jverre-drivalytix commented Jan 4, 2019 •

edited

tmcw commented Jan 7, 2019

mourner commented Jan 8, 2019

Yomguithereal commented Jan 8, 2019

mhkeller commented May 15, 2020

tmcw commented May 15, 2020

yiyange commented Feb 26, 2021

rbox-risk commented Sep 3, 2021

Incorrect quantile computation #353

Incorrect quantile computation #353

Comments

jverre-drivalytix commented Jan 4, 2019 • edited

tmcw commented Jan 7, 2019

mourner commented Jan 8, 2019

Yomguithereal commented Jan 8, 2019

mhkeller commented May 15, 2020

tmcw commented May 15, 2020

yiyange commented Feb 26, 2021

rbox-risk commented Sep 3, 2021

jverre-drivalytix commented Jan 4, 2019 •

edited