Feature: Have other linear interpolation for percentiles #797

bzah · 2021-08-17T16:42:47Z

xclim version: 0.28
Python version: 3.x
Operating System: Mac os

Description

Follow up to the discussion with @huard.
In other climate indices libraries such as climdex and icclim, the percentile interpolation method used is the 8th method of Hyndman and Fan. Both library have their own C/C++ implementation.
In Xclim, the default interpolation of numpy is used, which correspond to the 7th method of Hyndman and Fan.
However numpy does not implement the 8th method (see numpy/numpy#10736 (comment)).

It would be great if xclim:

Use the Scipy mquantiles method instead of np.percentile
Have a parameter to choose the interpolation in calendar.percentile_doy

As usual, I can work on this issue.

The text was updated successfully, but these errors were encountered:

huard · 2021-08-17T18:08:27Z

Question: there are many places in the xclim code where we use DataArray.quantile. Do we need to change this as well, or we are focussing here on percentile_doy only ?

bzah · 2021-08-18T08:44:29Z

Well I guess it depends.
The method 8 is particularly fit if you don't know the distribution function of your sample and your goal is to have an approximately unbiased median. It is advocated to be the best default function for quantile estimation.
However, if you know the distribution function of your sample you may have a better estimation with another linear interpolation method.
For example the method 9 of Hyndman&Fan is better fit, if the sample follow a normal distribution.

Plus, If you which to have an unbiased mean instead of an unbiased median, other methods could give better results. I found this on this paper, but it is not published in a scientific journal AFAIK.

I'm not qualified to know if it would be a good idea to use method 8 everywhere.
@pagecp Do you have a thought on this subject ?

pagecp · 2021-08-18T08:50:16Z

@bzah Precipitation is clearly not a normal distribution, but temperature is much more... and the median is better than the mean for sure because it takes better into account the number of samples. So I would say method 9 for temperature may be better, and method 8 for precipitation....
But I am not an expert at all in statistics.

bzah · 2021-08-20T10:43:13Z

I spent some time jumping from library to library to see how others do this. It's a kind of a mess, so I will summarize here what I learned.

TL;DR

IMO, it is safe and best to use scipy::mquantiles with method 8 parameters on percentile_doy and only on percentile_doy.

On the method to use

As stated above, the method 8 of hyndman&fan should be used if we don't know the distribution function of the sample.
The method 9, should be used if we know we are working a normal distribution (such as temperatures).
For other distribution function I'm really not sure...

In the case of day of year sample, I don't think there is a recognizable distribution function even for temperatures. Instinctively I would think daily averaged temperature of a date is independent of the temperature in the previous year at the same date, but maybe I'm wrong.
In other cases, the best fitted method will depend on the data handled and if the distribution function is known.

About performances

In order to find a quantile we need to have a look at each value in the array and reorganize them.
We can either sort the whole array or partially sort it. The second option is what numpy does in ::percentiles which, I suppose, gives it such good performances on large arrays.

In the case of day of year values, it doesn't matter much because ::percentile_doy expect daily values or coarser, and even if the dataset span on a 100 years, the sort will be done a 100 doy values. In fact, it might even have better performances to use ::sort instead of ::partition on small arrays due to the possible overhead of numpy ::partition function (see here).
In other cases, it depends. On SciTools/Iris project for example, they have both numpy percentiles and scipy mquantiles and they let the user decide if they prioritize performances using numpy::percentiles or they prioritize correctness with scipy::mquantiles.
I don't like this approach because IMO it exposes too much of the internal mechanisms to the user.

About dask

To make things even simpler, Dask has its own implementation of quantiles.
This is probably necessary because to compute quantiles it may need to access different chunks.
For example with doy values, if the data is chunked on time, to computation of quantiles needs to access most (all?) of the chunks. However when chunked on lat/lon, we could have all doy values of interest per chunk.
A good explanation of this problem : SciTools/iris#3901 (comment).
There is also some known issues with Dask implementation of quantiles: dask/dask#6566
I'm still a big noob with Dask, so this is probably an issue that I don't fully understand.

About other methods

Because the fun never ends, there are other methods to compute quantiles.
One in particular is a "concurrent" to the 8th method as the best method when the distribution function is unknown.
This method is described in Harrell&Davis 1982.
It was known to Hyndman&Fan but not included in their paper from 1996 because it was not yet implemented in widely used packages.
Now scipy has an implementation of it under scipy.stats.mstats_extras::hdquantiles and is supposed to have very good performances (at least since this PR)
It may be particularly interesting on small samples if the distribution is not a normal distribution.
So it might be a good idea to use this one instead of method 8, but it's not clear to me what would be the side effects.

Final note

@huard I think it is best to see case by case if DataArray.quantile can be replaced by scipy::mquantiles, because:

The default of numpy is not that bad and is widely used.
When the backend is Dask it will call it's own implementation.
To choose the very best method to use, some analysis must be done on each case.

huard · 2021-08-20T12:11:57Z

Thanks @bzah for the explanations.

I'm happy to let you decide on a course of action. I suggest we focus on percentile_doy for now and not try to tweak with the DataArray.quantile.

[Warning] _calc_perc now returns a masked array ! It is now possible to use other linear interpolation methods for percentiles. By default, it uses the type 8 of Hyndman&Fan which is similar to climdex and icclim behaviors.

bzah · 2021-09-16T14:11:29Z

As a side note here to give some confidence in this code:
For rx90p, when bootstrapping the percentiles, it now gives results very similar to those of climdex.
Only a few points seem to be inaccurate and for some reasons mostly out of the bootstrap period.

Here the bootstrap was on 1990-2000.

[#797] Update linear interpolation of percentiles

bzah mentioned this issue Aug 20, 2021

[#797] Update linear interpolation of percentiles #802

Merged

7 tasks

huard closed this as completed in #802 Sep 16, 2021

huard added a commit that referenced this issue Sep 16, 2021

Merge pull request #802 from bzah/feature/update_percentile_method

abb4a36

[#797] Update linear interpolation of percentiles

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Have other linear interpolation for percentiles #797

Feature: Have other linear interpolation for percentiles #797

bzah commented Aug 17, 2021

huard commented Aug 17, 2021

bzah commented Aug 18, 2021

pagecp commented Aug 18, 2021

bzah commented Aug 20, 2021

huard commented Aug 20, 2021

bzah commented Sep 16, 2021

Feature: Have other linear interpolation for percentiles #797

Feature: Have other linear interpolation for percentiles #797

Comments

bzah commented Aug 17, 2021

Description

huard commented Aug 17, 2021

bzah commented Aug 18, 2021

pagecp commented Aug 18, 2021

bzah commented Aug 20, 2021

TL;DR

On the method to use

About performances

About dask

About other methods

Final note

huard commented Aug 20, 2021

bzah commented Sep 16, 2021