Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Have other linear interpolation for percentiles #797

Closed
bzah opened this issue Aug 17, 2021 · 6 comments · Fixed by #802
Closed

Feature: Have other linear interpolation for percentiles #797

bzah opened this issue Aug 17, 2021 · 6 comments · Fixed by #802

Comments

@bzah
Copy link
Contributor

bzah commented Aug 17, 2021

  • xclim version: 0.28
  • Python version: 3.x
  • Operating System: Mac os

Description

Follow up to the discussion with @huard.
In other climate indices libraries such as climdex and icclim, the percentile interpolation method used is the 8th method of Hyndman and Fan. Both library have their own C/C++ implementation.
In Xclim, the default interpolation of numpy is used, which correspond to the 7th method of Hyndman and Fan.
However numpy does not implement the 8th method (see numpy/numpy#10736 (comment)).

It would be great if xclim:

  • Use the Scipy mquantiles method instead of np.percentile
  • Have a parameter to choose the interpolation in calendar.percentile_doy

As usual, I can work on this issue.

@huard
Copy link
Collaborator

huard commented Aug 17, 2021

Question: there are many places in the xclim code where we use DataArray.quantile. Do we need to change this as well, or we are focussing here on percentile_doy only ?

@bzah
Copy link
Contributor Author

bzah commented Aug 18, 2021

Well I guess it depends.
The method 8 is particularly fit if you don't know the distribution function of your sample and your goal is to have an approximately unbiased median. It is advocated to be the best default function for quantile estimation.
However, if you know the distribution function of your sample you may have a better estimation with another linear interpolation method.
For example the method 9 of Hyndman&Fan is better fit, if the sample follow a normal distribution.

Plus, If you which to have an unbiased mean instead of an unbiased median, other methods could give better results. I found this on this paper, but it is not published in a scientific journal AFAIK.

I'm not qualified to know if it would be a good idea to use method 8 everywhere.
@pagecp Do you have a thought on this subject ?

@pagecp
Copy link

pagecp commented Aug 18, 2021

@bzah Precipitation is clearly not a normal distribution, but temperature is much more... and the median is better than the mean for sure because it takes better into account the number of samples. So I would say method 9 for temperature may be better, and method 8 for precipitation....
But I am not an expert at all in statistics.

@bzah
Copy link
Contributor Author

bzah commented Aug 20, 2021

I spent some time jumping from library to library to see how others do this. It's a kind of a mess, so I will summarize here what I learned.

TL;DR

IMO, it is safe and best to use scipy::mquantiles with method 8 parameters on percentile_doy and only on percentile_doy.

On the method to use

As stated above, the method 8 of hyndman&fan should be used if we don't know the distribution function of the sample.
The method 9, should be used if we know we are working a normal distribution (such as temperatures).
For other distribution function I'm really not sure...

In the case of day of year sample, I don't think there is a recognizable distribution function even for temperatures. Instinctively I would think daily averaged temperature of a date is independent of the temperature in the previous year at the same date, but maybe I'm wrong.
In other cases, the best fitted method will depend on the data handled and if the distribution function is known.

About performances

In order to find a quantile we need to have a look at each value in the array and reorganize them.
We can either sort the whole array or partially sort it. The second option is what numpy does in ::percentiles which, I suppose, gives it such good performances on large arrays.

In the case of day of year values, it doesn't matter much because ::percentile_doy expect daily values or coarser, and even if the dataset span on a 100 years, the sort will be done a 100 doy values. In fact, it might even have better performances to use ::sort instead of ::partition on small arrays due to the possible overhead of numpy ::partition function (see here).
In other cases, it depends. On SciTools/Iris project for example, they have both numpy percentiles and scipy mquantiles and they let the user decide if they prioritize performances using numpy::percentiles or they prioritize correctness with scipy::mquantiles.
I don't like this approach because IMO it exposes too much of the internal mechanisms to the user.

About dask

To make things even simpler, Dask has its own implementation of quantiles.
This is probably necessary because to compute quantiles it may need to access different chunks.
For example with doy values, if the data is chunked on time, to computation of quantiles needs to access most (all?) of the chunks. However when chunked on lat/lon, we could have all doy values of interest per chunk.
A good explanation of this problem : SciTools/iris#3901 (comment).
There is also some known issues with Dask implementation of quantiles: dask/dask#6566
I'm still a big noob with Dask, so this is probably an issue that I don't fully understand.

About other methods

Because the fun never ends, there are other methods to compute quantiles.
One in particular is a "concurrent" to the 8th method as the best method when the distribution function is unknown.
This method is described in Harrell&Davis 1982.
It was known to Hyndman&Fan but not included in their paper from 1996 because it was not yet implemented in widely used packages.
Now scipy has an implementation of it under scipy.stats.mstats_extras::hdquantiles and is supposed to have very good performances (at least since this PR)
It may be particularly interesting on small samples if the distribution is not a normal distribution.
So it might be a good idea to use this one instead of method 8, but it's not clear to me what would be the side effects.

Final note

@huard I think it is best to see case by case if DataArray.quantile can be replaced by scipy::mquantiles, because:

  • The default of numpy is not that bad and is widely used.
  • When the backend is Dask it will call it's own implementation.
  • To choose the very best method to use, some analysis must be done on each case.

@huard
Copy link
Collaborator

huard commented Aug 20, 2021

Thanks @bzah for the explanations.

I'm happy to let you decide on a course of action. I suggest we focus on percentile_doy for now and not try to tweak with the DataArray.quantile.

bzah added a commit to bzah/xclim that referenced this issue Aug 20, 2021
[Warning] _calc_perc now returns a masked array !
It is now possible to use other linear interpolation methods
for percentiles.
By default, it uses the type 8 of Hyndman&Fan which is similar
to climdex and icclim behaviors.
bzah added a commit to bzah/xclim that referenced this issue Aug 20, 2021
[Warning] _calc_perc now returns a masked array !
It is now possible to use other linear interpolation methods
for percentiles.
By default, it uses the type 8 of Hyndman&Fan which is similar
to climdex and icclim behaviors.
@bzah
Copy link
Contributor Author

bzah commented Sep 16, 2021

As a side note here to give some confidence in this code:
For rx90p, when bootstrapping the percentiles, it now gives results very similar to those of climdex.
Only a few points seem to be inaccurate and for some reasons mostly out of the bootstrap period.

Here the bootstrap was on 1990-2000.
tx90p_in_climpact_resampled_and_unit_conv_and_sorte

huard added a commit that referenced this issue Sep 16, 2021
[#797] Update linear interpolation of percentiles
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants