Shortest Probability Interval support #136

arthur-albuquerque · 2022-05-15T02:08:50Z

Hi Matthew,

Do you plan to support the Shortest Probability Interval (further details here) with functions like ggdist::median_spi?

Thanks!

mjskay · 2022-06-19T20:46:35Z

Ah at the moment I probably don't have time to investigate it, though I don't in principle object if it is a better variant of HDIs. How does it differ from hdci()?

arthur-albuquerque · 2022-06-20T23:22:07Z

This thread by @bwiernik seems relevant.

mjskay · 2022-06-21T01:34:28Z

Ah hrm. So it's also based on density estimators? I guess I'm not sure the advantage without doing more comparison (or seeing comparisons others have done if there are some). One nice property of hdci is that it doesn't require a density estimator.

bwiernik · 2022-06-21T07:30:47Z

The major advantage of SPI over HDI is reduced Monte Carlo error error. https://doi.org/10.1007/s11222-015-9563-8

Both HDI via empirical shortest distance algorithm and SPI via the method linked above are implemented in bayetestR if you wanted to swap in that for HPDInterval

mjskay · 2022-06-21T19:40:03Z

Yeah the reason I'm interested specifically in comparisons of SPI to HDCI (not HDI) is that HDCI uses the CDF approach from HDInterval, so if those two methods are similar I can (lazily) just leave things as they are ;)

bwiernik · 2022-06-21T20:36:46Z

What's HDCI stand for?

bwiernik · 2022-06-21T20:40:07Z

@DominiqueMakowski I think you had some simulation code for comparing intervals set up? Could you compare 80% and 95% intervals for HDI, HDCI, and SPI?

arthur-albuquerque · 2022-07-02T02:25:23Z

"hdci yields the highest-density continuous interval" https://mjskay.github.io/ggdist/reference/point_interval.html

mjskay · 2022-07-02T02:32:18Z

I should probably state more clearly what hdci is in the docs. It uses the algorithm for a continuous interval from HDInterval, which basically uses the ECDF to find the smallest interval of the specified mass: https://github.com/mikemeredith/HDInterval/blob/main/R/hdiVector.R

It is nice because it is simple, fast, and does not rely on picking a density estimator. But if the SPI algorithm has lower error maybe I'd switch it to that. Or a better option might be for me to make sure point_interval(.interval = ...) is compatible with the interval functions in bayestestR.

DominiqueMakowski · 2022-07-02T02:51:54Z

I think you had some simulation code for comparing intervals set up?

I can't seem to find it, but it was basically a big loop generating various distributions of different sizes and computing the different types of CIs. I'm not sure how we can reliably assess the "error" 🤔

Unless it already exists, it would be quite interesting to create a gist/vignette/paper that would 1) describe and explain the advantages/caveats of each interval method, 2) provide clean, community-contributed, R code to compute them (that could be re-used in other packages), and 3) provide some benchmarking on efficiency & behavior

_{(PS: I will be co-editing a special issue in Mathematics on Advances in Statistical Computing and I'm pretty sure such work would be a great fit...)}

mjskay · 2022-07-02T03:14:48Z

Unless it already exists, it would be quite interesting to create a gist/vignette/paper that would 1) describe and explain the advantages/caveats of each interval method, 2) provide clean, community-contributed, R code to compute them (that could be re-used in other packages), and 3) provide some benchmarking on efficiency & behavior

This is a great idea! I would be happy to help with something like this but don't really have the time to lead it. Maybe I could see if one of my students is interested...

DominiqueMakowski · 2022-07-02T04:14:23Z

Maybe I could see if one of my students is interested...

good idea! also tagging ~~@easystats/maintainers~~ (oops that doesn't work outside of the org) @strengejacke @mattansb

mattansb · 2022-07-02T04:19:47Z

Sounds great! I'm in (:

mattansb · 2022-07-02T04:33:57Z

Doesn't the current implementation of bayestestR::hdi() actually return HDCI (and doesn't estimate any densities, a-la Kruschke), i.e. always a single interval is returned?
I'm still not clear how this isn't the same as PSI...?

mattansb · 2022-07-02T04:37:16Z

Ah, I see this is a different algo to estimate the same thing as HDCI. Cool!

strengejacke · 2022-07-02T09:11:16Z

Yeah, I'm interested in supporting a paper, too!

DominiqueMakowski · 2022-07-02T09:29:14Z

Just need someone that actually does the job now 😁

mjskay · 2022-07-04T01:28:44Z

Alright I started playing with this for my own curiosity. Some comparisons here: https://github.com/mjskay/interval-estimators/blob/main/interval-estimators.md (source Rmd in same repo).

I can see three broad classes worth investigating:

equi-tailed intervals, for which the usual estimator is the quantile interval (which could come in several forms; e.g. quantile() has 9 different quantile estimators). ggdist::qi() and bayestestR::eti() are examples
shortest intervals, for which there is the empirical shortest interval based on quantiles (which could come in as many forms as the equi-tailed intervals); I believe ggdist::hdci() and bayestestR::hdi() are examples; then there is the "Spin" algorithm implemented as bayestestR::spi(); I also added a rough implementation of a variant that can use any of the 9 different quantile estimators in quantile() to the above document (see the function quantile_shortest_interval().
highest-density intervals, which unlike the other two may consist of multiple intervals. Implementations include ggdist::hdi() / HDInterval::hdi(density(x)), hdrcde::hdr(), and distributional::hdr()

I have been looking at a few quantities of interest:

root mean squared error of the lower and upper ends of the intervals
mean percentage overlap of the estimated interval with the true interval
coverage (probability mass of the true distribution contained by the interval)
variance (sd) of the endpoints of the intervals

Also to look at: bias.

One intriguing thing so far is that SPI doesn't perform much differently on these metrics compared to the empirical shortest interval implementations (with the exception of having coverage closer to nominal), which is surprising as the SPI paper says SPI should have lower error than HDCI. I'm not sure if that's an implementation issue or what.

The other thing I've found so far is that amongst HDI algorithms, distributional::hdr() is looking the best. In particular it has lower variance than others, performing consistently down near quantile intervals in terms of variance. All the other algorithms for HDIs or HDCIs tend to have higher variance than quantile intervals, which is why I have tended to recommend against HDIs and HDCIs. But changing the underlying implementation of HDIs to the one in distributional might reduce that concern for cases when HDIs are desired (so long as multiple intervals are acceptable).

See e.g. this chart: these are 80% ribbons; there are 100 groups along the x axis, each a sample of size 10,000 from N(0,1). As far as I'm concerned qi and (barely) hdr are the only two with acceptable variance here. The other three are different algorithms for shortest intervals. Given that 10,000 tends to be the upper end of effective sample sizes people tend to generate from MCMC, unless there's a better algorithm out there for shortest intervals we aren't aware of, I'm not sure I'd recommend using them...

mattansb · 2022-07-05T07:30:28Z

I find it very surprising that even with 500 samples the methods don't differ that much (and that SPI seems to have relatively larger variance, if ever so slightly...).

Would it be interesting to compare these methods with ggdist::curve_interval() in the context of interval-bands? Or is that too different of an approach?

mjskay · 2022-07-09T18:56:58Z

Would it be interesting to compare these methods with ggdist::curve_interval() in the context of interval-bands? Or is that too different of an approach

I think the comparison could make sense for single distributions. For multiple distributions, curve_interval() is not really looking at the same thing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shortest Probability Interval support #136

Shortest Probability Interval support #136

arthur-albuquerque commented May 15, 2022

mjskay commented Jun 19, 2022

arthur-albuquerque commented Jun 20, 2022

mjskay commented Jun 21, 2022

bwiernik commented Jun 21, 2022 •

edited

mjskay commented Jun 21, 2022

bwiernik commented Jun 21, 2022

bwiernik commented Jun 21, 2022

arthur-albuquerque commented Jul 2, 2022

mjskay commented Jul 2, 2022

DominiqueMakowski commented Jul 2, 2022

mjskay commented Jul 2, 2022

DominiqueMakowski commented Jul 2, 2022 •

edited

mattansb commented Jul 2, 2022

mattansb commented Jul 2, 2022

mattansb commented Jul 2, 2022

strengejacke commented Jul 2, 2022

DominiqueMakowski commented Jul 2, 2022

mjskay commented Jul 4, 2022

mattansb commented Jul 5, 2022

mjskay commented Jul 9, 2022

Shortest Probability Interval support #136

Shortest Probability Interval support #136

Comments

arthur-albuquerque commented May 15, 2022

mjskay commented Jun 19, 2022

arthur-albuquerque commented Jun 20, 2022

mjskay commented Jun 21, 2022

bwiernik commented Jun 21, 2022 • edited

mjskay commented Jun 21, 2022

bwiernik commented Jun 21, 2022

bwiernik commented Jun 21, 2022

arthur-albuquerque commented Jul 2, 2022

mjskay commented Jul 2, 2022

DominiqueMakowski commented Jul 2, 2022

mjskay commented Jul 2, 2022

DominiqueMakowski commented Jul 2, 2022 • edited

mattansb commented Jul 2, 2022

mattansb commented Jul 2, 2022

mattansb commented Jul 2, 2022

strengejacke commented Jul 2, 2022

DominiqueMakowski commented Jul 2, 2022

mjskay commented Jul 4, 2022

mattansb commented Jul 5, 2022

mjskay commented Jul 9, 2022

bwiernik commented Jun 21, 2022 •

edited

DominiqueMakowski commented Jul 2, 2022 •

edited