Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lineribbons for random quantities and similar problems #179

Open
fweber144 opened this issue May 16, 2023 · 6 comments
Open

Lineribbons for random quantities and similar problems #179

fweber144 opened this issue May 16, 2023 · 6 comments

Comments

@fweber144
Copy link

First of all: Thank you very much for developing this great package!

I have a feature request which is related to Bayesian posterior predictive checks (PPCs), but which might also be helpful in other settings.

In "overlay" PPCs with a large number of posterior draws, I often experience the rendering of such a plot to take very long, due to the large amount of separate lines to plot (I guess). It's not just the rendering within RStudio's "Plot" pane, but also when saving the plot to a PDF file and then opening that PDF file in a PDF viewer. Here is an example, adapted from the ?bayesplot::ppc_dens_overlay examples:

library(bayesplot)
y <- example_y_data()
yrep <- example_yrep_draws()
# The computation itself doesn't take that long:
system.time(gg_obj <- ppc_dens_overlay(y, yrep[1:250, ]))
##    user  system elapsed 
##   0.064   0.004   0.067
# But the rendering does:
system.time(print(gg_obj))
##    user  system elapsed 
##   1.457   0.015   1.477

# # Saving to PDF (and opening that PDF) also takes a long time:
# system.time(ggplot2::ggsave("<desired_path>/ppc_overlay.pdf", width = 6, height = 6 * 0.618))

That's why I've been thinking about some kind of a "lineribbon" plot in such settings, i.e., the data ($y$) line gets plotted as before (e.g., a kernel density estimate of the observed response values), but the generated response values ($y_{\text{rep}}$) are not drawn as one line per posterior draw, but as a shaded area with some pre-specified coverage probability (defaulting, e.g., to 90%) or as a gradient-colored (possibly "ramped") ribbon. That would also allow to use the full number of posterior draws and not having to choose a subset of them.

I'm not sure about the best way to solve this mathematically and neither about the best way to implement this, so I guess some work needs to be done on that first.

Furthermore, I'm not sure if ggdist is really the best place for this; bayesplot might be another good place. But I've recently found the lineribbon plots here in ggdist, so I thought the feature request might fit in here. And as I said above, this feature could also be useful for plots other than PPCs, which would be another argument for having it in ggdist and not in bayesplot. In any case, I'm also tagging @jgabry in case he has already thought about this as well.

The reason why I think the existing lineribbon plots cannot be used for this is that they require multiple y-axis values for each x-axis value, but in PPCs (and possibly other settings), we usually don't have that (because we have random quantities on the x-axis).

@fweber144 fweber144 changed the title Lineribbons for kernel density estimates Lineribbons for random quantities and similar problems May 16, 2023
@mjskay
Copy link
Owner

mjskay commented May 17, 2023

I believe Michael Betancourt has examples of plots like what you're describing (see e.g. Step 14 here), though he's not a ggplot user so they wouldn't have been made using it ;).

I think a workflow for making these would definitely fit in ggdist --- I've thought about implementing similar things in the past, like so-called probability boxes (see #45), which are basically envelopes around a set of CDFs. What I would probably want is something that generalizes the current slab stat / geometry into a slab that has a lineribbon around it, and then allow people to add a lineribbon to either the pdf or cdf (or some function of these I suppose), where the pdf may be estimated either using a kernel density estimator or a histogram.

@fweber144
Copy link
Author

Thanks for the link to Michael Betancourt's case study. For histograms, things should be a lot easier than for kernel density estimates. Perhaps histograms even fit into the existing lineribbon framework? And perhaps ECDF plots as well (maybe this is what you wanted to point out with probability boxes)? So perhaps kernel density estimates are really the only hard special case. If that's the case, I would be happy with the other solutions mentioned above, even though a solution for kernel density estimates would be nice as well, of course.

What you suggest for the implementation/user interface sounds reasonable to me, although I have to admit that I'm not too familiar (yet) with the slab stat / geom.

@mjskay
Copy link
Owner

mjskay commented May 19, 2023

And perhaps ECDF plots as well (maybe this is what you wanted to point out with probability boxes)?

exactly :)

So perhaps kernel density estimates are really the only hard special case. If that's the case, I would be happy with the other solutions mentioned above, even though a solution for kernel density estimates would be nice as well, of course.

I'm not sure what the issue would be for KDEs --- if you use the same x grid to generate each density, it should be straightforward to calculate the ribbon (unless I'm missing something?)

@fweber144
Copy link
Author

I'm not sure what the issue would be for KDEs --- if you use the same x grid to generate each density, it should be straightforward to calculate the ribbon (unless I'm missing something?)

Yes, if the same x grid is used for the KDEs, then ribbons for them should be as "easy" as for other curves (in particular, as easy as for histograms and ECDF plots for which I was implicitly assuming the same x grid to be used across all curves as well). I'm not familiar with the details of KDEs and somehow was assuming that in general, the x grid would differ between the different KDEs drawn in such a plot.

@mjskay
Copy link
Owner

mjskay commented May 19, 2023

Yes, if the same x grid is used for the KDEs, then ribbons for them should be as "easy" as for other curves (in particular, as easy as for histograms and ECDF plots for which I was implicitly assuming the same x grid to be used across all curves as well). I'm not familiar with the details of KDEs and somehow was assuming that in general, the x grid would differ between the different KDEs drawn in such a plot.

Right, typically you might get something like this:

set.seed(1234)
df = data.frame(x = rnorm(20000), draw = 1:500) 

df |> ggplot(aes(x, group = draw)) + 
  stat_slab(fill = NA, color = "black", alpha = 0.1, density = "unbounded")

image

ggdist does let you ensure that the densities fill the full scale using expand; typically combined with trim = FALSE so that the density is allowed to be nonzero outside the range of the data:

set.seed(1234)
df = data.frame(x = rnorm(20000), draw = 1:500) 

df |> ggplot(aes(x, group = draw)) + 
  stat_slab(fill = NA, color = "black", alpha = 0.1, density = "unbounded", trim = FALSE, expand = TRUE)

image

When using base::density you can get the same result by setting from and to to the min/max over all the data, which makes it amenable to use of a lineribbon:

library(dplyr)

from = min(df$x)
to = max(df$x)

df |>
  group_by(draw) |>
  reframe(with(density(x, from = from, to = to), data.frame(x, y))) |>
  ggplot(aes(x, y)) +
  stat_lineribbon() + 
  scale_fill_brewer()

image

Something like this might be sufficient for what you want in bayesplot? For ggdist, I'd like this stat/geom to act similarly to stat_slab, i.e. to allow another variable to be mapped onto the y axis, which means a bit more work... :)

@fweber144
Copy link
Author

Great, thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants