Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: confidence intervals for bar charts #163

Open
jaanli opened this issue Aug 29, 2023 · 3 comments
Open

Feature request: confidence intervals for bar charts #163

jaanli opened this issue Aug 29, 2023 · 3 comments

Comments

@jaanli
Copy link

jaanli commented Aug 29, 2023

We have been using Mosaic for reporting on hundreds of thousands of hospital prices (demos here: https://beta.payless.health/examples/stlukes-bethlehem.html & https://beta.payless.health/examples/mount-sinai.html).

These prices are often listed according to minimum and maximum negotiated rates across several insurance products.

To accurately relay this information in a visualization, confidence intervals are necessary.

Would this be possible in Mosaic?

(In the docs I only found confidence intervals mentioned for the regressionY mark here: https://uwdata.github.io/mosaic/vgplot/#connected-marks)

@jheer
Copy link
Member

jheer commented Aug 29, 2023

There are two pieces to this: 1. Compute a confidence interval, and 2. Draw the interval as a mark.

For part 1, there are many possibilities. You could compute a standard parametric CI via an expression (e.g., MEAN(x) +/- 1.96 * STDDEV(x) / SQRT(COUNT()) for a 95% CI), though this bakes in a number of assumptions. Other methods, such as bootstraped CIs or model specific CIs would need to be computed through other means.

For part 2, given a calculated interval, you could use the link mark (for example) to visually represent the interval.

Something like this should work for a horizontally-oriented interval:

link(from(data), {
 x1: agg`AVG(x) - 1.96 * STDDEV(x) / SQRT(COUNT(*) FILTER (WHERE x IS NOT NULL))`,
 x2: agg`AVG(x) + 1.96 * STDDEV(x) / SQRT(COUNT(*) FILTER (WHERE x IS NOT NULL))`,
 y: 'your_y_variable'
})

@domoritz
Copy link
Member

#380 adds support for variance.

@jheer
Copy link
Member

jheer commented May 23, 2024

Variance aggregates were always supported in general, but in the next release, variance, stddev, etc also have data cube indexing support. However, we don't yet support parsing of aggregate expressions to identify "indexable" aggregates, and so for now only top-level aggregate functions will be indexed. That said, if you have a client that queries directly for avg, stddev, and count as separate columns, they will be indexed and then you can put them together to form a client-side adjustable range of parametric CIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants