Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling plots #2280

Open
vladsavelyev opened this issue Jan 24, 2024 · 3 comments
Open

Scaling plots #2280

vladsavelyev opened this issue Jan 24, 2024 · 3 comments

Comments

@vladsavelyev
Copy link
Member

vladsavelyev commented Jan 24, 2024

MultiQC HTML report should load and be useful to someone reading a report, regardless of the number of samples it was generated with. We should explore alternative representations of each plot type that handle large datasets efficiently. Such representations should:

  1. Show the overall distribution of the underlying data (by some sort of downsampling following sorting/clustering).
  2. Show detail for any extreme outliers that might be of interest.

✅ Table

An alternative representation of tables is a violin plot. When the number of samples exceeds config.max_table_rows=500, MultiQC builds a violin plot instead of a table. The violin shows the distribution, and outliers are given a hoverable scatter point placed on top of a violin. That scales samples both with respect to the number of columns and the number of rows.

✅ Violin pot

When the number of samples exceeds 2000, violins are downsampled by taking every Nth sample (sorted by values). Outliers are given a hoverable scatter point placed on top of a violin (if the number fo samples <=50, all dots are shown).

❓ Line plot

The proposed alternative representation is a single line built from the median or mean values of all samples at a given data point. "Confidence intervals" are added as additional dotted lines to show percentiles and/or max and min values.

Additionally, a violin plot is added, built of mean/median values of all data points for each sample. The violin has hoverable dots for outliers; clicking on those should show a standard line for this specific sample.

❓ Bar plot

One idea is to perform clustering, followed by downsampling, to keep the maximum number of bars around 1000 for the main barplot. An additional bar plot is placed for the outliers.

❓ Box plot

❓ Scatter plot

Downsampling of some sort to keep the number of points ~1000, plus placing points for all outliers on the same plot.

❓ Heatmap

Downsampling for the main heatmap, plus showing outlier pairs (as a list?)

@ewels
Copy link
Member

ewels commented Jan 26, 2024

I'm slightly less worried about Scatter + Heatmaps, as both are only a single value / pair of values per sample, unlike line plots which may have hundreds or thousands of data points per sample. I also feel that, without individual sample labels, both scale relatively well visually. I think that with flat image equivalents for very high sample numbers, both are relatively safe. Long term it would be nice to think of a solution that gives interactive plots still I guess.

Line plots are the priority for me due to data size and legibility. Then bar plots because the individual samples become unreadable at scale.

@ewels ewels pinned this issue Feb 9, 2024
@ewels
Copy link
Member

ewels commented Feb 11, 2024

Copying over idea from upcoming blog post for line graphs:

line_plot_styles

Potential approaches for line plots with large sample numbers. (A) Current output with 960 samples. Overlapping lines hide distribution. (B) Instead of showing individual samples, show median (dotted line) and range (shading). Plot can be interactive in report. This example is not accurate, created manually in Adobe Illustrator. (C) Keep all lines but drop opacity to 5%, showing density of overplotting. Interactive plots in reports are not possible.

I must confess, I'm starting to come around more to the third option (C). It means we get no interactivity, but it is true to the data and doesn't "hide" anything. It does mean that we can't do anything with outliers though, and they will effectively disappear at high sample numbers (assuming opacity drops proportionately with sample number). But I wonder if we don't really care about a handful of outliers when we have thousands of samples... 🤔

@Joon-Klaps
Copy link
Contributor

I must confess, I'm starting to come around more to the third option (C). It means we get no interactivity, but it is true to the data and doesn't "hide" anything. It does mean that we can't do anything with outliers though, and they will effectively disappear at high sample numbers (assuming opacity drops proportionately with sample number). But I wonder if we don't really care about a handful of outliers when we have thousands of samples... 🤔

I see where you're coming from with option C, but from my lab perspective, those outliers can be really important and is the first thing we look for. We're all about optimizing protocols, and understanding why 4 out of the 300 samples/conditions don't fit the norm is key for us. So, option B seems more interesting to me as it keeps those outliers in view while also keeping the global trends nicely represented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants