Scaling plots #2280

vladsavelyev · 2024-01-24T17:05:51Z

MultiQC HTML report should load and be useful to someone reading a report, regardless of the number of samples it was generated with. We should explore alternative representations of each plot type that handle large datasets efficiently. Such representations should:

Show the overall distribution of the underlying data (by some sort of downsampling following sorting/clustering).
Show detail for any extreme outliers that might be of interest.

✅ Table

An alternative representation of tables is a violin plot. When the number of samples exceeds config.max_table_rows=500, MultiQC builds a violin plot instead of a table. The violin shows the distribution, and outliers are given a hoverable scatter point placed on top of a violin. That scales samples both with respect to the number of columns and the number of rows.

✅ Violin pot

When the number of samples exceeds 2000, violins are downsampled by taking every Nth sample (sorted by values). Outliers are given a hoverable scatter point placed on top of a violin (if the number fo samples <=50, all dots are shown).

❓ Line plot

The proposed alternative representation is a single line built from the median or mean values of all samples at a given data point. "Confidence intervals" are added as additional dotted lines to show percentiles and/or max and min values.

Additionally, a violin plot is added, built of mean/median values of all data points for each sample. The violin has hoverable dots for outliers; clicking on those should show a standard line for this specific sample.

❓ Bar plot

One idea is to perform clustering, followed by downsampling, to keep the maximum number of bars around 1000 for the main barplot. An additional bar plot is placed for the outliers.

❓ Box plot

❓ Scatter plot

Downsampling of some sort to keep the number of points ~1000, plus placing points for all outliers on the same plot.

❓ Heatmap

Downsampling for the main heatmap, plus showing outlier pairs (as a list?)

The text was updated successfully, but these errors were encountered:

ewels · 2024-01-26T14:43:04Z

I'm slightly less worried about Scatter + Heatmaps, as both are only a single value / pair of values per sample, unlike line plots which may have hundreds or thousands of data points per sample. I also feel that, without individual sample labels, both scale relatively well visually. I think that with flat image equivalents for very high sample numbers, both are relatively safe. Long term it would be nice to think of a solution that gives interactive plots still I guess.

Line plots are the priority for me due to data size and legibility. Then bar plots because the individual samples become unreadable at scale.

ewels · 2024-02-11T19:40:21Z

Copying over idea from upcoming blog post for line graphs:

Potential approaches for line plots with large sample numbers. (A) Current output with 960 samples. Overlapping lines hide distribution. (B) Instead of showing individual samples, show median (dotted line) and range (shading). Plot can be interactive in report. This example is not accurate, created manually in Adobe Illustrator. (C) Keep all lines but drop opacity to 5%, showing density of overplotting. Interactive plots in reports are not possible.

I must confess, I'm starting to come around more to the third option (C). It means we get no interactivity, but it is true to the data and doesn't "hide" anything. It does mean that we can't do anything with outliers though, and they will effectively disappear at high sample numbers (assuming opacity drops proportionately with sample number). But I wonder if we don't really care about a handful of outliers when we have thousands of samples... 🤔

Joon-Klaps · 2024-04-04T09:05:16Z

I must confess, I'm starting to come around more to the third option (C). It means we get no interactivity, but it is true to the data and doesn't "hide" anything. It does mean that we can't do anything with outliers though, and they will effectively disappear at high sample numbers (assuming opacity drops proportionately with sample number). But I wonder if we don't really care about a handful of outliers when we have thousands of samples... 🤔

I see where you're coming from with option C, but from my lab perspective, those outliers can be really important and is the first thing we look for. We're all about optimizing protocols, and understanding why 4 out of the 300 samples/conditions don't fit the norm is key for us. So, option B seems more interesting to me as it keeps those outliers in view while also keeping the global trends nicely represented.

vladsavelyev added core: back end core: front end labels Jan 24, 2024

vladsavelyev added this to the MultiQC v1.21 milestone Jan 24, 2024

ewels pinned this issue Feb 9, 2024

This was referenced Feb 12, 2024

Plot both lines and point on the same graph #2218

Closed

Plot confidence area around a linegraph #2214

Open

vladsavelyev modified the milestones: MultiQC v1.21: Versions API, MultiQC v1.22 Feb 19, 2024

vladsavelyev mentioned this issue Feb 22, 2024

Add box plot #2358

Merged

7 tasks

vladsavelyev mentioned this issue Apr 29, 2024

Search file blocks rather than individual lines for faster results #2513

Merged

1 task

ewels modified the milestones: MultiQC v1.22: Pydantic, MultiQC v1.23 May 3, 2024

vladsavelyev mentioned this issue May 17, 2024

Scaling line plots #2574

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling plots #2280

Scaling plots #2280

vladsavelyev commented Jan 24, 2024 •

edited

ewels commented Jan 26, 2024

ewels commented Feb 11, 2024

Joon-Klaps commented Apr 4, 2024

Scaling plots #2280

Scaling plots #2280

Comments

vladsavelyev commented Jan 24, 2024 • edited

✅ Table

✅ Violin pot

❓ Line plot

❓ Bar plot

❓ Box plot

❓ Scatter plot

❓ Heatmap

ewels commented Jan 26, 2024

ewels commented Feb 11, 2024

Joon-Klaps commented Apr 4, 2024

vladsavelyev commented Jan 24, 2024 •

edited