Reusable data visualization

Radovan Bast (fosstodon.org/@radovan)

UiT The Arctic University of Norway

Text: CC-BY 4.0

About me

Theoretical chemist turned research software engineer.
I write research software and teach programming to researchers and lead the CodeRefinery project.
I lead the high-performance computing group and the research software engineering group at UiT. ]

CodeRefinery

We teach all the essential tools which are usually skipped in academic education so everyone can make full use of software, computing, and data.

https://coderefinery.org
https://coderefinery.org/workshops/past/ ] .right-column50[

]

Goals for this course/lesson

Our focus

Data visualization for .emph[publications and presentations] within and outside academia
.emph[Practical] recommendations
.emph[Reproducibility] for you and others
Know which tools exist -> .emph[good starting points]

What I will not focus on

Programming languages and technical details of tools
Data visualization for the general public (newspapers, television)

.quote["One thing I have learned over the years is that automation is your friend. I think figures should be autogenerated as part of the data analysis pipeline (which should also be automated), and they should come out of the pipeline ready to be sent to the printer, no manual post-processing needed."]

2 take-home messages

Prefer tools that can be automated/scripted

If data or requirements change, somebody will have to update figures.
Automation makes it a bit easier.

Optimize for comprehension and accessibility

So that we don't have to study the plot for 20 minutes with eyes hurting to get the message.
Font size, colors, suitable representation, good title, and caption.

Why visualizing data?

Anscombe's quartet

.right-column40[ All four plots have the .emph[same] mean of x and y, sample variance of x and y, correlation between x and y, linear regression line, and R^2 coefficient.

Same Stats, Different Graphs

.cite[J. Matejka, G. Fitzmaurice, "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing"]

How many 5s?

464418163541729611394089491019

103214981928889407852268902875

389879353920237244649469321810

290602004777144868218046078720

522890797338149835404330684291

.cite[Inspired by https://courses.cs.washington.edu/courses/cse512/23sp/, in turn inspired after J. Stasko]

How many 5s?

464418163.red[5]41729611394089491019

1032149819288894078.red[5]226890287.red[5]

3898793.red[5]3920237244649469321810

290602004777144868218046078720

.cite[Inspired by https://courses.cs.washington.edu/courses/cse512/23sp/, in turn inspired after J. Stasko]

Data visualization is a

"Visual representation and presentation of data to facilitate understanding"

Data visualizations map .emph[data values] onto .emph[aesthetics/channels]

position
length
shape
size
color
line width
line type
(there exist many more)

Why visualizing data?

More insight into data: easier to see patterns and problems

Both calculations and graphs will contribute to understanding

Communicating insight

Presentations/papers: facilitate understanding
Communication with the public

.quote[reflect on how important and powerful data visualization is: COVID-19, politics, climate change, ...] ]

Because others do it or tell us to

And we often copy the style and culture ]

How do you read a paper?

How do you read posters during a poster session?

(reflect about the value of a good visualization)

How is your design process?

How I design plots

Sometimes: Sketch with pen and paper
Browse directories/galleries for inspiration: Vega-Altair, Matplotlib, Seaborn, Plotly, Bokeh, ggplot, PyNGL, K3D, ggplot2, Shiny, Data-Driven Documents, ...
Take an example that is close to what I want
Try to rerun it with original example data
Try to replace example data with my own data
Tweak and refine

Checklist for good visual communication

[This list is adapted from a similar list in a presentation by L. Garrison, "Share Your Science: Visualization for Communication"]

Define your goals
Show the data (go beyond summary statistics)
Be honest with your visuals
Consider accessibility
Avoid taxing working memory
Tell a story
Reflect on uncertainty and unknowns

Define your goals

"Before you start, define your goals in 1-3 sentences" .cite[L. Garrison, "Share Your Science: Visualization for Communication"]
Audience?
Time constraints

Show the data: strip-plot vs box-plot vs violin-plot

.cite[J. Matejka, G. Fitzmaurice, "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing"]

Be honest with your visuals

The principle of "proportional ink"

Examples with disproportional data/ink ratio:

Be honest with your visuals

Another bad example

Accessibility: Avoid 3D plots (unless it's 3D object)

... unless you are plotting something inherently 3D (molecular structures, structure of an enzyme, a 3D relief of a terrain)

Accessibility: Colors

"We need five colors for the plot: black ... red ... green ... blue ... ... ... orange?"

Colors

Consider color vision deficiencies (CVD)

4% of the population is affected
View your color figures under CVD simulations
Use color scales designed to be CVD-friendly ]

Color scales: 3 types

.emph[Discrete/qualitative] color scales: designed to distinguish

.cite[Okabe, M., and K. Ito. 2008. "Color Universal Design (CUD): How to Make Figures and Presentations That Are Friendly to Colorblind People."]

.emph[Sequential/continuous] color scales: represent data values

.emph[Diverging] color scales: visualize deviation of data values relative to a neutral midpoint .cite[ColorBrewer pink to yellow-green]

Discrete/qualitative color scales: designed to distinguish

Great for scatter-plots.
What if you need more than 8 colors? Use direct labeling instead.

Sequential/continuous color scales: represent data values

Great for choropleth plots (here plotting unemployment rate).
Color vision deficiencies less of a concern for this type.
Avoid rainbow scales. ]

Diverging color scales: visualize deviation of data values relative to a neutral midpoint

Great for heatmaps.

Colors

Great resources

https://clauswilke.com/dataviz/color-pitfalls.html
https://blog.datawrapper.de/beautifulcolors/
Okabe, M., and K. Ito. 2008. "Color Universal Design (CUD): How to Make Figures and Presentations That Are Friendly to Colorblind People."
https://seaborn.pydata.org/tutorial/color_palettes.html
https://colorbrewer2.org/
https://www.fabiocrameri.ch/colourmaps/
https://venngage.com/tools/accessible-color-palette-generator

Problematic plots

Example 1

Example 2

Example 3

Example 4

Example 5

Example 6

Example 7

Example 8

Example 9

Example 10

Example 11

Example 12

Tell a story

Minard's Visualization Of Napoleon's 1812 March

Another great example: 1854 Broad Street cholera outbreak

There is a story in here: can you improve the text?

Reproducibility and FAIR principles

FAIR: Which problems can you anticipate?

Findable

Accessible

Interoperable

Reusable

Data formats

What problems can arive when storing data like this?

.emph[Format]: Limited interoperability with other programs
.emph[Error prone] (see e.g. this famous example)
Difficult to parse ("understand") by scripts: .emph[difficult to automate]
Not in tidy format (more about this later): .emph[difficult to extend/modify]

How should we arrange the data?

]

--

How can these 3 examples be problematic for .emph[automated data visualization]?

In the compact structure we need to divide at the comma
If we add more species or more observation sites, we need to adapt the visualization pipeline ]

"Tidy data"

Columns are variables
Rows are observations/measurements
"Long form"
Order does not matter
.emph[Easy to extend] with more species and more sites
.emph[Structure for storing data] - this does not mean that this is ideal for tables in presentations or publications

Standard data formats

Comma-saparated values (CSV)

Species,Observation site,Number of sightings
arctic fox,A,3
arctic fox,B,1
walrus,B,1
walrus,C,1
reindeer,B,10
reindeer,C,1
polar bear,A,1
polar bear,C,1
seal,A,2
seal,B,1
seal,C,2

CSV is often a good choice
Most visualization tools can read CSV data ]

There are many more formats

JSON
XML
GeoJSON
NPY (NumPy arrays)
HDF5
SQL
Many domain-specific formats (such as NetCDF)
.emph[Use standard formats, don't invent your own] ]

Data cleaning

Often we want to visualize data sets with inconsistent or missing entries:

Date,Organization,Number of participants
2020-09-27,UiT,20
Oct 10 2020,UiT Norges arktiske universitet,15
"Nov. 11, 2020",UiT The Arctic University of Norway,40
2020-12-12,UiT The Arctic University of Norway,-

Data cleaning is a bit outside the scope of this course but still good to know:

There are tools to clean and merge inconsistent data sets (e.g. OpenRefine, see also this Data Carpentry lesson)
This does not have to be done manually

Choosing the right tools

Choosing the right tools: scriptable

There is not the one perfect language and not the one perfect library for everything

You will have to choose what fits best you and your group
We will show examples using .emph[Python, R, and JavaScript]

No manual post-processing

This will bite you when you need to regenerate 50 figures one day before submission deadline or regenerate a set of figures after the person who created them left the group.
Use software that can be scripted: batch processing and reproducibility (more about that in next section).

Choosing the right tools: free

Use free software and free tools

Even if the university pays for a license, what happens after you leave university or after they stop paying? How can other groups build on your work?
.emph[Python and R are free], and popular for "notebook"-based pipelines, but also a number .emph[JavaScript frameworks] exist, especially for maps.
Plain text files for small datasets.
Standard formats instead of proprietary formats.

For any academic discipline it will be a good investment to learn a bit of Python or R if you want to do data visualization

Visualization libraries (incomplete list)

Two main families: procedural (e.g. Matplotlib) and declarative.

Python

Vega-Altair: declarative visualization
Matplotlib: MATLAB users will be at home
Seaborn: statistical functions built in
Plotly: interactive graphs
Bokeh: also here good for interactivity
ggplot: R users will be more at home
PyNGL: used in the weather forecast community
K3D: Jupyter notebook extension for 3D visualization ]

R

ggplot2: system for declaratively creating graphics, based on the grammar of graphics
Shiny: interactive graphs and notebooks

JavaScript

Data-Driven Documents ]

Data visualization using Python

https://coderefinery.github.io/data-visualization-python/

(co-created by the author of these slides)

Reproducible and reusable plots

.cite[Juliette Taka, Logilab and the OpenDreamKit project (2017), https://opendreamkit.org/2017/11/02/use-case-publishing-reproducible-notebooks/]

.emph[Demo]: visualization pipeline on Binder

Python/Altair on Jupyter served via Binder: https://github.com/bast/jupyter-binder-example
R/ggplot2 on RStudio/R Markdown served via Binder: https://github.com/bast/rstudio-binder-example

Other fantastic tools which I will not demonstrate

Data-Driven Documents with gallery of examples
Interactive plots with Shiny

Zenodo can give you a persistent identifier (DOI) and make your pipeline citable

Rather than specifying a GitHub repository when launching Binder, you can instead use a Zenodo DOI.

Progression

Start with a working example and try adapting it
Learn the very basics
- Learn a bit of Python
- Or R
It can be a good idea to start learning right away in a notebook
- Python: Jupyter
- R: R Markdown in R Studio
- Quarto
Later try Binder
Later learn how to get a DOI for your Binder
Now your plotting recipe can be cited and is reproducible

This takes time and it is OK to take time

.quote[If I had six hours to chop down a tree, I’d spend the first four hours sharpening the axe.] .cite[Abraham Lincoln]

Summary

Don't forget to tell a story
FAIR principles and reproducibility will be good for you (and for others)
Document all tools and dependencies used .emph[with versions]
Prefer .emph[free tools]
"Data visualization clinic" next week

Books

"Fundamentals of Data Visualization", C. O. Wilke
"Data Visualization: A practical introduction", K. Healy
"Data Visualisation: A Handbook for Data Driven Design", A. Kirk

Papers

N. P. Rougier, M. Droettboom, P. E. Bourne, "Ten Simple Rules for Better Figures", PLoS Comput Biol 10(9): e1003833 (2014)

Courses/talks

https://coderefinery.github.io/data-visualization-python/
https://courses.cs.washington.edu/courses/cse512/23sp/
https://swcarpentry.github.io/visualization-novice/
https://www.ub.uio.no/english/courses-events/events/all-libraries/2020/research-bazaar/visualisation.html
https://ajstewartlang.github.io/SIPS_2019/SIPS_presentation.html

Files

slides.md

Latest commit

History

slides.md

File metadata and controls

Reusable data visualization

Radovan Bast (fosstodon.org/@radovan)

UiT The Arctic University of Norway

About me

CodeRefinery

Goals for this course/lesson

Our focus

What I will not focus on

2 take-home messages

Prefer tools that can be automated/scripted

Optimize for comprehension and accessibility

Why visualizing data?

Anscombe's quartet

Same Stats, Different Graphs

How many 5s?

How many 5s?

"Visual representation and presentation of data to facilitate understanding"

Data visualizations map .emph[data values] onto .emph[aesthetics/channels]

Why visualizing data?

More insight into data: easier to see patterns and problems

Communicating insight

Because others do it or tell us to

How do you read a paper?

How do you read posters during a poster session?

How is your design process?

How I design plots

Checklist for good visual communication

Define your goals

Show the data: strip-plot vs box-plot vs violin-plot

Be honest with your visuals

The principle of "proportional ink"

Be honest with your visuals

Another bad example

Accessibility: Avoid 3D plots (unless it's 3D object)

Accessibility: Colors

Colors

Consider color vision deficiencies (CVD)

Color scales: 3 types

Discrete/qualitative color scales: designed to distinguish

Sequential/continuous color scales: represent data values

Diverging color scales: visualize deviation of data values relative to a neutral midpoint

Colors

Great resources

Categories

Good overviews

Problematic plots

Example 1

Example 2

Example 3

Example 4

Example 5

Example 6

Example 7

Example 8

Example 9

Example 10

Example 11

Example 12

Tell a story

Minard's Visualization Of Napoleon's 1812 March

There is a story in here: can you improve the text?

Reproducibility and FAIR principles

Reproducibility and FAIR principles

FAIR: Which problems can you anticipate?

Findable

Accessible

Interoperable

Reusable

Data formats

What problems can arive when storing data like this?

What problems can arive when storing data like this?

How should we arrange the data?

"Tidy data"

Standard data formats