Skip to content

Latest commit

 

History

History
961 lines (557 loc) · 24 KB

slides.md

File metadata and controls

961 lines (557 loc) · 24 KB

class: center, middle, gray-background

Reusable data visualization

Radovan Bast (fosstodon.org/@radovan)

UiT The Arctic University of Norway

 

Text: CC-BY 4.0


About me

.left-column30[ ]

.right-column70[


CodeRefinery

We teach all the essential tools which are usually skipped in academic education so everyone can make full use of software, computing, and data.

.left-column50[

]

Goals for this course/lesson

Our focus

  • Data visualization for .emph[publications and presentations] within and outside academia

  • .emph[Practical] recommendations

  • .emph[Reproducibility] for you and others

  • Know which tools exist -> .emph[good starting points]

What I will not focus on

  • Programming languages and technical details of tools

  • Data visualization for the general public (newspapers, television)


.quote["One thing I have learned over the years is that automation is your friend. I think figures should be autogenerated as part of the data analysis pipeline (which should also be automated), and they should come out of the pipeline ready to be sent to the printer, no manual post-processing needed."]

.cite["Fundamentals of Data Visualization", C. O. Wilke]

twitter post

.cite[https://twitter.com/kara_woo/status/1134878080567091200]


2 take-home messages

Prefer tools that can be automated/scripted

  • If data or requirements change, somebody will have to update figures.

  • Automation makes it a bit easier.

Optimize for comprehension and accessibility

  • So that we don't have to study the plot for 20 minutes with eyes hurting to get the message.

  • Font size, colors, suitable representation, good title, and caption.


class: center, middle, inverse

Why visualizing data?


Anscombe's quartet

.left-column60[ Anscombe's quartet ]

.right-column40[ All four plots have the .emph[same] mean of x and y, sample variance of x and y, correlation between x and y, linear regression line, and R^2 coefficient.

.cite[https://en.wikipedia.org/wiki/Anscombe%27s_quartet]

.cite[https://seaborn.pydata.org/examples/anscombes_quartet.html] ]


Same Stats, Different Graphs

gif cycling through different graphics with same stats

.cite[A. Cairo, "Datasaurus: Never trust summary statistics alone; always visualize your data"]

.cite[J. Matejka, G. Fitzmaurice, "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing"]


How many 5s?

464418163541729611394089491019

103214981928889407852268902875

389879353920237244649469321810

290602004777144868218046078720

522890797338149835404330684291

.cite[Inspired by https://courses.cs.washington.edu/courses/cse512/23sp/, in turn inspired after J. Stasko]


How many 5s?

464418163.red[5]41729611394089491019

1032149819288894078.red[5]226890287.red[5]

3898793.red[5]3920237244649469321810

290602004777144868218046078720

.red[5]2289079733814983.red[5]404330684291

.cite[Inspired by https://courses.cs.washington.edu/courses/cse512/23sp/, in turn inspired after J. Stasko]


Data visualization is a

"Visual representation and presentation of data to facilitate understanding"

.cite["Fundamentals of Data Visualization", C. O. Wilke]

Data visualizations map .emph[data values] onto .emph[aesthetics/channels]

  • position
  • length
  • shape
  • size
  • color
  • line width
  • line type
  • (there exist many more)

Why visualizing data?

More insight into data: easier to see patterns and problems

  • Both calculations and graphs will contribute to understanding

.left-column50[

Communicating insight

  • Presentations/papers: facilitate understanding
  • Communication with the public

.quote[reflect on how important and powerful data visualization is: COVID-19, politics, climate change, ...] ]

.right-column50[

Because others do it or tell us to

  • And we often copy the style and culture ]

class: center, middle, inverse

How do you read a paper?

How do you read posters during a poster session?

(reflect about the value of a good visualization)


class: center, middle, inverse

How is your design process?


How I design plots




Checklist for good visual communication

[This list is adapted from a similar list in a presentation by L. Garrison, "Share Your Science: Visualization for Communication"]

  • Define your goals

  • Show the data (go beyond summary statistics)

  • Be honest with your visuals

  • Consider accessibility

  • Avoid taxing working memory

  • Tell a story

  • Reflect on uncertainty and unknowns


Define your goals

  • "Before you start, define your goals in 1-3 sentences" .cite[L. Garrison, "Share Your Science: Visualization for Communication"]

  • Audience?

  • Time constraints


Show the data: strip-plot vs box-plot vs violin-plot

.cite[J. Matejka, G. Fitzmaurice, "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing"]


Be honest with your visuals

The principle of "proportional ink"

Examples with disproportional data/ink ratio:

figure with disproportional data/ink ratio

figure with disproportional data/ink ratio

.cite[Both figures from https://www.callingbullshit.org/tools/tools_proportional_ink.html]


Be honest with your visuals

Another bad example

figure with an axis range which is misleading

.cite[Citation needed]


Accessibility: Avoid 3D plots (unless it's 3D object)

... unless you are plotting something inherently 3D (molecular structures, structure of an enzyme, a 3D relief of a terrain)

3d scatterplot

.cite[https://matplotlib.org/3.1.1/gallery/mplot3d/scatter3d.html]


class: center, middle, inverse

Accessibility: Colors

"We need five colors for the plot: black ... red ... green ... blue ... ... ... orange?"


Colors

Consider color vision deficiencies (CVD)

.left-column50[ ishihara color test plate ]

.right-column50[

  • 4% of the population is affected

  • View your color figures under CVD simulations

  • Use color scales designed to be CVD-friendly ]


Color scales: 3 types

  • .emph[Discrete/qualitative] color scales: designed to distinguish

okabe ito color scale

.cite[Okabe, M., and K. Ito. 2008. "Color Universal Design (CUD): How to Make Figures and Presentations That Are Friendly to Colorblind People."]

  • .emph[Sequential/continuous] color scales: represent data values

blues color scale

rocket color scale

  • .emph[Diverging] color scales: visualize deviation of data values relative to a neutral midpoint .cite[ColorBrewer pink to yellow-green]

divergent color scale


Discrete/qualitative color scales: designed to distinguish

.left-column50[ okabe ito color scale

  • Great for scatter-plots.

  • What if you need more than 8 colors? Use direct labeling instead.

.cite[Okabe, M., and K. Ito. 2008] ]

.right-column50[ scatter plot

.cite[https://seaborn.pydata.org/examples/multiple_regression.html] ]


Sequential/continuous color scales: represent data values

.left-column50[ blues color scale rocket color scale

  • Great for choropleth plots (here plotting unemployment rate).

  • Color vision deficiencies less of a concern for this type.

  • Avoid rainbow scales. ]

.right-column50[ choropleth plot

.cite[https://altair-viz.github.io/gallery/choropleth.html] ]


Diverging color scales: visualize deviation of data values relative to a neutral midpoint

.left-column50[ divergent color scale

  • Great for heatmaps.

.cite[ColorBrewer pink to yellow-green] ]

.right-column50[ heatmap plot

.cite[https://seaborn.pydata.org/examples/many_pairwise_correlations.html] ]


Colors

Great resources


Categories

  • So that we know what to search for
  • Source of inspiration

Good overviews


class: center, middle, inverse

Problematic plots

See also: https://viz.wtf


Example 1

problematic plot

.cite[Figure from https://twitter.com/GraphCrimes]


Example 2

problematic plot

.cite[Figure from https://www.callingbullshit.org/tools/tools_proportional_ink.html]


Example 3

problematic plot

.cite[Figure from https://twitter.com/GraphCrimes]


Example 4

problematic plot

.cite[Figure from https://www.callingbullshit.org/tools/tools_proportional_ink.html]


Example 5

problematic plot

.cite[Figure from https://twitter.com/GraphCrimes]


Example 6

.left-column50[ problematic plot ]

.right-column50[ .cite[Figure from https://twitter.com/GraphCrimes] ]


Example 7

problematic plot

.cite[Figure from https://twitter.com/GraphCrimes]


Example 8

problematic plot

.cite[Figure from https://twitter.com/GraphCrimes]


Example 9

problematic plot

.cite[Figure from https://twitter.com/GraphCrimes]


Example 10

problematic plot

.cite[Example taken from "Fundamentals of Data Visualization", C. O. Wilke]


Example 11

problematic plot

.cite[Example taken from "Fundamentals of Data Visualization", C. O. Wilke]


Example 12

problematic plot

.cite[Example taken from "Fundamentals of Data Visualization", C. O. Wilke]


class: center, middle, inverse

Tell a story


Minard's Visualization Of Napoleon's 1812 March

Minard's Visualization Of Napoleon's 1812 March

.cite[https://www.edwardtufte.com/tufte/minard]


There is a story in here: can you improve the text?


class: center, middle, inverse

Reproducibility and FAIR principles


Reproducibility and FAIR principles

.cite[(c) Scriberia for The Turing Way, CC-BY]


.cite[Heidi Seibold, CC-BY 4.0, https://twitter.com/HeidiBaya/status/1579385587865649153]


FAIR: Which problems can you anticipate?

Findable

.quote["On which of my external hard-drives is my script?"]

Accessible

.quote["Can you please give me access to your plotting scripts?"]

Interoperable

.quote["How can I convert this file format?"]

Reusable

.quote["I wish I could reuse this for my new data!"]


class: center, middle, inverse

Data formats


What problems can arive when storing data like this?

storing data in a spreadsheet


What problems can arive when storing data like this?

storing data in a spreadsheet

  • .emph[Format]: Limited interoperability with other programs
  • .emph[Error prone] (see e.g. this famous example)
  • Difficult to parse ("understand") by scripts: .emph[difficult to automate]
  • Not in tidy format (more about this later): .emph[difficult to extend/modify]

How should we arrange the data?

.left-column50[ compact table

table wide format

table wide format transposed

]

--

.right-column40[ For the moment let us not focus on the tool, but the .emph[data structure]

How can these 3 examples be problematic for .emph[automated data visualization]?

  • In the compact structure we need to divide at the comma
  • If we add more species or more observation sites, we need to adapt the visualization pipeline ]

"Tidy data"

.left-column40[ table tidy format ]

.right-column60[

  • Columns are variables

  • Rows are observations/measurements

  • "Long form"

  • Order does not matter

  • .emph[Easy to extend] with more species and more sites

  • .emph[Structure for storing data] - this does not mean that this is ideal for tables in presentations or publications

.cite[H. Wickham, "Tidy Data"] ]


Standard data formats

.left-column50[

Comma-saparated values (CSV)

Species,Observation site,Number of sightings
arctic fox,A,3
arctic fox,B,1
walrus,B,1
walrus,C,1
reindeer,B,10
reindeer,C,1
polar bear,A,1
polar bear,C,1
seal,A,2
seal,B,1
seal,C,2
  • CSV is often a good choice
  • Most visualization tools can read CSV data ]

.right-column50[

There are many more formats


Data cleaning

  • Often we want to visualize data sets with inconsistent or missing entries:
Date,Organization,Number of participants
2020-09-27,UiT,20
Oct 10 2020,UiT Norges arktiske universitet,15
"Nov. 11, 2020",UiT The Arctic University of Norway,40
2020-12-12,UiT The Arctic University of Norway,-

Data cleaning is a bit outside the scope of this course but still good to know:


class: center, middle, inverse

Choosing the right tools


Choosing the right tools: scriptable

There is not the one perfect language and not the one perfect library for everything

  • You will have to choose what fits best you and your group

  • We will show examples using .emph[Python, R, and JavaScript]

No manual post-processing

  • This will bite you when you need to regenerate 50 figures one day before submission deadline or regenerate a set of figures after the person who created them left the group.

  • Use software that can be scripted: batch processing and reproducibility (more about that in next section).


Choosing the right tools: free

Use free software and free tools

  • Even if the university pays for a license, what happens after you leave university or after they stop paying? How can other groups build on your work?

  • .emph[Python and R are free], and popular for "notebook"-based pipelines, but also a number .emph[JavaScript frameworks] exist, especially for maps.

  • Plain text files for small datasets.

  • Standard formats instead of proprietary formats.

For any academic discipline it will be a good investment to learn a bit of Python or R if you want to do data visualization


Visualization libraries (incomplete list)

Two main families: procedural (e.g. Matplotlib) and declarative.

.left-column50[

Python

  • Vega-Altair: declarative visualization
  • Matplotlib: MATLAB users will be at home
  • Seaborn: statistical functions built in
  • Plotly: interactive graphs
  • Bokeh: also here good for interactivity
  • ggplot: R users will be more at home
  • PyNGL: used in the weather forecast community
  • K3D: Jupyter notebook extension for 3D visualization ]

.right-column40[

R

  • ggplot2: system for declaratively creating graphics, based on the grammar of graphics
  • Shiny: interactive graphs and notebooks

JavaScript


class: center, middle, inverse

Data visualization using Python

https://coderefinery.github.io/data-visualization-python/

(co-created by the author of these slides)


class: center, middle, inverse

Reproducible and reusable plots


class: center, middle

.cite[Juliette Taka, Logilab and the OpenDreamKit project (2017), https://opendreamkit.org/2017/11/02/use-case-publishing-reproducible-notebooks/]


.emph[Demo]: visualization pipeline on Binder

Other fantastic tools which I will not demonstrate


Zenodo can give you a persistent identifier (DOI) and make your pipeline citable

Rather than specifying a GitHub repository when launching Binder, you can instead use a Zenodo DOI.


Progression

  • Start with a working example and try adapting it
  • Learn the very basics
    • Learn a bit of Python
    • Or R
  • It can be a good idea to start learning right away in a notebook
    • Python: Jupyter
    • R: R Markdown in R Studio
    • Quarto
  • Later try Binder
  • Later learn how to get a DOI for your Binder
  • Now your plotting recipe can be cited and is reproducible

This takes time and it is OK to take time

.quote[If I had six hours to chop down a tree, I’d spend the first four hours sharpening the axe.] .cite[Abraham Lincoln]


Summary

  • Don't forget to tell a story

  • FAIR principles and reproducibility will be good for you (and for others)

  • Document all tools and dependencies used .emph[with versions]

  • Prefer .emph[free tools]

  • "Data visualization clinic" next week


Books

Papers

Courses/talks