Skip to content

Latest commit

 

History

History
2750 lines (2269 loc) · 102 KB

recipes.org

File metadata and controls

2750 lines (2269 loc) · 102 KB

ggplotnim recipes

This document contains recipes to create specific kinds of plots. It’s aimed mainly at people who are new the grammar of graphics. It can also be seen as the equivalent of a plot gallery.

Note that this Org file is actually a “literate programming document”, which uses ntangle. If you want to generate the recipes locally, just run:

ntangle recipes.org

Note that the recipes directory of the repo already contains these files.

The document contains two kinds of recipes.

  1. First the “Get To The Point” (GTTP) kind of recipe. This is a minimal example to produce a specific plot, without any fancy options, interesting data etc.
  2. And secondly the “Tell Me A Story” (TMAS) recipes, which try to explain every important step, typically introduce some interesting data set which we’ll investigate to finally produce a plot of a certain kind. Here also alternative ways might be presented. For some people maybe the more interesting read. :)

GTTP “Get To The Point” recipes

The recipes in this section present the simplest way to achieve the desired plot, without any superfluous talking or complicating examples.

Simple line plot

Just a line plot of some data. Simple as that.

Some basic imports:

import ggplotnim, sequtils, seqmath

Create some data so that we have something to plot:

let x = linspace(0.0, 30.0, 1000)
let y = x.mapIt(pow(sin(it), 2.0))

Build a dataframe from it:

let df = toDf(x, y)

and create the plot:

ggplot(df, aes("x", "y")) + # x and y are the identifiers given above as strings
  geom_line() +
  ggsave("media/recipes/rSimpleLinePlot.png")

This results in the following plot:

./media/recipes/rSimpleLinePlot.png

(Line) plot of specific size, different filetype

Creating a plot with a specific output size and saving it as a different filetype is very easy. It’s basically exactly the same as the plot above with the only change in the ggsave call:

import ggplotnim, sequtils, seqmath
const
  width = 720
  height = 480
let x = linspace(0.0, 30.0, 1000)
let y = x.mapIt(pow(sin(it), 2.0))
let df = toDf(x, y)
ggplot(df, aes("x", "y")) +
  geom_line() +
  ggsave("media/recipes/rLinePlotSize.png", width = width, height = height)

You can see that ggsave accepts a width and height argument. The desired output filetype is deduced from the filename extension. Supported are:

  • .pdf
  • .svg
  • .png

The above generates the following plot:

./media/recipes/rLinePlotSize.png

Stacked histogram colored by some class

Often one has a certain type of data for several discrete different classes. In these cases a stacked histogram may be useful. Considering the mpg dataset, this is done via:

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("cty", fill = "class")) + 
  geom_histogram() + 
  ggsave("media/recipes/rStackedMpgHistogram.png")

This generates:

./media/recipes/rStackedMpgHistogram.png

Stacked lines colored by some class

NOTE: the first lines should actually go to 0 again. That’s a bug currently.

The same as above can also be represented with lines using geom_freqpoly:

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("cty", color = "class")) + 
  geom_freqpoly() + 
  ggsave("media/recipes/rStackedMpgFreqpoly.png")

This generates:

./media/recipes/rStackedMpgFreqpoly.png

Change alpha of the fill color

Most geoms (those which allow a fillColor argument / being classified by fill) also allow an alpha field since version v0.2.17.

This allows to change the alpha of the fill without affecting the actual color or the normal color field.

For instance we can use it to make the above plot a little more pretty:

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("cty", fill = "class")) + 
  geom_freqpoly(alpha = 0.3) + 
  ggsave("media/recipes/rFreqPolyWithAlpha.png")

Notice how if a fill color is set, the lines are always drawn down to 0! For instance subcompact and pickup reach down to the x axis.

This generates:

./media/recipes/rFreqPolyWithAlpha.png

Simple histogram with N bins

Given a continuous data column we may want to calculate a histogram with N bins:

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy")) + 
  geom_histogram(bins = 20) + # by default 30 bins are used
  ggsave("media/recipes/rMpgHistoNumBins.png")

This generates:

./media/recipes/rMpgHistoNumBins.png

Simple histogram with specific bin width

We can also set a specific bin width instead of a number of bins. E.g. we want to bin by 1.5 mpg, which can be done using the binWidth argument:

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy")) + 
  geom_histogram(binWidth = 1.5) +
  ggsave("media/recipes/rMpgHistoBinWidth.png")

This generates:

./media/recipes/rMpgHistoBinWidth.png

Histogram with specific bin edges

Or sometimes one has specific edges in mind one wants to investigate. This can be done via the breaks argument. NOTE: the breaks given are interpreted as left bin edges plus the right most edge of the last bin! So the below starts at 0 on the left side and the last bin ends at 40 on the right side.

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy")) + 
  geom_histogram(breaks = @[0'f64, 10, 15, 19, 23, 25, 40]) +
  ggsave("media/recipes/rMpgHistoCustomBreaks.png")

This generates:

./media/recipes/rMpgHistoCustomBreaks.png

Bin with geom_point and overlay on histogram with specific bin edges

While this is a bad example, there are use cases when aside from the bins for the histogram points should be used to indicate some data that is binned in the same way. For instance when comparing simulations with experimental data in particle physics. In this case here we’ll just bin the point data in the same way as the histogram itself and draw the points into the center bin positions.

import ggplotnim
let df = readCsv("data/mpg.csv")
let breaks = @[0'f64, 10, 15, 19, 23, 25, 40]
ggplot(df, aes("cty")) +
  geom_histogram(breaks = breaks) +
  geom_point(stat="bin", breaks = breaks, binPosition = "center") +
  ggsave("media/recipes/rMpgHistoPlusPoints.png")

Note both additional arguments to the geom_point call. The ~stat=”bin”~ tells ggplotnim that the user wants to perform binning of the given x columns (“cty”). With that setting the breaks argument won’t be ignored and will be applied in the same way as for the geom_histogram call. Finally the ~binPosition=”center”~ is used to draw the points not where the binning data is located (read: left bin edge), but rather in the center of the bins.

This generates:

./media/recipes/rMpgHistoPlusPoints.png

Scatter plot with different point color

Sometimes the black points (or color of some other non classified geom) might be boring. That’s why the geom_* procs also take a color argument. This just takes a chroma.Color object. Let’s color our points monokai pink:

import ggplotnim
let df = readCsv("data/mpg.csv")
let breaks = @[0'f64, 10, 15, 19, 23, 25, 40]
ggplot(df, aes("displ", "cty")) +
  geom_point(color = "#F92672") +
  ggsave("media/recipes/rMpgCustomColorPoint.png")

Gives:

./media/recipes/rMpgCustomColorPoint.png

Histogram from already binned data

When dealing with histograms it’s quite likely that the user has already computed the bin content with a custom binning before hand and simply wants to plot that information. This can also be done easily by building a DF from that prebinned data and using ~stat=”identity”~ as the histogram argument:

import ggplotnim
let bins = @[0, 2, 5, 9, 15]
let counts = @[0.1, 0.8, 0.3, 0.05, 0.0] # <- last element is dummy
let df = toDf({"bin_edges" : bins, "counts" : counts})
ggplot(df, aes("bin_edges", "counts")) + 
  geom_histogram(stat = "identity") +
  ggsave("media/recipes/rPrebinnedHisto.png")

There are a couple of important things to mention here. In the case above the user hands x as the bin edges (!) starting from the left bin edge of the first bin to the right edge of the last bin. However, this means we only have N - 1 actual bins. Yet the DF requires all columns to have the same number of entries (Note: technically that’s not true, it’ll be filled up with VNull values, yet it’s not a nice solution and will cause other issues).

There are several ways to deal with this. Either one hands one additional dummy value in the counts sequence, which will be ignored for the bin content. Alternatively, one may only hand essentially the left edges of the bins (and thus as many bins elements as real counts values) and let ggplotnim determine the bin width. This works fine as long as the bin width of the second to last bin is the same as the bin width of the last bin. So if that is the case for your data, feel free to only hand N - 1 elements.

Which generates the following:

./media/recipes/rPrebinnedHisto.png

Simple bar plot for N categories

Often one deals with categorical data with N classes, e.g. the different type of cars listed in the mpg dataset and wishes to count the number of elements in each class. For this we can use a geom_bar plot. NOTE: For the moment the only function supported is the number of counts. FormulaNode and other Nim function support will probably be added at some point (or when desired by someone):

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("class")) + 
  geom_bar() + 
  ggsave("media/recipes/rMpgSimpleBarPlot.png")

Giving us the following result:

./media/recipes/rMpgSimpleBarPlot.png

Bar plot with already computed statistics, manually set scale as discrete

Consider a case in which we have a datafile for N sensors, where each row corresponds to one measurement per minute. In that case we might want to plot the mean value measured for all channels in a bar plot. To achieve this, we can do the following.

We start by putting the table into long form and parsing the channel numbers to integers. Then we compute the mean by grouping by channel and finally we create the plot.

Note: Parsing the channel numbers and thus having to set the x scale as `dcDiscrete` is of course not required. But this way we combine two examples into one. :)

import ggplotnim, sequtils, seqmath, strutils
let cols = toSeq(0 .. 7).mapIt($it)
# make `parseInt` work on Values, so we can parse the long form 
# `Channel` column
#liftScalarStringProc(parseInt)
let df = readCsv("data/szinti_channel_counts.txt",
                 sep = '\t',
                 colNames = cols)
  .gather(cols, key = "Channel", value = "Count")
  .mutate(f{string -> int: "Channel" ~ parseInt( df["Channel"][idx] )})
let dfMean = df.group_by("Channel").summarize(f{float: "Mean counts / min" << mean( c"Count" )})
# calculate mean for each channel
ggplot(dfMean, aes("Channel", "Mean counts / min")) +
  geom_bar(stat = "identity", position = "identity") +
  scale_x_discrete(name = "Channel number") +
  ggtitle("Mean counts per channel") +
  ggsave("media/recipes/rBarPlotCompStats.png")

Gives the following plot:

./media/recipes/rBarPlotCompStats.png

Stacked bar plot for N categories split by some other category

Sometimes the above classes may be part of some other class though, which is supposed to be split by as well. For instance whether cars of a specific class are 4WD, RWD or FWD:

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("class", fill = "drv")) + 
  geom_bar() + 
  ggsave("media/recipes/rMpgStackedBarPlot.png")

Results in:

./media/recipes/rMpgStackedBarPlot.png

Points colored by some continuous scale

Sometimes a scatter plot is supposed to highlight some continuous value of the points. For instance we can color a geom_point plot of the mpg displacement versus the highway efficiency by the city efficiency to get an even fuller picture:

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("displ", "hwy", color = "cty")) + 
  geom_point() + 
  ggsave("media/recipes/rMpgContinuousColorPoints.png")

See:

./media/recipes/rMpgContinuousColorPoints.png

Using points and lines to show discrete counts

Instead of using bars to show some count data classified by two columns, we can show the same thing using points instead. The same plot as above is then:

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("class", color = "drv")) + 
  geom_point(stat = "count") + 
  geom_line(stat = "count") + 
  ggsave("media/recipes/rMpgStackedPointPlot.png")

Results in:

./media/recipes/rMpgStackedPointPlot.png

Using a discrete X scale with points

It’s also just possible to draw a point (or line, see below) with an x axis that contains discrete data while providing a continuous y axis.

import ggplotnim
let df = readCsv("data/mpg.csv")
# coloring by class is of course not required to make this work :)
ggplot(df, aes("cyl", "hwy", color = "class")) + 
  geom_point() + 
  ggsave("media/recipes/rMpgDiscreteXScale.png")

Results in:

./media/recipes/rMpgDiscreteXScale.png

Using a discrete y axis

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy", "cyl")) +
  geom_point() +
  ggsave("media/recipes/rDiscreteYAxis.png")

media/recipes/rDiscreteYAxis.png

Both axes discrete

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("class", "cyl", color = "class")) +
  geom_point() +
  ggsave("media/recipes/rBothDiscreteAxes.png")

media/recipes/rBothDiscreteAxes.png

Using a discrete X scale with lines

Inspired by #40.

A plot with a discrete X scale (using string values) and continuous y scale.

import ggplotnim, seqmath
import random

const paths = 10
const dates = 80

proc gaussian*(rnd: var Rand, mu = 0.0, sigma = 1.0): float =
  var
    s = 0.0
    u = 0.0
    v = 0.0
  while s >= 1.0 or s <= 0.0:
    u = 2.0 * rnd.rand(0.0..1.0) - 1.0
    v = 2.0 * rnd.rand(0.0..1.0) - 1.0
    s = (u * u) + (v * v)

  let x = u * sqrt(-2.0 * ln(s) / s)
  return (mu + (sigma * x))

proc createDataFrame(): DataFrame = 
  const sigma = 0.10
  var rnd = initRand(124325)
  var pathNames = newSeq[string](dates * paths)
  var pathVals = newSeq[float](dates * paths)
  var tenors = newSeq[int](dates * paths)
  for j in 0 ..< paths:
    pathVals[j * dates] = 100.0
    pathNames[j * dates] = "path" & $(j + 1)
    tenors[j * dates] = 0
    for i in 1 ..< dates:
      let idx = j * dates + i
      pathNames[idx] = "path" & $(j + 1)
      pathVals[idx] = (pathVals[idx - 1] * exp(-0.5 * sigma * sigma + sigma * gaussian(rnd)))
      tenors[idx] = i
  result = toDf({ "tenors" : tenors,
                      "pathNames" : pathNames,
                      "pathValues" : pathVals })

let df = createDataFrame()
ggplot(df, aes("tenors", "pathValues", color = "pathNames")) + 
  geom_line() +
  ggsave("media/recipes/rDiscreteXLine.png")

Results in:

./media/recipes/rDiscreteXLine.png

Plotting column Y1 and Y2 against X

In many practical cases we may end up with some data Y1 and Y2, which both is equivalent in the phase space sense and was measured against the same variable but is available only in the format of:

# X        Y1        Y2
x1       y1_1        y2_1
x2       y1_2        y2_2
... 

One case such as this would be having two different sensors for the same property, which both took data at the same time. In those cases one probably wants a plot of X against Y1 and X againt Y2 in one plot with two different colors.

The naive way to do this is the following:

import ggplotnim, sequtils, seqmath
let df = readCsv("data/50-18004.CSV")
ggplot(df) +
  geom_line(aes(x = "in_s", y = "C1_in_V", color = "C1")) +
  geom_line(aes(x = "in_s", y = "C2_in_V", color = "C2")) +
  ggsave("media/recipes/rTwoSensorsBadStyle.png")

This generates the following:

./media/recipes/rTwoSensorsBadStyle.png

Which means that we specify the x and y aesthetics only in the two geoms and give it a color according to a string, which does represent a column of the df. In that way the color is being set to “C1” and “C2”. While this works it’s maybe not the nicest way to handle this, since the gather proc is specifically there to convert a table from this short form of X, Y1, Y2 to a long format dataframe of the type X, Key, Value, where Key stores the name of the previous column (or a custom name) and Value the value corresponding to X of that column. This simplifies the plotting to a single call to geom_line by specifying the Channel column as the discrete color scale:

import ggplotnim, sequtils, seqmath
let df = readCsv("data/50-18004.CSV")
let dfnew = df.gather(["C1_in_V", "C2_in_V"], key = "Channel", value = "V")
# Plotting via `df` directly causes scale problems!
ggplot(dfNew, aes("in_s", "V", color = "Channel")) +
  geom_line() +
  ggsave("media/recipes/rTwoSensorsGoodStyle.png")

Which then results in the following nicer plot (note that the legend now says something more useful as its title):

./media/recipes/rTwoSensorsGoodStyle.png

Bar plot with many elements, rotate labels

When dealing with discrete data, which contains a large number of labels a typical problem is that there’s not enough space for all labels. In this case we want to rotate the labels and right align them. Considering some fake dataset with people who did some shifts listed by year and we want a bar plot of the number of shifts each person did per year. Also we need a much wider plot for this dataset:

import ggplotnim
let df = readCsv("data/fake_shifter_data.txt")
ggplot(df, aes("Shifters", fill = "Year")) +
  geom_bar() +
  xlab(rotate = -45.0, margin = 1.75, alignTo = "right") +
  ggtitle("Number of shifts done by each shifter by year") +
  ggsave("media/recipes/rBarPlotRotatedLabels.png", width = 5000, height = 1000)

NOTE: It’s recommended to generate a plot as this as a vector graphic (svg, pdf) instead of a png. However, since these recipe plots are part of the CI, we generate PNGs for all of them, since conversion from svg, pdf yields bad results on travis.

Gives us the following plot. Probably better to look at the original instead of the embedded plot, since it’s so wide.

./media/expected/rBarPlotRotatedLabels.svg

Bar plot with negative values

By using stat = "identity"= combined with =geom_bar or geom_histogram, one can also plot bars with negative height.

This recipe is taken from: #64

import ggplotnim

let trials = @["A", "B", "C", "D", "E"]
let values = @[1.0, 0.5, 0, -0.5, -1.0]

let df = toDf({ "Trial" : trials,
                    "Value" : values })

ggplot(df, aes(x="Trial", y="Value")) +
  geom_bar(stat="identity", position="identity") +
  ggsave("media/recipes/rNegativeBarPlot.png")

Gives us the following plot:

media/recipes/rNegativeBarPlot.png

Plot with custom annotation

Putting custom annotations onto the plot is supported. However the styling of the annotations is still somewhat limited.

Annotations can either be done via relative coordinates of the plot area (left, bottom) or via data coordiantes (x, y).

Note that the annotation is drawn before the data!

By default a white rectangular background is drawn behind the annotation. This can be modified using the backgroundColor argument.

An example is shown below where we print the largest 5 values as an annotation onto the plot.

import ggplotnim
import algorithm, sequtils, strformat, strutils
# get the data from one of the other recipes
let df = readCsv("data/50-18004.CSV")
let dfnew = df.gather(["C1_in_V", "C2_in_V"], key = "Channel", value = "V")
# assume we want to create an annotation that prints the largest 5 values of
# Channel 2; get largest values, sorted by time (`in_s`)
let dfChMax = dfNew.filter(f{c"Channel" == "C2_in_V"})
  .arrange("V", SortOrder.Descending)
  .head(5)
  .arrange("in_s") # sort again by x axis
# build an annotation:
var annot: string
let idxs = toSeq({'A'..'E'})
for j, id in idxs:
  let xVal = alignLeft(formatFloat(dfChMax["in_s", j, float], precision = 2), 9)
  let yVal = formatFloat(dfChMax["V", j, float], precision = 4)
  annot.add &"{id}: (x: {xVal}, y: {yVal})"
  if j < idxs.high:
    annot.add "\n"
# create a font to use using the `ggplotnim.font` helper
let font = font(12.0, family = "monospace")
# now create the plot and put the annotation where we want it
ggplot(dfNew, aes("in_s", "V", color = "Channel")) +
  geom_line() +
  # either for instance in relative coordinates of the plot viewport
  # Values smaller 0.0 or larger 1.0 work too. Puts the annotation outside
  # of the plot
  annotate(annot,
           left = 0.5,
           bottom = 1.0,
           font = font) +
  # or in data coordinates using `(x, y)`
  annotate(annot,
           x = -2e-6,
           y = 0.06,
           font = font,
           backgroundColor = parseHex("FFEBB7")) +
  ggsave("media/recipes/rCustomAnnotations.png")

Gives us this (somewhat ugly, but that’s not the point) plot: ./media/recipes/rCustomAnnotations.png

Setting plot limits using xlim, ylim

It’s also possible to limit / enlarge the plotting range using xlim and ylim. The behavior of points which possibly lie outside the plotting range is determined by the outsideRange argument, which can take the values:

  • “clip” (default): clip them to the maximum range of the limit
  • “drop”: drop those points from the plot
  • “none”: leave them as they are. Will potentially show them outside the plot area.

This becomes a little more complicated in combination with the xMargin and yMargin procs. See below for an example.

First of all we can use it to enlarge the x range:

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy", "cty")) +
  geom_point() +
  xlim(10.0, 60.0) +
  ggsave("media/recipes/rEnlargeXRange.png")

which gives us:

./media/recipes/rEnlargeXRange.png

On the other hand we can also use it to limit the range of a plot:

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy", "cty")) +
  geom_point() +
  xlim(10.0, 30.0) +
  ggsave("media/recipes/rLimitXRange.png")

./media/recipes/rLimitXRange.png

Notice how all values larger than 30.0 (compare with plot above) are being clipped to 30.0.

Creating a buffer zone with xMargin / yMargin

Sometimes it might be nice to have an explicit area in the plot, which is used to designate data points, which lie outside a desired data range (or are Inf). In this case the xMargin or yMargin procs can be used (possibly in combination with xlim, ylim above).

Assuming we want to create the same plot as above, but for some reason are only interested in y values up to 25, but we want to be easily aware all points, which are larger (but not equal!) that value. Let’s add margin in y to achieve that.

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy", "cty")) +
  geom_point() +
  ylim(5.0, 25.0) +
  yMargin(0.1) +
  ggsave("media/recipes/rCreateMarginBuffer.png")

./media/recipes/rCreateMarginBuffer.png

Notice how all values that are larger than 25 appear at the top of the plot, while values smaller (and equal to 25) appear where they belong.

Highlight min / max of data in a plot

If we wish to highlight certain points of a plot with a specific geom / style we can do this in the following way.

It is based on first filtering an additional dataset to the values to be highlighted and then using the optional color etc. arguments to the geom_* procs to set a certain style.

For instance let’s highlight the min and max values of the second channel from the Plotting column Y1 and Y2 against X example above.

import ggplotnim, algorithm
# base this on one of the above examples
let df = readCsv("data/50-18004.CSV")
  .gather(["C1_in_V", "C2_in_V"], key = "Channel", value = "V")
# filter to Channel 2 and sort by voltage
let dfSorted = df.filter(f{c"Channel" == "C2_in_V"})
  .arrange("V", SortOrder.Descending)
# get min and max
let dfMax = dfSorted.head(1)
let dfMin = dfSorted.tail(1)
ggplot(df, aes("in_s", "V", color = "Channel")) +
  geom_line() + # the actual data
  # add additional geom with `data =` arg and set styles. 
  geom_point(data = dfMax,
             color = "#FF0000", # named colors (e.g. "red") are also possible!
             size = 5.0,
             marker = mkCross) +
  geom_point(data = dfMin,
             color = "#0000FF",
             size = 5.0) +
  ggsave("media/recipes/rHighlightMinMax.png")

Results in the following plot:

./media/recipes/rHighlightMinMax.png

Applying a formula to an aesthetic

In some cases the data frame one has does not contain exactly the data we actually want to plot.

Take for instance the mpg dataset where the fuel economoy is (as the name implies) given in miler per gallon. People who use a sensible unit system will probably want the fuel economy in liters per 100 km.

There are two ways to plot L/100km instead of mpg.

Either (as shown as an example in the documentation) by mutating the data frame we have using transmute or mutate to create a new, transformed column.

The other way to achieve this is to provide a FormulaNode to the aes call, like so:

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes(f{235 / c"cty"}, "displ")) + 
  geom_point() +
  xlab("cty [L / 100km]") +
  ggsave("media/recipes/rFormulaAesthetic.png")

Which gives us:

media/recipes/rFormulaAesthetic.png

This approach can be used for (almost) arbitrary computations on (even more than one) column. Note that if you wish to apply a proc to a column, make sure it’s lifted and corresponds to one of the types explained in the README.

Plotting error bars

Starting from version v0.2.15 plotting of error bars is supported. This is done via geom_errorbar.

Error bars are handled by new fields of the Aesthetics object, namely xMin, xMax, yMin and yMax. The important thing to keep in mind is that these fields require absolute values. So if you have an error of 0.1 you don’t set yMin to -0.1 and yMax to 0.1, but rather you add and subtract from y using a formula. See the example below.

The example below assumes asymmetric, but constant errors of 0.03 down and 0.05 up. Note that xMin etc. are completely normal aesthetic fields. You can also assign a column of your DF with precomputed min and max values or even use more complex functionality provided by aesthetics via formulas, e.g. compute the square root error via yMax = f{`y` + sqrt(`y`)} for instance.

import ggplotnim, seqmath, sequtils
# create some polynomial data 
let x = linspace(0, 1.0, 10)
let y = x.mapIt(0.5 * it - 1.2 * it * it + 1.1 * it * it * it)
let df = toDf(x, y)
# let's assume we have asymmetric errors, 0.03 down and 0.05 up
ggplot(df, aes("x", "y")) +
  geom_point() +
  # define errors as a formula, which references our "y" scale
  geom_errorbar(aes(yMin = f{`y` - 0.03}, yMax = f{`y` + 0.05})) +
  ggsave("media/recipes/rErrorBar.png")

Which results in the following plot:

media/recipes/rErrorBar.png

Plot with multiple legends

If a plot contains multiple aesthetic scales, which require a legend, they will be attempted to be drawn above one another. However, at the time of version v.0.2.18 they are not made smaller so they fit. If too many elements are shown, they won’t fit the plot.

An example below in which we classify by color and size:

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("cty", "displ", size = "cyl", color = "cty")) +
  geom_point() +
  ggsave("media/recipes/rMultipleLegends.png")

media/recipes/rMultipleLegends.png

Simple tile plot

For a simple tile plot, let’s generate some data for a 28x28 tiling.

By default tiling will assume that the bin widhts of each tile is 1 in each dimension. You can provide custom bin widths by using the

  • width
  • height

aesthetic in the aes call. Same as with geom_errorbar this can either be a full column containing widths for each tile or a constant value via a formula, e.g.

aes(x = ..., ..., width = f{0.95}, height = f{0.95})

to get tiles which only have a width of 0.95.

Also for tiles it may be (more so than other times) be desirable to have each axis be considered discrete despite the data being continuous like. In that case use scale_*_continuous as shown (commented out) below.

import ggplotnim, random
var
  xs = newSeq[float]()
  ys = newSeq[float]()
  zs = newSeq[float]()
  rnd = initRand(42)
for x in 0 ..< 28:
  for y in 0 ..< 28:
    xs.add x.float
    ys.add y.float
    zs.add rnd.rand(1.0)
let df = toDf(xs, ys, zs)
ggplot(df, aes("xs", "ys", fill = "zs")) +
  geom_tile() +
  #scale_x_discrete() +
  #scale_y_discrete() +
  ggsave("media/recipes/rSimpleTile.png")

This gives us the following plot:

media/recipes/rSimpleTile.png

Note to get rid of the spacing on the upper and right side (or get even spacing), use xlim and ylim.

Large heatmaps - geom_raster for efficiency

When dealing with the above case Simple tile plot this approach quickly becomes unwieldly for a large number of tiles. That is because using geom_tile each tile is drawn separately. Especially when storing the result as a vector graphic this can result in bad performance and huge file sizes. geom_raster instead draws a single bitmap for the whole data (a cairo PNG surface is filled pixel wise). This comes at the (arbitrary, but done for simplicity’s sake) limitation that each tile has the same size. The width and height fields are available, but they are redundant.

We’ll modify the example from above a bit to include more elements and move the location to a non trivial position.

import ggplotnim, random
var
  xs = newSeq[float]()
  ys = newSeq[float]()
  zs = newSeq[float]()
  rnd = initRand(42)
for x in countup(-256, 254, 2):
  for y in 0 ..< 256:
    xs.add x.float
    ys.add y.float
    zs.add rnd.rand(1.0)
let df = toDf(xs, ys, zs)
ggplot(df, aes("xs", "ys", fill = "zs")) +
  geom_raster() +
  ggsave("media/recipes/rSimpleRaster.png")

./media/recipes/rSimpleRaster.png

And for a slight modification of two facet wrapped heatmaps (mainly as a regession test):

import ggplotnim, random
var
  xs = newSeq[float]()
  ys = newSeq[float]()
  zs1 = newSeq[float]()
  zs2 = newSeq[float]()
  rnd = initRand(42)
for x in 0 ..< 256:
  for y in 0 ..< 256:
    xs.add x.float
    ys.add y.float
    zs1.add rnd.rand(1.0)
    zs2.add rnd.rand(1.0)
let df = toDf(xs, ys, zs1, zs2)
  .gather(["zs1", "zs2"], key = "Map", value = "vals")
ggplot(df, aes("xs", "ys", fill = "vals")) +
  facet_wrap("Map") +
  xlim(0, 256) + ylim(0, 256) +
  geom_raster() +
  ggsave("media/recipes/rFacetRaster.png", width = 920)

./media/recipes/rFacetRaster.png

Different colormaps and custom colormaps

Since version v0.5.0 we provide 4 default colormaps to choose from:

  • Viridis
  • Magma
  • Inferno
  • Plasma

(these are the colormaps introduced in matplotlib in version 2. They are more or less perceptually homogeneous in color, which is a nice property to avoid being biased due to a perceptual large shift in color representing a small change in actual values)

In addition to those predefined scales you can either modify any of the existing colormaps or simply hand your own.

First let’s look at the existing colormaps by drawing gradients of the scales:

import ggplotnim, sequtils

# 1000 points to use
let xs = linspace(0.0, 1.0, 1000)
var plts: seq[GgPlot]
# generate data to show a gradient for (currently a single element in one
# axis isn't supported for `geom_raster`, as it's essentially discrete).
var df = newDataFrame()
for i in 0 ..< 5:
  df.add toDf({"x" : xs, "y" : i })
  
for scale in [viridis(), magma(), plasma(), inferno()]:
  plts.add ggplot(df, aes("x", "y", fill = "x"), backend = bkCairo, fType = fkPng) +
    scale_fill_gradient(scale) + # assign the correct scale
    geom_raster() +
    ylim(0, 5) +   # due to our weird data (height deduced as 1, set correct limits)
    theme_void() + # no scales
    hideLegend() + # no legend
    margin(top = 1, bottom = 0.3, right = 0.3) + # set small margin left / bottom 
    ggtitle("Colorscale: " & $scale.name) 
# now create a multi plot of all of them
ggmulti(plts, "media/recipes/rColormaps.png", widths = @[600], heights = @[150, 150, 150, 150],
        width = 600, height = 600)

This gives us the following color comparison:

./media/recipes/rColormaps.png

Any of the functions viridis(), magma(), plasma() and inferno() return a ColorScale object:

type
  ColorScale* = object
    name*: string # name of the used color scale
    colors*: seq[uint32] # the used color scale as colors encoded in uint32

The name is optional and can simply be used as we do in the example above to give the name in the title of each subplot.

The colors are stored as uint32 colors, which means:

let color = 0xAA_BB_CC_DD
#             ^  ^  ^  ^--- blue value as hex byte
#             |  |  |--- green value as hex byte
#             |  |--- red value as hex byte 
#             |--- alpha value of the color as hex byte

This is mainly done, as the most performance sensitive use of colorscales is for a geom_raster, in which case we have to use uint32 colors to color each pixel (e.g. in Cairo). As such it’s good idea to already use uint32 to avoid the need to convert the colorscale from some arbitrary color type to uint32.

As such, one can either create a custom colorscale by just taking such a ColorScale object and assigning a sequence of uint32 colors (any number of colors. ggplotnim will squeeze all input data into the corresponding number of colors) or directly hand a seq[uint32] to the scale_fill_gradient (or related scale_color_gradient) procedure.

Let’s look at an example of modifying a color scale to assign a transparent color to exact 0 values. We’ll plot a 2D gaussian:

import ggplotnim
import seqmath # for gauss

# generate some gaussian 2D data
var
  xs = newSeq[float]()
  ys = newSeq[float]()
  zs = newSeq[float]()
let coords = linspace(-1.0, 1.0, 200)
for y in coords:
  for x in coords:
    xs.add x
    ys.add y
    zs.add gauss(sqrt(x*x + y*y), mean = 0.0, sigma = 0.2) # small sigma to cover multiple σ

# get the Inferno colormap    
var customInferno = inferno()
customInferno.name = "InfernoWithTransparent"
# and assign the 0th element to exact transparent
customInferno.colors[0] = 0 # transparent

ggplot(toDf(xs, ys, zs), aes("xs", "ys", fill = "zs")) +
  geom_raster() +
  xlim(-1, 1) + ylim(-1, 1) +
  scale_fill_gradient(customInferno) +
  ggtitle("Modified Inferno colormap with 0 set to transparent") + 
  ggsave("media/recipes/rCustomColormap.png")

Note: we can see some substructure due to the small number of points we actually sample (200×200 points for 256 color values).

./media/recipes/rCustomColormap.png

Simple geom_text example

geom_text can be used to represent an additional scale on the plot via text. E.g. either to write a classification as a string onto the plot, overlay numbers onto points etc.

Take note that by default the text will be centered on the position given by the x and y scales. You can change the alignment using the alignKind argument to geom_text or by providing the optional font argument, which has an alignKind field.

There are many ways it can be useful. However, geom_text is a completely valid geom, which means we can replace e.g. a geom_point by geom_text and the resulting plot works as expected (although it may be messy):

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy", "displ")) + 
  geom_text(aes(text = "manufacturer")) + 
  ggsave("media/recipes/rSimpleGeomText.png")

This generates:

./media/recipes/rSimpleGeomText.png

Text can also be classified! colored and sized geom_text

We can take things even further by also applying additional scales to the plot, which change the color and size of the shown text. That way we can end up showing 5 different scales in a single plot! Again, the following may not be the most reasonable example, but well…

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy", "displ", color = "class", size = "cyl")) + 
  geom_text(aes(text = "manufacturer")) + 
  ggsave("media/recipes/rClassifiedGeomText.png")

This generates:

./media/recipes/rClassifiedGeomText.png

Annotating points using geom_text

In more practical terms we might want to annotate certain points with text in a plot. This can also be done using geom_text.

To avoid the text being drawn on top of the point, we can modify the x or y scale of the aesthetic for geom_text to nudge it to the side.

Let’s start from a simple combination of geom_point and geom_text:

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("hwy", "displ")) + 
  geom_point() +
  geom_text(aes(x = f{c"hwy" + 0.3}, 
            text = "manufacturer"),
            alignKind = taLeft, 
            # font = font(10.0, ...) <- you can also change the font
            ) + 
  ggsave("media/recipes/rAnnotateUsingGeomText.png")

This generates:

./media/recipes/rAnnotateUsingGeomText.png

This is still pretty messy and usually not what one might use geom_text for. Instead let’s consider a similar example where we want to annotate the car model with the best fuel economy:

import ggplotnim 
let df = readCsv("data/mpg.csv")
let dfMax = df.mutate(f{"mpgMean" ~ (`cty` + `hwy`) / 2.0})
  .arrange("mpgMean")
  .tail(1)
ggplot(df, aes("hwy", "displ")) + 
  geom_point(aes(color = "cty")) + # set point specific color mapping
  # Add the annotation for the car model below the point
  geom_text(data = dfMax,
            aes = aes(y = f{c"displ" - 0.2}, 
                      text = "model")) +
  # and add another annotation of the mean mpg above the point
  geom_text(data = dfMax,
            aes = aes(y = f{c"displ" + 0.2}, 
                      text = "mpgMean")) +
  ggsave("media/recipes/rAnnotateMaxValues.png")

This generates:

./media/recipes/rAnnotateMaxValues.png

Annotated heat map

geom_text and geom_tile can be nicely combined to create annotated heatmaps.

Let’s calculate the mean of highway fuel economy for each pair of (car class, number of cylinder) in the mpg dataset and create an annotated heatmap from the combination.

import ggplotnim, math

let df = readCsv("data/mpg.csv")
let dfRed = df.group_by(["class", "cyl"]).summarize(f{float: "meanHwy" << mean( c"hwy" )})
# stringification of formula is default name
let meanHwyCol = "meanHwy"
# fill only applies to `tile`, but not text. `color` would apply to text!
ggplot(dfRed, aes("class", "cyl", fill = meanHwyCol)) +
  geom_tile() +
  geom_text(aes(text = meanHwyCol)) +
  scale_y_discrete() +
  ggsave("media/recipes/rAnnotatedHeatmap.png")

Which results in:

media/recipes/rAnnotatedHeatmap.png

Plot consisting of multiple subplots

In certain domains one often ends up with the desire to create a plot from multiple subplots. This is supported already, but requires explicit use of ginger functinoality at the moment.

See for an example, inspired by here: https://staff.fnwi.uva.nl/r.vandenboomgaard/SP20162017/SystemsSignals/plottingsignals.html below.

The major point is to create two plots (but not draw them) using ggcreate with custom defined width and height. Then create a viewport, which will hold the two plots, with width and height such that the two plots fit. Then create a layout of rows / columns (theoretically custom sizes can be set, but for an equal sized subplot not required) and embed the plots.

Note: From version v0.4.10 on a backend must be given manually to the ggplot calls. This is because work started to make ginger less backend dependent. Previously the Cairo backend was simply assumed. This is only required when not directly saving a plot of course (using ggsave), i.e. when calling ggcreate manually.

import ggplotnim, seqmath, math, sequtils, complex, ginger
let t = linspace(-0.02, 0.05, 1000)
let y1 = t.mapIt(exp(im(2'f64) * Pi * 50 * it).re)
let y2 = t.mapIt(exp(im(2'f64) * Pi * 50 * it).im)
let df = toDf({ "t" : t,
                    "Re x(t)" : y1,
                    "Im x(t)" : y2 })
let plt1 = ggcreate(
  ggplot(df, aes("t", "Re x(t)"),
         backend = bkCairo, fType = fkPng) + # tell `ggsave` we wish to create the plot on the cairo backend and as a PNG 
    geom_line() + 
    xlim(-0.02, 0.05) + 
    ggtitle("Real part of x(t)=e^{j 100 π t}"),
  width = 800, height = 300
)
let plt2 = ggcreate(
  ggplot(df, aes("t", "Im x(t)"),
         backend = bkCairo, fType = fkPng) +
    geom_line() + 
    xlim(-0.02, 0.05) + 
    ggtitle("Imaginary part of x(t)=e^{j 100 π t}"),
  width = 800, height = 300
)
# combine both into a single viewport to draw as one image
var plt = initViewport(wImg = 800, hImg = 600, backend = bkCairo) # set backend of this viewport too
plt.layout(1, rows = 2)
# embed the finished plots into the the new viewport
plt.embedAt(0, plt1.view)
plt.embedAt(1, plt2.view)
plt.draw("media/recipes/rMultiSubplots.png")

Which gives us:

media/recipes/rMultiSubplots.png

Creating a neural spike plot in ggplotnim

These two examples are essentially the following gist: https://gist.github.com/Vindaar/9c32c0676ffddec9078e4c0917861fcd

which I wrote for @voltist on IRC, who wanted to create neural raster spike plots.

Just a disclaimer: I know nothing about these plots, so if they’re wrong in some way (axes, labels, whatever), please open an issue on how to make them correct!

I googled and found this: https://pythontic.com/visualization/charts/spikerasterplot

which is easy to do in ggplotnim using geom_linerange.

I’ll show two ways to do it. One “elegant” way (in terms of what’s considered typical usage of ggplot) and one rather weird way, which allows to use custom color codes.

The first version relies on creating a long format data frame, with one column for spike numbers, one for the time axis, containing when a neuron spiked and finally a line size column for (what I assume is) the amplitude of the spike.

# first start with auto selection of colors
import ggplotnim, sequtils
const numx = 50
const numy = 8
const lineSizes = [0.4, 0.3, 0.2, 0.8, 0.5, 0.6, 0.7, 0.9]
# NOTE: The creation of the data here could surely be done in a nicer
# way...
var spikes = newSeq[float]()
var sizes = newSeq[float]()
for y in 0 ..< numy:
  for x in 0 ..< numx:
    spikes.add y.float
    sizes.add lineSizes[y]
var df = newDataFrame()
df["spikes"] = toColumn spikes
df["neurons"] = toColumn randomTensor(numx * numy, 1.0)
df["lineSize"] = toColumn sizes

ggplot(df, aes("neurons", "spikes", color = factor("lineSize"))) +
  geom_linerange(aes(ymin = f{c"spikes" - c"lineSize" / 2.0},
                     ymax = f{c"spikes" + c"lineSize" / 2.0})) +
  scale_y_continuous() + # make sure y is considered cont.
  ylim(-1, 8) + # at the moment ymin, ymax are not considered for the plot range (that's a bug)
  ggtitle("Spike raster plot") +
  ggsave("media/recipes/rAutoColoredNeuralSpikes.png")

This gives us the following plot:

media/recipes/rAutoColoredNeuralSpikes.png

In constrast this version only has a data frame with one column for each neuron containing the times when they spiked. The spike number, line size and color are all constant for each neuron.

import ggplotnim, sequtils
const numx = 50
const numy = 8
const lineSizes = [0.4, 0.3, 0.2, 0.8, 0.5, 0.6, 0.7, 0.9]
# alternatively using fixed colors and one geom_linerange for each color
let colorCodes = @[color(0, 0, 0),
                   color(1, 0, 0),
                   color(0, 1, 0),
                   color(0, 0, 1),
                   color(1 , 1, 0),
                   color(1, 0, 1),
                   color(0, 1, 1),
                   color(1, 0, 1)]
var df = newDataFrame()
for nr in 0 ..< numy:
  df["neuron " & $nr] = toColumn randomTensor(numx, 1.0)
var plt = ggplot(df)
for nr in 0 ..< numy:
  # could combine with above loop, but for clarity
  plt = plt + geom_linerange(aes(x = ("neuron " & $nr),
                                 y = nr,
                                 ymin = nr.float - lineSizes[nr] / 2.0,
                                 ymax = nr.float + lineSizes[nr] / 2.0),
                             color = colorCodes[nr])
# finally add scales, title and plot
plt + scale_y_continuous() + # make sure y is considered cont.
  ylim(-1, 8) + # at the moment ymin, ymax are not considered for the plot range (that's a bug)
  xlab("Neurons") +
  ylab("Spikes") +
  ggtitle("Spike raster plot, manual colors") +
  ggsave("media/recipes/rCustomColoredNeuralSpikes.png")

This results in this plot:

media/recipes/rCustomColoredNeuralSpikes.png

Facet wrap for simple grid of subplots

Facet wraps are useful to display subplots consisting of different labels of a discrete variable. Essentially instead of e.g. coloring each geom based on each label an individual subplot is created. Of course those can be combined as well.

The only important thing is that the variables by which facetting is done have to be discrete. For continuous data the data is still interpreted as discrete, so this might result in many subplots!

By default the scales of each subplot is fixed to the same scale, which is the one determined from the x and y aesthetics (in the below displ and hwy). This can be overriden by the scales argument to facet_wrap, which can be one of the following:

  • ~”fixed”~: default, all the same scale
  • ~”free_x”~: x axis is free and will be determined based on the data of each label
  • ~”free_x”~: y axis is free and will be determined based on the data of each label
  • ~”free”~: both axes are free and will be determined based on the data of each label

Facetting can be done by as many variables as one likes, but again, this will result in a combinatorial explosion.

import ggplotnim

let mpg = readCsv("data/mpg.csv")
ggplot(mpg, aes("displ", "hwy")) +
  geom_point(aes(color = "manufacturer")) +
  facet_wrap(["drv", "cyl"]) + 
  ggsave("media/recipes/rSimpleFacet.png")

This gives us:

media/recipes/rSimpleFacet.png

Facet wrap for data of different ranges

Facet wraps can also be (ab)used to display data that is not quite the same data just for different labels, but rather different data sets, which may be somehow related and one wishes to get a glimpse of some geom for all those data sets.

Consider the example below. It is based on background data from my https://github.com/Vindaar/TimepixAnalysis of a gaseous detector with a pixelized readout (256x256 pixels) consisting mostly of cosmic muons, which typically look something like this:

media/event_run_305_num_59.png

For every event such as this clusters are identified and then (geometric) properties of each event are calculated. The ones we are going to look at here:

  • hits: number of active pixels in a cluster
  • pos_x: center position along x of the cluster
  • pos_y: center position along y of the cluster
  • ecc: eccentricity of the cluster (essentially ratio of long to short axis)
  • rms_trans: RMS of the cluster along the short axis

The other day I had a weird issue with one dataset, so I wanted to get an overview of the distributions of these properties. Thanks to facet_wrap that’s a few lines of code:

import ggplotnim
let df = readCsv("data/run_305_tpa_data.csv")
# gather all columns to a long format df
echo df
let dfLong = df.gather(getKeys(df), key = "Property", value = "Value")
ggplot(dfLong, aes("Value")) +
  facet_wrap("Property", 
             scales = "free") + # each property has very different data ranges, Leave free
  geom_histogram(bins = 100, position = "identity", 
                 binBy = "subset") + # `binBy subset` means the histogram will be calculated 
                                     # in the data range of each properties data range
  ggsave("media/recipes/rFacetTpa.png", width = 800, height = 600)

Which gives us the following plot:

media/recipes/rFacetTpa.png

Notice how in this facetting example each subplot has its own x and y axes and tick labels. That is because we set the scales to ~”free”~.

Also note that if we didn’t set the ~binBy = “subset”~ argument, all subplots would be binned to 100 bins in the same data range, which is the encompassing data range of all subplots. Especially thanks to having the hits property in there, this would lead to useless binning.

Custom tick labels via callback for dates

NOTE: Starting from version v0.4.4 this approach is not recommended to be used for dates, but only to customize other numerical data. In this approach the actual placement of the ticks does not change, only the label does. See <a href=”Adjusting tick labels according to custom date time information”>Adjusting tick labels according to custom date time information below for the best approach now.

Starting from version v0.3.7 (thanks to @cooldome !) it’s possible to hand a callback to define custom tick labels.

This example shows how to get tick labels showing dates (especially useful, since ggplotnim does not natively support DateTime objects so far!).

The callback is given to the scale_x/y_discrete/continuous procs via the labels parameter. For continuous labels the signature is:

proc(x: float): string

while for discrete labels it is:

proc(x: Value): string

where the Value corresponds to each discrete label on the discrete scale.

import ggplotnim, strutils, sequtils, seqmath, times

let x = linspace(0.0, 30.0, 1000)
let y = x.mapIt(pow(sin(it), 2.0))

let df = toDf(x, y)

# helper template to get a reproducible `DateTime` for CI!
template nowTmpl(): untyped = initDateTime(15, mMay, 2020, 00, 00, 00, 00, utc())

ggplot(df, aes("x", "y")) +
  geom_line() +
  scale_y_continuous(labels = proc(x: float): string =
                              x.formatFloat(ffDecimal, 1)) +
  scale_x_continuous(labels = proc(x: float): string =
                              getDateStr(nowTmpl() - int(x).months)) +
  xlab(label = " ") +
  ggsave("media/recipes/rFormatDatesPlot.png")

This gives us:

media/recipes/rFormatDatesPlot.png

Custom tick labels to use decimals with specific precision

In the same vain as the above, it might sometimes be useful to have control over the precision of the tick labels shown on the plot. This can also be done using the labels callback approach:

import ggplotnim, strutils, sequtils, seqmath

let x = linspace(0.0, 30.0, 1000)
let y = x.mapIt(pow(sin(it), 2.0))

let df = toDf(x, y)

ggplot(df, aes("x", "y")) + 
  geom_line() +
  scale_x_continuous(labels = proc(x: float): string =
                              x.formatFloat(ffDecimal, 2)) +
  xlab(label = " ") +
  ggsave("media/recipes/rFormatDecimalsPlot.png")

This gives us:

media/recipes/rFormatDecimalsPlot.png

Custom tick positions & number of ticks

Another sometimes desired feature is changing the number of ticks or placing ticks at predefined position. Both can be achieved by using the breaks argument to the scale_x/y_* procedures. A modified example of the rSimpleLine.nim recipe shows both possibilities:

import ggplotnim, sequtils, seqmath

let x = linspace(0.0, 30.0, 1000)
let y = x.mapIt(pow(sin(it), 2.0))
let df = toDf(x, y)

ggplot(df, aes("x", "y")) + # x and y are the identifiers given above as strings
  geom_line() +
  scale_x_continuous(breaks = @[0.0, 1.0, 2.0, 12.0]) + # set custom ticks along x
  scale_y_continuous(breaks = 50) + # set a custom number of ticks along y
  ggsave("media/recipes/rCustomBreaks.png")

Note in particular that if one uses an integer, the given number is only a “desired” number of ticks. Internally we still look for the closest number that yields “nice” looking tick labels.

media/recipes/rCustomBreaks.png

Histogram with weights for each entry

Since version v0.3.9 it’s finally possible to give the elements being binned when using geom_histogram (or any other geom with ~stat = “bin”~).

Note that for the time being the default label for the weighted axis is still “count”!

If each element to be binned has a corresponding weight, this is simply done by using the weight aes pointing to said column:

import ggplotnim
let df = readCsv("data/diamonds.csv")
ggplot(df, aes("carat", weight = "price")) + 
  geom_histogram() + 
  ylab("Binned carat weighted by price") + 
  ggsave("media/recipes/rWeightedHistogram.png")

Which gives the following:

./media/recipes/rWeightedHistogram.png

Feel free to compare this result to the same plot with a weight of 1 for each element!

Histograms showing density instead of counts

Another addition for binned geoms in version v0.3.9 is the support for density calculations. Instead of associating to each bin the number of counts, the density of each bin is returned instead.

Don’t confuse this with the (soon to be added™) geom_density, which calculates a kernel density estimation given a set of samples and returns a smooth estimation of the underlying distribution.

import ggplotnim
let df = readCsv("data/diamonds.csv")
ggplot(df, aes("carat")) + 
  geom_histogram(density = true) + 
  ggsave("media/recipes/rHistogramDensity.png")

./media/recipes/rHistogramDensity.png

Supplying custom colors to a geom

With version 0.3.17 it is finally possible to provide custom mappings for discrete scales, using scale_color/size/fill_manual.

Courtesy of @hffqyd, consider the following tile map, in which we override the default ggplot fill colors by a custom mapping:

import ggplotnim, sequtils, seqmath, chroma, tables

let
  pos = [1, 2, 3, 1, 2, 3]
  name = ["a", "a", "a", "b", "b", "b"]
  n = [0, 1, 4, 4, 2, 3]
  df = toDf(pos, name, n)

ggplot(df, aes("pos", "name")) +
  geom_tile(aes(fill = "n")) +
  geom_text(aes(text = "n"), size = 25.0) +
  scale_x_discrete() +
  scale_y_discrete() +
  scale_fill_manual({ 0 : color(1.0, 0.0, 0.0),
                      1 : color(0.0, 1.0, 0.0),
                      2 : color(0.0, 0.0, 1.0),
                      3 : color(1.0, 1.0, 0.0),
                      4 : color(1.0, 0.0, 1.0) }.toTable) +
  ggsave("media/recipes/rCustomFill.png")

Which gives us the following tiles:

media/recipes/rCustomFill.png

Customizing the margins around a plot

While ggplotnim tries to be reasonably smart about the margins a plot requires, there are still many cases in which user defined margins are required.

An example are very long labels. An example derived from the above rCustomFill case by @hffyd (ref: issue #89).

The margin procedure can be used to set the correct fields of a Theme object. By default it interprets the given quantities in centimeter. Inch, point and even relative values are also supported using its unit argument.

import ggplotnim

let
  pos = [1, 2, 3, 1, 2, 3]
  name = ["a very long long label", "a very long long label", "a very long long label", "b", "b", "b"]
  n = [0, 1, 4, 4, 2, 3]
  df = toDf(pos, name, n)

ggplot(df, aes("pos", "name")) +
  geom_tile(aes(fill = "n")) +
  geom_text(aes(text = "n"), size = 25.0) +
  scale_x_discrete() +
  scale_y_discrete() +
  margin(left = 6.0) +
  ggsave("media/recipes/rCustomMargins.png")

./media/recipes/rCustomMargins.png

Plot with a title including line breaks

In case of super long titles the title will automatically be wrapped.

However, by default the space in the top part of the plot will not be large enough to accomodate the required space for more than 1 line of text using the default font size. That’s why we set the margin at the top of the plot here to 2 cm.

If the user provides a manual wrapping (i.e. the title contains at least one \n) no automatic wrapping will be performed.

import ggplotnim, sequtils, seqmath

let x = linspace(0.0, 30.0, 1000)
let y = x.mapIt(pow(sin(it), 2.0))

let df = toDf(x, y)

ggplot(df, aes("x", "y")) +
  geom_line() +
  margin(top = 2) +
  ggtitle("This is a very long title which gets cropped on the right side as it's longer than the image width.") +
  ggsave("media/recipes/rLongTitleMultiline.png")

./media/recipes/rLongTitleMultiline.png

Histograms drawn as an outline

v0.3.21 adds an additional drawing option for histograms. Before histograms were always drawn by drawing individual bars. This can lead to problems:

  1. we can have aliasing effects where depending on the size the resulting plot is viewed at, the background may be visible in small lines between the bars, even though the bars should be touching perfectly (or even overlap slightly), see <a href=”Simple histogram with N bins”>Simple histogram with N bins for an example where this effect is visible.
  2. when plotting multiple histograms using ~position=”identity”~ (i.e. on top / behind one another) the edges of each bar can be very distracting visually, because even if one sets an alpha for the histograms the alphas of the bin edges might overlap making them stand out again.

By drawing histograms as outlines both of these problems are solved.

Let’s look at how to plot such a histogram and see what it looks like. We will plot a histogram of purely the outline (so no fill) to better visualize the idea, but be assured that issue 1 is remedied in case of filled histograms.

We will draw the example from Change alpha of the fill color as outlined histograms.

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("cty", color = "class")) + 
  geom_histogram(lineWidth = 2.0, 
                 alpha = 0.0, # make transparent (only fill)
                 hdKind = hdOutline) + # draw as outline
  ggsave("media/recipes/rHistogramOutline.png")

media/recipes/rHistogramOutline.png

Smoothing noisy data with geom_smooth

In case of dealing with noisy data, one might want to see the trend of the data geom_smooth is the tool for the job.

It provides (currently) two different smoothing methods to apply to data.

  1. A Savitzky-Golay filter (often called “LOESS” or local regression in certain bubbles)
  2. A polynomial fit of arbitrary order N.

(The latter can of course be used to perform a linear fit to the data)

Note: both the Savitzky-Golay filter use LAPACK to solve a least squares problem, which means you need a working LAPACK installation.

Let us look at the commits per day to all Nimble repositories (thanks to @haxscramper for the dataset!).

In this example we leave the x labels as a unix timestamp. Combine it with the plot below (<a href=”Adjusting tick labels according to custom date time information”>Adjusting tick labels according to custom date time information) to see how to improve on that.

import ggplotnim
let df = readCsv("data/commits_nimble.csv")
ggplot(df, aes("days", "count")) +
  geom_line() + # plot the raw data as a line
  geom_smooth() + # draw a default smoother. This is a Saitzky-Golay filter of
                  # order 5 and a window `span` of 70%
  geom_smooth(smoother = "poly", # add a polynomial smoother using the full range
              polyOrder = 7,     # of order 7
              color = "#FF0000", # and red line (named colors e.g. "red" also allowed)
              size = 1.0) + # that is not as thick
  ggsave("media/recipes/rGeomSmooth.png")

media/recipes/rGeomSmooth.png

As we can see, in this case both algorithms give more or less good results. The Savitzky-Golay filter represents more of the variation of the actual change in the data, whereas the polynomial fit is just a smooth interpolation of the data. The latter shows some artifacts in places, e.g. on the left is an unphysical bump (c/f Runge’s phenomenon), which make it less suited for some datasets and polynomial order combinations.

In case of the Savitzky-Golay filter it’s possible to keep more of the local information by choosing a smaller span size (argument span). That results in a more wiggly result, which may be preferable in some use cases.

Adjusting tick labels according to custom date time information

In the above example Smoothing noisy data with =geom_smooth= the x axis contains unix timestamps as the date format. This is often enough for quickly checking things, but once a plot should be presentable to others or for a better understanding of specific dates (as unix timestamps are hard to assign to actual calendar dates), assigning labels that are human readable is very useful.

Starting from ggplotnim version v0.4.4 one can achieve this with the added scale_x/y_date, which assigns a date scale to either the x or y axis.

It requires 3 of 4 arguments:

  • isTimestamp: bool: tells the procedure whether the associated axis (x or y) contains time data as a unix timestamp (as integer or float).
  • parseDate: if isTimestamp is false, we require a procedure of signature:
    proc(x: string): DateTime
        

    which simply performs the necessary conversion of a string storing date time information to a DateTime object (we don’t use a format string here as it is more restrictive in what is possible here. It may be added as an additional argument in the future though).

  • formatString: string: The format string to use to convert a DateTime object to a tick label. This defines what the tick label will look like later.

As these 3 arguments alone are not enough to determine where and how many ticks to place, a last argument is required:

  • dateSpacing: Duration: It simply gives a Duration object that stores the duration required between two ticks. Be careful when dealing with multiple years, as Duration does not support years and a full year is not exactly 52 weeks, but 52 weeks + 1 day { when ignoring leap years etc. }).

Let’s use this information to turn the above plot into one with a sensible x axis:

import ggplotnim, times
var df = readCsv("data/commits_nimble.csv")
ggplot(df, aes("days", "count")) +
  geom_line() +
  scale_x_date(isTimestamp = true,                       # x is unix timestamp
               formatString = "MMM-yyyy",                # format as 'Jan-1970'
               dateSpacing = initDuration(days = 1,      # one year is roughly
                                          weeks = 52)) + # 365.25 days, but we use 365
  geom_smooth(span = 0.64) +
  ylab("Commit count") + xlab("Date") +
  ggtitle("Daily commit counts in all nimble repositories") +
  ggsave("media/recipes/rScaleXDate.png")

As we can see, we now get nice yearly labels of the month. The label is placed exactly at the start of the month of that year!

Note: In principle using weeks = 52 is enough, as we internally first filter to all unique tick labels that are possible for the format string. Then we check the distance between the last acceptable tick and next ones to find that which is closest to being dateSpacing away from it. The first tick label is computed from the date range on the axis. With 52 weeks this would already always fall to the next exact year, as we only keep month + year information in the format string.

media/recipes/rScaleXDate.png

In addition to the used arguments there are 2 more arguments of interest to the scale_x/y_date procedures:

  • dateAlgo: allows the selection of an alternative algorithm for the determination of the tick positions. The default dtaFilter behaves as described in the note above. dtaAddDuration simply adds the given dateSpacing onto the first valid tick and re-formats the result with formatString for a correct placement and text. The latter is useful for sparse input data along the time column.
  • breaks: allows to hand custom timestamps (unix) at which to place ticks. Uses formatString to format the timestamps. Ignores the dateAlgo and dateSpacing.

Using geom_smooth to perform a linear fit

geom_smooth currently provides two ways to “smooth” data. Either using a Savitzky-Golay filter (to smooth using a convolution in a fixed window with specific polynomial coefficients) and regular polynomial fits to the data.

One special case that is commonly used of polynomial fits, is fitting a polynomial of order 1. Or in other words, fitting a line.

Let’s showcase this by fitting independent lines to multiple classes of a dataset. Namely, we will take our classy mpg dataset and plot the engine displacement displ against the highway fuel efficiency hwy. In addition we will classify these by color based on the class of car class each car model is in.

Then we’ll fit a linear model to all classes that tells us the trend in engine displacement vs. fuel economy for different classes of cars.

All we need to do to accomplish this, is add one geom_smooth layer, in which we use the =”poly”= smoother using polyOrder = 1 (a line):

import ggplotnim
let df = readCsv("data/mpg.csv")
ggplot(df, aes("displ", "hwy", color = "class")) +
  geom_point() +
  geom_smooth(smoother = "poly", polyOrder = 1) +
  ggsave("media/recipes/rLinearFit.png")

This gives us the following pretty plot:

media/recipes/rLinearFit.png

Native LaTeX plots using the TikZ backend

There are many reasons why one might want to produce a TeX file as the target for a plot. Two major ones:

  • matching fonts & font sizes for plots in a LaTeX document
  • type setting arbitrary text (maths, physics, chemistry equations & formulae, …)

From ggplotnim version v0.4.4 the TikZ backend allows one to do just that. In principle it generates a basic .tex file. Different options allow to define the document class (article or standalone), purely emitting a TikZ command (\begin{tikzpicture} ... \end{tikzpicture)) or just giving a full TeX template into which the TikZ image will be inserted (requires a strutils.% compatible string with a single $# argument).

Alternatively, it is possible to directly compile the generated TeX files to PDFs using a local TeX compiler. By default we use xelatex (for better unicode support).

Note: Placing text on the TikZ backend comes with some quirks:

  1. text placement may be slightly different than on the Cairo backend, as we currently use a hack to determine string widths / heights based on font size alone. ginger needs an overhaul to handle embedding of coordinates into viewports to keep string width / height information until the locations are written to the output file (so that we can make use of text size information straight from TeX)
  2. Text is placed into TikZ node elements. These have some quirky behavior for more complex LaTeX constructs. E.g. it is not really possible to use an equation environment in them (leads to “Missing $ inserted” errors).
  3. because of hacky string width / height determination placing a non transparent background for annotations leads to background rectangles that are too small. Keep the background color transparent for the time being.
  4. Do not include line breaks \n in your annotations if you wish to let LaTeX handle line breaks for you. Any manual line break \n will be handled by ginger. Due to the string height hack, this can give somewhat ugly results.

Let’s create a plot in which we draw a Landau distribution and annotate the text with the correct mathematical formula to showcase math typesetting.

import ggplotnim, math, sequtils, latexdsl, strutils, ginger
proc landauApprox(x: float): float =
  result = 1.0 / sqrt(2 * math.PI) * exp(- (x + exp(-x)) / 2 )

proc annotateText(): string =
  # pure math in raw string due to too many invalid nim constructs for `latex` 
  # macro (would result in ugly mix of strings and identifiers)
  # Need manual math mode via `$`!
  let eq = r"$p(x) = \frac{1}{2πi}\int_{a-i∞}^{a+i∞} e^{s \log(s) + xs}\, ds$" 
  let eqApprox = r"$p(x) \approx \frac{1}{\sqrt{2π}} \exp\left(- \frac{x + e^{-x}}{2} \right)$"
  # align text with math using 2 line breaks for better separation (single line break is too
  # squished. Cannot use `equation` or similar in a TikZ node afaiu :/)
  result = r"The Landau distribution\\ \\ " &
    eq & r"\\ \\" &
    r"reduces to the approximation:\\ \\ " &
    eqApprox & r"\\ \\ " & 
    r"for $μ = 0, c = 1$"
     
let x = linspace(-5.0, 15.0, 1000)
let y = x.mapIt(landauApprox(it))
let df = toDf(x, y)
ggplot(df, aes("x", "y")) +
  geom_line() +                             # draw our Landau data as a line
  annotate(annotateText(),                  # add our text annotation
           x = 5.0, y = 0.1,                # at this location in 'data space'
           backgroundColor = transparent) + # transparent background as we do manual TeX line breaks
  ggtitle(r"Approximation of Landau distribution: " & 
    r"$p(x) \approx \frac{1}{\sqrt{2π}} \exp\left(- \frac{x + e^{-x}}{2} \right)$",
    titleFont = font(12.0)) +
  ggsave("media/recipes/rTikZLandau.tex", standalone = true) # standalone to get TeX file w/ only plot
  # use:
  # ggsave("media/recipes/rTikZLandau.pdf", useTex = true, standalone = true)
  # to directly compile to standalone PDF (using local xelatex)

which results in a (document class ‘standalone’) TeX file. After compilation with xelatex it yields the following plot (converted to a PNG to show here):

media/recipes/rTikZLandau.png

The generated TeX file can be found here:

media/recipes/rTikZLandau.tex

Ridgeline plots

Ridgeline plots (sometimes also called joyplots, referencing the album cover of “Unknown Pleasures” by Joy Division, which shows a time series plot of the first discovered radio pulsar CP 1919) can sometimes be useful to see the trends visible in multiple histograms / density estimations of different times / classes etc. For few histograms it’s fine to plot them in one window, but for more than 4 or 5 histograms it can quickly become too visually busy (drawing a histogram as an outline can help to an extent).

Ridgeline plots solve this by stacking the individual plots in multiple “ridges”, sometimes done including an overlap of the different panes.

We’re going to look at an example here, showing the distribution of the “gaussianity” of a point-cloud over different runs of a detector.

We will start with a regular plot, where we will highlight both the mean value of each distribution in red and the mean of all datasets in black.

The input file gaussSigma_runs.csv contains 3 columns, bins, counts and Run. The first two are simply the binned data of the distribution and Run designates which run the data corresponds to.

import ggplotnim

proc getMean(bins, counts: Tensor[float]): float =
  ## we define a helper proc to compute each mean of each ridge
  ## individually for each `Run`
  for i in 0 ..< counts.size:
    result += bins[i] * counts[i]
  result /= counts.sum

let df = readCsv("data/gaussSigma_runs.csv")
echo df
let mean = getMean(df["bins", float], df["counts", float])
let ymax = df["counts", float].max
ggplot(df, aes("bins", "counts", fill = "Run")) +
  ggridges("Run", overlap = 3.0) + # use a large overlap
  geom_freqpoly(stat = "identity", # do not perform binning
                color = "white", # white outline
                size = 1.5) + # make the lines a bit thicker
  geom_linerange(aes = aes(yMin = 0, # draw black line of common mean
                           yMax = ymax, 
                           x = mean), 
                 color = "black") +
  geom_linerange(aes = aes(
    fill = "Run", yMin = 0, yMax = ymax, # draw red line for each mean 
    x = f{float -> float: getMean(`bins`, `counts`)}), # compute the mean of the current Run
    color = "#FF0000") + # color the line red
  margin(top = 2) + # increase top margin due to large overlap
  xlab("gaussSigma") + ylab("Counts") +
  ggsave("media/recipes/rRidgeLineGauss.png", width = 1200, height = 1200)

This is a nice plot:

media/recipes/rRidgeLineGauss.png

But of course it begs the question, can we make it pretty? ;)

Let’s remove those distracting lines for the mean, turn off the legend and paint it black!

import ggplotnim
let df = readCsv("data/gaussSigma_runs.csv")
ggplot(df, aes("bins", "counts", fill = "Run")) +
  ggridges("Run", overlap = 3.0) +
  geom_freqpoly(stat = "identity", color = "white",
                size = 3.0) +
  xlab("gaussSigma") + ylab("Counts") +
  margin(top = 3, right = 2.5, bottom = 2) +
  theme_void("black") + hideLegend() +
  ggsave("media/recipes/rRidgeLineGaussBlack.png", width = 1200, height = 1200)

media/recipes/rRidgeLineGaussBlack.png

That’s what I call pretty!

But well, some people might prefer to go full Joy Division (just remove the classification by “Run”):

import ggplotnim
let df = readCsv("data/gaussSigma_runs.csv")
ggplot(df, aes("bins", "counts")) +
  ggridges("Run", overlap = 3.0) +
  geom_freqpoly(stat = "identity", color = "white",
                size = 2.0) +
  margin(top = 3, right = 2.5, bottom = 2) +
  theme_void("black") + hideLegend() +
  ggsave("media/recipes/rJoyplot.png", width = 1200, height = 1200)

media/recipes/rJoyplot.png

We joyplots yet? Well, this may not be the prettiest joyplot ever, but given more classifying fields and a larger overlap, this would look like the original. ;)

Simple Vega-Lite example

From version v0.3.21 on most plots also work using Vega-Lite (with some quirks). This excludes plots, which use facetting or subplots of any kind (including ggridges).

Usage is very simple and just requires two things: import ggplotnim/ggplot_vega and replace the ggsave call by ggvega. A simple example:

import ggplotnim
import ggplotnim/ggplot_vega
let mpg = readCsv("data/mpg.csv")
ggplot(mpg, aes(x = "displ", y = "cty", color = "class")) +
  geom_point() +
  ggtitle("ggplotnim in Vega-Lite!") +
  ggvega("media/recipes/rSimpleVegaLite.html") # w/o arg creates a `/tmp/vega_lite_plot.html`

The behavior of the ggvega call can be adjusted with multiple arguments. By default without any arguments it will generate an HTML file in the temp directory and show it using the webview backend. The backend can be switched to using the default browser with the backend argument. The temporary file will be removed after, unless removeFile is set to false.

If the given filename is a .json file only such a file will be generated and no plot will be shown, which is useful to generate multiple Vega-Lite plots in one program. Pure HTML files can also be generated without showing the plots if the show argument is set to false.

This recipe gives us the following plot:

media/recipes/rSimpleVegaLite.png

and here as an interactive chart in the vega browser.

Example of changing font size in plot

Log(-log) plot

See TMAS section below for now.

Line + point plot (w/ different) number of elements per type

See TMAS section below for now.

Set custom margin on a label

Add lines in a plot to highlight something

See TMAS section below for now.

NOTE: starting from version v0.2.15, this can also be achieved using geom_linerange!

TMAS “Tell Me A Story” recipes

This section goes for a more cohesive, explanatory and hopefully more fun introduction to different kinds of plots. Also possible alternatives might be discussed.

Fun with elements

This example ports the idea from the plotnine documentation.

import ggplotnim, sequtils, seqmath, strutils

##
## This is a straight up adaptation from the genius `plotnine` example
## here:
## https://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_tile.html
##

var elements = readCsv("data/elements.csv")
  .mutate(f{Value -> int: "group" ~ (
    if `group` == "-": -1
    else: `group`.toInt)})
echo elements.pretty(5)

# split the lanthanides and actinides from the rest
var top = elements.filter(f{`group` != -1})
  .rename(f{"x" <- "group"},
          f{"y" <- "period"})
var bottom = elements.filter(f{`group` == -1})
echo top["x"]
echo top["y"]

const nrows = 2
const hshift = 3.5
const vshift = 3.0
bottom["x"] = toColumn cycle(arange(0, bottom.len div nrows), nrows).mapIt(it.float + hshift)
bottom = bottom.mutate(f{"y" ~ `period` + vshift})
const tile_width = 0.95
const tile_height = 0.95

# replace `elements` by stacked top and bottom
elements = bind_rows([top, bottom])

let splitDf = toDf({
  "y": @[6, 7],
  "metal": @["lanthanoid", "actinoid"]
})

func cycle[T](s: openArray[T]; nums: seq[int]): seq[T] =
  result = newSeq[T](nums.foldl(a + b))
  var idx = 0
  for i in 0 ..< nums.len:
    for j in 0 ..< nums[i]:
      result[idx] = s[i]
      inc idx
# finally define rows and cols
let groupdf = toDf({
    "group": arange(1, 19),
    "y": cycle(@[1, 2, 4, 2, 1], @[1, 1, 10, 5, 1])})
let periodDf = toDf({
    "period": arange(1, 8),
    "x": cycle(@[0.5], @[7])})

ggplot(elements, aes("x", "y", fill = "metal")) +
  geom_tile(aes = aes(width = tileWidth,
                      height = tileHeight)) +
  geom_tile(data = splitDf,
            aes = aes(x = 3 - tileWidth/4.0 + 0.25,
                      width = tileWidth / 2.0,
                      height = tileHeight)) +
  scale_y_continuous() +
  geom_text(aes(x = f{`x` + 0.15},
                y = f{`y` + 0.15},
                text = "atomic number"),
            font = font(6.0)) +
  geom_text(aes(x = f{`x` + 0.5},
                y = f{`y` + 0.4},
                text = "symbol"),
            font = font(9.0)) +
  geom_text(aes(x = f{`x` + 0.5},
                y = f{`y` + 0.6},
                text = "name"),
            font = font(4.5)) +
  geom_text(aes(x = f{`x` + 0.5},
                y = f{`y` + 0.8},
                text = "atomic mass"),
            font = font(4.5)) +
  geom_text(data = groupdf,
            aes = aes(x = f{`group` + 0.5},
                      y = f{`y` - 0.2},
                      text = "group"),
            font = font(9.0, color = color(0.5, 0.5, 0.5))) +
  geom_text(data = periodDf,
            aes = aes(x = f{`x` + 0.3},
                      y = f{`period` + 0.5},
                      text = "period"),
            font = font(9.0, color = color(0.5, 0.5, 0.5))) +
  legendPosition(0.82, 0.1) +
  theme_void() +
  margin(right = 5, bottom = 2) + # adjust margin as `theme_void` implies tight margins
  scale_y_reverse() +
  scale_x_continuous() +
  ggsave("media/recipes/rPeriodicTable.png",
         width = 1000,
         height = 500)

which gives the following amazing result (better to look at a PDF!):

media/recipes/rPeriodicTable.png

Check if a point is in one or more polygons

This came up recently in a discussion with a colleague. The question was how to determine whether a point in 2D space lies within one or more (not necessarily disjoint) polygons.

I thought it would be fun to implement this with a simple algorithm and use ggplotnim to show that it works.

Let’s start by defining a few data types to make life easier:

import ggplotnim, sequtils, chroma, options, sugar, random

type 
  Point = object
    x, y: float
  Vertex {.borrow: `.`.} = distinct Point
  Polygon = object
    vertices: seq[Vertex]

Not that it really matters for a toy example, but I actually wanted Point to be a generic, but then {.borrow.} is broken: nim-lang/Nim#14449

The Polygon type in this example could of course also just be an alias for seq[Vertex]. But maybe for a more complete example the polygon would store additional information.

Given our types, we can define few helpers to access information more easily and create a Polygon from a set of points (not using Point due to the issue mentioned above…):

func len(p: Polygon): int = p.vertices.len
func `[]`(p: Polygon, idx: int): Vertex = p.vertices[idx]
    
proc initPolygon[T](vs: varargs[tuple[x, y: T]]): Polygon =
  result.vertices = newSeq[Vertex](vs.len)
  for i, v in vs:
    result.vertices[i] = Point(x: v[0].float, y: v[1].float).Vertex

On the other hand later we want to visualize the result with ggplotnim, so we also need something to convert the data back into a form from which we can create a data frame (obviously the code as written here is super inefficient since we usl mapIt for clarity and thus loop multiple times!):

proc flatten(p: Polygon): (seq[float], seq[float]) =
  result = (p.vertices.mapIt(it.x), p.vertices.mapIt(it.y))
  # add first point to get proper drawn polygon with geom line!
  result[0].add p.vertices[0].x
  result[1].add p.vertices[0].y

where we appended the first point to the result again, because we have to “abuse” geom_line to draw a polygon, which doesn’t close the line by default.

Now comes the major part of the code, namely the check whether a given point lies within a certain polygon. This based on the code here: https://wrf.ecse.rpi.edu/Research/Short_Notes/pnpoly.html

proc inPolygon(p: Point, poly: Polygon): bool = 
  # based on: https://wrf.ecse.rpi.edu/Research/Short_Notes/pnpoly.html
  var j = poly.len - 1
  for i in 0 ..< poly.vertices.len:
    if ((poly[i].y <= p.y and p.y < poly[j].y) or
        (poly[j].y <= p.y and p.y < poly[i].y)) and
       (p.x < (poly[j].x - poly[i].x) * (p.y - poly[i].y) / (poly[j].y - poly[i].y) + poly[i].x):
      result = not result
    j = i

With this defined we can quickly define a proc that checks whether a point is in a sequence of polygons:

proc inAnyPolygon(p: Point, polys: seq[Polygon]): bool =
  for poly in polys:
    if p.inPolygon(poly): return true

Now we are essentially done. Let’s create a couple of polygons:

let p1 = initPolygon((0, 1), (6, 0), (5, 2), (4, 1), (2, 4))
let p2 = initPolygon((5, 4), (8, 10), (10, 2), (7, 4))

and create a data frame, which essentially stores the columns:

xyNum

where Num is the number of the polygon is part of. This layout allows us to use ggplotnims built-in functionality to draw them as separate polygons.

let df1 = toDf({ "x" : p1.flatten[0],
                    "y" : p1.flatten[1] })
let df2 = toDf({ "x" : p2.flatten[0],
                    "y" : p2.flatten[1] })
let df = bind_rows(("Polygon 1", df1), ("Polygon 2", df2), "Num")

All that is left is to draw a bunch of points and check whether they are in any polygon or not. This is stored as a column of bools, which we can then classify by in the same way as we will for the Num column of the polygons:

var rnd = initRand(42)
# now sample a bunch of points in (0, 10) plane and plot it
let points = collect(newSeq):
  for i in 0 ..< 300:
    Point(x: rnd.rand(10.0), y: rnd.rand(10.0))
let inPoly = points.mapIt(it.inAnyPolygon(@[p1, p2]))
  
let dfPoints = toDf({ "x" : points.mapIt(it.x),
                          "y" : points.mapIt(it.y),
                          "InPoly" : inPoly })

The ggplot call is reasonably simple. The main aes only contains x and y, because only these two columns are shared between the two data frames. The fillColor argument for the geom_line call is just to override the colorring that the additional fill aesthetic will otherwise provide (which we need to get separate polygons). The problem is that the colors would be the same as for the points (the outline color still shows that!). For the points we hand the additional data frame and define the classification via the InPoly column.

# TODO: results in vertical line at start of polygon
ggplot(df, aes(x, y)) +
  geom_line(aes = aes(fill = "Num"), fillColor = "#ebba34") +
  geom_point(data = dfPoints, aes = aes(color = "InPoly")) +
  ggsave("./media/recipes/rPointInPolygons.png")

Which gives us the following plot that shows us the algorithm works as expected:

./media/recipes/rPointInPolygons.png

Plot a function

Assuming we have some mathematical function we want to plot. While this library is no gnuplot, this is still very simple (goes on talking about not simple example…). Let’s pretend we want to plot the gravitational acceleration of a point mass according to Newton. The analytical description would be

./media/newton_eq.png,

where r is the radial distance, R the radius of Earth, m the mass of Earth and G the gravitational constant. It shows both the case inside a massive body and outside.

Let’s plot this for Earth in a range from Earth’s center to X km!

First import the stuff we need:

import ggplotnim
import seqmath # for linspace, pow
import sequtils # for mapIt

Now we define the func, which returns the acceleration of Earth depending on the radial distance r:

func newtonAcceleration(r: float): float =
  ## returns the graviational acceleration experienced by a test mass
  ## at different distances from Earth (or inside Earth).
  ## `r` is the radial distance given in `m`
  const R = 6371 * 1000 # mean radius of Earth in m
  const m_E = 5.972e24 # kg
  const G = 6.674e-11 # m^3 kg^-1 s^-2
  if r < R:
    result = G * m_E * r / pow(R, 3.0)
  else:
    result = G * m_E / (r * r)

We have to define the range we actually want to look at. Let’s consider Earth’s center up to roughly the geostationary orbit at ~ 35,000 km.

let radii = linspace(0.0, 35_000_000, 1000) # up to geostationary orbit
# and the corresponding accelerations
let a = radii.mapIt(newtonAcceleration(it))

This gives us two seq[float], but we need a DataFrame. So we combine the two:

var df = toDf({ "r / m" : radii,
                "g(r) / m s¯²" : a})

which gives us a data frame with two columns. The names are, as one can guess, the given strings. (Note that in practice one might not want to use unicode superscripts etc. It’s just to show that it’s possible and allows us to have it in the y label without setting the y label manually).

However, we might want to plot it against km instead of m, so let’s transmute the data frame:

df = df.transmute(f{"r / km" ~ c"r / m" / 1000.0}, f{"g(r) / m s¯²"})

The transmute function takes a variable number of elements. Only those columns that appear here (on the LHS of the ~) will be part of the resulting data frame.

And finally create the plot of the dependency:

ggplot(df, aes("r / km", "g(r) / m s¯²")) +
  geom_line() +
  ggtitle("Gravitational acceleration of Earth depending on radial distance") +
  ggsave("media/recipes/rNewtonAcceleration.png")

The resulting plot is the following:

./media/recipes/rNewtonAcceleration.png

and shows what we expect. A linear increase in acceleration up to the surface and a 1/r^2 drop from there. At this point we might ask “Do we recover the known 9.81 m/s^2 at the surface?”. Let’s see. There’s many different ways we could go on about this. We’ll use summarize:

let maxG = df.summarize(f{float: "g_max" << max(col("g(r) / m s¯²"))})

An alternative way would be to access the data column directly, like so:

let maxG_alt = df["g(r) / m s¯²"].toTensor(float).max

where access via [] returns a PersistentVector[Value]. To copy the values to a seq[Value], so that we can use procs like max on it, we can use vToSeq (it’s not just toSeq, because that breaks things: https://github.com/nim-lang/Nim/issues/7322…) Let’s see what we have:

echo "Max acceleration:\n ", maxG

which should give roughly 9.8 m / s^2. The deviation comes from the fact that we didn’t actually look at the value at the surface exactly, but took a rough grid from 0 - 35,000 km with 1000 points. Evaluating the proc at the radius exactly might give a better result:

echo "At surface = ", newtonAcceleration(6371000)

except now we see a value slightly too large (~ 9.82). Because now we’d have to include the rotation of Earth to account for the centripetal force… But since this isn’t a physics lesson (going down this rabbit hole is a lot of fun though, I promise!), we’ll stop here!

“Tell me an axion story without telling it to me” story

Creating a log-log plot is as easy as done in the makePlot proc at the bottom. And yes, there’ll be explanations soon here for the curious of you! :)

import sequtils, seqmath, ggplotnim

proc effPhotonMass(ne: float): float =
  ## returns the effective photon mass for a given electron number density
  const alpha = 1.0 / 137.0
  const me = 511e3 # 511 keV
  # note the 1.97e-7 cubed to account for the length scale in `ne`
  result = sqrt( pow(1.97e-7, 3) * 4 * PI * alpha * ne / me )

proc numDensity(c: float): float =
  ## converts a molar concentration in mol / m^3 to a number density
  const Z = 2 # number of electron in helium atom
  result = Z * 6.022e23 * c

proc molarAmount(p, vol, temp: float): float =
  ## calculates the molar amount of gas given a certain pressure,
  ## volume and temperature
  ## the pressure is assumed in mbar
  const gasConstant = 8.314 # joule K^-1 mol^-1
  let pressure = p * 1e2 # pressure in Pa
  result = pressure * vol / (gasConstant * temp)

proc babyIaxoEffMass(p: float): float =
  ## calculates the effective photon (and thus axion mass) for BabyIAXO given
  ## a certain helium pressure in the BabyIAXO magnet
  const vol = 10.0 * (PI * pow(0.3, 2)) # 10m length, bore radius 30 cm
  # UPDATE: IAXO will be run at 4.2 K instead of 1.7 K
  # const temp = 1.7 # assume 1.7 K same as CAST
  const temp = 4.2
  once:
    echo "BabyIAXO magnet volume is ", vol, " m^3"
    echo "BabyIAXO magnet temperature is ", temp, " K"
  let amountMol = molarAmount(p, vol, temp) # amount of gas in mol
  let numPerMol = numDensity(amountMol / vol) # number of electrons per m^3
  result = effPhotonMass(numPerMol)

proc logspace(start, stop: float, num: int, base = 10.0): seq[float] =
  ## generates evenly spaced points between start and stop in log space
  let linear = linspace(start, stop, num)
  result = linear.mapIt(pow(base, it))

proc makePlot(pstart, pstop: float, fname: string) =
  let pressures = logspace(pstart, pstop, 1000) # 1000 values from 1e-5 mbar to 500 mbar
  let masses = pressures.mapIt(babyIaxoEffMass(it)) # corresponding masses
  # convert both seqs to a dataframe
  let df = toDf({"P / mbar" : pressures, "m_a / eV" : masses})
  ggplot(df, aes("P / mbar", "m_a / eV")) +
    geom_line() +
    scale_x_log10() +
    scale_y_log10() +
    ggtitle("Sensitive axion mass in eV depending on helium pressure in mbar") +
    ggsave(fname)

makePlot(-6.0, 2.0, "media/recipes/rAxionMassesLogLog.png")

This creates the following plot:

./media/recipes/rAxionMassesLogLog.png

“Another one of the quiet stories” story

A typical problem is comparing some measured data from experiment with a theoretical model for which an analytical description may exist. So that the analytical model may be represented by O(1000) elements, while the measured data only consists of N << 1000 elements. This can be done as shown below. It creates a plot of the mass attenuation function of X-rays in the energy range between 0-10 keV (very soft X-rays) for ⁴He.

import sequtils, seqmath, ggplotnim

proc logMassAttenuation(e: float): float =
  ## calculates the logarithm of the mass attenuation coefficient for a given
  ## energy `e` in `keV` and the result in `cm^2/g`
  result = -1.5832 + 5.9195 * exp(-0.353808 * e) + 4.03598 * exp(-0.970557 * e)

let energies = linspace(0.0, 10.0 , 1000) # from 0 to 10 keV
let logMuOverRho = energies.mapIt(logMassAttenuation(it))
# now the non-log values
let muOverRho = logMuOverRho.mapIt(exp(it))

const massAttenuationFile = "data/mass_attenuation_nist_data.txt"
# skip one line after header, second header line
var dfMuRhoTab = readCsv(massAttenuationFile, header = "#", 
                         sep = ' ')
  # convert MeV energy to keV
  .mutate(f{"Energy" ~ c"Energy" * 1000.0})
  .filter(f{float: c"Energy" >= energies.min and c"Energy" <= energies.max})
# create df of interpolated values
let dfMuRhoInterp = toDf({ "E / keV" : energies,
                               "mu/rho" : muOverRho })
# rename the columns of the tabulated values df and create a log column
dfMuRhoTab = dfMuRhoTab.rename(f{"E / keV" <- "Energy"})
# build combined DF
let dfMuRho = bind_rows([("Interpolation", dfMuRhoInterp), 
                         ("NIST", dfMuRhoTab)], 
                        id = "type")

# and the plot of the raw mu/rho values
ggplot(dfMuRho, aes("E / keV", "mu/rho", color = "type")) + 
  geom_line(data = dfMuRho.filter(f{`type` == "Interpolation"})) +
  geom_point(data = dfMuRho.filter(f{`type` == "NIST"})) +
  scale_y_log10() +
  ggtitle("Mass attenuation coefficient interpolation and data") +
  ggsave("media/recipes/rMassAttenuationFunction.png")

Gives:

./media/recipes/rMassAttenuationFunction.png

Add lines in a plot to highlight something

See TMAS section below for now.

import sequtils, seqmath, ggplotnim

proc logspace(start, stop: float, num: int, base = 10.0): seq[float] =
  ## generates evenly spaced points between start and stop in log space
  let linear = linspace(start, stop, num)
  result = linear.mapIt(pow(base, it))

proc density(p: float, temp = 4.2): float = 
  ## returns the density of the gas for the given pressure.
  ## The pressure is assumed in `mbar` and the temperature (in `K`).
  ## The default temperature corresponds to BabyIAXO aim.
  ## Returns the density in `g / cm^3`
  const gasConstant = 8.314 # joule K^-1 mol^-1
  const M = 4.002602 # g / mol
  let pressure = p * 1e2 # pressure in Pa
  # factor 1000 for conversion of M in g / mol to kg / mol
  result = pressure * M / (gasConstant * temp * 1000.0)
  # convert the result to g / cm^3 for use with mass attenuations
  result = result / 1000.0

proc numDensity(c: float): float =
  ## converts a molar concentration in mol / m^3 to a number density
  const Z = 2 # number of electron in helium atom
  result = Z * 6.022e23 * c

proc effPhotonMass(ne: float): float =
  ## returns the effective photon mass for a given electron number density
  const alpha = 1.0 / 137.0
  const me = 511e3 # 511 keV
  # note the 1.97e-7 cubed to account for the length scale in `ne`
  result = sqrt( pow(1.97e-7, 3) * 4 * PI * alpha * ne / me )

proc pressureGivenEffPhotonMass(m_gamma: float, T = 4.2): float =
  ## calculates the inverse of `babyIaxoEffPhotonMass`, i.e. the pressure 
  ## from a given effective photon mass in BabyIAXO
  result = m_gamma * m_gamma * T / 0.01988

proc molarAmount(p, vol, temp: float): float =
  ## calculates the molar amount of gas given a certain pressure,
  ## volume and temperature
  ## the pressure is assumed in mbar
  const gasConstant = 8.314 # joule K^-1 mol^-1
  let pressure = p * 1e2 # pressure in Pa
  result = pressure * vol / (gasConstant * temp)

proc babyIaxoEffMass(p: float): float =
  ## calculates the effective photon (and thus axion mass) for BabyIAXO given 
  ## a certain helium pressure in the BabyIAXO magnet
  const vol = 10.0 * (PI * pow(0.3, 2)) # 10m length, bore radius 30 cm 
  const temp = 4.2
  let amountMol = molarAmount(p, vol, temp) # amount of gas in mol
  let numPerMol = numDensity(amountMol / vol) # number of electrons per m^3
  result = effPhotonMass(numPerMol) 

proc vacuumMassLimit(energy: float, magnetLength = 10.0): float =
  ## given an axion energy in keV, calculate the limit of coherence
  ## for the vacuum case in BabyIAXO
  let babyIaxoLen = magnetLength / 1.97e-7 # length in "eV"
  result = sqrt(PI * 2 * energy * 1e3 / babyIaxoLen) # convert energy to eV

const babyIaxoVacuumMassLimit = vacuumMassLimit(4.2)
proc m_a_vs_density(pstart, pstop: float) =
  let pressures = logspace(pstart.log10, pstop.log10, 1000)
  let densities = pressures.mapIt(density(it, 4.2))
  let masses = pressures.mapIt(babyIaxoEffMass(it)) # corresponding masses
  # convert both seqs to a dataframe
  let ref1Bar = density(1000, 293.15)
  let df1bar = toDf({"ρ / g/cm^3" : @[ref1Bar, ref1Bar], "m_a / eV" : @[1e-2, 1.0]})
  let ref3Bar = density(3000, 293.15)
  let df3bar = toDf({"ρ / g/cm^3" : @[ref3Bar, ref3Bar], "m_a / eV" : @[1e-2, 1.0]})
  let refVacLimit = density(pressureGivenEffPhotonMass(babyIaxoVacuumMassLimit))
  let dfVacLimit = toDf({"ρ / g/cm^3" : @[refVacLimit, refVacLimit], "m_a / eV" : @[1e-2, 1.0]})
  let df = toDf({"ρ / g/cm^3" : densities, "m_a / eV" : masses})
  let dfComb = bind_rows([("ma vs ρ", df),
                          ("1 bar @ 293 K", df1bar),
                          ("3 bar @ 293 K", df3bar),
                          ("Vacuum limit", dfVacLimit)],
                         id = "Legend")
  ggplot(dfComb, aes("ρ / g/cm^3", "m_a / eV", color = "Legend")) +
    geom_line() + 
    scale_x_log10() + 
    scale_y_log10() +
    ggtitle("Sensitive axion mass in eV depending on helium density in g / cm^3") +
    ggsave("media/recipes/rAxionMassVsDensity.png")

m_a_vs_density(pressureGivenEffPhotonMass(babyIaxoVacuumMassLimit) * 0.9,
               pressureGivenEffPhotonMass(0.4521) * 1.1)

Which gives us the following annotated plot:

./media/recipes/rAxionMassVsDensity.png

Other resources

BabyIAXO calculations

While the following document was mainly written for myself, it might be a nice example as to how one might use ggplotnim to explore some calculation, generate a bunch of plots in a literate environment (almost relieving us of our desire for a working repl…) etc. It showcases a variety of plots one might want to create. At some point those will be part of the recipes above… In fact the plots found in the document below correspond to the TODO items in the GTTP section.

It contains calculations (and a lot of plots) for the sensitive axion mass ranges achievable in the BabyIAXO experiment, a prototype for the IAXO experiment.