BioOutputs

This package contains common R scripts I use in my day to day data analysis of biological data. The scripts are primarily for plotting and visualisation, with some data organisation thrown in as well.

Gallery

Bi-Results	Correlation Plot	Frequency Table
Module Plot	Treatment Group Comparisons	Volcano Plot
Switching Gene Ids	Fold Change Heatmap	Significance Boxplots
Sankey Plots	Capitalize Phrases	Module Scores

Required packages

dplyr ggplot2 ggrepel ComplexHeatmap RColorBrewer Rmisc ggpubr gtools grid pBrackets biomaRt

Install Package

First install devtools to allow installation from gitub and any other required packages.

install.packages("devtools")
library("devtools")

library(devtools)
library(knitr)

Now install the BioOutputs package.

install_github("KatrionaGoldmann/BioOutputs")
library("BioOutputs")

bio_corr

Create a correlation plot. Taken from kassambara/ggpubr just changed the default arguments.

So we can use the classic example with the mtcars data frames:

kable(head(mtcars))

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

bio_corr(mtcars, "qsec", "wt")

bio_frequency

The bio_frequency() function generates a frequency table from factor or character vector columns in a data frame. This has the following arguments:

Argument
data	A data frame containing columns to be counted
columns	Column names or indices to be counted in data
freq.percent	Whether the table should include frequency counts, percentages or both (options = c("freq", "percent", "both")). Default="both"
include.na	Include NA values (options are TRUE/FALSE, default=TRUE)
remove.vars	Character vector of variables not to be included in the counts (e.g. remove.vars = c("") remove blanks from the count)

Then if we want to see the breakdown of, say, the gear column in mtcars we can apply:

kable(bio_frequency(mtcars, "gear"))

	3	4	5	Total
gear	15 (47%)	12 (38%)	5 (16%)	n = 32

And if wanted we can remove one variable from the table. This is useful if we have unknowns or the likes.

kable(bio_frequency(mtcars, "gear", remove.vars=c("5")))

	3	4	Total
gear	15 (56%)	12 (44%)	n = 27

bio_volcano

This function generates a volcano plot from a top table using ggplot. The function contains many parameters, use ?bio_volcano to interogate.

Argument
toptable	A data frame containing p value and fold change columns for parameters compared across multiple groups. The p value column should be named "pvalue".
fc.col	The column name which stores the fold change. Should be in the log2 format (default="log2FC")
padj.col	The column which contains adjusted p-values. If NULL adjusted pvalues will be calculated
padj.method	correction method. Options include: c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"). Default="fdr
padj.cutoff	The cutoff for adjusted pvalues. This adds a horizontal line of significance (default=NULL)
fc.cutoff	The log2(fold change) significance cut-off (default=1)
marker.colour	Character vector of four colours to map to the volcano plot. In the order non-significanct, fold-change significant, pvalue significant, significant in fold-change and pvalues (default=c("grey60", "olivedrab", "salmon", "darkturquoise"))
label.p.cutoff	The cutoff for adjusted pvalues for labelling (default=NULL). Not recommended if many significant rows.
label.row.indices	Indices of rows to be labelled (default=NULL)
label.colour	Colour of labels (default="black")
legend.labs	A character vector for theThe legend label names (default=c("Not Significant", "FC>fc.cutoff", "Padj<padj.cutoff", "FC>fc.cutoff& Padj<padj.cutoff"))
add.lines	Whether to add dashed lines at fc.cutoff and padj.cutoff (default=TRUE)
line.colour	The color of dashed significance lines (default="grey14")
main	Plot title
xlims, ylims	The plot limits

Lets look at the ALS patients carrying the C9ORF72 data set

toptable <- read.table("https://gist.githubusercontent.com/stephenturner/806e31fce55a8b7175af/raw/1a507c4c3f9f1baaa3a69187223ff3d3050628d4/results.txt", sep="", header=T)
rownames(toptable) = toptable$Gene

bio_volcano(toptable, fc.col="log2FoldChange", label.row.indices=1:10, main="ALS patients carrying the C9ORF72", add.lines=TRUE)

Lets look at the leukemia data set

BiocManager::install("leukemiasEset", version = "3.8")

library(leukemiasEset)
library(limma)

data(leukemiasEset)
ourData <- leukemiasEset[, leukemiasEset$LeukemiaType %in% c("ALL", "NoL")]
ourData$LeukemiaType <- factor(ourData$LeukemiaType)

design <- model.matrix(~ ourData$LeukemiaType)
fit <- lmFit(ourData, design)
fit <- eBayes(fit)
toptable <- topTable(fit)
toptable$pvalue = toptable$P.Value

bio_volcano(toptable, fc.col="logFC", label.row.indices=1:10, main="leukemia", add.lines=FALSE)

bio_bires

This function colours data according to whether it is below or above a defined plane. The plane is plotted as a line and data can be output as either a line or markers/points.

Argument
x	x column name in data
y	y column name in data
df.data	Data frame containing x and y columns
x.plane	column name for the x axis in df.plane
y.plane	column name for the y axis in df.plane
df.plane	Date frame modelling the plane
stepwise	logical whether to plot the cutoff plane as stepwise or smoothed
colours	colour vector for higher, lower and plane values (default=c("green", "red", "grey) respectively)
inc.equal	logical whethere points on the line should be counted as above (dafault=TRUE)
labels	label for the markers (default=c("above", "below"))
type	type of plot for data (options include point (dafault), line, stepwise)

Lets look at some beaver temperature data.

data(beavers)
df.plane = beaver1
df.data = beaver2
df.plane$temp <- df.plane$temp + 0.5
bio_bires(x="time", y="temp", df.data, x.plane="time", y.plane="temp", df.plane)

bio_mods

This function splits expression data into customisable modules and averages over catagories in a given variable.

Argument
exp	data frame containing the expression data
mod.list	A list of modules. Each element contains the list of genes for a modules. The gene names must match the rownames in the exp dataframe.
meta	Dataframe where each column contains an annotation/tract for samples in the heatmap. The order of samples in meta must match that of exp.
cluster.rows	The method to use for clustering of rows
cols	Chacter vector, or named vector to fix the order, defining the colours of each mean.var group.
main	Title of heatmap
show.names	Show row names. Logical.
mean.subjects	Logical to determine whether to add a row for the mean value for all subjects in a group
split.var	Character defining the meta column to average (mean) over.

In this example we will look at two Li modules and a custom one I made up.

exp = rld.syn
meta <- rld.metadata.syn

bio_mods(exp=exp, mod.list=mod_list, meta=meta, mean.var = "Pathotype", cluster.rows = FALSE)

bio_treatmentGroups

This function plots a line graph comparing two treatment groups over time.

Argument
df	Data frame containing the data for both groups
group.col	Column name which corresponding to the group of values in df
y.col	Column name corresponding to the y-axis values in df
x.col	Column name corresponding to the x-axis values in df
id.col	Column name which corresponds to subject ID
main	Title of pot
cols	Character vector for colours of lines
p.col	Colour of p-value text

Lets look at kidney disease survival in this example from the survival package.

library(survival)
data(kidney)

df <- kidney
df <- df[df$time < 150, ]

bio_treatmentGroups(df, group.col="status", y.col="frail", x.col="time")

bio_geneid

This function switches gene (or snp) ids using biomart

Argument
IDs	list of the Ids you want to convert
IDFrom	What format these IDs are in (default Ensembl)
IDTo	What format you want the IDs converted to (default gene names)
mart	The biomart to use. Typically, for humans you will want ensembl (default). Alternatives can be found at listEnsembl()
dataset	you want to use. To see the different datasets available within a biomaRt you can e.g. do: mart = useEnsembl('ENSEMBL_MART_ENSEMBL'), followed by listDatasets(mart).
attributes	list of variables you want output

For example genes TNF and A1BG:

IDs = c("TNF", "A1BG")
#kable(bio_geneid(IDs, IDFrom='hgnc_symbol', IDTo = 'ensembl_transcript_id'))

bio_fc_heatmap

This function exports a complex heatmap looking at the expression of different modules which can be customised.

Argument
exp	Expression Data
var	Vector classing samples by variables
prefix	Prefix to Heatmap titles
stars	whether pvales should be written as numeric or start (default=FALSE)
overlay	pvalues on fold change Heatmap or besidde (default = TRUE)
logp	Whether or not to log the pvalues
...	Other parameters to pass to Complex Heatmap

exp = rld.syn
meta <- rld.metadata.syn

bio_fc_heatmap(exp=exp, var=meta$Pathotype)

bio_boxplots

Creates boxplots showing the significance between groups.

Argument
data	Data frame containing columns x and y
x, y	x and y variable names for drawing.
p.cutoff	plot p-value if above p.cutoff threshold. To include all comparisons set as NULL.
stars	Logical. Whether significance shown as numeric or stars
method	a character string indicating which method to be used for comparing means.c("t.test", "wilcox.test")
star.vals	a list of arguments to pass to the function symnum for symbolic number coding of p-values. For example, the dafault is symnum.args <- list(cutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1), symbols = c('**', '', '', '', 'ns')). In other words, we use the following convention for symbols indicating statistical significance: ns: p > 0.05; : p <= 0.05; : p <= 0.01; : p <= 0.001; **: p <= 0.0001

Lets look at the iris data:

bio_boxplots(iris, x="Species", y= "Sepal.Width", p.cutoff = 0.0001)

bio_boxplots(iris, x="Species", y= "Sepal.Width", NULL, stars=TRUE)

bio_sankey

Creates a sankey plot with heatmaps showing how individuals progress over time.

Argument
samp.orders	A list of named vectors for sample order at each timepoint. Vector names must correspond to matchable ids.

library(lme4)
library(ComplexHeatmap)
data(sleepstudy)
data = sleepstudy
data = data[data$Days %in% c(0, 1, 2), ]
data = data[data$Days !=1 | data$Subject != "308", ]
time.col="Days"
exp.cols = "Reaction"
sub.col = "Subject"

row.order=list()
for(i in unique(data[, time.col])){
  hm = Heatmap(data[data[, time.col] == i, exp.cols])
  row.order[[paste('timepoint', as.character(i))]] = setNames(unlist(row_order(hm)), data[data[, time.col] == i, sub.col])
}

bio_sankey(samp.order=row.order)

bio_capitalize

Converts strings to appropriate format for titles according to Chicago style.

Argument
titles	A character vector of phrases to be converted to titles
exeption.words	A character vector of words with case to be forced (for example abbreviations and roman numerals)
replace.chars	A named list of characters to replace in title. This works in order of appearance. E.g. c("\."=" ") replaces fullstops with spaces.

shakespeare_plays =c("all's well that ends well", "as you like it","comedy of errors","love's labour's lost", "measure for measure", "merchant of venice","merry wives of windsor","midsummer night's dream","much ado about nothing","taming of the shrew", "tempest", "twelfth night","two gentlemen of verona", "winter's tale", "henry iv, part i","henry iv, part ii", "henry v", "henry vi, part i","henry vi, part ii", "henry vi, part iii", "henry viii","king john", "pericles","richard ii", "richard iii", "antony and cleopatra","coriolanus","cymbeline", "hamlet","julius caesar", "king lear", "macbeth (the scottish play)", "othello", "romeo and juliet","timon of athens", "titus andronicus", "troilus and cressida")
exception.words= c("I", "II", "III", "IV", "V", "VI") # Pass in exceptions

bio_capitalize(as.character(shakespeare_plays[grepl("henry", shakespeare_plays)]), exception.words)

bio_modscores

Calculates the module expression from a list of genes.

Argument
exp	data frame containing the expression data
mod.list	A list of modules. Each element contains the list of genes for a modules. The gene names must match the rownames in the exp dataframe.

bio_mods(exp=exp, mod.list=mod_list)

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
R		R
README_files/figure-markdown_strict		README_files/figure-markdown_strict
data		data
figs		figs
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
BioOutputs.Rproj		BioOutputs.Rproj
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md
README.rmd		README.rmd

KatrionaGoldmann/BioOutputs

Folders and files

Latest commit

History

Repository files navigation