Skip to content

KatrionaGoldmann/BioOutputs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


BioOutputs

This package contains common R scripts I use in my day to day data analysis of biological data. The scripts are primarily for plotting and visualisation, with some data organisation thrown in as well.


Gallery

Bi-Results Correlation Plot Frequency Table
Module Plot Treatment Group Comparisons Volcano Plot
Switching Gene Ids Fold Change Heatmap Significance Boxplots
Sankey Plots Capitalize Phrases Module Scores

Required packages

dplyr ggplot2 ggrepel ComplexHeatmap RColorBrewer Rmisc ggpubr gtools grid pBrackets biomaRt


Install Package

First install devtools to allow installation from gitub and any other required packages.

install.packages("devtools")
library("devtools")

library(devtools)
library(knitr)

Now install the BioOutputs package.

install_github("KatrionaGoldmann/BioOutputs")
library("BioOutputs")

bio_corr

Create a correlation plot. Taken from kassambara/ggpubr just changed the default arguments.

So we can use the classic example with the mtcars data frames:

kable(head(mtcars))
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
bio_corr(mtcars, "qsec", "wt")


bio_frequency

The bio_frequency() function generates a frequency table from factor or character vector columns in a data frame. This has the following arguments:

Argument
data A data frame containing columns to be counted
columns Column names or indices to be counted in data
freq.percent Whether the table should include frequency counts, percentages or both (options = c("freq", "percent", "both")). Default="both"
include.na Include NA values (options are TRUE/FALSE, default=TRUE)
remove.vars Character vector of variables not to be included in the counts (e.g. remove.vars = c("") remove blanks from the count)

Then if we want to see the breakdown of, say, the gear column in mtcars we can apply:

kable(bio_frequency(mtcars, "gear"))
3 4 5 Total
gear 15 (47%) 12 (38%) 5 (16%) n = 32

And if wanted we can remove one variable from the table. This is useful if we have unknowns or the likes.

kable(bio_frequency(mtcars, "gear", remove.vars=c("5")))
3 4 Total
gear 15 (56%) 12 (44%) n = 27

bio_volcano

This function generates a volcano plot from a top table using ggplot. The function contains many parameters, use ?bio_volcano to interogate.

Argument
toptable A data frame containing p value and fold change columns for parameters compared across multiple groups. The p value column should be named "pvalue".
fc.col The column name which stores the fold change. Should be in the log2 format (default="log2FC")
padj.col The column which contains adjusted p-values. If NULL adjusted pvalues will be calculated
padj.method correction method. Options include: c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"). Default="fdr
padj.cutoff The cutoff for adjusted pvalues. This adds a horizontal line of significance (default=NULL)
fc.cutoff The log2(fold change) significance cut-off (default=1)
marker.colour Character vector of four colours to map to the volcano plot. In the order non-significanct, fold-change significant, pvalue significant, significant in fold-change and pvalues (default=c("grey60", "olivedrab", "salmon", "darkturquoise"))
label.p.cutoff The cutoff for adjusted pvalues for labelling (default=NULL). Not recommended if many significant rows.
label.row.indices Indices of rows to be labelled (default=NULL)
label.colour Colour of labels (default="black")
legend.labs A character vector for theThe legend label names (default=c("Not Significant", "FC>fc.cutoff", "Padj<padj.cutoff", "FC>fc.cutoff& Padj<padj.cutoff"))
add.lines Whether to add dashed lines at fc.cutoff and padj.cutoff (default=TRUE)
line.colour The color of dashed significance lines (default="grey14")
main Plot title
xlims, ylims The plot limits

Lets look at the ALS patients carrying the C9ORF72 data set

toptable <- read.table("https://gist.githubusercontent.com/stephenturner/806e31fce55a8b7175af/raw/1a507c4c3f9f1baaa3a69187223ff3d3050628d4/results.txt", sep="", header=T)
rownames(toptable) = toptable$Gene

bio_volcano(toptable, fc.col="log2FoldChange", label.row.indices=1:10, main="ALS patients carrying the C9ORF72", add.lines=TRUE)

Lets look at the leukemia data set

BiocManager::install("leukemiasEset", version = "3.8")

library(leukemiasEset)
library(limma)

data(leukemiasEset)
ourData <- leukemiasEset[, leukemiasEset$LeukemiaType %in% c("ALL", "NoL")]
ourData$LeukemiaType <- factor(ourData$LeukemiaType)

design <- model.matrix(~ ourData$LeukemiaType)
fit <- lmFit(ourData, design)
fit <- eBayes(fit)
toptable <- topTable(fit)
toptable$pvalue = toptable$P.Value

bio_volcano(toptable, fc.col="logFC", label.row.indices=1:10, main="leukemia", add.lines=FALSE)


bio_bires

This function colours data according to whether it is below or above a defined plane. The plane is plotted as a line and data can be output as either a line or markers/points.

Argument
x x column name in data
y y column name in data
df.data Data frame containing x and y columns
x.plane column name for the x axis in df.plane
y.plane column name for the y axis in df.plane
df.plane Date frame modelling the plane
stepwise logical whether to plot the cutoff plane as stepwise or smoothed
colours colour vector for higher, lower and plane values (default=c("green", "red", "grey) respectively)
inc.equal logical whethere points on the line should be counted as above (dafault=TRUE)
labels label for the markers (default=c("above", "below"))
type type of plot for data (options include point (dafault), line, stepwise)

Lets look at some beaver temperature data.

data(beavers)
df.plane = beaver1
df.data = beaver2
df.plane$temp <- df.plane$temp + 0.5
bio_bires(x="time", y="temp", df.data, x.plane="time", y.plane="temp", df.plane)

bio_mods

This function splits expression data into customisable modules and averages over catagories in a given variable.

Argument
exp data frame containing the expression data
mod.list A list of modules. Each element contains the list of genes for a modules. The gene names must match the rownames in the exp dataframe.
meta Dataframe where each column contains an annotation/tract for samples in the heatmap. The order of samples in meta must match that of exp.
cluster.rows The method to use for clustering of rows
cols Chacter vector, or named vector to fix the order, defining the colours of each mean.var group.
main Title of heatmap
show.names Show row names. Logical.
mean.subjects Logical to determine whether to add a row for the mean value for all subjects in a group
split.var Character defining the meta column to average (mean) over.

In this example we will look at two Li modules and a custom one I made up.

exp = rld.syn
meta <- rld.metadata.syn

bio_mods(exp=exp, mod.list=mod_list, meta=meta, mean.var = "Pathotype", cluster.rows = FALSE)

bio_treatmentGroups

This function plots a line graph comparing two treatment groups over time.

Argument
df Data frame containing the data for both groups
group.col Column name which corresponding to the group of values in df
y.col Column name corresponding to the y-axis values in df
x.col Column name corresponding to the x-axis values in df
id.col Column name which corresponds to subject ID
main Title of pot
cols Character vector for colours of lines
p.col Colour of p-value text

Lets look at kidney disease survival in this example from the survival package.

library(survival)
data(kidney)

df <- kidney
df <- df[df$time < 150, ]

bio_treatmentGroups(df, group.col="status", y.col="frail", x.col="time")

bio_geneid

This function switches gene (or snp) ids using biomart

Argument
IDs list of the Ids you want to convert
IDFrom What format these IDs are in (default Ensembl)
IDTo What format you want the IDs converted to (default gene names)
mart The biomart to use. Typically, for humans you will want ensembl (default). Alternatives can be found at listEnsembl()
dataset you want to use. To see the different datasets available within a biomaRt you can e.g. do: mart = useEnsembl('ENSEMBL_MART_ENSEMBL'), followed by listDatasets(mart).
attributes list of variables you want output

For example genes TNF and A1BG:

IDs = c("TNF", "A1BG")
#kable(bio_geneid(IDs, IDFrom='hgnc_symbol', IDTo = 'ensembl_transcript_id'))

bio_fc_heatmap

This function exports a complex heatmap looking at the expression of different modules which can be customised.

Argument
exp Expression Data
var Vector classing samples by variables
prefix Prefix to Heatmap titles
stars whether pvales should be written as numeric or start (default=FALSE)
overlay pvalues on fold change Heatmap or besidde (default = TRUE)
logp Whether or not to log the pvalues
... Other parameters to pass to Complex Heatmap
exp = rld.syn
meta <- rld.metadata.syn

bio_fc_heatmap(exp=exp, var=meta$Pathotype)

bio_boxplots

Creates boxplots showing the significance between groups.

Argument
data Data frame containing columns x and y
x, y x and y variable names for drawing.
p.cutoff plot p-value if above p.cutoff threshold. To include all comparisons set as NULL.
stars Logical. Whether significance shown as numeric or stars
method a character string indicating which method to be used for comparing means.c("t.test", "wilcox.test")
star.vals a list of arguments to pass to the function symnum for symbolic number coding of p-values. For example, the dafault is symnum.args <- list(cutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1), symbols = c('****', '', '', '', 'ns')). In other words, we use the following convention for symbols indicating statistical significance: ns: p > 0.05; : p <= 0.05; : p <= 0.01; : p <= 0.001; ****: p <= 0.0001

Lets look at the iris data:

bio_boxplots(iris, x="Species", y= "Sepal.Width", p.cutoff = 0.0001)

bio_boxplots(iris, x="Species", y= "Sepal.Width", NULL, stars=TRUE)

bio_sankey

Creates a sankey plot with heatmaps showing how individuals progress over time.

Argument
samp.orders A list of named vectors for sample order at each timepoint. Vector names must correspond to matchable ids.
library(lme4)
library(ComplexHeatmap)
data(sleepstudy)
data = sleepstudy
data = data[data$Days %in% c(0, 1, 2), ]
data = data[data$Days !=1 | data$Subject != "308", ]
time.col="Days"
exp.cols = "Reaction"
sub.col = "Subject"

row.order=list()
for(i in unique(data[, time.col])){
  hm = Heatmap(data[data[, time.col] == i, exp.cols])
  row.order[[paste('timepoint', as.character(i))]] = setNames(unlist(row_order(hm)), data[data[, time.col] == i, sub.col])
}

bio_sankey(samp.order=row.order)

bio_capitalize

Converts strings to appropriate format for titles according to Chicago style.

Argument
titles A character vector of phrases to be converted to titles
exeption.words A character vector of words with case to be forced (for example abbreviations and roman numerals)
replace.chars A named list of characters to replace in title. This works in order of appearance. E.g. c("\."=" ") replaces fullstops with spaces.
shakespeare_plays =c("all's well that ends well", "as you like it","comedy of errors","love's labour's lost", "measure for measure", "merchant of venice","merry wives of windsor","midsummer night's dream","much ado about nothing","taming of the shrew", "tempest", "twelfth night","two gentlemen of verona", "winter's tale", "henry iv, part i","henry iv, part ii", "henry v", "henry vi, part i","henry vi, part ii", "henry vi, part iii", "henry viii","king john", "pericles","richard ii", "richard iii", "antony and cleopatra","coriolanus","cymbeline", "hamlet","julius caesar", "king lear", "macbeth (the scottish play)", "othello", "romeo and juliet","timon of athens", "titus andronicus", "troilus and cressida")
exception.words= c("I", "II", "III", "IV", "V", "VI") # Pass in exceptions

bio_capitalize(as.character(shakespeare_plays[grepl("henry", shakespeare_plays)]), exception.words)

bio_modscores

Calculates the module expression from a list of genes.

Argument
exp data frame containing the expression data
mod.list A list of modules. Each element contains the list of genes for a modules. The gene names must match the rownames in the exp dataframe.
bio_mods(exp=exp, mod.list=mod_list)

About

An R package to create common outputs used in bioinformatical analyses and visualisations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages