Skip to content

Data Format and Normalisation

Alfred Ssekagiri edited this page Jan 20, 2018 · 1 revision

Data format/requirement

The data is required to be a phyloseq object (phyloseq-class) comprising taxa abundance information, taxonomy assignment, sample data which is a combination of measured environmental variables together with any categorical variables present in the samples. If the phylogenetic tree is available, it can also be part but not so relevant for most of the functionality implemented here so far. We choose to use this format since we can have enormous options for manipulating the data as we progress with the analysis and visualisations. Details of format and comprehensive manipulations of phyloseq objects are available at https://github.com/joey711/phyloseq.

Example dataset

To test the functionality, we use a pitlatrine dataset which was generated by 16S rRNA sequencing of various latrines from Tanzania and Vietnam at different depths. The data files are available at http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/ecological.html and the associated paper is B Torondel, JHJ Ensink, O Gundogdu, UZ Ijaz, J Parkhill, F Abdelahi, V-A Nguyen, S Sudgen, W Gibson, AW Walker, and C Quince. Assessment of the influence of intrinsic environmental and geographical factors on the bacterial ecology of pit latrines Microbial Biotechnology, 9(2):209-223, 2016. It is also freely accessible here.

To get the test data in phyloseq format,

library(microbiomeSeq)
data(pitlatrine) 

To check the components of the data, print out the data to find out the structure.

print(pitlatrine)
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 8883 taxa and 81 samples ]
sample_data() Sample Data:       [ 81 samples by 14 sample variables ]
tax_table()   Taxonomy Table:    [ 8883 taxa by 6 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 8883 tips and 8881 internal nodes ]
Generate a phyloseq object

To generate a phyloseq object to be used for analysis, a phyloseq function merge_phyloseq can be used to combine the taxa abundance information (OTU), taxa assignment (TAX), sample data (SAM) and phylogenetic tree (OTU_tree) in Newick format as follows; More details on how to construct a phyloseq object can be obtained from the phyloseq site cited earlier.

OTU = otu_table(as.matrix(abund_table), taxa_are_rows = FALSE)
TAX = tax_table(as.matrix(OTU_taxonomy))
SAM = sample_data(meta_table)
OTU_tree<-compute.brlen(OTU_tree,method="Grafen")
physeq<-merge_phyloseq(phyloseq(OTU, TAX),SAM,OTU_tree)
Data normalisation

Microbial community data is mainly OTU/taxa abundance (counts) and corresponding environmental data. The statistical methods have different requirements regarding the distribution and kind of data (for example counts, binary, fractional e.t.c), therefore, it is usually necessary that data is transformed by a suitable normalisation method.

Normalising OTU abundance

We implement different methods including; "relative", "log-relative", random sub sampling ("randomsubsample"), edgeR ("edgernorm") and variance stabilisation ("varstab") for normalisation of taxa abundance. The function takes a phyloseq object physeq and returns a similar object whose otu-table component is normalised by a selected method as shown in the following examples.

physeq<-normalise_data(physeq,norm.method = "randomsubsample")
physeq <- normalise_data(physeq, norm.method = "varstab" ,fitType="local")
Normalising sample data

In order to transform the sample_data component of phyloseq object, a logical value norm.meta is set to TRUE in additon to a suitable normalisation method. Note that amongst the above mentioned methods, this option(norm.meta) is currently available for relative and log-relative only.

physeq <- normalise_data(physeq, norm.method = "relative", norm.meta=T)

To scale sample data, "scale" is the selected as the norm.method. This function can also be used to perform log2 and square root transformation of sample data which is specified using the type argument as illustrated in the example below.

physeq <- normalise_data(physeq, norm.method = "scale", type="log")
physeq <- normalise_data(physeq, norm.method = "scale", type="sqrt")