Skip to content

Differential Abundance

Alfred Ssekagiri edited this page Feb 1, 2018 · 4 revisions

Differential abundance

Here we implement two functions to find features that are up or down regulated in the compared groups using DESeq2package and Kruskal-Wallis test. Functions produce plots of the top most features annotated with corresponding adjusted p-values and abundance distribution. In addition we implement classification using random forest classifier to find most important features amongst the differentially expressed features.

DESeq2 significance

Deseq procedure which was originally developed for differential expression analysis of RNA-seq data is used. In this case, it is supposed that abundance of a feature in a given sample is modelled as a negative binomial distribution, whose mean depends on sample specific size factor and concentration of that feature in a sample. The wald test is used to test significance of coefficients in a Negative Binomial GLM based on sample estimates.

physeq is a required phyloseq object containing taxa abundance and meta data. grouping_column is character string specifying the variable in the meta data with respect to which the data should be grouped, pvalue.threshold and lfc.threshold are thresholds for p-values and log2 fold change,normalisation method output_norm, is a character string specifying normalisation method for the output data to be plotted.

To find differentially abundant Families between Tanzania and Vietnam, we use this example.

physeq <- taxa_level(physeq, "Family")
NB_sig <- differential_abundance(physeq, grouping_column = "Country",output_norm=NULL,pvalue.threshold=0.05,lfc.threshold=0,filename=F)

The function plot_signif is used to produce the plots depending on supplied arguments as illustrated below.

To generate a plot showing differentially abundant taxa between among compared groups , corresponding adjusted p-values and rank of importance as detected by random forest classifier.

p<-plot_signif(NB_sig$plotdata, top.taxa = 20)
print(p)

Detailed information including base mean, log2 fold change, raw and adjusted p-values and a group in which a certain taxon is upregulated can be obtained by supplying a filename. For this case the, an example of such information is as shown below.

1               Gracilibacteraceae 21.651033      -5.299943 7.488104e-36 2.217308e-34           T
2               Acidaminococcaceae 13.203689       2.028611 4.367143e-05 1.072798e-04           V
3               Paenibacillaceae_1 10.711948       2.382468 2.051384e-07 7.358933e-07           V
4            Sporolactobacillaceae  2.400093       1.337766 1.570102e-03 3.007145e-03           V
5                    Bacillaceae_1 10.818641       2.514115 8.858409e-08 3.336667e-07           V

133           Pseudonocardiaceae    3.823966       1.839388 2.074989e-05 5.390201e-05           V
134           Micromonosporaceae    2.521131       1.326394 1.106597e-03 2.155955e-03           V
135         Propionibacteriaceae   32.791597      -2.463118 1.215185e-06 3.566648e-06           T
136 unclassified_Actinomycetales   80.257442       1.495950 1.950676e-03 3.704644e-03           V
137                 __Unknowns__ 1132.365032      -1.100679 6.833296e-03 1.197151e-02           T
138      Thermoanaerobacteraceae    1.980402      -1.533227 2.406417e-07 8.335673e-07           T

Random forest classifier is used to determine the importance of differentially expressed features/taxa to the microbial community. The measure used in this case is Mean Descrease in Accuracy which is reported for each of the features. This is obtained by removing the relationship of a feature and measuring increase in error. Consequently, the feature with high mean decrease in accuracy is considered most important.

To get a stand alone visual representation of important features as obtained by random forest classifer, The plot shows Taxa description and corresponding Mean Decrease Accuracy values. This shows a bit more details since in the significant features plot we show only the ranks of these features but in this, it is the measured values of Mean Decrease Accuracy.

p <- plot_MDA(NB_sig$importance, top.taxa=20)
print(p)

To get the Mean abundance plot (MA plot). This shows features that are detected as significant considering the threshold log fold supplied (lfc) against the mean abundance of the features.

p <- plot_MA(NB_sig$SignFeaturesTable, label=F)
print(p$maplot)

An optional visual representation of features that are either up or down regulated. It is generated by setting lfcplot=T, it is off by default. The information provided here is strictly limited to log fold change and basemean values annotated for each taxa/feature, it help with identification of features with extreme log folds more easily. The height of the bars correspond to log fold change, positive and negative values show up and down regulation respectively.

print(p$lfcplot)

Kruskal-Wallis significant features

We perform kruskal-wallis test on feature abundance under different groups/conditions. Kruskal-Wallis is a non parametric method for testing whether samples originate from the same distribution. The p-values values generated are corrected for multiple testing using family wise error rate. Significance is based on the corrected pvalue threshold which is specified via arguments. Significant features are assigned importance using random forest classifier. The measure of importance used in this case is mean decrese accuracy.

physeq is a required phyloseq object containing taxa abundance and meta data. grouping_column is character string specifying the variable in the meta data with respect to which the data should be grouped, pvalue.threshold is the significance threshold for p-values. To writes the information of up and down regulated features to a file, a filename should be supplied.

This examples shows how to find differentially expressed features between Tanzania and Vietnam using Kruskal-Wallis test. Firstly, we transform taxa abundance data using log-relative transformation.

physeq <- normalise_data(physeq, norm.method = "log-relative")
kw_sig <- kruskal_abundance(physeq, grouping_column = "Country")

The function plot_signif is used to produce the plots.

p <- plot_signif(kw_sig$plotdata, top.taxa = 20)

To plot features that are most significantly up or down regulated between the two groups with corresponding p-values and ranks of importance as detected by random forest classifier.

print(p)

To generate plot for multiple testing corrections.

plot_corrections(kw_sig$SignfeaturesTable)

Detailed information about significantly differentially expressed features is can be obtained by;

table_to_write <- kw_sig$SignfeaturesTable
head(table_to_write)
            id                      p.value      E.value       FWER      q.value.factor  q.value
Flavobacteriaceae                1.487223e-13 3.896525e-11 3.896525e-11      262.00000 3.896525e-11
Xanthomonadaceae                 1.060963e-12 2.779724e-10 2.779724e-10      131.00000 1.389862e-10
Cryomorphaceae                   2.248993e-11 5.892363e-09 5.892363e-09       87.33333 1.964121e-09
Synergistaceae                   3.970508e-11 1.040273e-08 1.040273e-08       65.50000 2.600683e-09
Alcaligenaceae                   1.623020e-10 4.252313e-08 4.252313e-08       52.40000 8.504627e-09
Gracilibacteraceae               2.709831e-10 7.099758e-08 7.099758e-08       43.66667 1.183293e-08
Pseudomonadaceae                 3.493510e-10 9.152996e-08 9.152996e-08       37.42857 1.307571e-08
Sphingobacteriaceae              8.398267e-10 2.200346e-07 2.200346e-07       32.75000 2.750433e-08
Syntrophomonadaceae              3.617744e-09 9.478489e-07 9.478484e-07       29.11111 1.053165e-07
Clostridiales_Incertae_Sedis_XII 4.870700e-09 1.276123e-06 1.276123e-06       26.20000 1.276123e-07

To get a stand alone visual representation of important features as obtained by random forest classifer, The plot shows Taxa description and corresponding Mean Decrease Accuracy values. This shows a bit more details since in the significant features plot we show only the ranks of these features but in this, it is the measured values of Mean Decrease Accuracy.

p <- plot_MDA(kw_sig$importance)
print(p)

Kernel-based differential analysis

Here we implement differential analysis using a distance-based kernel score test. It allows us to obtain a set(s) of differentially expresssed features (OTUs) or measured environmetal variables. The sets are obtained by grouping based on correlation of feature abundance/ measure of the environmental variables. The function returns a ggplot object showing significant feature/environmental variable sets annotated with corresponding adjusted p-values.

physeq is a phyloseq object containing taxa abundance and meta data information. grouping_column is a character string for variable with respect to which the data should be grouped. analyse is character string specifying whether to analyse taxa abundance ("abundance") or sample data ("meta"). Default is set to analyse taxa abundance. p.adjust.method character string for a method to be used for adjusting p-values for multitesting corrections, default is "BH" with options which include: "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none". adjusted.pvalue.thresholdis the threshold for the adjusted p-values. ... others arguments available to dscore and sscore functions.

The example below illustrates how to obtain sets of features that are differentially expressed between Tanzania and Vietnam. First, we apply log-relative transformation to the abundance data because this method is designed to handle fractional data and not counts. Therefore, this transformation avails the data in a type required by the method.

physeq <- normalise_data(physeq, norm.method = "log-relative")
kda_sig <- KDA(physeq, grouping_column="Country", analyse="abundance", method="dscore",p.adjust.method="BH",corr=0.99,adjusted.pvalue.threshold= 0.05)

Then use kda_plot function to visualise the results.

p <- kda_plot(kda_sig$plotdata)
print(p)

To use this method on meta data, a suitable normalisation should be applied. In this case, scale transformation is applied on the sample data. In addition, a list of variables of interest can be supplied via the argument select.variables. If not given, all the numerical variables in meta data are analysed as illustrated in the following example.

physeq <- normalise_data(physeq, norm.method = "scale")
kernel.meta <- KDA(physeq, "Country",analyse = "meta", select.variables=NULL)
p <- plot_kda(kernel.meta$plotdata)
print(p)