Skip to content

bramstone/Selecting-associations-in-microbial-datasets

Repository files navigation

Mining Significant Associations in Microbial Community Datasets

This function is an implementation of Lallich et al.'s 2006 algorithm to reduce type I error (the false discovery rate) by selecting a subset of correlations with values deemed suitable or interesting by the researcher (1). The function writes its output in a data frame format suitable for graphing with the igraph package. Currently, work is being done to incorporate Reiner et al.'s 2003 method for controlling false discovery (2), as well as the Benjamini and Yekutielli 2006 methodology (3), into the function.

Creating a network based off of biological or ecological interactions must contend with the fact that the features of interest (species or genes) are inherently interdependent with each other, meaning that more traditional methods of signinficance correction are not sufficient. Furthermore, microbial datasets present other difficulties, such as containing a large number of features, leading to many possible pairwise correlations and infrequent occurrence across samples or records, leading to features with non-normal abundance distributions. Creating bootstrapped null values to compare the existing data against is one solution to mine the dataset for features of interest.

To accomodate large datasets, this function breaks correlation matrices into 10,000 row blocks which are assessed against the initial threshold of significance. This is done mainly to ensure that R does not try to perform computations on prohibitively large datasets in memory. To accommodate the frequent non-linearity of microbial associations, it is recommended to assess microbial datasets using Spearman rank-based correlations. Other, more ecologically relevent association measures may be calculated using the vegan package, though they may not perform as well as Spearman or Pearson measures (4).

Besides the choice of association measure, users must decide how many false positive associations they are willing to accept in their data, as well as the risk (or probability, [0:1]) of the data containing more than the specified number of false positives. The default values are one false positive with a risk of 0.05 (or five percent) that more than one false positive will be in the dataset.

References

  1. Lallich, S, O Teytaud, E Prudhomme. Association rule interestingness: measure and statistical validation. Hamilton, G (ed.) Quality measures in data mining. 2006. Springer.
  2. Reiner, A, D Yekutieli, Y Benjamini. 2003. Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 19, 368-375.
  3. Benjamini, Y, D Yekutieli. 2001. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29, 1165–1188.
  4. Berry, D, Widder S. 2014. Deciphering microbial interactions and detecting keystone species with co-occurrence networks. Frontiers in Microbiology 5, 219.

About

Implementation of Lallich et al.'s 2006 algorithm to reduce type I error between correlations in microbial datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages