Skip to content

Statistical Analysis

Sam Minot edited this page Jun 8, 2020 · 5 revisions

We have implemented support for statistical analysis of microbiome survey datasets directly into the geneshot tool. This analysis is intended to help the user identify those CAGs (groups of co-abundant microbial genes) whose relative abundance is significantly associated with any of the metadata features provided by the user in the manifest.

In order to take advantage of this optional feature, the user needs to provide information with the --formula flag, and also provide the needed labels in the manifest.

For example, let us consider a manifest CSV which contains the following information:

specimen R1 R2 participant disease bristol
pA_s1 <> <> pA 0 3
pA_s2 <> <> pA 0 6
... <> <> ... ... ...
pZ_s5 <> <> pZ 0 4
pZ_s9 <> <> pZ 1 3

In this experiment we have obtained multiple microbiome samples from multiple participants. Each participant has some samples from times when they experienced some transient disease process. The Bristol score has also been recorded for each sample.

It is recommended that binary variables are coded as 0 / 1, categorical variables are coded as strings, and that continuous variables are coded as floats.

This is one particular experimental design used for illustrative purposes, and likely does not fit your experiment.

In order to enable the statistical analysis, use the --formula flag. This formula will be used to run Corncob on each CAG individually, testing for association with those features described in the manifest. Multiple formulae may be specified as a comma-delimited list.

Examples:

  • --formula "disease": Test for the association of the relative abundance of every CAG with the binary disease label
  • --formula "disease + participant": Test for the association of the relative abundance of every CAG with the binary disease label, allowing for the intercept to vary by participant (because it is a categorical variable in the provided table)
  • --formula "disease,participant": In two independent models, test for the association of the relative abundance of every CAG with (a) the binary disease label and (b) the categorical participant label
  • --formula "disease + bristol + participant": Test for the association of the relative abundance of every CAG with the binary disease label and the continuous bristol label, while allowing the intercept to vary by participant
  • --formula "disease * bristol + participant": Test for the association of the relative abundance of every CAG with the binary disease label and the continuous bristol label, while allowing for an interaction term between disease:bristol, while also allowing the intercept to vary by participant

If the user provides this --formula flag, the first step of geneshot will be to perform a dry run and ensure that this test can be executed with the manifest provided. Importantly, this test must pass before any large-scale compute is allowed to start.