Skip to content

Analysis Modes

Sam Minot edited this page Jan 18, 2020 · 2 revisions

There are two fundamentally different ways to run geneshot:

  1. Perform de novo assembly for each specimen, deduplicate gene sequences across the entire experiment, and use that catalog of genes for analysis, or
  2. Use a previously-generated collection of gene sequences, and entirely skip the de novo assembly step.

The advantage of performing de novo assembly is that you are able to identify all of the microbial genes present in your experiment, as long as they are above ~5X sequencing depth in any single sample (the practical limit for de novo assembly). The disadvantages of performing de novo assembly are that (a) it is computationally resource-intensive for large datasets, and (b) it is computationally inefficient for very shallowly sequenced samples.

At both of the extreme ends of the spectrum, it may be worthwhile to use a pre-generated catalog of microbial genes:

  • Large number of very deeply sequenced samples (i.e. over 30M reads per sample)
  • Any number of very shallowly sequenced samples (i.e. under 5M reads per sample)

You can use the --gene_fasta flag to pass in a gzipped FASTA of deduplicated reference gene sequences. The help text for the tool will also provide a path to a useful example generated from a large number of microbiome samples. You can also use the ref/genes.fasta.gz file output by geneshot as the reference for a subsequent run with new datasets (keeping the caveats of biological interpretation in mind).