Skip to content

Latest commit

 

History

History
65 lines (49 loc) · 2.73 KB

merge_species.md

File metadata and controls

65 lines (49 loc) · 2.73 KB

Overview

Merge species abundance files across samples As input, the script takes a list of sample directories. As output, matrix files are produced with relative abundance matrix, marker gene read-depth, counts of reads mapped to marker genes, and and table of species prevalence.

Usage

Usage: merge_midas.py species <outdir> [options]

positional arguments:
  outdir                Directory for output files

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT              Input to sample directories output by run_midas.py; see '-t' for details
  -t INPUT_TYPE         Specify one of the following:
                          list: -i is a comma-separated list (ex: /samples/sample_1,/samples/sample_2)
                          dir: -i is a directory containing all samples (ex: /samples)
                          file: -i is a file of paths to samples (ex: /sample_paths.txt)
  -d DB                 Path to reference database
                        By default the MIDAS_DB environmental variable is used
  --sample_depth FLOAT  Minimum per-sample marker-gene-depth for estimating species prevalence (1.0)
  --max_samples INT     Maximum number of samples to process.
                        Useful for testing (use all)

Examples

  1. provide list of paths to sample directories:
    merge_midas.py species /path/to/outdir -i /path/to/samples/sample_1,/path/to/samples/sample_2 -t list

  2. provide directory containing all samples:
    merge_midas.py species /path/to/outdir -i /path/to/samples -t dir

  3. provide file containing paths to sample directories:
    merge_midas.py species /path/to/outdir -i /path/to/samples/sample_paths.txt -t file

  4. run a quick test: merge_midas.py species /path/to/outdir -i /path/to/samples -t dir --max_samples 2

Output files

relative_abundance.txt: relative abundance matrix (columns are samples, rows are species).
count_reads.txt: read count matrix (columns are samples, rows are species).
coverage.txt: genome coverage matrix (columns are samples, rows are species).
species_prevalence.txt: summary statistics for each species across samples

Output formats

species_prevalence

  • species_id: species identifier
  • species_name: unique species name
  • mean_coverage: average read-depth across samples
  • median_coverage: median read-depth across samples
  • mean_abundance: average relative abundance across samples
  • median_abundance: median relative abundance across samples
  • prevalence: number of samples with >= MIN_COV

Memory usage

  • This step takes an insignificant amount of memory

Next steps

Profile strain-level gene content of species
Profile strain-level SNPs of species