Skip to content

The pipeline

Pavel V. Dimens edited this page Dec 6, 2021 · 2 revisions

So what does gust actually do? Well, a few things, so let's walk through them.

  1. FASTA format assemblies are converted to FASTQ format with dummy quality scores J
  2. The FASTQ'd assemblies are "fragmented" by creating a sliding window that advances by 1 bp
  3. The fragmented assemblies are mapped against the reference genome
  4. The alignments are used to call SNPs with freebayes
  5. The raw SNPs are filtered to refine the highest quality sites
    • no missing data
    • a bunch of quality filters
    • indels decomposed
    • all alleles in reference sample must be reference allele (else it's genotyping error)
  6. SNPs are thinned to retain x SNPs every y basepairs (reduce data size and redundancy)
  7. VCF is converted into FASTA for multiple-sequence alignment ("MSA")
  8. MAFFT performs MSA to get the best possible alignment under multiple scenarios
  9. Run RaxML on best MSA once (bootstrapped)
  10. Refine and optimize mutation model and rerun RaxML
  11. Basic plot of tree topology

Gust likes to be verbose in the message prompts for every task, so it will be very clear what it's doing as it's doing it.

Clone this wiki locally