Skip to content

Rowena-h/GaeumannomycesGenomics

Repository files navigation

Gaeumannomyces Genomics

Workflow schematic

Bioinformatics workflow for:

Hill et al. (2024) Evolutionary genomics reveals variation in structure and genetic content implicated in virulence and lifestyle in the genus Gaeumannomyces. biorxiv. doi:10.1101/2024.02.15.580261

The pipeline was written for and run on Norwich BioScience Institutes' HPC cluster which uses the SLURM batch-queue system. This means that many of the bash scripts (.sh file endings) specify core allocation, run times and memory usage allocation that may need to be adapted for different platforms.

Associated data files: DOI


1 De novo genome assembly

cd 01_assembly πŸ“

Checking reads

  1. sbatch -a 1-9 001_kat_hist.sh produces kmer histograms for the raw HiFi reads using KAT.

Assembly

  1. sbatch -a 1-9 002_hifiasm.sh assembles HiFi reads using hifiasm.
  2. sbatch -a 1-9 003_kat_comp.sh checks for content correctness of assemblies with respect to the input HiFi reads using KAT.

Assessment

  1. sbatch 004_quast.sh produces assembly contiguity statistics using QUAST.
  2. sbatch -a 1-9 005_busco_asco.sh produces gene set completeness statistics using BUSCO.
  3. sbatch -a 1-9 006_blastn.sh searches the assembly against the NCBI nucleotide database using BLAST to produce input for BlobTools.
  4. sbatch -a 1-9 007_readsmap.sh maps HiFi reads to the assembly using minimap2 to produce input for BlobTools.
  5. sbatch -a 1-9 008_blobtools.sh runs a contamination check using BlobTools.

Filtering

  1. sbatch -a 1-9 009_kat_sect.sh estimates assembly coverage levels using KAT.
  2. sbatch -a 1-9 010_filter_lowcov.sh filters out small, low coverage sequences using a custom script, low_cov_deleter.R.

Final assessment

  1. sbatch -a 1-9 011_kat_comp.sh reruns KAT comp to check final content correctness of the filtered assemblies.
  2. sbatch -a 1-9 012_busco_asco.sh reruns BUSCO on the filtered assemblies.
  3. sbatch 013_quast.sh reruns QUAST on the filtered assemblies.
  4. sbatch -a 1-9 014_tapestry.sh predicts telomeric repeats on the ends of fragments using Tapestry.
    sbatch -a 1-9 014_tidk.sh predicts telomeric repeats across fragments using tidk.
  5. Script to plot figures: plot_missing_BUSCOs.R

2 Structural annotation

cd 02_structural_annotation πŸ“

  1. 000_liftoff.txt contains the commands to generate the liftover gene model using the previous annotation of G. tritici using LiftOff.
  2. 001_eirepeat.txt contains the commands used to predict and mask repeat content using the eirepeat pipeline, which is comprised of RepeatModeler, RepeatMasker and Red.
  3. 002_reat_transcriptome.txt contains the commands used to predict gene models using REAT transcriptome module. This makes Mikado to consolidate gene models.
  4. 003_reat_homology.txt contains the commands used to predict gene models using REAT homology module. This module uses Splan to generate protein to genome alignments and leverages Mikado to score them and generate a final conscensus of the best gene models.
  5. 004_reat_prediction.txt contains the commands used to predict gene models using REAT transcriptome module. It uses evidence provided by the final output of the previous two modules, Portcullis and eirepeat to train Augustus to generate four different annotations using differential weighting of the evidences. It also generate a conscensus track using EVM and runs Mikado to recover UTR information.
  6. 005_minos.txt contains the commands used to perform a first consolidation of the all the gene models produced before using MINOS. This tool runs blastp, kallisto and CPC2 to generate score metrics to use as an input in Mikado to make sensible and standarized decision of the best gene models given a series of high quality annota
  7. 006_rnammer.txt contains the commands used to predict ribosomal RNA sub units using RNAmmer.
  8. 007_liftover_multi.txt contains the commands used to perform an all-versus-all comparison of gene models across all strains using LiftOff.
  9. 008_filter_rRNA.txt contains the commands used to filter the input annotations to the final MINOS step to avoid scoring rRNA as possible protein coding genes.
  10. 009_rerun_minos.txt contains the commands used to perform a final minos including the liftover annotation of the all vs. all comparsion of all the previous strain annotations generated in the step 5.
  11. sbatch -a 1-9 010_mitohifi.sh identifies mitochondrial contigs and annotates mitochondrial genes.

3 Functional annotation

cd 03_functional_annotation πŸ“

  1. 001_ahrd.txt contains the commands used to perform functional annotation of protein sets using AHRD, via the snakemake pipeline eifunannot.
  2. sbatch -a 1-9 002_run_dbcan.sh predicts CAZymes from protein sets using run_dbcan.
  3. sbatch -a 1-9 003_antismash.sh predicts secondary metabolites from protein sets using antiSMASH.
  4. 004_eggnog-mapper.txt contains the commands used to perform functional annotation of protein sets using eggNOG-mapper.

CSEP prediction

  1. sbatch -a 1-9 005_CSEP_prediction.sh submits a suite of tools that feed into CSEP prediction: deepsig.sh (DeepSig), deeploc.sh (DeepLoc), effectorp1.sh (EffectorP v1), effectorp2.sh (EffectorP v2), effectorp3.sh (EffectorP v3), phobius.sh (Phobius), ps_scan.sh, signalp3.sh (SignalP v3), signalp4.sh (SignalP v4.1), signalp6.sh (SignalP v6), targetp.sh (TargetP), tmhmm.sh (TMHMM).
  2. sbatch 006_CSEPfilter.sh runs CSEPfilter to produce a list of CSEPs from the outputs of tools listed above.
  3. sbatch -a 1-9 007_blast.sh searches CSEP sequences against the PHI-base database (requires phi-base_current.fas and phi-base_current.csv to be downloaded into this directory from here). Also searches for avenacinase gene and MAT loci in assemblies.

4 Phylogenetic classification

cd 04_phylogenetic_classification πŸ“

This folder contains a file - markers - listing the genetic markers selected for building the trees.

Gene extraction

  1. sbatch 001_genepull.sh uses GenePull to extract selected genetic markers from filtered assemblies of each strain. Requires fasta files containing a single example sequence from a closely related taxon for each genetic marker being extracted.

Alignment and ML tree building

  1. sbatch 002_align_multicopy.sh makes gene alignments using MAFFT for gdo and ITS2 for distinguishing Gaeumannomyces genetic groups and the avenacinase gene.
  2. sbatch 003_raxmlng_genetree.sh builds ML gene trees for each marker using RAxML-NG with bootstrapping until convergence or up to 1,000 replicates (whichever first).

5 Phylogenomics

cd 05_phylogenomics πŸ“

  1. sbatch 001_orthofinder.sh infers phylogenetic hierarchical orthogroups (HOGs) using OrthoFinder.
  2. sbatch 002_align_singlecopy.sh submits batch array jobs to align all single-copy HOGs using MAFFT and trim using trimAl.
  3. sbatch 003_concat.sh concatenates single-copy HOG alignments using AMAS.
  4. sbatch 004_raxmlng.sh builds genome-scale ML species trees using RAXML-NG with bootstrapping until convergence or up to 1,000 replicates (whichever first).
  5. Script to plot figure: plot_trees.R

6 Synteny and structure

cd 06_synteny πŸ“

  1. sbatch 001_genespace.sh formats protein and gff3 files and submits genespace.R to infer synteny between strains using GENESPACE.
  2. sbatch -a 1,9 002_readsmap_filter.sh filters secondary alignments out of BAM files produced during assembly for visualising read alignment across translocations in R (see below).
  3. sbatch 003_gc.sh calculates GC content in 100,000 bp windows across each genome using bedtools.
  4. sbatch -a 1-9 004_RIP.sh calculates the composite RIP index (CRI) across each genomes using this perl script.
  5. sbatch -a 1-9 005_agp_prep.sh prepares preliminary AGP files to inform the arrangement of contigs into chromosomes.
  6. sbatch -a 1-9 006_agptools.sh creates new gapped fasta files from manually curated AGP files and updates annotation gff3 files accordingly. Calls label_variants.R to label alternatively spliced genes in the gff3.
  7. sbatch -a 1-9 007_table2asn.sh produces .sqn files combining the assembly and annotation for NCBI submission. Requires ncbi_template.sbt downloaded from here.
  8. Scripts to plot figures: plot_genespace.R, plot_read_coverage.R

7 Comparative genomics

cd 07_comparative_genomics πŸ“

  1. sbatch 001_orthogroup_assignment.sh submits orthogroup_assigner.R which makes abundance matrices of HOGs from OrthoFinder.
  2. sbatch 002_big-scape.sh predicts biosynthetic gene clusters from earlier antiSMASH output using BiG-SCAPE.
  3. sbatch 003_lifestyle_test.sh submits lifestyle_v_phylogeny.R which prepares input files and submits permanova.sh, a PERMANOVA-based test comparing the effect of phylogeny versus lifestyle on gene variance. run_edited.py is modified from the original script run.py by Mesny & Vannier.
  4. sbatch 004_starfish.sh predicts Starship giant mobile elements using starfish.
  5. sbatch 005_align_tyr.sh aligns tyrosine recombinase genes predicted by starfish using MAFFT and trims alignments using trimAl.
  6. sbatch 006_raxmlng_tyr.sh builds an ML gene tree for tyrosine recombinases using RAxML-NG with bootstrapping until convergence or up to 1,000 replicates (whichever first).
  7. Rscript go_enrichment.R runs a GO term enrichment of high copy-number HOGs using topGO.
  8. Scripts to plot figures: plot_ideograms.R, plot_gene_content.R

Citation

Hill et al. (2024) Evolutionary genomics reveals variation in structure and genetic content implicated in virulence and lifestyle in the genus Gaeumannomyces. biorxiv. doi:10.1101/2024.02.15.580261