Snakemake pipeline for bacterial SNP distance, recombination and phylogenetic analysis
All the code is in the Snakefile and is written in snakemake.
The pipeline takes a number of WGS bacterial genome fastq files and outputs:
-
VCF file with unique (minimum one sample doesn't have the variant), high quality variants
-
SNP-distance matrix based on the filtered VCF file
-
Phylogenetic tree (if more than 3 samples were used for the analysis) based on the filtered VCF file and created using RAxML
-
Summary statistics including the number of variants identified before and after region of recombination filtering,the number of positions included in the analysis and the number of positions in the reference genome.
Optional:
- Recombination event prediction and files 1. and 2. with excluded variants from predicted recombination sites.
Before running the pipeline, make sure that the following programs are installed and added to the path:
GNU parallel >=2013xxxx
Perl>=5.12
Perl Modules: Time::Piece (core with modern Perl)
Bioperl >= 1.6
bwa mem>=0.7.12
readseq>=2.0
samclip>=0.2
bedtools>2.0
freebayes>=1.1
vcflib>=1.0
vcftools>=0.1.16
snpeff>=4.3
minimap2>=2.6
seqtk>=1.2
snp-sites>=2.0
snippy>=4.1.0
vt>=0.5
bcftools>=1.9
samtools>=1.9
raxml>=8.2.11
Snippy4 doesn't work with python3. Python3 should be disabled at that step and Python2 should be available.
Reference genome should not contain plasmid sequences for the results to be more accurate. More than one sequence in the Genbank format reference genome is not accepted.
In order for the pipeline to run, a configuration file is needed. A configuration file requires 4 fields to be present:
- sample_dir - Directory in which all fastq files which need to be analyzed are present.
- output_dir - Output directory where all the output files will be created.
- ref - Reference genome in gbk format.
- name - Chosen name for the analysis. The name is used as a prefix in output and intermediate files.
- ClonalFrameML -
True
if ClonalFrameML recombination analysis should be run,False
if the analysis should be skipped.
Here is the example configuration file:
sample_dir: "/home/project_name/fastq"
output_dir: "/home/project_name/AX01"
ref: "/home/reference_genomes/GCF_001457475.1_NCTC10807_genomic.gbk"
name: "AX01"
ClonalFrameML: True
Input files should be in fastq format.
If there are less than 4 samples, RAxML phylogenetic analysis and ClonalFrame recombination analysis will be skipped. With 3 samples RAxML analysis with the reference genome will be performed.
In order to run the pipeline anaconda3 (version 4.0.0) has to be available. Snakemake is started from its directory:
-j option allows to choose the number of threads (1-28) used for the analysis (default:1)
--configfile option allows to chose the configuration file for the analysis
Here is the example code for running snakemake:
snakemake -j 10 --configfile config.yaml
Pipeline can be rerun when new samples are added, old samples are deleted, etc.
First, all the new samples have to be added to the sample directory.
Second, the following directories have to be deleted:
- temp
- output_files
- vcf_calls
After completing the steps a pipeline can be rerun with different samples.
Migle Gabrielaite, Maria-Anna Misiakou, Rasmus L. Marvig
BacDist: Snakemake pipeline for bacterial SNP distance and phylogeny analysis
doi: 10.5281/zenodo.3667680
Migle Gabrielaite | migle.gabrielaite@regionh.dk
Maria-Anna Misiakou | maria.anna.misiakou@regionh.dk