Skip to content

MigleSur/BacDist

Repository files navigation

BacDist

Snakemake pipeline for bacterial SNP distance, recombination and phylogenetic analysis

General information

All the code is in the Snakefile and is written in snakemake.

The pipeline takes a number of WGS bacterial genome fastq files and outputs:

  1. VCF file with unique (minimum one sample doesn't have the variant), high quality variants

  2. SNP-distance matrix based on the filtered VCF file

  3. Phylogenetic tree (if more than 3 samples were used for the analysis) based on the filtered VCF file and created using RAxML

  4. Summary statistics including the number of variants identified before and after region of recombination filtering,the number of positions included in the analysis and the number of positions in the reference genome.

Optional:

  1. Recombination event prediction and files 1. and 2. with excluded variants from predicted recombination sites.

Required software

Before running the pipeline, make sure that the following programs are installed and added to the path:

GNU parallel >=2013xxxx
Perl>=5.12
Perl Modules: Time::Piece (core with modern Perl)
Bioperl >= 1.6
bwa mem>=0.7.12
readseq>=2.0
samclip>=0.2
bedtools>2.0
freebayes>=1.1
vcflib>=1.0
vcftools>=0.1.16
snpeff>=4.3
minimap2>=2.6
seqtk>=1.2
snp-sites>=2.0
snippy>=4.1.0
vt>=0.5
bcftools>=1.9
samtools>=1.9
raxml>=8.2.11

Optional software to run recombination analysis:

ClonalFrameML>=20170927

Disclaimer

Snippy4 doesn't work with python3. Python3 should be disabled at that step and Python2 should be available.

Reference genome should not contain plasmid sequences for the results to be more accurate. More than one sequence in the Genbank format reference genome is not accepted.

Setting up the config.yaml file

In order for the pipeline to run, a configuration file is needed. A configuration file requires 4 fields to be present:

  • sample_dir - Directory in which all fastq files which need to be analyzed are present.
  • output_dir - Output directory where all the output files will be created.
  • ref - Reference genome in gbk format.
  • name - Chosen name for the analysis. The name is used as a prefix in output and intermediate files.
  • ClonalFrameML - True if ClonalFrameML recombination analysis should be run, False if the analysis should be skipped.

Here is the example configuration file:

sample_dir: "/home/project_name/fastq"
output_dir: "/home/project_name/AX01"
ref: "/home/reference_genomes/GCF_001457475.1_NCTC10807_genomic.gbk"
name: "AX01"
ClonalFrameML: True

Input file requirements

Input files should be in fastq format.

If there are less than 4 samples, RAxML phylogenetic analysis and ClonalFrame recombination analysis will be skipped. With 3 samples RAxML analysis with the reference genome will be performed.

Running the pipeline

In order to run the pipeline anaconda3 (version 4.0.0) has to be available. Snakemake is started from its directory:

-j option allows to choose the number of threads (1-28) used for the analysis (default:1)
--configfile option allows to chose the configuration file for the analysis

Here is the example code for running snakemake:

	snakemake -j 10 --configfile config.yaml 

Rerunning the pipeline

Pipeline can be rerun when new samples are added, old samples are deleted, etc.

First, all the new samples have to be added to the sample directory.

Second, the following directories have to be deleted:

  • temp
  • output_files
  • vcf_calls

After completing the steps a pipeline can be rerun with different samples.

Citation

Migle Gabrielaite, Maria-Anna Misiakou, Rasmus L. Marvig
BacDist: Snakemake pipeline for bacterial SNP distance and phylogeny analysis
doi: 10.5281/zenodo.3667680

Author

Migle Gabrielaite | migle.gabrielaite@regionh.dk

Contributors

Maria-Anna Misiakou | maria.anna.misiakou@regionh.dk

About

Snakemake pipeline for bacterial SNP distance and phylogeny analysis

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages