BacDist

Snakemake pipeline for bacterial SNP distance, recombination and phylogenetic analysis

General information

All the code is in the Snakefile and is written in snakemake.

The pipeline takes a number of WGS bacterial genome fastq files and outputs:

VCF file with unique (minimum one sample doesn't have the variant), high quality variants
SNP-distance matrix based on the filtered VCF file
Phylogenetic tree (if more than 3 samples were used for the analysis) based on the filtered VCF file and created using RAxML
Summary statistics including the number of variants identified before and after region of recombination filtering,the number of positions included in the analysis and the number of positions in the reference genome.

Optional:

Recombination event prediction and files 1. and 2. with excluded variants from predicted recombination sites.

Required software

Before running the pipeline, make sure that the following programs are installed and added to the path:

GNU parallel >=2013xxxx
Perl>=5.12
Perl Modules: Time::Piece (core with modern Perl)
Bioperl >= 1.6
bwa mem>=0.7.12
readseq>=2.0
samclip>=0.2
bedtools>2.0
freebayes>=1.1
vcflib>=1.0
vcftools>=0.1.16
snpeff>=4.3
minimap2>=2.6
seqtk>=1.2
snp-sites>=2.0
snippy>=4.1.0
vt>=0.5
bcftools>=1.9
samtools>=1.9
raxml>=8.2.11

Optional software to run recombination analysis:

ClonalFrameML>=20170927

Disclaimer

Snippy4 doesn't work with python3. Python3 should be disabled at that step and Python2 should be available.

Reference genome should not contain plasmid sequences for the results to be more accurate. More than one sequence in the Genbank format reference genome is not accepted.

Setting up the config.yaml file

In order for the pipeline to run, a configuration file is needed. A configuration file requires 4 fields to be present:

sample_dir - Directory in which all fastq files which need to be analyzed are present.
output_dir - Output directory where all the output files will be created.
ref - Reference genome in gbk format.
name - Chosen name for the analysis. The name is used as a prefix in output and intermediate files.
ClonalFrameML - True if ClonalFrameML recombination analysis should be run, False if the analysis should be skipped.

Here is the example configuration file:

sample_dir: "/home/project_name/fastq"
output_dir: "/home/project_name/AX01"
ref: "/home/reference_genomes/GCF_001457475.1_NCTC10807_genomic.gbk"
name: "AX01"
ClonalFrameML: True

Input file requirements

Input files should be in fastq format.

If there are less than 4 samples, RAxML phylogenetic analysis and ClonalFrame recombination analysis will be skipped. With 3 samples RAxML analysis with the reference genome will be performed.

Running the pipeline

In order to run the pipeline anaconda3 (version 4.0.0) has to be available. Snakemake is started from its directory:

-j option allows to choose the number of threads (1-28) used for the analysis (default:1)
--configfile option allows to chose the configuration file for the analysis

Here is the example code for running snakemake:

	snakemake -j 10 --configfile config.yaml

Rerunning the pipeline

Pipeline can be rerun when new samples are added, old samples are deleted, etc.

First, all the new samples have to be added to the sample directory.

Second, the following directories have to be deleted:

temp
output_files
vcf_calls

After completing the steps a pipeline can be rerun with different samples.

Citation

Migle Gabrielaite, Maria-Anna Misiakou, Rasmus L. Marvig
BacDist: Snakemake pipeline for bacterial SNP distance and phylogeny analysis
doi: 10.5281/zenodo.3667680

Author

Migle Gabrielaite | migle.gabrielaite@regionh.dk

Contributors

Maria-Anna Misiakou | maria.anna.misiakou@regionh.dk

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
Snakefile_non_computerome		Snakefile_non_computerome
jobgraph.svg		jobgraph.svg
jobgraph_including_ClonalFrameML.svg		jobgraph_including_ClonalFrameML.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

Snakefile

Snakefile

Snakefile_non_computerome

Snakefile_non_computerome

jobgraph.svg

jobgraph.svg

jobgraph_including_ClonalFrameML.svg

jobgraph_including_ClonalFrameML.svg

Repository files navigation

BacDist

General information

Required software

Optional software to run recombination analysis:

Disclaimer

Setting up the config.yaml file

Input file requirements

Running the pipeline

Rerunning the pipeline

Citation

Author

Contributors

About

Releases 1

Packages

Contributors 2

Languages

License

MigleSur/BacDist

Folders and files

Latest commit

History

Repository files navigation

BacDist

General information

Required software

Optional software to run recombination analysis:

Disclaimer

Setting up the config.yaml file

Input file requirements

Running the pipeline

Rerunning the pipeline

Citation

Author

Contributors

About

Resources

License

Stars

Watchers

Forks

Languages