Pactyper

Snakemake pipeline for continuous clone type prediction for WGS sequenced bacterial isolates based on their core genome.

General information

All the code is in the Snakefile and is written in snakemake. Snakefile_non_computerome file should be used by users which are not using Computerome to run the pipeline (the file should be renamed to 'Snakefile').

The pipeline takes (1) a fastq file of genome of interest and (2) a core genome fasta file as input and outputs:

Output file description	Output file location
Alignment quality statistics for the input sample	`[output_dir]/sample_alignments/alignment_statistics.txt`
Aligned fasta file with SNPs applied to the core genome sequence	`[output_dir]/sample_alignments/[sample_name].aligned.fa`
VCF file will all the called variants in the core genome	`[output_dir]/sample_alignments/[sample_name].vcf.gz`
Clone type preditcion for the input sample	`[output_dir]/sample_alignments/[sample_name]_clonetype_summary.txt`
SNP distance matrix which includes all the samples already present in the matrix	`[output_dir]/sample_alignments/[sample_name]_clonetype_snp_distances.txt`
SNP distance matrix visualization	IN PROGRESS, not implemented yet

Required software

Before running the pipeline, make sure that the following programs are installed and added to the path:

GNU parallel >=2013xxxx
Perl>=5.12
Perl Modules: Time::Piece (core with modern Perl)
Bioperl >= 1.6
bwa mem>=0.7.12
readseq>=2.0
samclip>=0.2
bedtools>2.0
freebayes>=1.1
vcflib>=1.0
vcftools>=0.1.16
snpeff>=4.3
minimap2>=2.6
seqtk>=1.2
snp-sites>=2.0
snippy>=4.1.0
vt>=0.5
samtools>=1.9
seqkit>=0.7
snp-dists>=0.6.3
datamash>=1.4

Setting up the config.yaml file

In order for the pipeline to run, a configuration file is needed. A configuration file requires 6 fields to be present:

Field name	Description
input_dir	Directory in which all fastq files which will be analyzed are present
input_sample	The unique prefix of the sample for which the clone type has to be predicted
core_genome	Full path to the FASTA file containing all the core genome genes
output_dir	Directory where all the output files will be stored
include	If the input sample should be included to the final matrix with the predicted clone type and used in the future iterations
SNP_distance	Number of SNP difference in the core genome for isolates to be defined as of different clone type

Here is the example config.yaml file:

input_dir: "/home/project_name/fastqs"
input_sample: "551_12062011-DK10-0"
core_genome: "/home/project_name/Pseudomonas_aeruginosa/Pseudomonas_aeruginosa_core_genome.fasta"
output_dir: "/home/project_name/output_files"
include: True
SNP_distance: 5000

Overwritting configuration file in the command line

Input files and input requirements

The code is written to (1) build and (2) apply and extend the clone type matrix over time.

The first input sample will be assigned the [prefix]001 clone type. Include is automatically defined as "True" for the first and second input sample.
Starting from the second input sample, all other input samples will be compared to the existing samples and if the SNP distance is lower than defined in the configuration file, the existing clone type which passes the criteria is assigned. If none of the clone types are close enough to the input file, the new clone type is assigned to the input sample.
If it is stated in the configuration file that the input sample should be included in the final martix, the predicted clone type will be assigned to the input sample and it will be added to the final distance matrix.

Disclaimer

No files should be deleted from the output_files directory or the code will fail during the next run.

Snippy4 doesn't work with python3. Python3 should be disabled at that step and Python2 should be available.

Running the pipeline

In order to run the pipeline anaconda3 (version 4.0.0) has to be available. Snakemake is started from its directory:

-j option allows to choose the number of threads (1-28) used for the analysis (default:1)
--configfile option allows to chose the configuration file for the analysis
--config option allows to overwrite the config file

Here is the example code for running snakemake:

snakemake -j 10 --configfile config.yaml --config input_sample="test_sample"

Rerunning the pipeline

Pipeline can be rerun when new samples are added by changing the sample name in the config.yaml file or by overwritting the config file in the command line.

Citation

Migle Gabrielaite, Helle K. Johansen, Søren Molin, Finn C. Nielsen, Rasmus L. Marvig
Gene Loss and Acquisition in Lineages of Pseudomonas aeruginosa Evolving in Cystic Fibrosis Patient Airways
doi: 10.1128/mBio.02359-20

Author

Migle Gabrielaite | migle.gabrielaite@regionh.dk

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
README.md		README.md
Snakefile		Snakefile
Snakefile_non_computerome		Snakefile_non_computerome

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Snakefile

Snakefile

Snakefile_non_computerome

Snakefile_non_computerome

Repository files navigation

Pactyper

General information

Required software

Setting up the config.yaml file

Overwritting configuration file in the command line

Input files and input requirements

Disclaimer

Running the pipeline

Rerunning the pipeline

Citation

Author

About

Releases 2

Packages

Languages

MigleSur/Pactyper

Folders and files

Latest commit

History

Repository files navigation

Pactyper

General information

Required software

Setting up the config.yaml file

Overwritting configuration file in the command line

Input files and input requirements

Disclaimer

Running the pipeline

Rerunning the pipeline

Citation

Author

About

Resources

Stars

Watchers

Forks

Languages