Skip to content

MigleSur/Pactyper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 

Repository files navigation

Pactyper

Snakemake pipeline for continuous clone type prediction for WGS sequenced bacterial isolates based on their core genome.

General information

All the code is in the Snakefile and is written in snakemake. Snakefile_non_computerome file should be used by users which are not using Computerome to run the pipeline (the file should be renamed to 'Snakefile').

The pipeline takes (1) a fastq file of genome of interest and (2) a core genome fasta file as input and outputs:

Output file description Output file location
Alignment quality statistics for the input sample [output_dir]/sample_alignments/alignment_statistics.txt
Aligned fasta file with SNPs applied to the core genome sequence [output_dir]/sample_alignments/[sample_name].aligned.fa
VCF file will all the called variants in the core genome [output_dir]/sample_alignments/[sample_name].vcf.gz
Clone type preditcion for the input sample [output_dir]/sample_alignments/[sample_name]_clonetype_summary.txt
SNP distance matrix which includes all the samples already present in the matrix [output_dir]/sample_alignments/[sample_name]_clonetype_snp_distances.txt
SNP distance matrix visualization IN PROGRESS, not implemented yet

Required software

Before running the pipeline, make sure that the following programs are installed and added to the path:

GNU parallel >=2013xxxx
Perl>=5.12
Perl Modules: Time::Piece (core with modern Perl)
Bioperl >= 1.6
bwa mem>=0.7.12
readseq>=2.0
samclip>=0.2
bedtools>2.0
freebayes>=1.1
vcflib>=1.0
vcftools>=0.1.16
snpeff>=4.3
minimap2>=2.6
seqtk>=1.2
snp-sites>=2.0
snippy>=4.1.0
vt>=0.5
samtools>=1.9
seqkit>=0.7
snp-dists>=0.6.3
datamash>=1.4

Setting up the config.yaml file

In order for the pipeline to run, a configuration file is needed. A configuration file requires 6 fields to be present:

Field name Description
input_dir Directory in which all fastq files which will be analyzed are present
input_sample The unique prefix of the sample for which the clone type has to be predicted
core_genome Full path to the FASTA file containing all the core genome genes
output_dir Directory where all the output files will be stored
include If the input sample should be included to the final matrix with the predicted clone type and used in the future iterations
SNP_distance Number of SNP difference in the core genome for isolates to be defined as of different clone type

Here is the example config.yaml file:

input_dir: "/home/project_name/fastqs"
input_sample: "551_12062011-DK10-0"
core_genome: "/home/project_name/Pseudomonas_aeruginosa/Pseudomonas_aeruginosa_core_genome.fasta"
output_dir: "/home/project_name/output_files"
include: True
SNP_distance: 5000

Overwritting configuration file in the command line

Input files and input requirements

The code is written to (1) build and (2) apply and extend the clone type matrix over time.

  1. The first input sample will be assigned the [prefix]001 clone type. Include is automatically defined as "True" for the first and second input sample.
  2. Starting from the second input sample, all other input samples will be compared to the existing samples and if the SNP distance is lower than defined in the configuration file, the existing clone type which passes the criteria is assigned. If none of the clone types are close enough to the input file, the new clone type is assigned to the input sample.
  3. If it is stated in the configuration file that the input sample should be included in the final martix, the predicted clone type will be assigned to the input sample and it will be added to the final distance matrix.

Disclaimer

No files should be deleted from the output_files directory or the code will fail during the next run.

Snippy4 doesn't work with python3. Python3 should be disabled at that step and Python2 should be available.

Running the pipeline

In order to run the pipeline anaconda3 (version 4.0.0) has to be available. Snakemake is started from its directory:

-j option allows to choose the number of threads (1-28) used for the analysis (default:1)
--configfile option allows to chose the configuration file for the analysis
--config option allows to overwrite the config file

Here is the example code for running snakemake:

snakemake -j 10 --configfile config.yaml --config input_sample="test_sample"

Rerunning the pipeline

Pipeline can be rerun when new samples are added by changing the sample name in the config.yaml file or by overwritting the config file in the command line.

Citation

Migle Gabrielaite, Helle K. Johansen, Søren Molin, Finn C. Nielsen, Rasmus L. Marvig
Gene Loss and Acquisition in Lineages of Pseudomonas aeruginosa Evolving in Cystic Fibrosis Patient Airways
doi: 10.1128/mBio.02359-20

Author

Migle Gabrielaite | migle.gabrielaite@regionh.dk