Milestone is an end-to-end sample-based MLST profile creation workflow for bacterial species.
- Milestone Workflow
- Schema Creation
- Allele Calling
- Citation
- Milestone has a fully-automated workflow.
- Milestone creates reference-related files:
Details of <reference>_info.txt Position (POS), reference (REF), alternate (ALT), and quality (QUAL) information of each variation are separated by specific characters in each line, where each variation of each allele is separated by comma(,
) given in the same line (cdsName_alleleId
).
- i.e.
cdsName_alleleId POS*REF>ALT-QUAL,POS*REF>ALT-QUAL
- Each comma-separated part
POS*REF>ALT-QUAL
represents a variation of an allele. - Each variation set on a line,
POS*REF>ALT-QUAL,POS*REF>ALT-QUAL
, represents an allele. - Each line represents a single allele of a single CDS.
- Each comma-separated part
-
Milestone assigns the allele ID for sample's sequence aligned to the CDS based on the following criteria:
- <ID_from_the_reference> If there is a complete match between the variations of sample's aligned sequence to the CDS and the allele-defining variation set given in TEXT-formatted reference file, it assigns the allele ID equal to the matching allele ID in the reference file.
- LNF If the depth of coverage of the sample's CDS is lower than the expected, it assigns LNF (Locus Not Found) as allele ID to the sample's allele.
- 1 If the depth of coverage of the sample's aligned sequence is equal to and more than the expected value and the sample does not have any variations for the CDS locus, it assigns the allele ID equal to the reference's, which is the longest allele of the reference CDS.
- If there is no match between the variations of sample's aligned sequence and the allele-defining variation set given in TEXT-formatted reference file, it checks the validity of the sample's aligned sequence to the CDS before declaring the sequence as a novel allele of the CDS.
- LNF If the length of the sequence is not a multiplier of 3 and/or the aligned sequence to the CDS contains in-frame stop codon, invalid start codon, and invalid stop codon, it assigns allele ID as LNF because bacterial genomes do not contain exons and it is not a valid coding sequence.
- ASM If the sequence passes the validation steps, but its length is smaller than 20% of the length of locus allele length mode, it assigns ASM (Alleles Smaller than Mode) to the sample's allele.
- ALM If the sequence passes the validation steps, but its length is larger than 20% of the length of locus allele length more, it assigns ALM (Alleles Larger than Mode) to the sample's allele.
-
Reference update is described below:
This tutorial aims to create multilocus sequence typing (MLST) from the user-defined coding sequences and raw reads. Begin the tutorial by creating the environment for milestone run by following the instructions below.
-
- Setup
- 1.1. Setting up the data for the tutorial
- 1.2. Setting up the environment for the tutorial
- Linux
- i. Install pip (Pip Installs Packages) using APT (Advanced Packaging Tool)
- ii. Install conda
- iii. Create the conda environment
- macOS
- i. Install homebrew (The Missing Package Manager)
- ii. Install pip (Pip Installs Packages) using homebrew
- iii. Install conda
- iv. Create the conda environment
- Linux
- 2.1.
milestone.py schema_creation
- a. From genome assemblies of species
- b. From coding sequences
- b.1. Only coding sequences are available in the initial set.
- b.2. Coding sequences and their alleles are available in the initial set.
- 2.1.1. Input files
- 2.1.2. Parameters
- 2.1.2.a. Milestone parameters
- 2.1.2.b. Snakemake parameters (*optional)
- 2.1.3. Output files
- 2.2.
milestone.py allele_calling
- 2.2.1. Input files
- 2.2.2. Parameters
- 2.2.2.a. Milestone parameters
- 2.2.2.a.1. VG
- 2.2.2.a.2. SBG
- 2.2.2.b. Snakemake parameters (*optional)
- 2.2.2.a. Milestone parameters
- 2.2.3. Output files
- 2.2.3.1. VG
- 2.2.3.1. SBG
-
Create a directory milestone_tutorial for these exercises.
-
Copy files from ... into milestone_tutorial directory.
i. Install pip (Pip Installs Packages) using APT (Advanced Packaging Tool)
sudo apt-get update
sudo apt-get install python3-pip
ii. Install conda
- Follow the instructions in conda's website.
conda config --add channels bioconda
conda config --add channels conda-forge
conda create --name milestone bcftools=1.13 biopython=1.79 chewbbaca=2.7.0 htslib=1.13 fastp=0.12 freebayes=1.3.2 minimap2=2.22 pysam=0.16.0.1 samtools=1.13 snakemake=5.32.2 vg=1.34
- You can activate the created environment to work in it:
source activate milestone
- When your analysis is done, you can deactivate the created environment:
conda deactivate
- Your environment will be kept unless you remove it. You can use it again by activating with the line given above.
i. Install homebrew (The Missing Package Manager)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
ii. Install pip (Pip Installs Packages) using homebrew
brew install python3.8
iii. Install conda
- Follow the instructions in conda's website.
conda config --add channels bioconda
conda config --add channels conda-forge
conda create --name milestone bcftools=1.13 biopython=1.79 chewbbaca=2.7.0 htslib=1.13 fastp=0.12 freebayes=1.3.2 minimap2=2.22 pysam=0.16.0.1 samtools=1.13 snakemake=5.32.2
- VG only have conda installation for Linux so you need to install VG to your local by following the steps in VG's website as an additional step.
- You can activate the created environment to work in it:
source activate milestone
- When your analysis is done, you can deactivate the created environment:
conda deactivate
- Your environment will be kept unless you remove it. You can use it again by activating with the line given above.
- Milestone runs in two modes:
python milestone.py schema_creation
python milestone.py allele_calling
- Milestone creates
Snakefile
file so it doesn't require to use--snakefile SNAKEFILE
parameter. Only if you definitely want a different layout, you need to use this parameter. - Milestone creates
config.yaml
files so you should not create this file.
-
You can use chewBBACA to call alleles using public or user-provided genome assemblies belonging to the species.
-
If you prefer using public genome assemblies of the species of the interest, you can download the public data by running
download_species_reference_fasta.sh
script with the command below:bash download_species_reference_fasta.sh -s <species_name>
- It appends all the coding sequences to create
<reference.fasta>
file. - It creates an empty
<reference_info.txt>
file for further analysis. - It creates a
<reference.vcf>
file containing only a default header for further analysis.
- It appends all the coding sequences to create
<reference.fasta>
file. - It creates a
<reference_info.txt>
file to identify the allele set of user-provided coding sequences for further analysis. - It creates a
<reference.vcf>
file containing a default header and variations between alleles and their reference coding sequences for further analysis.
schema_name
|- CDS1.fasta
|- CDS2.fasta
|- ...
|- CDSn.fasta
--reference <reference>
Name of reference file to be given without extension and directory. It will be used to name <reference>.fasta, <reference>.vcf, and <reference>_info.txt files. [required: True]
--threads <threads>
Number of threads to be used in the workflow. [default: 1]
--schema_name <schema_name>
Name of directory containing user-defined coding sequences and their alleles. [required: True]
--output <output_directory>
Name of directory for the output reference files. [required: True]
--help
Show this help message and exit.
--dryrun
Display the commands without running. [default: False]
--unlock
Removes possible locks on Snakemake. [default: False]
--quiet
Do not output any progress. [default: False]
--rerun-incomplete
Rerun all incomplete jobs. [default: False]
--forceall
Run all rules independent of being already created output. [default: False]
--printshellcmds
Prints the shell command to be executed. [default: False]
output_directory
|- <reference>.fasta
|- <reference>.vcf.gz
|- <reference>_info.txt
schema_name
will only be used to add novel allele sequences identified using the analyzed sample.- Allele info for the reference is already available in
<reference>_info.txt
file.
output_directory
|- <reference>.fasta
|- <reference>.vcf.gz
|- <reference>_info.txt
<sample_read1>
<sample_read2>
schema_name
|- CDS1.fasta
|- CDS2.fasta
|- ...
|- CDSn.fasta
--aligner vg
--read1 <sample_read1>
First read of sample including its directory. [required: True]
--read2 <sample_read2>
Second read of sample including its directory. [required: True]
--reference <reference>
Name of reference file to be given without extension and directory. It will be used to name <reference>.fasta, <reference>.vcf, and <reference>_info.txt files. [required: True]
--threads <threads>
Number of threads to be used in the workflow. [default: 1]
--schema_name <schema_name>
Name of directory containing user-defined coding sequences and their alleles. [required: True]
--output <output_directory>
Name of directory for the output reference files. [required: True]
--update_reference
Updates <reference>_info.txt and <reference>.vcf files by adding the information coming from the analyzed sample. [default: False]
--aligner sbg
--read1 <sample_read1>
First read of sample including its directory. [required: True]
--read2 <sample_read2>
Second read of sample including its directory. [required: True]
--reference <reference>
Name of reference file to be given without extension and directory. It will be used to name <reference>.fasta, <reference>.vcf, and <reference>_info.txt files. [required: True]
--threads <threads>
Number of threads to be used in the workflow. [default: 1]
--schema_name <schema_name>
Name of directory containing user-defined coding sequences and their alleles. [required: True]
--output <output_directory>
Name of directory for the output reference files. [required: True]
--update_reference
Updates <reference>_info.txt and <reference>.vcf files by adding the information coming from the analyzed sample. [default: False]
--help
Show this help message and exit.
--dryrun
Display the commands without running. [default: False]
--unlock
Removes possible locks on Snakemake. [default: False]
--quiet
Do not output any progress. [default: False]
--rerun-incomplete
Rerun all incomplete jobs. [default: False]
--forceall
Run all rules independent of being already created output. [default: False]
--printshellcmds
Prints the shell command to be executed. [default: False]
- If
--update_reference
parameter is used,<reference>_info.txt
and<reference.vcf>
will be updated. (Note:<reference>.fasta
is not required to be updated because the references of coding sequences will remain the same.) - Files in
vg
orsbg
are created from the scratch , but the remaining files are the updated versions of the input files.
output_directory
|- <reference>_info.txt
|- <reference>.fasta
|- <reference>.vcf
|- vg
|--- <sample>.vcf
|--- <sample>_mlst.tsv
|--- <sample>.bam
|--- <sample>.depth
schema_name
|- CDS1.fasta
|- CDS2.fasta
|- ...
|- CDSn.fasta
output_directory
|- <reference>_info.txt
|- <reference>.fasta
|- <reference>.vcf
|- sbg
|--- <sample>.vcf
|--- <sample>_mlst.tsv
|--- <sample>.bam
|--- <sample>.depth
schema_name
|- CDS1.fasta
|- CDS2.fasta
|- ...
|- CDSn.fasta
@todo