MILESTONE

Milestone is an end-to-end sample-based MLST profile creation workflow for bacterial species.

Milestone Workflow

Milestone has a fully-automated workflow.

Schema Creation

Milestone creates reference-related files:

Details of <reference>_info.txt Position (POS), reference (REF), alternate (ALT), and quality (QUAL) information of each variation are separated by specific characters in each line, where each variation of each allele is separated by comma(,) given in the same line (cdsName_alleleId).

i.e. cdsName_alleleId POS*REF>ALT-QUAL,POS*REF>ALT-QUAL
- Each comma-separated part POS*REF>ALT-QUAL represents a variation of an allele.
- Each variation set on a line, POS*REF>ALT-QUAL,POS*REF>ALT-QUAL , represents an allele.
- Each line represents a single allele of a single CDS.

Allele Calling

Milestone assigns the allele ID for sample's sequence aligned to the CDS based on the following criteria:
- <ID_from_the_reference> If there is a complete match between the variations of sample's aligned sequence to the CDS and the allele-defining variation set given in TEXT-formatted reference file, it assigns the allele ID equal to the matching allele ID in the reference file.
- LNF If the depth of coverage of the sample's CDS is lower than the expected, it assigns LNF (Locus Not Found) as allele ID to the sample's allele.
- 1 If the depth of coverage of the sample's aligned sequence is equal to and more than the expected value and the sample does not have any variations for the CDS locus, it assigns the allele ID equal to the reference's, which is the longest allele of the reference CDS.
- If there is no match between the variations of sample's aligned sequence and the allele-defining variation set given in TEXT-formatted reference file, it checks the validity of the sample's aligned sequence to the CDS before declaring the sequence as a novel allele of the CDS.
  - LNF If the length of the sequence is not a multiplier of 3 and/or the aligned sequence to the CDS contains in-frame stop codon, invalid start codon, and invalid stop codon, it assigns allele ID as LNF because bacterial genomes do not contain exons and it is not a valid coding sequence.
  - ASM If the sequence passes the validation steps, but its length is smaller than 20% of the length of locus allele length mode, it assigns ASM (Alleles Smaller than Mode) to the sample's allele.
  - ALM If the sequence passes the validation steps, but its length is larger than 20% of the length of locus allele length more, it assigns ALM (Alleles Larger than Mode) to the sample's allele.
Reference update is described below:

Tutorial

This tutorial aims to create multilocus sequence typing (MLST) from the user-defined coding sequences and raw reads. Begin the tutorial by creating the environment for milestone run by following the instructions below.

1. Setup
- 1.1. Setting up the data for the tutorial
- 1.2. Setting up the environment for the tutorial
  - Linux
    - i. Install pip (Pip Installs Packages) using APT (Advanced Packaging Tool)
    - ii. Install conda
    - iii. Create the conda environment
  - macOS
    - i. Install homebrew (The Missing Package Manager)
    - ii. Install pip (Pip Installs Packages) using homebrew
    - iii. Install conda
    - iv. Create the conda environment
- 2.1. milestone.py schema_creation
  - a. From genome assemblies of species
  - b. From coding sequences
    - b.1. Only coding sequences are available in the initial set.
    - b.2. Coding sequences and their alleles are available in the initial set.
  - 2.1.1. Input files
  - 2.1.2. Parameters
    - 2.1.2.a. Milestone parameters
    - 2.1.2.b. Snakemake parameters (*optional)
  - 2.1.3. Output files
- 2.2. milestone.py allele_calling
  - 2.2.1. Input files
  - 2.2.2. Parameters
    - 2.2.2.a. Milestone parameters
      - 2.2.2.a.1. VG
      - 2.2.2.a.2. SBG
    - 2.2.2.b. Snakemake parameters (*optional)
  - 2.2.3. Output files
    - 2.2.3.1. VG
    - 2.2.3.1. SBG

1. Setup

1.1. Setting up the data for the tutorial

Create a directory milestone_tutorial for these exercises.
Copy files from ... into milestone_tutorial directory.

1.2. Setting up the environment for the tutorial

Linux

i. Install pip (Pip Installs Packages) using APT (Advanced Packaging Tool)

sudo apt-get update
sudo apt-get install python3-pip

ii. Install conda

Follow the instructions in conda's website.

iii. Create the conda environment

conda config --add channels bioconda
conda config --add channels conda-forge
conda create --name milestone bcftools=1.13 biopython=1.79 chewbbaca=2.7.0 htslib=1.13 fastp=0.12 freebayes=1.3.2 minimap2=2.22 pysam=0.16.0.1 samtools=1.13 snakemake=5.32.2 vg=1.34

You can activate the created environment to work in it:
- source activate milestone
When your analysis is done, you can deactivate the created environment:
- conda deactivate
- Your environment will be kept unless you remove it. You can use it again by activating with the line given above.

macOS

i. Install homebrew (The Missing Package Manager)

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

ii. Install pip (Pip Installs Packages) using homebrew

brew install python3.8

iii. Install conda

Follow the instructions in conda's website.

iv. Create the conda environment

conda config --add channels bioconda
conda config --add channels conda-forge
conda create --name milestone bcftools=1.13 biopython=1.79 chewbbaca=2.7.0 htslib=1.13 fastp=0.12 freebayes=1.3.2 minimap2=2.22 pysam=0.16.0.1 samtools=1.13 snakemake=5.32.2

VG only have conda installation for Linux so you need to install VG to your local by following the steps in VG's website as an additional step.
You can activate the created environment to work in it:
- source activate milestone
When your analysis is done, you can deactivate the created environment:
- conda deactivate
- Your environment will be kept unless you remove it. You can use it again by activating with the line given above.

Milestone runs in two modes:

python milestone.py schema_creation
python milestone.py allele_calling

2.1. `milestone.py schema_creation`

Milestone creates Snakefile file so it doesn't require to use --snakefile SNAKEFILE parameter. Only if you definitely want a different layout, you need to use this parameter.
Milestone creates config.yaml files so you should not create this file.

a. From genome assemblies of species

You can use chewBBACA to call alleles using public or user-provided genome assemblies belonging to the species.
If you prefer using public genome assemblies of the species of the interest, you can download the public data by running download_species_reference_fasta.sh script with the command below:

bash download_species_reference_fasta.sh -s <species_name>

b. From coding sequences

b.1. Only coding sequences are available in the initial set.

It appends all the coding sequences to create <reference.fasta> file.
It creates an empty <reference_info.txt> file for further analysis.
It creates a <reference.vcf> file containing only a default header for further analysis.

b.2. Coding sequences and their alleles are available in the initial set.

It appends all the coding sequences to create <reference.fasta> file.
It creates a <reference_info.txt> file to identify the allele set of user-provided coding sequences for further analysis.
It creates a <reference.vcf> file containing a default header and variations between alleles and their reference coding sequences for further analysis.

2.1.1. Input files

schema_name
|- CDS1.fasta
|- CDS2.fasta
|- ...
|- CDSn.fasta

2.1.2. Parameters

2.1.2.a. Milestone parameters

--reference <reference>
		Name of reference file to be given without extension and directory. It will be used to name <reference>.fasta, <reference>.vcf, and <reference>_info.txt files. [required: True]

--threads <threads>
		Number of threads to be used in the workflow. [default: 1]

--schema_name <schema_name>
		Name of directory containing user-defined coding sequences and their alleles. [required: True]

--output <output_directory>
		Name of directory for the output reference files. [required: True]

2.1.2.b. Snakemake parameters (*optional)

--help
		Show this help message and exit.

--dryrun
		Display the commands without running. [default: False]
		
--unlock
		Removes possible locks on Snakemake. [default: False]

--quiet
		Do not output any progress. [default: False]

--rerun-incomplete
		Rerun all incomplete jobs. [default: False]

--forceall
		Run all rules independent of being already created output. [default: False]

--printshellcmds
		Prints the shell command to be executed. [default: False]

2.1.3. Output files

output_directory
|- <reference>.fasta
|- <reference>.vcf.gz
|- <reference>_info.txt

2.2. `milestone.py allele_calling`

2.2.1. Input files

schema_name will only be used to add novel allele sequences identified using the analyzed sample.
Allele info for the reference is already available in <reference>_info.txt file.

output_directory
|- <reference>.fasta
|- <reference>.vcf.gz
|- <reference>_info.txt

<sample_read1>

<sample_read2>

schema_name
|- CDS1.fasta
|- CDS2.fasta
|- ...
|- CDSn.fasta

2.2.2. Parameters

2.2.2.a. Milestone parameters

2.2.2.a.1. VG

--aligner vg

--read1 <sample_read1>
		First read of sample including its directory. [required: True]

--read2 <sample_read2>
		Second read of sample including its directory. [required: True]

--reference <reference>
		Name of reference file to be given without extension and directory. It will be used to name <reference>.fasta, <reference>.vcf, and <reference>_info.txt files. [required: True]

--threads <threads>
		Number of threads to be used in the workflow. [default: 1]

--schema_name <schema_name>
		Name of directory containing user-defined coding sequences and their alleles. [required: True]

--output <output_directory>
		Name of directory for the output reference files. [required: True]

--update_reference
		Updates <reference>_info.txt and <reference>.vcf files by adding the information coming from the analyzed sample. [default: False]

2.2.2.a.2. SBG

--aligner sbg

--read1 <sample_read1>
		First read of sample including its directory. [required: True]

--read2 <sample_read2>
		Second read of sample including its directory. [required: True]

--reference <reference>
		Name of reference file to be given without extension and directory. It will be used to name <reference>.fasta, <reference>.vcf, and <reference>_info.txt files. [required: True]

--threads <threads>
		Number of threads to be used in the workflow. [default: 1]

--schema_name <schema_name>
		Name of directory containing user-defined coding sequences and their alleles. [required: True]

--output <output_directory>
		Name of directory for the output reference files. [required: True]

--update_reference
		Updates <reference>_info.txt and <reference>.vcf files by adding the information coming from the analyzed sample. [default: False]

2.2.2.b. Snakemake parameters (*optional)

--help
		Show this help message and exit.

--dryrun
		Display the commands without running. [default: False]
		
--unlock
		Removes possible locks on Snakemake. [default: False]

--quiet
		Do not output any progress. [default: False]

--rerun-incomplete
		Rerun all incomplete jobs. [default: False]

--forceall
		Run all rules independent of being already created output. [default: False]

--printshellcmds
		Prints the shell command to be executed. [default: False]

2.2.3. Output files

If --update_reference parameter is used, <reference>_info.txt and <reference.vcf> will be updated. (Note: <reference>.fasta is not required to be updated because the references of coding sequences will remain the same.)
Files in vg or sbg are created from the scratch , but the remaining files are the updated versions of the input files.

2.2.3.1. VG

output_directory
|- <reference>_info.txt
|- <reference>.fasta
|- <reference>.vcf
|- vg
|--- <sample>.vcf
|--- <sample>_mlst.tsv
|--- <sample>.bam
|--- <sample>.depth

schema_name
|- CDS1.fasta
|- CDS2.fasta
|- ...
|- CDSn.fasta

2.2.3.1. SBG

output_directory
|- <reference>_info.txt
|- <reference>.fasta
|- <reference>.vcf
|- sbg
|--- <sample>.vcf
|--- <sample>_mlst.tsv
|--- <sample>.bam
|--- <sample>.depth

schema_name
|- CDS1.fasta
|- CDS2.fasta
|- ...
|- CDSn.fasta

Citation

@todo

Name		Name	Last commit message	Last commit date
Latest commit History 237 Commits
images		images
workflow		workflow
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md

License

fatmakahveci/milestone

Folders and files

Latest commit

History

Repository files navigation