Skip to content

fatmakahveci/milestone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

milestone_logo MILESTONE

Milestone is an end-to-end sample-based MLST profile creation workflow for bacterial species.


Table of Contents

  • Milestone Workflow
    • Schema Creation
    • Allele Calling
  • Citation

Milestone Workflow

  • Milestone has a fully-automated workflow.

milestone workflow

Schema Creation

  • Milestone creates reference-related files:

Graph representation in files

allele to vcf

Details of <reference>_info.txt Position (POS), reference (REF), alternate (ALT), and quality (QUAL) information of each variation are separated by specific characters in each line, where each variation of each allele is separated by comma(,) given in the same line (cdsName_alleleId).

  • i.e. cdsName_alleleId POS*REF>ALT-QUAL,POS*REF>ALT-QUAL
    • Each comma-separated part POS*REF>ALT-QUAL represents a variation of an allele.
    • Each variation set on a line, POS*REF>ALT-QUAL,POS*REF>ALT-QUAL , represents an allele.
    • Each line represents a single allele of a single CDS.

Allele Calling

  • Milestone assigns the allele ID for sample's sequence aligned to the CDS based on the following criteria:

    • <ID_from_the_reference> If there is a complete match between the variations of sample's aligned sequence to the CDS and the allele-defining variation set given in TEXT-formatted reference file, it assigns the allele ID equal to the matching allele ID in the reference file.
    • LNF If the depth of coverage of the sample's CDS is lower than the expected, it assigns LNF (Locus Not Found) as allele ID to the sample's allele.
    • 1 If the depth of coverage of the sample's aligned sequence is equal to and more than the expected value and the sample does not have any variations for the CDS locus, it assigns the allele ID equal to the reference's, which is the longest allele of the reference CDS.
    • If there is no match between the variations of sample's aligned sequence and the allele-defining variation set given in TEXT-formatted reference file, it checks the validity of the sample's aligned sequence to the CDS before declaring the sequence as a novel allele of the CDS.
      • LNF If the length of the sequence is not a multiplier of 3 and/or the aligned sequence to the CDS contains in-frame stop codon, invalid start codon, and invalid stop codon, it assigns allele ID as LNF because bacterial genomes do not contain exons and it is not a valid coding sequence.
      • ASM If the sequence passes the validation steps, but its length is smaller than 20% of the length of locus allele length mode, it assigns ASM (Alleles Smaller than Mode) to the sample's allele.
      • ALM If the sequence passes the validation steps, but its length is larger than 20% of the length of locus allele length more, it assigns ALM (Alleles Larger than Mode) to the sample's allele.
  • Reference update is described below:

reference update


Tutorial

This tutorial aims to create multilocus sequence typing (MLST) from the user-defined coding sequences and raw reads. Begin the tutorial by creating the environment for milestone run by following the instructions below.


Table of Contents

    1. Setup
    • 1.1. Setting up the data for the tutorial
    • 1.2. Setting up the environment for the tutorial
      • Linux
        • i. Install pip (Pip Installs Packages) using APT (Advanced Packaging Tool)
        • ii. Install conda
        • iii. Create the conda environment
      • macOS
        • i. Install homebrew (The Missing Package Manager)
        • ii. Install pip (Pip Installs Packages) using homebrew
        • iii. Install conda
        • iv. Create the conda environment
    • 2.1. milestone.py schema_creation
      • a. From genome assemblies of species
      • b. From coding sequences
        • b.1. Only coding sequences are available in the initial set.
        • b.2. Coding sequences and their alleles are available in the initial set.
      • 2.1.1. Input files
      • 2.1.2. Parameters
        • 2.1.2.a. Milestone parameters
        • 2.1.2.b. Snakemake parameters (*optional)
      • 2.1.3. Output files
    • 2.2. milestone.py allele_calling
      • 2.2.1. Input files
      • 2.2.2. Parameters
        • 2.2.2.a. Milestone parameters
          • 2.2.2.a.1. VG
          • 2.2.2.a.2. SBG
        • 2.2.2.b. Snakemake parameters (*optional)
      • 2.2.3. Output files
        • 2.2.3.1. VG
        • 2.2.3.1. SBG

1. Setup

1.1. Setting up the data for the tutorial

  • Create a directory milestone_tutorial for these exercises.

  • Copy files from ... into milestone_tutorial directory.

1.2. Setting up the environment for the tutorial


Linux

i. Install pip (Pip Installs Packages) using APT (Advanced Packaging Tool)
  • sudo apt-get update
  • sudo apt-get install python3-pip
ii. Install conda
  • Follow the instructions in conda's website.
iii. Create the conda environment
conda config --add channels bioconda
conda config --add channels conda-forge
conda create --name milestone bcftools=1.13 biopython=1.79 chewbbaca=2.7.0 htslib=1.13 fastp=0.12 freebayes=1.3.2 minimap2=2.22 pysam=0.16.0.1 samtools=1.13 snakemake=5.32.2 vg=1.34
  • You can activate the created environment to work in it:
    • source activate milestone
  • When your analysis is done, you can deactivate the created environment:
    • conda deactivate
    • Your environment will be kept unless you remove it. You can use it again by activating with the line given above.

macOS

i. Install homebrew (The Missing Package Manager)

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

ii. Install pip (Pip Installs Packages) using homebrew
  • brew install python3.8
iii. Install conda
  • Follow the instructions in conda's website.
iv. Create the conda environment
conda config --add channels bioconda
conda config --add channels conda-forge
conda create --name milestone bcftools=1.13 biopython=1.79 chewbbaca=2.7.0 htslib=1.13 fastp=0.12 freebayes=1.3.2 minimap2=2.22 pysam=0.16.0.1 samtools=1.13 snakemake=5.32.2 
  • VG only have conda installation for Linux so you need to install VG to your local by following the steps in VG's website as an additional step.
  • You can activate the created environment to work in it:
    • source activate milestone
  • When your analysis is done, you can deactivate the created environment:
    • conda deactivate
    • Your environment will be kept unless you remove it. You can use it again by activating with the line given above.

  • Milestone runs in two modes:
  1. python milestone.py schema_creation
  2. python milestone.py allele_calling

2.1. milestone.py schema_creation

  • Milestone creates Snakefile file so it doesn't require to use --snakefile SNAKEFILE parameter. Only if you definitely want a different layout, you need to use this parameter.
  • Milestone creates config.yaml files so you should not create this file.

a. From genome assemblies of species

  • You can use chewBBACA to call alleles using public or user-provided genome assemblies belonging to the species.

  • If you prefer using public genome assemblies of the species of the interest, you can download the public data by running download_species_reference_fasta.sh script with the command below:

    bash download_species_reference_fasta.sh -s <species_name>


b. From coding sequences

b.1. Only coding sequences are available in the initial set.
  • It appends all the coding sequences to create <reference.fasta> file.
  • It creates an empty <reference_info.txt> file for further analysis.
  • It creates a <reference.vcf> file containing only a default header for further analysis.
b.2. Coding sequences and their alleles are available in the initial set.
  • It appends all the coding sequences to create <reference.fasta> file.
  • It creates a <reference_info.txt> file to identify the allele set of user-provided coding sequences for further analysis.
  • It creates a <reference.vcf> file containing a default header and variations between alleles and their reference coding sequences for further analysis.

2.1.1. Input files

schema_name
|- CDS1.fasta
|- CDS2.fasta
|- ...
|- CDSn.fasta

2.1.2. Parameters

2.1.2.a. Milestone parameters
--reference <reference>
		Name of reference file to be given without extension and directory. It will be used to name <reference>.fasta, <reference>.vcf, and <reference>_info.txt files. [required: True]

--threads <threads>
		Number of threads to be used in the workflow. [default: 1]

--schema_name <schema_name>
		Name of directory containing user-defined coding sequences and their alleles. [required: True]

--output <output_directory>
		Name of directory for the output reference files. [required: True]
2.1.2.b. Snakemake parameters (*optional)
--help
		Show this help message and exit.

--dryrun
		Display the commands without running. [default: False]
		
--unlock
		Removes possible locks on Snakemake. [default: False]

--quiet
		Do not output any progress. [default: False]

--rerun-incomplete
		Rerun all incomplete jobs. [default: False]

--forceall
		Run all rules independent of being already created output. [default: False]

--printshellcmds
		Prints the shell command to be executed. [default: False]

2.1.3. Output files

output_directory
|- <reference>.fasta
|- <reference>.vcf.gz
|- <reference>_info.txt

2.2. milestone.py allele_calling


2.2.1. Input files

  • schema_name will only be used to add novel allele sequences identified using the analyzed sample.
  • Allele info for the reference is already available in <reference>_info.txt file.
output_directory
|- <reference>.fasta
|- <reference>.vcf.gz
|- <reference>_info.txt

<sample_read1>

<sample_read2>

schema_name
|- CDS1.fasta
|- CDS2.fasta
|- ...
|- CDSn.fasta

2.2.2. Parameters

2.2.2.a. Milestone parameters
2.2.2.a.1. VG
--aligner vg

--read1 <sample_read1>
		First read of sample including its directory. [required: True]

--read2 <sample_read2>
		Second read of sample including its directory. [required: True]

--reference <reference>
		Name of reference file to be given without extension and directory. It will be used to name <reference>.fasta, <reference>.vcf, and <reference>_info.txt files. [required: True]

--threads <threads>
		Number of threads to be used in the workflow. [default: 1]

--schema_name <schema_name>
		Name of directory containing user-defined coding sequences and their alleles. [required: True]

--output <output_directory>
		Name of directory for the output reference files. [required: True]

--update_reference
		Updates <reference>_info.txt and <reference>.vcf files by adding the information coming from the analyzed sample. [default: False]
2.2.2.a.2. SBG
--aligner sbg

--read1 <sample_read1>
		First read of sample including its directory. [required: True]

--read2 <sample_read2>
		Second read of sample including its directory. [required: True]

--reference <reference>
		Name of reference file to be given without extension and directory. It will be used to name <reference>.fasta, <reference>.vcf, and <reference>_info.txt files. [required: True]

--threads <threads>
		Number of threads to be used in the workflow. [default: 1]

--schema_name <schema_name>
		Name of directory containing user-defined coding sequences and their alleles. [required: True]

--output <output_directory>
		Name of directory for the output reference files. [required: True]

--update_reference
		Updates <reference>_info.txt and <reference>.vcf files by adding the information coming from the analyzed sample. [default: False]
2.2.2.b. Snakemake parameters (*optional)
--help
		Show this help message and exit.

--dryrun
		Display the commands without running. [default: False]
		
--unlock
		Removes possible locks on Snakemake. [default: False]

--quiet
		Do not output any progress. [default: False]

--rerun-incomplete
		Rerun all incomplete jobs. [default: False]

--forceall
		Run all rules independent of being already created output. [default: False]

--printshellcmds
		Prints the shell command to be executed. [default: False]

2.2.3. Output files

  • If --update_reference parameter is used, <reference>_info.txt and <reference.vcf> will be updated. (Note: <reference>.fasta is not required to be updated because the references of coding sequences will remain the same.)
  • Files in vg or sbg are created from the scratch , but the remaining files are the updated versions of the input files.
2.2.3.1. VG
output_directory
|- <reference>_info.txt
|- <reference>.fasta
|- <reference>.vcf
|- vg
|--- <sample>.vcf
|--- <sample>_mlst.tsv
|--- <sample>.bam
|--- <sample>.depth

schema_name
|- CDS1.fasta
|- CDS2.fasta
|- ...
|- CDSn.fasta
2.2.3.1. SBG
output_directory
|- <reference>_info.txt
|- <reference>.fasta
|- <reference>.vcf
|- sbg
|--- <sample>.vcf
|--- <sample>_mlst.tsv
|--- <sample>.bam
|--- <sample>.depth

schema_name
|- CDS1.fasta
|- CDS2.fasta
|- ...
|- CDSn.fasta

Citation

@todo