Skip to content

Gab0/straintables

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Travis build PyPI version Language grade: Python

Contents

logo

About

straintables is a tool that helps to evaluate differences among gene loci across mutiple genomes of the same species. This software is composed of two parts. The first component is an in-silico version of the PCR reaction, where regions designated as primer pairs are matched across the genomes so the region between the primers can be retrieved. Secondly, there is a dissimilarity matrix generator that creates matrices based in the differences across the retrieved sequences.

Overview

straintables has different modes of operation. The primers used to fetch genomic regions may be user-defined or found by brute force searches on top of the gene sequence. These searches use an user-provided identifier that points to a sequence located inside an annotation file to retrieve the desired region's boundaries.

Analysis proceeds while it counts the SNPs that diverge from the primer-bound sequences found at each genome, then builds a dissimilarity matrix for each region.

Further clustering is done, based on the DMs.

The viewer interface is simple and shows how the pairwise distances between regions from all analyzed genomes.

This package is composed by few independent python scripts which are installed as system commands. The commands are listed here.

Inside The Pipeline

1) Primer Docking: fetching Amplicons

This step is carried by the module straintables.Executable.primerFinder.

For each designated loci, the app will try to find the complement and/or the original sequence of both primers on all genomes. If both primers are found in a genome, the sequence between those primers is extracted and it proceeds to the next genome.

If every genome got its amplicon for the current locus the script saves the sequences, then goes goes to the next.

If for some reason not every genome is sucessfull with given pair of primers, the script retrieves the gene sequence from the master genome and fetch random sequences near the beginning and near the gene end, to be used as primers. This step only happens if the locus name defined by the user matches a gene name, or locus on the genome annotation. Otherwise, the locus is discarded.

Some available genomes are complement-reversed. The script will make sure that loci sequences for every genome are in the same orientation.

2) Amplicon Sequences Alignment

After getting the loci sequence from all the genomes, the visualization of the differences among genomes is done in two fronts:

2a) Dissimilarity Matrix

  1. The multifasta file containing sequence for one loci among all genomes is passed through ClustalW2
  2. The the SNPs are detected and scored.
  3. One Dissimilarity Matrix is created, showing which genome groups have similar locus.
  4. Dissimilarity Matrices can be viewed individually as .pdf files, .npy python files, or grouped at the visualization tool stview.

2b) MeshClust Clustering

  1. The primary locus multifasta file is sent to MeshClust, which will detect clusters among genome's locus. Default MeshClust identity parameters is 0.999.
  2. The output of MeshClust is parsed at the visualization tool, which decorates genomes names at the Dissimilarity Matrix labels according to it's cluster group.

3) Visualization

Afther the pipeline executes the docking and evaluation scripts, the user can execute stview <result_directory_path> in order to view the results.

More statistical analysis on the Dissimilarity Matrices are carried, mostly using python's skbio module. The interpretation of analysis is under construction.

By looking at a pair of D. Matrices at a time, both corresponding to locus that are neighbors, the user may have an insight on data of the studied organism, like the recombination frequency.

Setup

Method 1: Download and Install this Python module

straintables requires Python3.6+

  1. from pipy:
pip install setuptools numpy scipy cython --user

!! We run pip twice because the modules installed on the first step may have installation issues
!! If they fail to install, check the pip message log, it contains info for missing required system packages.


pip install straintables --user

!! Executable scripts are now at ~/.local/bin by default,
!! symlink them to your $PATH, add this folder to your $PATH,
!! or run pip without "--user" and with admin privileges, which is not recommended.

Method 2: Install the conda package

conda install -c gabzn straintables
conda update straintables
!!Then the executables should be available on conda's $PATH.

Setup issues:

If the setup command shown above fails, there should be a problem with the build of some required python module. Take note of which module is failing, and create a issue ticket on this repository and/or check google if it has some answer to the problem. This has never been tested on windows, but should work. The python modules numpy, scipy, cython which should installed before straintables can raise errors on installation, and the error message should give directions to where the problem is, and they occour mostly due to missing system packages which are required by the mentioned modules.

Docker

A Dockerfile is provided as an experimental way of running this software for advanced users. This file may also be used as reference of the required packages on Linux systems.

External Software

Here is a list of external software that are required or optional straintables' operation.
The executables should be available at your $PATH.

Clustal Omega [required]

The alignment step of straintables requires ClustalO installed on your system.

MeShClust [optional]

The recombination analysis step of straintables has MeShCluSt as an optional dependency.

Having it installed on the system will enable genome group clustering to be totally independend from the alignment software, as MeShCluSt does the clustering on top of unaligned .fasta files.

Usage

Fetch genomes and annotation files

This step will define the organism under analysis, so it's adivised run this inside a new directory, having one dir for each organism.

The following commands download each genome matching the query organism from NCBI, along with one annotation file for one specified strain. Each of the command below will create and populate the folders genomes and annotations, so make your choice from the examples and run one of them.

To download Toxoplasma gondii genomes, strain ME49 annotation:
$stdownload --organism "Toxoplasma gondii" --strain ME49

With lactobacillus plantarum, strain WCFS1 annotation:
$stdownload --organism "Lactobacillus plantarum" --strain WCFS1

Ten genomes of Saccharomyces cerevisiae:
$stdownload --organism "Saccharomyces cerevisiae" --max 10

Although the script stdownload contatins various methods to ensure the correct file names for downloaded genomes, it's recommended to check the folder after the process for weird names that would otherwise be shown on the resulting matrices.

The user can manually add desired genomes and annotations, as explained in the next subsections:

Annotation

  • The annotation file serve as a guide for automatic primer docking, since they contain the boundaries for each locus.
  • One .gbff annotation file at annotation folder is required.

Genomes

  • The genome files are the root of the analysis.
  • One multifasta file per strain.
  • They should be placed at the genomes folder.

Analysis

  1. Put the wanted Locus names, ForwardPrimers and ReversePrimers on a .csv file inside the Primer folder. The primer sequences are optional, leave blank to trigger the automatic primer search. Look for the examples.

  2. stgenomepline is the pipeline script, it calls analysis components at proper order.

  3. Check the results at the result folder that is equal to the Primer file selected for the run. Result folders are down the Alignments folder.

Example 1: Automatic Locus Selection with Automatic Primer Search.

$ stprimer -d annotations -c X -o Primers/TEST.csv -p 0.01
$ stgenomepline -p Primers/TEST.csv
$ stview analysisResults/TEST

Example 2: Custom Locus Selection with Automatic Primer Search

  • Make your own primer .csv file, named Primers/chr_X.csv for this example. It should have blank primer fields.

@file: Primers/chr_X.csv

CDPK
IMC2A
AP2X1
TGME49_227830

Then, execute:

$stgenomepline -p Primers/chr_X.csv
  • Then view similarity matrices and phylogenetic trees on pdf files at Alignments/chr_X folder.

Example 3: Custom Loci Selection, Custom Primer Search

  • Follow Example 2, except now the primer file can have a pair of primers designed for each loci:
  • Some primers, if missing or problematic, will trigger the automatic primer search.

@file: Primers/chr_X.csv

LocusName,ForwardPrimer,ReversePrimer
CDPK1,ACAAAGGCTACTTCTACCTC,TTCTATGTGGGGATGCAGAG
IMC2A,,GACGGACGCATGGCTTGCTG
AP2X1,GCTCAAGCTGCTCCCCGGGC,TCGACGGAGGTGCTCCAACC

Executable Scripts

stdownload [--help]
stprimer [--help]
stgenomepline [--help]
stview [--help]
stprotein [--help] (under development & undocumented)

Results

  • As the pipeline unfolds, the user defined WorkingDirectory folder (argument -d) will be created and populated with files of various kinds, in the order described below. It's not required to read these files manually if you stick to the stview visualization tool.
  1. .fasta Sequence files, one holding the amplicon found for each loci.
  2. .aln Alignment files, one for each loci.
  3. .aln.npy Dissimilarity Matrix files, one for each loci.
  4. .pdf Dissimilarity Matrix Plot files, one for each loci;
  • We also have some .csv files with information on those regions.
  1. MatchedRegions.csv: Information on matched regions, their position on each genome and more.

  2. AlignedRegions.csv: Information on matched regions after alignment, mostly number of snps.

  3. PrimerData.csv: Information on matched primers, mostly their position on each genome and orientation.

  4. PWMAnalysis.csv: Extended analysis on matched regions, a comparison of each pair of regions.

Result Analysis Tools

Some python scripts on the main module are not called within stgenomepline or stfastapline. They are optional analysis tools and should be launched by the user.

  1. stview The basic one. This will launch a webserver with default address localhost:5000 where you can point your browser to and view the dissimilarity matrices built.

Matrix from fasta region sequences

Alternatively, you can use straintables as you would use MatGAT, where you just have a few multifasta file with many compatible short sequences, one file per region, and just want to see dissimilarity matrices for them. The entire workflow is described below:

$stfastapline -d DIRECTORY_WITH_FASTA_FILES
$stview DIRECTORY_WITH_FASTA_FILES

The first command is a mini pipeline and should be executed only once. As of the current version, you'll need more than one region to execute this.

About

Compare genomic regions across organism strains.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published