Skip to content
/ pamir Public

Discovery and Genotyping of Novel Sequence Insertions in Many Sequenced Individuals

License

Notifications You must be signed in to change notification settings

vpc-ccg/pamir

Repository files navigation

Pamir: Discovery and Genotyping of Novel Sequence Insertions in Many Sequenced Individuals

Pamir detects and genotypes novel sequence insertions in single or multiple datasets of paired-end WGS (Whole Genome Sequencing) Illumina reads by jointly analyzing one-end anchored (OEA) and orphan reads.

Table of contents

  1. Installation
  2. Running Pamir
  3. Example
  4. Visualization
  5. Publications
  6. Contact & Support

Installation

Installation from Source

Prerequisite. You will need g++ 5.2 and higher to compile the source code.

The first step to install Pamir is to download the source code from our GitHub repository. After downloading, change the current directory to the source directory pamir and run make and make install in terminal to create the necessary binary files.

git clone https://github.com/vpc-ccg/pamir.git --recursive
cd pamir
make
make install

Running Pamir

Prerequisites

Pamir's pipeline requires a number of external programs. You can either manually install them or take advantage of pamir's conda environment.yaml to install all the dependencies except the assembler:

conda env  create -f environment.yaml
source activate pamir-deps 
Dependencies Version
Python 3.x
samtools >= 1.9
mrsfast >= 3.4.0
BLAST >= 2.9.0+
bedtools >= 2.26.0
bwa >= 0.7.17
snakemake >= 5.3.0
RepeatMasker >= 4.0.9
minia >= 3.2.0 *
abyss >= 2.2.3 *
spades >= 3.13 *

*Note: You only need to install one of the assemblers.

Project Configuration

In order to run pamir, you need to create a project configuration file namely config.yaml. This configuration consists of a number mandatory settings and some optional advance settings. Below is the list of the all the settings that you can set in your project.

config-paramater-name Type Description
path Mandatory Full path to project directory.
raw-data Mandatory Location of the input files (crams or bams) relative to path.
population Mandatory Populuation/cohort name. Note that name cannot contain any space characters.
reference Mandatory Full path to the reference genome.
input Mandatory A list of input files per individual. Pamir 2.0 accepts BAM and CRAM files as input.
analysis-base Optional Location of intermediate files relative to path. default: {path}/analysis
results-base Optional Location of final results relative to the path. default: {path}/results
assembler Optional External assembler to use (minia, abyss, spades) default: minia
assembler_k Optional kmer to use for external assembler. default: 47
pamir_partitition_per_thread Optional Number of internal pamir jobs to be completed per thread. This is an advanced settings, modifying this can heavily affect the performance. Too small or too large may affect the performance negatively. default: 1000
blastdb Optional Full path to blast database to remove possible contaminants from the data.
centromeres Optional Full path to the file in bed format that contains centromeres locations. The calls in these regions will not be reported
align_threads Optional number of threads to use for alignment jobs. default: 16
assembly_threads Optional number of threads to use for assembly jobs. default: 62
other_threads Optional number of threads to use for other jobs. default: 16
minia_min_abundance Optional minia's internal assembly parameter. default: 5
min_contig_len Deprecated Minimum contig length from the external assembler to use. We know calculate this on the go.
read_length Deprecated Read length of the input reads. We know calculate this on the go.

The following a an example of config-yaml with two individuals.

path:
    /full/path/to/project-directory
raw-data:
    raw-data
reference:
    /full/path/to/the/reference.fa
population:
    my-pop
input:
 "samplename1":
  - A.cram
 "samplename2":
  - B.bam

Now, to run pamir on such a config file, you have to run the following command.

pamir.sh  --configfile /path/to/config.yaml

Since, pamir.sh is internally utilizing snakemake, you can pass any additionak snakemake parameters to pamir.sh. Here are some examples:

pamir.sh  --configfile /path/to/config.yaml -j [number of threads] 
pamir.sh  --configfile /path/to/config.yaml -np [Dry Run] 
pamir.sh  --configfile /path/to/config.yaml --forceall [rerun all steps regardless of the current stage]

Output Formats

Pamir will generate the following structure. Pamir generates a VCF file for detected novel sequence insertions.

[path]/
├── raw-data                       -> OR [raw-data]
│   ├── A.cram
│   ├── B.bam
├── analysis                       -> OR [analysis-base]
│   └── my-pop
└── results                        -> OR [results-base]
    └── my-pop
        ├── index.html             -> Summary fo events
        ├── summary.js             -> Summary required by index.html
        ├── data.js                -> Data required by index.html
        ├── events.repeat.bed      -> annotation of repeats for detected eveents
        ├── events.fa              -> all the detected events with 1000bp flanking region
        ├── events.fa.fai          -> index of events.fa
        └── ind
            ├── A
            │   ├── events.bam     -> mapping of the reads in the events region
            │   ├── events.bam.bai -> index
            │   ├── events.bed     -> location of events
            │   └── events.vcf     -> genotyped insertion calls
            ├── B
            │   ├── events.bam
            │   ├── events.bam.bai
            │   ├── events.bed
            │   └── events.vcf

Example

curl -L https://ndownloader.figshare.com/files/22813988 --output example.tar.gz
tar xzvf example.tar.gz
cd example
chmod +x configure.sh
./configure.sh
pamir.sh -j16 --configfile config.yaml

Visualization

index.html provides a quick way of looking at general overview of events. It is an alternative to working with vcf files in a friendly fashion. If you start your IGV, you can easily jump back and forth investigating your events from index.html.

Publications

Discovery and genotyping of novel sequence insertions in many sequenced individuals. P. Kavak*, Y-Y. Lin*, I. Numanagić, H. Asghari, T. Güngör, C. Alkan‡, F. Hach‡. Bioinformatics (ISMB-ECCB 2017 issue), 33 (14): i161-i169, 2017.

Contact and Support

Feel free to drop any inquiry at the issue page .