Pamir: Discovery and Genotyping of Novel Sequence Insertions in Many Sequenced Individuals

Pamir detects and genotypes novel sequence insertions in single or multiple datasets of paired-end WGS (Whole Genome Sequencing) Illumina reads by jointly analyzing one-end anchored (OEA) and orphan reads.

Installation

Installation from Source

Prerequisite. You will need g++ 5.2 and higher to compile the source code.

The first step to install Pamir is to download the source code from our GitHub repository. After downloading, change the current directory to the source directory pamir and run make and make install in terminal to create the necessary binary files.

git clone https://github.com/vpc-ccg/pamir.git --recursive
cd pamir
make
make install

Running Pamir

Prerequisites

Pamir's pipeline requires a number of external programs. You can either manually install them or take advantage of pamir's conda environment.yaml to install all the dependencies except the assembler:

conda env  create -f environment.yaml
source activate pamir-deps

Dependencies	Version
Python	3.x
samtools	>= 1.9
mrsfast	>= 3.4.0
BLAST	>= 2.9.0+
bedtools	>= 2.26.0
bwa	>= 0.7.17
snakemake	>= 5.3.0
RepeatMasker	>= 4.0.9
minia	>= 3.2.0 *
abyss	>= 2.2.3 *
spades	>= 3.13 *

*Note: You only need to install one of the assemblers.

Project Configuration

In order to run pamir, you need to create a project configuration file namely config.yaml. This configuration consists of a number mandatory settings and some optional advance settings. Below is the list of the all the settings that you can set in your project.

config-paramater-name	Type	Description
path	Mandatory	Full path to project directory.
raw-data	Mandatory	Location of the input files (crams or bams) relative to `path`.
population	Mandatory	Populuation/cohort name. Note that name cannot contain any space characters.
reference	Mandatory	Full path to the reference genome.
input	Mandatory	A list of input files per individual. Pamir 2.0 accepts BAM and CRAM files as input.
analysis-base	Optional	Location of intermediate files relative to `path`. default: `{path}/analysis`
results-base	Optional	Location of final results relative to the `path`. default: `{path}/results`
assembler	Optional	External assembler to use (`minia`, `abyss`, `spades`) default: `minia`
assembler_k	Optional	kmer to use for external assembler. default: 47
pamir_partitition_per_thread	Optional	Number of internal pamir jobs to be completed per thread. This is an advanced settings, modifying this can heavily affect the performance. Too small or too large may affect the performance negatively. default: 1000
blastdb	Optional	Full path to blast database to remove possible contaminants from the data.
centromeres	Optional	Full path to the file in bed format that contains centromeres locations. The calls in these regions will not be reported
align_threads	Optional	number of threads to use for alignment jobs. default: 16
assembly_threads	Optional	number of threads to use for assembly jobs. default: 62
other_threads	Optional	number of threads to use for other jobs. default: 16
minia_min_abundance	Optional	minia's internal assembly parameter. default: 5
min_contig_len	Deprecated	Minimum contig length from the external assembler to use. We know calculate this on the go.
read_length	Deprecated	Read length of the input reads. We know calculate this on the go.

The following a an example of config-yaml with two individuals.

path:
    /full/path/to/project-directory
raw-data:
    raw-data
reference:
    /full/path/to/the/reference.fa
population:
    my-pop
input:
 "samplename1":
  - A.cram
 "samplename2":
  - B.bam

Now, to run pamir on such a config file, you have to run the following command.

pamir.sh  --configfile /path/to/config.yaml

Since, pamir.sh is internally utilizing snakemake, you can pass any additionak snakemake parameters to pamir.sh. Here are some examples:

pamir.sh  --configfile /path/to/config.yaml -j [number of threads] 
pamir.sh  --configfile /path/to/config.yaml -np [Dry Run] 
pamir.sh  --configfile /path/to/config.yaml --forceall [rerun all steps regardless of the current stage]

Output Formats

Pamir will generate the following structure. Pamir generates a VCF file for detected novel sequence insertions.

[path]/
├── raw-data                       -> OR [raw-data]
│   ├── A.cram
│   ├── B.bam
├── analysis                       -> OR [analysis-base]
│   └── my-pop
└── results                        -> OR [results-base]
    └── my-pop
        ├── index.html             -> Summary fo events
        ├── summary.js             -> Summary required by index.html
        ├── data.js                -> Data required by index.html
        ├── events.repeat.bed      -> annotation of repeats for detected eveents
        ├── events.fa              -> all the detected events with 1000bp flanking region
        ├── events.fa.fai          -> index of events.fa
        └── ind
            ├── A
            │   ├── events.bam     -> mapping of the reads in the events region
            │   ├── events.bam.bai -> index
            │   ├── events.bed     -> location of events
            │   └── events.vcf     -> genotyped insertion calls
            ├── B
            │   ├── events.bam
            │   ├── events.bam.bai
            │   ├── events.bed
            │   └── events.vcf

Example

curl -L https://ndownloader.figshare.com/files/22813988 --output example.tar.gz
tar xzvf example.tar.gz
cd example
chmod +x configure.sh
./configure.sh
pamir.sh -j16 --configfile config.yaml

Visualization

index.html provides a quick way of looking at general overview of events. It is an alternative to working with vcf files in a friendly fashion. If you start your IGV, you can easily jump back and forth investigating your events from index.html.

Publications

Discovery and genotyping of novel sequence insertions in many sequenced individuals. P. Kavak*, Y-Y. Lin*, I. Numanagić, H. Asghari, T. Güngör, C. Alkan‡, F. Hach‡. Bioinformatics (ISMB-ECCB 2017 issue), 33 (14): i161-i169, 2017.

Contact and Support

Feel free to drop any inquiry at the issue page .

Name		Name	Last commit message	Last commit date
Latest commit History 414 Commits
ext		ext
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
Snakefile		Snakefile
cluster.json		cluster.json
environment.yaml		environment.yaml
pamir.sh		pamir.sh

License

vpc-ccg/pamir

Folders and files

Latest commit

History

Repository files navigation

Pamir: Discovery and Genotyping of Novel Sequence Insertions in Many Sequenced Individuals

Table of contents

Installation

Installation from Source

Running Pamir

Prerequisites

Project Configuration

Output Formats

Example

Visualization

Publications

Contact and Support

About

Topics

Resources

License

Stars

Watchers

Forks

Languages