nPoRe: n-Polymer Realigner for improved pileup variant calling

Introduction

npore is a read realigner which recalculates each read's fine-grained alignment in order to more accurately align ''n-polymers'' such as homopolymers (n=1) and tandem repeats (2 ≤ n ≤ 6). In other words, given an input BAM, it adjusts each read's CIGAR string to more accurately model the most likely sequencing errors and actual variants. Traditional affine gap penalties are context-agnostic, and do not model the higher likelihood of INDELs in low-complexity regions (particularly n-polymers), leading to poor or inconsistent alignments. We find that npore improves pileup concordance across reads and results in slightly better variant calling performance.

Citation

Please cite the following pre-print if you use npore:

[bioRxiv] nPoRe: n-Polymer Realigner for improved pileup variant calling

@article {dunn-npore,
    author = {Dunn, Tim and Blaauw, David and Das, Reetuparna and Narayanasamy, Satish},
    title = {nPoRe: n-Polymer Realigner for improved pileup variant calling},
    elocation-id = {2022.02.15.480561},
    year = {2022},
    doi = {10.1101/2022.02.15.480561},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2022/02/18/2022.02.15.480561},
    eprint = {https://www.biorxiv.org/content/early/2022/02/18/2022.02.15.480561.full.pdf},
    journal = {bioRxiv}
}

Installation

Option 1: GitHub Source

First, clone the repository:

git clone https://github.com/timd1/npore && cd npore

Next, set up a virtual environment, activate it, and install the required packages.

python3 -m venv venv3 --prompt "npore"
source ./venv3/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt

Please ensure this environment is activated when building or running npore. Lastly, build npore and verify that it has succeeded.

make
python3 ./src/realign.py --help

Option 2: Docker Hub Image

A pre-built Docker image can be downloaded from here using:

sudo docker pull timd1/npore
sudo docker run -it timd1/npore:latest python3 realign.py --help

Option 3: Dockerfile

This may take some time to re-build the image; the previous option should be preferred in most cases.

git clone https://github.com/TimD1/npore && cd npore
sudo docker build -f ./Dockerfile -t timd1/npore:latest .
sudo docker run -it timd1/npore:latest python3 realign.py --help

Usage

Prerequisites

All input BAMs are required to:

Have MD tag annotations (for pysam read to ref mapping)
Be indexed (have an associated .bam.bai file)

You can prepare your input BAM using samtools:

samtools calmd -@ `nproc` -b -Q orig_reads.bam ref.fasta > reads.bam
samtools index reads.bam

Option 1: GitHub Source

Here's an example usage of the main realign.py program, which will store results in realigned.sam.

export NPORE="$HOME/npore"
export DATA="$NPORE/test/data"
. $NPORE/venv3/bin/activate
python3 $NPORE/src/realign.py \
    --bam $DATA/reads.bam \
    --ref $DATA/ref.fasta \
    --out_prefix $DATA/realigned \
    --stats_dir $NPORE/guppy5_stats

For additional options, run python3 realign.py --help.

Option 2: Docker

Here's how to call the Docker container with the same arguments as above:

export NPORE="$HOME/npore"
export DATA="$NPORE/test/data"
sudo docker run \
    -v $DATA:$DATA \
    -v $NPORE/guppy5_stats:$NPORE/guppy5_stats \
    timd1/npore:v0.1.0 \
        python3 realign.py \
        --bam $DATA/reads.bam \
        --ref $DATA/ref.fasta \
        --out_prefix $DATA/realigned \
        --stats_dir $NPORE/guppy5_stats

Project Structure

`src/`	nPoRe source code
`realign.py`	Module for realigning a BAM file.
`standardize_vcf.py`	Module for standardizing a ground-truth VCF file to report variants in the same manner that nPoRe would align the reads.
`bed.py`	Module for computing n-polymer BED regions.
`purity.py`	Module for computing a BAM pileup's Gini purity, for measuring read concordance.
`filter.py`	Simple module for filtering overlapping variants.
`cfg.py`	Contains global variables and configuration.

All other src/ files (aln.pyx, bam.pyx, cig.pyx, vcf.py, util.py) contain functions used in the above modules.

`scripts/`	Helper scripts used during evaluation
`realign_pipeline.sh`	Main Clair3 retraining pipeline.
`happy.sh`	Runs `hap.py` evaluation of all configurations/regions.
`plot_results.py`	Plots final precision/recall graphs.
`plot_sankey.py`	Generates Sankey plot of actual/error INDELs by n-polymer BED region.
`calc_beds.sh`	Calculates n-polymer BEDs, running `bed.py`.
`sankey.py`	Custom Sankey plot library, extended from `pySankey`.
`purity.sh`	Calculates Gini purity.
`align.sh`	Aligns reads to a reference, allowing multiple input formats.
`tag_unphased.py`	Tags unphased reads with `HP:i:0`.

`test/`	Testing directory
`align.py`	Tests `align()` kernel.
`get_np_info.py`	Tests n-polymer info generation.
`realign.sh`	Tests full read realignment.
`test_std_vcf.sh`	Tests VCF standardization.
`profile_alignment.ipynb`	Line-by-line profiling of `align()` kernel.

`*stats/`	Directory storing cached confusion matrices

Data Sources

The Genome In A Bottle GRCh38 v4.1 ground truth VCF and benchmarking regions were downloaded from here. The GRCh38 human reference sequence and R9.4.1 reads basecalled with the Guppy 5.0.6 super-accuracy model were downloaded from ONT Open Datasets.

Acknowledgements

We would like to thank the developers of samtools, minimap2, pysam, clair3, pepper-deepvariant, pysankey, igv, and swalign. We would also like to thank GIAB and ONT for making their data available publicly.

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
guppy5_stats		guppy5_stats
img		img
scripts		scripts
src		src
test		test
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

TimD1/nPoRe

Folders and files

Latest commit

History

Repository files navigation

nPoRe: n-Polymer Realigner for improved pileup variant calling

Introduction

Citation

Contents

Installation

Option 1: GitHub Source

Option 2: Docker Hub Image

Option 3: Dockerfile

Usage

Prerequisites

Option 1: GitHub Source

Option 2: Docker

Project Structure

Data Sources

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Languages