Skip to content

at-cg/PanAligner

Repository files navigation

Getting Started

git clone https://github.com/at-cg/PanAligner
cd PanAligner && make
# Map sequence to graph
./PanAligner -cx lr test/MT.gfa test/MT-orangA.fa > out.gaf

Table of Contents

Introduction

PanAligner is an efficient tool to align long-reads or assembly contigs to a cyclic pangenome graph. We follow the seed-chain-extend procedure. We provide the first exact implementation of co-linear chaining technique which is generalized to cyclic graphs. The details of the formulation and the algorithm are provided in our paper. If the input graph is a DAG, PanAligner works similarly as minichain. We benefit from open-source code from minichain, minigraph, and GraphChainer for other necessary components besides co-linear chaining. PanAligner can scale to human pangenome graphs and whole-genome sequencing read sets.

Users' Guide

Installation

To install PanAligner, type make in the source code directory.

Dependencies

  1. gcc9 or later version
  2. zlib

Sequence mapping

PanAligner can be used for both sequence-to-sequence alignment and sequence-to-graph mapping. For sequence-to-sequence alignment, PanAligner maps a read to a reference in fasta format and provide read mapping output in PAF format. For sequence-to-graph mapping, PanAligner takes the graph in GFA and rGFA format as input, and provides read mapping in GAF format.

# Map sequence to sequence
./PanAligner -cx lr test/MT-human.fa test/MT-orangA.fa > out.paf
# Map sequence to graph
./PanAligner -cx lr test/MT.gfa test/MT-orangA.fa > out.gaf

Hybrid method

The Hybrid method leverages the strengths of both minigraph and PanAligner to achieve efficient and accurate sequence-to-graph mapping. This method is designed to identify a subset of reads that are relatively "easy-to-align" and utilizes the fast minigraph heuristics for aligning them. For the remaining reads, PanAligner is used for the alignment.

Before running the HybridMethod.sh, ensure that conda is installed and available in your PATH.

# One time installation of dependencies
chmod +x get_dependencies.sh
./get_dependencies.sh

# create a hybrid_test folder
mkdir hybrid_test
cp hybrid_method.sh hybrid_test
cd hybrid_test

# Map a sequence using the hybrid method in the hybrid_test folder
./hybrid_method.sh ../test/MT.gfa ../test/MT-human.fa out.gaf 4
 # Here hybrid_method.sh takes 1st argument as graph file 2nd argument as query file 3rd argument as output "gaf" file and last argument specifies the count of threads

Benchmark

We evaluated PanAligner and Hybrid method against other sequence-to-graph aligners to assess its scalability and accuracy advantages. The evaluation utilized human pangenome graphs constructed from 94 high-quality haplotype assemblies provided by the Human Pangenome Reference Consortium, along with the CHM13 human genome assembly from the Telomere-to-Telomere consortium. Simulated long-reads with 0.5× coverage and 5% error-rate were used for the experiments, employing cyclic graphs of sizes 10H, 40H, and 95H, where the prefix integer represents the haplotype count in each graph. The results demonstrated superior read mapping precision, as shown in the figure. Notably, even on the largest graph with 95 haplotypes, PanAligner achieved efficient performance, requiring 2 hours and 36 minutes, 44 GB RAM, and 32 threads on perlmutter CPU nodes.

Plot

Citation

Jyotshna Rajput, Ghanshyam Chandra, Chirag Jain. Co-linear Chaining on Pangenome Graphs. WABI 2023