Use of GFA2 as a pangenome reference #118

oakeley · 2023-05-25T12:50:43Z

When we assemble a new genome (for example with HiFiASM) we get five useful sequence files:
hap1.p_ctg.gfa
hap2.p_ctg.gfa
p_ctg.gfa
p_utg.gfa
r_utg.gfa

Each captures elements of the (diploid) genome. A future (pan) genome reference is even worse.
I would like to be able to specify a GFA1/2 file as a reference rather than a legacy FASTA file for aligning PacBio reads.

Aligning a PacBio IsoSeq with "allele-specific expression" or unexpected heterozygous differences would avoid the multimapper problem of alignment because the "path of the read" would be one valid route through the GFA2 graph. In the case of an identical duplication in two distinct places in the genome it would be nice to represent as a single sequence with an edge path joined to either "this chromosome" or "that" so the unique mapping would be like street junction where we either know that we are at the "junction" (conserved sequence on all paths) or turned left or right from the junction if the read is long enough to resolve the route.

Some code or process for merging GFA1/2 files via a multiple GFA alignment would facilitate the "all against all search" needed with FASTA views of a genome. So, given the GFA files from HifiASM plus a public FASTA reference like human T2T and merge all the possible paths into a single GFA2 pangenome in one file (to be used as a possible reference for a long read aligner). If we assemble multiple individuals then merging the old and new GFA2 pangenomes to represent the newly observed haplotype paths would be desirable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of GFA2 as a pangenome reference #118

Use of GFA2 as a pangenome reference #118

oakeley commented May 25, 2023

Use of GFA2 as a pangenome reference #118

Use of GFA2 as a pangenome reference #118

Comments

oakeley commented May 25, 2023