Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparing minigraph and Dipcall #85

Open
cjain7 opened this issue Dec 26, 2022 · 1 comment
Open

Comparing minigraph and Dipcall #85

cjain7 opened this issue Dec 26, 2022 · 1 comment

Comments

@cjain7
Copy link

cjain7 commented Dec 26, 2022

Dear Heng,

I and @gsc74 tried to compare minigraph and Dipcall SV calls by using HG002 HiFi diploid assembly. We used a similar method for comparison as how you had used the Syndip benchmark in the minigraph paper. We didn't expect to reproduce the exact same numbers that were presented in the paper because significant improvements to minigraph code have been made since then. Unfortunately we are seeing worse performance with the latest minigraph code when compared to the paper. I am sharing the results and benchmarking methodology here. If you have any advice, please let us know.

Genomes used: Diploid HG002 assembly (HAP1=GCA_018852605.1, HAP2=GCA_018852615.1)
REF= GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
Minigraph version: 0.20-r559

Total SVs called by minigraph: 24004
Fraction of minigraph SVs that overlap with Dipcall SV calls: 91.2%
Total SVs called by Dipcall: 28479
Fraction of Dipcall SVs that overlap with minigraph SV calls: 85.5%
But the numbers from paper were 94% and 97.3% respectively for calling SVs of length ≥ 100bp.

Minigraph commands:

minigraph -t 4 -cxggs $REF $HAP1 $HAP2 > minigraph.gfa
gfatools bubble minigraph.gfa > minigraph.bed
cat minigraph.bed | awk '{ print $1"\t"$2"\t"$3}' | sort -k 1,1 -k2,2n > minigraph.SV.bed

Dipcall commands:

run-dipcall -x hs38.PAR.bed prefix $REF $HAP1 $HAP2 > prefix.mak
make -j8 -f prefix.mak
tabix -p vcf prefix.dip.vcf.gz
# convert VCF to BED
zcat prefix.dip.vcf.gz | awk '! /\#/' | awk '{if(length($4) > length($5)) print $1"\t"($2-1)"\t"($2+length($4)-1); else print $1"\t"($2-1)"\t"($2+length($5)-1)}' > dip.vcf.bed
# filter variants of length >=50
cat dip.vcf.bed | awk '{ print $1"\t"$2"\t"$3}' | awk '{if($3-$2 ≥ 50) print}' | sort -k 1,1 -k2,2n > dip.vcf.SV.bed

We checked how many minigraph SVs are supported by Dipcall SVs:

# expand Dipcall SV intervals by 1000 bp on both sides
bedops --range 1000 --everything dip.vcf.SV.bed > dip.vcf.SV.expand.bed
bedtools intersect -a minigraph.SV.bed -b dip.vcf.SV.expand.bed -u | wc -l

Similarly, we checked how many Dipcall SVs are supported by minigraph SVs:

bedops --range 1000 --everything minigraph.SV.bed > minigraph.SV.expand.bed
bedtools intersect -a dip.vcf.SV.bed -b minigraph.SV.expand.bed -u | wc -l

Do you think the above results are okay?

@cjain7 cjain7 changed the title Comparing minigraph and dipcall Comparing minigraph and Dipcall Dec 26, 2022
@lh3
Copy link
Owner

lh3 commented Dec 26, 2022

When calculating the fraction of dipcall SVs overlapping minigraph SVs, I was only considering SVs in confident regions. When I did the reverse, I didn't apply the confident region. For comparing SVs, usually we don't apply a hard cutoff on SV length. For example, if we want to evaluate SVs >=50bp, we typically include "truth SVs" >=30bp instead because SV representations are often different. You will get higher percentages with these changes.

You may still see lower percentages in comparison to the numbers in the paper for two reasons. 1) The paper considers SVs >=100bp. The consistency is usually lower for >=50bp SVs. 2) With base alignment, minigraph now produces smaller bubbles. The differences in SV representation will be amplified. For example, if we produce one bubble per chromosome, we will get a 100% consistency. In theory, it would be better to take SV lengths into account, but it is tricky in practice.

PS: I have looked at graphs generated with more recent versions. The graphs generally look tighter and more precise – overall an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants