You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
and found the following bug in the markup algorithms
if some alignment records have same chr, start, end coordinates, although they might have different alignment details(such as indels), they can also be marked as duplicates, however, I think this might lead some loss of variant information in those alignments, for example, some patient with indel in some region and NGS sequences reads just mark this indel reads as duplicates as reference reads covering the same region, then no one will found this intel in the following pipeline.
the test data is here test.data.tar.gz
ref.fa is the reference
r1.fastq is the reads from ref.fa
test.bam is the alignment results of r1.fastq to ref.fa
mkdup.bam is the markup results generated by samtools markdup -s -f mkdup.stats test.bam mkdup.bam
mkdup.stats is the markup results stats file generated by the cmd before
I know samtools markup strongly rely on the coordinates of alignment, shall we take cigar compatibility check into consideration?
Will it be computation exhaustion if do so?
The text was updated successfully, but these errors were encountered:
Yes, this is not so much a bug as a weakness in the algorithm (and once I've finished the documentation I am writing, a known weakness).
Though I am not sure how much of a weakness. Is this something you have come across with real data?
As well as slowing things down, matching CIGAR strings has its own problems. You can have reads that are duplicates but sequencing errors would make the CIGAR strings different and then you would miss the duplicates.
Still, it might be worth putting in as an optional check.
I am using samtools 1.16.1 Using htslib 1.16
and found the following bug in the markup algorithms
if some alignment records have same chr, start, end coordinates, although they might have different alignment details(such as indels), they can also be marked as duplicates, however, I think this might lead some loss of variant information in those alignments, for example, some patient with indel in some region and NGS sequences reads just mark this indel reads as duplicates as reference reads covering the same region, then no one will found this intel in the following pipeline.
the test data is here test.data.tar.gz
ref.fa is the reference
r1.fastq is the reads from ref.fa
test.bam is the alignment results of r1.fastq to ref.fa
mkdup.bam is the markup results generated by
samtools markdup -s -f mkdup.stats test.bam mkdup.bam
mkdup.stats is the markup results stats file generated by the cmd before
I know samtools markup strongly rely on the coordinates of alignment, shall we take cigar compatibility check into consideration?
Will it be computation exhaustion if do so?
The text was updated successfully, but these errors were encountered: